Senior Site Reliability Engineer

London

Xtremepush

Turn visitors into loyal players with real-time data, AI and gamification. Channels include; web, app & social engagement.

View all jobs at Xtremepush

Apply now Apply later

We are seeking a Senior SRE with experience of working with scaled SaaS production infrastructure. The successful candidate will work as part of a team focused on site reliability, security, and scalability, as we manage our rapid growth. 

The ideal candidate will be a proactive and driven individual, who excels at understanding and working on complex technical solutions requiring performance and optimisation at scale. Our core technologies include PHP, MySQL, Vue.js and AWS. Participating in an on-call roster is required as part of this role. 

This is a hybrid role (2 days in the office). #LI-Hybrid
  • Act as a senior member of the SRE team, supporting activities including  the backlog and workload of the team, scoping requirements, peer review of code, providing feedback to the rest of the team.
  • Represent the team in management and stakeholder meetings. Ensure best practices are kept, and suggest improvements to our development processes where you see gaps.
  • Investigate, test, and resolve technical problems,  working closely with other engineers to deliver core product functionality.
  • Defining SLOs, SLIs, and SLAs for key metrics that indicate the health, security, stability and uptime of production, staging and development environments
  • Monitoring the above environments and reacting to alerts and issues that may arise in day-to-day operation of their product line.
  • Participate in an on-call rota for priority-1 level alarms with the rest of the Platform teams
  • Ongoing upgrades and improvements to operational processes to optimise performance, stability and cost.
  • Working with the platform engineering team to contribute to the planning of how we carry application/infrastructure releases and configuration changes.
  • Interact with internal teams and external 3rd party vendors to troubleshoot and resolve complex problems
  • 5+ years experience in an engineering role responsible for supporting a scaled SaaS platform running on Linux in a cloud environment
  • Experience working with high-performance systems, and solving complex engineering problems at scale (our platform processes ~100 Billion messages per year)
  • Understanding of distributed systems design – including asynchronous tasks, event driven architecture, scheduling, caching and queue processing
  • Ability to apply distributed systems design knowledge to resolve scaling constraints. The capability to carry out performance tuning from the API to Application to Database layer of the platform.
  • Strong communication skills and ability to explain complex technical solutions simply to others
  • Strong understanding of PHP, GoLang, MySQL, Opentelemetry, Prometheus
  • Experience with Cloud and DevOps technologies (AWS, Terraform, CI/CD etc.)
  • Experience with specific technologies in our stack: Clickhouse, Kafka, Pulsar, Python
  • Experience with networking and security concepts
  • Interest or experience with marketing technologies
  • Interest or experience with big data, data analytics, AI and machine learning
Ireland (Dublin) or UK (London or Milton Keynes)
Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0

Tags: APIs Architecture AWS Big Data CI/CD Data Analytics DevOps Distributed Systems Engineering Golang Kafka Linux Machine Learning MySQL PHP Pulsar Python Security Terraform Vue

Perks/benefits: Career development

Region: Europe
Country: United Kingdom

More jobs like this