Sr. Cloud Site Reliability Engineer
US- remote
Full Time Senior-level / Expert USD 150K - 200K
Serve Robotics
Why move a 2-pound burrito in a 2-ton car? Meet Serve, the future of sustainable, self-driving delivery.At Serve Robotics, we’re reimagining how things move in cities. Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses. The Serve fleet has been delighting merchants, customers, and pedestrians along the way in Los Angeles while doing commercial deliveries.
The Serve fleet has been delighting merchants, customers, and pedestrians along the way in Los Angeles while doing commercial deliveries. We’re looking for talented individuals who will grow robotic deliveries from surprising novelty to efficient ubiquity.
Who We Are
We are tech industry veterans in software, hardware, and design who are pooling our skills to build the future we want to live in. We are solving real-world problems leveraging robotics, machine learning and computer vision, among other disciplines, with a mindful eye towards the end-to-end user experience. Our team is agile, diverse, and driven. We believe that the best way to solve complicated dynamic problems is collaboratively and respectfully.
This is a senior-level, individual contributor position. You will balance hands-on responsibilities—building and maintaining critical SRE tooling and processes - with technical leadership - guiding architecture decisions, mentoring others in SRE practices, and steering strategic initiatives to enhance system resiliency and availability. You’ll collaborate across engineering, product, and operations teams to ensure our systems meet strict uptime and performance goals, all while aligning with overarching business objectives.
Responsibilities
Instrumentation & Monitoring
Develop and refine monitoring and observability tools (metrics, logs, traces) to validate system availability and performance.
Implement best practices for instrumentation using tools like Prometheus, Grafana, Datadog, or equivalent.
Reliability Engineering
Collaborate with development teams to design and implement solutions for higher availability in the cloud.
Lead the definition and management of Service Level Indicators (SLIs) and Service Level Objectives (SLOs), ensuring alignment with business goals.
Perform capacity planning, load testing, and performance tuning to ensure systems can handle projected traffic and workloads.
Incident Response & Prevention
Own the incident response process, including on-call rotation, alerts, and root cause analysis.
Proactively identify reliability risks and propose mitigations to reduce system downtime.
Conduct and facilitate postmortems to capture learnings, drive improvements, and prevent recurrence of issues.
Align System Health with Business Metrics
Map system availability metrics to direct business value, ensuring stakeholders understand how reliability impacts overall company objectives.
Create reporting dashboards that connect reliability data with KPIs and business goals.
Technical Leadership & Mentorship
Serve as an in-house SRE expert, advising teams on reliability-oriented designs, coding practices, and testing methodologies.
Mentor junior and mid-level engineers, fostering a culture of continuous learning, automation, and operational excellence.
Collaboration & Education
Work closely with engineering, product, and operations teams to advocate for SRE best practices.
Conduct training sessions and share knowledge to build a culture of reliability throughout the organization.
Qualifications
Experience:
5+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
Demonstrated success implementing SRE best practices in high-availability, large-scale systems.
Technical Skills:
Cloud: Experience with one or more major cloud providers (e.g., Google Cloud, AWS, Azure); familiarity with managed services and best practices for high availability.
Containers & Orchestration: Proficiency in Docker, Kubernetes, or similar containerization/orchestration platforms.
Observability Tools: Hands-on experience with logging, metrics, and tracing tools (e.g., Prometheus, Grafana, Datadog, Splunk, New Relic).
Automation & IaC: Familiarity with Infrastructure-as-Code (Terraform, Ansible, etc.) and scripting (Python, Go, Bash).
CI/CD: Comfort with modern CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, etc.).
Soft Skills:
Leadership: Proven ability to guide teams in adopting SRE principles without direct managerial authority.
Collaboration: Excellent communication skills to work across diverse technical and business teams.
Problem Solving: Strong analytical skills to navigate complex systems and identify root causes.
Adaptability: Comfortable operating in a fast-paced environment with shifting priorities.
Education:
Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
What Makes You Stand Out
Chaos Engineering: Hands-on experience running game days or chaos tests to proactively discover system weaknesses.
Multi-Region Deployments: Familiarity with designing architectures for geo-distributed systems to maximize resilience.
Performance Testing & Optimization: History of significantly reducing latency or resource usage through targeted tuning or innovative solutions.
Open Source Contributions: Demonstrated initiative in the community, such as contributing to key SRE or DevOps tools.
Tags: Agile Ansible Architecture AWS Azure CI/CD Computer Science Computer Vision DevOps Distributed Systems Docker Engineering GCP GitHub GitLab Google Cloud Grafana Jenkins KPIs Kubernetes Machine Learning Open Source Pipelines Python Robotics Splunk Terraform Testing
Perks/benefits: Career development
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.