Site Reliability Engineer (Data)
NAMER
Zapier
Build and scale AI workflows and agents across 8,000+ apps with Zapier—the most connected AI orchestration platform. Trusted by 3 million+ businesses.About Zapier
We're humans who simply think computers should do more work.
At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI. Our mission is to make automation work for everyone by delivering products that delight our customers. You’ll collaborate with brilliant people, use the latest tools, and leverage the flexibility of remote work. Your work will directly fuel our customers’ success, and as they grow, so will you.
Job Posted: May 9, 2025
Location: Americas
Hi there! 👋 Are you passionate about building reliable systems that help data teams thrive at scale?
Zapier is looking for a Site Reliability Engineer to join our Data Platforms team. In this role, you’ll work alongside our existing SRE to level up the reliability, observability, and operational maturity of the modern data stack that powers internal products and customer facing across Zapier. From orchestrating workflows in Databricks to tuning performance in our data infrastructure, you’ll play a key role in keeping our data ecosystem healthy, scalable, and developer-friendly.
About You
You’re experienced, but still growing. You have 4+ years of experience in Site Reliability Engineering roles. You’ve worked in production environments, solved real incidents, and shipped platform improvements—but you’re also eager to learn and grow alongside a thoughtful, distributed team.
You know the cloud—and how to keep it healthy. You’re familiar with cloud-native architecture and services (we use AWS). You’ve helped teams build and maintain reliable workflows using tools like Terraform and you understand the tradeoffs behind infrastructure decisions.
You’re observability- and incident-driven. You know how to detect issues before customers feel them. You believe in rich metrics, structured logs, and smart alerting. You’ve contributed to incident response processes and helped teams learn from failure.
You bring an automation- and AI-first mindset. You’re not afraid to write code (Python, TypeScript, or Bash are all great) and believe deeply in Infrastructure as Code. You lean into tools, automation, and AI to reduce toil, improve deployment confidence, and free up teams to focus on meaningful work. You're are open to experimenting with AI tools to decrease toil and increase your impact.
You’re a strong communicator in a remote-first world. You can clearly describe problems, propose solutions, and write clean documentation others can follow. You’re comfortable collaborating asynchronously with cross-functional teams and support partners.
Things You’ll Do
Level up reliability for our modern data stack – Help support and evolve our data platforms (including Databricks, Airflow and our LLMOps tooling) with reliability best practices and clear operational standards.
Improve observability and alerting – Partner with engineering teams to implement monitoring and alerting that supports ownership, reduces noise, and improves incident response metrics like MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve).
Automate and optimize operations – Build and maintain infrastructure-as-code, job orchestration logic, and internal tooling that reduce manual intervention and improve system resilience.
Participate in on-call and incident response – Share in our on-call rotation (~one week per quarter) and work alongside others to improve postmortems, retrospectives, and mitigation strategies.
Contribute to security and compliance readiness – Help evolve our access controls, auditability, and deployment practices in support of growing needs like sensitive Data security compliance.
Be a partner, not a gatekeeper – Work closely with Data Engineers, ML Engineers, and Backend Engineers to ensure platforms are reliable and empowering to use.
Bonus Points
(Not required, but nice to have!)
Experience with tools like Airflow, Databricks, or Kubernetes.
Experience with Databricks administration, cost governance, or workspace security
Familiarity with data lake architecture (e.g., Delta Lake, Unity Catalog)
Exposure to compliance-driven environments (HIPAA, SOC 2, etc.)
Demonstrated AI fluency—whether it’s applying AI for troubleshooting, documentation, automation, or infrastructure tooling
How to Apply
At Zapier, we believe that diverse perspectives and experiences make us better, which is why we have a non-standard application process designed to promote inclusion and equity. We're looking for the best fit for each of our roles, regardless of the type of companies in your background, so we encourage you to apply even if your skills and experiences don’t exactly match the job description. All we ask is that you answer a few in-depth questions in our application that would typically be asked at the start of an interview process. This helps speed things up by letting us get to know you and your skillset a bit better right out of the gate. Please be sure to answer each question; the resume and CV fields are optional.
Education is not a requirement for our roles; however, if you receive an offer, you will need to include your most recent educational experience as part of our background check process.
After you apply, you are going to hear back from us—even if we don’t see an immediate fit with our team. In fact, throughout the process, we strive to never go more than seven days without letting you know the status of your application. We know we’ll make mistakes from time to time, so if you ever have questions about where you stand or about the process, just ask your recruiter!
Zapier is an equal-opportunity employer and we're excited to work with talented and empathetic people of all identities. Zapier does not discriminate based on someone's identity in any aspect of hiring or employment as required by law and in line with our commitment to Diversity, Inclusion, Belonging and Equity. Our code of conduct provides a beacon for the kind of company we strive to be, and we celebrate our differences because those differences are what allow us to make a product that serves a global user base. Zapier will consider all qualified applicants, including those with criminal histories, consistent with applicable laws.
Zapier prioritizes the security of our customers' information and is dedicated to adhering to all applicable data privacy laws. You can review our privacy policy here.
Zapier is committed to inclusion. As part of this commitment, Zapier welcomes applications from individuals with disabilities and will work to provide reasonable accommodations. If reasonable accommodations are needed to participate in the job application or interview process, please contact jobs@zapier.com.
Application Deadline:
The anticipated application window is 30 days from the date job is posted, unless the number of applicants requires it to close sooner or later, or if the position is filled.
Even though we’re an all-remote company, we still need to be thoughtful about where we have Zapiens working. Check out this resource for a list of countries where we currently cannot have Zapiens permanently working.
Tags: Airflow Architecture AWS Databricks Engineering Kubernetes LLMOps Machine Learning Privacy Python Security Terraform TypeScript
Perks/benefits: Career development Equity / stock options Salary bonus
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.