Site Reliability Engineer

San Francisco HQ

Alembic

Uncover marketing success with Alembic's AI-driven analytics. Predict revenue outcomes, optimize media spend, and gain actionable insights in real-time.

View all jobs at Alembic

Apply now Apply later

About Us

Alembic is where top engineers are solving marketing's hardest problem: proving what actually works. If you're looking for frontier technical challenges at an applied science company, this is the place.

At Alembic, we're not just building software, we're decoding the chaos of modern marketing. Join Alembic to build trusted systems that Fortune 100 companies use to make multimillion dollar decisions.

We're backed by leading tech luminaries including WndrCo (founded by DreamWorks founder Jeffrey Katzenberg), Jensen Huang, Joe Montana, and many more.

About the Role

We’re looking for a SRE to help evolve and scale the systems that power Alembic. This is a high-impact, foundational role where you’ll drive platform scalability from the ground up. This role is particularly well-suited for seasoned platform, cloud, or DevOps engineers who are ready to dive into AI infrastructure. You'll leverage your proven expertise in scalable systems while learning to deploy and manage cutting-edge ML workloads—making this an ideal transition role for infrastructure veterans looking to specialize in the AI space.

What You’ll Do

  • Design, build, integrate, and operate the foundational infrastructure that powers Alembic’s platform—including core services, data pipelines, and distributed AI/ML workloads—across both cloud (primarily AWS) and on-prem environments.

  • Leverage Infrastructure as Code (IaC) tools such as Terraform for cloud resource provisioning and Ansible for configuration management, enabling repeatable, auditable, and environment-agnostic infrastructure deployments.

  • Develop and maintain CI/CD pipelines that enable reliable, low-risk, and rapid deployments using modern tools like GitHub Actions, ArgoCD, Bazel, or equivalent, with automated testing, rollback, and deployment workflows.

  • Establish and operate robust observability systems, including metrics, logging, and distributed tracing, using tools like Prometheus, Grafana, Datadog, and OpenTelemetry to ensure proactive incident detection and diagnosis.

  • Collaborate closely with the AI Research team to deploy and manage novel ML algorithms and drive next generation work on GPU-based development efforts.

  • Serve as a technical mentor and thought leader, promoting best practices in system design, infrastructure reliability, and code quality across the engineering organization.

What Will Help You Succeed

  • 15–20 years of engineering experience, including significant time spent on platform, infrastructure, or DevOps/SRE teams.

  • Deep experience with AWS (or GCP/Azure), container orchestration with Kubernetes, and service discovery at scale.

  • Strong grasp of DevOps principles, infrastructure as code (Terraform, Ansible), and immutable infrastructure.

  • Experience deploying and operating production systems in fast-paced environments, ideally early- or growth-stage startups.

  • Proficiency in systems or scripting language (e.g., Python, Bash).

  • Experience with secure networking, secrets management, and managing systems in compliance-heavy environments.

  • A bias for simplicity, automation, and building tools that empower developers.

  • A hands-on, in-the-weeds approach and a collaborative mindset. You’re as comfortable fixing a broken pipeline as designing the future of our platform.

This role is right for you if:

  • You're an experienced platform/DevOps engineer ready to apply your infrastructure expertise to the cutting edge of AI. This role offers the perfect bridge between traditional platform engineering and the emerging world of ML/AI systems at scale.

  • You want to build something that is both technologically challenging and solves a real customer need. You want a role with major upside that tackles a massive market opportunity.

Why You Might Be Excited About Alembic
  • Hard problems with real impact: You'll tackle the hardest challenges in marketing analytics while building systems that influence multimillion-dollar decisions at Fortune 100 companies

  • Technical autonomy: You want ownership over technical decisions and the freedom to solve complex problems your way

  • Cutting-edge technology: Work with advanced AI/ML algorithms, composite AI solutions, private NVIDIA DGX clusters, and the latest in data processing at scale

  • Elite team: Join top engineers who thrive on challenging problems and high-impact work

  • Startup upside: Early-stage equity opportunity with experienced leadership and proven product-market fit

Why You Might Not Be Excited
  • If you only want to tell people what to build instead of building and coding alongside them, we're not the environment for you

  • You prefer company practices with 100% built-out process for every detail

  • You prefer static over dynamic. Projects, priorities, and roles will adapt to your skill set and goals. Though we have real paying customers and a playbook for growth, we proudly remain an early-stage startup

Apply now Apply later
Job stats:  0  0  0

Tags: Ansible AWS Azure Bazel CI/CD Data pipelines DevOps Engineering GCP GitHub GPU Grafana Kubernetes Machine Learning ML infrastructure Pipelines Python Research Terraform Testing

Perks/benefits: Career development Startup environment

Region: North America
Country: United States

More jobs like this