MLOps Lead, Central Technology

Redwood City, CA (Hybrid)

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

Chan Zuckerberg Initiative

The Chan Zuckerberg Initiative (CZI) is a new kind of philanthropy that’s on a mission to help build a more inclusive, just and healthy future for everyone.

View all jobs at Chan Zuckerberg Initiative

Apply now Apply later

The Chan Zuckerberg Initiative was founded by Priscilla Chan and Mark Zuckerberg in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education to addressing the needs of our local communities. Our mission is to build a more inclusive, just, and healthy future for everyone.

The Team

Founded by Priscilla Chan and Mark Zuckerberg in 2015, the Chan Zuckerberg Initiative (CZI) is a new kind of philanthropy that’s leveraging technology to help solve some of the world’s toughest challenges – from eradicating disease, to improving education, to reforming the criminal justice system. Our mission is to create a future for everyone.  Across our core Initiative focus areas of Science and  Education,  we’re pairing engineering with grantmaking, impact investing, policy work, and movement building, to help build an inclusive, just and healthy future for everyone.

Our Values 

  • We believe we can help build a future for everyone. 
  • We aim to be daring, but humble:  We look for bold ideas — regardless of structure and stage — and help them scale by pairing engineers with subject matter experts to build tools that accelerate the pace of social progress. 
  • We want to learn fast, but build for the long-term: We want to iterate fast and help bring new solutions to the table, but we also realize that important breakthroughs often take decades, or even centuries. 
  • Stay close to the real problems: We engage directly in the communities we serve because no one understands our society’s challenges like those who live them every day.  

Our success is dependent on building teams that include people from different backgrounds and experiences who can challenge each other's assumptions with fresh perspectives. To that end, we look for a diverse pool of applicants including those from historically marginalized groups — women, people with disabilities, people of color, formerly incarcerated people, people who are lesbian, gay, bisexual, transgender, and/or gender nonconforming, first and second generation immigrants, veterans, and people from different socioeconomic backgrounds.

The Opportunity

Our Central Tech team provides technology and security support for CZI, the Biohub Network,  and our grantees. We believe that Engineering and Security are most effective when in sync and learning from each other on a daily basis.  Our AI Infrastructure Engineering team enables our AI Research teams to achieve their goals faster and more securely. We leverage technology to automate manual processes, constantly innovate to optimize operations, provide first-class support, and build solutions to enable the scale and execution of our business partners' strategies and initiatives.

The AI/ML and Data Engineering Infrastructure organization works on building shared tools and platforms to be used across all of the Chan Zuckerberg Initiative, partnering and supporting the work of a wide range of Research Scientists, Data Scientists, AI Research Scientists, as well as a broad range of Engineers focusing on Education and Science domain problems. Members of the shared infrastructure engineering team have an impact on all of CZI's initiatives by enabling the technology solutions used by other engineering teams at CZI to scale. A person in this role will build these technology solutions and help to cultivate a culture of shared best practices and knowledge around core engineering.

What You'll Do

  • Provide technical MLOps leadership: for a team of MLOps Engineers,  where you will manage and lead the team  in operating our heterogeneous AI training and inference systems as well as  collaborating in the design and build of our AI platform components.
  • Drive the application of MLOps and DevOps principles: across our multiple platforms, ensuring peak operational efficiency across our AI operations and process automation necessary for a world class large scale AI model training environment. 
  • Instrumentation and Observation technical leadership: for the MLOps team, defining our end to end metrics program including full proactive monitoring and alerting systems
  • Facilitate model training through collaboration with our AI Researchers: alongside the rest of the AI Infrastructure Eng team work together to make sure that our models we are training and releasing to  inference make use of best machine learning and deep learning practices, and are through code automation libraries fully resilient to restarts and checkpoint recoveries. 
  • Continuous Optimization of our Kubernetes based AI Lifecycle platform: through our IAC based practices and integrating our MLOps AI Lifecycle platform tooling, alongside integrating this with our On-Prem HPC systems into a cohesive heterogeneous platform. 
  • Collaboration on Data systems for our AI model training: with our Data Infrastructure Eng team as well as the Science data teams on the end to end data usage that drive our AI model training.
  • Lead our MLOps team supporting our  on-call rotation: combining  a focus on automation and proactive alerting focused on  reducing on-call loads and  improving self healing AI system operations. This will be low volume, but we do have 24/7 coverage, and will include members of the rest of the AI team for escalation and on-call coverage.

What You'll Bring

  • BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience
  • 7+ years of relevant coding and systems experience
  • 5+ years of systems Architecture and Design experience, with a broad range of MLOps experience across Data Infrastructure and AI/ML platforms
  • Proven technical leadership in SRE and MLOps related experience, as well as either direct or indirect people management experience
  • Proven SRE and MLOps knowledge and related experience
  • Strong experience scaling containerized applications  on Kubernetes or Mesos, including expertise  with creating custom containers using secure AMIs and continuous deployment systems that integrate with Kubernetes or Mesos. (Kubernetes preferred)
  • Cloud Platform proficiency with Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, and experience with On-Prem and Colocation Service hosting environments
  • MLOps experience working with medium to large scale GPU clusters in Kubernetes (Kubeflow),  HPC environments, or large scale Cloud based ML deployments
  • Working knowledge of Nvidia CUDA and AI/ML custom libraries.
  • Knowledge of Linux systems optimization and administration  
  • Solid Coding experience
  • Proven coding ability with a systems language such as Rust,C/ C++, C#, Go, Java, or Scala
  • Expertise with a scripting language such as Python (preferred), PHP, or Ruby
  • Experience in integrating Data with the AI Lifecycle
  • AI/ML Platform Operations experience in an environment integrated with  challenging data and systems platform challenges
  • Large scale Streaming data systems integration experience
  • Hadoop, Spark,  and/or  Kafka deployments, or their corollaries such as Pulsar, Flink, and/or Ray) 
  • Workflow scheduling tools such as Apache Airflow, Dagster, or Apache Beam  
  • Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms.
  • PyTorch, Keras, or Tensorflow  experience a strong nice to have
  • HPC with and Slurm experience a strong nice to have

Compensation

The Redwood City, CA base pay range for this role is $241,000 - $331,000. New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job-related skills and experience, as evaluated throughout the interview process. 

Work Mode

As we grow, we’re excited to strengthen in-person connections and cultivate a collaborative, team-oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team’s manager. The exact schedule will be at the hiring manager's discretion and communicated during the interview process.

Benefits for the Whole You 

We’re thankful to have an incredible team behind our work. To honor their commitment, we offer a wide range of benefits to support the people who make all we do possible. 

  • CZI provides a generous employer match on employee 401(k) contributions to support planning for the future.
  • Annual benefit for employees that can be used most meaningfully for them and their families, such as housing, student loan repayment, childcare, commuter costs, or other life needs.
  • CZI Life of Service Gifts are awarded to employees to “live the mission” and support the causes closest to them.
  • Paid time off to volunteer at an organization of your choice. 
  • Funding for select family-forming benefits. 
  • Relocation support for employees who need assistance moving to the Bay Area
  • And more!

If you’re interested in a role but your previous experience doesn’t perfectly align with each qualification in the job description, we still encourage you to apply as you may be the perfect fit for this or another role.

Explore our work modesbenefits, and interview process at www.chanzuckerberg.com/careers.

#LI-Hybrid 

       
 
Apply now Apply later
Job stats:  0  0  0

Tags: Airflow Architecture AWS Azure Computer Science CUDA Dagster Data governance Deep Learning DevOps Engineering Flink GCP Google Cloud GPU Hadoop HPC Java Kafka Keras Kubeflow Kubernetes Linux Machine Learning ML infrastructure MLOps Model training PhD PHP Pulsar Python PyTorch Research Ruby Rust Scala Security Spark Streaming TensorFlow

Perks/benefits: 401(k) matching Career development Flex vacation Relocation support

Region: North America
Country: United States

More jobs like this