Machine Learning Operations Manager
Cambridge, MA
Robotics and AI Institute
Welcome. At the Robotics and AI Institute (RAI Institute), our mission is to solve the most important and fundamental challenges in robotics and AI.
Our mission is to solve the most important and fundamental challenges in AI and Robotics to enable future generations of intelligent machines that will help us all live better lives.
Who we are looking for:We are seeking a Machine Learning Operations (ML-OPs) Manager who is both technically adept and an effective leader. In this role, you will lead a small team of engineers while also being hands-on in designing, building, and maintaining infrastructure that supports the entire lifecycle of Machine Learning (ML) projects. If you have a passion for building scalable ML infrastructure, mentoring engineers, and collaborating with world-class researchers, this is the role for you!
Who we are looking for:We are seeking a Machine Learning Operations (ML-OPs) Manager who is both technically adept and an effective leader. In this role, you will lead a small team of engineers while also being hands-on in designing, building, and maintaining infrastructure that supports the entire lifecycle of Machine Learning (ML) projects. If you have a passion for building scalable ML infrastructure, mentoring engineers, and collaborating with world-class researchers, this is the role for you!
What You Will Do
- Technical Leadership & Strategy: Drive the design, development, and maintenance of company-wide MLOps platforms and tools, leveraging Kubernetes infrastructure for ML and data processing applications.
- Team Management & Mentorship: Manage and mentor a small team of engineers, providing technical guidance, setting priorities, and fostering a collaborative team culture
- Scalability & Performance: Enable self-service access to ML-compute resources across on-prem and cloud environments, ensuring workload scalability, fault tolerance, and efficient job scheduling
- Monitoring & Observability: Enhance system observability through integrations with tools and services such as FluentD, Prometheus, Grafana, and DataDog to improve reliability and debugging
- Experiment & Model Lifecycle Management: Integrate ML applications with experiment tracking and model management services such as Weights and Biases
- Best Practices & Collaboration: Champion engineering best practices, drive improvements in CI/CD, infrastructure automation, and reproducibility. Work closely with ML Engineers, Data Engineers, DevOps teams, and researchers to accelerate research and deployment.
What You Will Bring
- BS or MS in Computer Science, Engineering, or equivalent
- 5+ years of experience in an ML-Ops, DevOps, ML Engineering, or software engineering role
- 2+ years of experience managing or mentoring engineers (can be formal management or technical leadership)
- Strong, hands-on experience with Kubernetes for ML applications
- Experience developing ML-Ops platforms (covering data/artifact management, reproducibility, fault tolerance, experiment tracking, and model serving)
- Proficiency in Python, Docker, and environment management tools (pip, poetry, uv, or similar)Familiarity with CI/CD tools (GitHub Actions, ArgoCD) and Infrastructure as Code (Terraform)
Skills We Value
- Experience with job scheduling mechanisms like Kueue
- Hands-on experience with workflow orchestration tools (Airflow, Metaflow, Argo Workflows)
- Experience managing cloud infrastructure (GCP, AWS) and hybrid-cloud environments
- Knowledge of scalable AI/ML platforms like Ray or PyTorch Lightning
- Experience with logging & monitoring tools (FluentD, Prometheus, Grafana, DataDog or similar
- Exposure to ML model serving frameworks (TorchServe, ONNX Runtime, or similar)
- Previous experience collaborating with research teams in academic or industrial settings
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Job stats:
0
0
0
Categories:
Leadership Jobs
Machine Learning Jobs
MLOps Jobs
Tags: Airflow AWS CI/CD Computer Science DevOps Docker Engineering GCP GitHub Grafana Industrial Kubernetes Machine Learning ML infrastructure MLOps ONNX Python PyTorch Research Robotics Terraform
Perks/benefits: Career development
Region:
North America
Country:
United States
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
BI Developer jobsData Engineer II jobsSr. Data Engineer jobsPrincipal Data Engineer jobsStaff Data Scientist jobsBusiness Intelligence Analyst jobsStaff Machine Learning Engineer jobsData Science Manager jobsPrincipal Software Engineer jobsData Manager jobsData Science Intern jobsJunior Data Analyst jobsSoftware Engineer II jobsDevOps Engineer jobsData Analyst Intern jobsData Specialist jobsBusiness Data Analyst jobsSr. Data Scientist jobsStaff Software Engineer jobsLead Data Analyst jobsAI/ML Engineer jobsResearch Scientist jobsSenior Backend Engineer jobsData Engineer III jobsBI Analyst jobs
NLP jobsAirflow jobsOpen Source jobsEconomics jobsMLOps jobsTerraform jobsKPIs jobsNoSQL jobsKafka jobsLinux jobsJavaScript jobsComputer Vision jobsData Warehousing jobsRDBMS jobsGoogle Cloud jobsPostgreSQL jobsPhysics jobsBanking jobsGitHub jobsScikit-learn jobsHadoop jobsScala jobsStreaming jobsData warehouse jobsPandas jobs
R&D jobsOracle jobsdbt jobsCX jobsBigQuery jobsClassification jobsLooker jobsReact jobsDistributed Systems jobsPySpark jobsScrum jobsRAG jobsRedshift jobsJira jobsELT jobsRobotics jobsPrompt engineering jobsMicroservices jobsIndustrial jobsGPT jobsSAS jobsMySQL jobsData Mining jobsNumPy jobsTypeScript jobs