MLOps Platform Engineer
Canada (Remote)
Zesty.ai
ZestyAI uses artificial intelligence to account for all factors that may impact a property's value and its risk exposure to natural disasters.MLOps Platform Engineer
Job Summary
ZestyAI is on a mission to revolutionize how the world understands and manages climate and property risk. By combining cutting-edge artificial intelligence with property-specific data, we empower our customers to make smarter, data-driven risk decisions that protect families, communities, and their financial well-being. Our innovative predictive models provide highly accurate, property-level insights, helping clients anticipate and mitigate risks related to climate events like wildfires and severe storms. Join us, and be part of a team that’s redefining the future of property risk assessment.
We’re seeking a skilled and adaptable MLOps Engineer to join our Platform team, building the scalable, AI-focused infrastructure that powers Zesty’s cloud-based machine learning solutions. This role involves close collaboration with our Machine Learning teams to develop and deploy training pipelines, automate workflows, and ensure high-performance, reliable model deployment. As a key contributor, you’ll foster a culture of automation, efficiency, and innovation within the Data Science and Machine Learning teams, driving advancements in backend systems, cloud infrastructure, and ML frameworks like PyTorch. If you’re passionate about creating impactful AI-driven solutions and thrive in a dynamic, collaborative environment, this role is a great fit.
Responsibilities
- Workflow Optimization: Collaborate with ML engineers and data scientists to deploy models, optimize inference latency, and efficiently manage cloud resources.
- Platform Engineering: Design, build, maintain, and support the core infrastructure underlying our application platform, enabling a seamless ML workflow from training to production.
- CI/CD and Deployment Automation: Implement and maintain CI/CD pipelines and automated deployment processes, promoting consistency and scalability across environments.
- Real-Time Monitoring: Develop monitoring and logging solutions for tracking model performance, system health, and data quality. Proactively detect issues such as model drift or degradation.
- System Optimization and Cost Efficiency: Fine-tune infrastructure settings to support large-scale ML workloads, prioritizing resource and cost efficiency through GCP’s autoscaling and optimization capabilities.
- Documentation and Best Practices: Maintain detailed documentation of processes and best practices for ML operations, enabling efficient collaboration and knowledge sharing.
- Cross-Functional Collaboration: Work closely with software engineering, platform engineering, and product teams to design, enhance, and streamline tools, infrastructure, and ML workflows.
Required Skills
- Technical Stack Proficiency: Strong experience in Python, Docker, and building CI/CD pipelines (e.g., GitHub Actions) to automate and streamline development workflows.
- Machine Learning Tools: Proficiency with PyTorch, Torch Compile, and familiarity with MLOps tools, with experience in ML training and deployment workflows.
- ML Lifecycle Management Knowledge: Understanding of best practices for model retraining, governance, and production ML lifecycle management.
- Adaptability to New Architectures: Able to adapt to and support new and evolving ML model architectures, ensuring seamless integration into existing workflows.
- Cloud Infrastructure Familiarity: Experienced in using GCP to support ML operations, including application logging, debugging, setting up and maintaining VMs, and optimizing resources and costs within the cloud environment.
- Database Skills: Familiar with building data pipelines that interact with BigQuery or PostgreSQL, capable of writing SQL queries, and experienced in handling data workflows within these databases.
- Container Orchestration Knowledge: Strong understanding of container orchestration using Kubernetes to manage and scale ML workflows effectively.
- Observability for ML Workflows: Skilled in setting up monitoring, logging, and alerting specific to ML models in production, including metrics for drift detection and accuracy monitoring.
- Experiment Tracking and Data Versioning: Experience with tools like MLflow or DVC for tracking experiments and managing data and model versions over time.
- Problem-Solving Skills: Ability to troubleshoot complex technical issues and implement robust solutions with ML teams.
- Collaboration and Communication: Strong communication skills to work effectively with cross-functional teams, especially within DSML.
Nice to Have
- Technical Skills: Familiarity with Golang, and experience with Triton Inference Server or other model serving frameworks (e.g., TensorFlow Serving, ONNX).
- Computer Vision and Risk Modeling: Experience with computer vision techniques and risk modeling, particularly in the context of property or climate risk assessment.
- Relevant Experience: Prior experience in a similar MLOps-focused platform role or in supporting ML model deployment and lifecycle management.
Why Join Zesty?
At Zesty, we’re committed to fostering a supportive and innovative environment where team members can grow and succeed. We offer competitive compensation, a flexible, fully remote work schedule, and a collaborative culture that values individual perspectives and talents. Join us in advancing AI-driven solutions that make a real impact.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Architecture BigQuery CI/CD Computer Vision Data pipelines Data quality Docker Engineering GCP GitHub Golang Kubernetes Machine Learning MLFlow ML models MLOps Model deployment ONNX Pipelines PostgreSQL Python PyTorch SQL TensorFlow
Perks/benefits: Competitive pay Flex hours Team events
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.