Senior Machine Learning Operations Engineer

Boulder, Colorado, United States

SciTec

The world brings problems; SciTec builds solutions. Our team is committed to delivering cutting-edge advancements for defense, security, and civil affairs.

View all jobs at SciTec

Apply now Apply later

Job Summary

We are seeking an experienced Machine Learning Operations (MLOps) Engineer to join and help shape our new MLOps team. This role focuses on deploying and optimizing machine learning models for always-on, high-availability systems in real-world, real-time unclassified and classified environments. As part of a new and growing team, you will have the unique opportunity to evangelize MLOps practices, contribute to the development of an on-premises development platform, and drive innovation in mission-critical applications.

Responsibilities

 

  • Deploy and maintain high-performing ML models (e.g., ensembles of LSTMs and Random Forests) in real-time environments
  • Monitor deployed models for drift or performance degradation and implement automated retraining pipelines.
  • Implement advanced deployment strategies (e.g., Blue-Green, Canary, Champion-Challenger).
  • Develop modular and flexible ML pipelines that ensure uptime and reliability
  • Build and manage scalable infrastructure using Kubernetes, Docker, Terraform, and related tools
  • Design and implement an on-premises development platform using Kubeflow to replicate cloud capabilities in classified environments
  • Set up robust monitoring, logging, and alerting systems using Prometheus, Grafana, and Loki.
  • Optimize performance metrics like inference latency and system throughput while ensuring fault tolerance
  • Work with cross-functional teams, including Data Engineering, Machine Learning, and DevOps, to integrate and enhance ML systems
  • Define touchpoints and handoffs with DevOps and Data Engineering to ensure seamless integration of ML workflows with existing infrastructure and data pipelines
  • Mentor junior team members and contribute to building a collaborative and innovative team culture
  • Other duties as assigned

Requirements

  • 8+ years, including leading large-scale ML model deployments and scaling production environments
  • Expertise in architecting Python applications for large-scale systems, mentoring junior engineers in Python best practices, and optimizing code for high performance
  • Proven leadership in designing enterprise-grade CI/CD systems, incorporating advanced features like parallel testing, rollback strategies, and security hardening
  • Advanced expertise in designing and optimizing distributed pipelines with Protobufs and ZeroMQ, ensuring fault tolerance and scalability.
  • Advanced expertise in designing workflows using MLflow or Kubeflow to streamline experimentation and production deployments
  • Expertise in architecting complex Kubernetes and Terraform configurations for distributed systems, incorporating advanced features like auto-scaling and load balancing

Preferred Qualifications

  • Familiarity with C++ and/or Rust
  • Experience with workflow orchestration tools such as Airflow or Prefect
  • Experience with distributed data processing frameworks such as PySpark
  • Familiarity with SQL and modern database technologies (e.g., MinIO, Yugabyte)
  • Experience with DVC, Ansible, Kustomize, Helm, Prometheus, and Grafana
  • Understanding of secure software development practices and/or experience working in classified environments

Education

  • Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or a related technical field
  • Relevant certifications (e.g., Certified Kubernetes Administrator, Certified Kubernetes Application Developer, Terraform Associate) are a plus

Soft Skills

  • Strong problem-solving and analytical skills
  • Excellent communication and collaboration capabilities
  • Ability to thrive in a dynamic, fast-paced environment
  • Good verbal and written communication skills
  • Detail oriented

Benefits

SciTec offers a highly competitive salary and benefits package, including:

  • Employee Stock Ownership Plan (ESOP)
  • 3% Fully Vested Company 401K Contribution (no employee contribution required)
  • 100% company paid HSA Medical insurance, with a choice of 2 buy-up options
  • 80% company paid Dental insurance
  • 100% company paid Vision insurance
  • 100% company paid Life insurance
  • 100% company paid Long-term Disability insurance
  • Short-term Disability insurance
  • Annual Profit-Sharing Plan
  • Discretionary Performance Bonus
  • Paid Parental Leave
  • Generous Paid Time Off, including Holiday, Vacation, and Sick Pay
  • Flexible Work Hours

The pay range for this position is $141,000- $168,000 / year. SciTec considers several factors when extending an offer of employment, including but not limited to the role and associated responsibilities, a candidate's work experience, education/training, and key skills. This is not a guarantee of compensation.

SciTec is committed to hiring and retaining a diverse workforce and is proud to be an Equal Opportunity/Affirmative Action employer. M/F/VETS/Disabled

Apply now Apply later
Job stats:  0  0  0

Tags: Airflow Ansible CI/CD Computer Science Data pipelines DevOps Distributed Systems Docker Engineering Grafana Helm Kubeflow Kubernetes Machine Learning MLFlow ML models MLOps PhD Pipelines PySpark Python Rust Security SQL Terraform Testing

Perks/benefits: 401(k) matching Career development Competitive pay Equity / stock options Flex hours Flex vacation Health care Insurance Medical leave Parental leave Salary bonus

Regions: North America South America
Country: United States

More jobs like this