Senior MLOps Engineer

IN KA Bengaluru

Apply now Apply later

While technology is the heart of our business, a global and diverse culture is the heart of our success. We love our people and we take pride in catering them to a culture built on transparency, diversity, integrity, learning and growth.


If working in an environment that encourages you to innovate and excel, not just in professional but personal life, interests you- you would enjoy your career with Quantiphi!


Required Experience: 2 to 6 Years
 

Roles and Responsibilities:

  • Design, deploy, and maintain distributed systems using Kubernetes and Slurm for optimal resource utilization and workload management.

  • Lead the configuration and optimization of Multi-GPU, Multi-Node Deep Learning job scheduling, ensuring efficient computation and data processing.

  • Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions.

  • Experience in working with On-prem NVIDIA GPU servers.

  • Develop and maintain complex shell scripts for various system automation tasks, enhancing efficiency and reducing manual intervention.

  • Monitor system performance, identify bottlenecks, and implement necessary adjustments to ensure high availability and reliability.

  • Troubleshoot and resolve technical issues related to the distributed system, job scheduling, and deep learning processes.

  • Stay updated with industry trends and emerging technologies in distributed systems, deep learning, and automation.

Skill Set Needed:

  • Strong communication and collaboration skills to work effectively within a cross-functional team.

  • Good with Python. 

  • Hands-on experience in MLOps - MLFlow, Kubeflow, AutoML etc.

  • Good to have at least one ML framework understanding - PyTorch / TensorFlow.

  • Experience in shell scripting./linux

  • Good understanding of logical networks. 

  • Understanding of NLP (preferred) / Computer Vision

  • Cloud native stack.

  • Proven experience in designing, deploying, and managing distributed systems, with a focus on Kubernetes and Slurm.

  • Sufficient understanding of AI Model Training and Deployment and Strong background in Multi-GPU, Multi-Node Deep Learning job scheduling and resource management.

  • Proficiency in Linux systems, particularly Ubuntu, and the ability to navigate and troubleshoot related issues.

  • Extensive experience creating complex shell scripts for automation and system orchestration.

  • Familiarity with continuous integration and deployment (CI/CD) processes.

  • Excellent problem-solving skills and the ability to diagnose and resolve technical issues promptly.

Good to Have:

  • Previously working on NVIDIA Ecosystem or well aware of NVIDIA Ecosystem - Triton Inference Server, CUDA, 

  • Good at Slurm, Kubernetes, Linux, and AI Deployment tools.
     

If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

Apply now Apply later
  • Share this job via
  • 𝕏
  • or

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0

Tags: CI/CD Computer Vision CUDA Deep Learning Distributed Systems Excel GPU Kubeflow Kubernetes Linux Machine Learning MLFlow MLOps Model training NLP Python PyTorch Shell scripting TensorFlow

Perks/benefits: Career development Transparency

Region: Asia/Pacific
Country: India

More jobs like this