Lead Cluster Operations Support Engineer
Chicago, Illinois
Thoughtworks
We are laser-focused on transforming your digital journey so you can make extraordinary impact today, tomorrow and beyond.We are seeking a highly skilled Lead Cluster Operations Support Engineer with extensive experience in cloud infrastructure, Kubernetes, and GPU clusters. The ideal candidate will possess a strong background in operations, cloud architecture, and managing large-scale environments, particularly in the context of machine learning model training and high-performance computing.
This role will play a crucial part in providing 24x7 white-glove support to clients using large blocks of GPUs (6,000+ contiguous GPUs) for short periods of time (6-12 weeks) to perform Managed Post Training (MPT). The Lead Cluster Operations Support Engineer will be responsible for ensuring full utilization of GPU clusters, optimizing infrastructure for ML workloads, and coordinating with teams across four time zones (US, Europe, India, and Australia) to deliver continuous support.
In this role, the Lead Cluster Operations Support Engineer will collaborate effectively with diverse teams, leveraging technical expertise and a DevOps mindset to optimize infrastructure and support operational needs.
Job responsibilities
- You will help shape and iterate this new white glove model training support service on large GPU clusters.
- You will work in a collaborative team with Machine Learning Engineers and Infrastructure Engineers.
- You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster. For example: improving observability, automating user onboarding, or bringing in new tools that everyone seems to want to use. This will likely involve a combination of Terraform/Pulumi, Helm Charts, Python, and Shell Scripts.
- You will help assess model training readiness and data preparation.
- You will provide model training support during rotating daytime weekend shifts - with pagers, to any issues they may encounter. These can range from infrastructure issues to data science challenges, such as a configuration change in GKE on GCP affecting training.
- You will facilitate collaborative problem-solving within the team by actively listening, communicating effectively, and mentoring other engineers.
- You will proactively identify and address challenges related to the white glove service for continued pre-training, proposing solutions, and implementing improvements.
Job qualifications
Technical Skills
- Deep expertise in Kubernetes administration and debugging at scale.
- Extensive experience managing large clusters with thousands of nodes using Kubernetes.
- Knowledge of running training workloads on thousands of GPUs.
- Familiarity with the Lustre filesystem is a plus.
- Experience working with the NVIDIA NeMo Framework (Docker image for model training).
- Experience with NVIDIA NeMo NIMs (Docker images for inference).
- Proficiency with cloud platforms such as GCP, AWS, and Azure.
- Experience with Terraform/Pulumi, Helm Charts, Linux, and other Infrastructure-as-Code tools.
- Familiarity with tools like Run:ai, TrueFoundry, Huggingface platform (training will be provided).
- Experience with HPC technologies such as Slurm is a bonus.
Professional Skills
- You will be part of a high-value client-facing white glove service, where a high level of professionalism is required.
- You understand the importance of stakeholder management and can easily liaise between clients and key stakeholders throughout projects, ensuring buy-in and building trust.
- You are resilient in ambiguous situations and can adapt your role to approach challenges from multiple perspectives.
- You do not shy away from risks or conflicts; instead, you take them on and manage them skillfully.
- You are eager to coach, mentor, and motivate others and aspire to influence teammates to take positive action and accountability for their work.
- You enjoy influencing others and always advocate for technical excellence while being open to change when needed.
- You have an insatiable curiosity and a drive to learn new things.
Other things to know
Learning & Development
There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.
Travel
While we’ve traditionally been a traveling consultancy, we have adopted a hybrid working model with the majority of work being completed remotely from either home or local Thoughtworks offices. However, business travel to client locations should be expected when required by Thoughtworks or our clients’ needs.
Company Policies
We are an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity or expression, pregnancy, age, national origin, disability status, genetic information, protected veteran status, or any other characteristic protected by law.
About Thoughtworks
Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary.
Salary
Benefits: https://www.thoughtworks.com/en-us/careers/benefits
The annual salary range posted is subject to many factors and may vary depending on experience, geographic location, job responsibilities, performance, skills and/or training.
Salary$125,330—$208,880 USDTags: Architecture AWS Azure DevOps Docker GCP GloVe GPU Helm HPC HuggingFace Kubernetes Linux Machine Learning Model training Python Terraform
Perks/benefits: Career development Salary bonus
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.