Senior Manager, Technical Program Management - DGX Cloud

US, CA, Santa Clara, United States

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

Full Time Senior-level / Expert USD 232K - 368K

NVIDIA

NVIDIA on grafiikkasuorittimen keksijä, jonka kehittämät edistysaskeleet vievät eteenpäin tekoälyn, suurteholaskennan.

Posted 6 hours ago

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, we're at the forefront of AI innovation powering breakthroughs in research, autonomous vehicles, robotics, and more. The DGX Cloud team builds and operates the AI infrastructure that fuels this progress. NVIDIA is seeking an experienced and driven Sr Manager for Technical Program Management team to lead a high-impact team within our DGX Cloud Infrastructure organization. You will play a critical role in driving sophisticated, cross-functional programs involving Compute Platform, cluster bring-ups (including cutting-edge systems like GB200), and ensuring world-class fleet availability, occupancy optimization, and infrastructure metrics tracking across the global DGX Cloud fleet.

As a DGX Cloud leader within the Technical Program Management team, you will serve as the vital bridge between NVIDIA Research and DGXC Engineering, driving the development of resilient, high-performance infrastructure for AI training and inference. You’ll lead and scale a team that supports mission-critical systems empowering over 1,000 researchers. Your mission is To accelerate NVIDIA’s research by delivering a world-class AI environment — from GPU clusters to software stack — setting industry standards in productivity, performance, and global impact.

What you will be doing:

Lead with impact to build and scale a high-performing team of Technical Program Managers focused on delivering a world-class AI platform that empowers over 1K++ NVIDIA researchers. Ensure the team are customer-obsessed, prioritizing developer productivity, platform usability, and end-to-end user experience
Deep understanding of Slurm: architecture, configuration, workload management, job prioritization/fair-share policies, any alternative schedulers and hybrid scheduling architectures to drive capacity management and allocation process across Internal NVIDIA research teams
Experience with end-to-end cluster bring-ups and integration with MLOps stacks, including deep familiarity with operational models, Fleet efficiency metrics and deployment across hyperscaler environments such as OCI, GCP, and others
Skilled in capacity modeling, demand forecasting, and supply-demand balancing, with experience using prioritisation frameworks and collaborating with governance teams to define and implement prioritisation strategies.
Lead initiatives to reduce GPU idle waste, and improve cluster utilization metrics. Drive developer-centric programs and own the execution of key initiatives that accelerate internal developer velocity
Establish and enforce best-in-class program governance, roadmap planning, and risk management processes. Encourage transparency and accountability throughout engineering and operations by defining clear important metrics and reporting frameworks
Develop and execute a communication strategy that keeps stakeholders advised at all levels—from engineering contributors to NVIDIA leadership—about program progress, blockers, and impact.

What we need to see:

15+ overall years of program management experience leading large-scale software, AI/ML and infrastructure programs in fast-paced, matrixed environments. This includes 8+ years of managing a team.
Hands-on experience driving programs that support AI/ML platform development, including workload orchestration, platform reliability, researcher tooling, GPU resource management, hardware readiness states, and integration with customer MLOps pipelines
Proven track record delivering sophisticated AI/ML infrastructure programs at scale—ideally in cloud, hyperscaler, or enterprise datacenter settings—with a deep understanding of system architecture and cluster deployments.
Strong grasp of capacity modeling, forecasting techniques, and demand/supply reconciliation in compute environment with fleet-wide metrics such as availability, utilization, occupancy and the ability to use the data to drive operational excellence and roadmap prioritizations
Proficiency with tools like Grafana, Prometheus, or scheduler-native tools to monitor job efficiency, wait times, and node health
MS CS degree, or a related technical field, or equivalent experience.

Ways to stand out from the crowd:

Highly motivated with strong communication skills, with proven track record to work successfully with multi-functional teams and coordinate effectively across organizational boundaries and geographies.
Solid understanding of cloud technologies is a plus.
Experience with new product introduction and program managing research teams.
Background with productivity tools and process automation is a big plus.

We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our elite engineering teams are growing fast. If you're creative and autonomous with a real passion for technology, we want to hear from you!

#LI-Hybrid

The base salary range is 232,000 USD - 368,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply now Apply later

Job stats: 2 1 0

Category: Leadership Jobs

Tags: Architecture Engineering GCP GPU Grafana Machine Learning ML infrastructure MLOps Pipelines Research Robotics