Software Engineering Manager, Distributed Task-based Runtimes

US, CA, Santa Clara

Full Time Mid-level / Intermediate USD 220K - 339K

NVIDIA

NVIDIA erfindet den Grafikprozessor und fördert Fortschritte in den Bereichen KI, HPC, Gaming, kreatives Design, autonome Fahrzeuge und Robotik.

View all jobs at NVIDIA

Apply now Apply later

Posted 2 hours ago

We are looking for an experienced software engineering manager to lead the development of NVIDIA’s distributed runtime stack for large-scale distributed computing that attempts to democratize scalable accelerated computing for everyone. Around the world, leading commercial and academic organizations are revolutionizing AI, scientific computing, and data analytics, using data centers powered by GPUs. Applications of these technologies include LLMs, Computer Vision, autonomous vehicles and countless others. Our team develops foundational distributed computing software that extremely simplifies development of such applications!

In this role, you will lead an engineering team designing, developing, and optimizing the distributed task-based runtime software stack that includes Legate, Legion and Realm. Ideal candidates should have experience leading software product engineering teams, and be motivated to advance the state-of-the-art in a variety of accelerated computing domains. If this sounds exciting, we would love to meet you!

What you'll be doing:

Lead, mentor, and grow your distributed runtime engineering team and be responsible for the planning and execution of projects as well as the quality, and performance of the runtime stack.
Work closely with NVIDIA Research, Engineering, Developer Technology, and Product Management teams in the areas of scientific computing, data analytics, programming systems, and AI to help collect requirements for your products as well as contribute to the development of technology roadmaps.
Interact with external partners and researchers to understand their use cases and requirements.

What we need to see:

BS, MS or PhD degree in Computer Science, Electrical Engineering or related field (or equivalent experience)
8+ years of overall experience in developing distributed runtimes or at-scale high-performance software.
3+ years of experience recruiting, training and leading software engineering teams.
Background in high performance computing and performance critical applications
Experience implementing, tuning, and debugging runtimes and/or distributed systems for supercomputers or the cloud
Hands-on experience with design, development, testing, maintenance, and performance optimization of GPU-accelerated software using C, C++ or Python.
Strong collaboration, communication, and documentation habits.
Experience with agile software development practices using project management tools such as JIRA.

Ways to stand out from the crowd:

Experience with development of distributed runtimes such as Legion, Ray or Dask
Experience with parallel programming, ideally using CUDA, MPI or OpenMP
Good knowledge of CPU and/or GPU hardware architecture.
Development of domain specific libraries/languages for high performance computing
Good understanding of Machine Learning and Deep Learning technologies

The base salary range is 220,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply now Apply later

Job stats: 0 0 0

Categories: Engineering Jobs Leadership Jobs

Tags: Agile Architecture Computer Science Computer Vision CUDA Data Analytics Deep Learning Distributed Systems Engineering GPU HPC Jira LLMs Machine Learning OpenMP PhD Python Research Testing