Deep Learning Solutions Architect – Distributed Training

UK, Remote, United Kingdom

NVIDIA

NVIDIA on grafiikkasuorittimen keksijä, jonka kehittämät edistysaskeleet vievät eteenpäin tekoälyn, suurteholaskennan.

View all jobs at NVIDIA

Apply now Apply later

NVIDIA’s Worldwide Field Operations (WWFO) team is seeking a Solution Architect with a strong focus on Deep Learning and deep understanding of neural network training. Introduction of NVIDIA GB200 NVL72 systems, bringing Chip-to-Chip NVLINK and the significant expansion of the NVLINK domain have enabled a wide range of new neural network architectures and approaches to training. The ideal candidate will be proficient using tools such as NeMo, Megatron-LM, DeepSpeed, PyTorch FSDP or similar, and have strong systems knowledge, enabling customers to fully use the capabilities of the new Grace Blackwell training systems. This could include helping customers take advantage of much wider neural networks, use of asynchronous checkpointing or activation offloading. Hands on experience in LLM post-training in particular RL would help the candidate stand out.

Solutions Architects work with the most exciting computing hardware and software, driving the latest breakthroughs in artificial intelligence! We need individuals who can enable customer productivity and develop lasting relationships with our technology partners, making NVIDIA an integral part of end-user solutions. We are looking for someone always passionate about artificial intelligence, someone who can maintain understanding of a fast paced field, someone able to coordinate efforts between corporate marketing, industry business development and engineering. Solutions Architects, are the first line of technical expertise between NVIDIA and our customers. Your duties will vary from working on proof-of-concept demonstrations, to driving relationships with key executives and managers in order to promote adoption of NVIDIA based AI technology. Engaging with developers, scientific researchers, data scientists, IT managers and senior leaders is a significant part of the Solutions Architect role.

What you will be doing:

  • Work directly with key customers to understand their technology and provide the best AI solutions/ guidance on training process in terms of tools and methodology.

  • Perform in-depth analysis and optimization to ensure the best performance on GPU architecture systems (in particular Grace/ARM based systems). This includes support in optimization of distributed training pipelines.

  • Partner with Engineering, Product and Sales teams to develop, plan best suitable solutions for customers. Enable development and growth of product features through customer feedback and proof-of-concept evaluations.

What we need to see:

  • Excellent verbal, written communication, and technical presentation skills in English.

  • MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields.

  • 5+ years work or research experience with Python/ C++ / other software development

  • Work experience and knowledge of modern NLP including good understanding of transformer, state space, diffusion, MOE model architectures. This can include either expertise in training or optimization/compression/operation of DNNs.

  • Understanding of key libraries used for NLP/LLM training (such as Megatron-LN, NeMo, DeepSpeed etc.) and/or deployment (e.g. TensorRT-LLM, vLLM, Triton Inference Server).

  • Track record in neural network performance optimization and/or training robustness.

  • Person excited to work with multiple levels and teams across organizations (Engineering, Product, Sales and Marketing team). Capable of working in a constantly evolving environment without losing focus.

  • Self-starter with demeanor for growth, passion for continuous learning and sharing findings across the team.

Ways to Stand Out from The Crowd:

  • Ability to conduct LLM post training in particular knowledge of large scale RL.

  • Track record in running large scale training/HPC jobs with a focus on training robustness / failure resilience.

  • Understanding of HPC systems: data center design, high speed interconnect InfiniBand, Cluster Storage and Scheduling related design and/or management experience.

Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  1  0  0

Tags: Architecture Computer Science Deep Learning Engineering FSDP GPU HPC InfiniBand LLMs Mathematics NLP NVLink PhD Physics Pipelines Python PyTorch Research TensorRT vLLM

Perks/benefits: Career development

Regions: Remote/Anywhere Europe
Country: United Kingdom

More jobs like this