Distributed Systems Engineer

Palo Alto, CA

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

Luma AI

Create, animate & innovate with Luma’s AI. Use text, images, or video to generate realistic motion content with Ray2 and Dream Machine for next-gen storytelling.

View all jobs at Luma AI

Apply now Apply later

We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.

Responsibilities

  • Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.

  • Profile and optimize our model training code-base to achieve best in class hardware efficiency.

  • Build systems to distribute work across massive GPU clusters efficiently.

  • Design and implement methods to robustly train models in the presence of hardware failures.

  • Build tooling to help us better understand problems in our largest training jobs.

Experience

  • 5+ years of work experience.

  • Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.

  • Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.

  • Experience building stable and highly efficient distributed systems.

  • Strong generalist Python and Software skills including significant experience with Pytorch.

  • Good to have experience working with high performance C++ or CUDA.

Your application is reviewed by real people.

Apply now Apply later
Job stats:  1  1  0
Category: Engineering Jobs

Tags: CUDA Distributed Systems GPU HPC Machine Learning Model training Pipelines Python PyTorch Research

Regions: Remote/Anywhere North America
Country: United States

More jobs like this