Senior Distributed Systems Engineer
Palo Alto, California
Full Time Senior-level / Expert USD 180K - 250K
We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.
Responsibilities
- Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.
- Profile and optimize our model training code-base to achieve best in class hardware efficiency.
- Build systems to distribute work across massive GPU clusters efficiently.
- Design and implement methods to robustly train models in the presence of hardware failures.
- Build tooling to help us better understand problems in our largest training jobs.
Experience
- 5+ years of work experience.
- Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.
- Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.
- Experience building stable and highly efficient distributed systems.
- Strong generalist Python and Software skills including significant experience with Pytorch.
- Good to have experience working with high performance C++ or CUDA.
- Please note this role is not meant for recent grads.
Compensation
- The pay range for this position in California is $180,000 - $250,000yr; however, base pay offered may vary depending on job-related knowledge, skills, candidate location, and experience. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan.
Job stats:
1
1
0
Category:
Engineering Jobs
Tags: CUDA Distributed Systems GPU HPC Machine Learning Model training Pipelines Python PyTorch Research
Perks/benefits: Competitive pay Equity / stock options
Region:
North America
Country:
United States
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Staff Machine Learning Engineer jobsData Scientist II jobsPrincipal Data Engineer jobsStaff Data Scientist jobsBI Developer jobsData Manager jobsJunior Data Analyst jobsResearch Scientist jobsData Science Manager jobsBusiness Data Analyst jobsLead Data Analyst jobsData Engineer III jobsSenior AI Engineer jobsData Specialist jobsData Science Intern jobsSr. Data Scientist jobsPrincipal Software Engineer jobsData Analyst Intern jobsAzure Data Engineer jobsSoftware Engineer II jobsData Analyst II jobsBI Analyst jobsSoftware Engineer, Machine Learning jobsJunior Data Engineer jobsSenior Data Scientist, Performance Marketing jobs
Snowflake jobsLinux jobsEconomics jobsOpen Source jobsBanking jobsHadoop jobsJavaScript jobsComputer Vision jobsRDBMS jobsPhysics jobsKafka jobsData Warehousing jobsMLOps jobsAirflow jobsNoSQL jobsKPIs jobsR&D jobsGoogle Cloud jobsScala jobsOracle jobsData warehouse jobsStreaming jobsClassification jobsPostgreSQL jobsGitHub jobs
Scikit-learn jobsSAS jobsCX jobsTerraform jobsScrum jobsPandas jobsPySpark jobsData Mining jobsDistributed Systems jobsRobotics jobsIndustrial jobsBigQuery jobsLooker jobsJira jobsUnstructured data jobsRedshift jobsJenkins jobsE-commerce jobsdbt jobsReact jobsMicroservices jobsPharma jobsData strategy jobsMySQL jobsNumPy jobs