Principal High-Performance LLM Training Engineer
US, CA, Santa Clara, United States
USD 272K-431K Senior-level Full Time
Tasks
- Build performance models workload characterizations and simulation methodologies
- Develop production quality software tools and benchmarks
- Drive workloads to speed of light performance by removing bottlenecks
- Lead end to end performance analysis and optimization of LLM training workloads
- Mentor engineers and establish best practices for performance analysis
- Serve as technical authority for AI training performance
- Translate workload insights into hardware and software recommendations
Perks/Benefits
Skills/Tech-stack
Activation checkpointing | Benchmarking | CUDA | Communication and Computation Overlap | Compilers | Data parallelism | Distributed Training | GPU Architecture | High Performance | High-Performance Computing | JAX | Mixed Precision | Mixed-precision training | NEMO | Performance Computing | Performance Modeling | Pipeline parallelism | Profiling | PyTorch | Runtimes | Tensor Parallelism | Transformer Models
Education
Regions
Countries
States
Cities
Related jobs
-
API Development | Airflow | Automated retraining | CI/CD | Cloud PlatformsEquityMid-level Full TimeNaples, United States8h ago
-
Delivery Senior Consultant, Data Engineering and Gen AI USD 119K-208K.NET | AWS | Agentic AI | Agile | AngularSenior-level Full TimeGilbert, Arizona, United States; Lake Mary, …10h ago
-
Data Pipelines | Data Storage | Distributed Systems | High Performance | High-Performance ComputingCareer growthEntry-level Full TimeSan Jose, California, United States10h ago
-
Apache Flink | Apache Spark | Automation | C++ | Cause analysisSenior-level Full TimeSan Jose, California, United States10h ago
-
Research Engineer / Scientist - Storage for LLM USD 156K-387KAttention Mechanisms | CUDA | Caching | Distributed Systems | Eviction policiesCompetitive compensation | Conference attendance | Generous research resources | Innovation-driven culture | Open source contributionsEntry-level Full TimeSan Jose, California, United States10h ago
-
Agentic data | Apache Hive | Apache Spark | Coding Data | Data CurationSenior-level Full TimeMenlo Park, CA11h ago
-
Research Engineer, Benchmarking, Robotics, DeepMind USD 147K-211KBenchmarking | Data Pipelines | Deep learning | Docker | Embedded deploymentMid-level Full TimeMountain View, CA, USA11h ago
-
Software Engineer USD 149K-211KAlgorithms | C# | C++ | Code review | Data AnalysisBonus | Equity | Hybrid work scheduleMid-level Full TimeMountain View, CA, USA R11h ago
-
Software Engineer USD 149K-211KC# | C++ | Cause analysis | Data Processing | Data StructuresHybrid scheduleMid-level Full TimeSunnyvale, CA, USA R11h ago
-
Research Engineer USD 147K-211KAlgorithm Design | C++ | Experimental Design | JAX | Machine LearningHybrid scheduleMid-level Full TimeMountain View, CA, USA R11h ago
-
Senior Research Engineer USD 174K-252KC plus plus | Code Reviews | Data Curation | Deep learning | JAXHybrid scheduleSenior-level Full TimeNew York, NY, USA R11h ago
-
AI accelerators | C++ | CPU | Diffusion Models | Edge ComputingSenior-level Full TimeMountain View, CA, USA11h ago
-
BEV | Bayesian Methods | CUDA | Machine Learning | Metrics OptimizationSenior-level Full TimeFoster City, CA21h ago
-
Continual Learning | Data Processing | Deep learning | JAX | Language ModelsBonus program | Company benefits program | Equity incentive planEntry-level Full TimeMountain View, CA USA; San Francisco, …22h ago
-
Staff Machine Learning Engineer, Multi-Modal Perception USD 251K-310KC plus plus | Computer Vision | Data Analysis | Deep learning | JAXSenior-level Full TimeMountain View, CA USA; San Francisco, …22h ago
-
Senior-level Full TimeInnovation Point, United States22h ago
-
Hugging Face | LLM orchestration | Langchain | Language Models | Large Language ModelsCareer growth potential | Early stage technical hire | Equity compensation | High ownership role | Hybrid workMid-level Full TimeSan Francisco, CA; Hybrid R22h ago
-
Senior-level Full TimeFoster City, United States22h ago
-
Staff Software Engineer, AI Runtime USD 190K-265KCUDA | Checkpointing | Data parallelism | DeepSpeed | Distributed SystemsSenior-level Full TimeMountain View, California; San Francisco, California1d ago
-
Senior Software Engineer, AI Runtime USD 160K-225KAlgorithms | Checkpointing | Collective communication | Data Structures | Data parallelismSenior-level Full TimeMountain View, California; San Francisco, California1d ago
-
AI/ML Security Engineer USD 102K-163KAPI Integration | AWS | Azure | Benchmarking | EvaluationCorporate holidays | Flexible time off | Group dental insurance | Group health insurance | Pet benefit optionMid-level Full TimeRemote R1d ago
-
Batching | C# | C++ | CUDA | FP16Dental insurance | Disability insurance | Flexible spending account | Flexible vacation | Health insuranceMid-level Full TimeAnywhere, USA R1d ago
-
Machine Learning Engineer USD 80K-90KDeep learning | Evaluation metrics | Generalization | Language Models | Large Language ModelsSenior-level Full TimeFremont, California R1d ago
-
Machine Learning Engineer USD 80K-90KDeep learning | Evaluation metrics | Generalization | Language Models | Large Language ModelsBonus | Health insurance | Onsite work | Paid time offSenior-level Full TimeManteno, Illinois R1d ago
-
AI Engineer USD 107K-190KAgile | Data Pipelines | Data Preprocessing | Deep learning | Feature EngineeringBackup Child Care | Employee resource groups | Flexible family care | Medical, dental, and prescription coverage | Paid HolidaysMid-level Full TimeDearborn, MI, United States1d ago