Principal High-Performance LLM Training Engineer
US, CA, Santa Clara, United States
USD 272K-431K Senior-level Full Time
Tasks
- Build performance models workload characterizations and simulation methodologies
- Develop production quality software tools and benchmarks
- Drive workloads to speed of light performance by removing bottlenecks
- Lead end to end performance analysis and optimization of LLM training workloads
- Mentor engineers and establish best practices for performance analysis
- Serve as technical authority for AI training performance
- Translate workload insights into hardware and software recommendations
Perks/Benefits
Skills/Tech-stack
Activation checkpointing | Benchmarking | CUDA | Communication and Computation Overlap | Compilers | Data parallelism | Distributed Training | GPU Architecture | High Performance | High-Performance Computing | JAX | Mixed Precision | Mixed-precision training | NEMO | Performance Computing | Performance Modeling | Pipeline parallelism | Profiling | PyTorch | Runtimes | Tensor Parallelism | Transformer Models
Education
Regions
Countries
States
Cities
Related jobs
-
C++ | Compute Optimization | Deep learning | GPU | JAXSenior-level Full TimeSunnyvale, CA, USA; Kirkland, WA, USA5h ago
-
Mid-level Full TimeRemote, United States R13h ago
-
Senior AI Infrastructure Engineer - Training Platform USD 216K-270KAWS | Admission controllers | C++ | CUDA | Custom ResourcesCommuter stipend | Comprehensive health, dental and vision coverage | Generous PTO | Learning and development stipend | Retirement benefitsSenior-level Full TimeSan Francisco, CA; Seattle, WA; New …13h ago
-
Mid-level Full TimeScottsdale, AZ16h ago
-
Data Science and AI Intern USD 50K-50KAWS | Cloud Computing | DBT | Data Visualization | ETLFree daily on site lunches | Free on site EV charging | Latest hardware | On-site gym | Open & transparent cultureEntry-level InternshipMenlo Park, CA17h ago
-
Machine Learning Engineer, Growth USD 0K-0KFeature Engineering | Machine Learning | Model Deployment | Model Evaluation | Model TrainingCell phone allowance | Childcare allowance | Company holidays | Dogfood budget | Family planning benefitMid-level Full TimeSan Francisco, CA17h ago
-
Machine Learning Engineer, PhD Intern USD 123K-161KAWS | Azure | Code review | Data Analysis | ExperimentationIn office work 5 days per week | Mentorship | Structured intern programming | Team eventsEntry-level InternshipSan Francisco, CA18h ago
-
Staff, Data Scientist USD 90K-180KBigQuery | Data Wrangling | DeepAR | Hive | Keras401k | Company discounts | Health insurance | Paid time off | Parental leaveSenior-level Full TimeBentonville, AR, United States19h ago
-
Sr Lead Software Engineer - AI/ML USD 177K-215KAI Platform | AWS CloudFormation | AWS SageMaker | Apache Airflow | Apache SparkBackup childcare | Financial coaching | Health care coverage | Mental health support | On-site health and wellness centersSenior-level Full TimeNew York, NY, United States20h ago
-
Action Chunking | Behavioral cloning | Data Versioning | Diffusion Models | Domain Randomization401k retirement plan | Comprehensive medical, dental and vision coverage | Daily free lunch | Employee referral bonuses | Flexible PTOMid-level Full TimeColumbus, Ohio R21h ago
-
Senior Staff Software Engineer - Data Science USD 145K-194KAWS | Azure | CI/CD | Cloud platform | Clustering401k match | Dental insurance | Health insurance | Life insurance | Paid time offSenior-level Full TimeWestchester, IL, United States21h ago
-
Staff Machine Learning Engineer, AI Research USD 230K-275KComputer Vision | Feature Engineering | Fine Tuning | Hyperparameter Tuning | Kubeflow401k | Dental insurance | Equity | Fertility treatment benefit | Health insuranceSenior-level Full TimeRemote - United States R22h ago
-
Senior Machine Learning Engineer, Roblox Assistant USD 196K-243KData Processing | Distributed data | Distributed data processing | Fine Tuning | Language ModelSenior-level Full TimeSan Mateo, CA, United States R1d ago
-
Senior AI-ML Engineer USD 160K-210KCI/CD | Cloud infrastructure | Computer Vision | Data Engineering | Data ScienceDental insurance | FSA | HSA | Medical insurance | Retirement planSenior-level Full TimeRochester, MN, United States1d ago
-
Applied AI/ML - Vice President USD 150K-225KA/B | A/B Testing | AWS | AWS Glue | Active LearningBackup childcare | Financial coaching | Health care coverage | Mental health support | Retirement savings planExecutive-level Full TimeChicago, IL, United States1d ago
-
Vision Foundation Model Research Intern USD 114K-114KComputer Vision | Debugging | Deep learning | Distributed Training | JAXBonus | Equity | Health benefitsEntry-level InternshipMountain View, California1d ago
-
Lead Software Engineer - Python, AI/ML USD 175K-215KAWS | Agile methodologies | Algorithms | Application development | Auto-GPTBackup childcare | Financial coaching | Health care coverage | Mental health support | On-site health and wellness centersSenior-level Full TimePlano, TX, United States1d ago
-
Applied AI ML Lead [Multiple Positions Available] USD 275K-275KAWS | Accuracy | Agent systems | CI/CD | Cost OptimizationBackup childcare | Financial coaching | Health care coverage | Mental health support | On-site health and wellness centersSenior-level Full TimePalo Alto, CA, United States1d ago
-
Senior Data Engineer TS/SCI Clearance USD 160K-220KAWS | Cloud Native | Data Visualization | Database Design | Database performanceBest place to work recognition | Employee development | Full employee approach | High employee morale and retentionSenior-level Full TimeHuntsville, United States1d ago
-
Data cleaning | Data collection | Deep learning | Machine Learning | Model EvaluationSenior-level Full TimeSan Jose, California, United States1d ago
-
GenAI Engineer USD 93K-163KAWS Bedrock | Agentic Workflows | C++ | CI/CD | CohereHealth and wellness benefits | Mentorship | Professional developmentEntry-level Full TimeArlington/Rosslyn, Virginia, United States1d ago
-
Senior GenAI Engineer USD 102K-171KAPI Development | AWS Bedrock | Agentic Workflows | CI/CD | CohereSenior-level Full TimeArlington/Rosslyn, Virginia, United States1d ago
-
Computer Vision | Data Pipelines | Language Models | Language Processing | Large Language ModelsSenior-level Full TimeBellevue, WA | Menlo Park, CA1d ago
-
Senior Software Engineer, Cloud Databases USD 174K-252KAnalytical processing | Benchmarking | C++ | Cloud Databases | Cloud platformSenior-level Full TimeKirkland, WA, USA1d ago
-
Research Engineer, Pretraining, DeepMind USD 174K-252KFine Tuning | JAX | Language Models | Large Language Models | Machine LearningMid-level Full TimeNew York, NY, USA1d ago