Software Engineering Manager, LLM Training
Tasks
- Architect post training infrastructure
- Collaborate with responsible AI teams on compliance and safety
- Define engineering performance goals and metrics
- Develop agentic research and performance optimization agents
- Develop post training platform components
- Drive operational excellence culture
- Enable distributed training parallelism
- Implement observability and profiling for training runs
- Lead and coach engineering team
- Lead containerized training image lifecycle management
- Optimize LLM training performance
Perks/Benefits
- N/A
Skills/Tech-stack
CUDA | Containerization | Context Parallelism | Data I/O | Data parallelism | Distributed Systems | Expert parallelism | Fine Tuning | FlashAttention | High Performance | High Performance Data I O | High-performance data | Hugging Face | Hugging Face Accelerate | Hugging Face Transformers | Human Feedback | I/O | Learning from Human Feedback | Liger Kernels | Low Precision | Low-precision training | Megatron | Model Pruning | Model Quantization | Multi Teacher Distillation | NCCL | Observability | Pipeline parallelism | Profiling | PyTorch | Ray | Reinforcement Learning | Reinforcement Learning GRPO | Reinforcement Learning from Human Feedback | SGLang | Speculative decoding | Supervised Fine Tuning | Telemetry | Tensor Parallelism | VLLM | VeRL
Education
Regions
Countries
States
Related jobs
-
Director, Data Governance & Observability USD 165K-190KAPI Design | Alerting | Automated testing | CI/CD | DashboardsExecutive-level Full TimeNew York, NEW YORK, United States10h ago
-
Data Science Manager, AI Products USD 230K-280KAWS Bedrock | AWS Bedrock AgentCore | Agentic AI | Agents SDK | Bedrock AgentCore401k match | Employee assistance program | Flexible paid time off | Flexible spending accounts | Health insurance coverageSenior-level Full TimeNew York, NC R12h ago
-
Principal Product Mgr - Technical, Agentic AI DevOps USD 181K-245KAI tools | Business Model | Call Management | DevOps | Go-to-marketCareer growth opportunities | High ownership culture | High-velocity team | Knowledge sharing | MentorshipSenior-level Full TimeSeattle, Washington, USA1d ago
-
Senior Manager of Software Engineering for Data Platform USD 175K-185KAWS | Access Control | Alert Suppression | Alert escalation | Alert routingSenior-level Full TimeJersey City, NJ, United States1d ago
-
Manager- Applied Sciences / Machine Learning USD 163K-331KApache Spark | Artificial Intelligence | Artificial Intelligence Generated Content | C# | C++Mid-level Full TimeRedmond, WA, US; Mountain View, CA, …1d ago
-
Principal Data Engineer USD 152K-190KApache Spark | Artificial Intelligence | CI/CD | Cloud Platforms | Code Coverage401k company match | Dental insurance | Flexible paid time off | Life insurance | Long-term disabilitySenior-level Full TimeDallas, TX - Hybrid (3x in … R1d ago
-
Manager of Data Platform Engineering USD 87K-119KAgile | Backlog Grooming | Capacity Planning | Cloud infrastructure | Data EngineeringSenior-level Full TimeMorristown, NJ, United States1d ago
-
AI Lead/Manager USD 250K-340KAgent systems | Applied Machine Learning | Architecture | Code review | DebuggingSenior-level Full TimeNew York City1d ago
-
APIs | CUDA | Cloud Computing | Data Pipelines | Data ProcessingSenior-level Full TimeUS, CA, Santa Clara, United States2d ago
-
Manager, Page Wide Industrial Customer Assurance- Data Pipeline & System Test Engineering USD 130K-205KAnalytics | Business Intelligence | Cause analysis | Data dashboards | Data pipelineDental insurance | Employee assistance program | Flexible spending account | Generous time off | Health insuranceMid-level Full TimeCOR08 - Corvallis, OR B-8 (COR08), …2d ago
-
A/B | A/B Testing | AWS | Agile | Artificial IntelligenceMid-level Full TimeSan Francisco, CA2d ago
-
AWS | Amazon DynamoDB | Amazon Redshift | Apache Cassandra | Apache SparkMid-level Full TimeBoston, MA2d ago
-
API Integration | Agile | Amazon Web Services | Ansible | AuroraSenior-level Full TimeBoston, MA2d ago
-
Data Analysis | Data Science | Direct Preference Optimization | Fine Tuning | Language ModelsSenior-level Full TimeSunnyvale, CA, USA3d ago
-
Engineering Manager - Forward Deployed Engineering (LLM) USD 260K-380KDistributed inference | Docker | GPU infrastructure | Hugging Face | LLM InferenceCompany 401K | Fertility and family building stipend | Flexible PTO | Medical, dental, and vision insurance | Paid parental leaveMid-level Full TimeSan Francisco4d ago
-
Technical Program Manager, Discovery USD 365K-435KCompute Planning | Data Infrastructure | Data pipeline | Data pipeline debugging | ForecastingFlexible working hours | Generous vacation | Health and wellness benefits | Hybrid work policy | Optional equity donation matchingMid-level Full TimeSan Francisco, CA | New York …4d ago
-
Technical Lead Manager, Physical AI USD 248K-311KAttention | CLIP | Data loading | Deep learning | Diffusion ModelCommuter stipend | Generous PTO | Health, dental, vision coverage | Learning and development stipend | Retirement benefitsSenior-level Full TimeSan Francisco, CA4d ago
-
Head of Data Science - Identity & Compliance USD 250K-300KAgentic Systems | Apache Spark | Data Governance | Deep learning | Distributed SystemsEquity | Health insurance | Paid time off | Professional development | Remote work flexibilityExecutive-level Full TimeSan Francisco, CA4d ago
-
Adversarial Machine Learning | Agentic Systems | Anti-spoofing | Attack detection | BiometricsEquity | Health insurance | Paid time off | Professional developmentExecutive-level Full TimeNew York, NY4d ago
-
Senior Product Manager, AI Developer Platform USD 165K-414KAI-native | AI-native workflows | API Design | API Integration | API toolingSenior-level Full TimeSeattle, Washington, United States4d ago
-
Reality Labs Project Manager, Product Data Operations USD 116K-166KAutonomous Vehicles | Bash | Data Curation | Hardware-in-the-loop | Imitation LearningMid-level Full TimeBurlingame, CA4d ago
-
Sr. Manager, Data Science USD 193K-220KAmazon Kinesis | Amazon SageMaker | Apache Spark | Build Automation | Code review401k | Dental insurance | Financial planning assistance | Medical insurance | Mental Health Wellness CounselingSenior-level Full TimeUnited States5d ago
-
Delivery Solutions Architect - U.S. Federal Sector USD 180K-247KBusiness Analysis | Data Architecture | Distributed Systems | Program Management | Project ManagementCustomer-facing role | Hybrid work | Travel up to 30 percentSenior-level Full TimeMaryland5d ago
-
AWS | Data analytics | DevOps | Distributed Systems | LinuxCareer growth | Flexibility | Mentorship | Work-life balanceSenior-level Full TimeSeattle, Washington, USA5d ago
-
Auditability | Backend platform | Billing Systems | Catalog management | Data ModelingCompany vehicle eligibility | Hybrid work scheduleMid-level Full TimeGM Global Technical Center - 7000 …5d ago