Senior Engineering Manager, AI Runtime
USD 228K-297K Senior-level Full Time
Tasks
- Architect managed GPU training systems
- Build observability and reliability practices
- Define product and technical roadmap
- Develop operational runbooks
- Drive end-to-end delivery
- Implement checkpointing and failure recovery
- Lead and mentor engineering team
- Partner with recruiting to hire talent
Perks/Benefits
- N/A
Skills/Tech-stack
Checkpointing | Cluster Lifecycle Management | Cluster lifecycle | DeepSpeed | Distributed Training | Elastic Training | FSDP | Fault Tolerance | GPU Performance | GPU Performance Optimization | Lifecycle Management | Megatron-LM | NCCL | Observability | Performance optimization | Pipeline parallelism | PyTorch | Tensor Comprehension | Tensor Parallelism
Education
Roles
AI | AI Engineering | AI Engineering Manager | Engineering | Engineering Manager | Manager
Regions
Countries
States
Related jobs
-
API Integration | Agent systems | Asynchronous processing | Chunking | Cost OptimizationCompetitive salary based on experience | High-impact role | Opportunity to scale AI systems | Strong ownershipMid-level Full TimeAustin, Texas, United States - Remote R17h ago
-
Data Scientist Lead USD 175K-210KAWS | Apache Spark | Data Governance | Data Modeling | DatabricksBackup childcare | Financial coaching | Health care coverage | Mental health support | Onsite wellness centersSenior-level Full TimeOH, United States20h ago
-
Lead AI Engineer - AI & Credit Analytics USD 156K-234KAWS | CI/CD | Data Governance | Generative AI | LLMOpsFlexible time off | Flexible work environment | Hybrid work option | Matching 401k | Medical/Dental/Vision insuranceSenior-level Full TimeCosta Mesa, CA, United States R20h ago
-
Senior-level Full TimePalo Alto20h ago
-
Sr. AI/ML Engineer - Shared Services Automation-Remote USD 145K-225KAI Center | AI Engineering | Azure | Cloud platform | Communications Mining100 percent remote work | Advancement opportunities | Continuing education | Dental insurance | Flexible spending accountSenior-level Full TimeRochester, MN, United States R21h ago
-
AI/ML Engineer - Revenue Cycle Automation-Remote USD 125K-171KAzure | Bias detection | Cloud infrastructure | Cloud platform | Data Engineering100 percent remote | Dental insurance | FSA | HSA | Health insuranceMid-level Full TimeRochester, MN, United States R22h ago
-
Mid-level Full TimeKing George, VA, United States1d ago
-
Product Manager, Databricks Experimentation Platform USD 111K-202KAI/ML Platforms | Cross-functional | Cross-functional leadership | Data Security | Enterprise AIBackup childcare | Financial coaching | Health insurance | Mental health support | On Site Health Wellness CentersMid-level Full TimeWilmington, DE, United States1d ago
-
AI Governance | Agent systems | Architecture | Context engineering | Data SovereigntySenior-level Full TimeChicago, IL, USA; Atlanta, GA, USA1d ago
-
Mid-level Full TimeScottsdale, AZ1d ago
-
AI Inference | AI Training | AI systems | AI systems design | AMD GPUsDomestic travel | International travel | Remote workSenior-level Full TimeRemote Employee US, NH, US R1d ago
-
AI Solutions Consultant USD 140K-185KAI machine learning | AWS AI | AWS AI Machine Learning | Amazon Web Services | Artificial IntelligenceMid-level Full TimeNew York, NY1d ago
-
Senior Solutions Architect - AI Factory Deployment USD 184K-356KAllReduce | AllToAll | Automation | Bash | BenchmarkingEquity | Health benefits | Paid time offSenior-level Full TimeUS, CA, Remote, United States R1d ago
-
Senior AI Software Architect - Runtime USD 195K-361KBenchmarking | C# | C++ | CI/CD | CUDAHybrid work modelSenior-level Full TimeUSA - OR - Hillsboro, United …1d ago
-
AI Solutions Architect USD 158K-264KAnomaly Detection | Apache Airflow | Apache Kafka | Apache Spark | Azure DataSenior-level Full TimeDurham Blackwell Street, United States1d ago
-
Sr/Staff AI Engineer (Remote - US) USD 165K-300KAWS | AWS S3 | Agentic architecture | Azure | Databricks401k | Annual bonus | Counseling services | Dental insurance | Disability insuranceSenior-level Full TimeREMOTE, US, US, 76131 R1d ago
-
Auto-code | Auto-code generation | CANalyzer | Carsim | Code generationHybrid work scheduleSenior-level Full TimeMilford Proving Ground - Bldg 31 …1d ago
-
Staff Robotics Software Engineer, AI/ML USD 170K-260K3D Perception | C++ | Candidate Generation | Coordinate frames | DVCRelocation benefitsSenior-level Full TimeGM Global Technical Center - Smith …1d ago
-
RAI Lead USD 200K-240KAI Governance | AI Product Development | Artificial Intelligence | Ethical AI | Governance frameworks401k match | Dental insurance | Education assistance | Employee assistance program | Flexible spending accountsSenior-level Full TimeAustin, TX, US, 787011d ago
-
Senior Technical Program Manager, Data Platform USD 200K-322KAirflow | Batch Processing | Cost Allocation | Data Governance | Data PipelinesSenior-level Full TimeUS, CA, Santa Clara, United States1d ago
-
Agent Orchestration | Automated testing | Benchmarking | CUDA | CUDA CompilerEquity | Health benefits | Paid time offSenior-level Full TimeUS, CA, Santa Clara, United States1d ago
-
ML and Agentic Systems Engineer USD 224K-431KAutomation | Benchmarking | Codebase Integration | Data Pipelines | DebuggingEquity | Health benefitsSenior-level Full TimeUS, CA, Santa Clara, United States1d ago
-
AI Business Unit Lead Analyst - Vice President USD 125K-188KAgent systems | Agentic AI | Asynchronous programming | Chroma | Convolutional Neural Networks401k | Accident insurance | Dental insurance | Disability insurance | Life insuranceSenior-level Full Time6400 LAS COLINAS BLVD IRVING, United …1d ago
-
Senior AI Engineer USD 145K-189KAgile | Computer Vision | Continuous integration | Deep learning | MATLABEquity opportunity | Fringe benefits | Fun work environmentSenior-level Full TimeHuntsville, AL1d ago
-
AWS | Agentic AI | Amazon S3 | BigQuery | CI/CDExecutive-level Full TimeUSA - NY - Headquarters, United …1d ago