Senior Engineering Manager, AI Runtime
USD 228K-297K Senior-level Full Time
Tasks
- Architect managed GPU training systems
- Build observability and reliability practices
- Define product and technical roadmap
- Develop operational runbooks
- Drive end-to-end delivery
- Implement checkpointing and failure recovery
- Lead and mentor engineering team
- Partner with recruiting to hire talent
Perks/Benefits
- N/A
Skills/Tech-stack
Checkpointing | Cluster Lifecycle Management | Cluster lifecycle | DeepSpeed | Distributed Training | Elastic Training | FSDP | Fault Tolerance | GPU Performance | GPU Performance Optimization | Lifecycle Management | Megatron-LM | NCCL | Observability | Performance optimization | Pipeline parallelism | PyTorch | Tensor Comprehension | Tensor Parallelism
Education
Roles
AI | AI Engineering | AI Engineering Manager | Engineering | Engineering Manager | Manager
Regions
Countries
States
Related jobs
-
API Integration | Agent systems | Asynchronous processing | Chunking | Cost OptimizationCompetitive salary based on experience | High-impact role | Opportunity to scale AI systems | Strong ownershipMid-level Full TimeAustin, Texas, United States - Remote R16h ago
-
Data Scientist Lead USD 175K-210KAWS | Apache Spark | Data Governance | Data Modeling | DatabricksBackup childcare | Financial coaching | Health care coverage | Mental health support | Onsite wellness centersSenior-level Full TimeOH, United States18h ago
-
Lead AI Engineer - AI & Credit Analytics USD 156K-234KAWS | CI/CD | Data Governance | Generative AI | LLMOpsFlexible time off | Flexible work environment | Hybrid work option | Matching 401k | Medical/Dental/Vision insuranceSenior-level Full TimeCosta Mesa, CA, United States R19h ago
-
Senior-level Full TimePalo Alto19h ago
-
Sr. AI/ML Engineer - Shared Services Automation-Remote USD 145K-225KAI Center | AI Engineering | Azure | Cloud platform | Communications Mining100 percent remote work | Advancement opportunities | Continuing education | Dental insurance | Flexible spending accountSenior-level Full TimeRochester, MN, United States R20h ago
-
AI/ML Engineer - Revenue Cycle Automation-Remote USD 125K-171KAzure | Bias detection | Cloud infrastructure | Cloud platform | Data Engineering100 percent remote | Dental insurance | FSA | HSA | Health insuranceMid-level Full TimeRochester, MN, United States R20h ago
-
Mid-level Full TimeKing George, VA, United States1d ago
-
Product Manager, Databricks Experimentation Platform USD 111K-202KAI/ML Platforms | Cross-functional | Cross-functional leadership | Data Security | Enterprise AIBackup childcare | Financial coaching | Health insurance | Mental health support | On Site Health Wellness CentersMid-level Full TimeWilmington, DE, United States1d ago
-
AI Governance | Agent systems | Architecture | Context engineering | Data SovereigntySenior-level Full TimeChicago, IL, USA; Atlanta, GA, USA1d ago
-
Mid-level Full TimeScottsdale, AZ1d ago
-
AI Inference | AI Training | AI systems | AI systems design | AMD GPUsDomestic travel | International travel | Remote workSenior-level Full TimeRemote Employee US, NH, US R1d ago
-
AI Solutions Consultant USD 140K-185KAI machine learning | AWS AI | AWS AI Machine Learning | Amazon Web Services | Artificial IntelligenceMid-level Full TimeNew York, NY1d ago
-
Senior Solutions Architect - AI Factory Deployment USD 184K-356KAllReduce | AllToAll | Automation | Bash | BenchmarkingEquity | Health benefits | Paid time offSenior-level Full TimeUS, CA, Remote, United States R1d ago
-
Senior AI Software Architect - Runtime USD 195K-361KBenchmarking | C# | C++ | CI/CD | CUDAHybrid work modelSenior-level Full TimeUSA - OR - Hillsboro, United …1d ago
-
AI Solutions Architect USD 158K-264KAnomaly Detection | Apache Airflow | Apache Kafka | Apache Spark | Azure DataSenior-level Full TimeDurham Blackwell Street, United States1d ago
-
Sr/Staff AI Engineer (Remote - US) USD 165K-300KAWS | AWS S3 | Agentic architecture | Azure | Databricks401k | Annual bonus | Counseling services | Dental insurance | Disability insuranceSenior-level Full TimeREMOTE, US, US, 76131 R1d ago
-
Auto-code | Auto-code generation | CANalyzer | Carsim | Code generationHybrid work scheduleSenior-level Full TimeMilford Proving Ground - Bldg 31 …1d ago
-
Staff Robotics Software Engineer, AI/ML USD 170K-260K3D Perception | C++ | Candidate Generation | Coordinate frames | DVCRelocation benefitsSenior-level Full TimeGM Global Technical Center - Smith …1d ago
-
RAI Lead USD 200K-240KAI Governance | AI Product Development | Artificial Intelligence | Ethical AI | Governance frameworks401k match | Dental insurance | Education assistance | Employee assistance program | Flexible spending accountsSenior-level Full TimeAustin, TX, US, 787011d ago
-
Senior Technical Program Manager, Data Platform USD 200K-322KAirflow | Batch Processing | Cost Allocation | Data Governance | Data PipelinesSenior-level Full TimeUS, CA, Santa Clara, United States1d ago
-
Senior Assistant Vice President USD 164K-220KAWS | Agent Orchestration | AgentOps | Asynchronous programming | AuditabilitySenior-level Full TimeUnited States1d ago
-
Senior AI Engineer USD 143K-215KAPI Integration | Agent systems | Authentication | Authorization | Autogen401k | Annual learning stipend | Health/Dental/Vision | Paid time offSenior-level Full TimeGreater Boston, MA; Los Angeles, CA; …1d ago
-
Artificial Intelligence | Cost Optimization | Datadog | Distributed Systems | Drift DetectionExecutive-level Full TimeDallas, TX, United States1d ago
-
Director, Data Engineering & Data Science Engineering USD 170K-272KApache Airflow | Azure | Azure Synapse | Azure Synapse Analytics | CI/CDExecutive-level Full TimeChicago, IL, United States1d ago
-
Senior Manager, Machine Learning Engineering USD 200K-250KApache Spark | Cloud Native | Cloud-Native Storage | Computer Vision | Data AnnotationSenior-level Full TimeSeattle, Washington, United States1d ago