Senior AI Infrastructure Engineer - Training Platform
San Francisco, CA; Seattle, WA; New York, NY
USD 216K-270K Senior-level Full Time
Tasks
- Architect multi tenant GPU orchestration layer
- Deliver end-to-end project execution
- Design scheduling primitives for training jobs
- Evaluate build-vs-buy decisions
- Identify isolate hardware failures
- Implement deep observability and automated health checking
- Integrate CNCF and AI ecosystem technologies
- Optimize training job lifecycle
- Participate in on-call operations
- Plan compute capacity
- Scale high performance training platform
Perks/Benefits
- Commuter stipend
- Comprehensive health, dental and vision coverage
- Generous PTO
- Learning and development stipend
- Retirement benefits
Skills/Tech-stack
AWS | Admission controllers | C++ | CUDA | Custom Resources | DeepSpeed | Device plugins | Distributed Storage | Distributed Training | EFA | FSDP | GCP | Gang Scheduling | Go | Infiniband | Infrastructure as Code | Kubernetes | Kueue | Lustre | NCCL | NVIDIA hardware | Nvidia software | Operators | Preemption | PyTorch | Python | Queueing | Quotas | Ray | Rust | S3 | Terraform | Topology Aware Scheduling | “as-code”
Education
N/A
Regions
Countries
States
Related jobs
-
Software Engineer, Video AI/ML Specialist USD 141K-211KAI | AV1 | AV2 | Audio Processing | Audio/VideoMid-level Full TimeBellevue, WA | Menlo Park, CA …3h ago
-
Network Engineer, Deployment & Support USD 101K-156K400G | 800G | AI | Automation | Coherent opticsMid-level Full TimeMenlo Park, CA | Eagle Mountain, …3h ago
-
Senior Software Engineer, Database Internals, AlloyDB USD 174K-252KC# | C++ | Code optimization | Concurrency Control | Database InternalsEntry-level Full TimeSunnyvale, CA, USA3h ago
-
Artificial Intelligence | Data Analysis | Data Structures | Data structures algorithms | Human-in-the-loopSenior-level Full TimeMountain View, CA, USA3h ago
-
Agent tooling | Artificial Intelligence | C++ | Cloud Architecture | Conversational AISecret clearance | TravelSenior-level Full TimeAtlanta, GA, USA; Austin, TX, USA3h ago
-
Software Engineer III, AI/ML GenAI, Google Cloud Compute USD 147K-211KAudio generation | C++ | Computer Vision | Data Processing | DebuggingSenior-level Full TimeSunnyvale, CA, USA3h ago
-
Senior Software Engineer, Applied AI Commerce USD 174K-252KAutomated Evaluation | C++ | Cloud | Evaluation datasets | GeminiSenior-level Full TimeSunnyvale, CA, USA; Kirkland, WA, USA3h ago
-
C++ | Data Structures | Data Structures and Algorithms | Embedded Linux | Software ArchitectureSenior-level Full TimeMountain View, CA, USA3h ago
-
AI Data Engineer USD 120K-220KAgent memory | Amazon Web Services | Audio Processing | Batch Processing | Cloud infrastructureAccess to AI tools | Equity | Remote opportunitiesMid-level Full TimeSan Francisco Bay Area13h ago
-
Senior-level Full TimeRaleigh, NC, US14h ago
-
AI Innovation Analyst - Internal USD 65K-80KAI Governance | AI Services | Authentication | Automation | AzureEntry-level Full TimeMiami, FL14h ago
-
C plus plus | C# | CAD | Dynamics | FDA Compliance401k | Company holidays | Dental insurance | Health insurance | Paid maternity/paternity leaveSenior-level Full TimeLos Angeles, California R15h ago
-
AI Engineer, Generative AI Agents USD 130K-188KAWS | Agile | Amazon Bedrock | Context engineering | Cost OptimizationOn-site work requiredSenior-level Full TimeDenver, CO15h ago
-
Supercomputing Intern - Summer 2027 USD 40K-140KAlgorithms | C# | C++ | CI/CD | Data StructuresDaily meals | Direct mentorship | Housing supportEntry-level InternshipSan Jose16h ago
-
Supercomputing Intern - Fall 2026 USD 40K-140KC# | C++ | CI/CD | Co-design | Data StructuresDaily Dinner | Daily lunch | Direct mentorship | Housing support | Relocation supportEntry-level InternshipSan Jose16h ago
-
Inference Intern - Summer 2027 USD 64K-142KAccelerator architectures | C++ | Compilers | Debugging | Distributed SystemsDaily Dinner | Daily lunch | Direct mentorship | Housing support | Paid internshipEntry-level InternshipSan Jose16h ago
-
Inference Intern - Spring 2027 USD 60K-142KC++ | Compilers | Consensus Protocols | Consistency models | Distributed SystemsDaily Dinner | Daily lunch | Direct mentorship | Housing support | Paid internshipEntry-level InternshipSan Jose16h ago
-
Mid-level Full TimeIrvine, CA17h ago
-
Senior Data Engineer USD 117K-162KAWS | Azure | BigQuery | DBT | Data Architecture401k | Annual wellness stipend | Cell phone reimbursement | Coaches and therapists access | Collective Pause DaysSenior-level Full TimeRemote - US R17h ago
-
Software Engineer, Enterprise AI Platform USD 230K-385KAPI Design | Agent systems | Applied AI | Auditability | AuthenticationMid-level Full TimeSan Francisco17h ago
-
Embedded Software Engineer II USD 115K-140KBash | C plus plus | C# | CI/CD | D-busERGs | Family Caregiver Support | Flexible PTO | HSA match | Health benefitsMid-level Full TimeRemote - USA R17h ago
-
Senior-level Full TimeBoston, Massachusetts17h ago
-
VP, Quantitative Developer - Risk Engineering USD 225K-280KAI | Automation | Generative AI | Model Risk Management | Model risk401k company contribution | Flexible Time Off Paid | Free Coaching Counseling | Health benefits | Hybrid & flexible workingExecutive-level Full TimeNew York, NY18h ago
-
Data Analyst, Data Cloud Intelligence USD 85K-95KAWS | Attribution | Business Intelligence | Data Modeling | Data pipelineEmployee discounts | Employee equity | Medical, dental & vision coverage | Pet insurance | Stock purchase planMid-level Full TimeRemote - US R18h ago
-
Senior Embedded Software Engineer USD 145K-220KAgile | Buildroot | C# | C++ | CI/CD401k match | Casual dress code | Dental benefits | FSA | Free daily lunchSenior-level Full TimeSan Diego, California, United States18h ago