Senior AI Infrastructure Engineer - Training Platform
San Francisco, CA; Seattle, WA; New York, NY
USD 216K-270K Senior-level Full Time
Tasks
- Architect multi tenant GPU orchestration layer
- Deliver end-to-end project execution
- Design scheduling primitives for training jobs
- Evaluate build-vs-buy decisions
- Identify isolate hardware failures
- Implement deep observability and automated health checking
- Integrate CNCF and AI ecosystem technologies
- Optimize training job lifecycle
- Participate in on-call operations
- Plan compute capacity
- Scale high performance training platform
Perks/Benefits
- Commuter stipend
- Comprehensive health, dental and vision coverage
- Generous PTO
- Learning and development stipend
- Retirement benefits
Skills/Tech-stack
AWS | Admission controllers | C++ | CUDA | Custom Resources | DeepSpeed | Device plugins | Distributed Storage | Distributed Training | EFA | FSDP | GCP | Gang Scheduling | Go | Infiniband | Infrastructure as Code | Kubernetes | Kueue | Lustre | NCCL | NVIDIA hardware | Nvidia software | Operators | Preemption | PyTorch | Python | Queueing | Quotas | Ray | Rust | S3 | Terraform | Topology Aware Scheduling | “as-code”
Education
N/A
Regions
Countries
States
Related jobs
-
Azure Data Engineer (Telecommunications) USD 135K-165KAzure | CI/CD | DBT | Data Quality | DatabricksSenior-level ContractFrisco, United States2h ago
-
AWS | Alteryx | Amazon SageMaker | Azure | Azure DataMid-level Full TimeNew York, NY, United States3h ago
-
Machine Learning Engineer USD 128K-214KAWS | Agile | Azure | Cloud platform | GitHealth insurance | Holiday pay | Learning and development | Life insurance | Long-term disabilityMid-level Full TimeUSA-Remote Work R3h ago
-
Strategic Intelligence & Advanced Analytics Engineer USD 108K-136KAnomaly Detection | Artificial Intelligence | Azure | Data Pipelines | Data QualityPaid parental leave | Paid time off | Public service loan forgiveness | Tuition reimbursement | Wellness programsMid-level Full TimeTexas-Dallas-5323 Harry Hines Blvd3h ago
-
Software Engineer, Video AI/ML Specialist USD 141K-211KAI | AV1 | AV2 | Audio Processing | Audio/VideoMid-level Full TimeBellevue, WA | Menlo Park, CA …4h ago
-
Network Engineer, Deployment & Support USD 101K-156K400G | 800G | AI | Automation | Coherent opticsMid-level Full TimeMenlo Park, CA | Eagle Mountain, …4h ago
-
Senior Software Engineer, Database Internals, AlloyDB USD 174K-252KC# | C++ | Code optimization | Concurrency Control | Database InternalsEntry-level Full TimeSunnyvale, CA, USA4h ago
-
Artificial Intelligence | Data Analysis | Data Structures | Data structures algorithms | Human-in-the-loopSenior-level Full TimeMountain View, CA, USA4h ago
-
Agent tooling | Artificial Intelligence | C++ | Cloud Architecture | Conversational AISecret clearance | TravelSenior-level Full TimeAtlanta, GA, USA; Austin, TX, USA4h ago
-
Software Engineer III, AI/ML GenAI, Google Cloud Compute USD 147K-211KAudio generation | C++ | Computer Vision | Data Processing | DebuggingSenior-level Full TimeSunnyvale, CA, USA4h ago
-
Senior Software Engineer, Applied AI Commerce USD 174K-252KAutomated Evaluation | C++ | Cloud | Evaluation datasets | GeminiSenior-level Full TimeSunnyvale, CA, USA; Kirkland, WA, USA4h ago
-
C++ | Data Structures | Data Structures and Algorithms | Embedded Linux | Software ArchitectureSenior-level Full TimeMountain View, CA, USA4h ago
-
Applied AI ML Lead - LLM SUITE ENGINEERING USD 176K-215KAPI Design | AWS | Agentic AI | Caching | Cloud NativeBackup childcare | Financial coaching | Health care coverage | Mental health support | On-site health and wellness centersSenior-level Full TimeWilmington, DE, United States12h ago
-
AI Data Engineer USD 120K-220KAgent memory | Amazon Web Services | Audio Processing | Batch Processing | Cloud infrastructureAccess to AI tools | Equity | Remote opportunitiesMid-level Full TimeSan Francisco Bay Area14h ago
-
Senior-level Full TimeRaleigh, NC, US15h ago
-
AI Innovation Analyst - Internal USD 65K-80KAI Governance | AI Services | Authentication | Automation | AzureEntry-level Full TimeMiami, FL15h ago
-
Senior AI Engineer USD 107K-199KAKS | API Design | Alerts | Anomaly Detection | Apache SparkHybrid work environment | Inclusion support | Learning opportunities | Well-being supportSenior-level Full TimeUSA, Massachusetts, Boston, 200 Berkeley Street, …15h ago
-
Associate AI Engineer USD 80K-134KAPI Development | Azure | Cloud Platforms | Data Preparation | DocumentationFlexible work environment | Hybrid work arrangement | Inclusion programs | Paid time off | Wellness benefitsMid-level Full TimeUSA, Massachusetts, Boston, 200 Berkeley Street, …15h ago
-
Entry-level Full TimeUnited States - Remote R15h ago
-
CI/CD | Docker | Drift Detection | Embeddings | Experiment trackingMentorship | Remote workSenior-level Full TimeUnited States - Remote R15h ago
-
Data Engineer USD 85K-141KAPI Gateways | CI/CD | Cloud Databases | Data Governance | Data Lakes401k retirement plan | Adoption Assistance | Flexible spending accounts | Health savings account | Parental leaveMid-level Full TimeClient Office: Aberdeen, MD, United States15h ago
-
C plus plus | C# | CAD | Dynamics | FDA Compliance401k | Company holidays | Dental insurance | Health insurance | Paid maternity/paternity leaveSenior-level Full TimeLos Angeles, California R16h ago
-
AI Engineer, Generative AI Agents USD 130K-188KAWS | Agile | Amazon Bedrock | Context engineering | Cost OptimizationOn-site work requiredSenior-level Full TimeDenver, CO16h ago
-
Senior-level Full TimeGolden Valley, MN, United States16h ago
-
Supercomputing Intern - Summer 2027 USD 40K-140KAlgorithms | C# | C++ | CI/CD | Data StructuresDaily meals | Direct mentorship | Housing supportEntry-level InternshipSan Jose17h ago