Lead Infrastructure and Reliability Engineer (Systems & Scale)
Tasks
- Architect GPU environments
- Build scalable reliability mechanisms
- Design scheduling and resource management
- Hire and develop engineering team
- Improve utilization and performance
- Resolve hardware and software failures
- Scale training and inference infrastructure
- Shape architecture through research and product collaboration
- Translate reliability constraints into platform strategy
Perks/Benefits
- N/A
Skills/Tech-stack
Automation | Code Development | Debugging hardware | Debugging hardware to orchestration | Distributed Systems | GPU clusters | Kubernetes | Linux | Open Source | Open source infrastructure | Systems behavior | Systems behavior under contention and scale
Education
N/A
Regions
Countries
States
Cities
Related jobs
-
Software Engineer - Dragonfly Portfolio USD 160K-215KCryptography | Distributed Systems | Event Ingestion | Onchain Event Ingestion | Performance optimizationOnsite work locationMid-level Full TimeSan Francisco6h ago
-
Mid-level Full TimeSalt Lake City, Utah7h ago
-
Software Engineer, Machine Learning USD 185K-200KClassification | Computer Vision | Data Mining | Data Regression | Deep learningMid-level Full TimeMenlo Park, CA11h ago
-
Robotics Engineer - Logistics and Material Flow USD 170K-240KAGV | Automation | C++ | Cause analysis | Computer VisionTravel to data centers for engineering studiesSenior-level Full TimeFremont, CA11h ago
-
Product Software Modernization Engineer, Quantum AI USD 147K-211KBazel | Cloud Spanner | Cloud Storage | Cloud platform | Distributed cloudMid-level Full TimeSeattle, WA, USA; Goleta, CA, USA11h ago
-
Software Engineer III, Infrastructure, GDC AI Storage USD 147K-211KCSI | Data Structures | Data Structures and Algorithms | Distributed Systems | GoSenior-level Full TimeKirkland, WA, USA11h ago
-
Automation | C++ | CSS | Database Design | HTMLMid-level Full TimeAnn Arbor, MI, USA11h ago
-
Robotics Platform Security Engineer USD 90K-300KAppArmor | Auditd | C# | C++ | CIS BenchmarksHybrid work option | On-site collaboration | Remote work optionSenior-level Full TimeIrvine, CA18h ago
-
Software Engineer II - Abnormal Data Platform USD 149K-214KAerospike | Amazon DynamoDB | Apache Spark | Data Storage | DatabricksDistributed team collaboration | Remote work | Technical mentorshipMid-level Full TimeRemote - USA R18h ago
-
Data Center Operations Systems Engineer (Dallas, TX) USD 89K-145KAirflow management | Aisle containment | Boot process | Cable Management | Cable Optics401k company match | Commuter stipend | Dental insurance | Flexible paid time off | Health insuranceMid-level Full TimeDallas, TX - Data Center19h ago
-
Senior Computational Fluid Dynamics (CFD) Engineer USD 168K-205KAerodynamic Database | Aerodynamics | Aeromechanics | Automation | CAMRAD IIBonus | On-site work 5 days a week | Relocation assistance | Travel reimbursementSenior-level Full TimeIrvine, CA20h ago
-
Applied AI ML Engineer-Senior Associate USD 175K-210KAWS | Amazon Bedrock | Amazon EKS | Amazon SageMaker | Data ShardingBackup childcare | Financial coaching | Health care coverage | Mental health support | On-site health and wellness centersSenior-level Full TimeNew York, NY, United States21h ago
-
Machine Learning Engineer, Growth USD 130K-500KElasticsearch | Embeddings | Fine Tuning | Go | KafkaEquity grant | Free gym membership | Health insurance | Housing bonus | Meals stipendMid-level Full TimeSan Francisco21h ago
-
Data Engineering | Deep learning | Fine Tuning | LLM | Language ModelsSenior-level Full TimeNew York, United States22h ago
-
Principal Software Engineer - Robotics & Drones USD 170K-200KAPIs | Accelerators | CPU | Camera Signal Processing | Cloud DataSenior-level Full TimeBoston, MA - USA, United States22h ago
-
Sr Engineer, Data Integration USD 101K-175K.NET | Azure | Cloverleaf Interface Engine | Csharp | EDIHealthcare benefits | Incentive programs | Professional development | Remote work | Retirement benefitsSenior-level Full TimeWork at Home - Ohio - … R22h ago
-
AI/ML Engineer USD 130K-223KAgentic AI | Deep learning | Distributed Training | Docker | EmbeddingsMid-level Full TimeScottsdale, AZ1d ago
-
Principal Engineer, Data & ML Platform USD 119K-180KAPIs | Automated testing | Cloud Native | Cloud platform | Continuous DeploymentSenior-level Full TimeScottsdale, AZ1d ago
-
Principal Machine Learning Engineer USD 245K-393KCloud infrastructure | Data Science | Distributed Systems | Infrastructure as Code | ML pipelinesSenior-level Full TimeChicago, Illinois, USA R1d ago
-
AI Ops Specialist USD 150K-210KAPI Integration | Automation | Evaluation | GitHub API | LLM Agents401k match | Dental insurance | Hardware setup | Health insurance | Unlimited PTOMid-level Full TimeNew York1d ago
-
Sr Sales Engineer, West USD 160K-196KAnalytics | Apache Spark | Artificial Intelligence | Dataiku | Kubernetes401k match | Dental insurance | Employer paid disability coverage | Flexible spending accounts | Medical insuranceSenior-level Full TimeUnited States, Remote R1d ago
-
Sr. IT Systems/Automation Engineer USD 143K-197KAccess Lifecycle | Access Management | Access lifecycle management | Automation | Automation platformSenior-level Full TimeMountain View, California1d ago
-
Machine Learning Engineer, Foundation Model USD 129K-247KAuto-regressive models | C plus plus | Deep learning | Diffusion Models | Distributed TrainingSenior-level Full TimeSan Jose1d ago
-
Senior-level Full TimeAnnapolis Junction, Maryland1d ago
-
Distinguished Software Engineer, Data Infrastructure USD 248K-406KAI | Batch Processing | Data Infrastructure | Data Privacy | Data ProcessingExecutive-level Full TimeMountain View, CA, United States1d ago