Lead Infrastructure and Reliability Engineer (Systems & Scale)
Tasks
- Architect GPU environments
- Build scalable reliability mechanisms
- Design scheduling and resource management
- Hire and develop engineering team
- Improve utilization and performance
- Resolve hardware and software failures
- Scale training and inference infrastructure
- Shape architecture through research and product collaboration
- Translate reliability constraints into platform strategy
Perks/Benefits
- N/A
Skills/Tech-stack
Automation | Code Development | Debugging hardware | Debugging hardware to orchestration | Distributed Systems | GPU clusters | Kubernetes | Linux | Open Source | Open source infrastructure | Systems behavior | Systems behavior under contention and scale
Education
N/A
Regions
Countries
States
Cities
Related jobs
-
Automation Testing | CI/CD | CSS | Cypress | Feature DevelopmentMedical, dental & vision coverage | Paid time off | Parental leave | Reimbursement programs | Retirement planMid-levelRaleigh, United States R11d ago
-
Senior-level Full TimeSan Francisco, US7h ago
-
Senior-level Full TimeUSA - Atlanta, GA; USA - …9h ago
-
Staff Software Engineer - Distributed Data Systems USD 182K-247KACID transactions | Algorithms | Apache Spark | C++ | Data StructuresSenior-level Full TimeBellevue, Washington10h ago
-
Senior Software Engineer - Distributed Data Systems USD 157K-213KACID transactions | Algorithms | Apache Spark | C++ | Data StructuresSenior-level Full TimeBellevue, Washington10h ago
-
Partner 20, Applied ML, Engineer, ASG USD 362K-422KAirflow | CI/CD | Data Engineering | Docker | Feature EngineeringMid-level Full TimeSan Francisco, California, United States11h ago
-
Senior AI Software Engineer USD 160K-198KA2A | ADK | API | Containers | EnvoyNo visa sponsorship | Remote workSenior-level Full TimeRemote US R12h ago
-
Deployed Engineer (Seattle) USD 150K-250KAWS | Agent Orchestration | Containers | Demo Delivery | Failure handling401k plan | Dental insurance | Flexible vacation | Health insurance | Life insuranceMid-level Full TimeSeattle, WA12h ago
-
Senior Machine Learning Engineer USD 120K-195KAnomaly Detection | Apache Beam | Apache Spark | BigQuery | CI/CD401k | Compassionate leave | Dental insurance | Discount Portal | HolidaysSenior-level Full TimeRemote - US R13h ago
-
Systems Software Engineer - Robotic Platform USD 150K-200K*nix | C++ | Deterministic builds | Docker | LinuxSenior-level Full TimeAustin, Texas, United States13h ago
-
Member of Technical Staff - Infrastructure Supercompute USD 180K-370KAnsible | ArgoCD | Docker | Flux | Go401k retirement plan | Dental insurance | Discounts and perks | Life insurance | Long-term disability insuranceSenior-level Full TimePalo Alto, CA; Seattle, WA13h ago
-
Principal AI Engineer - Nexus Black USD 135K-160KCI/CD | Cloud Native | Cloud Native Architecture | Distributed Systems | EvaluationHybrid workSenior-level Full TimeItasca, United States15h ago
-
Generative AI Inference Engineer USD 152K-287KAWS | CUDA | Cloud platform | Diffusion Models | DockerSenior-level Full TimeUnited States16h ago
-
Senior MLOps Platform Engineer {S} USD 120K-185KAWS EKS | Airflow | Amazon S3 | Argo CD | Batching401k match | Dental insurance | Employee assistance program | HSA contributions | Health insuranceSenior-level Full TimeColorado Springs, Colorado, United States R16h ago
-
Distinguished Software Engineer, Data Infrastructure USD 248K-406KAI Inference | AI Training | Batch Processing | Compliance | Data InfrastructureExecutive-level Full TimeMountain View, CA, United States16h ago
-
Senior Generative AI Engineer USD 125K-188KAI Safety | AI Search | AWS Bedrock | Amazon SageMaker | Amazon Web ServicesSenior-level Full TimeRidgefield Park, NJ, United States16h ago
-
Senior Software Engineer - AI Coding Agents USD 160K-235KAI SDK | API Routes | Auth | CI/CD | DockerSenior-level Full TimeUSA - Atlanta, GA; USA - …17h ago
-
Manager, Data Engineering USD 130K-166KAWS | Access Controls | Apache Airflow | Audit Logging | AzureCollaborative team culture | Remote work | Work-life balanceSenior-level Full TimeRemote, United States R18h ago
-
Senior Software Engineer - Data Platform USD 150K-195KAPIs | BigQuery | C++ | CI/CD | Cloud Storage401k company contribution | Disability insurance | Fertility and infertility benefits | Industry competitive PTO | Learning and development opportunitiesSenior-level Full TimeUnited States R18h ago
-
AI Search | API Development | AWS | AWS Bedrock | Azure401k | Dental insurance | Medical insurance | Paid sick hours | Vision insuranceSenior-level Contract Full TimeRidgefield Park, NJ, United States18h ago
-
Mid-level Full TimeAustin, TX19h ago
-
Machine Learning Engineer USD 155K-228KAirflow | Amazon SageMaker | Apache Spark | Argo CD | Artificial IntelligenceHealthcare insurance | Paid parental leave | Paid personal time off | Paid sick time | Paid time offMid-level Full TimeRemote - US R20h ago
-
Staff Software Engineer, Data USD 270KApache Flink | Apache Kafka | Apache Spark | Cloud Data | Cloud data warehousing401k matching | ADND Insurance | Company holidays | Extended parental leave | Flexible spending accountSenior-level Full TimeUSA, Palo Alto20h ago
-
Software Engineer, Data Security - USDS USD 118K-237KData Security | Distributed Systems | Language Models | Large Language Models | Performance optimizationEntry-level Full TimeSan Jose, California, United States22h ago
-
Mid-level Full TimeSunnyvale, CA | Burlingame, CA22h ago