Senior Site Reliability Engineer, AI Factory
US, CA, Santa Clara, United States
USD 176K-333K Senior-level Full Time
Tasks
- Collaborate with technical teams to define procedures
- Coordinate remote hands operations
- Create documentation for AI Data Centers
- Develop and implement operational strategies
- Develop automation and reliability processes
- Feed requirements to hardware/software teams
- Implement open-source break-fix solutions
- Maintain SLAs and ensure uptime
- Manage firmware versions and hardware communication
- Monitor hardware state and troubleshoot issues
- Optimize performance and capacity
- Run commissioning and provisioning of GPU systems
Perks/Benefits
Skills/Tech-stack
BMS | Commissioning | Configuration Management | Data center | Data center operations | GPU Management | Infrastructure automation | Inventory Management | Message bus | Open Source | Open Source Software | Packer | Power Management | Provisioning | Qcow2 | Reliability Engineering | Workflow Engine | Zero Touch | Zero-touch provisioning
Education
Bachelor | Computer Science | Master in Computer Engineering | Related Field
Regions
Countries
States
Cities
Related jobs
-
Senior-level Full TimeSarpy County, NE | Mesa, AZ …20h ago
-
Embedded Software Test Engineer | Airborne Software USD 96K-140KARM Cortex | ARM Cortex-M | ARP 4754A | ARP-4761 | Bare MetalComprehensive medical coverage | Daily lunch provided | Dental and vision | Flight Lessons | Long-term disabilityMid-level Full TimeSouth Burlington, VT1d ago
-
Staff AI Engineer USD 210K-235KAgent systems | Agentic AI | Anthropic API | Anthropic Claude | Automated Evaluation401k | Career growth | Disability and life insurance | Equipment provided | Flexible vacation policySenior-level Full TimeRemote (United States) R1d ago
-
Senior AI Engineer USD 170K-200KAgent systems | Agentic AI | Anthropic API | Automated evals | Backend architecture401k | Company-provided equipment | Comprehensive medical, dental and vision coverage | Disability insurance | Flexible vacation policySenior-level Full TimeRemote (United States) R1d ago
-
Principal Data Engineer USD 82K-169KAPI Development | Analytics | Business Intelligence | CI/CD | CloudSenior-level Full TimeFifth Third Center Cincinnati, United States1d ago
-
AWS | Agile | Amazon EC2 | Automation | Configuration ManagementCode review support | Flexible working hours | Mentorship | On-call rotation | Work-life balanceMid-level Full TimeArlington, Virginia, USA1d ago
-
AWS | Agile Scrum | Amazon EC2 | Automation | Configuration ManagementCareer growth | Flexible working hours | Learning events | Mentorship | Work-life balanceMid-level Full TimeArlington, Virginia, USA1d ago
-
Embedded Software Engineer, Leo Payload Radio USD 143K-194KC# | C++ | Design Patterns | Embedded Linux | Embedded securityMid-level Full TimeRedmond, Washington, USA1d ago
-
Automated remediation | Automated testing | Automation | Bash | Blameless postmortems401k match | Dental insurance | Health insurance | Life insurance | Long-term disabilitySenior-level Full TimeUnited States1d ago
-
Staff Site Reliability Engineer, Data Platform USD 131K-210KAccess Management | Azure Kubernetes | Azure Kubernetes Service | Cloud Networking | Configuration Management401k | Dental insurance | FSA | HSA | Life insuranceSenior-level Full TimeAustin, TX, United States1d ago
-
Staff Software Engineer, Model LifeCycle USD 208K-253KAPI Design | Checkpointing | Distributed Training | Failure recovery | Fine Tuning401k match | Cell phone stipend | Commuter benefits | Dental insurance | Employer HSA contributionsSenior-level Full TimeSan Francisco, CA - US1d ago
-
Senior-level Full TimeSan Jose, CA2d ago
-
Robotics Hardware Integration Engineer USD 120K-158KAWS IoT | AWS IoT Core | AWS IoT Greengrass | C++ | CANDirect impact on shipped products | Early-stage company | Equity | Health, dental, and vision insurance | Permissive time off policyMid-level Full TimeAtlanta, Georgia, United States - Remote R2d ago
-
Lead AI Engineer (ML Ops) USD 116K-170KAPI Development | AWS | Azure | CI/CD | Cloud services401k match | Charity match | Employee assistance program | Employee stock purchase plan | Health and wellness allowanceSenior-level Full TimeIrving - 6011 Connection, United States2d ago
-
AI Hardware Systems Engineer, Annapurna Labs, Trainium Machine Learning Fleet Operations USD 136K-184KAutomation | Bash | Data Analysis | Data Infrastructure | GPU debuggingMid-level Full TimeAustin, Texas, USA2d ago
-
AWS S3 | Ansible | Automation | Azure Blob | Azure Blob Storage401k | Dental insurance | Medical insurance | Paid sick hours | Vision insuranceSenior-level Contract Full TimeRidgefield Park, NJ, United States2d ago
-
Senior AI Engineer USD 139K-229KAnt | Configuration Management | Data Mining | Data Science | Deep learningHealth and wellness programs | Hybrid work schedule | Time offSenior-level Full TimeSunnyvale, CA, United States2d ago
-
DevOps Team Lead USD 180K-200KCloud platform | Google Cloud | Google Cloud Platform | Incident Response | Infrastructure as Code401k matching | Birthday time off | Cell phone reimbursement | Childcare expense reimbursement | Company holidaysSenior-level Full TimeBoston, MA2d ago
-
Staff Software Engineer, Agentic Patient Outreach USD 175K-215KAPIs | AWS | AWS Glue | Apache Airflow | AuthenticationSenior-level Full TimeRemote, United States R2d ago
-
Senior Software Engineer, TorchTPU USD 174K-252KC++ | ML frameworks | Machine Learning | Open Source | Open Source SoftwareBenefits | Bonus | EquitySenior-level Full TimeSunnyvale, CA, USA2d ago
-
Senior Data Analyst USD 110K-130KData Visualization | Data analytics | Data integration | Information security | Machine LearningSenior-level Full TimeArlington, VA3d ago
-
Principal Embedded Software Engineer USD 222KAutomated testing | Bluetooth | Board Support | Board Support Package | Buildroot401 k plan match | Dependent daycare spending account | Employee assistance program | Employee stock purchase plan | Employer incentive planSenior-level Full TimeUSA-CO Lafayette Bldg 1, United States3d ago
-
Principal Embedded Linux Developer - TS/SCI w/ Poly USD 148K-266KARM | Assembly | Autoconf | Automake | C#Senior-level Full TimeUSA MD Columbia (7175 Columbia Gateway …3d ago
-
Principal Embedded Software Engineer USD 222KBluetooth | Board Support | Board Support Package | Buildroot | C#401k match | Dental insurance | Employee assistance program | Employee stock purchase plan | Health insuranceSenior-level Full TimeUSA-CO Lafayette Bldg 1, United States3d ago
-
Senior Engineer 2: Inference Data Plane USD 167K-209KAI hardware | Cloud Operations | Distributed Systems | GPU Optimization | GRPCEmployee assistance program | Equity compensation | Flexible time off | Health benefits | Learning resourcesSenior-level Full TimeBoston R3d ago