Senior Site Reliability Engineer, AI Factory
US, CA, Santa Clara, United States
USD 176K-333K Senior-level Full Time Found 6d ago
Tasks
- Collaborate with technical teams to define procedures
- Coordinate remote hands operations
- Create documentation for AI Data Centers
- Develop and implement operational strategies
- Develop automation and reliability processes
- Feed requirements to hardware/software teams
- Implement open-source break-fix solutions
- Maintain SLAs and ensure uptime
- Manage firmware versions and hardware communication
- Monitor hardware state and troubleshoot issues
- Optimize performance and capacity
- Run commissioning and provisioning of GPU systems
Perks/Benefits
Skills/Tech-stack
BMS | Commissioning | Configuration Management | Data center | Data center operations | GPU Management | Infrastructure automation | Inventory Management | Message bus | Open Source | Open Source Software | Packer | Power Management | Provisioning | Qcow2 | Reliability Engineering | Workflow Engine | Zero Touch | Zero-touch provisioning
Education
Bachelor | Computer Science | Master in Computer Engineering | Related Field
Regions
Countries
States
Cities
Language: en |
Views: 0 |
Clicks: 0
Related jobs
-
Code review | Data Filtering | Data Generation | Data Pipelines | Distributed SystemsSenior-level Full TimeMenlo Park, CA19h ago
-
Software Engineer USD 200K-550KAPI Design | Backend Development | Data Pipelines | Distributed Systems | Frontend workflows401k matching | Equity | Health, dental, vision insurance | Relocation stipend | Unlimited paid time offMid-level Full TimeSan Francisco1d ago
-
Data Center Electrical Engineer, Data Center Services USD 144K-211KCommissioning | Construction | Construction Management | Controls | Distribution systemsMid-level Full TimeClarksville, TN, USA1d ago
-
Machine Learning Engineer USD 115K-202KAWS | Data Analysis | Data-driven | Data-driven analytics | KubernetesMid-level Full TimePittsburgh, PA2d ago
-
Building Management | Building Management Systems | Cause analysis | Commissioning | Communication ProtocolsCareer growth resources | Flexible work culture | Inclusive environment | Mentorship opportunitiesMid-level Full TimeNew Albany, Ohio, USA2d ago
-
API Design | AWS | AWS Bedrock | Anthropic | AzureSenior-level Contract Full TimeEdison, NJ, United States2d ago
-
Data Engineer USD 85K-115KAutomation | Cloud services | Data Analysis | Data Storytelling | Data Visualization401k matching | Dental insurance | Disability | Educational growth | Employee discountEntry-level Full TimeTexas-The Woodlands2d ago
-
Business Strategy | Business strategy planning | Controls | Data Center Systems | Data centerBenefits | Bonus | EquitySenior-level Full TimePittsburgh, PA, USA2d ago
-
AI Systems Engineer USD 133K-186KAPI Design | Agent systems | Concurrency | Cost monitoring | Distributed SystemsFlexible work environment | Health benefits | Paid time off | Retirement planSenior-level Full TimeBoston, MA, United States2d ago
-
Senior Software Engineer USD 117K-234KAngularJS | Apache Airflow | Apache Kafka | Apache Spark | Big DataCompany discounts | Educational assistance | Health coverage | Leave options | Life insuranceSenior-level Full Time(USA) Crossman Respect Building CA SUNNYVALE …3d ago
-
Senior ML Compiler Engineer USD 152K-287KApache TVM | C++ | Deep learning | Deep learning frameworks | Flash AttentionBenefits | Competitive salaries | EquitySenior-level Full TimeUS, WA, Redmond, United States3d ago
-
Applied Machine Learning Engineer USD 110K-165KA3C | AWS | Azure | C++ | CUDA401k | Education assistance | Health care | Inclusive environment | Paid HolidaysSenior-level Full TimeColorado Springs, United States3d ago
-
Entry-level Full TimeRedmond, WA, US3d ago
-
Mechanical Engineer, Data Center Technology Systems USD 171K-254KASHRAE | Building Services | Building services design | C++ | Cooling systemsBenefits | Bonus | EquitySenior-level Full TimeSunnyvale, CA, USA3d ago
-
Forward Deployed Engineer USD 130K-217KAI Agents | API Design | Automation | CLI Development | CLI toolsSenior-level Full TimeRemote - United States R3d ago
-
Senior Machine Learning Engineer USD 120K-200KData Annotation | Kubernetes | Language Processing | Machine Learning | Model DeploymentSenior-level Full TimePittsburgh, PA4d ago
-
Platform Engineer (Hybrid) - 27029 USD 111K-145KAnsible | CI/CD | CI/CD Tools | CI/CD tools GitLab | CI/CD tools Gitlab Travis JenkinsHealthcare coverage | Paid Holidays | Paid time off | Retirement plan | Training allowancesMid-level Full TimeColumbia, MD, Maryland, United States4d ago
-
AI Lead Data Engineer USD 157KAI | AI/ML | AI/ML ops | APIs | APIs integrationAdditional leave (bereavement, voting, jury duty, volunteer, parental) | Flexibility hybrid work | Health insurance | Paid Holidays | Paid wellness daysSenior-level Full TimeAtlanta, GA - 6205 Peachtree Dunwoody …4d ago
-
Senior Staff Engineer – Data Lakehouse Platform USD 130K-260KAWS | Apache Iceberg | Apache Spark | Azure | Azure Network401k with 6% match | Fertility and adoption benefits | Market competitive compensation | Mental healthcare | Performance incentivesSenior-level Full TimeMD Chevy Chase (Office) - JPS, …4d ago
-
Senior Lead AI Engineer USD 229K-286KAI Engineering | AI infrastructure | AWS | Go | HuggingfaceFinancial benefits | Health benefits | Inclusive environmentSenior-level Full TimeSan Jose, CA, United States4d ago
-
Senior AI Engineer USD 161K-201KAWS | Go | Huggingface | Java | LLMFinancial benefits | Health benefits | Inclusive environmentSenior-level Full TimeNew York, NY, United States4d ago
-
Senior Data Engineer USD 147K-201KAgile | Cassandra | Cloud Computing | Data Streaming | Distributed SystemsFinancial benefits | Health benefits | Inclusive environment | Performance incentivesSenior-level Full TimeWilmington, DE, United States4d ago
-
AWS | Agile | Cassandra | Data Warehousing | Distributed ComputingFinancial benefits | Health benefits | Inclusive environmentSenior-level Full TimePlano, TX, United States4d ago
-
Cloud Software Engineer 3 (NWG - 008) USD 122K-216KAnsible | Apache Lucene | Apache Solr | Aspect Oriented Design | Aspect-orientedSenior-level Full TimeFort Meade, Maryland, United States4d ago
-
Hpc Swe - 3 (Awf - 052) USD 138K-307KAlgorithm Development | Big Data | Big data analytics | COTS | Configuration ManagementSenior-level Full TimeAnnapolis Junction, Maryland, United States4d ago