Site Reliability Engineer, ML Compute SRE
Tasks
- Collaborate with infrastructure teams to reduce launch risks
- Define and enhance metrics and SLOs
- Design new ML features
- Improve operational experience
- Participate in on-call support
- Support ML operations
Perks/Benefits
Skills/Tech-stack
Algorithms | Automation | Coding | Complexity analysis | Data Structures | Debugging | Distributed Systems | Large Scale Distributed Systems | Machine Learning | Machine Learning Infrastructure | Scale distributed systems | Software development | System design | Troubleshooting
Education
Regions
Countries
States
Related jobs
-
Forward Deployed Engineer USD 170K-256KAWS | Azure | Bash | Debugging | Distributed SystemsHybrid work | Travel requiredSenior-level Full TimeNew York, NY1d ago
-
Software Engineer, Infrastructure Reliability USD 255K-405KAWS | Automation | Best practices | CI/CD | Cloud infrastructureSenior-level Full TimeSan Francisco1d ago
-
Staff Site Reliability Engineer, Data Platform USD 131K-210KAccess Management | Azure Kubernetes | Azure Kubernetes Service | Cloud Networking | Configuration Management401k | Dental insurance | FSA | HSA | Life insuranceSenior-level Full TimeAustin, TX, United States1d ago
-
Automation | Fault Tolerance | Incident Response | Machine Learning | MonitoringMid-level Full TimeSan Jose, California, United States2d ago
-
Director, Engineering USD 150K-200KAI | AWS | Anomaly Detection | Azure | BashEquity participation | Health and wellness benefits | Hybrid or remote work | Time off policiesExecutive-level Full TimeUnited States - Remote R9d ago
-
Senior Infrastructure Engineer, Observe by Snowflake USD 200K-287KAWS | Ansible | Automation | Go | KubernetesSenior-level Full TimeUS-CA-Menlo Park10d ago
-
Infrastructure Engineer, Observe by Snowflake USD 160K-210KAWS | Ansible | Automation | Azure | GCPMid-level Full TimeUS-CA-Menlo Park11d ago
-
Site Reliability Engineer - Data (Seattle) USD 177K-341KAutomation | Cloud infrastructure | Flink | Kubernetes | Monitoring ToolsMid-level Full TimeSeattle, Washington, United States15d ago
-
Site Reliability Engineer, AI Applications USD 136K-359KAutomation | Capacity Planning | Documentation | Incident Response | MonitoringMid-level Full TimeSan Jose, California, United States15d ago
-
Automation | Cloud infrastructure | Flink | Incident Management | KubernetesDevelopment workshops | Social eventsEntry-level InternshipSan Jose, California, United States15d ago
-
Tech Lead Cloud Site Reliability Engineer - DCS Cloud USD 198K-416KAutomation | Cloud Computing | Compliance Standards | Infrastructure Management | LinuxSenior-level Full TimeSeattle, Washington, United States16d ago
-
Site Reliability Developer 3 USD 79K-158KApache Spark | Cloud Computing | Distributed Systems | Go | KafkaAdoption Assistance | Flexible benefits | Flexible vacation | Medical/Dental/Vision insurance | Paid HolidaysMid-level Full TimeTN, United States16d ago
-
Digital - Principal SRE (AI Engineer) USD 77K-154KAI | AWS | Alerting | Ansible | CI/CDCollaborative environment | Flexible work arrangementSenior-level Full TimeEaston Ops Cols C Oh, United …17d ago
-
Senior Software Engineer, Vertex AI Platform SRE USD 166K-244KAutomation | Cloud platform | Container Orchestration | Distributed Systems | GCP NetworkingBenefits | Work with cutting-edge technologySenior-level Full TimeSunnyvale, CA, USA17d ago
-
Senior Site Reliability Engineer USD 129K-180KAWS | Alerting | ArgoCD | Automation | Bash401k matching | Healthcare coverage | Online learning platform | Paid time offSenior-level Full TimeUSA - Georgia - Alpharetta - …22d ago
-
Member of Technical Staff, Site Reliability Engineer (HPC) - MAI SuperIntelligence Team USD 119K-304KAWS | Azure | Bash | CI/CD | Capacity PlanningBenefits | Competitive compensation | Equity optionsSenior-level Full TimeMountain View, CA, US1mo ago
-
Site Reliability Engineer USD 175K-225KAI Agent | AI Agent architecture | AI infrastructure | Agent architecture | CachesBenefits | EquityMid-level Full TimeNew York, NY1mo ago
-
Automation | Documentation | Infrastructure tuning | Mentoring | MonitoringSenior-level Full TimeSan Jose, California, United States1mo ago
-
Automation | Cloud Management | Cloud infrastructure | DevOps | Distributed SystemsGlobal team participation | Growth opportunities | Innovation environmentEntry-level Full TimeSan Jose, California, United States1mo ago
-
Senior Site Reliability Engineer, ML System USD 136K-359KCoding | Distributed Systems | Machine Learning | Monitoring Tools | Performance AnalysisSenior-level Full TimeSan Jose, California, United States1mo ago
-
Automation | Cloud infrastructure | Flink | Kubernetes | Monitoring FrameworksEntry-level Full TimeSeattle, Washington, United States1mo ago
-
Site Reliability Engineer - Data USD 136K-359KAutomation | Cloud infrastructure | Distributed Systems | Flink | KubernetesMid-level Full TimeSan Jose, California, United States1mo ago
-
Senior Site Reliability Engineer - Data Infrastructure USD 177K-341KAutomation | Cloud infrastructure | Cost efficiency | Cross-Functional Collaboration | Cross-functionalSenior-level Full TimeSeattle, Washington, United States1mo ago