Site Reliability Engineer-AI Cloud
Bade, Taiwan, TW
Supermicro
The premier provider of advanced Server Building Block Solutions® for 5G/Edge, Data Center, Cloud, Enterprise, Big Data, HPC and Embedded markets worldwide.Job Req ID: 26896
About Supermicro:
Supermicro® is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.
Job Summary:
As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You’ll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.
Essential Duties and Responsibilities:
Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.
Platform Reliability:
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.
Monitoring & Alerting:
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.
Capacity Planning:
Analyze usage patterns and forecast infrastructure needs for AI workloads.
Incident Management:
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.
CI/CD Integration:
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.
Security & Compliance:
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).
Documentation & Playbooks:
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.
Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
- Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
- Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
- Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
- Strong scripting and coding skills (Bash, Python, or Go).
- Exposure to secure multi-tenant environments and zero trust architectures.
- Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
- Excellent collaboration and communication skills for cross-team, partner, and customer initiatives
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Ansible Architecture Big Data CI/CD Computer Science CUDA DevOps Docker ELK Engineering GitLab GPU Grafana Hadoop Helm HPC InfiniBand Kubernetes Linux MLOps Pipelines Python Security Terraform Weka
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.