System Engineer (GPU Infrastructure & Platform Engineering)
Crimson House Singapore
Rakuten Asia Pte Ltd
Rakuten Group, Inc. is a leading global company that contributes to society by creating value through innovation and entrepreneurship. Browse corporate information, including company overview, investor relations, sustainability and careers.Job Description:
Situated in the heart of Singapore's Central Business District, Rakuten Asia Pte. Ltd. is Rakuten's Asia Regional headquarters. Established in August 2012 as part of Rakuten's global expansion strategy, Rakuten Asia comprises various businesses that provide essential value-added services to Rakuten's global ecosystem. Through advertisement product development, product strategy, and data management, among others, Rakuten Asia is strengthening Rakuten Group's core competencies to take the lead in an increasingly digitalized world.
AI & Data Division (AIDD) spearheads data science & AI initiatives by leveraging data from Rakuten Group. We build a platform for large-scale field experimentations using cutting-edge technologies to provide critical insights that enable faster and better and faster contribution for our business. Our division boasts an international culture created by talented employees from around the world. Following the strategic vision “Rakuten as a data-driven membership company”, AIDD is expanding its data & AI related activities across multiple Rakuten Group companies.
As a System Engineer (GPU Infrastructure & Platform Engineering), you will build, scale, and optimize the GPU cluster infrastructure that supports both training (e.g., ranking models, LLMs) and inference workloads. Your focus will be on the design and build of GPU platform with sophisticated scheduling, elasticity, quota management —ensuring efficient utilization, scalability, and stability for Rakuten’s AI workloads.
Key Responsibilities
Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation.
Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment.
Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization.
Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks.
Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health.
Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).
Mandatory Qualifications
3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing.
Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers).
Strong programming skills in Go or Python for platform development, automation and tooling.
Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand).
Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins).
Bachelor’s or higher degree in Computer Science, Engineering, or a related field.
Strong teamwork and communication skills, with a passion for solving infrastructure challenges.
Nice-to-Have Skills
Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed).
Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput.
Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues.
Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data.
Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM).
Why Join Us?
Build and scale cutting-edge GPU infrastructure for ranking models, LLMs, and real-time AI.
Work with global AI/ML teams to solve high-impact infrastructure challenges.
Opportunity to shape the future of Rakuten’s GPU platform for scalability and efficiency.
Rakuten is an equal opportunities employer and welcomes applications regardless of sex, marital status, ethnic origin, sexual orientation, religious belief, or age.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Ansible CI/CD Computer Science CUDA Data management DDP DevOps Engineering FSDP GitHub GPU Grafana InfiniBand Jenkins Kubeflow Kubernetes Linux LLMs Machine Learning MLOps Model deployment Pipelines Python PyTorch TensorRT Terraform vLLM
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.