System Engineer, GPU Infrastructure & Platform Engineering - GPU Optimization Department (GPUOD)
Rakuten Crimson House, Japan
Rakuten
楽天グループ株式会社のコーポレートサイトです。企業情報や投資家情報、プレスリリース、サステナビリティ情報、採用情報などを掲載しています。楽天グループは、イノベーションを通じて、人々と社会をエンパワーメントすることを目指しています。Job Description:
AI & Data Division (AIDD) spearheads data science & AI initiatives by leveraging data from Rakuten Group. We build a platform for large-scale field experimentations using cutting-edge technologies to provide critical insights that enable faster and better and faster contribution for our business. Our division boasts an international culture created by talented employees from around the world. Following the strategic vision “Rakuten as a data-driven membership company”, AIDD is expanding its data & AI related activities across multiple Rakuten Group companies.
Department Overview
GPU Optimization Department is responsible for the strategic management, optimization, and governance of Rakuten's company-wide AI infrastructure, ensuring high-performance, cost-efficient utilization of compute resources for machine learning workloads. We oversee a large-scale hybrid infrastructure spanning thousands of accelerators, including the latest Hopper and upcoming Blackwell architectures.
As a central enabler for AI innovation, we:
- Optimize compute resource allocation across on-premises and multi-cloud environments, maximizing efficiency for training and inference workloads
- Manage hybrid orchestration of diverse accelerator resources, ensuring seamless scalability and cost-effective deployment
- Develop and enhance frameworks for large-scale distributed training, with special focus on LLMs and generative AI
- Optimize inference performance through model optimization techniques and system-level acceleration
- Collaborate with internal teams to deliver scalable, high-availability inference services tailored to business needs
- Continuously evaluate next-generation hardware solutions, including specialized AI chips optimized for LLM workloads
By effectively managing both conventional and specialized compute resources across on-premises and cloud environments, our team ensures Rakuten's AI ecosystem remains at the forefront of performance, reliability, and cost-efficiency.
Position:
Why We Hire
- Build and scale cutting-edge GPU infrastructure for ranking models, LLMs, and real-time AI.
- Work with global AI/ML teams to solve high-impact infrastructure challenges.
- Opportunity to shape the future of Rakuten’s GPU platform for scalability and efficiency.
Position Details
As a System Engineer (GPU Infrastructure & Platform Engineering), you will build, scale, and optimize the GPU cluster infrastructure that supports both training (e.g., ranking models, LLMs) and inference workloads. Your focus will be on the design and build of GPU platform with sophisticated scheduling, elasticity, quota management —ensuring efficient utilization, scalability, and stability for Rakuten’s AI workloads.
Key Responsibilities
· Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation.
· Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment.
· Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization.
· Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks.
· Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health.
· Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).
Mandatory Qualifications:
- 3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing.
- Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers).
- Strong programming skills in Go or Python for platform development, automation and tooling.
- Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand).
- Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins).
- Bachelor’s or higher degree in Computer Science, Engineering, or a related field.
- Strong teamwork and communication skills, with a passion for solving infrastructure challenges.
Desired Qualifications:
- Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed).
- Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput.
- Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues.
- Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data.
- Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM).
Languages:
English (Overall - 3 - Advanced)* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Ansible Architecture CI/CD Computer Science CUDA DDP DevOps Engineering FSDP Generative AI GitHub GPU Grafana InfiniBand Jenkins Kubeflow Kubernetes Linux LLMs Machine Learning ML infrastructure MLOps Model deployment Pipelines Python PyTorch TensorRT Terraform vLLM
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.