Data Engineer, LLM - Rakuten Institute of Technology Worldwide (RIT) (Tokyo)

Rakuten Crimson House, Japan

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

Full Time Mid-level / Intermediate USD 49K - 91K * ^est.

Rakuten

楽天グループ株式会社のコーポレートサイトです。企業情報や投資家情報、プレスリリース、サステナビリティ情報、採用情報などを掲載しています。楽天グループは、イノベーションを通じて、人々と社会をエンパワーメントすることを目指しています。

View all jobs at Rakuten

Apply now Apply later

Posted 22 hours ago

Job Description:

Business Overview

The Rakuten AI & Data Division (AIDD) creates powerful, customer-focused search, recommendation, data science, advertising, marketing, pricing, and inventory optimization solutions for various businesses in commerce, fintech, and mobile industries.

Department Overview

The Rakuten Institute of Technology Worldwide (RIT), the AI R&D engine of Rakuten Group, Inc. is a global network of research labs spanning Tokyo, Singapore, Boston, San Mateo, Bengaluru, and Paris. We are dedicated to pioneering advancements in core AI technologies, with a focus on machine learning, deep learning, and generative AI. Our researchers are actively exploring the use case of large language models, intelligent agent systems, and other cutting-edge applications, driving innovation across Rakuten's diverse ecosystem.

Position:

Why We Hire

To establish and support domain-leading LLMs across critical sectors such as Fintech, Booking services, and E-commerce, we are building a foundational Senior Data Engineering team. This team will play a critical role in designing, building, and maintaining the robust data infrastructure essential for the entire LLM lifecycle - from data collection and preparation for pre-training and fine-tuning, to serving and monitoring. You will work closely with ML Engineers, Data Scientists, and Researchers to ensure data quality, accessibility, and scalability, directly impacting the success and performance of our in-house LLM initiatives.

Position Details

Data Pipeline Development for LLMs:

Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.

High-Quality Dataset Creation & Curation:

Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.

Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.

Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).

Data Job Management:

Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.

Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.

Data Infrastructure & Orchestration:

Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.

Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.

Mandatory Qualifications:

- Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field, with 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.

- Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.

- Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).

- Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).

- Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.

- Experience designing and implementing data annotation workflows and pipelines.

- Strong proficiency in Python, and extensive experience with its data ecosystem.

- Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.

Other Information:

Additional information on English Qualification

English: Fluent

#engineer #infrastructureengineer #aianddatadiv

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats: 1 0 0

Category: Engineering Jobs