Software Engineer, LLM Storage System Intern - 2025 Start
Singapore
ByteDance
ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures and geographies.Responsibilities
About the Team
ByteDance Doubao Large Model Team was established in 2023, dedicated to developing the most advanced AI large model technology in the industry, becoming a world-class research team, and contributing to the development of technology and society.
The Doubao large model team has a long-term vision and determination in the field of AI, with research directions covering NLP, CV, speech, etc. They have laboratories and research positions in China, Singapore, the US and other places. The team relies on sufficient data, computing and other resources on the platform, continuously invests in related fields, and has launched self-developed general large models, providing MultiModal Machine Learning capabilities. Downstream support includes 50 + businesses such as Doubao, Coze, Dreamina, and is open to enterprise customers through Volcengine. Currently, Doubao APP has become the largest AIGC application in the Chinese market.
1. Assume responsibility for the design and development of components associated with the storage of machine learning systems, catering to diverse business scenarios of large model inference (LLM/S2S/VLM/multimodal, etc.). This includes model distribution and loading, KVCache optimization, enhancement of data IO performance, and improvement of TTFT and TBT in LLM serving
2. Take charge of designing and implementing a multi-level storage system for large model inference. Comprehensively utilize various media, including HBM, host memory, distributed disk, and remote large-capacity storage systems (HDFS/object storage) for data storage and migration management. Realize an integrated hierarchical system of "near-compute cache + remote large-capacity storage".
3. Be accountable for optimizing the hit rate of large model KV Cache. Formulate customized optimization strategies from multiple system dimensions, such as the inference framework, traffic scheduling, and multi-level cache. Optimize data IO performance by fully leveraging NVLink, RDMA high-speed network, and GPU Direct technologies on the near-compute side to achieve efficient data transmission. Optimize the storage strategy of data replicas to achieve a reasonable distribution of load traffic and stored data.
4. Undertake the design and implementation of efficient and user-friendly data access interfaces. Realize seamless docking with the inference framework, and manage the lifecycle of KV Cache.
5. Be responsible for the access, management, operation and maintenance, and monitoring of the multi-level storage system in the Kubernetes scenario to ensure stability.
6. Assume the task of system setup and disaster recovery in multi-datacenter, multi-region, and multi-cloud scenarios, and optimize data placement across clusters.
Qualifications
Minimum Qualifications:
1. Demonstrate proficiency in one or more of the following programming languages: C++, Go, Python, Shell, etc., within a Linux environment.
2. Exhibit familiarity with Kubernetes architecture and ecosystem, as well as cloud-native container storage technologies such as PV/CSI.
3. Master the principles of distributed systems and have participated in the design, development, and maintenance of large-scale distributed systems.
4. Possess outstanding logical analysis skills and be capable of reasonably abstracting and splitting business logic.
5. Have a strong sense of work responsibility, good learning ability, communication skills, and self-motivation, and be able to respond and take action promptly.
6. Maintain good habits of writing work documents and be able to write and update work processes and technical documents in a timely manner as required.
Preferred Qualifications:
1. Demonstrate familiarity with at least one mainstream machine learning framework or inference engine (e.g., vLLM/SGLang/PyTorch).
2. Possess technical implementation experience related to large-model inference, such as the implementation of KVCache related optimizations.
3. Exhibit familiarity with distributed cache systems (e.g., Alluxio/JuiceFS/GooseFS/JindoFS), among others.
4. Demonstrate knowledge of technologies such as NVLink, RDMA, NCCL, and GPU Direct.
5. Have a record of publishing papers at top conferences in computer system storage (including but not limited to OSDI/SOSP/FAST).
Candidates can apply to a maximum of two positions and will be considered for jobs in the order you apply. The application limit is applicable to ByteDance and its affiliates' jobs globally.
Applications will be reviewed on a rolling basis - we encourage you to apply early.
Successful candidates must be able to commit to at least 3 months long internship period.
By submitting an application for this role, you accept and agree to our global applicant privacy policy, which may be accessed here: https://jobs.bytedance.com/en/legal/privacy.
If you have any questions, please reach out to us at apac-earlycareers@bytedance.com
Job Information
About Doubao (Seed)
Founded in 2023, the ByteDance Doubao (Seed) Team, is dedicated to pioneering advanced AI foundation models. Our goal is to lead in cutting-edge research and drive technological and societal advancements.
With a strong commitment to AI, our research areas span deep learning, reinforcement learning, Language, Vision, Audio, AI Infra and AI Safety. Our team has labs and research positions across China, Singapore, and the US.
Why Join ByteDanceInspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
Tags: Architecture Deep Learning Distributed Systems GPU HDFS Kubernetes Linux LLMs Machine Learning Model inference NLP NVLink Privacy Python PyTorch Reinforcement Learning Research vLLM
Perks/benefits: Career development Conferences
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.