PhD student in Reliable Large Scale AI Infrastructures

München, Germany

Huawei Research Center Germany & Austria

Huawei is a leading global provider of information and communications technology (ICT) infrastructure and smart devices.

View all jobs at Huawei Research Center Germany & Austria

Apply now Apply later

Huawei’s TTE RAMS Lab is a corporate competence center responsible for researching high reliability and high safety architecture as well as technologies for complex intelligent system; Our goal is to provide Huawei products with cutting-edge researches and advanced technical solutions on intelligent reliability and safety for carrier grade ICT and safety critical systems such as autonomous driving so that our products provide our customers with best user experiences and performance.

Position Overview:

We are seeking a highly motivated and talented PhD student to join our cutting-edge research team focused on large-scale reliable AI infrastructures. This position will emphasize ensuring the robustness and reliability of training and inferencing for large language models (LLMs). The ideal candidate will engage in both theoretical and practical research aimed at overcoming challenges related to scaling AI systems while maintaining reliability, resilience, and efficiency across various AI workflows.

As part of the team, you will have the opportunity to work at the forefront of AI infrastructure, addressing critical issues like fault tolerance, data and model consistency, distributed AI system infrastructure, and the optimization of machine learning pipelines.

Key Responsibilities:

Conduct advanced research on scaling and ensuring reliability in the training and inferencing of large language models (LLMs).

  • Develop innovative methodologies for enhancing the fault tolerance and resilience of AI infrastructures, including the detection and recovery of system failures;
  • Investigate and improve existing distributed systems architectures and parallel computing frameworks for large-scale machine learning tasks;
  • Collaborate with interdisciplinary teams to design and implement strategies that ensure the robustness of AI systems in production environments;
  • Contribute to the development of new algorithms and models that improve both the scalability and reliability of large-scale training and inferencing pipelines;
  • Publish research findings in top-tier conferences and journals, and contribute to internal knowledge sharing;
  • Assist in the development of open-source tools and frameworks for the research community.

Required Qualifications:

  • Strong academic background in Computer Science, Engineering, or a related field with a Master’s degree;
  • Demonstrated interest or experience in AI, machine learning, or large-scale distributed systems;
  • Solid understanding of deep learning principles and techniques, particularly as applied to large language models (LLMs);
  • Proficiency in programming languages such as Python, C/C++, or equivalent and experiences in system scripting such as Bash, Perl, sed&awk
  • Familiarity with deep learning frameworks (TensorFlow, PyTorch, etc.), understanding of the underlying low level technical details will be a big plus;
  • Strong knowledge of Operating Systems (Linux-based), distributed computing, cloud infrastructure, and containerization technologies (e.g., Kubernetes, Docker);
  • Excellent problem-solving skills, analytical thinking, and attention to detail;
  • Ability to work collaboratively in a multidisciplinary team environment and communicate complex technical concepts effectively.

Preferred Qualifications:

  • Experience in large-scale AI/ML system deployment, optimization, or maintenance;
  • Familiarity with the challenges and best practices in training very large neural networks;
  • Background in systems engineering, cloud architectures, or high-performance computing (HPC);
  • Knowledge of tools and technologies for distributed training (e.g., Horovod, DeepSpeed,Slurm, Ray, etc);
  • Prior research or industry experience in AI model reliability, system fault tolerance, or similar areas;
  • Publications in the relevant conferences and workshops are a big plus.

Huawei is a leading global information and communications technology (ICT) solutions provider. Driven by a commitment to operations, ongoing innovation, and open collaboration, we have established a competitive ICT portfolio of end-to-end solutions in Telecom and enterprise networks, Devices and Cloud technology and services. Our ICT solutions, products and services are used in more than 170 countries and regions, serving over one-third of the world's population. With 197,000 employees, Huawei is committed to develop the future information society and build a Better Connected World.

Please send your application and CV (incl. cover letter and reference letters) in English.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0
Category: Deep Learning Jobs

Tags: Architecture Autonomous Driving Computer Science Deep Learning Distributed Systems Docker Engineering Horovod HPC Kubernetes Linux LLMs Machine Learning ML infrastructure Open Source Perl PhD Pipelines Python PyTorch Research TensorFlow

Perks/benefits: Conferences

Region: Europe
Country: Germany

More jobs like this