Research Intern

Santa Clara, CA

XPeng Motors

XPENG's electric vehicles designed for performance, safety, and sustainability. Explore our range of smart EVs, advanced technology, and commitment to a greener future.

View all jobs at XPeng Motors

Apply now Apply later

XPENG is a leading smart technology company at the forefront of innovation, integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric vertical take-off and landing (eVTOL) aircraft, and robotics. With a strong focus on intelligent mobility, XPENG is dedicated to reshaping the future of transportation through cutting-edge R&D in AI, machine learning, and smart connectivity.   We are looking for highly motivated research interns to join our AI Infrastructure team to work on automated diagnosis and discovery of failures in large-scale deep learning training systems. This project targets core challenges in distributed training environments, including root cause analysis of training failures, detecting performance anomalies, and improving system robustness via automatic recovery.   Responsibilities
  • Investigate, reproduce, and diagnose failures in PyTorch-based distributed training pipelines.
  • Develop tools and techniques for automatic failure detection using system traces, logs, and hardware-level metrics.
  • Implement failure-aware monitoring for NCCL, CUDA runtime, and communication components.
  • Analyze training disruptions/errors related to GPU/NVLink/network instability, OOMs, deadlocks, and degraded throughput.
  • Integrate with system components such as etcd, cupti, or XLA profiling tools to extract telemetry and debug information.
  • Collaborate with senior engineers to design robust, scalable diagnostic frameworks.
  Required Qualifications
  • Strong programming skills in Python and C/C++.
  • Hands-on experience with PyTorch and distributed training (e.g., DDP, NCCL).
  • Solid understanding of Operating Systems and Distributed Systems, especially process management, memory, and networking.
  • Familiarity with debugging and profiling tools (e.g., gdb, perf, nvprof, nsys).
  Preferred Qualifications
  • Experience with failure diagnosis, logging systems, or automated root cause analysis.
  • Understanding of NCCL internals, CUDA architecture, or GPU performance profiling.
  • Experience working with etcd, cupti, or other telemetry tools in a production-grade system.
  • Exposure to cloud-native systems or large-scale cluster management.
  What You’ll Gain
  • Exposure to real-world infrastructure challenges in deep learning systems at scale.
  • Mentorship from experienced engineers/researchers in system design and AI infra.
  • Opportunity to contribute to internal tools or publications (if applicable).
  • Hands-on experience with cutting-edge hardware and training platforms.
  • The tools you build will be directly integrated into our production platform, helping our machine learning teams train models faster and more reliably.
  • Potential to publish patents and papers.
  What do we provide:
  • A fun, supportive and engaging environment
  • Infrastructures and computational resources to support your work.
  • Opportunity to work on cutting edge technologies with the top talents in the field.
  • Opportunity to make significant impact on the transportation revolution by the means of advancing autonomous driving
  • Competitive compensation package
  • Snacks, lunches, dinners, and fun activities
  We are an Equal Opportunity Employer. It is our policy to provide equal employment opportunities to all qualified persons without regard to race, age, color, sex, sexual orientation, religion, national origin, disability, veteran status or marital status or any other prescribed category set forth in federal or state regulations.
Apply now Apply later
Job stats:  1  0  0
Category: Research Jobs

Tags: Architecture Autonomous Driving CUDA DDP Deep Learning Distributed Systems GPU Machine Learning ML infrastructure NVLink Pipelines Python PyTorch R R&D Research Robotics

Perks/benefits: Competitive pay

Region: North America
Country: United States

More jobs like this