Research Intern
Santa Clara, CA
XPeng Motors
XPENG's electric vehicles designed for performance, safety, and sustainability. Explore our range of smart EVs, advanced technology, and commitment to a greener future.
XPENG is a leading smart technology company at the forefront of innovation, integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric vertical take-off and landing (eVTOL) aircraft, and robotics. With a strong focus on intelligent mobility, XPENG is dedicated to reshaping the future of transportation through cutting-edge R&D in AI, machine learning, and smart connectivity.
We are looking for highly motivated research interns to join our AI Infrastructure team to work on automated diagnosis and discovery of failures in large-scale deep learning training systems. This project targets core challenges in distributed training environments, including root cause analysis of training failures, detecting performance anomalies, and improving system robustness via automatic recovery.
Responsibilities
- Investigate, reproduce, and diagnose failures in PyTorch-based distributed training pipelines.
- Develop tools and techniques for automatic failure detection using system traces, logs, and hardware-level metrics.
- Implement failure-aware monitoring for NCCL, CUDA runtime, and communication components.
- Analyze training disruptions/errors related to GPU/NVLink/network instability, OOMs, deadlocks, and degraded throughput.
- Integrate with system components such as etcd, cupti, or XLA profiling tools to extract telemetry and debug information.
- Collaborate with senior engineers to design robust, scalable diagnostic frameworks.
- Strong programming skills in Python and C/C++.
- Hands-on experience with PyTorch and distributed training (e.g., DDP, NCCL).
- Solid understanding of Operating Systems and Distributed Systems, especially process management, memory, and networking.
- Familiarity with debugging and profiling tools (e.g., gdb, perf, nvprof, nsys).
- Experience with failure diagnosis, logging systems, or automated root cause analysis.
- Understanding of NCCL internals, CUDA architecture, or GPU performance profiling.
- Experience working with etcd, cupti, or other telemetry tools in a production-grade system.
- Exposure to cloud-native systems or large-scale cluster management.
- Exposure to real-world infrastructure challenges in deep learning systems at scale.
- Mentorship from experienced engineers/researchers in system design and AI infra.
- Opportunity to contribute to internal tools or publications (if applicable).
- Hands-on experience with cutting-edge hardware and training platforms.
- The tools you build will be directly integrated into our production platform, helping our machine learning teams train models faster and more reliably.
- Potential to publish patents and papers.
- A fun, supportive and engaging environment
- Infrastructures and computational resources to support your work.
- Opportunity to work on cutting edge technologies with the top talents in the field.
- Opportunity to make significant impact on the transportation revolution by the means of advancing autonomous driving
- Competitive compensation package
- Snacks, lunches, dinners, and fun activities
Job stats:
1
0
0
Category:
Research Jobs
Tags: Architecture Autonomous Driving CUDA DDP Deep Learning Distributed Systems GPU Machine Learning ML infrastructure NVLink Pipelines Python PyTorch R R&D Research Robotics
Perks/benefits: Competitive pay
Region:
North America
Country:
United States
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Power BI Developer jobsBI Developer jobsPrincipal Data Engineer jobsSr. Data Engineer jobsStaff Data Scientist jobsPrincipal Software Engineer jobsStaff Machine Learning Engineer jobsData Science Intern jobsData Science Manager jobsDevOps Engineer jobsJunior Data Analyst jobsData Manager jobsSoftware Engineer II jobsData Analyst Intern jobsLead Data Analyst jobsAccount Executive jobsBusiness Data Analyst jobsStaff Software Engineer jobsSr. Data Scientist jobsData Specialist jobsAI/ML Engineer jobsSenior Backend Engineer jobsData Governance Analyst jobsBusiness Intelligence Analyst jobsData Engineer III jobs
Consulting jobsMLOps jobsAirflow jobsEconomics jobsOpen Source jobsLinux jobsKPIs jobsTerraform jobsJavaScript jobsKafka jobsGitHub jobsData Warehousing jobsPostgreSQL jobsRDBMS jobsComputer Vision jobsNoSQL jobsGoogle Cloud jobsPrompt engineering jobsClassification jobsScikit-learn jobsStreaming jobsBanking jobsPhysics jobsRAG jobsHadoop jobs
Oracle jobsData warehouse jobsBigQuery jobsPandas jobsR&D jobsdbt jobsLooker jobsGPT jobsReact jobsScala jobsDistributed Systems jobsPySpark jobsScrum jobsCX jobsIndustrial jobsELT jobsMicroservices jobsLangChain jobsJira jobsRedshift jobsSAS jobsOpenAI jobsJenkins jobsTypeScript jobsModel training jobs