Staff AI Infrastructure Engineer: Inference Platform
Santa Clara, CA
Full Time Senior-level / Expert USD 215K - 364K
XPeng Motors
XPENG's electric vehicles designed for performance, safety, and sustainability. Explore our range of smart EVs, advanced technology, and commitment to a greener future.- Design, implement and operate components of our novel model inference platform( e.g. quota management, job scheduling, and queuing systems). You will play a critical role in scheduling GPU resources.
- Identify performance bottlenecks and optimization opportunities
- Work closely with Machine Learning Engineers to evolve the inference platform as per their use cases
- Monitor system health, diagnose and troubleshoot issues, and perform routine maintenance tasks to ensure the reliability of the distributed inference infrastructure
- Build and maintain documentation for infrastructure components and systems
- Advanced degree (MS or PhD) in Computer Science or related field
- 5+ years of industry or research experience in ML Infra, model inference
- Expertise in programming languages like Python/Java/C++ and experience with distributed computing frameworks
- Experience with high-throughput, fault-tolerant system design
- Proficient in Docker and Kubernetes
- Experience with Jenkins, Github CI/CD, or similar tools
- Experience with Prometheus, Grafana, or similar monitoring solutions
- Excellent problem-solving skills and attention to detail
- Strong communication skills and ability to work in a collaborative environment
- Strong background in building and maintaining large-scale distributed systems
- Strong background in performance optimization and system scaling
- Experience in scheduling jobs on heterogeneous computation resources
- Deep understanding of cloud computing platforms
- Deep knowledge of monitoring and observability practices
- Experience with CUDA packages
- Experience with PyTorch, Tensorflow or similar frameworks
- A dynamic, supportive, and engaging work environment where creativity thrives.
- The opportunity to make a significant impact on the transportation revolution through advancements in autonomous driving.
- Exposure to cutting-edge technologies alongside top industry talent.
- Competitive compensation package.
- Perks include snacks, lunches, and organized fun activities.
Tags: Autonomous Driving CI/CD Computer Science CUDA Distributed Systems Docker GitHub GPU Grafana Java Jenkins Kubernetes Machine Learning ML infrastructure Model inference PhD Python PyTorch R R&D Research TensorFlow
Perks/benefits: Career development Competitive pay Equity / stock options Health care Salary bonus
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.