Senior Software Engineer, HPC Platform Modernization

Foster City, CA

Zoox

Zoox is reinventing personal transportation with an all-electric, fully autonomous ride-hail vehicle, built for riders, not drivers.

View all jobs at Zoox

Apply now Apply later

Zoox is looking for an experienced Software Engineer to work on key new frameworks and infrastructure modernization for our custom High-Performance Computing infrastructure and its supporting ecosystem of tools and services. Zoox HPC services combine industry-best scheduling and workload orchestration technologies, such as Ray.io and SLURM, with value-add workflows specifically for Autonomous Vehicle development. These HPC services form the backbone of development workflows across all Zoox software teams, from data engineering to training our AI models in Perception, Planner, Prediction, to simulation, and more. You will take on a breadth of end-to-end responsibilities including distributed system design, algorithmic job scheduling, and adaptive cloud scaling in support of all of Zoox’s computational needs.
The position comes with a high degree of independence and the opportunity to help define Zoox’s compute scaling strategy, both technically and organizationally. You will work closely with stakeholders in Autonomy and Software teams to iterate on world-class developer experiences, incorporating the latest industry tools and best practices.

In this role, you will:

  • Evaluate new distributed system paradigms and technologies to meet Zoox’s ever-growing computational and storage needs
  • Strike a balance between incremental improvements to Zoox’s existing in-house HPC infrastructure and greenfield services and abstractions.
  • Create production-grade web service APIs, SDKs, and other tools to provide a world-class developer experience for all of Zoox’s software teams.

Qualifications

  • 7+ years of experience
  • Experience with Ray.io, particularly Ray Core and Ray Data
  • Experience with Kubernetes, particularly for heterogeneous workloads and clusters
  • Experience with Ray.io and Kubernetes deployed on Amazon Web Services (AWS) or other similar cloud providers such as Azure or GCP
  • Proficiency with Python

Bonus Qualifications

  • Exposure to machine learning workloads (training, inference, data generation, etc) from a compute infra service provider perspective
  • Experience with Kubernetes or SLURM at scale (>10k+ nodes)
  • Experience with SLURM workload manager
CompensationThere are three major components to compensation for this position: salary, Amazon Restricted Stock Units (RSUs), and Zoox Stock Appreciation Rights. The salary range for this position is $210,000 to $275,000. A sign-on bonus may be offered as part of the compensation package. Compensation will vary based on geographic location and level. Leveling, as well as positioning within a level, is determined by a range of factors, including, but not limited to, a candidate's relevant years of experience, domain knowledge, and interview performance. The salary range listed in this posting is representative of the range of levels Zoox is considering for this position. Zoox also offers a comprehensive package of benefits including paid time off (e.g. sick leave, vacation, bereavement), unpaid time off, Zoox Stock Appreciation Rights, Amazon RSUs, health insurance, long-term care insurance, long-term and short-term disability insurance, and life insurance.
Apply now Apply later
Job stats:  0  0  0
Category: Engineering Jobs

Tags: APIs AWS Azure Engineering GCP HPC Kubernetes Machine Learning Python

Perks/benefits: Career development Equity / stock options Health care Insurance Salary bonus Signing bonus

Region: North America
Country: United States

More jobs like this