Senior DevOps Engineer, IPP Sanity Engineering
US, CA, Santa Clara, United States
NVIDIA
NVIDIA on grafiikkasuorittimen keksijä, jonka kehittämät edistysaskeleet vievät eteenpäin tekoälyn, suurteholaskennan.NVIDIA is looking for a Senior DevOps Engineer to work in IPP (Infrastructure, Planning and Process) Sanity Engineering, to execute on Nvidia product bringups. IPP is a core software infrastructure organization within NVIDIA. This group works with various other groups within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure needs. These cloud services provide almost half a million automated jobs per day on thousands of distributed datacenters helping with the productivity of thousands of NVIDIA's software engineers worldwide.
The cloud hosts a heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android), a multitude of hardware platforms both NVIDIA GPUs and Tegra Processors. Are you passionate about distributed infrastructure and looking for a sophisticated workspace and are ready to build the next generation of cloud services for chip bringups, design creative solutions, mine through data to uncover real problems and fix them? We are looking forward to onboard a fun-loving person like you.
What you'll be doing
Lead end to end infrastructure bringup execution of new Nvidia GPU product.
Have a thorough understanding of Nvidia GPU hardware and display driver stack, SBIOS, VBIOS and should be able to enhance automation for farm wide updates.
Solve complex problems on groundbreaking pre-release products, leading GPU product bringups (PCIe & Enterprise) in infrastructure, integrating GPU test suites to infrastructure harness, and multi site distributed infrastructure scaling.
Optimize farm utilization of GPU resources by identifying right set of regression test coverage
Champion config automation using world-class configuration management & infrastructure automation (IaC) tools like Chef, Puppet, Ansible, Terraform etc.
Execute on bringup of specialized products used for accelerated computing and AI, in fastpaced and critically important environments.
Will be leading a service charter completely, and will be essentially a responsible pilot in charge for the development, telemetry and automation of the bringup infrastructure.
Automation and performance tuning of regression test frameworks, creation of self healing/automated recovery solutions for multi-geo regression farms.
Coherent engagement with collaborators and partner teams across organizations to deliver onboarding of new products in CI/CD.
Implement seamlessly on multiple parallel bringups scoped out in Nvidia Product Bringup landscape.
What we need to see
Bachelor's or Master's Degree in Computer Science or Software Engineering, or equivalent experience.
10+ years of relevant experience.
Hands on coding and debug experience, cross compiling source code on various platforms, triage, root cause and resolve issues in the bringup infrastructure.
Familiar with maintenance and setup of Linux, Windows (x64 and arm) hosts, VM and container based environments.
Hands-on programming experience with any including but not limited to Python (preferred), JAVA etc.
Unix & TCL shell proficiency is expected.
Experience in MySQL/No-SQL(plus), should be able to write complex queries.
Experience with version control systems like Perforce, GIT.
Demonstrable experience working in large scale enterprise production systems. 7+ years of development and operations experience required.
Ways to stand out from the crowd
Background with automating bare metal and VM provisioning.
Prior knowledge of VM isolation for GPUs and Nvidia Confidential Computing is a plus.
Experience with public clouds (AWS, GCP, Azure), VM and container virtualization technologies like VMware, KVM, HyperV, Docker and Kubernetes Clusters.
Experience with debugging GPU performance issues, embedded device software development and automation, software driver development and CUDA/TensorRT applications.
We are widely considered to be one of the technology world's most desirable employers, and as a result, we have some of the most forward-thinking and hardworking people in the world working for us. If you're passionate, creative, and driven, we'd love to have you join the team. With competitive salaries and a generous benefits package, we are considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us, and due to unprecedented growth, our exclusive engineering teams are rapidly growing. We want to hear from you if you're a creative and autonomous engineer with a real passion for technology.
You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.Tags: Ansible AWS Azure CI/CD Computer Science CUDA Deep Learning DevOps Docker Engineering GCP Git GPU Java Kubernetes Linux MySQL Puppet Python SQL TensorRT Terraform
Perks/benefits: Career development Competitive pay Equity / stock options Gear
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.