Principal Site Reliability Engineer - Enterprise AI Platform
US, CA, Santa Clara, United States
⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️
NVIDIA
NVIDIA on grafiikkasuorittimen keksijä, jonka kehittämät edistysaskeleet vievät eteenpäin tekoälyn, suurteholaskennan.NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Join the team and see how you can make a lasting impact on the world.
NVIDIA is looking to hire a deeply technical and creative Site Reliability Engineer to build, support and maintain the next generation AI powered enterprise products that improve engineering efficiency, data security, and power our product development. This role will give an opportunity to collaborate with the Cloud and AI/ML workforce in a dynamic and agile working environment.
What you will be doing:
Collaborate on translating business objectives into actionable plans
Address operational challenges, automate processes, and iterate for efficiency
Tackle systemic reliability issues with multi-functional teams.
Monitor, optimize, and manage system performance and resources.
Institute validated practices for reliability, remediations, and troubleshooting.
Design, deploy, and automate production support, documenting essential knowledge.
Navigate intricate tasks with a deep understanding of SRE principles.
Lead cross-organizational projects from inception to completion.
Mentor and train junior engineers for professional development.
Serve as a subject matter expert in core team functions.
What we need to see:
15+ years of working experience in cloud, platform or SRE roles
A Bachelors or Masters Degree in an Engineering or Computer Science or related field or equivalent experience
Proficient in one or more programming languages: Python, Go, Perl, or Ruby.
Hands-on experience handling and scaling distributed systems in a public, private, or hybrid cloud, on-prem environment 24x7x365
Has delivered software with full understanding of deploying applications in Kubernetes clusters along with GPU and CPU pod scheduling (Ability to understand on Prem)
Has maintained and managed Micro-services relating to AI platforms (Inference, Training, Evaluation, Ingestion)
Hands-on experience in deploying, supporting, and supervising new and existing services, platforms, and application stacks.
Experience with CI/CD systems such as Jenkins, GitHub Actions, etc.
Background with Infrastructure as Code (IaC) methodologies and relevant tools.
Extensive experience working with MS Windows Server and/or Linux operating systems.
Solid communication skills, demonstrating the ability to comprehend and articulate technical issues to a non-technical audience.
Ways to stand out from the crowd:
Cloud expertise in Azure and AWS.
Passionate and experienced in AI methodologies.
Strong background in software design and development.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and enjoy learning while having fun, then what are you waiting for? Apply today!
#LI-Hybrid
The base salary range is 248,000 USD - 391,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.Tags: Agile AWS Azure CI/CD Computer Science Distributed Systems Engineering GitHub GPU Jenkins Kubernetes Linux Machine Learning Perl Python Ruby Security
Perks/benefits: Career development Equity / stock options
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.