Senior Cloud Operations Engineer - PyTorch

San Francisco, CA, United States

Full Time Senior-level / Expert USD 125K - 165K

The Linux Foundation

Helping open technology projects build world class open source software, communities and companies.

View all jobs at The Linux Foundation

Apply now Apply later

Posted 7 hours ago

Company Description

The Linux Foundation is a driving force in fostering open-source collaboration and supporting communities across a range of projects, including PyTorch. We're dedicated to enhancing and expanding our infrastructure to meet the growing demands of PyTorch and related AI projects. We are seeking a Senior Cloud Operations Engineer who will focus on the infrastructure operations of the PyTorch project, automating processes, optimizing cloud-native tools, and ensuring a robust and scalable cloud environment.

Job Description

The Senior Cloud Operations Engineer will play a crucial role in managing and optimizing our multi-cloud infrastructure and DevOps practices. This position is essential for maintaining and scaling our cloud operations across multiple cloud provider platforms and accelerator technologies. The ideal candidate will combine deep expertise in cloud technologies, hardware accelerators, and DevOps methodologies to ensure our infrastructure remains robust, efficient, and future-proof.

Responsibilities:

Cloud Infrastructure Management

Design and manage multi-cloud environments across AWS, GCP, and Azure
Optimize instance selection and utilization across various compute types including AMD and Intel CPU-based instances
Configure and manage GPU-accelerated instances (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)
Implement and maintain infrastructure-as-code using Terraform and other IaC tools
Optimize cloud resource utilization and implement FinOps practices for cost management
Design and implement high-availability solutions across multiple cloud providers

CI/CD and DevOps

Design, implement, and maintain CI/CD pipelines using GitHub Actions
Configure and manage both github-hosted and self-hosted runners
Implement and maintain non-blocking and out-of-tree CI jobs
Design and implement matrix testing strategies across different hardware configurations
Develop and maintain automated testing frameworks for various testing types (unit, integration, performance)
Implement best practices for version control management and branching strategies
Experience with agile methodologies and scrum practices

Performance Optimization and Testing

Develop and implement performance testing frameworks for various hardware accelerators
Optimize workload distribution across different types of compute instances
Implement automated performance regression testing
Design and maintain benchmarking systems for various hardware configurations

Infrastructure Security and Monitoring

Implement security best practices across multi-cloud environments
Develop comprehensive monitoring solutions using cloud-native tools
Participate in on-call rotations supporting operations and incident response
Establish and maintain escalation procedures and resolution processes
Manage access control and security policies across cloud platforms

Qualifications

Required:

Bachelor's degree in Computer Science, Engineering, or related field
7+ years of experience in cloud operations with extensive multi-cloud expertise (AWS, GCP, Azure)
Demonstrated experience with GPU computing (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)
Strong knowledge of CPU architectures and instance type optimization (AMD, Intel)
Advanced experience with GitHub Actions, including custom runner configuration and management
Expertise in implementing non-blocking and out-of-tree CI jobs
Strong background in version control systems and branching strategies
Experience with agile methodologies and scrum practices
Proficiency in infrastructure-as-code tools, particularly Terraform
Strong scripting abilities (Python, Bash, PowerShell, Typescript)
Experience with containerization and orchestration (Docker, Kubernetes)
Demonstrated experience in implementing automated testing frameworks

Preferred:

Experience optimizing workloads across different hardware accelerators
Background in performance testing and optimization
Contributions to open-source projects
Experience mentoring other engineers
Background in machine learning infrastructure
Experience with Datadog is a plus

Benefits:

Competitive salary
Comprehensive health, dental, and vision insurance
Flexible PTO policy
Remote work environment
Professional development opportunities
401(k) matching
Home office stipend

Additional Information

Open to US-based employees only. Preference for West Coast candidates.

Salary $125,000 - $165,000 USD

About Us:

We maintain a predominantly remote workforce and are committed to hiring top-notch talent. We are passionate about providing a flexible and supportive work culture. Our team values collaboration, innovation, and continuous learning. We embrace diversity and believe in creating an inclusive environment where all team members can thrive.

The Linux Foundation is an Equal Opportunity Employer.

Apply now Apply later

Job stats: 0 0 0

Category: Engineering Jobs

Tags: Agile Architecture AWS Azure CI/CD Computer Science DevOps Docker Engineering GCP GitHub GPU Kubernetes Linux Machine Learning ML infrastructure Open Source Pipelines Python PyTorch Scrum Security Terraform Testing TypeScript