Director of Infrastructure
HQ, United States
ā ļø We'll shut down after Aug 1st - try fooš¦ for all jobs in tech ā ļø
Rogue Fitness
Rogue Fitness is the leading manufacturer in strength and conditioning equipment & an official sponsor of the CrossFit Games, Arnold Classic, and USA Weightlifting. From power racks, rigs, and barbells to shoes, apparel & accessories, our...Job Description:
Ā
Overview
Rogue Fitness is accelerating AI across manufacturing, warehousing, and eācommerce and needs a hands-on leader to build the MLOps platform, enhance SRE, and keep our hybrid infrastructure stateāofātheāart.āÆYouāll own everything from GPU clusters and data pipelines to uptime, security, and factory networks, partnering with Technology and Solutions Directors to deploy AI at speed.āÆYour mandate: deliver measurable gains in throughput, cost savings, and customer experience to keep Rogue ahead of the pack.
Responsibilities
AI Infrastructure:
Build and lead the crossāfunctional team that designs, scales, and operates Rogueās production MLOps platformācovering data pipelines, model versioning, automated deployments, and realātime monitoring across onāprem and cloud GPU clusters
Own reliability, performance, and cost management for all AI compute and storageācapacity planning, incident response, and continuous optimization to meet SLA/SLO targets
Site Reliability:
Direct the SRE organization that safeguards roguefitness.com and all internal appsādefining SLIs/SLOs, automating CI/CD pipelines, and ensuring release velocity without sacrificing stability
Drive proactive reliability engineering: establish unified observability, conduct capacity and chaos testing, and lead rapid incident response to keep MTTR low and uptime above targets
Own continuous improvement of performance, scalability, and cost efficiencyāpartnering with product and infrastructure teams to embed reliability best practices from design through deployment
Traditional Infrastructure:
Oversee endātoāend operations of onāpremises and cloud infrastructureāWindows/Linux servers, storage, backups, DR, networks, and collaboration platformsāmanaged through infrastructureāasācode and realātime dashboards
Lead lifecycle planning and execution for upgrades, migrations, and capacity expansions, enforcing disciplined change control, budget stewardship, and clear communication to stakeholders
Establish and monitor service performance and security standards (availability, latency, compliance) while mentoring engineering staff and aligning roadmaps with business objectives
Required Qualifications
Masterās degree in Computer Science, Electrical Engineering, or a related technical field
10+ yrs experience in hybrid infrastructure with 3+ yrs as a manager
4+ years running production MLOps pipelines
4+ years leading SRE/DevOps practices: CI/CD, metrics, rapid rollbacks
Hands-on with Azure/GCP, Windows/AD, Google Workspace, virtualization, Terraform/Bicep, modern observability
Networking & security team lead: VLANs, firewalls, ZeroāTrust, incident response
Preferred Qualifications
Managed ML Ops and SRE teams for highly dynamic companies
Experience in online retail, manufacturing, and/or warehousing companies
OT experience: Modbus, segmented VLANs, factory networks
Tuned edgeāsecurity services like Cloudflare, Google Cloud Armor, or AWS WAF
By applying to Rogue, regardless of the platform you choose to use, you are agreeing to Rogue's preferred methods of communication (i.e. text message). Submitting an application, through whatever online forum is ultimately used, constitutes a knowing and voluntary agreement to send and receive text messages during the recruitment process.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index š°
Tags: AWS Azure CI/CD Computer Science CX Data pipelines DevOps Engineering GCP Google Cloud GPU Linux Machine Learning ML infrastructure MLOps Pipelines Security Terraform Testing
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.