Director of Infrastructure

HQ, United States

āš ļø We'll shut down after Aug 1st - try foošŸ¦ for all jobs in tech āš ļø

Rogue Fitness

Rogue Fitness is the leading manufacturer in strength and conditioning equipment & an official sponsor of the CrossFit Games, Arnold Classic, and USA Weightlifting. From power racks, rigs, and barbells to shoes, apparel & accessories, our...

View all jobs at Rogue Fitness

Apply now Apply later

Job Description:

Ā 

Overview

Rogue Fitness is accelerating AI across manufacturing, warehousing, and e‑commerce and needs a hands-on leader to build the MLOps platform, enhance SRE, and keep our hybrid infrastructure state‑of‑the‑art. You’ll own everything from GPU clusters and data pipelines to uptime, security, and factory networks, partnering with Technology and Solutions Directors to deploy AI at speed. Your mandate: deliver measurable gains in throughput, cost savings, and customer experience to keep Rogue ahead of the pack.

Responsibilities

  • AI Infrastructure:

    • Build and lead the cross‑functional team that designs, scales, and operates Rogue’s production MLOps platform—covering data pipelines, model versioning, automated deployments, and real‑time monitoring across on‑prem and cloud GPU clusters

    • Own reliability, performance, and cost management for all AI compute and storage—capacity planning, incident response, and continuous optimization to meet SLA/SLO targets

  • Site Reliability:

    • Direct the SRE organization that safeguards roguefitness.com and all internal apps—defining SLIs/SLOs, automating CI/CD pipelines, and ensuring release velocity without sacrificing stability

    • Drive proactive reliability engineering: establish unified observability, conduct capacity and chaos testing, and lead rapid incident response to keep MTTR low and uptime above targets

    • Own continuous improvement of performance, scalability, and cost efficiency—partnering with product and infrastructure teams to embed reliability best practices from design through deployment

  • Traditional Infrastructure:

    • Oversee end‑to‑end operations of on‑premises and cloud infrastructure—Windows/Linux servers, storage, backups, DR, networks, and collaboration platforms—managed through infrastructure‑as‑code and real‑time dashboards

    • Lead lifecycle planning and execution for upgrades, migrations, and capacity expansions, enforcing disciplined change control, budget stewardship, and clear communication to stakeholders

    • Establish and monitor service performance and security standards (availability, latency, compliance) while mentoring engineering staff and aligning roadmaps with business objectives

Required Qualifications

  • Master’s degree in Computer Science, Electrical Engineering, or a related technical field

  • 10+ yrs experience in hybrid infrastructure with 3+ yrs as a manager

  • 4+ years running production MLOps pipelines

  • 4+ years leading SRE/DevOps practices: CI/CD, metrics, rapid rollbacks

  • Hands-on with Azure/GCP, Windows/AD, Google Workspace, virtualization, Terraform/Bicep, modern observability

  • Networking & security team lead: VLANs, firewalls, Zero‑Trust, incident response

Preferred Qualifications

  • Managed ML Ops and SRE teams for highly dynamic companies

  • Experience in online retail, manufacturing, and/or warehousing companies

  • OT experience: Modbus, segmented VLANs, factory networks

  • Tuned edge‑security services like Cloudflare, Google Cloud Armor, or AWS WAF

By applying to Rogue, regardless of the platform you choose to use, you are agreeing to Rogue's preferred methods of communication (i.e. text message). Submitting an application, through whatever online forum is ultimately used, constitutes a knowing and voluntary agreement to send and receive text messages during the recruitment process.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index šŸ’°

Job stats:  0  0  0

Tags: AWS Azure CI/CD Computer Science CX Data pipelines DevOps Engineering GCP Google Cloud GPU Linux Machine Learning ML infrastructure MLOps Pipelines Security Terraform Testing

Region: North America
Country: United States

More jobs like this