Lead Site Reliability Engineer

Los Angeles, CA

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

DISQO

With DISQO's CX Platform go beyond with Brand Lift, Outcomes Lift, Market Research, and Surveys to connect what customers say to what they do.

View all jobs at DISQO

Apply now Apply later

DISQO’s mission is to build the world’s most trusted ad measurement platform that fuels brand growth. The world’s largest brands, agencies, and media companies trust DISQO for expert insight and AI-driven intelligence about their advertising performance across all platforms. We capture people’s sentiments and journeys,  connecting them with the brands they value and the media they consume. With this identity-based approach, brands gain more accurate and authentic insight so they can create more meaningful interactions.
When you join DISQO Nation, you join a community that values trust, transparency and innovation. We invest in our employees and apply a bottom-up management approach, rooted in the concept of servant leadership. We approach each day eager to learn, grow, and make a lasting impact. Best of all, we have fun while doing it!
About the Role:We are seeking an experienced Lead Site Reliability Engineer to join our engineering team and drive the reliability, scalability, and performance of our production systems through innovative use of AI and automation. In this role, you will lead SRE initiatives, mentor team members, and leverage AI technologies to enhance operational excellence, predictive maintenance, and intelligent automation across our infrastructure.
Key Responsibilities:

Technical Leadership:

  • Design and implement comprehensive monitoring, alerting, and observability solutions, leveraging AI for intelligent anomaly detection and root cause analysis
  • Lead incident response efforts using AI-assisted diagnostics and automated remediation, conduct post-mortems, and drive systemic improvements
  • Develop and maintain service level objectives (SLOs) and error budgets with AI-powered predictive analytics to forecast reliability risks
  • Architect and implement intelligent automation solutions for deployment, scaling, and infrastructure management using machine learning models
  • Drive capacity planning and performance optimization using AI forecasting models and predictive analytics

AI-Enhanced SRE Leadership:

  • Implement and maintain AI-powered incident prediction and prevention systems
  • Design intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learning
  • Develop AI-driven capacity planning models that predict resource needs and optimize cost efficiency
  • Build and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triage
  • Implement automated root cause analysis using AI correlation engines and log analysis

Team Leadership & Collaboration:

  • Mentor junior SREs on integrating AI tools and practices into traditional SRE workflows
  • Partner with engineering teams to embed AI-enhanced reliability principles into the software development lifecycle
  • Lead cross-functional initiatives to implement AI-driven operational improvements
  • Collaborate with data science teams to develop custom AI models for operational use cases
  • Participate in on-call rotations while developing AI systems to minimize toil and improve response efficiency

Strategic Initiatives:

  • Develop and execute an SRE roadmap aligned with business objectives and technological advancement
  • Evaluate and implement new AI tools and technologies to improve system reliability, security and operational efficiency
  • Drive adoption of AI-powered engineering and predictive failure testing
  • Establish metrics and reporting using AI analytics to demonstrate the business value of intelligent reliability investments

Required Qualifications:

  • 6+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure-focused roles
  • 2+ years of experience leading technical teams or initiatives
  • Strong experience with AI/ML tools and frameworks applied to operational use cases (anomaly detection, predictive analytics, NLP)
  • Hands-on experience implementing AI-powered monitoring, alerting, and automation solutions
  • Strong programming skills in Python with experience in AI/ML libraries
  • Extensive experience with cloud platforms (AWS, GCP,) and their AI/ML services
  • Knowledge of prompt engineering, LLM integration, and building AI-powered operational tools
  • Proficiency with infrastructure as code and configuration management with AI-enhanced workflows
  • Experience with time series analysis, statistical modeling, and predictive analytics for infrastructure metrics
  • Deep understanding of monitoring and observability tools enhanced with AI capabilities
  • Experience with CI/CD pipelines incorporating AI-driven quality gates and automated decision making
  • Strong knowledge of networking, distributed systems, and database technologies
  • Expert level knowledge in following domains: AWS ( core services, networking, compute, databases, storage, etc.. ) TerraformKubernetes / Karpetner / Helm
  • Strong experience building in-house observability platforms, including: OpenTelemetryLokiGrafanaPrometheusAWS CloudwatchAWS X-Ray or Jaeger
  • Experience in ArgoCD / ArgoWorkflow will be big plus
  • Bachelor’s degree in Computer Science, Engineering, Data Science, or equivalent practical experience

Preferred Qualifications:

  • Advanced experience with large language models (LLMs) for operational documentation, code generation, and incident response
  • Experience with automated incident response systems using AI decision engines
  • Experience with microservices architecture and intelligent service mesh management
  • Familiarity with AI-powered security tools and anomaly detection for infrastructure protection
  • Experience building and maintaining AI-driven dashboards and reporting systems
  • Experience with AI-powered cost optimization and resource right-sizing tools
  • Certification in relevant cloud platforms 
#LI-MV1
At DISQO, we pride ourselves on having a positive, performance-oriented workplace that includes a flexible hybrid approach, competitive medical benefits, and an amazing vacation policy. Read more about our culture on Glassdoor.
You can learn more about what’s happening at DISQO by visiting the DISQO Developer Blog or the DISQO Company Blog.
Perks & Benefits:
·100% covered Medical/Dental/Vision for employee, competitive dependent coverage·Equity·401K·Generous PTO policy·Flexible workplace policy·Team offsites, social events & happy hours·Life Insurance·Health FSA·Commuter FSA (for hybrid employees)·Catered lunch and fully stocked kitchen·Paid Maternity/Paternity leave·Disability Insurance·Travel Assistance Program·24/7 Counseling Services offered to Employees
Note: The benefits noted above are for full time US based employees only.
DISQO is an equal opportunity employer. Discovery, innovation, and growth are possible when we open ourselves to new possibilities, perspectives, and approaches. That’s why, at DISQO, we welcome, support, and empower individuals from diverse backgrounds. Exceptional teams are rooted in extraordinary people, each with a unique story and a compelling set of skills. DISQO does not discriminate against employees based on race, color, religion, sex, national origin, gender identity or expression, age, disability, pregnancy (including childbirth, breastfeeding, or related medical condition), genetic information, protected military or veteran status, sexual orientation, or any other characteristic protected by applicable federal, state or local laws.
*Recruiting firms that submit resumes to DISQO without first entering into a written contract will not be entitled to any compensation on candidates referred by that firm.
Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0

Tags: Architecture AWS Chatbots CI/CD Computer Science DevOps Distributed Systems Engineering GCP Helm LLMs Machine Learning Microservices ML models NLP Pipelines Predictive Maintenance Prompt engineering Python Security Statistical modeling Statistics Testing

Perks/benefits: Career development Competitive pay Flex vacation Health care Parental leave Team events Transparency

Region: North America
Country: United States

More jobs like this