Lead Site Reliability Engineer

Los Angeles, CA

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

Full Time Senior-level / Expert USD 146K - 271K * ^est.

DISQO

With DISQO's CX Platform go beyond with Brand Lift, Outcomes Lift, Market Research, and Surveys to connect what customers say to what they do.

View all jobs at DISQO

Apply now Apply later

Posted 17 hours ago

DISQO’s mission is to build the world’s most trusted ad measurement platform that fuels brand growth. The world’s largest brands, agencies, and media companies trust DISQO for expert insight and AI-driven intelligence about their advertising performance across all platforms. We capture people’s sentiments and journeys, connecting them with the brands they value and the media they consume. With this identity-based approach, brands gain more accurate and authentic insight so they can create more meaningful interactions.
When you join DISQO Nation, you join a community that values trust, transparency and innovation. We invest in our employees and apply a bottom-up management approach, rooted in the concept of servant leadership. We approach each day eager to learn, grow, and make a lasting impact. Best of all, we have fun while doing it!
About the Role:We are seeking an experienced Lead Site Reliability Engineer to join our engineering team and drive the reliability, scalability, and performance of our production systems through innovative use of AI and automation. In this role, you will lead SRE initiatives, mentor team members, and leverage AI technologies to enhance operational excellence, predictive maintenance, and intelligent automation across our infrastructure.
Key Responsibilities:

Technical Leadership:

Design and implement comprehensive monitoring, alerting, and observability solutions, leveraging AI for intelligent anomaly detection and root cause analysis
Lead incident response efforts using AI-assisted diagnostics and automated remediation, conduct post-mortems, and drive systemic improvements
Develop and maintain service level objectives (SLOs) and error budgets with AI-powered predictive analytics to forecast reliability risks
Architect and implement intelligent automation solutions for deployment, scaling, and infrastructure management using machine learning models
Drive capacity planning and performance optimization using AI forecasting models and predictive analytics

AI-Enhanced SRE Leadership:

Implement and maintain AI-powered incident prediction and prevention systems
Design intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learning
Develop AI-driven capacity planning models that predict resource needs and optimize cost efficiency
Build and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triage
Implement automated root cause analysis using AI correlation engines and log analysis

Team Leadership & Collaboration:

Mentor junior SREs on integrating AI tools and practices into traditional SRE workflows
Partner with engineering teams to embed AI-enhanced reliability principles into the software development lifecycle
Lead cross-functional initiatives to implement AI-driven operational improvements
Collaborate with data science teams to develop custom AI models for operational use cases
Participate in on-call rotations while developing AI systems to minimize toil and improve response efficiency

Strategic Initiatives:

Develop and execute an SRE roadmap aligned with business objectives and technological advancement
Evaluate and implement new AI tools and technologies to improve system reliability, security and operational efficiency
Drive adoption of AI-powered engineering and predictive failure testing
Establish metrics and reporting using AI analytics to demonstrate the business value of intelligent reliability investments

Required Qualifications:

6+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure-focused roles
2+ years of experience leading technical teams or initiatives
Strong experience with AI/ML tools and frameworks applied to operational use cases (anomaly detection, predictive analytics, NLP)
Hands-on experience implementing AI-powered monitoring, alerting, and automation solutions
Strong programming skills in Python with experience in AI/ML libraries
Extensive experience with cloud platforms (AWS, GCP,) and their AI/ML services
Knowledge of prompt engineering, LLM integration, and building AI-powered operational tools
Proficiency with infrastructure as code and configuration management with AI-enhanced workflows
Experience with time series analysis, statistical modeling, and predictive analytics for infrastructure metrics
Deep understanding of monitoring and observability tools enhanced with AI capabilities
Experience with CI/CD pipelines incorporating AI-driven quality gates and automated decision making
Strong knowledge of networking, distributed systems, and database technologies
Expert level knowledge in following domains: AWS ( core services, networking, compute, databases, storage, etc.. ) TerraformKubernetes / Karpetner / Helm
Strong experience building in-house observability platforms, including: OpenTelemetryLokiGrafanaPrometheusAWS CloudwatchAWS X-Ray or Jaeger
Experience in ArgoCD / ArgoWorkflow will be big plus
Bachelor’s degree in Computer Science, Engineering, Data Science, or equivalent practical experience

Preferred Qualifications:

Advanced experience with large language models (LLMs) for operational documentation, code generation, and incident response
Experience with automated incident response systems using AI decision engines
Experience with microservices architecture and intelligent service mesh management
Familiarity with AI-powered security tools and anomaly detection for infrastructure protection
Experience building and maintaining AI-driven dashboards and reporting systems
Experience with AI-powered cost optimization and resource right-sizing tools
Certification in relevant cloud platforms

#LI-MV1
At DISQO, we pride ourselves on having a positive, performance-oriented workplace that includes a flexible hybrid approach, competitive medical benefits, and an amazing vacation policy. Read more about our culture on Glassdoor.
You can learn more about what’s happening at DISQO by visiting the DISQO Developer Blog or the DISQO Company Blog.
Perks & Benefits:
·100% covered Medical/Dental/Vision for employee, competitive dependent coverage·Equity·401K·Generous PTO policy·Flexible workplace policy·Team offsites, social events & happy hours·Life Insurance·Health FSA·Commuter FSA (for hybrid employees)·Catered lunch and fully stocked kitchen·Paid Maternity/Paternity leave·Disability Insurance·Travel Assistance Program·24/7 Counseling Services offered to Employees
Note: The benefits noted above are for full time US based employees only.
DISQO is an equal opportunity employer. Discovery, innovation, and growth are possible when we open ourselves to new possibilities, perspectives, and approaches. That’s why, at DISQO, we welcome, support, and empower individuals from diverse backgrounds. Exceptional teams are rooted in extraordinary people, each with a unique story and a compelling set of skills. DISQO does not discriminate against employees based on race, color, religion, sex, national origin, gender identity or expression, age, disability, pregnancy (including childbirth, breastfeeding, or related medical condition), genetic information, protected military or veteran status, sexual orientation, or any other characteristic protected by applicable federal, state or local laws.
*Recruiting firms that submit resumes to DISQO without first entering into a written contract will not be entitled to any compensation on candidates referred by that firm.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats: 0 0 0

Categories: Big Data Jobs Engineering Jobs Leadership Jobs

Tags: Architecture AWS Chatbots CI/CD Computer Science DevOps Distributed Systems Engineering GCP Helm LLMs Machine Learning Microservices ML models NLP Pipelines Predictive Maintenance Prompt engineering Python Security Statistical modeling Statistics Testing