Lead Consultant - HPC DevOps Engineer

India - Chennai

AstraZeneca

AstraZeneca is a global, science-led biopharmaceutical business and our innovative medicines are used by millions of patients worldwide.

View all jobs at AstraZeneca

Apply now Apply later

Job Title: Lead Consultant - HPC DevOps Engineer

Career level: E

Introduction to role

The Research Data & Analytics Team in R&D IT comprises skilled data and AI engineers and professionals who are dedicated to delivering innovative services and products. Our mission is to transform the way R&D discovers and develops medicine through data, analytics, and AI. We partner with scientific teams to deliver groundbreaking capabilities, products, and platforms that enable scientists to accelerate medicines that are safe and effective for patients.

The Scientific Computing platform (SCP) is a foundational capability for HPC and scaled computing solutions. Embedded within the Research D&A organization, it is central to analytics products focused on computational chemistry, imaging, multi-OMICs, structural biology, data science, and AI. The SCP team is accountable for the end-to-end delivery of high-performance analytics products, with an emphasis on augmenting the HPC experience. We combine modern HPC with a powerful DevOps stack and cloud-native technologies to power research and development at AstraZeneca.

Accountabilities

The Observability Engineer will be responsible for designing, implementing, and managing monitoring and logging systems that ensure high availability, performance, and visibility across the platform’s infrastructure and applications. The ideal candidate should have deep expertise in Prometheus, Grafana, ELK (Elastic Stack), or similar stack, with a strong understanding of short-term and long-term storage solutions for metrics and logs. Equally important is experience in leadership and coaching to lead and encourage best practices throughout the platform.

What you'll do:

Prometheus: Metrics Collection and Storage: Design and manage Prometheus architecture, including identifying high cardinality and troubleshooting performance issues. Configure short-term and long-term storage solutions using Prometheus-compatible systems (e.g., Thanos, Cortex, or VictoriaMetrics). Implement and optimize Prometheus exporters for collecting custom application metrics. Establish alerting rules using Prometheus Alertmanager.

Grafana: Visualization and Dashboarding: Develop and maintain Grafana dashboards for real-time observability. Integrate Grafana with other systems for unified visualization. Identify key metrics and insights through dashboards for both internal and external consumption.

Management and Insights: Setup and manage logging solutions, develop relevant dashboards and queries to provide actionable insights. Integrate logging solutions with other observability tools for cohesive monitoring.

Cross-Tool Integration: Implement integrations between Prometheus, Grafana, and logging solutions to create a unified observability platform. Design solutions for correlation of metrics and logs to streamline root cause analysis.

Performance Tuning and Maintenance: Monitor the performance of observability tools and optimize resource utilization. Conduct regular upgrades and maintenance of all observability components. Collaboration and Documentation: Work with SCP teams and users to define monitoring and logging requirements. Leadership and coaching on observability best practices while aiming for simplification. Focus on offering observability as an easy-to-consume service for the rest. Document observability architecture, workflows, and troubleshooting guides.

Essential Skills/Experience

Technical skills

  • Prometheus: Expertise in Prometheus setup, scaling, and federation. Knowledge of Thanos, Cortex, or VictoriaMetrics for long-term storage. Hands-on experience with PromQL for writing complex queries.

  • Grafana: Proficiency in creating dashboards and integrating with multiple data sources.

  • Logging: In-depth experience with ELK, Splunk, Loki or similar, both with query languages and dashboarding.

  • Infrastructure: Hands-on experience managing observability infrastructure in Kubernetes, Docker, or other container technologies.

  • Scripting and Automation: Proficiency in Python, Bash, or similar scripting languages. Experience with Infrastructure as Code tools like Terraform or Ansible.

Soft skills

  • Strong problem-solving and analytical abilities.

  • Excellent communication and collaboration skills to work across teams and end users.

  • Ability to streamline complex processes and requirements into simple and elegant solutions.

  • Ability to document complex systems clearly and concisely.

Desirable Skills/Experience

  • Familiarity with other observability tools (e.g., Loki, VictoriaMetrics).

  • Certifications: Prometheus Certified Associate.

When we put unexpected teams in the same room, we unleash bold thinking with the power to inspire life-changing medicines. In-person working gives us the platform we need to connect, work at pace and challenge perceptions. That's why we work, on average, a minimum of three days per week from the office. But that doesn't mean we're not flexible. We balance the expectation of being in the office while respecting individual flexibility. Join us in our unique and ambitious world.

At AstraZeneca, our work has a direct impact on patients by transforming our ability to develop life-changing medicines. We empower the business to perform at its peak by combining ground breaking science with leading digital technology platforms and data. Here you can innovate, take ownership, explore new solutions, experiment with innovative technology, and tackle challenges in a modern technology environment.

Ready to make a difference? Apply now!

Date Posted

18-Feb-2025

Closing Date

AstraZeneca embraces diversity and equality of opportunity.  We are committed to building an inclusive and diverse team representing all backgrounds, with as wide a range of perspectives as possible, and harnessing industry-leading skills.  We believe that the more inclusive we are, the better our work will be.  We welcome and consider applications to join our team from all qualified candidates, regardless of their characteristics.  We comply with all applicable laws and regulations on non-discrimination in employment (and recruitment), as well as work authorization and employment eligibility verification requirements.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0

Tags: Ansible Architecture Biology Chemistry DevOps Docker ELK Grafana HPC Kubernetes Python R R&D Research Splunk Terraform

Perks/benefits: Career development Flex hours

Region: Asia/Pacific
Country: India

More jobs like this