Instrumentation Observability Engineer

Remote Worker

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

Applications have closed

AdaptHealth

With 672 locations in 47 states and over 37,000 home deliveries each day, AdaptHealth empowers patients to live their best lives.

View all jobs at AdaptHealth

Find more jobs like this

Posted 2 months ago

Description

Position Summary:

The Instrumentation and Observability Engineer designs implement, and maintain the tools, frameworks, and practices that provide visibility into complex distributed systems. This role is critical in creating systems that collect data about how applications and infrastructure are performing, making it possible to troubleshoot issues, optimize performance, and understand user experiences. The ideal candidate will bridge the gap between software development, operations, and product teams to build a comprehensive observability strategy.

Essential Functions and Job Responsibilities:

Observability Infrastructure

Design and implement observability platforms that collect, process, store, and visualize telemetry data (metrics, logs, traces)
Develop and maintain instrumentations for applications and infrastructure components.
Create standardized approaches for collecting meaningful telemetry data across all systems.
Establish data retention policies and scaling strategies for observability data storage.
Ensure high availability and performance of the observability stack.

Monitoring and Alerting

Design and implement effective alerting systems that identify critical issues with minimal noise.
Create intuitive dashboards that provide real-time visibility into system health and performance.
Develop SLIs (Service Level Indicators) and help teams establish appropriate SLOs (Service Level Objectives)
Configure anomaly detection systems to identify unusual patterns in system behavior.
Establish on-call procedures and runbooks for incident response.

Engineering Support

Partner with development teams to integrate instrumentation into new and existing applications.
Provide guidance on best practices for logs, metrics, and distributed tracing implementations.
Troubleshoot complex production issues using observability tools and data.
Support incident management with data-driven insights and analysis.
Create documentation for observability systems and train teams on effective usage.

Platform Development

Develop custom tools and integrations for the observability platform as needed.
Automate configuration management for monitoring agents and collectors.
Build data pipelines for processing and enriching telemetry data.
Create self-service tooling for teams to manage their observability configurations.
Continuously evaluate and integrate new technologies into the observability platform.
Maintain patient confidentiality and function within the guidelines of HIPAA.
Complete assigned compliance training and other educational programs as required.
Maintain compliance with AdaptHealth’s Compliance Program.
Perform other related duties as assigned.
Assist in vendor contract reviews with managers and legal.

Competency, Skills, and Abilities:

Strong analytical and problem-solving abilities
Excellent communication skills to explain complex technical concepts.
Collaborative mindset with ability to work across teams.
Self-motivated with a passion for learning new technologies.
Detail-oriented with a focus on system reliability and performance.
Ability to thrive in a fast-paced environment and manage competing priorities.
Proficiency in at least one programming language (Go, Python, Java, etc.)
Experience with distributed tracing systems (Jaeger, Zipkin, OpenTelemetry)
Solid understanding of cloud infrastructure and containerization (Kubernetes, Docker)
Knowledge of log aggregation and analysis systems
Demonstrated ability to troubleshoot complex distributed systems.
Experience with cloud-native observability solutions (AWS CloudWatch, Google Cloud Operations, Azure Monitor)
Knowledge of statistical analysis and data visualization techniques
Experience with service mesh technologies (Istio, Linkerd)
Background in high-scale, distributed systems
Understanding of machine learning techniques for anomaly detection and predictive analytics
Familiarity with GitOps practices for infrastructure and configuration management
Experience with real user monitoring (RUM) and synthetic monitoring techniques
Observability Platforms: Prometheus, Grafana, Datadog, New Relic, Elastic Stack, Splunk
Distributed Tracing: Jaeger, Zipkin, OpenTelemetry
Programming: Python, Go, Java, or similar languages
Infrastructure: Kubernetes, Docker, cloud platforms (AWS, GCP, Azure)
Data Processing: SQL, streaming data, time-series databases
Automation : CI/CD pipelines, Infrastructure as Code

Requirements

Education and Experience Requirements:

Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
3+ years of experience in software engineering, DevOps, or site reliability engineering
Strong experience with observability platforms (such as Prometheus, Grafana, Datadog, New Relic, Elastic Stack, Splunk, or similar tools)

Physical Demands and Work Environment:

Must be able to bend, stoop, stretch, stand, and sit for extended periods.
Ability to perform repetitive motions of wrists, hands, and/or fingers due to extensive computer use.
The work environment may be stressful at times, as overall office activities and work levels fluctuate.
Subject to prolonged periods of sitting and exposure to computer screens.
Ability to utilize a personal computer and other office equipment.
Must be able to lift 30 pounds as needed.
Physical and mental ability to analyze, solve problems and lead others.
Excellent ability to communicate both verbally and in writing.

Find more jobs like this

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats: 4 0 0

Categories: Big Data Jobs Engineering Jobs

Tags: AWS Azure CI/CD Computer Science Data pipelines Data visualization DevOps Distributed Systems Docker Engineering GCP Google Cloud Grafana Java Kubernetes Machine Learning Pipelines Python Splunk SQL Statistics Streaming