Instrumentation Observability Engineer
Remote Worker
AdaptHealth
Position Summary:
The Instrumentation and Observability Engineer designs implement, and maintain the tools, frameworks, and practices that provide visibility into complex distributed systems. This role is critical in creating systems that collect data about how applications and infrastructure are performing, making it possible to troubleshoot issues, optimize performance, and understand user experiences. The ideal candidate will bridge the gap between software development, operations, and product teams to build a comprehensive observability strategy.
Essential Functions and Job Responsibilities:
Observability Infrastructure
- Design and implement observability platforms that collect, process, store, and visualize telemetry data (metrics, logs, traces)
- Develop and maintain instrumentations for applications and infrastructure components.
- Create standardized approaches for collecting meaningful telemetry data across all systems.
- Establish data retention policies and scaling strategies for observability data storage.
- Ensure high availability and performance of the observability stack.
Monitoring and Alerting
- Design and implement effective alerting systems that identify critical issues with minimal noise.
- Create intuitive dashboards that provide real-time visibility into system health and performance.
- Develop SLIs (Service Level Indicators) and help teams establish appropriate SLOs (Service Level Objectives)
- Configure anomaly detection systems to identify unusual patterns in system behavior.
- Establish on-call procedures and runbooks for incident response.
Engineering Support
- Partner with development teams to integrate instrumentation into new and existing applications.
- Provide guidance on best practices for logs, metrics, and distributed tracing implementations.
- Troubleshoot complex production issues using observability tools and data.
- Support incident management with data-driven insights and analysis.
- Create documentation for observability systems and train teams on effective usage.
Platform Development
- Develop custom tools and integrations for the observability platform as needed.
- Automate configuration management for monitoring agents and collectors.
- Build data pipelines for processing and enriching telemetry data.
- Create self-service tooling for teams to manage their observability configurations.
- Continuously evaluate and integrate new technologies into the observability platform.
- Maintain patient confidentiality and function within the guidelines of HIPAA.
- Complete assigned compliance training and other educational programs as required.
- Maintain compliance with AdaptHealth’s Compliance Program.
- Perform other related duties as assigned.
- Assist in vendor contract reviews with managers and legal.
Competency, Skills, and Abilities:
- Strong analytical and problem-solving abilities
- Excellent communication skills to explain complex technical concepts.
- Collaborative mindset with ability to work across teams.
- Self-motivated with a passion for learning new technologies.
- Detail-oriented with a focus on system reliability and performance.
- Ability to thrive in a fast-paced environment and manage competing priorities.
- Proficiency in at least one programming language (Go, Python, Java, etc.)
- Experience with distributed tracing systems (Jaeger, Zipkin, OpenTelemetry)
- Solid understanding of cloud infrastructure and containerization (Kubernetes, Docker)
- Knowledge of log aggregation and analysis systems
- Demonstrated ability to troubleshoot complex distributed systems.
- Experience with cloud-native observability solutions (AWS CloudWatch, Google Cloud Operations, Azure Monitor)
- Knowledge of statistical analysis and data visualization techniques
- Experience with service mesh technologies (Istio, Linkerd)
- Background in high-scale, distributed systems
- Understanding of machine learning techniques for anomaly detection and predictive analytics
- Familiarity with GitOps practices for infrastructure and configuration management
- Experience with real user monitoring (RUM) and synthetic monitoring techniques
- Observability Platforms: Prometheus, Grafana, Datadog, New Relic, Elastic Stack, Splunk
- Distributed Tracing: Jaeger, Zipkin, OpenTelemetry
- Programming: Python, Go, Java, or similar languages
- Infrastructure: Kubernetes, Docker, cloud platforms (AWS, GCP, Azure)
- Data Processing: SQL, streaming data, time-series databases
- Automation : CI/CD pipelines, Infrastructure as Code
Education and Experience Requirements:
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
- 3+ years of experience in software engineering, DevOps, or site reliability engineering
- Strong experience with observability platforms (such as Prometheus, Grafana, Datadog, New Relic, Elastic Stack, Splunk, or similar tools)
Physical Demands and Work Environment:
- Must be able to bend, stoop, stretch, stand, and sit for extended periods.
- Ability to perform repetitive motions of wrists, hands, and/or fingers due to extensive computer use.
- The work environment may be stressful at times, as overall office activities and work levels fluctuate.
- Subject to prolonged periods of sitting and exposure to computer screens.
- Ability to utilize a personal computer and other office equipment.
- Must be able to lift 30 pounds as needed.
- Physical and mental ability to analyze, solve problems and lead others.
- Excellent ability to communicate both verbally and in writing.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: AWS Azure CI/CD Computer Science Data pipelines Data visualization DevOps Distributed Systems Docker Engineering GCP Google Cloud Grafana Java Kubernetes Machine Learning Pipelines Python Splunk SQL Statistics Streaming
Perks/benefits: Career development
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.