AI-Factory Observability Principal
Tasks
- Correlate network behavior with training performance
- Define SLOs alerting strategy and on call signal quality
- Define observability architecture
- Design unified data model for correlation
- Establish instrumentation and telemetry pipelines
- Generate incident summaries with LLMs
- Implement AIOps anomaly detection and forecasting
- Instrument Kubernetes for AI workloads
- Integrate BMS telemetry into observability platform
- Integrate EPMS power telemetry into observability platform
- Mentor engineers and set observability roadmap direction
- Monitor GPU accelerator clusters
- Own metrics logs traces and events
Perks/Benefits
- N/A
Skills/Tech-stack
AIOps | Anomaly Detection | BACnet | BMS | DNP3 | Datacenter Engineering | Datadog | Dynatrace | ECC errors | EPMS | Elasticsearch | Events | Forecasting | GNMI | GPU Monitoring | Grafana | IEC 61850 | IPMI | Infiniband | Infrastructure as Code | Kubernetes | Kubernetes Scheduler | LLM | Language Processing | Linux | Logs | Loki | Machine Learning | Metrics | Modbus | Mqtt | NVIDIA DCGM | NVLink | Natural Language | Natural Language Processing | Netflow | OPC UA | Observability | OpenTelemetry | Openconfig | Predictive Maintenance | Prometheus | Python | RDMA | Redfish | RoCE | SFlow | SNMP | SRE | Splunk | Streaming telemetry | Telemetry pipelines | Tempo | Terraform | Thanos | Time Series | Time-series databases | Traces | VictoriaMetrics | WMI | “as-code”
Education
N/A
Related jobs
-
Featured Feat. Associate Director, Data Labs USD 167K-167KAWS | Cloud Computing | Compute Infrastructure | Data Analysis | LLM GovernanceConference speaking opportunities | Hybrid work schedule | Media appearancesSenior-level Full TimeWashington, District of Columbia, 20004, United … R5d ago
-
Computational Designer USD 95K-118KC# | C++ | Computational Geometry | Computer Graphics | Data Pipelines401k plan | Dental insurance | Education assistance | Fertility support | Flexible time offMid-level Full TimePortland, OR, US R21h ago
-
Senior Software Engineer, Data Foundation USD 189K-256KDjango | Go | High Volume | High-volume APIs | Python401k | Enhanced parental leave | Generous vacation | Holiday Breaks | Medical, dental & vision coverageSenior-level Full TimeSan Francisco, US (Hybrid) R1d ago
-
Principal Machine Learning Engineer USD 190K-305KDeep learning | Experimentation | Feature Engineering | Information Retrieval | LLMHealth and wellbeing resources | Paid volunteer daysSenior-level Full TimeSingapore - Singapore - Central Singapore … R1d ago
-
AWS S3 | Benchmarking | Debugging | Document AI | Document processingDirect collaboration with founding teams | Rapid Career Growth Opportunities | Remote work | Visa sponsorship availableSenior-level Full TimeSan Francisco, CA; Onsite R1d ago
-
A/B | A/B Testing | Agentic Workflows | B testing | BatchingEmployee resource groups | Inclusive culture | Professional development | Social events | Work-life balanceSenior-level Full TimeWork from Home, United States, United … R1d ago
-
AWS | Azure | CI/CD | Data Modeling | Data QualitySenior-level Full TimeTwo Destiny Way, Westlake TX, United … R1d ago
-
Senior Data Engineer USD 91K-163KAKS | Alerting | Azure Data | Azure Data Factory | Azure Databricks401k contribution | Career development opportunities | Comprehensive benefits package | Equity stock purchase | Incentive and recognition programsSenior-level Full TimePrimary location: Eden Prairie, MN R1d ago
-
Senior-level Full TimeOne Destiny Way, Westlake TX, United … R1d ago
-
Senior Software Engineer/Developer USD 141K-181KAWS DMS | AWS Glue | AWS Lambda | AWS Step Functions | Acceptance TestingSenior-level Full Time300 New Millennium Way, Durham NC, … R1d ago
-
Data Engineer - Temporary Worker USD 80K-96KAdobe Campaign | Agile | Amazon Redshift | Apache Airflow | Apache KafkaPaid Holidays | Safe Time | Sick timeSenior-level Full Time TemporaryUSA - Remote - Maryland, United … R1d ago
-
Principal – HR AI, Analytics & Listening (Product Lead) USD 107K-159KAgile | Analytics | Artificial Intelligence | Automation | Backlog ManagementBonus plan | Health insurance | Long-term equity incentive plan | Remote work option | Retirement benefitsSenior-level Full TimeUnited States, Chesterfield, MO, RGA HQ, … R1d ago
-
Sr AI Data Engineer USD 131K-180KAWS CDK | AWS Glue | AWS Lambda | AWS Step Functions | Amazon Athena401k match | Disability insurance | Employee assistance program | Life insurance | Paid parental leaveSenior-level Full TimeRemote, United States R1d ago
-
Senior Java/AI Engineer (Vice President) USD 162K-215KAI Agents | AI Observability | AI orchestration | Agile | Cloud NativeFlexible time off | Healthcare | Hybrid work model | Leave benefits | Retirement benefitsSenior-level Full TimeNY7 - 50 Hudson Yards, New … R1d ago
-
Robotics Automation Project Engineer USD 90K-120KASRS | AutoCAD | Automated testing | Azure | Control SystemsCustomer training support | Travel for implementationsMid-level Full TimeMASSACHUSETTS - VIRTUAL - MA01, United … R1d ago
-
Software Engineer II - Data Platform USD 116K-162KAirflow | BigQuery | DBT | Docker | GoFlexible time off | In-office workspace | LinkedIn Learning access | Medical insurance | Paid HolidaysMid-level Full TimeUnited States (Remote) R1d ago
-
APIs | AWS Glue | AWS Redshift | Amazon Web Services | Apache AirflowFully remote | Health insurance | On call production support | Paid time off | Retirement plansSenior-level Full TimeOrlando, FL, United States R1d ago
-
Analytics Lead USD 160K-240KA/B | A/B Testing | Agent architecture | Airtable | B testing401k | Dental insurance | Medical insurance | Vision insuranceSenior-level Full TimeRedwood City, CA (Hybrid) R1d ago
-
Machine Learning / Data Science Engineer USD 90K-200KAgent Frameworks | Agent Orchestration | Agentic AI | Amazon RDS | Automation1 1 care | 401k matching | Carrot Fertility | Company paid stipend program | Disability insuranceSenior-level Full TimeReston, VA, United States R1d ago
-
Machine Learning / Data Science Engineer USD 90K-200KAgent Orchestration | Agentic AI | Amazon RDS | Apache Spark | AutomationDisability insurance | Employer 401K matching | Fertility coverage | Flexible work environment | Generous PTOSenior-level Full TimeRichmond, VA, United States R1d ago
-
Lead Machine Learning / Data Science Engineer USD 90K-200KAgent Orchestration | Agentic AI | Amazon RDS | Apache Spark | Azure SQL401k matching | Carrot fertility benefits | Disability insurance | Employee resource groups | Flexible work environmentSenior-level Full TimeReston, VA, United States R1d ago
-
Lead Machine Learning / Data Science Engineer USD 90K-200KAgent Orchestration | Agentic AI | Amazon RDS | Automation | Azure SQL401k matching | Disability insurance | Employee resource groups | Fertility coverage | Generous paid time offSenior-level Full TimeRichmond, VA, United States R1d ago
-
Applied AI Engineer USD 160K-160KAsynchronous task management | CI/CD | Context Management | Django | FastAPI401k match | Medical insurance | Paid parental leave | Paid time off | Remote-first work environmentSenior-level Full TimeRemote (anywhere in the U.S.) R1d ago
-
[2026] Senior Machine Learning Engineer (Systems), Embodied AI/NPCs, ML Platform - PhD Early Career USD 196K-243KAWS | Azure | Cloud platform | Continuous batching | Data PipelinesEquity compensation | Health benefits | Paid time offSenior-level Full TimeSan Mateo, CA, United States R1d ago
-
[2026] Senior Machine Learning Engineer (Systems), Embodied AI/NPCs, ML Platform - PhD Early Career USD 196K-243KAWS | Azure | Cloud platform | Continuous batching | Deep learningSenior-level Full TimeSan Mateo, CA, United States R1d ago