Principal Engineer, AI Inference Reliability
Remote, California, United States; Sunnyvale CA or Toronto Canada
R
USD 175K-260K (estimate) Senior-level Full Time
Tasks
- Build dashboards and alerts
- Build reliability tooling
- Conduct postmortems
- Define SLOs and ensure alignment
- Define and drive reliability strategy
- Design and implement fault detection
- Design for debuggability
- Design for durability
- Design for redundancy
- Develop reliability best practices
- Implement failover
- Implement graceful degradation
- Implement recovery
- Implement throttling
- Inject distributed faults
- Lead incident management
- Measure reliability metrics
- Mentor engineers on reliability engineering
- Monitor service health metrics
- Perform root cause analysis
- Prevent repeat reliability incidents
- Run chaos testing
- Run load simulation
Perks/Benefits
- N/A
Skills/Tech-stack
Alerting | C++ | Chaos Testing | Distributed Fault Injection | Distributed Systems | Distributed debugging | Failover | Fault detection | Fault injection | Go | Graceful Degradation | Incident Response | Load Testing | Monitoring | Observability | Postmortem | Python | Recovery | Rust | SLA | SLI | SLO | Throttling
Education
Related jobs
-
Lead AI Engineer - AI & Credit Analytics USD 156K-234KAWS | CI/CD | Data Governance | Generative AI | LLMOpsFlexible time off | Flexible work environment | Hybrid work option | Matching 401k | Medical/Dental/Vision insuranceSenior-level Full TimeCosta Mesa, CA, United States R11h ago
-
Data Engineer USD 130K-145KApache Spark | CI/CD | Cloud platform | Containerization | Data GovernancePublic trust clearance support | Remote workSenior-level Full TimeWork from home, VA, United States R17h ago
-
DNS | FC | Fibre Channel | Isilon | LinuxRemote work | Unlimited growthSenior-level Full TimeUnited States, United States R18h ago
-
AI/ML Engineer, Senior - WFH1650 USD 128K-201KCPU Inference | Class imbalance | Data Analysis | Data Preprocessing | Data QualityWork from homeSenior-level Full TimeReston, VA - Remote R1d ago
-
Mid-level Full TimeRemote, United States R1d ago
-
Principle Data Engineer USD 220K-235KAWS | Airflow | BigQuery | Capacity Planning | Compliance401k | Equity | Essential equipment | Flexible PTO | Fully remoteSenior-level Full TimeCleveland, OH R1d ago
-
Agent Frameworks | Deterministic systems | Distributed Systems | GraphQL | LLMDirect collaboration with executive leadership | High-ownership environment | Hybrid schedule | Relocation assistance | Remote flexibilitySenior-level Full TimeRemote; San Francisco, CA; United States R1d ago
-
Apache Airflow | AtScale | BigQuery | CI/CD | Cloud DataSenior-level Full TimeGEORGIA - VIRTUAL - GA01, United … R1d ago
-
Airflow | BigQuery | CI/CD | Data Modeling | Data WarehousingRemoteMid-level Full TimeGEORGIA - VIRTUAL - GA01, United … R1d ago
-
Principal Machine Learning Revenue Architect USD 163K-382KBig Data | Business Process | Data Mining | Data Science | Data VisualizationSenior-level Full TimeVirtual, United States R1d ago
-
Senior Analytics Engineer USD 87K-161KData Lakehouse | Data Modeling | Data mesh | Data pipeline | Databricks401k | Health insurance | Hybrid work options | Paid Holidays | Paid time offSenior-level Full TimeRemote-MO, United States R1d ago
-
Senior Engineer - LLMOps & MLOps USD 120K-189KAI Search | AWS | AWS Step Functions | Airflow | Amazon BedrockCareer growth | Great Place to Work culture | Work-life balanceSenior-level Full TimeTelecommuter TN, United States R1d ago
-
Machine Learning Engineer 5 - Globalization USD 466K-750KBatching | Data Pipelines | Distributed Training | GPU Optimization | Inference Optimization401k matching | Disability coverage | Flexible spending account | Flexible time off | Health insuranceSenior-level Full TimeUSA - Remote, United States R1d ago
-
Sr/Staff AI Engineer (Remote - US) USD 165K-300KAWS | AWS S3 | Agentic architecture | Azure | Databricks401k | Annual bonus | Counseling services | Dental insurance | Disability insuranceSenior-level Full TimeREMOTE, US, US, 76131 R1d ago
-
Sr/Staff Data Engineer (Remote - US) USD 165K-300KAWS Kinesis | Apache Flink | Apache Iceberg | Apache Kafka | Apache Spark401k | Annual bonus | Counseling services | Dental insurance | Disability insuranceSenior-level Full TimeREMOTE, US, US, 76131 R1d ago
-
Software Engineer, Data Platform USD 105K-132KAPI | AWS | CI/CD | Code review | DBT401k | Baby bonding leave | Commuter benefits | Disability insurance | Employee referral programSenior-level Full TimeUS Remote R1d ago
-
AI Value Partner, Customer Analytics USD 120K-160KA/B | A/B Testing | B testing | Causal Inference | Experimental DesignFlexible vacation time | Gym allowance | Lunch and dinners stipend | Medical, dental, vision plans | Paid parental leaveMid-level Full TimeUnited States (Remote) R1d ago
-
Principal Software Engineer - Storage Cache USD 295K-345KActive/Active | Alertmanager | C++ | Chaos Engineering | Container OrchestrationEquity compensationSenior-level Full TimeSan Mateo, CA, United States R1d ago
-
Senior Software Engineer, Data Platform USD 125K-156KAPI Development | AWS | Automation | CI/CD | Data Engineering401k benefits | Baby bonding leave | Commuter benefits | Disability insurance | Employee referral programSenior-level Full TimeUS Remote R1d ago
-
Embedded & Automation Engineer USD 140K-230KAbstraction layer | C# | C++ | Edge Computing | Embedded Systems401k retirement savings plan | Accident insurance | Basic Life/AD&D | Critical illness insurance | DCFSAMid-level Full TimeSan Diego, CA, Remote R1d ago
-
Associate Data Engineer USD 70K-95KABAC | AWS Glue | Amazon S3 | Apache Airflow | Apache IcebergAD and D insurance | Flexible spending accounts | Health insurance | Life insurance | Long-term disabilityMid-level Full TimeRemote, United States R1d ago
-
Applied AI & Optimization Engineer USD 140K-185KAgent-based | Agent-based modeling | Air Gapped Computing | Air-gapped | Analytics401k matching | DCFSA | Employee mental health resources | FSA | HSAMid-level Full TimeSan Diego, CA, Remote R1d ago
-
Finance Analytics Engineer USD 185K-221KAnomaly Detection | Apache Airflow | Business Intelligence | Data Modeling | Data MonitoringEquity compensation | Health benefitsMid-level Full TimeSan Mateo, CA, United States R1d ago
-
Action Chunking | Behavioral cloning | Data Versioning | Diffusion Models | Domain Randomization401k retirement plan | Comprehensive medical, dental and vision coverage | Daily free lunch | Employee referral bonuses | Flexible PTOMid-level Full TimeColumbus, Ohio R1d ago
-
Action Policy Learning | C++ | Data collection | Domain Randomization | Edge inference401k retirement plan | Dental insurance | Employee referral bonuses | Flexible PTO | Free lunchSenior-level Full TimeColumbus, Ohio or Remote R1d ago