Senior Software Engineer, AI Runtime
Mountain View, California; San Francisco, California
USD 160K-225K Senior-level Full Time
Tasks
- Build fault tolerance and recovery mechanisms
- Build scalable high throughput resilient training systems
- Design APIs CLI and developer experience for training jobs
- Develop GPU scheduling and dynamic routing
- Drive architecture for managed GPU training platform
- Implement checkpoint and restore for long running jobs
- Implement multi node orchestration and distributed parallelism
- Improve resilience and observability for multi node jobs
- Lead end to end engineering from design to production rollout
- Mentor engineers and lead technical design reviews
- Optimize high throughput data loading
Perks/Benefits
- N/A
Skills/Tech-stack
Algorithms | Checkpointing | Collective communication | Data Structures | Data parallelism | DeepSpeed | Distributed Systems | Distributed Training | Dynamic routing | FSDP | Fault Tolerance | GPU scheduling | High Performance | High-Performance Computing | Infiniband | Megatron | NVLink | Performance Computing | Pipeline parallelism | PyTorch | Python | RoCE | Sequence parallelism | System design | Tensor Parallelism
Education
Regions
Countries
States
Related jobs
-
Agent Orchestration | Airflow | Argo Workflows | Artifact versioning | Autonomous workflowsRemote work flexibilitySenior-level Full TimeRemote - United States R8h ago
-
Senior Databricks Engineer USD 180K-247KAWS | Autoscaling | Azure | CI/CD | CachingVisa sponsorshipSenior-level Full TimeCanada R9h ago
-
Staff Applied Scientist USD 244K-320KAgentic Systems | Artificial Intelligence | Benchmarking | CI/CD | Computer VisionEmployee communities | Experience bonus | Hybrid work model | Wellness reimbursementSenior-level Full TimeSeattle, Washington, United States9h ago
-
Senior-level Full TimeCanada R9h ago
-
Capacity Analysis | Cloud Computing | Continuous Improvement | Data Visualization | Data Warehousing401k | Dental insurance | Discounts | Health insurance | Paid leaveMid-level Full TimeUniversal City, CALIFORNIA, United States10h ago
-
AI Research Engineer USD 190K-280KDeep learning | Generative AI | Language Models | Language Processing | Large Language ModelsCareer development | Diversity and inclusion | Flexible work environmentMid-level Full TimeSeattle, Washington, United States; South San …10h ago
-
Sr. AI Engineer (Applied AI & ML Systems) USD 132K-165KAgentic AI | Context engineering | Continuous Improvement | Data Engineering | Data PipelinesE learning license | Hackathons | Healthcare benefits | Home office setup allowance | Identity theft protectionSenior-level Full TimeUnited States R10h ago
-
Senior Applied AI Engineer CAD 144K-165KAI SDK | AWS ECS | AWS ECS Fargate | AWS Key Management Service | AWS LambdaSenior-level Full TimeCanada12h ago
-
Bioinformatics Engineer USD 125K-150KBAM | BED | BWA | Batch | Bismark401k match | Dependent care assistance | Educational benefits | Employee referral bonus | Flexible spending accountMid-level Full TimeRockville, MD13h ago
-
A/B | A/B Testing | AWS | Airflow | Amazon Redshift401k matching | Employee assistance program | Flexible time off | Flexible work arrangement | Paid HolidaysMid-level Full TimeRemote, US R13h ago
-
Data Scientist I (Prescriptive AI) USD 99K-135KCPLEX | DB2 | Data Warehousing | Discrete Event Simulation | Discrete eventCross training | Onsite Work Authorization SupportMid-level Full TimeLittle Rock, AR14h ago
-
Software Engineer - Medical Applications & Algorithms USD 130K-150KAWS CodeBuild | AWS CodePipeline | Agile | Amazon Web Services | C++Cross-functional team collaboration | Hybrid work environment | Medical device industry domainMid-level Full TimeSan Francisco, California, United States14h ago
-
Senior-level Full TimeIrving, TX14h ago
-
Associate AI Engineer USD 144K-180K.NET | APIs | ASPNet | AWS | Azure401k matching | Dental insurance | Hybrid work model | Medical insurance | Paid time offMid-level Full TimeIrving, TX R14h ago
-
Data Engineer-Secret Clearance Required USD 100K-127KAWS | AWS Glue | AWS Redshift | Azure | Azure Data401k match | Bereavement leave | Disability insurance | Employee assistance program | Employee discount programSenior-level Full TimeRemote - Nationwide, United States R16h ago
-
Sr AI Engineer USD 84K-105KC# | Deep learning | Digital Signal | Digital Signal Processing | Edge ComputingAccidental death and dismemberment | Commuter benefits | Dental insurance | Flexible spending account | Health savings accountSenior-level Full TimeColumbia, MARYLAND, United States16h ago
-
Senior Healthcare Data Engineer USD 104K-199KConditional Aggregation | Data Modeling | Data Quality | Data Reconciliation | Data Validation401k matching | Employee assistance program | Family building benefits | Flexible spending accounts | HolidaysSenior-level Full TimeSeattle, Washington, United States R16h ago
-
Staff Machine Learning Engineer, Foundation - Seattle USD 208K-298KAWS | Artificial Intelligence | Azure | C++ | Code reviewCommunity groups | Experience bonus | Hybrid work model | Wellness reimbursementSenior-level Full TimeSeattle, Washington, United States17h ago
-
Applied Research Scientist / Engineer USD 175K-250KData Curation | Deep learning | Diffusion Models | Distributed Training | Domain AdaptationMid-level Full TimeNew York, NY, SF Bay Area, …17h ago
-
Data Engineer USD 130K-165KAPI Development | Airflow | Amazon Redshift | Amazon S3 | AvroCollaborative work environment | Dental insurance | Equity | Health insurance | Retirement benefitsMid-level Full TimeWilmington, MA17h ago
-
Machine Learning Engineer, Data Mining USD 144K-192KActive Learning | Batch inference | CI/CD | Data Augmentation | Data Curation401k match | Dental insurance | Health savings account | Life insurance | Medical insuranceSenior-level Full TimePittsburgh, Pennsylvania, United States; Remote U.S. R18h ago
-
Machine Learning Engineer, Data Mining USD 144K-192KActive Learning | Batch inference | CI/CD | Data Augmentation | Data Drift401k match | Dental insurance | Health insurance | Health savings account | Life insuranceSenior-level Full TimeBoston, Massachusetts, United States; Remote U.S. R18h ago
-
AI Engineer (GenAI & Integration) USD 130K-181KAI Agents | AI Governance | API Integration | Automation workflows | DeploymentMid-level Full TimeCenter, Center District, IL18h ago
-
Data Engineer USD 105K-130KAPIs | Data Governance | Data Modeling | Data Monitoring | Data Quality401k employer matching | Childcare reimbursement | Company events social hours | Company paid parking or MTS pass | Fertility treatment coverageSenior-level Full TimeSan Diego, CA, United States18h ago
-
Senior Software Engineer - San Francisco (Onsite) USD 130K-220KAWS | Amazon EMR | Amazon S3 | Apache Flink | Apache SparkFast-paced startup environment | Onsite work environment | Rapid hiring process feedback | Relocation supportSenior-level Full TimeSan Francisco, CA, US19h ago