Senior Software Engineer, AI Runtime
Mountain View, California; San Francisco, California
USD 160K-225K Senior-level Full Time
Tasks
- Build resilience and observability for multi node jobs
- Contribute to core systems for accelerators and new regions
- Develop APIs CLI and developer experience for training jobs
- Drive architecture and evolution of managed GPU training platform
- Lead end to end engineering from design to production rollout
- Mentor engineers and lead design reviews
- Optimize GPU efficiency and training performance
- Solve large scale training orchestration and distributed parallelism problems
Perks/Benefits
- N/A
Skills/Tech-stack
Algorithms | Checkpointing | Collective communication | Data Structures | Data parallelism | DeepSpeed | Distributed Systems | FSDP | Fault Tolerance | GPU scheduling | High Performance | High-Performance Computing | Infiniband | Megatron | NVLink | Parallelism | Performance Computing | Pipeline parallelism | PyTorch | Python | RoCE | Sequence parallelism | System design | Tensor Parallelism
Education
Regions
Countries
States
Related jobs
-
Featured Feat. Applied AI Engineer - Bay Area USD 211K-263KArtificial Intelligence | C plus plus | C# | Embeddings | Feature Engineering401k | Comprehensive health and wellness benefits | Learning and development opportunities | Unlimited time offMid-level Full TimeHQ (San Francisco)23d ago
-
AI Research Engineer – Agentic AI USD 165K-180KAblation Studies | Agentic AI | Edge Computing | Error Analysis | Evaluation401k matching | Disability insurance | Health insurance | Life insurance | Paid time offMid-level Full TimeSunnyvale, CA, United States6h ago
-
Lead Analytics Engineer USD 123K-175KCI/CD | DBT | Data Governance | Data Modeling | NetsuiteFull-time telecommuting | Remote work optionSenior-level Full TimePalo Alto, California7h ago
-
Staff Software Engineer, AI Data Platform USD 250K-280KCloud platform | Google Cloud | Google Cloud Platform | GraphQL | KafkaSenior-level Full TimeSan Francisco Bay Area R9h ago
-
BEV | Bayesian Methods | CUDA | Machine Learning | Metrics OptimizationSenior-level Full TimeFoster City, CA10h ago
-
Continual Learning | Data Processing | Deep learning | JAX | Language ModelsBonus program | Company benefits program | Equity incentive planEntry-level Full TimeMountain View, CA USA; San Francisco, …11h ago
-
Staff Machine Learning Engineer, Multi-Modal Perception USD 251K-310KC plus plus | Computer Vision | Data Analysis | Deep learning | JAXSenior-level Full TimeMountain View, CA USA; San Francisco, …11h ago
-
Sr. Software Development Engineer, MLOPs USD 168K-227KAlerting | Amazon EKS | CI/CD | Checkpointing | Data IngestionSenior-level Full TimeBellevue, Washington, USA11h ago
-
Software Development Engineer, Aurora Storage USD 143K-194KAWS | Amazon Aurora | Distributed Systems | High Availability | MySQLCareer growth | Flexible work schedule | Mentorship | Work-life balanceMid-level Full TimeRedmond, Washington, USA11h ago
-
Software Engineer, Compute Infrastructure USD 140K-220KAWS | Autoscaling | Azure | Cost Optimization | Distributed Systems401k contribution | Dental insurance | Education stipend | Healthy lunches | Home office improvement stipendSenior-level Full TimeMountain View, CA12h ago
-
Staff Systems Engineer - Perception USD 170K-223KComputer Vision | DOORS | Field of view | Functional Safety | Hazard AnalysisSenior-level Full TimeAustin, TX or Sunnyvale, CA12h ago
-
Data Engineer USD 85K-107KAWS Glue | AWS Lambda | AWS S3 | Access Control | DBT401k match | Career growth | DashPass membership | Gym reimbursement | Medical/Dental/VisionMid-level Full TimeProvo, UT12h ago
-
Senior-level Full TimeUnited States - Remote R12h ago
-
Sr Data Engineer USD 115K-145KAWS S3 | Apache Airflow | Azure Blob | Azure Blob Storage | BigQuery401k | Dental insurance | Discounts | Fully remote | Health insuranceSenior-level Full TimeNew York, NEW YORK, United States R12h ago
-
Data Engineer USD 119K-168KAI Agents | Azure Data | Azure Data Factory | Azure Functions | Azure Lakehouse401k | Dental coverage | Health coverage | Paid parental leave | Paid time offMid-level Full TimeSterling Heights13h ago
-
Staff Software Engineer, AI Runtime USD 190K-265KCUDA | Checkpointing | Data parallelism | DeepSpeed | Distributed SystemsSenior-level Full TimeMountain View, California; San Francisco, California13h ago
-
Senior Staff Software Engineer, Data Platform USD 253K-298KAI Agents | Agent systems | Batch Processing | Change Data Capture | Compliance401k | Quarterly in person surges | Remote-firstSenior-level Full TimeRemote - USA R14h ago
-
Senior Software Engineer, Data Systems (Python) USD 170K-200KAPI Design | API Keys | Apache Airflow | Authentication | BigQuery12 Company Paid Holidays | 401k | Company-Paid Holidays | Flexible PTO | Healthcare benefitsSenior-level Full TimeRemote - USA R14h ago
-
AI/ML Security Engineer USD 102K-163KAPI Integration | AWS | Azure | Benchmarking | EvaluationCorporate holidays | Flexible time off | Group dental insurance | Group health insurance | Pet benefit optionMid-level Full TimeRemote R14h ago
-
Mid-level Full TimeBethesda, MD - TS/SCI clearance required14h ago
-
Principal AI Engineer USD 265K-285KAWS | Amazon SageMaker | Apache Airflow | CI/CD | DBT401k match | Company-provided phone | Extended leave | Full insurance coverage | Observed holidaysSenior-level Full TimeAustin, Texas, United States; Denver, Colorado, …14h ago
-
Senior Staff Engineer - Senior Data Engineer USD 135K-165KAmazon Web Services | Apache Airflow | Apache Kafka | Apache Spark | DatabricksRemote workSenior-level Full TimeRemote, REMOTE, United States R15h ago
-
Junior Quantitative Analyst USD 150K-150KAlgorithms | C++ | DAG | Data Engineering | Data Structures401k | Casual dress code | Employee resource groups | Flexible spending account | Gym discountsEntry-level Full TimeAustin, Texas15h ago
-
Staff + Senior Software Engineer, Inference USD 320K-485KAutoscaling | Batching | Caching | Cloud infrastructure | Deployment PipelinesCompetitive benefits | Flexible working hours | Generous vacation | Parental leave | Visa sponsorshipSenior-level Full TimeSan Francisco, CA | New York …15h ago
-
Staff Engineer - AI Engineer USD 110K-130KERP | Hugging Face | Langchain | MES | ModbusRemote workSenior-level Full TimeRemote, REMOTE, United States R15h ago