Research Engineer - LLM Training Infrastructure - Seed Infra
San Jose, California, United States
USD 244K-450K Mid-level Full Time
Tasks
- Analyze exascale training performance bottlenecks
- Conduct LLM training infrastructure research and development
- Design and optimize distributed LLM training strategies
- Improve system reliability and resilience
- Optimize network scheduling and GPU memory management
- Translate research ideas into scalable production AI infrastructure
Perks/Benefits
- N/A
Skills/Tech-stack
Checkpointing | Data-Driven Optimization | Data-driven | Distributed Training | Fault Tolerance | GPU memory | GPU memory management | Language Models | Large Language Models | Memory Management | Network Optimization | Parallelism | Performance Profiling | Reinforcement Learning | Scheduling
Education
N/A
Roles
Related jobs
-
Senior-level Full TimeUnited States1h ago
-
Senior Software Engineer, AI Coding Tools USD 244K-588KDeep learning | Fine Tuning | GPU | Inference acceleration | Language ModelsSenior-level Full TimeSan Jose, California, United States4h ago
-
Machine Learning Engineer Intern (E-commerce-Supply Chain & Logistics) - 2026 Summer (MS) USD 136K-237KClassification | Data Mining | Deep learning | Forecasting | Knowledge graphsEntry-level InternshipSeattle, Washington, United States4h ago
-
Software Engineer, Generative AI, Workspace USD 147K-211KC++ | Distributed Systems | Generative AI | Information Retrieval | Integration TestingBenefits | Bonus | EquityMid-level Full TimeBoulder, CO, USA5h ago
-
Staff Software Engineer, Machine Learning, Google Chat USD 207K-300KAgentic Workflows | Caching | Cloud Spanner | Continuous Delivery | Continuous integrationSenior-level Full TimeSunnyvale, CA, USA5h ago
-
Software Engineer III, Database Internals AlloyDB USD 147K-211KACID | C# | C++ | CAP Theorem | Compiler TheoryEntry-level Full TimeSunnyvale, CA, USA5h ago
-
AI/ML Engineer 2 USD 101K-165KAI Agents | API Development | AWS | Azure | CI/CDDisability insurance | Family leave | Flexible spending accounts | Life and AD D Insurance | Medical/Dental/Vision insuranceSenior-level Full TimePhiladelphia, PA, US, 1910311h ago
-
Staff AI/ML Engineer USD 240K-270KAWS | Agentic Workflows | Cloud platform | Data Curation | Deep learning401k | Commuter benefits | Dog-friendly office | Equity | FSA benefitsSenior-level Full TimeSan Francisco, CA11h ago
-
Applied AI ML Engineer-Vice President USD 150K-210KAWS Bedrock | AWS SageMaker | Amazon EKS | AutoPrompt | DDPBackup childcare | Financial coaching | Health care coverage | Mental health support | On-site health and wellness centersExecutive-level Full TimeNew York, NY, United States15h ago
-
Applied AI ML Engineer-Senior Associate USD 175K-210KAWS Bedrock | AWS SageMaker | Amazon EKS | Containerization | Data PreprocessingSenior-level Full TimeNew York, NY, United States15h ago
-
AI Engineer USD 157K-283KAPI Integration | Agentic Workflows | Autogen | Cloud infrastructure | CrewAISenior-level Full TimeUSA MD Columbia (Field), United States16h ago
-
API Integration | Agent Orchestration | Amazon Bedrock | Angular | Autogen401k plan | Commuter benefits | Disability benefits | Life insurance | Paid time offExecutive-level Full Time110832-NY-30 Hudson Yards, New York, United …16h ago
-
Staff AI engineer USD 140K-160KAI Evaluation | AWS | Agent Orchestration | Caching | Data PipelinesFlexible working hours | Hybrid work culture | Unlimited time offSenior-level Full TimeSan Francisco1d ago
-
Lead Machine Learning Engineer - Infrastructure USD 210K-260KCI/CD | Cloud Computing | Data Engineering | Experimentation | GolangSenior-level Full TimeBoston or NYC1d ago
-
Data Scientist (Generative AI) USD 125K-160KAWS | AWS Bedrock | AWS SageMaker | Adversarial Networks | Attention MechanismsEntry-level Full TimeMcLean, VA, United States1d ago
-
Amazon Web Services | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkSenior-level Full TimeChicago, Illinois, United States1d ago
-
Amazon Web Services | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkBonus | Distributed work | Equity | Full-timeSenior-level Full TimeNew Jersey, New Jersey, United States1d ago
-
AWS | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkSenior-level Full TimeBoston, Massachusetts, United States1d ago
-
Amazon Web Services | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkSenior-level Full TimeStamford, Connecticut, United States1d ago
-
Amazon Web Services | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkSenior-level Full TimeConnecticut, Connecticut, United States1d ago
-
Airflow | Amazon Web Services | Apache Flink | Apache Kafka | Apache SparkBonus | EquitySenior-level Full TimeNew York, New York, United States1d ago
-
Amazon Web Services | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkSenior-level Full TimeCharlotte, North Carolina, United States1d ago
-
AWS | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkBonus | EquitySenior-level Full TimeFlorida, Florida, United States1d ago
-
Amazon Web Services | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkSenior-level Full TimeWashington D.C., District of Columbia, United …1d ago
-
Amazon Web Services | Apache Airflow | Apache Flink | Apache Kafka | Apache SparkSenior-level Full TimeSan Jose, California, United States1d ago