Senior ML Engineer — Distributed LLM Training Infrastructure
Tasks
- Architect distributed LLM training infrastructure
- Architect end to end training pipelines from data ingestion to deployment
- Build Kubernetes and Docker containerized systems
- Build benchmarking and performance regression suites
- Build fault tolerant checkpointing and node failure recovery
- Build resource managers schedulers and profiling tools
- Design model sharding for large parameter spaces
- Develop communication backends for unreliable networks
- Develop observability and monitoring tools
- Extend distributed training frameworks for permissionless multi party training
- Implement CI CD pipelines for distributed infrastructure
- Implement gradient compression, quantization, and sparsification
- Implement mixed precision gradient accumulation and fused kernels
- Implement model parallelism and efficient gradient synchronization
- Optimize GPU memory and operations with CUDA kernels
- Profile and optimize throughput memory and communication patterns
Perks/Benefits
- N/A
Skills/Tech-stack
AWS | Azure | C++ | CI/CD | CUDA | Checkpointing | DDP | DeepSpeed | Docker | FSDP | Fairscale | GCP | Gradient Accumulation | Gradient compression | Kubernetes | MPI | Megatron-LM | Mixed Precision | Model Parallelism | NCCL | PyTorch | Python | Quantization | RPC | Sparsification | TorchTitan
Education
Related jobs
-
Automation Testing | CI/CD | CSS | Cypress | Feature DevelopmentMedical, dental & vision coverage | Paid time off | Parental leave | Reimbursement programs | Retirement planMid-levelRaleigh, United States R9d ago
-
Databricks Pipeline Architect USD 150K-180KAWS Glue | AWS Lambda | AWS S3 | Agile | Amazon Web ServicesPublic trust clearance support | Remote workSenior-level Full TimeWork from home, VA, United States R2h ago
-
API Design | API Gateway | Authentication | Authorization | CI/CDFlexible work options | Work from home optionMid-level Full TimePoland R5h ago
-
API Gateway | Alerting | Authentication | Authorization | CI/CDFlexible work options | Work from home optionMid-level Full TimePoland R5h ago
-
AI Inference | CI/CD | Cloud services | Containerization | DebuggingFlexible workingMid-level Full TimePoland R5h ago
-
Benchmarking | Compute Cost Optimization | Cost Optimization | Distributed inference | GPU OptimizationFlexible work arrangement | Work from home optionSenior-level Full TimePoland R5h ago
-
API Development | Authentication | Backend Development | CI/CD | Cloud ComputingFamily benefits | Flexible working options | Health benefits | Remote work optionEntry-level Full TimePoland R5h ago
-
APIs | AWS | Benchmarking | Cloudflare Workers | EvaluationEquity packages | Flexible leave options | Inclusive parental leave | Virtual interviews | Wellbeing allowanceSenior-level Full TimeMelbourne, VIC, Australia R13h ago
-
API Design | AWS | Cloudflare Workers | Evaluation | Language ModelsEquity packages | Flexible leave options | Inclusive parental leave | Office setup allowance | Social connection allowanceSenior-level Full TimeSydney, Australia R13h ago
-
API Design | AWS | Agent architecture | Benchmarking | Cloudflare WorkersEquity packages | Flexible leave | Flexible work arrangements | Parental leave | Wellbeing allowanceSenior-level Full TimeMelbourne, VIC, Australia R13h ago
-
API Design | AWS | Agent architecture | Benchmarking | Cloudflare WorkersEquity packages | Flexible leave options | Inclusive parental leave | Wellbeing allowanceSenior-level Full TimeSydney, Australia R13h ago
-
Senior Machine Learning Engineer USD 150K-200KDistributed Systems | Feature Engineering | Feature Selection | Language Models | Language Processing401k matching | Cell phone and internet stipend | Employee stock purchase plan | Flexible time off | Learning programsSenior-level Full TimeRemote - USA R14h ago
-
Senior Software Engineer - Data Platform USD 186K-218KAccess Control | Airflow | Apache Kafka | Apache Spark | CachingSenior-level Full TimeRemote - USA R16h ago
-
Senior-level Full TimeHybrid (Salt Lake City, UT, US) R17h ago
-
Software Engineer, Data USD 120K-250KAlerting | Anomaly Detection | Avro | BigQuery | ClickHouseDental insurance | Flexible vacation policy | Health insurance | Open Source contribution | Vision insuranceMid-level Full TimeRemote, U.S R18h ago
-
Senior-level Full TimeBangalore, INDIA R18h ago
-
AI Solutions Architect INR 2500K-4500KAWS | Convolutional Neural Networks | Data Governance | Distributed Systems | DockerSenior-level Full TimeIndia - Remote R18h ago
-
Senior Azure Fabric Data Engineer INR 1500K-2000KAirflow | Azure | Azure Data | Azure Data Factory | Azure Data LakeRemote work | Work from homeSenior-level Full TimeIndia - Remote R18h ago
-
Senior Solutions Engineer (UK) GBP 59K-80KAPI Integration | Apache Spark | Cloud Computing | Data Pipelines | Data QualityDental insurance | Flexible hours | Flexible-hybrid work | Health insurance | Paid vacationSenior-level Full TimeRemote - London, Greater London, United … R18h ago
-
Computer Vision Engineer TWD 1000K-1500K3D Geometry | C++ | Camera Calibration | Computer Vision | GitInternational travel | Travel 50 percent to factorySenior-level Full TimeRemote - Taiwan R18h ago
-
Analytics Engineer (Data Platform) GBP 70K-95KAWS DMS | Amazon Web Services | Automation | CI/CD | DBTAnnual leave | Corporate events | Employee assistance program | Enhanced maternity/paternity pay | Extra holidayMid-level Full TimeLondon, England, United Kingdom R18h ago
-
Robotics Perception Engineer USD 140K-204KC++ | CI/CD | Camera-based perception | Cameras | Cloud processingCompany-provided laptop | Dental insurance | Health insurance | Paid time off | Remote work optionMid-level Full TimeAtlanta, Georgia, United States - Remote R18h ago
-
Robotics Hardware Integration Engineer USD 120K-158KAWS IoT | AWS IoT Core | AWS IoT Greengrass | C++ | CANDirect impact on shipped products | Early-stage company | Equity | Health, dental, and vision insurance | Permissive time off policyMid-level Full TimeAtlanta, Georgia, United States - Remote R18h ago
-
Robotics Navigation and Control Engineer USD 160K-240KA Star | C++ | CI/CD | EKF | GazeboCompany laptop | Health, dental, vision benefits | Permissive time off | Remote work | Travel to Atlanta facility for hardware integrationSenior-level Full TimeAtlanta, Georgia, United States - Remote R18h ago
-
Senior Software Engineer INR 1800K-2400KAWS | Apache Atlas | Google Cloud | JUnit | JavaCareer development | Employee resource groups | Flexible WFH policy | Generous PTO | Internet reimbursementSenior-level Full TimeIndia-Bangalore-Remote R18h ago