AI Platform Engineer, Training and Inference
Tasks
- Assemble context and construct prompts before inference
- Automate Flyte DAG to canary
- Build RL training infrastructure
- Build and operate LLM inference mesh
- Compose vLLM SGLang and NVIDIA Triton deployment graph
- Configure Plasma zero copy memory sharing
- Configure Ray Data GPU direct streaming
- Configure TorchTrainer and DDP
- Configure fractional GPU allocation
- Define Flyte workflows for RL pipelines
- Define relative quality gates
- Design model routing layer
- Enable continuous batching
- Execute shadow A B and canary rollout
- Implement capability version and tenant routing
- Implement cost aware fallback between SLMs and cloud LLMs
- Implement golden signal auto rollback
- Implement per engine autoscaling
- Implement spot preemption recovery
- Implement vector similarity search
- Integrate RAG retrieval in inference mesh
- Integrate Ray RLlib or PPO GRPO loops
- Integrate warm start fine tuning pipelines
- Integrate warm start retraining
- Manage KubeRay on GKE
- Manage checkpoint lifecycle
- Manage replay buffer persistence on GCS
- Operate Plasma distributed object store
- Operate distributed training with Ray Train
- Operate model promotion lifecycle quality gates and canary
- Optimize inference performance
- Own Ray ecosystem end to end
- Run retrain pipeline with drift detection
- Set up NCCL multi node training
- Tune KV cache block sizes
- Tune Ray Core task actor scheduling
Perks/Benefits
Skills/Tech-stack
ANN indexing | BF16 | DDP | Embeddings | FP8 | FSDP | Flyte | GCS | GKE | GRPO | Gradient Checkpointing | K6 | Kuberay | MLflow | Mixed Precision | NCCL | NVIDIA Triton | ONNX | PGVector | PPO | Plasma | PyTorch | Python | Qdrant | RAG | RLHF | Ray Core | Ray Data | Ray RLlib | Ray Serve | Ray Train | S3 | SGLang | Similarity Search | TensorRT | Torch | TorchTrainer | VLLM | Vector similarity | Vector similarity search
Education
Roles
Regions
Countries
States
Related jobs
-
Forward Deployed Machine Learning Engineer USD 180K-300KAPI Design | Cloud Computing | Deep learning | Diffusion Models | Fine TuningIn-person collaboration days | Remote work flexibility | Travel cost coverageSenior-level Full TimeSan Francisco (USA) R6h ago
-
Specialist Solutions Architect - AI/ML USD 180K-247KAI guardrails | Amazon Web Services | Apache Spark | Artificial Intelligence | Cloud ComputingMentorship | Remote work | Technical training | Travel up to 30 percentSenior-level Full TimeUnited States8h ago
-
Lead Data Engineer USD 115K-130KAgile | Apache Airflow | Azure Data | Azure Data Factory | Cloud Orchestration401k | Dental insurance | Medical insurance | Paid leave | Tuition reimbursementSenior-level Full TimeUniversal City, CALIFORNIA, United States8h ago
-
Cyber Data Engineer USD 140K-145KAWS | ArcSight | Bash | Cribl | DHCP401k match | Accrued PTO | Health/Dental/Vision | Life insurance | Long-term disabilitySenior-level Full TimeSpringfield, VA10h ago
-
Ansible | ArgoCD | CI/CD | Chef | Configuration ManagementSenior-level Full TimeNew York, NY, United States10h ago
-
Senior Engineer Embedded System Engineer USD 94K-125K8D methodology | A3 problem solving | Board Design | Bus Traces | Bus analyzerLife, accident, and disability insurance | Medical/Dental/Vision insurance | Paid sick leave | Paid vacation time | Tax-advantaged flexible spendingSenior-level Full TimeIrvine, CA, United States13h ago
-
A/B | A/B Testing | APIs | Airflow | B testingSenior-level Full TimeUnited States13h ago
-
Senior Systems Analyst – Robotic Algorithms and Control USD 134K-201KC# | C++ | CAD Tools | Classical control | Computer VisionSenior-level Full TimeSunnyvale, CA, United States13h ago
-
Full Stack Software Engineer - Robotics USD 125K-200KAWS | Datadog | Distributed Systems | Edge Computing | Grafana401k | Cell phone reimbursement | DC FSA | Employee assistance program | EquityMid-level Full TimeSan Francisco || Oakland, CA R14h ago
-
Senior Research Engineer, Voice + Speech USD 200K-400KData Pipelines | LLM | Language Processing | Machine Learning | Model EvaluationDaily meals snacks | Disability benefits | Fertility benefits | Life insurance | Medical/Dental/VisionSenior-level Full TimeNew York City14h ago
-
Staff Research Engineer, Voice + Speech USD 200K-400KConversational AI | Data Pipelines | Deep learning | Information Retrieval | LLM DeploymentDaily lunches and snacks | Disability benefits | Fertility and family building benefits | Life insurance | Medical/Dental/Vision insuranceSenior-level Full TimeNew York City14h ago
-
Senior-level Full TimeCosta Mesa, California, United States14h ago
-
Senior Research Engineer, Voice + Speech USD 200K-400KData Pipelines | Deep learning | Information Retrieval | LLM Deployment | Language ModelsDaily meals | Dental insurance | Disability insurance | Health insurance | Life insuranceSenior-level Full TimeSan Francisco14h ago
-
AI Developer - Model Creation & Full Stack USD 150K-175KAWS | Angular | Azure | CI/CD | D3.jsRemote work | USPS Public Trust Clearance eligibleMid-level Full TimeWork from home, VA, United States R14h ago
-
Software/Embedded Systems Engineer USD 135K-158KAgile Development | C++ | Change Control | DoD Systems | Embedded SoftwareTravel 25%Senior-level Full TimeArlington, VA, United States14h ago
-
Data/ML Scientist SME USD 105K-150KAWS GovCloud | Anomaly Detection | Apache Spark | Bayesian Causal Inference | Bayesian MethodsMid-level Full TimeFAIRFAX, VA, United States14h ago
-
AI Data Engineer USD 165K-225KAudit Columns | CI/CD | Cortex Analyst | Cortex Complete | DBT401k company match | Flexible paid time off | Learning and development | Medical benefits | Paid parental leaveSenior-level Full TimePhiladelphia, PA, United States15h ago
-
AWS | Agile | Azure | Cloud platform | DB2401k match | Disability insurance | Life insurance | Medical, dental, and vision insurance | Paid bench timeSenior-level Full TimeCincinnati, Ohio, United States15h ago
-
Lead AI Engineer USD 180K-280KCI/CD | ColBERT | Docker | Faiss | Fine Tuning401k match | Bonus | Childcare benefits | Dental insurance | Disability insuranceSenior-level Full TimeQuincy, MA, United States15h ago
-
AI Developer USD 140K-180KAPI Integration | Agile | Authentication | Azure | Azure FunctionsPaid Holidays | Paid time off | Tuition reimbursementEntry-level Full TimeDallas, Texas, United States15h ago
-
API Integration | AWS | AWS Glue | Batch Processing | Code reviewSenior-level Full TimeIndianapolis, IN, United States R16h ago
-
Senior-level Full TimeUnited States16h ago
-
Senior-level Full TimeUnited States16h ago
-
Software engineer, generative AI USD 119K-292KAWS | Agentic Workflows | Asyncio | Azure | Docker401k | Cancer testing support | Company holidays | Company off-sites | Company stock optionsMid-level Full TimeSan Francisco, CA16h ago
-
AI engineer USD 152K-315KAWS | Azure | Cloud Computing | Deep learning | GCP401k | Company offsites | Dental insurance | Fertility support | Flexible spending accountMid-level Full TimeSan Francisco, CA16h ago