Senior AI Engineer
Tasks
- Autoscale inference services and manage dynamic model loading
- Build logging and monitoring for training and inference
- Build model development toolchains service pipelines
- Cache frequently used models
- Coordinate with DevOps and IT teams
- Create automated data preprocessing feature engineering dataset versioning
- Create runbooks and perform root cause analysis
- Design user interfaces and APIs
- Develop LLM training GPU infrastructure and cluster
- Develop dashboards for latency accuracy drift detection
- Develop machine learning platform management system
- Dynamically allocate inference resources based on demand
- Enable distributed model training and hyperparameter optimization
- Evaluate platform performance scalability reliability
- Experiment with serverless architectures
- Implement A B testing for model deployments
- Implement CI/CD pipelines for model deployment
- Implement access control and security
- Implement alerting for anomalies and performance degradation
- Implement fault tolerant distributed LLM training
- Implement model memory loading and unloading
- Implement self-healing systems
- Integrate TPUs into training infrastructure
- Manage scheduling for multi tenant GPU clusters
- Optimize GPU utilization for large scale training
- Provide technical support to data scientists and engineers
- Refine metrics based on stakeholder feedback
- Support edge inference for lightweight models
- Troubleshoot training and inference issues
Perks/Benefits
Skills/Tech-stack
A/B | A/B Testing | Autoscaling | B testing | Bash | C plus plus | CUDA | CUDA kernels | Caching | DALI | DeepSpeed | Distributed Training | Docker | Edge Computing | GPU Cluster | GRPC | Go | Grafana | Horovod | Hyperparameter Optimization | Kubernetes | Logging | Machine Learning | Model Monitoring | Multi-tenant | Multi-tenant systems | NCCL | Pipeline parallelism | Prometheus | PyTorch distributed | Python | Ray | Resource scheduling | Serverless architecture | Slurm | TPU | Tensor Parallelism | Tf data | Triton | Weights and Biases
Roles
Related jobs
-
Featured Feat. Associate Director, Data Labs USD 167K-167KAWS | Cloud Computing | Compute Infrastructure | Data Analysis | LLM GovernanceConference speaking opportunities | Hybrid work schedule | Media appearancesSenior-level Full TimeWashington, District of Columbia, 20004, United … R8d ago
-
Senior Data Engineer USD 187K-321KAWS | Airflow | Apache Spark | Batch Processing | Data Modeling401k matching | Flexible work schedule | Health and wellness supportSenior-level Full TimeAustin, Texas13h ago
-
Senior Data Engineer USD 148K-361KAirflow | Apache Spark | Data Modeling | Data Quality | HDFS401k | Commuter benefits | Dental insurance | Disability benefits | Equity awardsSenior-level Full TimeSan Jose, California13h ago
-
Bash | Cloud platform | Data Ingestion | Data Processing | DockerMid-level Full TimeAustin, TX, USA13h ago
-
Bash | Data Processing | Docker | GCP | LinuxAsynchronous culture | Flexible remote work environment | Supportive entrepreneurial teamMid-level Full TimeAtlanta, GA, USA13h ago
-
Bash | Cloud infrastructure | Data Processing | Docker | GCPAsynchronous culture | Entrepreneurial team | Remote workMid-level Full TimeNew York, NY, USA13h ago
-
Bash | Cloud platform | Data Pipelines | Data Processing | DockerAsynchronous culture | Bonus | Equity | Laid-back atmosphere | Remote-friendlyMid-level Full TimeBoston, MA, USA13h ago
-
Bash | Cloud platform | Docker | Google Cloud | Google Cloud PlatformAsynchronous culture | Bonus | Equity | Flexible work environment | Laid-back atmosphereMid-level Full TimePortland, OR, USA13h ago
-
Bash | Cloud infrastructure | Docker | GCP | Infrastructure as CodeAsynchronous culture | Remote-friendlyMid-level Full TimeTempe, AZ, USA13h ago
-
Bash | Cloud platform | Data Ingestion | Data Processing | DockerAsynchronous culture | Competitive benefits | Laid-back atmosphere | Remote-friendlyMid-level Full TimeLas Vegas, NV, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | Google CloudAsynchronous culture | Bonus | Equity | Friendly work environmentMid-level Full TimeFrisco, TX, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | Google CloudAsynchronous culture | Bonuses | Equity | Friendly work environmentMid-level Full TimeMinneapolis, MN, USA13h ago
-
Bash | Data Processing | Docker | GCP | Infrastructure as CodeAsynchronous culture | Bonus | Equity | Flexible team environmentMid-level Full TimeRaleigh, NC, USA13h ago
-
Bash | Cloud platform | Data Pipelines | Docker | Google CloudAsynchronous culture | Flexible management approach | Friendly work environment | Opportunity to make impact | Remote/distributed teamMid-level Full TimeKansas City, MO, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | Google CloudAsynchronous culture | Entrepreneurial environment | Opportunity impact | Remote/distributed workMid-level Full TimeCincinnati, OH, USA13h ago
-
Bash | Data Processing | Docker | GCP | Infrastructure as CodeAsynchronous culture | Laid-back atmosphere | Portfolio support | Remote-friendlyMid-level Full TimeDetroit, MI, USA13h ago
-
Bash | Cloud infrastructure | Data Processing | Docker | GCPAsynchronous culture | Friendly laid-back atmosphereMid-level Full TimeEvanston, IL, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | GCPAsynchronous culture | Competitive benefits | Equity bonus | Remote-friendlyMid-level Full TimeRichmond, VA, USA13h ago
-
Bash | Data Processing | Docker | GCP | Infrastructure as CodeAsynchronous work culture | Entrepreneurial environment | Hands-off management | Remote-friendly, distributed teamMid-level Full TimeBakersfield, CA, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | Google CloudAsynchronous work culture | Friendly atmosphere | Handsoff managementMid-level Full TimeFort Collins, CO, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | Google CloudAsynchronous culture | Friendly laid-back atmosphereMid-level Full TimeCollege Station, TX, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | Google CloudAsynchronous work culture | Flexible priorities | Remote-friendly environment | Supportive teamMid-level Full TimeBirmingham, AL, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | GCPAsynchronous work culture | Flexible management approach | Impactful mission | Remote-friendlyMid-level Full TimeTallahassee, FL, USA13h ago
-
Bash | Cloud platform | Data Processing | Docker | Google CloudMid-level Full TimeSt. Petersburg, FL, USA13h ago
-
Bash | Data Processing | Docker | GCP | Infrastructure as CodeAsynchronous culture | Career development | Friendly work environment | Inclusive workplace | Opportunity for portfolio impactMid-level Full TimeHartford, CT, USA13h ago