Ai基础架构工程师-大规模集群方向
Tasks
- Build Kubernetes based AI training platform
- Ensure high availability and capacity planning
- Implement checkpointing for training resumption
- Integrate deep learning frameworks for efficient cluster execution
- Maintain Linux drivers CUDA and cuDNN
- Maintain NCCL communication libraries
- Manage GPU cluster operations
- Monitor GPU RDMA network and storage end to end
- Optimize dataloader performance
- Optimize distributed training startup workflow
- Optimize scheduler for resource isolation and priority scheduling
- Optimize training performance in software and hardware layers
- Troubleshoot distributed training failures
Perks/Benefits
- N/A
Skills/Tech-stack
Automation tools | CUDA | CUDNN | Ceph | Containerd | DCGM | DDP | DeepSpeed | Docker | FSDP | GPFS | Golang | Golang Automation Tools | Grafana | Infiniband | Kubeflow | Kubernetes | Kubernetes Operator | Language model training | Large Language Model | Large language model training | Linux | Lustre | Megatron | Megatron-LM | Minio | Model Training | NCCL | NVIDIA A100 | NVIDIA GPU | NVIDIA H100 | NVLink | NVSwitch | Pipeline parallelism | Prometheus | PyTorch | Python | RDMA | RoCE | TensorFlow | Volcano
Education
Roles
AI | AI Infrastructure Engineer | Engineer | Infrastructure Engineer | Platform | Platform Engineer
Related jobs
-
具身智能-强化学习(灵巧操作方向) 实习生 CNY 25K-37KActor-critic | Diffusion Models | Distributed Training | Embodied intelligence | Flow matchingEntry-level Full Time Internship深圳3h ago
-
DPO | Deep learning | Diverse Preference Optimization | Learning algorithms | Machine LearningMid-level Full Time上海6h ago
-
算法工程师-大模型数据方向 CNY 240K-360KAutomated Evaluation | Clustering | Corpus Synthesis | Data Augmentation | Data GovernanceSenior-level Full Time上海6h ago
-
数据开发工程师(Ai知识方向) CNY 180K-300KContent governance | Data Governance | Data Quality | Data Quality Metrics | ETLMid-level Full Time上海6h ago
-
Mid-level Full Time上海6h ago
-
Senior-level Full Time上海6h ago
-
Senior-level Full Time上海6h ago
-
Mid-level Internship上海6h ago
-
Mid-level Full Time上海6h ago
-
大语言模型后训练/Agentic算法工程师 CNY 180K-360KAgentic RL | DAPO | Distributed Training | Evaluation | Function CallingEntry-level Full Time上海、北京6h ago
-
Senior-level Full Time上海6h ago
-
Associate Director, Data and Analytics CNY 280K-360KApache Airflow | Automated testing | BigQuery | CI/CD | Cloud ComposerMid-level Full TimeGuangzhou, Guangdong, China14h ago
-
Entry-level Full TimeSuzhou, Jiangsu, China16h ago
-
AWS | Access Controls | Agile | Azure | CI/CDCareer growth opportunities | Continuous training | High-end technology access | Inclusive workplaceMid-level Full TimeCHN – Chengdu - Commercial, China1d ago
-
Senior System Software Engineer, Robotics CNY 144K-240KARM architecture | C# | C++ | CUDA | DeterminismSenior-level Full TimeChina, Shanghai1d ago
-
C plus plus | C# | Camera Calibration | Camera Synchronization | Camera systemsMid-level Full TimeShenzhen, Guangdong, China1d ago
-
Machine Learning Engineer CNY 216K-300KAndroid | C# | C++ | Embedded Systems | Inference OptimizationMid-level Full TimeShanghai, Shanghai, China1d ago
-
C plus plus | CUDA | Code generation | Compiler design | Domain-specific languageSenior-level Full TimeChina, Shanghai1d ago
-
Mid-level Full Time深圳1d ago
-
Mid-level Full Time东莞1d ago
-
Ai算法工程师 CNY 180K-300KConvolutional Neural Networks | Data Mining | Data Warehouse | Data cleaning | Data labelingMid-level Full Time东莞1d ago
-
Ai 院--多模态团队--多模态理解算法研究员-强化学习方向 CNY 240K-480KDPO | Data Preprocessing | Data cleaning | DeepSpeed | Distributed TrainingSenior-level Full Time北京 R1d ago
-
AI院-GLM团队-AI-Native 全栈工程师(偏后端) CNY 180K-300KAPI Design | API design and implementation | Cloud Native | Data Processing | Database operationsMid-level Full Time北京1d ago
-
Mid-level Full Time杭州1d ago
-
AI院--训练Infra工程师 CNY 180K-300KComputer Vision | Distributed Training | Language Models | Language Processing | Large Language ModelsMid-level Full Time北京1d ago