Site Reliability Engineer, Machine Learning Systems - Singapore
Singapore, Singapore
Mid-level Full Time Found 14d ago
Tasks
- Develop monitoring and management tools for ML infrastructure and services
- Ensure ML systems operate efficiently for deployment, training, evaluation, inference
- Implement disaster recovery plans, cluster governance, and improve operational stability and efficiency
- Maintain stability of offline tasks/services across multi-data center, multi-region, multi-cloud
- Manage resources including computing and storage, plan capacity, control costs
- Provide on-call support for system and business issues
Perks/Benefits
- N/A
Skills/Tech-stack
Cloud Computing | Cluster governance | Coding | Disaster Recovery | Distributed Systems | Global Collaboration | Monitoring | Performance Analysis | Resource Management | Scalability | Server management | Storage Systems | System Stability
Education
Language: en |
Views: 1 |
Clicks: 0
Related jobs
-
Automation | Coding | Distributed Systems | Hardware Integration | Mathematical AnalysisEntry-level Full TimeSingapore, Singapore14d ago
-
Cloud Native | Cluster management | DevOps | High Performance | High-Performance ComputingMid-level Full TimeSingapore, Singapore14d ago