Senior HPC Cluster Administrator - Deep Learning Frameworks Infrastructure
Tasks
- Automate infrastructure with IaC
- Build CI CD pipelines for infrastructure
- Configure GPU compute clusters
- Configure fair share policies
- Contribute to engineering standards
- Deploy container runtimes
- Design and scale storage solutions
- Evaluate new networking technologies
- Evaluate new storage technologies
- Maintain observability with Prometheus and Grafana
- Manage job scheduling with SLURM
- Manage monitoring and deprecation
- Manage reservation scheduling
- Mentor junior engineers
- Optimize GPU partitioning with MIG
- Own GPU compute cluster lifecycle
- Provision GPU compute clusters
- Resolve hardware and software incidents
- Tune cluster configuration for distributed training
- Use DCGM for GPU monitoring
Perks/Benefits
- N/A
Skills/Tech-stack
Ansible | Apptainer | Bash | CI/CD | Cgroups | DCGM | Distributed Storage | Docker | EFA | GitLab | Grafana | IPMI | InfiniBand RoCE | Infiniband | Kubernetes | Linux | Lustre | NFS | NVIDIA MIG | NVIDIA NVSwitch | NVIDIA SMI | NVLink | Prometheus | Python | RDMA | Redfish | Singularity | Slurm | Terraform | WekaFS
Education
Related jobs
-
Database Administrator (6-months contract) PLN 124K-186KAWS | Amazon RDS | Aurora | CI/CD | DynamoDBMid-level Contract Full TimeGdansk - Olivia Business Centre, Poland20d ago
-
Database Administrator (MongoDB, PostgreSQL) PLN 150K-192KAWS | Atlas | Backup and Recovery | CI/CD | CoralogixBirthday leave | Learning and development programs | Marriage leave | Paid leave | Remote workMid-level Full TimeWarsaw, Poland1mo ago
-
Ansible | Apptainer | Bash | Cgroup | DCGMSenior-level Full TimePoland, Remote R1mo ago