Lead Systems Engineer - GPU Management (AI/HPC)
Palo Alto
Applications have closed
At Hippocratic AI, we are at the forefront of technological innovation, leveraging advanced computing resources to solve complex problems. Our dedicated GPU clusters, including high-end NVIDIA A100 and H100 models, are crucial for our data processing, machine learning, and computational tasks, including the development and optimization of Large Language Models (LLMs).
Position Overview:
As Lead System Administrator specializing in Slurm, HPC, and GPUs, you will play a crucial role in designing, implementing, and maintaining our advanced computing infrastructure. Your in-depth knowledge of Slurm, HPC principles, and GPU utilization will enable you to optimize our system performance, ensure reliable operation, and support our growing computational needs.
Responsibilities:
GPU Cluster Management:
Run high-performance compute services in public cloud environments (AWS, GCP, and Azure) like Sagemaker.
Knowledge of hardware components, such as GPUs (including high-end models like NVIDIA A100 and H100), and familiarity with NVIDIA Container Toolkit.
Experience in managing GPU nodes in cloud environments, ensuring optimal performance and reliability.
Orchestration and Automation:
Proficiency in Kubernetes for container orchestration and Slurm for workload management to efficiently distribute tasks across the GPU cluster.
Experience in setting up and configuring these orchestration tools to ensure high availability and scalability of cluster resources.
Troubleshooting and Debugging:
Ability to provide in-depth technical support for complex issues, including debugging and troubleshooting high-end GPUs.
Familiarity with debugging tools and techniques specific to GPU hardware and software.
Performance Optimization:
Continuous monitoring of system performance to identify bottlenecks and implement solutions to optimize resource utilization and throughput.
Knowledge of performance tuning techniques for GPU clusters and the ability to apply them effectively.
Security and Compliance:
Ensure adherence to security best practices and compliance requirements for GPU cluster infrastructure.
Implementation and management of security protocols and disaster recovery strategies to safeguard cluster resources and data.
Collaboration and Support:
Work closely with other engineering, research and applied science teams to understand and support their computational needs.
Offer guidance and expertise on utilizing the GPU cluster efficiently for various tasks and applications.
Participate in planning and executing future expansion or enhancement of cluster capabilities to meet evolving computational requirements.
Requirements:
Education:
Bachelor’s degree in Computer Science, Electrical Engineering, or a related field. Master’s degree preferred.
Experience:
At least 3 years of experience in managing and maintaining GPU clusters, preferably in the cloud, with hands-on experience with NVIDIA A100 and H100 GPUs or similar high-end models.
Technical Skills:
Proficiency in Kubernetes for container orchestration and management, with experience in deploying, scaling, and managing containerized applications within Kubernetes clusters, including familiarity with AWS Kubernetes services for cloud deployment and management.
Experience with Slurm for workload management in GPU cluster environments.
Deep understanding of GPU hardware, including experience with debugging and troubleshooting GPU issues.
Strong background in Linux/Unix administration, scripting (e.g., Bash, Python), and automation tools, with expertise in Ansible for configuration management and automation tasks.
Familiarity with network configuration, storage systems, and security protocols relevant to GPU clusters.
Problem-Solving:
Exceptional analytical and problem-solving skills, with the ability to handle complex technical challenges effectively.
Communication:
Excellent communication and documentation skills, capable of collaborating effectively across diverse teams.
About Hippocratic AI
Hippocratic AI is dedicated to developing a safety-focused large language model (LLM) tailored for the healthcare sector. We firmly believe in the potential of generative AI to significantly enhance global healthcare accessibility, provided it is developed and tested responsibly. Mirroring the principles of the Hippocratic oath that guides medical professionals, our model is designed with the ethos of "Do no Harm."
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Ansible AWS Azure Computer Science Engineering GCP Generative AI GPU HPC Kubernetes Linux LLMs Machine Learning Python Research SageMaker Security
Perks/benefits: Career development
More jobs like this
Explore more AI, ML, Data Science career opportunities
Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.
- Open Marketing Data Analyst jobs
- Open Data Science Manager jobs
- Open Lead Data Analyst jobs
- Open Data Engineer II jobs
- Open Senior Business Intelligence Analyst jobs
- Open Principal Data Engineer jobs
- Open MLOps Engineer jobs
- Open Data Analytics Engineer jobs
- Open Data Scientist II jobs
- Open Power BI Developer jobs
- Open Junior Data Scientist jobs
- Open Business Intelligence Developer jobs
- Open Product Data Analyst jobs
- Open Business Data Analyst jobs
- Open Sr Data Engineer jobs
- Open Data Analyst Intern jobs
- Open Senior Data Architect jobs
- Open Sr. Data Scientist jobs
- Open Big Data Engineer jobs
- Open Research Scientist jobs
- Open Data Quality Analyst jobs
- Open Manager, Data Engineering jobs
- Open Azure Data Engineer jobs
- Open Principal Data Scientist jobs
- Open Data Product Manager jobs
- Open GCP-related jobs
- Open Data quality-related jobs
- Open Java-related jobs
- Open ML models-related jobs
- Open Business Intelligence-related jobs
- Open Data management-related jobs
- Open Privacy-related jobs
- Open Deep Learning-related jobs
- Open PhD-related jobs
- Open Data visualization-related jobs
- Open NLP-related jobs
- Open Finance-related jobs
- Open PyTorch-related jobs
- Open TensorFlow-related jobs
- Open APIs-related jobs
- Open Consulting-related jobs
- Open LLMs-related jobs
- Open Generative AI-related jobs
- Open CI/CD-related jobs
- Open Snowflake-related jobs
- Open Hadoop-related jobs
- Open Kubernetes-related jobs
- Open Data governance-related jobs
- Open Databricks-related jobs
- Open DevOps-related jobs