High-Performance Computing (HPC) Specialist – AI Training Infrastructure
London, UK
Meta
Giving people the power to build community and bring the world closer together
We are seeking experienced and passionate High-Performance Computing (HPC) Specialists to join our AI Training Infrastructure team. In this role, you will design, optimize, and manage cutting-edge AI training environments for large-scale machine learning models. You will collaborate with a multidisciplinary team to ensure seamless integration and scalability across heterogeneous hardware platforms.High-Performance Computing (HPC) Specialist – AI Training Infrastructure Responsibilities
Equal Employment Opportunity Meta is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice here.
Meta is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, fill out the Accommodations request form.
- Design and implement HPC solutions for large-scale AI/ML training workloads, ensuring high performance, scalability, and efficiency.
- Optimize AI training pipelines and workflows to maximize utilization of GPUs and other specialized accelerators.
- Analyze and troubleshoot hardware bottlenecks, network issues, and performance inefficiencies in large-scale AI training environments.
- Collaborate with AI/ML researchers and data scientists to tailor HPC solutions that meet their specific model training requirements.
- Develop monitoring and profiling systems to ensure efficient utilization of resources across heterogeneous systems.
- Stay updated with advancements in HPC, AI/ML frameworks, and heterogeneous hardware technologies.
- Contribute to documentation, best practices, and knowledge sharing within the team.
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related field
- 3+ years of experience in HPC environments, particularly for AI/ML workloads.
- Proficiency in parallel programming, distributed systems, and HPC-specific libraries (e.g., MPI, OpenMP, CUDA, ROCm).
- Hands-on experience with at least one hardware platform (e.g., NVIDIA GPUs, AMD GPUs, TPUs, FPGAs, or custom ASICs).
- Familiarity with PyTorch.
- Requires understanding of networked storage solutions, interconnects (e.g., InfiniBand, NVLink), and high-speed networking.
- Past experience in optimizing resource utilization in multi-node training environments.
- Problem-solving, communication, and collaboration skills.
Equal Employment Opportunity Meta is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice here.
Meta is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, fill out the Accommodations request form.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Job stats:
8
0
0
Category:
Deep Learning Jobs
Tags: Computer Science CUDA Distributed Systems Engineering HPC InfiniBand Machine Learning ML models Model training NVLink OpenMP Physics Pipelines PyTorch VR
Perks/benefits: Career development
Region:
Europe
Country:
United Kingdom
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
BI Developer jobsSr. Data Engineer jobsData Engineer II jobsBusiness Intelligence Analyst jobsPrincipal Data Engineer jobsStaff Data Scientist jobsStaff Machine Learning Engineer jobsData Manager jobsData Science Manager jobsPrincipal Software Engineer jobsData Science Intern jobsBusiness Data Analyst jobsJunior Data Analyst jobsData Analyst Intern jobsData Specialist jobsSoftware Engineer II jobsLead Data Analyst jobsResearch Scientist jobsSr. Data Scientist jobsDevOps Engineer jobsStaff Software Engineer jobsAI/ML Engineer jobsData Engineer III jobsSenior Backend Engineer jobsBI Analyst jobs
Git jobsAirflow jobsEconomics jobsOpen Source jobsLinux jobsComputer Vision jobsKafka jobsGoogle Cloud jobsJavaScript jobsMLOps jobsNoSQL jobsData Warehousing jobsTerraform jobsPhysics jobsKPIs jobsRDBMS jobsPostgreSQL jobsScikit-learn jobsBanking jobsHadoop jobsScala jobsGitHub jobsData warehouse jobsStreaming jobsPandas jobs
Classification jobsR&D jobsBigQuery jobsDistributed Systems jobsOracle jobsPySpark jobsdbt jobsLooker jobsCX jobsScrum jobsReact jobsRAG jobsMicroservices jobsRobotics jobsJira jobsRedshift jobsIndustrial jobsSAS jobsData Mining jobsNumPy jobsPrompt engineering jobsGPT jobsELT jobsMySQL jobsData strategy jobs