Software Engineer, Systems ML - HPC Specialist
Bellevue, WA | Menlo Park, CA | Remote, US
⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️
Meta
Giving people the power to build community and bring the world closer together
Meta is seeking an AI Software Engineer to join our Research & Development teams. The ideal candidate will have industry experience working on AI Infrastructure related topics. The position will involve taking these skills and applying them to solve for some of the most crucial & exciting problems that exist on the web.
Some aspects of this role as an HPC specialist may include authoring components such as cuBLAS, cuDNN, AITemplate, FlashAttention and development of runtimes such as LLM disaggregated runtime. HPC specialists spend time optimizing the program to reduce the accelerators idle time. They also develop tools to debug (cuda-gdb), profiler utilizing the accelerated computing hardware (such as PE’s/SFU etc in MTIA or Transformer engine in H100). They are experts in systems who are able to design, debug and accelerate AI workloads from single-node scale up to multi-node scale out distributed systems. They also are able to influence the next generation of Silicon architectures (such as Tensor Core in V100. Transformer Engine in H100) based on the evolving AI workload needs.
We are hiring in multiple locations.Software Engineer, Systems ML - HPC Specialist Responsibilities
Individual compensation is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base hourly rate, monthly rate, or annual salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base compensation, Meta offers benefits. Learn more about benefits at Meta.
Some aspects of this role as an HPC specialist may include authoring components such as cuBLAS, cuDNN, AITemplate, FlashAttention and development of runtimes such as LLM disaggregated runtime. HPC specialists spend time optimizing the program to reduce the accelerators idle time. They also develop tools to debug (cuda-gdb), profiler utilizing the accelerated computing hardware (such as PE’s/SFU etc in MTIA or Transformer engine in H100). They are experts in systems who are able to design, debug and accelerate AI workloads from single-node scale up to multi-node scale out distributed systems. They also are able to influence the next generation of Silicon architectures (such as Tensor Core in V100. Transformer Engine in H100) based on the evolving AI workload needs.
We are hiring in multiple locations.Software Engineer, Systems ML - HPC Specialist Responsibilities
- Apply relevant AI and machine learning techniques to build & optimize our intelligent systems that improve Metas products and experiences
- Develop custom/novel architectures, define use cases, and develop methodology & benchmarks to evaluate different approaches
- Apply in depth knowledge of how the machine learning system interacts with the other systems around it
- Assist in goal setting related to project impact, AI system design, and ML excellence
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
- 2+ years of experience in HPC and parallel computing.
- Proficiency in GPU programming using CUDA and familiarity with CUDA libraries (cuBLAS, cuDNN, etc.).
- Proven track record of leading successful HPC projects.
- Proven technical expertise in HPC architectures and technologies.
- PhD in Computer Science, Computer Engineering, or relevant technical field.
- Experience developing AI algorithms or AI-System infrastructure in C/C++ or Python.
- Experience developing AI Compiler (TorchInductor in PyTorch 2.0).
Individual compensation is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base hourly rate, monthly rate, or annual salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base compensation, Meta offers benefits. Learn more about benefits at Meta.
Job stats:
5
1
0
Categories:
Engineering Jobs
Machine Learning Jobs
Tags: Architecture Computer Science CUDA cuDNN Distributed Systems Engineering GPU HPC LLMs Machine Learning ML infrastructure PhD Physics Python PyTorch R&D Research VR
Perks/benefits: Career development Equity / stock options Health care Salary bonus
Regions:
Remote/Anywhere
North America
Country:
United States
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Data Scientist II jobsSr. Data Engineer jobsBI Developer jobsPrincipal Data Engineer jobsBusiness Intelligence Developer jobsStaff Machine Learning Engineer jobsStaff Data Scientist jobsPrincipal Software Engineer jobsJunior Data Analyst jobsData Science Intern jobsDevOps Engineer jobsData Manager jobsSoftware Engineer II jobsData Science Manager jobsStaff Software Engineer jobsLead Data Analyst jobsData Analyst Intern jobsAI/ML Engineer jobsSr. Data Scientist jobsBusiness Data Analyst jobsData Specialist jobsBusiness Intelligence Analyst jobsData Engineer III jobsData Governance Analyst jobsSenior Backend Engineer jobs
Consulting jobsMLOps jobsAirflow jobsOpen Source jobsEconomics jobsLinux jobsKPIs jobsKafka jobsTerraform jobsGitHub jobsJavaScript jobsPostgreSQL jobsPrompt engineering jobsRDBMS jobsBanking jobsData Warehousing jobsNoSQL jobsStreaming jobsClassification jobsComputer Vision jobsRAG jobsScikit-learn jobsPhysics jobsGoogle Cloud jobsdbt jobs
GPT jobsPandas jobsHadoop jobsLooker jobsBigQuery jobsR&D jobsData warehouse jobsOracle jobsReact jobsScala jobsDistributed Systems jobsLangChain jobsScrum jobsMicroservices jobsELT jobsCX jobsPySpark jobsIndustrial jobsOpenAI jobsJira jobsSAS jobsRedshift jobsTypeScript jobsModel training jobsRobotics jobs