High Performance Computing Engineer
Toronto
⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️
Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.
About The Role
We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration.
You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.
About The Role
We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration.
You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.
A day in the life:
- Manage private large high-end GPU clusters
- Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
- Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
- Configure and maintain MAAS, Ceph, Slurm and Kubernetes
- Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
- Configure and maintain network, e.g. Layer 3 networking
- Learn about new tools and deploy them
You might be a great fit if you have:
- Strong background in high performance computing
- Experience with with on-premises Data Center operations and technologies
- Experience in managing a large hardware cluster
- Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
- Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
- Familiarity with GPU utilization for machine learning workloads and optimization techniques
- Experience with managing firmware / systems updates for systems, e.g. on SuperMicro
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Job stats:
4
1
0
Category:
Engineering Jobs
Tags: Deep Learning Generative AI GPU HPC InfiniBand Kubernetes Linux Machine Learning NLP Python Statistics
Perks/benefits: Startup environment
Region:
North America
Country:
Canada
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Business Intelligence Developer jobsData Scientist II jobsPrincipal Data Engineer jobsSr. Data Engineer jobsBI Developer jobsStaff Data Scientist jobsPrincipal Software Engineer jobsStaff Machine Learning Engineer jobsDevOps Engineer jobsData Science Intern jobsJunior Data Analyst jobsAI/ML Engineer jobsSoftware Engineer II jobsData Manager jobsStaff Software Engineer jobsData Science Manager jobsLead Data Analyst jobsData Analyst Intern jobsSr. Data Scientist jobsData Specialist jobsBusiness Data Analyst jobsBusiness Intelligence Analyst jobsData Governance Analyst jobsData Engineer III jobsSenior Backend Engineer jobs
Consulting jobsMLOps jobsAirflow jobsOpen Source jobsEconomics jobsKafka jobsLinux jobsKPIs jobsGitHub jobsJavaScript jobsTerraform jobsPrompt engineering jobsPostgreSQL jobsRAG jobsStreaming jobsScikit-learn jobsBanking jobsData Warehousing jobsNoSQL jobsClassification jobsPhysics jobsRDBMS jobsComputer Vision jobsdbt jobsPandas jobs
Google Cloud jobsHadoop jobsScala jobsLangChain jobsGPT jobsR&D jobsMicroservices jobsData warehouse jobsBigQuery jobsCX jobsDistributed Systems jobsScrum jobsELT jobsReact jobsOracle jobsLooker jobsIndustrial jobsPySpark jobsOpenAI jobsJira jobsRedshift jobsRobotics jobsSAS jobsTypeScript jobsUnstructured data jobs