Member of Technical Staff - Infrastructure & Systems for Large Models

Markham, Ontario, Canada

⚠️ We'll shut down after Aug 1st - try foo🦍 for all jobs in tech ⚠️

Full Time Contract Senior-level / Expert CAD 213K - 397K * ^est.

Huawei Technologies Canada Co., Ltd.

Huawei is a leading global provider of information and communications technology (ICT) infrastructure and smart devices.

View all jobs at Huawei Technologies Canada Co., Ltd.

Apply now Apply later

Posted 1 day ago

Huawei Canada has an immediate 12-month contract opening for a Member of Technical Staff

About the team:

Founded in 2012, the Noah’s Ark lab has evolved into a prominent research organization with notable achievements in academia and industry. The lab’s mission focuses on advancing artificial intelligence and related fields to benefit the company and society. Driven by impactful, long-term projects, the aim is to enhance state-of-the-art research while integrating innovations into the company's products and services, including LLMs, RL, NLP, computer vision, AI theory, and Autonomous driving.

About the job:

Join a team that maintain the core infrastructure powering large-scale AI training.
Contribute to data loading, training workflows, and checkpointing systems for distributed model training.
Help improve tools that manage training jobs across compute clusters (e.g., GPUs, TPUs, multi-node setups).
Work on monitoring and logging tools to make long-running jobs reliable and observable.
Support optimization efforts (e.g., mixed precision, sharding) to make model training faster and more efficient.
Collaborate closely with machine learning engineers and researchers on new training methods and experiments.
Learn to scale systems, debug complex workloads, and make training pipelines reproducible.
Be part of a team that bridges research and infrastructure to accelerate AI development.

Requirements

About the ideal candidate:

1–2 years of software engineering experience.
Proficient in Python, with basic experience in backend or infrastructure development.
Familiarity with ML frameworks like PyTorch or TensorFlow.
Some exposure to distributed systems, training jobs, or cloud computing is a plus.
Comfortable using Linux, containers (e.g., Docker), and command-line tools.
Understanding of software engineering best practices (e.g., testing, version control).
Eager to learn about large-scale ML systems and infrastructure design.
Strong communication and collaboration skills; enjoys working with cross-functional teams.