Senior ML Engineer - Distributed training and performance

Zürich, Switzerland

Full Time Senior-level / Expert CHF 141K - 262K * ^est.

Kaiko

Kaiko’s data framework for cancer research facilitates hospitals and research institutes with data insights, A.I. support for medical doctors and the latest developments in machine-learning research.

View all jobs at Kaiko

Apply now Apply later

Posted 3 weeks ago

About kaiko

In cancer care, treatment decisions can take many days—but patients don’t have that time. One of the reasons for delays? Cancer patients' data is scattered across many places: doctor’s notes, medical imagery, genomics data. At kaiko, we are developing AI foundation models to bring this data together and integrate it into clinical workflows, enabling doctors to make faster, more effective treatments decisions.

We also collaborate closely with the leading Dutch cancer research institute (NKI) on multiple AI research projects and a joint clinical validation initiative. In 2025, we plan on expanding our partnerships to even more hospitals.

We raised significant long-term funding and have offices in Zurich and Amsterdam. Over the past year, our team has nearly doubled in size, now comprising 80+ people from 25 countries and we just got recognized as a Rising Innovator at the GenAI conference

About the role

As a Senior ML Engineer specializing in Distributed Training & Performance, you will be a critical force multiplier for our growing team of ML Researchers. Your mission is to optimize the backbone that trains our multimodal models, streamlining their efforts and empowering them to focus on scientific discovery. You will be responsible for both scaling 10–100B+ parameter models across hundreds of GPUs and diving deep into the code to eliminate performance bottlenecks, creating tools and frameworks that enable our research team to self-serve. The role is based in Zurich.

Some areas of responsibility

Distributed Training & Scaling: Research, design, and develop state-of-the-art distributed training strategies for large-scale models using tensor, pipeline, and data parallelism (e.g., FSDP, DeepSpeed, Megatron-LM).

Performance Engineering & Profiling: Proactively use profiling tools (NVIDIA Nsight, PyTorch Profiler) to interpret traces, diagnose bottlenecks, and optimize everything from NCCL collectives and CUDA kernels to GPU memory, GPUDirect Storage, and high-bandwidth fabrics (NVLink, InfiniBand).

Kernel-Level Optimizations: Develop or optimize custom CUDA or Triton kernels

Fault-Tolerant Orchestration: Design and manage robust, fault-tolerant training jobs at scale using orchestration frameworks like Ray Train, Kubernetes, or SLURM.

Mentorship & Collaboration: Mentor ML researchers and engineers, turning model requirements into scalable training pipelines and evangelizing best practices in writing high-performance, production-grade code.

About you

Expert-level Python and C++/CUDA; with a deep knowledge of PyTorch internals.

End-to-end experience training ≥10B-parameter models at scale, preferably multimodal.

Proven ability to design experiments, benchmark collectives, interpret profiling traces, and resolve complex performance issues in GPU-bound workloads.

Strong systems intuition: PCIe, NVLink, HBM, NUMA, CUDA Graphs, InfiniBand, and mixed-precision training (BF16/FP8).

Proven ability to translate complex systems-level constraints (e.g., memory bandwidth, interconnect topology) into actionable guidance and robust tooling for a large, PhD-level research team.

We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you’re excited about us but don’t fit every single qualification, we still encourage you to apply: we’ve had incredible team members join us who didn’t check every box!

Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:

Ownership: You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.

Collaboration: You’ll have to approach disagreement with curiosity, build on common ground and create solutions together.

Ambition: You’ll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.

 
In addition, we offer: