AI/HPC Computing Performance Engineer (IT-SD-GSS-2024-185-GRAE)

Geneva, Switzerland

CERN

CERN, the European Organization for Nuclear Research, is one of the world’s largest and most respected centres for scientific research. Its business is fundamental physics, finding out what the Universe is made of and how it works.

View all jobs at CERN

Apply now Apply later

Job Description

Your responsibilities

Are you passionate about pushing the boundaries of technology? We are seeking a skilled and motivated Computing Engineer to work on storage performance optimization for Artificial Intelligence and Machine Learning. Join us to architect and fine-tune high-performance storage solutions to meet the demanding needs of AI/ML workloads, ensuring efficient data flow between large-scale distributed storage systems and computing farms with CPUs and GPUs.

You will be part of the Ceph team, which is responsible for providing block, object, and file storage for the CERN Cloud Infrastructure, addressing the storage needs of virtual machines, cloud-native applications, and diverse IT services for the CERN community.

Your work will be focused on optimizing the performance of open source storage systems for AI/ML workloads for scientific use-cases. In particular, you will work on distributed systems hosting PBs of datasets capturing physics events that have to be delivered and processed by multiple compute nodes in an efficient and performant manner. This poses significant challenges to the storage system hosting the training data, as it must be capable of storing large volumes of data, serving highly parallel concurrent requests, and delivering high throughput to computing clients.

Your responsibilities will include:

  • Investigation of storage requirements for AI/ML applications and identification of bottlenecks when accessing distributed storage systems
  • Architecting efficient access to storage resources, considering volume, performance and cost
  • Measurement and benchmarking of achievable performance on existing large-scale storage clusters
  • Configuration, tuning and code optimization to improve storage access efficiency

Also, as a team member, you will share responsibility for:

  • Operating clusters totaling ~100 PB of raw capacity over 20+ production instances and 2 data centers
  • Evolving the Ceph storage offer following users' needs, technology evolution and capacity requirements
  • Maintaining service documentation, procedures, and reports
  • Providing support to end-users and guiding them to make the best possible use of Ceph storage

Your profile

Skills

As a junior engineering position we are seeking a candidate who can demonstrate or readily acquire the following:

Required skills:

  • Linux system administration, in particular tracing and debugging tools (e.g., GDB, eBPF, DTrace)
  • C++ software design and development, or strong interest to acquire relevant competencies. Scripting languages: Python, shell scripting
  • Testing and debugging of distributed applications (storage systems being a plus)
  • Strong motivation to acquire skills on the job to progress on tasks and address problems.

Desirable skills:

  • Genuine curiosity about modern, software-defined storage and computing technologies
  • Familiarity with AI/ML workflows and storage requirements
  • Practical experience in performance measurement and benchmarking
  • Knowledge of mainstream monitoring technologies (Prometheus, Grafana)

Language Requirements:

  • Fluent in English

Eligibility criteria:

  • You are a national of a CERN Member or Associate Member State.
  • By the application deadline, you have a maximum of two years of professional experience since graduation in Computing or Physical Sciences (or a related field) and your highest educational qualification is either a Bachelor's or Master's degree.
  • You have never had a CERN fellow or graduate contract before.
  • Applicants without University degree are not eligible.
  • Applicants with a PhD are not eligible.

Additional Information

Job closing date: 12 November at 23:59 CET.

Contract duration: 24 months, with a possible extension up to 36 months maximum.

Working hours: 40 hours per week

Target start date: 01-January-2025

This position involves: Stand-by duty, when required by the needs of the Organization.

Job reference: IT-SD-GSS-2024-185-GRAE

Field of work: Software Engineering and IT

What we offer

  • A monthly stipend ranging between 5134 and 5647 Swiss Francs (net of tax).
  • Coverage by CERN's comprehensive health scheme (for yourself, your spouse and children), and membership of the CERN Pension Fund.
  • Depending on your individual circumstances: installation grant; family, child and infant allowances; payment of travel expenses at the beginning and end of contract.
  • 30 days of paid leave per year.
  • On-the-job and formal training at CERN as well as in-house language courses for English and/or French.

About us

At CERN, the European Organization for Nuclear Research, physicists and engineers are probing the fundamental structure of the universe. Using the world's largest and most complex scientific instruments, they study the basic constituents of matter - fundamental particles that are made to collide together at close to the speed of light. The process gives physicists clues about how particles interact, and provides insights into the fundamental laws of nature. Find out more on http://home.cern.

 

Diversity has been an integral part of CERN's mission since its foundation and is an established value of the Organization. Employing a diverse workforce is central to our success.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  1  1  0

Tags: Distributed Systems Engineering Grafana HPC Linux Machine Learning Open Source PhD Physics Python Research Shell scripting Testing

Perks/benefits: Career development Health care Team events

Region: Europe
Country: Switzerland

More jobs like this