Evals Software Engineer (Infrastructure focus)

London

Full Time GBP 67K - 124K * ^est.

Apollo Research

View all jobs at Apollo Research

Apply now Apply later

Posted 3 weeks ago

Applications deadline: The final date for submissions is 22 June 2025. However, we review applications on a rolling basis and encourage early submissions.

THE OPPORTUNITY
We're seeking a Software Engineer who will design, build, and maintain the secure infrastructure foundation that powers our frontier AI evaluations, as well as the tools which leverage this infrastructure.
As the infrastructure Lead on our team, you'll have broad decision-making authority over our tech stack and directly enable our research mission through robust, scalable systems.

Key Responsibilities
- Design, implement, scale, and maintain infrastructure for running frontier LLM evals using Infrastructure as Code (IaC)- Choose and integrate appropriate technologies for our infrastructure stack (e.g. decide whether to adopt Kubernetes)- Build internal software tools that leverage our infrastructure, for tasks including job orchestration, project access, results storage, and more- Collaborate with researchers to understand anticipated future infrastructure needs, e.g. understand the cluster requirements for future agent deployments.- Ensure that our evals run on our infrastructure and debug issues throughout the technology stack–whether in AWS, Kubernetes, or Python application code- Administer and secure internal AWS accounts (IAM, resource deployments, spending, etc)- Help set up and manage organisation-wide security processes- As we evolve, you can co-create and lead the infrastructure team that we need
Minimum Qualifications
- Strong software engineering background, preferably in Python- Experience leading infrastructure projects from start to finish- Strong hands-on experience with Kubernetes- Solid knowledge of AWS, including IAM and EKS- Experience implementing security best practices for cloud and containerised environments- Experience with Infrastructure as Code tools (e.g. Terraform)
We want to emphasize that people who feel they don’t fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine.
At Apollo, we aim for a culture that emphasizes truth-seeking, being goal-oriented, giving and receiving constructive feedback, and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.
What You'll Accomplish in Your First Year
- Accelerate our frontier LLM evaluations research by building an internal platform with both robust backend infrastructure and intuitive researcher-facing interfaces- Define the vision and roadmap for the team’s infrastructure that securely and reliably supports our research - Establish a comprehensive security posture across our cloud resources and deployments to protect sensitive data and maintain the trust of our partners- Optimize infrastructure costs, ensuring efficient use of resources as our demand increases
ABOUT APOLLO RESEARCH
The capabilities of current AI systems are evolving at a rapid pace. While these advancements offer tremendous opportunities, they also present significant risks, such as the potential for deliberate misuse or the deployment of sophisticated yet misaligned models. At Apollo Research, our primary concern lies with deceptive alignment, a phenomenon where a model appears to be aligned but is, in fact, misaligned and capable of evading human oversight.
Our approach focuses on behavioral model evaluations, which we then use to audit real-world models. We also combine black-box approaches with applied interpretability. In our evaluations, we focus on LM agents, i.e. LLMs with agentic scaffolding similar to AIDE or SWE agent. We also study model organisms in controlled environments (see our security policies), e.g. to better understand capabilities related to scheming.

ABOUT THE TEAM
The current evals team consists of Mikita Balesni, Jérémy Scheurer, Alex Meinke, Rusheb Shah, Bronson Schoen, Andrei Matveiakin, Felix Hofstätter, and Axel Højmark. Marius Hobbhahn manages and advises the team, though team members lead individual projects. You would work closely with Rusheb and Andrei, who are the full-time software engineers on the evals team, but you would also interact a lot with everyone else. You can find our full team here.
EVALS TEAM WORK
- Conceptual work on safety cases for scheming, for example, our work on evaluation-based safety cases for scheming- Building evaluations for scheming-related properties, such as situational awareness or deceptive reasoning.- Conducting evaluations on frontier models and publishing the results either to the general public or a target audience such as AI developers or governments, for example, our work in OpenAI’s o1-preview system card.- Creating model organisms and demonstrations of behavior related to deceptive alignment, e.g. exploring the influence of goal-directedness on scheming.- Designing and evaluating AI control protocols. We have not started these efforts yet but intend to work on them starting Q2 2025.

LOGISTICS

Start Date: Target of 2-3 months after the first interview.
Time Allocation: Full-time.
Location: The office is in London, and the building is shared with the London Initiative for Safe AI (LISA) offices. This is an in-person role.
Work Visas: We can sponsor UK visas

BENEFITS

Salary: a competitive UK-based salary.
Flexible work hours and schedule.
Unlimited vacation.
Unlimited sick leave.
Lunch, dinner, and snacks are provided for all employees on workdays.
Paid work trips, including staff retreats, business trips, and relevant conferences.
A yearly $1,000 (USD) professional development budget.

Equality Statement: Apollo Research is an Equal Opportunity Employer. We value diversity and are committed to providing equal opportunities to all, regardless of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex, or sexual orientation.
How to apply: Please complete the application form with your CV. The provision of a cover letter is optional but not necessary. Please also feel free to share links to relevant work samples.
About the interview process: Our multi-stage process includes a screening interview, a take-home test (approx. 2 hours), 3 technical interviews, and a final interview with Marius (CEO). The technical interviews will be closely related to tasks the candidate would do on the job. There are no leetcode-style general coding interviews. If you want to prepare for the interviews, we suggest working on hands-on LLM evals projects (e.g. as suggested in our starter guide), such as building LM agent evaluations in Inspect.
Applications deadline: The final date for submissions is 22 June 2025. However, we review applications on a rolling basis and encourage early submissions.
* This role is supported by AI Futures Grants, a UK Government program designed to help the next generation of AI leaders meet the costs of relocating to the UK. AI Futures Grants provide financial support to reimburse relocation costs such as work visa fees, immigration health surcharge and travel/subsistence expenses. Successful candidates for this role may be able to get up to £10,000 to meet associated relocation costs, subject to terms and conditions.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats: 4 2 0

Category: Engineering Jobs

Tags: AWS Engineering Kubernetes LLMs OpenAI Python Research Security Terraform