Software Engineer, Reliability

Berlin Office

Protect AI

Protect AI is the broadest and most comprehensive platform to secure your AI. It enables you to see, know, and manage AI securely, end to end.

View all jobs at Protect AI

Apply now Apply later

About Protect AI

Protect AI is shaping, defining, and innovating a new category within cybersecurity around the risk and security of AI/ML. Our ML Security Platform enables customers to see, know, and manage security risks to defend against unique AI security threats, and embrace MLSecOps for a safer AI-powered world. This includes a broad set of capabilities including AI supply chain security, Auditable Bill of Materials for AI, ML model scanning, signing, attestation and LLM Security.

Join our team to help us solve this critical need of protecting AI!

Role

Protect AI is seeking a Software Engineer, Reliability to join our growing team. Our engineering team builds and operates the Protect AI Platform, covering everything from AI supply chain security and ML Model Scanning to LLM Security. You will work closely with fellow Software Engineers who bring a variety of skill sets to the table, including Backend, Frontend, Infra, ML, and Security Engineering. You will also have the opportunity to work closely with Product Managers, Architects, and our CTO.

As a Software Engineer, Reliability, you will be responsible for designing and developing automation tools, infrastructure services, and frameworks that enhance the reliability, scalability, and performance of our platform. You will focus on writing software that automates infrastructure, improves observability, and ensures high availability of our services.

Responsibilities:

  • Develop and maintain automation tools to manage infrastructure at scale, including Kubernetes CRDs, controllers, and Terraform modules.

  • Build scalable and reliable distributed systems to enhance platform performance and availability.

  • Implement observability solutions using Grafana, Prometheus, Loki, and OpenTelemetry to monitor system health and performance.

  • Enhance CI/CD processes using GitOps workflows via ArgoCD, Helm, and Kubernetes manifests to streamline deployments and automate testing.

  • Apply on-call practices to improve system resiliency and ensure rapid incident response.

Qualifications:

  1. Strong software engineering background (3+ years) with experience developing automation tools and distributed systems on Kubernetes.

  2. Proficiency in at least one programming language, such as Go, Python, Rust, or Java.

  3. Experience developing Kubernetes CRDs, controllers, and automation frameworks.

  4. Hands-on experience with GitOps workflows (ArgoCD), Terraform, Kubernetes manifests, and Helm.

  5. Expertise in monitoring, logging, and observability tools (Grafana stack, Prometheus, OpenTelemetry).

  6. Strong understanding of CI/CD pipelines and build automation tools.

  7. Proven experience handling on-call rotations and improving incident response processes.

  8. Strong problem-solving and debugging skills.

  9. Experience with AWS (or other cloud providers) and cloud-native technologies.

  10. Ability to work in a fast-paced, agile environment, collaborating across engineering teams.

What We Offer:

  • An exciting, collaborative work environment in a fast-growing startup.

  • Competitive salary and benefits package.

  • Opportunities for professional growth and development including attending and presenting technical talks at meetups and conferences.

  • A culture that values innovation, accountability, and teamwork.

  • Work with a team of talented and well-accomplished peers.

  • Work with best-in-class tools — M4 Macbook Pro, 34” Monitor, modern tech stack and high quality collaboration tools.

  • No bureaucracy and legacy systems. You are empowered to innovate and do your best work.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  2  1  0

Tags: Agile AWS CI/CD Distributed Systems Engineering Grafana Helm Java Kubernetes LLMs Machine Learning Pipelines Python Rust Security Terraform Testing

Perks/benefits: Career development Competitive pay Conferences Gear Health care Startup environment

Region: Europe
Country: Germany

More jobs like this