Site Reliability Engineer - Platform
London
InstaDeep
InstaDeep delivers AI-powered decision-making systems for the Enterprise. With expertise in both machine intelligence research and concrete business deployments, we provide a competitive advantage to our customers in an AI-first world.
InstaDeep, founded in 2014, is a pioneering AI company at the forefront of innovation. With strategic offices in major cities worldwide, including London, Paris, Berlin, Tunis, Kigali, Cape Town, Boston, and San Francisco, InstaDeep collaborates with giants like Google DeepMind and prestigious educational institutions like MIT, Stanford, Oxford, UCL, and Imperial College London. We are a Google Cloud Partner and a select NVIDIA Elite Service Delivery Partner. We have been listed among notable players in AI, fast-growing companies, and Europe's 1000 fastest-growing companies in 2022 by Statista and the Financial Times. Our recent acquisition by BioNTech has further solidified our commitment to leading the industry.
Join us to be a part of the AI revolution!
As a vital member of the Platform Squad and member of the Infrastructure/SRE team at InstaDeep, you will work alongside various stakeholders, reporting directly to the Lead SRE of the squad. Your primary responsibility will involve managing the deployment and maintenance of our platform engineering and production systems and will be responsible for ensuring that our services remain reliable and performant consistently. Additionally, you'll contribute significantly to enhancing our platform engineering capabilities, including observability and incident management. This role requires bringing forward thoughtful proposals and demonstrating the ability to work independently, backed by strong experience in production environments.
Technical stack :CSPs: Google Cloud (mainly), AWS and Azure.Observability: Prometheus, Grafana, Alert Manager, Mimir, Loki and Tempo.Core: Kubernetes, Crossplane, FluxCD and ArgoCD.Development: Mainly Python and Bash.
Right to work: Please note that you will require the legal right to work in the location you are applying for.
Join us to be a part of the AI revolution!
As a vital member of the Platform Squad and member of the Infrastructure/SRE team at InstaDeep, you will work alongside various stakeholders, reporting directly to the Lead SRE of the squad. Your primary responsibility will involve managing the deployment and maintenance of our platform engineering and production systems and will be responsible for ensuring that our services remain reliable and performant consistently. Additionally, you'll contribute significantly to enhancing our platform engineering capabilities, including observability and incident management. This role requires bringing forward thoughtful proposals and demonstrating the ability to work independently, backed by strong experience in production environments.
Technical stack :CSPs: Google Cloud (mainly), AWS and Azure.Observability: Prometheus, Grafana, Alert Manager, Mimir, Loki and Tempo.Core: Kubernetes, Crossplane, FluxCD and ArgoCD.Development: Mainly Python and Bash.
- Participate in the architecture design of the internal Platform Engineering Framework.
- Develop and maintain custom Crossplane functions and compositions using Python.
- Deploy, architecture and maintain production grade observability components across various environments.
- Ensure and define SLOs of products and various projects with stakeholders to maintain their reliability.
- Participate in provisioning and managing infrastructure through Infrastructure as Code and Gitops-ing everything. Required Qualifications
- MSc degree in Computer Science or similar engineering discipline.
- Eligibility to work in FR or in the UK
- 5+ years experience in the technology industry (likely as SRE, DevOps or MLOps).
- Strong experience in Kubernetes and containerized architectures.Strong expertise with at least one of the major CSPs (likely GCP, AWS or Azure).
- Knowledge about observability powered by Prometheus, Grafana and monitoring Kubernetes services or similar stack.
- Familiarity with infrastructure technologies such as Helm, Terraform, or similar.
- Familiarity with GitOps workflows.
- Comfortable with UNIX/Linux and systems operations.
- Comfortable Python development skills (more than basic scripting) for automating infrastructure.
- Experience in scripting languages like Bash or Python. Preferred - optional qualifications
- Comfortable with SRE principles.
- Familiarity with Crossplane or comparable declarative infrastructure management solutions.
- Understanding of virtualization technologies (KVM, VMware, OpenStack).
- Possessing a Kubernetes Certification (CKA, CKAD, CKS, KCNA).
Right to work: Please note that you will require the legal right to work in the location you are applying for.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Job stats:
0
0
0
Categories:
Big Data Jobs
Engineering Jobs
Tags: Architecture AWS Azure Computer Science DevOps Engineering GCP Google Cloud Grafana Helm Kubernetes Linux MLOps OpenStack Python Terraform
Region:
Europe
Country:
United Kingdom
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
BI Developer jobsPrincipal Data Engineer jobsData Engineer II jobsStaff Data Scientist jobsSr. Data Engineer jobsPrincipal Software Engineer jobsStaff Machine Learning Engineer jobsData Manager jobsData Science Manager jobsData Science Intern jobsDevOps Engineer jobsSoftware Engineer II jobsJunior Data Analyst jobsBusiness Intelligence Analyst jobsData Analyst Intern jobsBusiness Data Analyst jobsLead Data Analyst jobsStaff Software Engineer jobsData Specialist jobsSenior Backend Engineer jobsSr. Data Scientist jobsAI/ML Engineer jobsData Governance Analyst jobsData Engineer III jobsAccount Executive jobs
Consulting jobsAirflow jobsOpen Source jobsMLOps jobsEconomics jobsKPIs jobsLinux jobsTerraform jobsJavaScript jobsNoSQL jobsRDBMS jobsKafka jobsData Warehousing jobsGoogle Cloud jobsGitHub jobsComputer Vision jobsPostgreSQL jobsScikit-learn jobsPhysics jobsClassification jobsData warehouse jobsBanking jobsStreaming jobsHadoop jobsR&D jobs
dbt jobsLooker jobsScala jobsOracle jobsBigQuery jobsPandas jobsRAG jobsPrompt engineering jobsReact jobsGPT jobsCX jobsPySpark jobsDistributed Systems jobsScrum jobsIndustrial jobsELT jobsJira jobsRedshift jobsMicroservices jobsRobotics jobsSalesforce jobsLangChain jobsSAS jobsJenkins jobsOpenAI jobs