System Manager (Site Reliability Engineering) (Contract)

MAS: MAS Building, Singapore

Apply now Apply later

The Monetary Authority of Singapore (MAS) is Singapore’s central bank and integrated financial regulator.

As central bank, MAS promotes sustained, non-inflationary economic growth through the conduct of monetary policy and close macroeconomic surveillance and analysis. It manages Singapore’s exchange rate, official foreign reserves, and liquidity in the banking sector.

As an integrated financial supervisor, MAS fosters a sound financial services sector through its prudential oversight of all financial institutions in Singapore – banks, insurers, capital market intermediaries, financial advisors, and stock exchanges. It is also responsible for well-functioning financial markets, sound conduct, and investor education.

MAS also works with the financial industry to promote Singapore as a dynamic international financial centre. It facilitates the development of infrastructure, adoption of technology, and upgrading of skills in the financial industry.

Join us now, if you have a genuine interest in making an impact to help shape Singapore’s economic and financial landscape.

[What you will be working on]

You will be part of a dynamic team within Supervision Platforms Division under Information Technology Department.

As a Senior Site Reliability Engineer specialising in Observability, you will be instrumental in building and maintaining highly reliable, scalable systems while focusing on observability practices across our division.

In this role, the officer is expected to undertake the following areas of work:

(i) Site Reliability Engineering (SRE)

  • Lead the implementation of comprehensive observability strategies across all applications and services in the division

  • Develop and maintain SRE framework aligned with business objectives

  • Participating in system design discussions to improve application reliability, scalability and fault tolerance

  • Drive cultural transformation towards SRE practices and observability-driven development

(ii) Observability & Monitoring

  • Design and implement robust observability solutions incorporating metrics, logs, and traces

  • Establish and maintain golden signals, SLOs and SLIs across services

  • Optimise monitoring systems by reducing alert fatigue and ensuring high signal-to-noise ratio

  • Develop advanced alerting strategies using machine learning for anomaly detection

  • Create comprehensive dashboards for system health visibility and performance analytics

(iii) Operations Excellence

  • Architect and maintain runbooks and playbooks for incident response

  • Lead post-incident reviews and implement systematic improvements

  • Drive review of system and application resource utilization trends to proactively identify potential areas of concern and address them

  • Develop automation solutions to improve the efficiency and effectiveness of operational tasks (i.e. service recovery, patching, monitoring, etc)

  • Establish best practices for logging standards and log management

  • Guide teams in implementing proper instrumentation of applications for observability

(iv) Operations and maintenance

  • Review and ensure application logging can be utilized for monitoring and alerting

  • Measure, monitor and continuously improve performances of IT services

  • Ensure that there are operations (IT) run books available for all new features/ services developed

(v) Governance, Risk and Compliance

  • Ensure compliance to ICT Governance policies and standards

  • Ensure risk framework is adhered to, and carry out risk management for projects and products

[What we are looking for]

  • Bachelor's or Master's degree in Computer Science, Engineering, or related field

  • 10+ years of experience in SRE, DevOps, or similar roles with strong focus on observability

  • Deep expertise in and advanced knowledge of observability platforms (ELK Stack, Grafana, Prometheus, Datadog, or similar)

  • Proven experience with cloud platforms, particularly AWS

  • Strong background in SLI/SLO implementation and chaos engineering practices

  • Proficiency in at least one programming language (Python, Go, or Java preferred)

  • Experience with distributed systems and microservices architecture

  • Experience with APM solutions and distributed tracing systems

  • Proficiency in Infrastructure-as-code and configuration management, as well as process optimisation and automation

  • Expertise in log aggregation and analysis tools

  • Familiarity with container orchestration platforms (Kubernetes)

  • Excellence in technical writing and documentation

  • Demonstrated experience in stakeholder management and technical leadership, with effective communication and people management skills

  • Strong project management skills with JIRA proficiency, and strong analytical and problem-solving capabilities

  • Experience in mentoring junior engineers and driving technical initiatives

  • Excellent incident management and troubleshooting skills, with proven track record of delivering results under pressure

  • Understanding of Government Instruction Manual 8 principles and familiarity with Government IT governance policies

  • Self-motivated with strong teamwork orientation, and the ability to influence and drive consensus across diverse teams

As part of the shortlisting process for this role, you may be required to complete a medical declaration and/or undergo further assessment.

This is a contract position until Dec 2029. All applicants will be notified on whether they are shortlisted or not within 4 weeks of the closing date of this job posting.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  1  0  0

Tags: Architecture AWS Banking Computer Science DevOps Distributed Systems ELK Engineering Grafana Java Jira Kubernetes Machine Learning Microservices Python

Perks/benefits: Career development

Region: Asia/Pacific
Country: Singapore

More jobs like this