System Manager (Site Reliability Engineering) (Contract)
MAS: MAS Building, Singapore
The Singapore Public Service
As central bank, MAS promotes sustained, non-inflationary economic growth through the conduct of monetary policy and close macroeconomic surveillance and analysis. It manages Singapore’s exchange rate, official foreign reserves, and liquidity in the banking sector.
As an integrated financial supervisor, MAS fosters a sound financial services sector through its prudential oversight of all financial institutions in Singapore – banks, insurers, capital market intermediaries, financial advisors, and stock exchanges. It is also responsible for well-functioning financial markets, sound conduct, and investor education.
MAS also works with the financial industry to promote Singapore as a dynamic international financial centre. It facilitates the development of infrastructure, adoption of technology, and upgrading of skills in the financial industry.
Join us now, if you have a genuine interest in making an impact to help shape Singapore’s economic and financial landscape.
[What you will be working on]
You will be part of a dynamic team within Supervision Platforms Division under Information Technology Department.
As a Senior Site Reliability Engineer specialising in Observability, you will be instrumental in building and maintaining highly reliable, scalable systems while focusing on observability practices across our division.
In this role, the officer is expected to undertake the following areas of work:
(i) Site Reliability Engineering (SRE)
Lead the implementation of comprehensive observability strategies across all applications and services in the division
Develop and maintain SRE framework aligned with business objectives
Participating in system design discussions to improve application reliability, scalability and fault tolerance
Drive cultural transformation towards SRE practices and observability-driven development
(ii) Observability & Monitoring
Design and implement robust observability solutions incorporating metrics, logs, and traces
Establish and maintain golden signals, SLOs and SLIs across services
Optimise monitoring systems by reducing alert fatigue and ensuring high signal-to-noise ratio
Develop advanced alerting strategies using machine learning for anomaly detection
Create comprehensive dashboards for system health visibility and performance analytics
(iii) Operations Excellence
Architect and maintain runbooks and playbooks for incident response
Lead post-incident reviews and implement systematic improvements
Drive review of system and application resource utilization trends to proactively identify potential areas of concern and address them
Develop automation solutions to improve the efficiency and effectiveness of operational tasks (i.e. service recovery, patching, monitoring, etc)
Establish best practices for logging standards and log management
Guide teams in implementing proper instrumentation of applications for observability
(iv) Operations and maintenance
Review and ensure application logging can be utilized for monitoring and alerting
Measure, monitor and continuously improve performances of IT services
Ensure that there are operations (IT) run books available for all new features/ services developed
(v) Governance, Risk and Compliance
Ensure compliance to ICT Governance policies and standards
Ensure risk framework is adhered to, and carry out risk management for projects and products
[What we are looking for]
Bachelor's or Master's degree in Computer Science, Engineering, or related field
10+ years of experience in SRE, DevOps, or similar roles with strong focus on observability
Deep expertise in and advanced knowledge of observability platforms (ELK Stack, Grafana, Prometheus, Datadog, or similar)
Proven experience with cloud platforms, particularly AWS
Strong background in SLI/SLO implementation and chaos engineering practices
Proficiency in at least one programming language (Python, Go, or Java preferred)
Experience with distributed systems and microservices architecture
Experience with APM solutions and distributed tracing systems
Proficiency in Infrastructure-as-code and configuration management, as well as process optimisation and automation
Expertise in log aggregation and analysis tools
Familiarity with container orchestration platforms (Kubernetes)
Excellence in technical writing and documentation
Demonstrated experience in stakeholder management and technical leadership, with effective communication and people management skills
Strong project management skills with JIRA proficiency, and strong analytical and problem-solving capabilities
Experience in mentoring junior engineers and driving technical initiatives
Excellent incident management and troubleshooting skills, with proven track record of delivering results under pressure
Understanding of Government Instruction Manual 8 principles and familiarity with Government IT governance policies
Self-motivated with strong teamwork orientation, and the ability to influence and drive consensus across diverse teams
As part of the shortlisting process for this role, you may be required to complete a medical declaration and/or undergo further assessment.
This is a contract position until Dec 2029. All applicants will be notified on whether they are shortlisted or not within 4 weeks of the closing date of this job posting.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Architecture AWS Banking Computer Science DevOps Distributed Systems ELK Engineering Grafana Java Jira Kubernetes Machine Learning Microservices Python
Perks/benefits: Career development
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.