Principal Reliability Engineering/SRE
Chicago, IL-200 W Madison St, United States
Full Time Senior-level / Expert USD 151K - 226K
The Hartford
Get business, home and car insurance from The Hartford. Choose from a broad selection of business insurance coverages and design the right solution for your company. The Hartford offers AARP members great ways to save on car and home insurance,...We’re determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals – and to help others accomplish theirs, too. Join our team as we help shape the future.
.
The Hartford’s Corporate / HIMCO IT team is seeking an experienced and highly motivated Principal Engineer who will be responsible for driving Reliability Engineering for multiple applications in the portfolio as well as implementation of Gen AI and AI platform capabilities. The principal engineer will be responsible for building, optimizing, and maintaining the cloud automation capabilities to enable infrastructure provisioning, application availability, testing, quality, application deployment, resiliency, recovery, and efficiency of IT applications and platforms.
The principal engineer will also ensure the implementation of IT Security and service hardening requirements. Key measures of success will include service reliability (such as availability, latency, quality), as well as technical debt reduction and cost efficiency.
This role will have a Hybrid work schedule, with the expectation of working in an office (Hartford, CT or Charlotte, NC) 3 days a week. Candidate must be authorized to work in the US without company sponsorship. The company will not support the STEM OPT I-983 Training Plan endorsement for this position.
Responsibilities:
- Set the strategy and advance the use of best-in-class software engineering standards, tools, and design practices to enable highly available and performant customer-facing applications. Lead adoption of metrics of overall application health - availability, performance, monitoring, alerting, quality, currency and resiliency
- Technical expert for the applications and infrastructure supported, requiring depth and breadth of knowledge in technologies, applications, integration, interfaces and business domain.
- Drive the development and implementation of Gen AI and AI platform capabilities, including evaluating and selecting AI and ML frameworks platforms and tools. Leverage cutting-edge technologies and methodologies to optimize business operations, enhance customer experience, and drive competitive advantage
- Develop the strategy to ensure effective tooling, alerts, and response mechanisms to identify and address reliability and security risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
- Develop the strategy to enhance the velocity of the SDLC by engineering the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.
- Identify, define and implement preventative controls and drive increased automation and self-healing capabilities. Continue to improve cost efficiency baselines.
- Lead the migration of applications to open source platforms, PaaS, containers, serverless, event-based designs, and other cloud technology standards for cloud-enablement and platform agility.
- Set a strategy to drive simplification across the stack, responsible for ensuring that all technical designs can be effectively operated in a cost-efficient manner, without adding operational complexity.
- Lead inner- and open-sourcing practices to accelerate the development of self-service enterprise capabilities
- Expert experience in setting up scalable SDLC environments using COTS, PaaS, SaaS products catering to Data, Application and Infrastructure-based pipeline needs
- Design a migration plan which build solutions to drive applications to open-source platforms, PaaS and use of containers and other cloud technology standards for cloud-enablement and platform agility.
- Ensure operational excellence. Lead the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business. Demonstrate end-to-end ownership.
- Partner with infrastructure teams on strategy to design and implement intelligent automation and orchestration systems, enhanced monitoring/alerting capabilities and rapid service restoration processes. Take proactive measures to prevent high impactful incidents.
Qualifications:
- Bachelor Degree in Computer Science or related discipline
- 10+ years of work experience in IT systems analysis, design, application development, IT Operations, and tech leadership.
- 5+ years of experience in a Reliability Engineer, Multi Stack Engineer or Data Engineer role with Manager Accountabilities
- Proven Experience with FinOps
- System Thinking end-to-end - Broad understanding/application of enterprise architectures and complex distributed systems
- 2+ years of experience in leading AI and ML Engineering organizations with expertise in building and/or managing large-scale AI, data and analytics platforms desirable
- Knowledge about the principles and practices of FMOps and LLMOps, and the tools and technologies used for generative AI model operations desirable
- Understanding of GenAI, machine learning, and related technologies along with business acumen.
- Proven experience with solution architecture orientation to enable expedient troubleshooting, issue-resolution and root-cause removal in a hybrid cloud environment.
- Proven experience with continuous integration and DevOps methodologies, tools including GitHub, Jenkins, Nexus, Rally, SonarQube, Jira, Azure DevOps, AWE Code Pipeline.
- Proven experience using Performance and Observability tools such as DynaTrace, CloudWatch, CloudTrail, AWS X-Ray, and related tools.
- Proven hybrid cloud experience (private and public) across various service delivery models – IaaS, PaaS, SaaS.
- Proven experience with IAC tools such as Terraform, Cloud Formation etc.
- Highly collaborative, partners with peers, stakeholders with a passion about delighting customers.
- Strong communicator at all levels in the Enterprise (verbal and written) / Influence/negotiation skills, working in a diverse team cross business units
- Certified in one ore more of the following:
- AWS Certified Developer
- AWS Certified Solution Architect
- AWS Certified DevOps Engineer
- Certified Kubernetes Administrator (CKA)
- Certified Kubernetes Application Developer (CKAD)
Compensation
The listed annualized base pay range is primarily based on analysis of similar positions in the external market. Actual base pay could vary and may be above or below the listed range based on factors including but not limited to performance, proficiency and demonstration of competencies required for the role. The base pay is just one component of The Hartford’s total compensation package for employees. Other rewards may include short-term or annual bonuses, long-term incentives, and on-the-spot recognition. The annualized base pay range for this role is:
$151,280 - $226,920Equal Opportunity Employer/Females/Minorities/Veterans/Disability/Sexual Orientation/Gender Identity or Expression/Religion/Age
About Us | Culture & Employee Insights | Diversity, Equity and Inclusion | Benefits
Tags: Architecture AWS Azure Computer Science CX DevOps Distributed Systems Engineering Generative AI GitHub Jenkins Jira Kubernetes LLMOps Machine Learning Open Source SDLC Security STEM Terraform Testing
Perks/benefits: Career development Competitive pay Equity / stock options Health care Insurance
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.