Senior Manager Reliability Engineering

Description:

The Sr. Manager, Infrastructure Reliability and AIOps Engineering is accountable for improving reliability, observability, and automated recovery across Cloud Infrastructure, Networking, Enterprise Tools, and IAM. This leader builds and operates the Operations and Reliability Engineering function using AIOps practices and is accountable for day-to-day operational outcomes, including incident response, escalations, and restoration quality. The role leads Reliability Analysts and partners with domain teams, ITSM/Platform Enablement, and Security to prevent incidents, reduce alert noise, and improve recovery performance. Scope and accountability This role is accountable for:

Operational ownership of event-driven incidents, including active participation in incident response, ticket escalation management, and coordination through resolution and restoration.
AIOps outcomes and governance for platform operations: event ingestion, normalization, correlation, alert quality, intelligent routing, and automated event-to-incident workflows.
Reliability outcomes across Cloud Infrastructure, Networking, Enterprise Tools, and IAM (SLO attainment, improved availability/latency where applicable, MTTD/MTTR reduction, reduced repeat incidents).
Signal quality management (alert hygiene, deduplication, suppression, threshold tuning, enrichment, and ownership mapping) to improve signal-to-noise and reduce operational toil.
Event correlation standards and service impact intelligence (dependency mapping, CI/service association, and prioritization logic aligned to CMDB/ITSM).
Automation quality and “production readiness” for self-healing workflows across all platform domains (validation, rollback, auditability, and measurable success criteria).
Reliability operating cadence (incident triage standards, major incident support model, post-incident reviews, problem trend management, and reliability roadmap governance).
Reliability standards for telemetry, runbooks, monitoring coverage, and operational readiness checks (aligned to ITSM practices and security/compliance needs where applicable).
Predictive avoidance driven IT Operations.

Key Responsibilities

Reliability operations leadership
Own the reliability execution model from signal → event → incident → restoration, including active incident engagement, escalation management, and accountability for ticket progression and resolution quality.
Operate and continuously improve the AIOps layer: event ingestion/normalization, correlation rule design, enrichment, deduplication, suppression, and noise reduction.
Drive measurable improvements in operational performance through alert quality KPIs (false positives, duplicates, unassigned events, time-to-triage).
Lead post-incident reviews with a prevention mindset; convert lessons learned into problem records, reliability backlog items, and automation candidates with clear owners and due dates.
Establish a consistent “incident learning → reliability backlog → automation delivery” feedback loop with Cloud, Network, Tools, and IAM teams.

Organization	Genesys
Industry	Engineering Jobs
Occupational Category	Senior Manager Reliability Engineering
Job Location	London,UK
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2026-04-10 9:15 pm
Expires on	2026-07-16