Description:
The Sr. Manager, Infrastructure Reliability and AIOps Engineering is accountable for improving reliability, observability, and automated recovery across Cloud Infrastructure, Networking, Enterprise Tools, and IAM. This leader builds and operates the Operations and Reliability Engineering function using AIOps practices and is accountable for day-to-day operational outcomes, including incident response, escalations, and restoration quality. The role leads Reliability Analysts and partners with domain teams, ITSM/Platform Enablement, and Security to prevent incidents, reduce alert noise, and improve recovery performance. Scope and accountability This role is accountable for:
- Operational ownership of event-driven incidents, including active participation in incident response, ticket escalation management, and coordination through resolution and restoration.
- AIOps outcomes and governance for platform operations: event ingestion, normalization, correlation, alert quality, intelligent routing, and automated event-to-incident workflows.
- Reliability outcomes across Cloud Infrastructure, Networking, Enterprise Tools, and IAM (SLO attainment, improved availability/latency where applicable, MTTD/MTTR reduction, reduced repeat incidents).
- Signal quality management (alert hygiene, deduplication, suppression, threshold tuning, enrichment, and ownership mapping) to improve signal-to-noise and reduce operational toil.
- Event correlation standards and service impact intelligence (dependency mapping, CI/service association, and prioritization logic aligned to CMDB/ITSM).
- Automation quality and “production readiness” for self-healing workflows across all platform domains (validation, rollback, auditability, and measurable success criteria).
- Reliability operating cadence (incident triage standards, major incident support model, post-incident reviews, problem trend management, and reliability roadmap governance).
- Reliability standards for telemetry, runbooks, monitoring coverage, and operational readiness checks (aligned to ITSM practices and security/compliance needs where applicable).
- Predictive avoidance driven IT Operations.
Key Responsibilities
- Reliability operations leadership
- Own the reliability execution model from signal → event → incident → restoration, including active incident engagement, escalation management, and accountability for ticket progression and resolution quality.
- Operate and continuously improve the AIOps layer: event ingestion/normalization, correlation rule design, enrichment, deduplication, suppression, and noise reduction.
- Drive measurable improvements in operational performance through alert quality KPIs (false positives, duplicates, unassigned events, time-to-triage).
- Lead post-incident reviews with a prevention mindset; convert lessons learned into problem records, reliability backlog items, and automation candidates with clear owners and due dates.
- Establish a consistent “incident learning → reliability backlog → automation delivery” feedback loop with Cloud, Network, Tools, and IAM teams.