Site Reliability Engineer

Description:

We are looking for a Site Reliability Engineer who views "manual effort" as a bug to be fixed. In this role, you won't just be keeping the lights on; you will be the architect of our system’s resilience. We need a proactive engineer who is obsessed with Kubernetes and Cloud infrastructure, but also has a visionary streak—someone eager to experiment with AI-driven operations (AIOps) to predict failures and automate responses. If you enjoy building self-healing systems and staying ahead of the tech curve, this is the place for you.

What you will be doing

Engineering Reliability: Designing and implementing self-healing infrastructure using Kubernetes to maintain high uptime and system integrity
Scaling Cloud Ecosystems: Optimizing our cloud footprint (AWS/GCP/Azure) to ensure our platforms can handle rapid growth without breaking a sweat
Innovating with AI: Proactively identifying opportunities to integrate AI tools into our observability stack to automate incident detection and root-cause analysis
Eliminating Toil: Writing clean, efficient code to automate repetitive operational tasks, turning manual workflows into seamless "set and forget" processes
Defining Observability: Building advanced monitoring and alerting frameworks that provide deep insights into system health and performance

What we are looking for

Kubernetes Power User: Extensive experience managing production-grade K8s environments, including ingress, service mesh, and container security
Cloud Infrastructure Expert: A deep understanding of cloud networking, storage, and compute services within a major provider (AWS, Azure, or GCP)
Proactive Mindset: An engineer who doesn't wait for a ticket; you naturally seek out system weaknesses and build solutions to strengthen them
AI Curiosity: An active interest in the AI landscape and a desire to leverage LLMs or machine learning to improve SRE workflows
Programming Literacy: Ideally experience with at least one language (such as Java, Python, Go, or Ruby) to bridge the gap between software engineering and operations

Organization	Matillion
Industry	IT / Telecom / Software Jobs
Occupational Category	Site Reliability Engineer
Job Location	Manchester,UK
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2026-01-19 3:19 pm
Expires on	2026-07-17