Site Reliability Engineer

 

Description:

We are looking for a Site Reliability Engineer who views "manual effort" as a bug to be fixed. In this role, you won't just be keeping the lights on; you will be the architect of our system’s resilience. We need a proactive engineer who is obsessed with Kubernetes and Cloud infrastructure, but also has a visionary streak—someone eager to experiment with AI-driven operations (AIOps) to predict failures and automate responses. If you enjoy building self-healing systems and staying ahead of the tech curve, this is the place for you.

 

What you will be doing

  • Engineering Reliability: Designing and implementing self-healing infrastructure using Kubernetes to maintain high uptime and system integrity
  • Scaling Cloud Ecosystems: Optimizing our cloud footprint (AWS/GCP/Azure) to ensure our platforms can handle rapid growth without breaking a sweat
  • Innovating with AI: Proactively identifying opportunities to integrate AI tools into our observability stack to automate incident detection and root-cause analysis
  • Eliminating Toil: Writing clean, efficient code to automate repetitive operational tasks, turning manual workflows into seamless "set and forget" processes
  • Defining Observability: Building advanced monitoring and alerting frameworks that provide deep insights into system health and performance

 

What we are looking for

  • Kubernetes Power User: Extensive experience managing production-grade K8s environments, including ingress, service mesh, and container security
  • Cloud Infrastructure Expert: A deep understanding of cloud networking, storage, and compute services within a major provider (AWS, Azure, or GCP)
  • Proactive Mindset: An engineer who doesn't wait for a ticket; you naturally seek out system weaknesses and build solutions to strengthen them
  • AI Curiosity: An active interest in the AI landscape and a desire to leverage LLMs or machine learning to improve SRE workflows
  • Programming Literacy: Ideally experience with at least one language (such as Java, Python, Go, or Ruby) to bridge the gap between software engineering and operations

Organization Matillion
Industry IT / Telecom / Software Jobs
Occupational Category Site Reliability Engineer
Job Location Manchester,UK
Shift Type Morning
Job Type Full Time
Gender No Preference
Career Level Intermediate
Experience 2 Years
Posted at 2026-01-19 3:19 pm
Expires on 2026-03-05