Description:
We are looking for a Site Reliability Engineer who views "manual effort" as a bug to be fixed. In this role, you won't just be keeping the lights on; you will be the architect of our system’s resilience. We need a proactive engineer who is obsessed with Kubernetes and Cloud infrastructure, but also has a visionary streak—someone eager to experiment with AI-driven operations (AIOps) to predict failures and automate responses. If you enjoy building self-healing systems and staying ahead of the tech curve, this is the place for you.
What you will be doing
- Engineering Reliability: Designing and implementing self-healing infrastructure using Kubernetes to maintain high uptime and system integrity
- Scaling Cloud Ecosystems: Optimizing our cloud footprint (AWS/GCP/Azure) to ensure our platforms can handle rapid growth without breaking a sweat
- Innovating with AI: Proactively identifying opportunities to integrate AI tools into our observability stack to automate incident detection and root-cause analysis
- Eliminating Toil: Writing clean, efficient code to automate repetitive operational tasks, turning manual workflows into seamless "set and forget" processes
- Defining Observability: Building advanced monitoring and alerting frameworks that provide deep insights into system health and performance
What we are looking for
- Kubernetes Power User: Extensive experience managing production-grade K8s environments, including ingress, service mesh, and container security
- Cloud Infrastructure Expert: A deep understanding of cloud networking, storage, and compute services within a major provider (AWS, Azure, or GCP)
- Proactive Mindset: An engineer who doesn't wait for a ticket; you naturally seek out system weaknesses and build solutions to strengthen them
- AI Curiosity: An active interest in the AI landscape and a desire to leverage LLMs or machine learning to improve SRE workflows
- Programming Literacy: Ideally experience with at least one language (such as Java, Python, Go, or Ruby) to bridge the gap between software engineering and operations