Lead Site Reliability Engineer
About the Role
Akka Platform is a cloud-native PaaS that enables teams to build and operate AI-enhanced microservices at scale. Our infrastructure lets customers run their workloads on AWS, GCP, and Azure, and is built around Kubernetes-native primitives – from GitOps delivery and custom Operators to service meshes and zero-trust networking.
The Lead SRE owns reliability, scalability, and security for a multi-tenant platform running customer workloads across EKS, GKE, and AKS. The visible work is what you'd expect: SLOs, capacity planning, cluster and database upgrades, the service mesh, the PKI, the observability stack, on-call. The work that matters more is the work that doesn't show up on a stack diagram, being a person the SOC2 auditor turns to, owning a disaster recovery plan that has been executed in production rather than discussed in a conference room, and inheriting a three-person team and growing it stronger than you found it: the post-mortem that names things honestly, the on-call load that needs rebalancing, the engineer who isn't growing into the role. This is not a role you grow into. If you've done this work before, not adjacent work, not work at a smaller scale, you'll recognize the job from the paragraph above. If you haven't, the rest of the interview will be us both finding that out, and we'd rather save us both the time.
What You’ll Do
Platform Reliability & Operations
- Own Service Level Objectives/Service Level Indicators (SLOs/SLIs) and error budgets across multi-cloud clusters (EKS, GKE, AKS); drive blameless post-mortems and systemic remediation.
- Lead capacity planning with our customers, cluster lifecycle management, and Kubernetes and database upgrade cycles.
- Define and enforce runbooks, on-call rotations, and escalation paths for the wider engineering organisation.
Infrastructure as Code & Delivery
- Own and evolve the IaC layer: Helm charts, Crossplane compositions, and FluxCD GitOps pipelines.
- Design and maintain cloud-resource provisioning workflows that span all three cloud providers, with consistent policy controls.
Networking & Security
- Architect and operate connectivity patterns: AWS PrivateLink / Transit Gateway, GCP NCC, Azure VNet Peering, and cross-region ingress with Contour/Envoy.
- Maintain and evolve the Linkerd service mesh for mTLS, workload identity (OIDC), and zero-trust authorisation policies.
- Drive PKI hygiene with cert-manager: root/intermediate CA rotation, ACME certificate lifecycle, and secret management via KMS-backed Kubernetes vaults.
Observability & Incident Response
- Own the observability stack: Prometheus, Cortex (multi-tenant metrics), OpenTelemetry sidecars, centralised log pipelines, and Groundcover / Grafana dashboards.
- Establish alerting standards and SLO-based alerting rules; ensure distributed traces are actionable across JVM, Rust, and Go workloads.
- Actively participate in on-call and lead the technical response for platform-level incidents.
Technical Leadership & Mentorship
- Set engineering standards and review infrastructure changes across the team.
- Partner with Security, Product, and Application Engineering to translate reliability requirements into platform capabilities.
- Grow a team of 3–5 SREs through code review, architecture sessions, and career conversations.
What We’re Looking For
Required Experience
- 7+ years in SRE, platform engineering, or infrastructure engineering roles.
- Deep, hands-on Kubernetes experience: operating and scaling clusters across at least two of GKE, EKS, and AKS in production.
- Proven IaC ownership: Helm chart authoring, Crossplane provider/composition design, and GitOps with Flux or ArgoCD.
- Strong multi-cloud networking: VPC design, private connectivity (PrivateLink, NCC, VNet Peering), and DNS (Route 53, Cloud DNS, Azure DNS, Cloudflare).
- Production experience with a service mesh (Linkerd or Istio) and Envoy-based ingress.
- Solid observability track record with Prometheus, distributed tracing (OpenTelemetry), and structured logging pipelines.
- Experience securing Kubernetes clusters: RBAC, workload identity / OIDC, mTLS, and secret management with cloud KMS.
- Comfortable reading and writing at least one systems language (Go, Rust, or similar) and shell scripting for automation and operator development.
Nice to Have
- Experience writing Kubernetes Operators / custom controllers (Go preferred).
- Familiarity with JVM workloads on Kubernetes – GC tuning, heap sizing, graceful shutdown.
- Exposure to event-driven / event-sourcing architectures (Akka, Kafka, or similar).
- Experience with Teleport for federated cluster access.
- Background operating Cortex for long-term, multi-tenant metrics storage.
- Knowledge of gRPC service design and debugging.
Who You Are
- You think in systems – you reason about failure modes, blast radius, and cascading effects before cutting tickets.
- You treat infrastructure as a product – reliability, security, and developer experience are non-negotiable features.
- You communicate clearly across all levels: from a detailed post-mortem to a board-level incident summary.
- You raise the bar without creating bottlenecks – you know when to approve quickly and when to push back hard.
Technology Stack at a Glance
What We Offer
- Competitive salary and equity, benchmarked against senior/lead IC roles in your market.
- Remote-first culture with flexible working hours.
- Comprehensive health and wellness benefits.
- Opportunities for professional development and continuous learning.
- Collaborative, inclusive, and innovative company culture.
- A team that has strong opinions, writes good documentation, and builds things they are proud of.
Our Core Values
- We’re Authentic: We value transparency and genuine communication, without politics or games. We're honest and assume good intentions, cultivating trust and accountability within our organization and in our interactions with others outside of Akka.
- We’re Customer-Focused: We value customer outcomes above all else. By prioritizing our customers' interests, and meeting them where they are today, we help ensure their success. We are dedicated to deeply understanding our customer’s needs, anticipating challenges, navigating time constraints and striving to exceed expectations.
- We’re Nonconventional: We value fearless innovation by challenging the status quo and embracing alternative approaches. Continuous learning and a growth mindset aimed at improving ourselves, our company, and our products, drives us to push boundaries and explore new solutions. Guided by a bias for action, we leverage industry and customer insights to inspire fresh ideas, enabling optimal future offerings.
- We’re Persistent: We value excellence through continuous experimentation and courageous problem-solving. We recognize that achieving success often demands approaching challenges with tenacity and taking calculated risks to achieve leading-edge solutions.
Akka is an Equal Opportunity Employer. We welcome applications from candidates of all backgrounds and experience levels.