The role sits at the intersection of infrastructure, platform engineering, and site reliability, with ownership across build, deploy, run, and scale phases.
Key Responsibilities
Own the reliability, availability, and performance of large-scale distributed systems
Design, manage, and scale Kubernetes clusters supporting microservices architectures
Build and maintain CI/CD pipelines that enable fast, safe, and repeatable deployments
Implement and manage cloud infrastructure using Infrastructure as Code principles
Improve system resilience through proactive monitoring, alerting, and observability
Participate in production incident response, root cause analysis, and post-incident reviews
Continuously reduce operational toil through automation and self-healing systems
Technology Stack & Tools
Cloud platforms: AWS, Azure, or Google Cloud Platform (GCP)
Containerization and orchestration: Kubernetes, Docker, Helm, Skaffold
Infrastructure as Code: Terraform
Observability stack: Prometheus, Grafana, centralized logging, and distributed tracing
Scripting and automation: Python, Shell scripting, Linux tooling
Required Experience & Skills
3–5 years of hands-on experience in DevOps Engineering, SRE, or Platform Engineering roles
Strong experience managing production Kubernetes environments
Deep understanding of cloud-native architectures and microservices-based systems
Proven experience with CI/CD systems and release automation
Strong troubleshooting skills in live production environments
Solid knowledge of monitoring, alerting, incident management, and postmortem practices
Automation-first mindset with a strong bias toward scalable, repeatable systems
What Makes This Role Impactful
This role directly influences system scalability, platform stability, and customer experience. The work done here impacts engineering velocity, infrastructure costs, and overall business reliability. The position partners closely with software engineering teams to influence platform architecture and long-term technical decisions.
Who This Role Is Best Suited For
Engineers who enjoy solving infrastructure problems at scale, taking ownership of production systems, and building platforms that grow without constant manual intervention. Ideal for someone who sees reliability as a core product feature and automation as a non-negotiable standard.
High Impact Jobs: CareerXperts Jobs
Follow CareerXperts on LinkedIn: CareerXperts Consulting