We are looking for a seasoned and strategic Site Reliability Engineering (SRE) Manager tolead and grow our SRE team. You will be responsible for building and managing a team ofengineers who ensure the reliability, scalability, and performance of our mission-critical systemsand services. As the SRE Manager, you will play a key role in the design and execution ofoperational best practices while promoting a culture of automation and continuous improvementacross the engineering teams.
- Team Leadership and Mentorship — Lead, mentor, and grow a team of Site ReliabilityEngineers by providing guidance on best practices, technical decisions, and careerdevelopment.
- Operational Excellence — Own the overall reliability, uptime, and performance of thesystems and services, ensuring they meet business SLAs and customer expectations.
- Incident Management — Oversee the incident response process, including monitoring,alerting, incident resolution, and root cause analysis, with a focus on improving responsetimes and minimizing impact.
- Automation and Tooling — Drive the adoption of automation and self-service tools toreduce manual intervention, improve system reliability, and enhance engineeringproductivity.
- Collaboration with Engineering Teams — Work closely with software developers, QAteams, and other stakeholders to embed reliability into the design and development ofapplications and services.
- Capacity Planning and Performance Optimization — Manage capacity planning,performance monitoring, and optimization to ensure the infrastructure can scale to meetbusiness needs.
- Infrastructure Management — Collaborate with the DevOps and cloud infrastructureteams to manage, maintain, and optimize cloud infrastructure using modern IaC(Infrastructure as Code) tools and methodologies.
- Budget and Resource Management — Manage budgets, vendor relationships, andresource allocations to ensure efficient use of infrastructure and technology investments.
- Drive SRE Culture — Promote a culture of continuous improvement, emphasizinglearning from failure, monitoring, and proactive problem-solving.
- Security and Compliance — Work closely with security teams to implement bestpractices for secure infrastructure and ensure compliance with internal and externalregulations.
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent work experience.
- 5+ years of experience in site reliability engineering, DevOps, or infrastructure roles with at least 2 years in a leadership or management role.
- Deep knowledge of cloud platforms (AWS, Google Cloud Platform, Azure) and the ability to manage highly available and scalable infrastructure.
- Hands-on experience with monitoring, alerting, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
- Strong expertise in automation tools and practices (CI/CD pipelines, IaC tools such as Terraform).
- Solid understanding of containers and orchestration tools (Docker, Kubernetes).
- Proven experience with incident management, root cause analysis, and post-mortem processes.
- Deep knowledge of Linux/Unix systems administration and networking concepts (DNS, TCP/IP, load balancing).
- Strong communication and leadership skills, with the ability to collaborate across teams and functions.
- Experience with large-scale distributed systems and high-availability architectures.
- Familiarity with security best practices for cloud environments.
- Experience managing multi-region, multi-cloud deployments.
- Prior experience working in Agile or Scrum environments.
- Knowledge of cost optimization strategies for cloud infrastructure
- Strong organizational and time management skills.
- Ability to influence and inspire teams to adopt best practices.
- Excellent verbal and written communication skills for both technical and non-technical stakeholders.
- Ability to think strategically while staying hands-on when necessary.
- Demonstrated problem-solving skills and a proactive approach to identifying risks and finding solutions