Site Reliability Engineering Manager at Toshiba │ DOU

B2B

DevOps

remote

About the Role:

We are looking for a seasoned and strategic Site Reliability Engineering (SRE) Manager tolead and grow our SRE team. You will be responsible for building and managing a team ofengineers who ensure the reliability, scalability, and performance of our mission-critical systemsand services. As the SRE Manager, you will play a key role in the design and execution ofoperational best practices while promoting a culture of automation and continuous improvementacross the engineering teams.

Key Responsibilities:

Team Leadership and Mentorship — Lead, mentor, and grow a team of Site ReliabilityEngineers by providing guidance on best practices, technical decisions, and careerdevelopment.
Operational Excellence — Own the overall reliability, uptime, and performance of thesystems and services, ensuring they meet business SLAs and customer expectations.
Incident Management — Oversee the incident response process, including monitoring,alerting, incident resolution, and root cause analysis, with a focus on improving responsetimes and minimizing impact.
Automation and Tooling — Drive the adoption of automation and self-service tools toreduce manual intervention, improve system reliability, and enhance engineeringproductivity.
Collaboration with Engineering Teams — Work closely with software developers, QAteams, and other stakeholders to embed reliability into the design and development ofapplications and services.
Capacity Planning and Performance Optimization — Manage capacity planning,performance monitoring, and optimization to ensure the infrastructure can scale to meetbusiness needs.
Infrastructure Management — Collaborate with the DevOps and cloud infrastructureteams to manage, maintain, and optimize cloud infrastructure using modern IaC(Infrastructure as Code) tools and methodologies.
Budget and Resource Management — Manage budgets, vendor relationships, andresource allocations to ensure efficient use of infrastructure and technology investments.
Drive SRE Culture — Promote a culture of continuous improvement, emphasizinglearning from failure, monitoring, and proactive problem-solving.
Security and Compliance — Work closely with security teams to implement bestpractices for secure infrastructure and ensure compliance with internal and externalregulations.

Required Qualifications:

Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent work experience.
5+ years of experience in site reliability engineering, DevOps, or infrastructure roles with at least 2 years in a leadership or management role.
Deep knowledge of cloud platforms (AWS, Google Cloud Platform, Azure) and the ability to manage highly available and scalable infrastructure.
Hands-on experience with monitoring, alerting, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
Strong expertise in automation tools and practices (CI/CD pipelines, IaC tools such as Terraform).
Solid understanding of containers and orchestration tools (Docker, Kubernetes).
Proven experience with incident management, root cause analysis, and post-mortem processes.
Deep knowledge of Linux/Unix systems administration and networking concepts (DNS, TCP/IP, load balancing).
Strong communication and leadership skills, with the ability to collaborate across teams and functions.

Preferred Qualifications:

Experience with large-scale distributed systems and high-availability architectures.
Familiarity with security best practices for cloud environments.
Experience managing multi-region, multi-cloud deployments.
Prior experience working in Agile or Scrum environments.
Knowledge of cost optimization strategies for cloud infrastructure

Soft Skills:

Strong organizational and time management skills.
Ability to influence and inspire teams to adopt best practices.
Excellent verbal and written communication skills for both technical and non-technical stakeholders.
Ability to think strategically while staying hands-on when necessary.
Demonstrated problem-solving skills and a proactive approach to identifying risks and finding solutions