We are looking for a Site Reliability Engineer (SRE) to operate our Azure cloud infrastructure and secure the availability of applications used by our new generation of SaaS product portfolio.
- Own observability end-to-end with Datadog — from dashboards and alerts to log pipelines and service monitoring
- Proactively identify reliability risks and performance bottlenecks using Datadog metrics, traces, and logs
- Define and maintain SLOs, SLIs, and error budgets, and partner with teams to improve service reliability
- Participate in on-call rotations, incident response, and root cause analysis (RCAs)
- Lead blameless postmortems and continuously improve incident response and recovery procedures
- Automate infrastructure provisioning and updates using Terraform
- Build and maintain CI/CD pipelines using Jenkins
- Work with cross-functional teams to design and scale systems across Azure, and later across AWS and GCP
- Create automation and tools to reduce toil, improve MTTR, and scale reliability practices
- Mentor teams on SRE principles and best practices
- Solid experience in a Site Reliability Engineering or DevOps role supporting production systems
- Great hands-on experience with Datadog — creating monitors, dashboards, log pipelines, APM, and alerting strategies
- Strong familiarity with incident management processes, including on-call, triage, communication, and postmortems
- Practical experience with Kubernetes (ideally AKS) in production
- Strong knowledge of Terraform and Jenkins
- Proficiency in one or more scripting languages (e.g., Python, Bash, PowerShell)
- Knowledge of Azure services, especially Blob Storage, Service Bus, and Application Gateway
- Passion for automation, root cause analysis, and driving systems toward high availability
- Experience with AWS, GCP, or managing systems in a multi-cloud environment is desirable.
The product is aimed at comprehensive retail chain automation and covers all work processes of large retail chain operators. The product covers retail store management, warehouse management, payment systems integration, logistics management, hardware/software store automation, etc. The product is already adopted by the market, and the biggest US and global retail operators are among the clients.
Toshiba Global Commerce Solutions is a dynamic global company based in Research Triangle Park, NC, providing retail store solutions to your favorite brands. Have you ever been in a hurry and made use of the self-checkout at Lowe’s Foods, earned fuel rewards at Kroger, or just paid for purchases at retailers such as Walmart, Michaels, Carrefour, The Gap, Calvin Klein, Boots, Cencosud, BJ’s, or Costco? These are just a few examples of our in-store solutions and impressive customer base that made us the world’s installed market share leader.The nature of retail is changing quickly, so if you share our ’Together Commerce’ vision of a seamless two-way, participatory shopping experience, let’s get together to drive the new economy.