We are seeking a senior, hands-on Quality Engineer with a strong SRE mindset to own reliability, resilience, and production quality for our global POS, cloud, and middleware platform.
This is a deeply technical individual-contributor role focused on preventing production incidents through rigorous quality engineering, failure-mode validation, automation, and observability. As AI accelerates development velocity, this role ensures that verification, resilience testing, and release safety scale accordingly.
This role operates close to production systems and partners daily with Software Engineering, SRE, Cloud Operations, and Architecture to ensure services are safe, observable, and resilient before release.
Reliability Engineering & SRE Practices
• Continuously validate SLOs, SLIs.
• Actively assess and block releases that violate reliability or operability standards.
• Analyze incidents and near-misses to drive permanent preventive improvements.
• Reduce Sev-1 / Sev-2 incidents, repeat failures, and operational toil.
• Participate in on-call and production support rotations as a reliability specialist.
Resilience & Failure-Mode Testing
• Design and execute:
— chaos engineering experiments
— failover and disaster recovery validation
— load, stress, and soak testing
• Validate real-world failure scenarios including:
— store-mode operation
— payment and transaction workflows
— edge-device and network instability
— partial and cascading service failures
• Ensure systems fail safely, degrade gracefully, and recover automatically.
Performance & Capacity Engineering
• Identify performance bottlenecks and scaling limits early.
• Validate capacity models and peak traffic assumptions.
• Partner with engineers to tune latency, throughput, and resource utilization.
• Ensure performance regressions are detected automatically.
Production Readiness & Architecture Validation
• Lead and contribute to Production Readiness Reviews (PRRs).
• Validate:
— observability completeness
— alert quality and signal-to-noise
— rollback and recovery strategies
— dependency risk and blast radius
• Enforce resilience patterns such as:
— timeouts
— retries with backoff
— circuit breakers
Observability & Telemetry Quality
• Ensure services emit high-quality logs, metrics, traces, and business KPIs.
• Validate that alerts detect customer-impacting issues early.
• Use telemetry to proactively identify reliability risks and hidden failure modes.
• Partner with platform teams on OpenTelemetry standards and adoption.
Release Reliability & Automation
• Partner with AI/ML teams to apply intelligent analysis to test and production data.
Cross-Functional Engineering Collaboration
• Embed with feature teams as a reliability-focused QE.
• Partner with:
— Software Development
— SRE and Cloud Operations
— Functional Quality Engineering
— Architecture and Platform Engineering
• Act as a technical authority on production quality and reliability.
Quality Engineering & SRE Background
• Relevant experience in:
— Quality Engineering
— Site Reliability Engineering
— Reliability or Performance Engineering
Technical Expertise
• Strong hands-on experience with:
— distributed systems and microservices
— Kubernetes and cloud platforms (AKS preferred)
— observability stacks (OpenTelemetry, Grafana, App Insights, Datadog)
— performance testing and tuning
— fault tolerance and resilience patterns
— database and service scaling
— chaos and failure-mode testing
Automation & Tooling
• Strong coding/scripting skills (Python, Go, Java, or similar).
• Experience building automated test, validation, or reliability tooling.
• CI/CD integration experience.
Operational Excellence
• Strong incident analysis and root-cause prevention skills.
• Comfortable working close to production systems.
• Retail POS, payments, or edge-device environments
• Hybrid cloud + edge architectures
• AI-assisted testing, anomaly detection, or reliability analytics