Senior SRE Specialist │ DOU

B2B

DevOps

What project we have for you

We are a publicly-traded FTSE250 FinTech company who run mobile, web and desktop platforms that help our clients trade stocks & shares, leveraged products, Futures & Options and Crypto.

We are ambitious. Over 340,000 people already use our platforms. We’re global with offices in 18 countries and products in 16 regions. We’re hungry to move faster, ship better product for our customers and grow our user base. We believe in high autonomy, and we want people who are looking to do things differently in order to create better experiences for our customers.

We work in cross-functional teams and are laser focused on increasing the number of active clients we serve to drive sustainable growth.

Your teamThe SRE Team comprises highly skilled software engineers dedicated to embedding performance and reliability into our trading platform. You’ll work with cutting-edge distributed systems handling high-throughput, low-latency trading operations that demand zero downtime.

As a Site Reliability Engineer, you’ll champion reliability patterns, improve observability, establish 24/7 operations, and drive operational excellence across our crypto trading platform infrastructure and associated applications.

What you will do

System Reliability & Operations

This role excludes on call support.
Implement comprehensive monitoring and observability using OpenTelemetry and distributed tracing
Establish and maintain operational readiness including automated deployments, blue/green releases, and zero-downtime patching strategies
Define and track Service Level Objectives (SLOs) and Error Budgets for critical crypto trading services
Identify and eliminate single points of failure in distributed systems

Application Instrumentation & Observability

Instrument Java applications with OpenTelemetry spans, metrics, and traces
Work hands-on with development teams to add observability to their code
Guide teams on implementing meaningful SLIs that reflect user experience

Technical Leadership & Enablement

Partner with development teams on system design, capacity planning, and architectural reviews
Provide technical guidance and hands-on support for teams transitioning from traditional deployments to containerized infrastructure
Mentor developers on reliability patterns including circuit breakers, retry logic, and fault tolerance
Lead by example – write production code that demonstrates SRE best practices

Software Development & Automation

Write clean, maintainable code in Java and Python following industry best practices
Build automation tools and CI/CD pipelines that embed reliability practices
Contribute to application codebases to implement instrumentation and reliability patterns
Apply software engineering discipline including version control, code reviews, and testing

What you need for this

Java development experience– Must be able to read, write, and instrument Java code. Deep understanding of JVM internals and experience with complex distributed Java applications

Observability & Instrumentation – Hands-on experience with OpenTelemetry, distributed tracing concepts (spans, trace context propagation), and observability platforms such as Honeycomb, Datadog, Dynatrace, Splunk or Grafana. Strong understanding of OpenTelemetry Collector pipelines, including data transformation, enrichment, and labeling, use of processors (attributes, resource, transform, span, tail sampling), and propagation of custom business identifiers (e.g., customer/tenant/transaction IDs) across services to enable end-to-end trace correlation between heterogeneous systems, applications, and environments.

SLO/SLI Expertise – Proven experience defining SLOs based on SLIs, establishing error budgets, and working with development teams on reliability measurement

Reliability Patterns – Solid understanding of circuit breakers, retry logic, bulkheads, and other fault tolerance patterns

Cloud – AWS & Kubernetes Platform Engineering– Strong hands-on experience with AWS as the primary cloud provider, including production workloads on Amazon EKS. Proven expertise in Kubernetes networking, covering ingress and egress controllers (e.g., ALB / NGINX / Envoy), service configuration and fine-tuning (requests/limits, HPA/VPA, pod disruption budgets, network policies), and traffic management. Demonstrated ability to investigate and optimize performance and reliability using metrics, logs, and traces, complemented by chaos engineering practices (fault injection, node/pod failures, network latency, dependency outages) to validate system resilience and high availability under real-world failure scenarios.

Message Brokers – Production experience with ActiveMQ, Kafka, or similar messaging systems

Containerization – Hands-on experience with container orchestration (Nomad experience is advantageous, Kubernetes acceptable)

CI/CD – Experience building and maintaining deployment pipelines, preferably with GitLab

Experience Requirements

Track record in high-throughput, production environments (financial services, trading platforms, or similar mission-critical systems preferred)
Demonstrated ability to improve system reliability and performance at scale
Experience working collaboratively with development teams to implement observability and reliability improvements
Strong troubleshooting skills in distributed systems environments

Core Competencies

Systems thinking approach to problem-solving
Excellent communication skills for cross-functional collaboration and technical enablement
Ability to balance hands-on development work with operational responsibilities
Strong bias toward automation and eliminating manual toil
Comfortable working in a fast-paced environment with evolving requirements