We are a publicly-traded FTSE250 FinTech company who run mobile, web and desktop platforms that help our clients trade stocks & shares, leveraged products, Futures & Options and Crypto.
We are ambitious. Over 340,000 people already use our platforms. We’re global with offices in 18 countries and products in 16 regions. We’re hungry to move faster, ship better product for our customers and grow our user base. We believe in high autonomy, and we want people who are looking to do things differently in order to create better experiences for our customers.
We work in cross-functional teams and are laser focused on increasing the number of active clients we serve to drive sustainable growth.
Your teamThe SRE Team comprises highly skilled software engineers dedicated to embedding performance and reliability into our trading platform. You’ll work with cutting-edge distributed systems handling high-throughput, low-latency trading operations that demand zero downtime.
As a Site Reliability Engineer, you’ll champion reliability patterns, improve observability, establish 24/7 operations, and drive operational excellence across our crypto trading platform infrastructure and associated applications.
System Reliability & Operations
Application Instrumentation & Observability
Technical Leadership & Enablement
Software Development & Automation
Java development experience– Must be able to read, write, and instrument Java code. Deep understanding of JVM internals and experience with complex distributed Java applications
Observability & Instrumentation – Hands-on experience with OpenTelemetry, distributed tracing concepts (spans, trace context propagation), and observability platforms such as Honeycomb, Datadog, Dynatrace, Splunk or Grafana. Strong understanding of OpenTelemetry Collector pipelines, including data transformation, enrichment, and labeling, use of processors (attributes, resource, transform, span, tail sampling), and propagation of custom business identifiers (e.g., customer/tenant/transaction IDs) across services to enable end-to-end trace correlation between heterogeneous systems, applications, and environments.
SLO/SLI Expertise – Proven experience defining SLOs based on SLIs, establishing error budgets, and working with development teams on reliability measurement
Reliability Patterns – Solid understanding of circuit breakers, retry logic, bulkheads, and other fault tolerance patterns
Cloud – AWS & Kubernetes Platform Engineering– Strong hands-on experience with AWS as the primary cloud provider, including production workloads on Amazon EKS. Proven expertise in Kubernetes networking, covering ingress and egress controllers (e.g., ALB / NGINX / Envoy), service configuration and fine-tuning (requests/limits, HPA/VPA, pod disruption budgets, network policies), and traffic management. Demonstrated ability to investigate and optimize performance and reliability using metrics, logs, and traces, complemented by chaos engineering practices (fault injection, node/pod failures, network latency, dependency outages) to validate system resilience and high availability under real-world failure scenarios.
Message Brokers – Production experience with ActiveMQ, Kafka, or similar messaging systems
Containerization – Hands-on experience with container orchestration (Nomad experience is advantageous, Kubernetes acceptable)
CI/CD – Experience building and maintaining deployment pipelines, preferably with GitLab
Experience Requirements
Core Competencies