A B2B SaaS platform for automating sales, contracts, and revenue. The platform unifies the entire deal lifecycle — from quote creation to revenue recognition.
- Collaborate with engineering teams to design, implement, and maintain reliable and scalable infrastructure.
- Manage and troubleshoot Kubernetes clusters in cloud environments (AWS, Azure, or GCP).
- Maintain and optimize containerized workloads, including Helm charts and CI/CD pipelines.
- Write scripts to automate operational tasks using Bash, Python, or Go.
- Monitor production systems and implement observability solutions using Prometheus, Grafana, and OpenTelemetry.
- Conduct root cause analysis for incidents and implement preventative measures.
- Continuously improve deployment workflows, production reliability, and operational efficiency.
- 3+ years of professional experience in DevOps or Site Reliability Engineering roles.
- Hands-on experience with Kubernetes administration.
- Hands-on experience with troubleshooting in cloud environments.
- Experience managing containerized workloads and maintaining Helm charts.
- Solid understanding of CI/CD principles and tools (e.g., Jenkins, GitLab CI, GitHub Actions).
- Proficiency in scripting languages such as Bash, Python, or Go.
- Knowledge of monitoring and observability ecosystems (Prometheus, Grafana).
- Experience of distributed tracing with OpenTelemetry.
- Strong collaboration and communication skills within engineering teams.
- Strong problem-solving skills with a proven track record of performing root cause analysis.
- Good command of English (written and spoken).
- Experience running, deploying, and debugging Java applications in production.
- Proficiency with Java build tools (Maven or Gradle) and CI pipeline optimization.
- Understanding of JVM fundamentals (memory management, garbage collection).
- Experience maintaining and scaling Elasticsearch clusters.
- Experience managing MongoDB in production environments.