LIVE · OPERATIONAL 📍 NEW JERSEY ⚡ STRIPE / DEVOPS 5+ YRS · STRIPE · DBX · PAYTM

Abhishek Soppanna.

I build the boring, invisible plumbing that keeps payments moving — multi-region Kubernetes, GPU ML infra, and FinOps tooling at Stripe.

Five years of DevOps / SRE across Stripe, Databricks, and Paytm — payments rails, multi-region Kubernetes, IaC, observability, FinOps. The site you're on is a working dashboard, not a brochure: scale the cluster, run a deploy, query the terminal. Click anything that looks clickable.

→ Open dashboard ~$ contact --me 📄 resume.pdf

PROD UPTIME

99.9%

▲ payments-grade SLO held

MTTR

↓25%

▼ obs + automation · Stripe

CLOUD SAVED

$2.4M

FinOps · annualized run-rate

INFRA INCIDENTS

caused · while +15% txn growth

/ 01 — TRACE A PAYMENT

Watch $5 travel through 9 systems in ~140ms.

This is what payments infrastructure does in the moment between "Pay" and "Paid." Type any amount, hit trace. Hover any node for the why. Inject a failure to see how it survives.

$ interactive · canonical Stripe-style flow

TOTAL LATENCY 0ms

awaiting trace · click ▶ TRACE to begin · or hit one of the chaos buttons below

CHAOS:

/ 02 — THE TRADEOFF

Three dimensions. One impossible job.

Speed, reliability, cost. You don't get all three. Drag the dot. Watch what you'd give up.

P99 LATENCY

142ms

UPTIME

99.85%

COST / MO

$112k

INFRA GRADE

A-

You're balanced for payments at scale — reliability-leaning with disciplined cost and a healthy speed budget. This is what shipping money looks like.

SNAP TO:

/ 03 — RESILIENCE

What happens when one machine dies?

Click the button. Watch a node fail, traffic re-route, replica promote, alerts settle — under five seconds. No human in the loop.

cluster nominal · 3/3 pods healthy · awaiting chaos

/ 04 — SCALE

Turn the knob. Watch it breathe.

One physical knob controls incoming traffic — 100 RPS to 100K. Pods auto-scale. Latency creeps. Costs climb. At saturation, watch graceful degradation, not a fall over.

drag to turn · click around the dial · ↑/↓ keys

PODS RUNNING

P99 LATENCY

38ms

COST / HR

$0.36

ERRORS / MIN

NOMINAL cluster cruising · pods at 8% utilization

→ HPA monitoring · target: 70% CPU · scale-up cooldown 30s

/ 05 — SHELL

Ask me anything in zsh.

Real command parser. Try whoami · stripe · kubectl get pods. Use ↑/↓ for history, tab to autocomplete.

~/abhishek — zsh

try help

commands: help whoami skills kubectl deploy contact stripe clear

/ 06 — DEPLOY

Push to prod, watch it ship.

Click "Run pipeline" to simulate a real deploy: tests, build, security scan, staging, prod rollout. Same flow I build for engineering teams.

deploy-payments-api · #4287

trigger: git push origin main

Test

queued

Build

queued

Scan

queued

Staging

queued

Prod

queued

elapsed: 0:00

/ 07 — TRACK RECORD

Where I've shipped.

Three companies, one throughline: making payments infrastructure boring. The numbers below are what I shipped, not what I aspire to.

Stripe.

SEP 2024 — PRESENT

Senior DevOps / Cloud Engineer · California

99.9%

Uptime · payments SLO

↓25%

MTTR · obs + automation

$2.4M

Annualized cloud savings

Designed and operate a multi-region AWS EKS platform with Helm, GitHub Actions, and Ansible — 99.9% uptime through 15% transaction volume growth, zero infrastructure-caused incidents.
Architected AWS VPC networking for K8s workloads across multi-AZ environments — private subnets, ingress, security groups — cutting environment-related deployment issues by 20%.
Integrated Trivy + Open Policy Agent into GitHub Actions, enforcing SOC 2 / PCI-DSS and blocking non-compliant deployments pre-prod.
Partnered with ML and Quant teams on GPU infra (SageMaker + EKS) — reduced model inference latency by 20%.
Built observability + FinOps dashboards in Prometheus / Grafana — cut MTTR by 25% and surfaced cost-optimization across the platform.
Shipped Python predictive failure detection (scikit-learn / TensorFlow) on telemetry — cut incident response time by 30%.
Tuned HA Postgres (RDS) + Redis (ElastiCache) with automated backups, perf tuning, and connection pooling for payment-system data integrity.

Databricks.

NOV 2023 — AUG 2024

DevOps Engineer · California

Engineered Azure IaC with Terraform + Bicep across AKS, App Services, and VNets — multi-team provisioning that went from days to minutes.
Built scalable Azure DevOps YAML pipelines for microservices: blue/green rollouts, automated rollback on canary regression — release cycles up 15%.
Containerized distributed apps on Docker + AKS for portability and environment consistency across staging and prod.
Stood up Azure Monitor + Log Analytics for cross-service tracing and proactive alerting — cut MTTR by 20%.

Paytm.

JUN 2019 — JUL 2022

DevOps Engineer · India

Led SRE for the UPI payment gateway — peak ~30M+ txns/day. Deployed Prometheus + PagerDuty, drove SLO/SLI practices, cut incident resolution time by 15%.
Built CI/CD with Jenkins, GitLab CI, Argo CD, Docker — daily deploys, release reliability up 25%.
Designed automated disaster recovery with Terraform + CloudFormation — RTO 4hr → ~18min, validated quarterly.
Implemented PCI-DSS-compliant IAM, AWS KMS, and HashiCorp Vault to secure cardholder data and harden cloud security posture.

/ 08 — ACTIVE BUILDS

What I'm shipping right now.

Live work at Stripe — payment-grade reliability, ML-driven ops, infrastructure that pays for itself.

INFRASTRUCTURE

Multi-region EKS Platform

Active-active K8s for payment workloads. Helm-based service rollouts, Ansible node config, GitHub Actions for promotion. Built to absorb regional failure without blinking.

99.9% uptime · 0 infra-caused incidents · <300ms regional failover

EKSHelmAnsibleGitHub ActionsRoute53

MACHINE LEARNING

GPU ML Infrastructure

SageMaker + EKS hybrid for fraud / risk model training and inference. Partnered with ML and Quant teams to size GPUs to actual workload shape, not vibes.

↓ 20% inference latency · ↓ 38% idle GPU spend · cost-aware A100 scheduling

SageMakerEKSA100Karpenter

SECURITY

CI/CD Security Gates

Trivy image scanning + OPA policy-as-code + GitHub Actions. Catches CVEs and policy violations pre-merge — non-compliant deploys never reach prod.

100% scan coverage · SOC 2 / PCI-DSS enforced pre-merge · ~0 audit findings

TrivyOPAGitHub ActionsCosign

RELIABILITY · ML-OPS

Predictive Failure Detection

scikit-learn + TensorFlow models on Prometheus telemetry. Flags pre-incident anomalies — node thrash, latency drift, GC pressure — before they page on-call.

↓ 30% incident response · 8–14 min early warning · ↓ 22% P1 incidents

Pythonscikit-learnTensorFlowPrometheus

FINOPS · OBSERVABILITY

FinOps + Observability

Unified Prometheus / Grafana dashboards mapping cost and reliability to service / team / SLO. Engineers see what their deploys cost — that changes behavior fast.

$2.4M annualized savings · adopted across 14 teams · ↓ 25% MTTR

PrometheusGrafanaAWS CURAthena

DATABASES

HA Postgres + Redis

Tuning RDS Postgres + ElastiCache Redis for payment workloads. Automated backups, replica routing, connection pooling — the unsexy work that keeps p99 honest.

p99 < 50ms sustained · automated backups · connection pooling · perf-tuned

RDSElastiCachePgBouncerRedis

/ 09 — OPEN CHANNEL

Got a gnarly infra problem?
Let's talk.

Best for: payments-grade SRE, multi-cloud platform engineering, FinOps audits, ML-ops infrastructure, and the occasional "why is our cluster on fire" consult.

abhisheksoppannadevops@gmail.com

↗

↗

↗

resume.pdf — download

↗