abhishek.sh SYS_OK UPTIME: 99.9% P99: 42ms
--:--:--
LIVE · OPERATIONAL 📍 NEW JERSEY ⚡ STRIPE / DEVOPS 5+ YRS · STRIPE · DBX · PAYTM

Abhishek Soppanna.

I build the boring, invisible plumbing that keeps payments moving — multi-region Kubernetes, GPU ML infra, and FinOps tooling at Stripe.

Five years of DevOps / SRE across Stripe, Databricks, and Paytm — payments rails, multi-region Kubernetes, IaC, observability, FinOps. The site you're on is a working dashboard, not a brochure: scale the cluster, run a deploy, query the terminal. Click anything that looks clickable.

PROD UPTIME
99.9%
payments-grade SLO held
MTTR
↓25%
obs + automation · Stripe
CLOUD SAVED
$2.4M
FinOps · annualized run-rate
INFRA INCIDENTS
0
caused · while +15% txn growth

Watch $5 travel through 9 systems in ~140ms.

This is what payments infrastructure does in the moment between "Pay" and "Paid." Type any amount, hit trace. Hover any node for the why. Inject a failure to see how it survives.

$ interactive · canonical Stripe-style flow
TOTAL LATENCY 0ms
awaiting trace · click ▶ TRACE to begin · or hit one of the chaos buttons below
CHAOS:

Three dimensions. One impossible job.

Speed, reliability, cost. You don't get all three. Drag the dot. Watch what you'd give up.

⚡ SPEED 🛡 RELIABILITY 💰 COST
P99 LATENCY
142ms
UPTIME
99.85%
COST / MO
$112k
INFRA GRADE
A-
You're balanced for payments at scale — reliability-leaning with disciplined cost and a healthy speed budget. This is what shipping money looks like.
SNAP TO:

What happens when one machine dies?

Click the button. Watch a node fail, traffic re-route, replica promote, alerts settle — under five seconds. No human in the loop.

ALB multi-AZ api-pod-7d9 us-east-1a api-pod-x4f us-east-1b api-pod-2a8 us-east-1c RDS primary 📟 us-east-1 · payments-prod
cluster nominal · 3/3 pods healthy · awaiting chaos

Turn the knob. Watch it breathe.

One physical knob controls incoming traffic — 100 RPS to 100K. Pods auto-scale. Latency creeps. Costs climb. At saturation, watch graceful degradation, not a fall over.

LOAD 2.5K rps
drag to turn · click around the dial · ↑/↓ keys
PODS RUNNING
4
P99 LATENCY
38ms
COST / HR
$0.36
ERRORS / MIN
0
NOMINAL cluster cruising · pods at 8% utilization
→ HPA monitoring · target: 70% CPU · scale-up cooldown 30s

Ask me anything in zsh.

Real command parser. Try whoami · stripe · kubectl get pods. Use ↑/↓ for history, tab to autocomplete.

~/abhishek — zsh
try help
commands: help whoami skills kubectl deploy contact stripe clear

Push to prod, watch it ship.

Click "Run pipeline" to simulate a real deploy: tests, build, security scan, staging, prod rollout. Same flow I build for engineering teams.

deploy-payments-api · #4287
trigger: git push origin main
01
Test
queued
02
Build
queued
03
Scan
queued
04
Staging
queued
05
Prod
queued
elapsed: 0:00

Where I've shipped.

Three companies, one throughline: making payments infrastructure boring. The numbers below are what I shipped, not what I aspire to.

Stripe.

SEP 2024 — PRESENT
Senior DevOps / Cloud Engineer · California
99.9%
Uptime · payments SLO
↓25%
MTTR · obs + automation
$2.4M
Annualized cloud savings
  • Designed and operate a multi-region AWS EKS platform with Helm, GitHub Actions, and Ansible — 99.9% uptime through 15% transaction volume growth, zero infrastructure-caused incidents.
  • Architected AWS VPC networking for K8s workloads across multi-AZ environments — private subnets, ingress, security groups — cutting environment-related deployment issues by 20%.
  • Integrated Trivy + Open Policy Agent into GitHub Actions, enforcing SOC 2 / PCI-DSS and blocking non-compliant deployments pre-prod.
  • Partnered with ML and Quant teams on GPU infra (SageMaker + EKS) — reduced model inference latency by 20%.
  • Built observability + FinOps dashboards in Prometheus / Grafana — cut MTTR by 25% and surfaced cost-optimization across the platform.
  • Shipped Python predictive failure detection (scikit-learn / TensorFlow) on telemetry — cut incident response time by 30%.
  • Tuned HA Postgres (RDS) + Redis (ElastiCache) with automated backups, perf tuning, and connection pooling for payment-system data integrity.

Databricks.

NOV 2023 — AUG 2024
DevOps Engineer · California
  • Engineered Azure IaC with Terraform + Bicep across AKS, App Services, and VNets — multi-team provisioning that went from days to minutes.
  • Built scalable Azure DevOps YAML pipelines for microservices: blue/green rollouts, automated rollback on canary regression — release cycles up 15%.
  • Containerized distributed apps on Docker + AKS for portability and environment consistency across staging and prod.
  • Stood up Azure Monitor + Log Analytics for cross-service tracing and proactive alerting — cut MTTR by 20%.

Paytm.

JUN 2019 — JUL 2022
DevOps Engineer · India
  • Led SRE for the UPI payment gateway — peak ~30M+ txns/day. Deployed Prometheus + PagerDuty, drove SLO/SLI practices, cut incident resolution time by 15%.
  • Built CI/CD with Jenkins, GitLab CI, Argo CD, Docker — daily deploys, release reliability up 25%.
  • Designed automated disaster recovery with Terraform + CloudFormation — RTO 4hr → ~18min, validated quarterly.
  • Implemented PCI-DSS-compliant IAM, AWS KMS, and HashiCorp Vault to secure cardholder data and harden cloud security posture.

What I'm shipping right now.

Live work at Stripe — payment-grade reliability, ML-driven ops, infrastructure that pays for itself.

INFRASTRUCTURE

Multi-region EKS Platform

Active-active K8s for payment workloads. Helm-based service rollouts, Ansible node config, GitHub Actions for promotion. Built to absorb regional failure without blinking.

99.9% uptime · 0 infra-caused incidents · <300ms regional failover
EKSHelmAnsibleGitHub ActionsRoute53
MACHINE LEARNING

GPU ML Infrastructure

SageMaker + EKS hybrid for fraud / risk model training and inference. Partnered with ML and Quant teams to size GPUs to actual workload shape, not vibes.

↓ 20% inference latency · ↓ 38% idle GPU spend · cost-aware A100 scheduling
SageMakerEKSA100Karpenter
SECURITY

CI/CD Security Gates

Trivy image scanning + OPA policy-as-code + GitHub Actions. Catches CVEs and policy violations pre-merge — non-compliant deploys never reach prod.

100% scan coverage · SOC 2 / PCI-DSS enforced pre-merge · ~0 audit findings
TrivyOPAGitHub ActionsCosign
RELIABILITY · ML-OPS

Predictive Failure Detection

scikit-learn + TensorFlow models on Prometheus telemetry. Flags pre-incident anomalies — node thrash, latency drift, GC pressure — before they page on-call.

↓ 30% incident response · 8–14 min early warning · ↓ 22% P1 incidents
Pythonscikit-learnTensorFlowPrometheus
FINOPS · OBSERVABILITY

FinOps + Observability

Unified Prometheus / Grafana dashboards mapping cost and reliability to service / team / SLO. Engineers see what their deploys cost — that changes behavior fast.

$2.4M annualized savings · adopted across 14 teams · ↓ 25% MTTR
PrometheusGrafanaAWS CURAthena
DATABASES

HA Postgres + Redis

Tuning RDS Postgres + ElastiCache Redis for payment workloads. Automated backups, replica routing, connection pooling — the unsexy work that keeps p99 honest.

p99 < 50ms sustained · automated backups · connection pooling · perf-tuned
RDSElastiCachePgBouncerRedis

Got a gnarly infra problem?
Let's talk.

Best for: payments-grade SRE, multi-cloud platform engineering, FinOps audits, ML-ops infrastructure, and the occasional "why is our cluster on fire" consult.