Given a RESTful service that is experiencing increased latency under load, outline the step-by-step approach you would take to diagnose and fix the issue, including specific tools, metrics, and code or infrastructure changes you might consider.

Question

VirtualInterview.AI · Accepted Answer

At my last role we saw p95 latency jump from 120ms to 800ms as traffic doubled to ~50k RPS. I start by gathering metrics: p50/p95/p99, error rates, GC pauses, thread pools, DB slow queries, and network stats using Prometheus/Grafana, Jaeger for traces, and flamegraphs from async-profiler. I reproduce load in staging with k6 and narrow the hotspot via distributed traces—often it's DB or sync calls. If DB is the culprit I add indexing, rewrite queries, or introduce read replicas and a 5–10% cache hit rate with Redis. If app CPU-bound, I optimize hot paths, add batching, or switch to async IO; if infra-limited, autoscale or upgrade instance types. I ship incremental changes behind feature flags, monitor 10-minute rolling windows, and roll back if p99 degrades more than 10%. Within three weeks these steps cut p99 from 1.2s to 180ms for us.

Sample Answer

Keywords