IntermediateTECHNICAL
Given a RESTful service that is experiencing increased latency under load, outline the step-by-step approach you would take to diagnose and fix the issue, including specific tools, metrics, and code or infrastructure changes you might consider.
Software Engineer
General

Sample Answer

At my last role we saw p95 latency jump from 120ms to 800ms as traffic doubled to ~50k RPS. I start by gathering metrics: p50/p95/p99, error rates, GC pauses, thread pools, DB slow queries, and network stats using Prometheus/Grafana, Jaeger for traces, and flamegraphs from async-profiler. I reproduce load in staging with k6 and narrow the hotspot via distributed traces—often it's DB or sync calls. If DB is the culprit I add indexing, rewrite queries, or introduce read replicas and a 5–10% cache hit rate with Redis. If app CPU-bound, I optimize hot paths, add batching, or switch to async IO; if infra-limited, autoscale or upgrade instance types. I ship incremental changes behind feature flags, monitor 10-minute rolling windows, and roll back if p99 degrades more than 10%. Within three weeks these steps cut p99 from 1.2s to 180ms for us.

Keywords

Collecting metrics and traces (Prometheus/Grafana, Jaeger)Reproducing load in staging and isolating hotspotsTargeted fixes: DB tuning, caching, async processing, autoscalingIncremental rollout with feature flags and rollback thresholds