Given a REST API that is experiencing high latency under load, outline the step-by-step diagnostic approach you would take (including tools and metrics), and list at least three possible backend-level fixes prioritized by impact and risk.

Question

VirtualInterview.AI · Accepted Answer

When I first saw a REST API with 95th-percentile latency spiking from 200ms to 2s under 5k RPS, I started with data: I pulled APM traces (Datadog/NewRelic), enabled distributed tracing (OpenTelemetry), and looked at CPU/memory, GC, thread dumps, and p99 latency by endpoint. I correlated DB slow queries via EXPLAIN and slow query logs, checked cache hit rates (Redis), and measured network and load balancer metrics. Short-term fixes: 1) increase cache TTL and materialize heavy responses (low risk, high impact), 2) add DB read replicas and move read-heavy endpoints (moderate risk), 3) introduce async processing (message queue) for non-critical work (higher effort). I also throttled abusive clients and rolled out canary changes to validate improvements, shaving p99 latency from 2s to ~250ms within a week.

Sample Answer

Keywords