You have a web service that suddenly starts timing out under load. How would you systematically diagnose and resolve the issue? Be specific about what metrics, logs, and tools you would use, and how you would narrow down potential bottlenecks in the system.

Question

VirtualInterview.AI · Accepted Answer

When a service starts timing out, I treat it like a production incident with a clear hypothesis loop. First, I’d check high-level metrics in something like Grafana: latency, error rate, CPU, memory, and saturation (threads, DB connections) across instances. If p95 latency spikes but CPU is low, that usually suggests an external dependency. I’d then look at distributed traces (Jaeger/Tempo) to see where time is spent – in one case, we found 70% of request time blocked on a misconfigured DB connection pool. In parallel, I’d enable or query structured logs filtered by correlation ID for timed-out requests to confirm patterns: specific endpoints, tenants, or queries. From there, I narrow down: DB slow queries (EXPLAIN, index checks), thread pool exhaustion, or downstream timeouts. Once we identified the DB pool issue, we right-sized the pool, added backpressure, and introduced a 300ms timeout with retries. That brought p95 from 4s back down to 180ms and stabilized timeouts under 0.2%.

Sample Answer

Keywords