Sample Answer
When a service starts timing out, I treat it like a production incident with a clear hypothesis loop. First, I’d check high-level metrics in something like Grafana: latency, error rate, CPU, memory, and saturation (threads, DB connections) across instances. If p95 latency spikes but CPU is low, that usually suggests an external dependency. I’d then look at distributed traces (Jaeger/Tempo) to see where time is spent – in one case, we found 70% of request time blocked on a misconfigured DB connection pool. In parallel, I’d enable or query structured logs filtered by correlation ID for timed-out requests to confirm patterns: specific endpoints, tenants, or queries. From there, I narrow down: DB slow queries (EXPLAIN, index checks), thread pool exhaustion, or downstream timeouts. Once we identified the DB pool issue, we right-sized the pool, added backpressure, and introduced a 300ms timeout with retries. That brought p95 from 4s back down to 180ms and stabilized timeouts under 0.2%.
Keywords
Failed to load related questions