You discover a critical memory leak in a service that manifests only under production load and is causing periodic OOM crashes. Walk me through how you would diagnose, mitigate immediately, and plan a permanent fix while minimizing customer impact.

Question

VirtualInterview.AI · Accepted Answer

If I see OOMs only under production load, I first stabilize the situation. I’d put a temporary circuit breaker in front of the service, add autoscaling headroom, and, if possible, lower cache TTLs or disable non-critical features to reduce memory pressure. In parallel, I’d enable detailed heap dumps on OOM, attach a profiler (like async-profiler or YourKit) to a canary instance, and mirror a slice of production traffic to a staging environment to reproduce the pattern safely.

Once I identify the leak (for example, a growing in-memory map keyed by request IDs), I’d ship a minimal patch to stop the growth, guarded by feature flags, and monitor memory over a few hours of peak traffic. After that immediate fix, I’d schedule a deeper refactor: add load tests that simulate 2–3x peak traffic, introduce memory budgets and alerts on RSS/heap growth rate, and run a postmortem so we reduce mean time to detection and avoid similar leaks in other services.

Sample Answer

Keywords