IntermediateSITUATIONAL
You discover a critical memory leak in a service that manifests only under production load and is causing periodic OOM crashes. Walk me through how you would diagnose, mitigate immediately, and plan a permanent fix while minimizing customer impact.
Backend Developer
General

Sample Answer

If I see OOMs only under production load, I first stabilize the situation. I’d put a temporary circuit breaker in front of the service, add autoscaling headroom, and, if possible, lower cache TTLs or disable non-critical features to reduce memory pressure. In parallel, I’d enable detailed heap dumps on OOM, attach a profiler (like async-profiler or YourKit) to a canary instance, and mirror a slice of production traffic to a staging environment to reproduce the pattern safely. Once I identify the leak (for example, a growing in-memory map keyed by request IDs), I’d ship a minimal patch to stop the growth, guarded by feature flags, and monitor memory over a few hours of peak traffic. After that immediate fix, I’d schedule a deeper refactor: add load tests that simulate 2–3x peak traffic, introduce memory budgets and alerts on RSS/heap growth rate, and run a postmortem so we reduce mean time to detection and avoid similar leaks in other services.

Keywords

Immediate stabilization with circuit breakers, autoscaling, and feature flagsUse heap dumps and profilers plus traffic replay to reproduce issueShip minimal patch, then monitor under real production loadPlan long-term fixes: tests, alerts, and postmortem-driven improvements