First, I triage to confirm scope and impact — in one incident I discovered 18% of users were affected and SLAs were at risk. I pause noncritical sprint work and form a focused two-engineer incident team while the rest maintain customer-facing tasks. We aim for an initial mitigation or rollback within one hour and a staged hotfix within three hours; in that case we deployed a hotfix in 2.5 hours, cutting errors by 92%. I communicate regularly: an initial incident message within 15 minutes to PMs, support, and execs with impact and ETA, then updates every 30–60 minutes. After stabilization, we run a postmortem, implement a root-cause fix (completed in four days), add regression tests, and update runbooks. The result reduced recurrence risk and improved mean time to recovery from 4 hours to under 1 hour.
Get AI-powered feedback on your answer and improve your skills
Takes 5-10 minutes