First I quantify the change (for example, an 18% drop in DAU between midnight and 6am) and scope it by segment: platform, region, cohort, and app version. Hypotheses I’d test: ingestion/ETL failure, deployment bug, analytics query change, product issue, or external traffic drop. I run fast queries comparing raw event counts, table row counts per ETL stage, and error logs for the last 48 hours; check deployment and feature-flag logs, CDN/third-party-status pages, and payment or auth provider logs. I’d triangulate with independent sources (backend logs, SRE metrics) and run cohort-level A/B splits to isolate affected groups. If I reproduce the discrepancy in raw logs and find a recent deploy at 02:17 with increased error rate, I’d validate by replaying events or rolling back and measure recovery. I’d deliver an initial incident summary in 2 hours and a 24-hour RCA with remediation steps.
Get AI-powered feedback on your answer and improve your skills
Takes 5-10 minutes