First I profile the data (missingness matrix, histograms, correlation heatmap) to quantify problems — e.g., columns with >40% missing, 1–3% extreme values, and several pairs with Corr>0.8. For missingness I drop features >40% missing, impute numeric fields with median if skewed or mean when <5% missing, and use KNN or MICE for 5–40% missing where relationships matter. For outliers I detect with IQR and z-score (>3) and treat them with winsorization at 1st/99th percentiles or log-transform if skew improves. For multicollinearity I remove one of a pair (threshold 0.8) or use VIF>5 to guide removals; if needed I apply Lasso or PCA but prefer Lasso for interpretability. I scale with RobustScaler, split 80/20, and run 5-fold CV. To stakeholders I show trade-offs with before/after metrics (R^2, RMSE, VIF) and time estimates (typically 1–2 days for 100k rows) so they can see improved stability and interpretability.
Get AI-powered feedback on your answer and improve your skills
Takes 5-10 minutes