I start by profiling the data to understand size and types — for example, a recent 120k-row file showed 12% exact duplicates and an 11.5% missing-rate. I’d use Power Query or Python (pandas) and SQL for repeatable steps. First I remove exact duplicates, then apply fuzzy matching (fuzzywuzzy or dedupe) for near-duplicates, which cut duplicates by another 86% in a past project. Next I standardize formatting: trim whitespace, normalize case, parse dates with strict formats, and convert numeric types. For missing values I set rules: drop rows with >50% missing, impute median for continuous fields, mode for categorical, and always flag imputed values. Finally I validate with constraints, create checksums and a data dictionary, and document the pipeline so it’s reproducible — the whole cleanup took about 3 hours and reduced missing data to under 1%.
Get AI-powered feedback on your answer and improve your skills
Takes 5-10 minutes