I’d first profile data sizes and key cardinality. If one side is small (<100-200MB compressed) I’d force a broadcast join; that removed shuffles in a past job and cut runtime from 2 hours to 18 minutes. If both are large, I’d repartition both datasets by the join key (repartitionByRange or hash) and enable adaptive query execution (spark.sql.adaptive.enabled=true) so Spark can switch join strategies and handle skew. For skew I’ve salted the key and pre-aggregated where possible. I also tune spark.sql.shuffle.partitions to match cluster cores (e.g., cores*2–3), increase executor memory or enable off-heap, and set spark.memory.fraction to reduce OOMs. On the data side, use columnar Parquet, drop unused columns, and use bucketing when joins are routine. Those changes reduced shuffle size ~75% and eliminated OOMs in my last project.
Takes 5-10 minutes
Get AI-powered feedback on your answer and improve your skills