PySpark Data Skew Handling – Complete Guide
š“ 1. Problem Statement: Skewed Aggregation df . groupBy( "user_id" ) . count() ❗ Issue One user_id contains ~40% of total data Spark sends same key → same partition Result: One task becomes extremely heavy Other tasks finish early Straggler problem → slow job š§ 2. Why Skew Happens Spark distributes data based on keys: user_id = A → goes to one partition user_id = B → another partition If: A = 40% of data Then: Partition for A = huge → bottleneck š 3. How to Identify Skew (Practical Approach) ✅ Method 1: Distribution Check from pyspark . sql . functions import count df . groupBy( "user_id" ) \ . agg( count ( "*" ) . alias( "cnt" )) \ . orderBy( "cnt" , ascending = False ) \ . show( 10 ) š Example Output: user_id cnt A 40,000,000 ← skew B 1,000 C 900 ✅ Method 2: Percentile Analysis df . groupBy( "user_id" ) \ . count() \ . selectExpr( ...