DE Interview: Pivot / Unpivot

 

complete interview-ready guide for
🔹 PATTERN 13: Pivot / Unpivot
🎯 Core skill: Transform rows ↔ columns

Designed for 6–8 years experience, with real interview questions + solutions in:

  • ✅ SQL Basic

  • ✅ SQL Advanced

  • ✅ PySpark Basic

  • ✅ PySpark Advanced / Optimized


🔹 Assumed Interview Dataset

sales

order_id | order_date | region | product | metric | value ------------------------------------------------------------ 1 | 2024-01-01 | East | Laptop | revenue | 1000 2 | 2024-01-01 | East | Laptop | quantity | 2 3 | 2024-01-01 | West | Mobile | revenue | 800

1️⃣ Pivot Product Rows into Columns (Total Revenue)

🟢 SQL – Basic (CASE WHEN)

SELECT region, SUM(CASE WHEN product = 'Laptop' THEN value END) AS laptop_revenue, SUM(CASE WHEN product = 'Mobile' THEN value END) AS mobile_revenue FROM sales WHERE metric = 'revenue' GROUP BY region;

📌 Most common interview pivot


🔵 SQL – Advanced (PIVOT)

SELECT * FROM ( SELECT region, product, value FROM sales WHERE metric = 'revenue' ) src PIVOT ( SUM(value) FOR product IN ('Laptop', 'Mobile') );

🟢 PySpark – Basic

from pyspark.sql import functions as F df = (sales_df .filter("metric = 'revenue'") .groupBy("region") .pivot("product") .sum("value"))

🔵 PySpark – Advanced (Null-safe + rename)

df = (sales_df .filter(F.col("metric") == "revenue") .groupBy("region") .pivot("product", ["Laptop", "Mobile"]) .agg(F.sum("value")) .fillna(0))

2️⃣ Pivot Multiple Metrics (Revenue & Quantity)

🟢 SQL – Basic

SELECT region, product, SUM(CASE WHEN metric = 'revenue' THEN value END) AS revenue, SUM(CASE WHEN metric = 'quantity' THEN value END) AS quantity FROM sales GROUP BY region, product;

🔵 SQL – Advanced (Double Pivot Pattern)

SELECT * FROM ( SELECT region, product, metric, value FROM sales ) src PIVOT ( SUM(value) FOR metric IN ('revenue', 'quantity') );

📌 Very common BI interview question


🟢 PySpark – Basic

df = (sales_df .groupBy("region", "product") .pivot("metric") .sum("value"))

🔵 PySpark – Advanced

df = (sales_df .groupBy("region", "product") .pivot("metric", ["revenue", "quantity"]) .agg(F.sum("value")))

3️⃣ Monthly Pivot (Rows → Month Columns)

🟢 SQL – Basic

SELECT product, SUM(CASE WHEN MONTH(order_date) = 1 THEN value END) AS jan, SUM(CASE WHEN MONTH(order_date) = 2 THEN value END) AS feb FROM sales WHERE metric = 'revenue' GROUP BY product;

🔵 SQL – Advanced

SELECT * FROM ( SELECT product, FORMAT(order_date, 'MMM') AS month, value FROM sales WHERE metric = 'revenue' ) src PIVOT ( SUM(value) FOR month IN ('Jan', 'Feb', 'Mar') );

🟢 PySpark – Basic

df = (sales_df .withColumn("month", F.month("order_date")) .groupBy("product") .pivot("month") .sum("value"))

🔵 PySpark – Advanced

df = (sales_df .withColumn("month", F.date_format("order_date", "MMM")) .groupBy("product") .pivot("month") .agg(F.sum("value")))

4️⃣ Unpivot Columns Back to Rows (Very Common!)

🟢 SQL – Basic (UNION ALL)

SELECT product, 'Jan' AS month, jan AS revenue FROM sales_summary UNION ALL SELECT product, 'Feb', feb FROM sales_summary;

🔵 SQL – Advanced (UNPIVOT)

SELECT product, month, revenue FROM sales_summary UNPIVOT ( revenue FOR month IN (jan, feb, mar) );

📌 Senior-level question


🟢 PySpark – Basic

df = sales_summary_df.selectExpr( "product", "stack(3, 'Jan', jan, 'Feb', feb, 'Mar', mar) as (month, revenue)" )

🔵 PySpark – Advanced (Explode)

df = sales_summary_df.select( "product", F.explode( F.array( F.struct(F.lit("Jan").alias("month"), F.col("jan").alias("revenue")), F.struct(F.lit("Feb"), F.col("feb")), F.struct(F.lit("Mar"), F.col("mar")) ) ).alias("kv") ).select("product", "kv.*")

5️⃣ Dynamic Pivot (Unknown Values) – Tricky Interview Question

🔵 SQL – Conceptual

SQL requires dynamic SQL for unknown pivot columns.


🔵 PySpark – Advanced (Dynamic)

pivot_cols = [row[0] for row in sales_df.select("product").distinct().collect()] df = (sales_df .filter("metric = 'revenue'") .groupBy("region") .pivot("product", pivot_cols) .sum("value"))

📌 Very strong Spark interview answer


🔥 Interview Traps & Talking Points

TopicInterview Insight
CASE vs PIVOTCASE works everywhere
Dynamic pivotEasy in Spark, hard in SQL
NULL handlingAlways COALESCE / fillna
PerformancePivot = shuffle-heavy
Wide tablesBad for analytics
UnpivotNeeded before ML/BI

🧠 When to Pivot vs Not Pivot

  • ❌ Avoid pivot in raw / bronze layers

  • ✅ Use pivot in gold / reporting layers

  • ✅ Unpivot before ML / analytics

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing