DE-interview Questions: Running Totals & Moving Averages

 

🎯 Running Totals & Moving Averages
— written at 6–8 years experience level, with basic + advanced / optimized approaches.


🔹 Assumed Sample Tables

sales

sale_id | customer_id | product_id | sale_date | amount | status

transactions

txn_id | account_id | txn_date | txn_type | amount -- credit/debit

1️⃣ Running Total of Daily Sales

✅ SQL – Basic

SELECT sale_date, SUM(amount) AS daily_sales, SUM(SUM(amount)) OVER ( ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total FROM sales GROUP BY sale_date ORDER BY sale_date;

📌 Interview Tip
Always aggregate before applying window functions when dealing with dates.


✅ SQL – Advanced (Partitioned Running Total)

SELECT customer_id, sale_date, SUM(amount) AS daily_sales, SUM(SUM(amount)) OVER ( PARTITION BY customer_id ORDER BY sale_date ) AS running_total FROM sales GROUP BY customer_id, sale_date;

✅ PySpark – Basic

from pyspark.sql import functions as F from pyspark.sql.window import Window w = Window.orderBy("sale_date") df = (sales_df .groupBy("sale_date") .agg(F.sum("amount").alias("daily_sales")) .withColumn("running_total", F.sum("daily_sales").over(w)))

✅ PySpark – Optimized (Partition + Sort)

w = (Window .partitionBy("customer_id") .orderBy("sale_date") .rowsBetween(Window.unboundedPreceding, Window.currentRow)) df = sales_df.withColumn( "running_total", F.sum("amount").over(w) )

2️⃣ Running Balance (Credit / Debit)

✅ SQL – Advanced (Conditional Running Total)

SELECT account_id, txn_date, txn_type, amount, SUM( CASE WHEN txn_type = 'credit' THEN amount ELSE -amount END ) OVER ( PARTITION BY account_id ORDER BY txn_date ) AS running_balance FROM transactions;

✅ PySpark – Optimized

df = transactions_df.withColumn( "signed_amount", F.when(F.col("txn_type") == "credit", F.col("amount")) .otherwise(-F.col("amount")) ) w = Window.partitionBy("account_id").orderBy("txn_date") df = df.withColumn("running_balance", F.sum("signed_amount").over(w))

3️⃣ Excluding Current Row (Previous Running Total)

✅ SQL

SELECT sale_date, amount, SUM(amount) OVER ( ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) AS prev_running_total FROM sales;

📌 Very common interview trap


✅ PySpark

w = Window.orderBy("sale_date") \ .rowsBetween(Window.unboundedPreceding, -1) df = sales_df.withColumn("prev_running_total", F.sum("amount").over(w))

4️⃣ Moving Average (3-Day Rolling)

✅ SQL – Basic

SELECT sale_date, amount, AVG(amount) OVER ( ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS moving_avg_3d FROM sales;

✅ SQL – RANGE (Date-Aware)

SELECT sale_date, amount, AVG(amount) OVER ( ORDER BY sale_date RANGE BETWEEN INTERVAL '2' DAY PRECEDING AND CURRENT ROW ) AS moving_avg_3d FROM sales;

📌 ROWS vs RANGE is a must-know interview topic


✅ PySpark – Rolling Average

w = Window.orderBy("sale_date").rowsBetween(-2, 0) df = sales_df.withColumn( "moving_avg_3d", F.avg("amount").over(w) )

5️⃣ Moving Average Per Customer

✅ SQL

SELECT customer_id, sale_date, amount, AVG(amount) OVER ( PARTITION BY customer_id ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS moving_avg_7d FROM sales;

✅ PySpark

w = (Window .partitionBy("customer_id") .orderBy("sale_date") .rowsBetween(-6, 0)) df = sales_df.withColumn( "moving_avg_7d", F.avg("amount").over(w) )

6️⃣ Year-to-Date (YTD) Sales

✅ SQL – Advanced

SELECT sale_date, amount, SUM(amount) OVER ( PARTITION BY EXTRACT(YEAR FROM sale_date) ORDER BY sale_date ) AS ytd_sales FROM sales;

✅ PySpark

df = sales_df.withColumn("year", F.year("sale_date")) w = Window.partitionBy("year").orderBy("sale_date") df = df.withColumn("ytd_sales", F.sum("amount").over(w))

7️⃣ Restart Running Total When Condition Changes

(e.g., status changes)

✅ SQL – Advanced

SELECT *, SUM(amount) OVER ( PARTITION BY customer_id, status ORDER BY sale_date ) AS conditional_running_total FROM sales;

✅ PySpark

w = (Window .partitionBy("customer_id", "status") .orderBy("sale_date")) df = sales_df.withColumn( "conditional_running_total", F.sum("amount").over(w) )

8️⃣ Performance-Optimized Spark Pattern (Very Important)

❌ BAD (multiple window scans)

df.withColumn("sum1", F.sum("amount").over(w)) \ .withColumn("avg1", F.avg("amount").over(w))

✅ GOOD (single window spec)

df.select( "*", F.sum("amount").over(w).alias("running_total"), F.avg("amount").over(w).alias("moving_avg") )

📌 Senior-level optimization insight

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing