🎯 Running Totals & Moving Averages
— written at 6–8 years experience level, with basic + advanced / optimized approaches.

🔹 Assumed Sample Tables

`sales`


sale_id | customer_id | product_id | sale_date | amount | status

`transactions`


txn_id | account_id | txn_date | txn_type | amount   -- credit/debit

1️⃣ Running Total of Daily Sales

✅ SQL – Basic


SELECT
    sale_date,
    SUM(amount) AS daily_sales,
    SUM(SUM(amount)) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM sales
GROUP BY sale_date
ORDER BY sale_date;

📌 Interview Tip
Always aggregate before applying window functions when dealing with dates.

✅ SQL – Advanced (Partitioned Running Total)


SELECT
    customer_id,
    sale_date,
    SUM(amount) AS daily_sales,
    SUM(SUM(amount)) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS running_total
FROM sales
GROUP BY customer_id, sale_date;

✅ PySpark – Basic


from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window.orderBy("sale_date")

df = (sales_df
      .groupBy("sale_date")
      .agg(F.sum("amount").alias("daily_sales"))
      .withColumn("running_total",
                  F.sum("daily_sales").over(w)))

✅ PySpark – Optimized (Partition + Sort)


w = (Window
     .partitionBy("customer_id")
     .orderBy("sale_date")
     .rowsBetween(Window.unboundedPreceding, Window.currentRow))

df = sales_df.withColumn(
    "running_total",
    F.sum("amount").over(w)
)

2️⃣ Running Balance (Credit / Debit)

✅ SQL – Advanced (Conditional Running Total)


SELECT
    account_id,
    txn_date,
    txn_type,
    amount,
    SUM(
        CASE WHEN txn_type = 'credit' THEN amount
             ELSE -amount
        END
    ) OVER (
        PARTITION BY account_id
        ORDER BY txn_date
    ) AS running_balance
FROM transactions;

✅ PySpark – Optimized


df = transactions_df.withColumn(
    "signed_amount",
    F.when(F.col("txn_type") == "credit", F.col("amount"))
     .otherwise(-F.col("amount"))
)

w = Window.partitionBy("account_id").orderBy("txn_date")

df = df.withColumn("running_balance",
                   F.sum("signed_amount").over(w))

3️⃣ Excluding Current Row (Previous Running Total)

✅ SQL


SELECT
    sale_date,
    amount,
    SUM(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
    ) AS prev_running_total
FROM sales;

📌 Very common interview trap

✅ PySpark


w = Window.orderBy("sale_date") \
          .rowsBetween(Window.unboundedPreceding, -1)

df = sales_df.withColumn("prev_running_total",
                         F.sum("amount").over(w))

4️⃣ Moving Average (3-Day Rolling)

✅ SQL – Basic


SELECT
    sale_date,
    amount,
    AVG(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3d
FROM sales;

✅ SQL – RANGE (Date-Aware)


SELECT
    sale_date,
    amount,
    AVG(amount) OVER (
        ORDER BY sale_date
        RANGE BETWEEN INTERVAL '2' DAY PRECEDING AND CURRENT ROW
    ) AS moving_avg_3d
FROM sales;

📌 ROWS vs RANGE is a must-know interview topic

✅ PySpark – Rolling Average


w = Window.orderBy("sale_date").rowsBetween(-2, 0)

df = sales_df.withColumn(
    "moving_avg_3d",
    F.avg("amount").over(w)
)

5️⃣ Moving Average Per Customer

✅ SQL


SELECT
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM sales;

✅ PySpark


w = (Window
     .partitionBy("customer_id")
     .orderBy("sale_date")
     .rowsBetween(-6, 0))

df = sales_df.withColumn(
    "moving_avg_7d",
    F.avg("amount").over(w)
)

6️⃣ Year-to-Date (YTD) Sales

✅ SQL – Advanced


SELECT
    sale_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY EXTRACT(YEAR FROM sale_date)
        ORDER BY sale_date
    ) AS ytd_sales
FROM sales;

✅ PySpark


df = sales_df.withColumn("year", F.year("sale_date"))

w = Window.partitionBy("year").orderBy("sale_date")

df = df.withColumn("ytd_sales",
                   F.sum("amount").over(w))

7️⃣ Restart Running Total When Condition Changes

(e.g., status changes)

✅ SQL – Advanced


SELECT *,
       SUM(amount) OVER (
           PARTITION BY customer_id, status
           ORDER BY sale_date
       ) AS conditional_running_total
FROM sales;

✅ PySpark


w = (Window
     .partitionBy("customer_id", "status")
     .orderBy("sale_date"))

df = sales_df.withColumn(
    "conditional_running_total",
    F.sum("amount").over(w)
)

8️⃣ Performance-Optimized Spark Pattern (Very Important)

❌ BAD (multiple window scans)


df.withColumn("sum1", F.sum("amount").over(w)) \
  .withColumn("avg1", F.avg("amount").over(w))

✅ GOOD (single window spec)


df.select(
    "*",
    F.sum("amount").over(w).alias("running_total"),
    F.avg("amount").over(w).alias("moving_avg")
)

📌 Senior-level optimization insight

Search This Blog

ShaikBlog

DE-interview Questions: Running Totals & Moving Averages

🔹 Assumed Sample Tables

`sales`

`transactions`

1️⃣ Running Total of Daily Sales

✅ SQL – Basic

✅ SQL – Advanced (Partitioned Running Total)

✅ PySpark – Basic

✅ PySpark – Optimized (Partition + Sort)

2️⃣ Running Balance (Credit / Debit)

✅ SQL – Advanced (Conditional Running Total)

✅ PySpark – Optimized

3️⃣ Excluding Current Row (Previous Running Total)

✅ SQL

✅ PySpark

4️⃣ Moving Average (3-Day Rolling)

✅ SQL – Basic

✅ SQL – RANGE (Date-Aware)

✅ PySpark – Rolling Average

5️⃣ Moving Average Per Customer

✅ SQL

✅ PySpark

6️⃣ Year-to-Date (YTD) Sales

✅ SQL – Advanced

✅ PySpark

7️⃣ Restart Running Total When Condition Changes

✅ SQL – Advanced

✅ PySpark

8️⃣ Performance-Optimized Spark Pattern (Very Important)

❌ BAD (multiple window scans)

✅ GOOD (single window spec)

Comments

Post a Comment

Popular posts from this blog

Revenue Cycle Management- HealthCare

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing