🔥 ULTRA HARD – Pattern 1 Mixed With CTE + Window + GROUP BY

Level: Senior Data Engineer (8+ years)
These are the kinds of questions asked at product companies, fintech, FAANG-level interviews.

These problems mix:

GROUP BY
HAVING
CTEs
Window functions
Multi-step aggregations
Global vs partition comparisons
Ratio & ranking logic

We’ll use 4 realistic datasets.

🧩 DATASET 1: BANKING – Fraud Signal Detection

Table: `transactions`


txn_id | account_id | txn_date   | amount | txn_type
-----------------------------------------------------
1      | 101        | 2025-01-01 | 5000   | CREDIT
2      | 101        | 2025-01-02 | 2000   | DEBIT
3      | 101        | 2025-01-03 | 15000  | DEBIT
4      | 102        | 2025-01-01 | 8000   | CREDIT
5      | 102        | 2025-01-02 | 9000   | DEBIT

🔥 Q1. Find accounts whose total debit is above the 90th percentile of total debit across all accounts

✅ SQL (CTE + Window + GROUP BY)


WITH debit_summary AS (
    SELECT account_id,
           SUM(CASE WHEN txn_type='DEBIT' THEN amount ELSE 0 END) AS total_debit
    FROM transactions
    GROUP BY account_id
),
percentile_calc AS (
    SELECT account_id,
           total_debit,
           PERCENTILE_CONT(0.9) 
           WITHIN GROUP (ORDER BY total_debit) OVER () AS p90
    FROM debit_summary
)
SELECT account_id, total_debit
FROM percentile_calc
WHERE total_debit > p90;

✅ PySpark Advanced


from pyspark.sql.functions import sum, when, col, expr

debit_df = df.groupBy("account_id").agg(
    sum(when(col("txn_type")=="DEBIT", col("amount")).otherwise(0)).alias("total_debit")
)

p90_value = debit_df.approxQuantile("total_debit", [0.9], 0.01)[0]

debit_df.filter(col("total_debit") > p90_value).show()

🧩 DATASET 2: HR – Compensation Outlier Detection

Table: `employees`


emp_id | dept_id | salary | hire_date

🔥 Q2. Find employees earning more than department average AND above overall average

✅ SQL


WITH dept_avg AS (
    SELECT dept_id,
           AVG(salary) AS dept_avg_salary
    FROM employees
    GROUP BY dept_id
),
overall_avg AS (
    SELECT AVG(salary) AS global_avg
    FROM employees
)
SELECT e.*
FROM employees e
JOIN dept_avg d ON e.dept_id = d.dept_id
CROSS JOIN overall_avg o
WHERE e.salary > d.dept_avg_salary
  AND e.salary > o.global_avg;

✅ PySpark Advanced


from pyspark.sql.functions import avg

dept_avg = df.groupBy("dept_id").agg(avg("salary").alias("dept_avg"))
global_avg = df.agg(avg("salary").alias("global_avg")).collect()[0][0]

df.join(dept_avg, "dept_id") \
  .filter((col("salary") > col("dept_avg")) &
          (col("salary") > global_avg)) \
  .show()

🧩 DATASET 3: E-COMMERCE – Power Users

Table: `orders`


order_id | customer_id | order_date | amount

🔥 Q3. Find customers whose monthly spend is consistently above monthly average for at least 3 months

✅ SQL


WITH monthly_spend AS (
    SELECT customer_id,
           DATE_TRUNC('month', order_date) AS month,
           SUM(amount) AS total_spend
    FROM orders
    GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
monthly_avg AS (
    SELECT month,
           AVG(total_spend) AS avg_spend
    FROM monthly_spend
    GROUP BY month
),
flagged AS (
    SELECT m.customer_id,
           m.month,
           m.total_spend,
           a.avg_spend,
           CASE WHEN m.total_spend > a.avg_spend THEN 1 ELSE 0 END AS is_above
    FROM monthly_spend m
    JOIN monthly_avg a ON m.month = a.month
)
SELECT customer_id
FROM flagged
GROUP BY customer_id
HAVING SUM(is_above) >= 3;

✅ PySpark Advanced


from pyspark.sql.functions import sum, avg, when, date_trunc

monthly = df.groupBy("customer_id",
                     date_trunc("month","order_date").alias("month")) \
            .agg(sum("amount").alias("total_spend"))

monthly_avg = monthly.groupBy("month") \
                     .agg(avg("total_spend").alias("avg_spend"))

flagged = monthly.join(monthly_avg, "month") \
    .withColumn("is_above",
                when(col("total_spend") > col("avg_spend"),1).otherwise(0))

flagged.groupBy("customer_id") \
       .agg(sum("is_above").alias("above_count")) \
       .filter("above_count >= 3") \
       .show()

🧩 DATASET 4: Platform Logs – Heavy Users

Table: `user_sessions`


user_id | session_date | duration_minutes

🔥 Q4. Find users whose 7-day average session time exceeds global daily average

✅ SQL (Window + CTE + GROUP BY)


WITH daily_user AS (
    SELECT user_id,
           session_date,
           SUM(duration_minutes) AS daily_duration
    FROM user_sessions
    GROUP BY user_id, session_date
),
rolling_avg AS (
    SELECT user_id,
           session_date,
           AVG(daily_duration) OVER (
               PARTITION BY user_id
               ORDER BY session_date
               ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
           ) AS rolling_7day_avg
    FROM daily_user
),
global_avg AS (
    SELECT AVG(daily_duration) AS global_daily_avg
    FROM daily_user
)
SELECT DISTINCT r.user_id
FROM rolling_avg r
CROSS JOIN global_avg g
WHERE r.rolling_7day_avg > g.global_daily_avg;

🧩 DATASET 5: Payments – Merchant Risk Tiering

Table: `payments`


merchant_id | payment_date | status | amount

🔥 Q5. Classify merchants into HIGH / MEDIUM / LOW risk based on failure ratio percentile

✅ SQL


WITH merchant_stats AS (
    SELECT merchant_id,
           SUM(CASE WHEN status='FAILED' THEN 1 ELSE 0 END)*1.0
           / COUNT(*) AS failure_ratio
    FROM payments
    GROUP BY merchant_id
),
ranked AS (
    SELECT merchant_id,
           failure_ratio,
           NTILE(3) OVER (ORDER BY failure_ratio DESC) AS risk_tier
    FROM merchant_stats
)
SELECT merchant_id,
       CASE risk_tier
            WHEN 1 THEN 'HIGH'
            WHEN 2 THEN 'MEDIUM'
            ELSE 'LOW'
       END AS risk_category
FROM ranked;

Search This Blog

ShaikBlog

DataENgineering Interview: Ultra Hard (CTE + Window + GROUP BY mix)

🧩 DATASET 1: BANKING – Fraud Signal Detection

Table: `transactions`

🔥 Q1. Find accounts whose total debit is above the 90th percentile of total debit across all accounts

✅ SQL (CTE + Window + GROUP BY)

✅ PySpark Advanced

🧩 DATASET 2: HR – Compensation Outlier Detection

Table: `employees`

🔥 Q2. Find employees earning more than department average AND above overall average

✅ SQL

✅ PySpark Advanced

🧩 DATASET 3: E-COMMERCE – Power Users

Table: `orders`

🔥 Q3. Find customers whose monthly spend is consistently above monthly average for at least 3 months

✅ SQL

✅ PySpark Advanced

🧩 DATASET 4: Platform Logs – Heavy Users

Table: `user_sessions`

🔥 Q4. Find users whose 7-day average session time exceeds global daily average

✅ SQL (Window + CTE + GROUP BY)

🧩 DATASET 5: Payments – Merchant Risk Tiering

Table: `payments`

🔥 Q5. Classify merchants into HIGH / MEDIUM / LOW risk based on failure ratio percentile

✅ SQL

Comments

Post a Comment

Popular posts from this blog

Revenue Cycle Management- HealthCare

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing

DataENgineering Interview: Ultra Hard (CTE + Window + GROUP BY mix)

🧩 DATASET 1: BANKING – Fraud Signal Detection

Table: transactions

🔥 Q1. Find accounts whose total debit is above the 90th percentile of total debit across all accounts

✅ SQL (CTE + Window + GROUP BY)

✅ PySpark Advanced

🧩 DATASET 2: HR – Compensation Outlier Detection

Table: employees

🔥 Q2. Find employees earning more than department average AND above overall average

✅ SQL

✅ PySpark Advanced

🧩 DATASET 3: E-COMMERCE – Power Users

Table: orders

🔥 Q3. Find customers whose monthly spend is consistently above monthly average for at least 3 months

✅ SQL

✅ PySpark Advanced

🧩 DATASET 4: Platform Logs – Heavy Users

Table: user_sessions

🔥 Q4. Find users whose 7-day average session time exceeds global daily average

✅ SQL (Window + CTE + GROUP BY)

🧩 DATASET 5: Payments – Merchant Risk Tiering

Table: payments

🔥 Q5. Classify merchants into HIGH / MEDIUM / LOW risk based on failure ratio percentile

✅ SQL

Comments

Post a Comment

Popular posts from this blog

Revenue Cycle Management- HealthCare

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing

Table: `transactions`

Table: `employees`

Table: `orders`

Table: `user_sessions`

Table: `payments`