DataENgineering Interview: Ultra Hard (CTE + Window + GROUP BY mix)

 🔥 ULTRA HARD – Pattern 1 Mixed With CTE + Window + GROUP BY

Level: Senior Data Engineer (8+ years)
These are the kinds of questions asked at product companies, fintech, FAANG-level interviews.

These problems mix:

  • GROUP BY

  • HAVING

  • CTEs

  • Window functions

  • Multi-step aggregations

  • Global vs partition comparisons

  • Ratio & ranking logic

We’ll use 4 realistic datasets.


🧩 DATASET 1: BANKING – Fraud Signal Detection

Table: transactions

txn_id | account_id | txn_date | amount | txn_type ----------------------------------------------------- 1 | 101 | 2025-01-01 | 5000 | CREDIT 2 | 101 | 2025-01-02 | 2000 | DEBIT 3 | 101 | 2025-01-03 | 15000 | DEBIT 4 | 102 | 2025-01-01 | 8000 | CREDIT 5 | 102 | 2025-01-02 | 9000 | DEBIT

🔥 Q1. Find accounts whose total debit is above the 90th percentile of total debit across all accounts


✅ SQL (CTE + Window + GROUP BY)

WITH debit_summary AS ( SELECT account_id, SUM(CASE WHEN txn_type='DEBIT' THEN amount ELSE 0 END) AS total_debit FROM transactions GROUP BY account_id ), percentile_calc AS ( SELECT account_id, total_debit, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY total_debit) OVER () AS p90 FROM debit_summary ) SELECT account_id, total_debit FROM percentile_calc WHERE total_debit > p90;

✅ PySpark Advanced

from pyspark.sql.functions import sum, when, col, expr debit_df = df.groupBy("account_id").agg( sum(when(col("txn_type")=="DEBIT", col("amount")).otherwise(0)).alias("total_debit") ) p90_value = debit_df.approxQuantile("total_debit", [0.9], 0.01)[0] debit_df.filter(col("total_debit") > p90_value).show()

🧩 DATASET 2: HR – Compensation Outlier Detection

Table: employees

emp_id | dept_id | salary | hire_date

🔥 Q2. Find employees earning more than department average AND above overall average


✅ SQL

WITH dept_avg AS ( SELECT dept_id, AVG(salary) AS dept_avg_salary FROM employees GROUP BY dept_id ), overall_avg AS ( SELECT AVG(salary) AS global_avg FROM employees ) SELECT e.* FROM employees e JOIN dept_avg d ON e.dept_id = d.dept_id CROSS JOIN overall_avg o WHERE e.salary > d.dept_avg_salary AND e.salary > o.global_avg;

✅ PySpark Advanced

from pyspark.sql.functions import avg dept_avg = df.groupBy("dept_id").agg(avg("salary").alias("dept_avg")) global_avg = df.agg(avg("salary").alias("global_avg")).collect()[0][0] df.join(dept_avg, "dept_id") \ .filter((col("salary") > col("dept_avg")) & (col("salary") > global_avg)) \ .show()

🧩 DATASET 3: E-COMMERCE – Power Users

Table: orders

order_id | customer_id | order_date | amount

🔥 Q3. Find customers whose monthly spend is consistently above monthly average for at least 3 months


✅ SQL

WITH monthly_spend AS ( SELECT customer_id, DATE_TRUNC('month', order_date) AS month, SUM(amount) AS total_spend FROM orders GROUP BY customer_id, DATE_TRUNC('month', order_date) ), monthly_avg AS ( SELECT month, AVG(total_spend) AS avg_spend FROM monthly_spend GROUP BY month ), flagged AS ( SELECT m.customer_id, m.month, m.total_spend, a.avg_spend, CASE WHEN m.total_spend > a.avg_spend THEN 1 ELSE 0 END AS is_above FROM monthly_spend m JOIN monthly_avg a ON m.month = a.month ) SELECT customer_id FROM flagged GROUP BY customer_id HAVING SUM(is_above) >= 3;

✅ PySpark Advanced

from pyspark.sql.functions import sum, avg, when, date_trunc monthly = df.groupBy("customer_id", date_trunc("month","order_date").alias("month")) \ .agg(sum("amount").alias("total_spend")) monthly_avg = monthly.groupBy("month") \ .agg(avg("total_spend").alias("avg_spend")) flagged = monthly.join(monthly_avg, "month") \ .withColumn("is_above", when(col("total_spend") > col("avg_spend"),1).otherwise(0)) flagged.groupBy("customer_id") \ .agg(sum("is_above").alias("above_count")) \ .filter("above_count >= 3") \ .show()

🧩 DATASET 4: Platform Logs – Heavy Users

Table: user_sessions

user_id | session_date | duration_minutes

🔥 Q4. Find users whose 7-day average session time exceeds global daily average


✅ SQL (Window + CTE + GROUP BY)

WITH daily_user AS ( SELECT user_id, session_date, SUM(duration_minutes) AS daily_duration FROM user_sessions GROUP BY user_id, session_date ), rolling_avg AS ( SELECT user_id, session_date, AVG(daily_duration) OVER ( PARTITION BY user_id ORDER BY session_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS rolling_7day_avg FROM daily_user ), global_avg AS ( SELECT AVG(daily_duration) AS global_daily_avg FROM daily_user ) SELECT DISTINCT r.user_id FROM rolling_avg r CROSS JOIN global_avg g WHERE r.rolling_7day_avg > g.global_daily_avg;

🧩 DATASET 5: Payments – Merchant Risk Tiering

Table: payments

merchant_id | payment_date | status | amount

🔥 Q5. Classify merchants into HIGH / MEDIUM / LOW risk based on failure ratio percentile


✅ SQL

WITH merchant_stats AS ( SELECT merchant_id, SUM(CASE WHEN status='FAILED' THEN 1 ELSE 0 END)*1.0 / COUNT(*) AS failure_ratio FROM payments GROUP BY merchant_id ), ranked AS ( SELECT merchant_id, failure_ratio, NTILE(3) OVER (ORDER BY failure_ratio DESC) AS risk_tier FROM merchant_stats ) SELECT merchant_id, CASE risk_tier WHEN 1 THEN 'HIGH' WHEN 2 THEN 'MEDIUM' ELSE 'LOW' END AS risk_category FROM ranked;

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing