VERY HARD – Pattern 1 (GROUP BY + HAVING)

Level: Senior Data Engineer / 5–8+ years

🧩 DATASET 1: BANKING (Fraud + Behavioral Analysis)

Table: transactions

txn_id | account_id | txn_date | amount | txn_type ----------------------------------------------------- 1 | 101 | 2025-01-01 | 5000 | CREDIT 2 | 101 | 2025-01-01 | 2000 | DEBIT 3 | 101 | 2025-01-02 | 10000 | DEBIT 4 | 102 | 2025-01-01 | 8000 | CREDIT 5 | 102 | 2025-01-02 | 9000 | DEBIT

🔥 Q1. Find accounts where total debit > total credit

✅ SQL Advanced

SELECT account_id, SUM(CASE WHEN txn_type='DEBIT' THEN amount ELSE 0 END) AS total_debit, SUM(CASE WHEN txn_type='CREDIT' THEN amount ELSE 0 END) AS total_credit FROM transactions GROUP BY account_id HAVING SUM(CASE WHEN txn_type='DEBIT' THEN amount ELSE 0 END) > SUM(CASE WHEN txn_type='CREDIT' THEN amount ELSE 0 END);

✅ PySpark Advanced

from pyspark.sql.functions import sum, when, col df.groupBy("account_id").agg( sum(when(col("txn_type")=="DEBIT", col("amount")).otherwise(0)).alias("total_debit"), sum(when(col("txn_type")=="CREDIT", col("amount")).otherwise(0)).alias("total_credit") ).filter(col("total_debit") > col("total_credit")).show()

🔥 Q2. Find accounts where average debit amount > 2× average credit amount

SQL

SELECT account_id FROM transactions GROUP BY account_id HAVING AVG(CASE WHEN txn_type='DEBIT' THEN amount END) > 2 * AVG(CASE WHEN txn_type='CREDIT' THEN amount END);

PySpark

from pyspark.sql.functions import avg df.groupBy("account_id").agg( avg(when(col("txn_type")=="DEBIT", col("amount"))).alias("avg_debit"), avg(when(col("txn_type")=="CREDIT", col("amount"))).alias("avg_credit") ).filter(col("avg_debit") > 2*col("avg_credit")).show()

🧩 DATASET 2: HR Analytics

Table: employees

emp_id | dept_id | salary | gender ----------------------------------- 1 | 10 | 50000 | M 2 | 10 | 60000 | F 3 | 10 | 70000 | F 4 | 20 | 30000 | M 5 | 20 | 35000 | M

🔥 Q3. Find departments where female average salary > male average salary

SQL

SELECT dept_id FROM employees GROUP BY dept_id HAVING AVG(CASE WHEN gender='F' THEN salary END) > AVG(CASE WHEN gender='M' THEN salary END);

PySpark

df.groupBy("dept_id").agg( avg(when(col("gender")=="F", col("salary"))).alias("avg_female"), avg(when(col("gender")=="M", col("salary"))).alias("avg_male") ).filter(col("avg_female") > col("avg_male")).show()

🔥 Q4. Find departments where salary variance > 10,000

SQL

SELECT dept_id, MAX(salary) - MIN(salary) AS salary_gap FROM employees GROUP BY dept_id HAVING MAX(salary) - MIN(salary) > 10000;

PySpark

from pyspark.sql.functions import max, min df.groupBy("dept_id").agg( (max("salary") - min("salary")).alias("salary_gap") ).filter("salary_gap > 10000").show()

🧩 DATASET 3: E-COMMERCE (Behavioral Segmentation)

Table: orders

order_id | customer_id | product_id | amount ------------------------------------------------ 1 | 100 | A | 200 2 | 100 | B | 300 3 | 100 | A | 100 4 | 101 | C | 500

🔥 Q5. Find customers purchasing more than 2 distinct products

SQL

SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(DISTINCT product_id) > 2;

PySpark

from pyspark.sql.functions import countDistinct df.groupBy("customer_id") \ .agg(countDistinct("product_id").alias("prod_count")) \ .filter("prod_count > 2") \ .show()

🔥 Q6. Find customers whose max order > 50% of their total spend

SQL

SELECT customer_id FROM orders GROUP BY customer_id HAVING MAX(amount) > 0.5 * SUM(amount);

PySpark

from pyspark.sql.functions import max, sum df.groupBy("customer_id").agg( max("amount").alias("max_amt"), sum("amount").alias("total_amt") ).filter(col("max_amt") > 0.5*col("total_amt")).show()

🧩 DATASET 4: App / Platform Logs

Table: user_sessions

user_id | session_id | duration_minutes | device ------------------------------------------------- 1 | S1 | 30 | Android 1 | S2 | 5 | Android 2 | S3 | 100 | iOS 2 | S4 | 120 | iOS

🔥 Q7. Find users whose total session duration > global average duration

SQL

SELECT user_id FROM user_sessions GROUP BY user_id HAVING SUM(duration_minutes) > (SELECT AVG(duration_minutes) FROM user_sessions);

PySpark

global_avg = df.agg(avg("duration_minutes")).collect()[0][0] df.groupBy("user_id") \ .agg(sum("duration_minutes").alias("total_duration")) \ .filter(col("total_duration") > global_avg) \ .show()

🔥 Q8. Find devices where average session duration is 2× overall average

SQL

SELECT device FROM user_sessions GROUP BY device HAVING AVG(duration_minutes) > 2 * (SELECT AVG(duration_minutes) FROM user_sessions);

🧩 DATASET 5: Payments (Data Quality Check)

Table: payments

payment_id | merchant_id | status | amount --------------------------------------------- 1 | M1 | SUCCESS | 100 2 | M1 | FAILED | 200 3 | M2 | SUCCESS | 300

🔥 Q9. Find merchants where failed transactions > 30% of total

SQL

SELECT merchant_id FROM payments GROUP BY merchant_id HAVING SUM(CASE WHEN status='FAILED' THEN 1 ELSE 0 END) * 1.0 / COUNT(*) > 0.3;

PySpark

df.groupBy("merchant_id").agg( (sum(when(col("status")=="FAILED",1).otherwise(0)) / count("*")).alias("failure_ratio") ).filter("failure_ratio > 0.3").show()

🔥 Q10. Find merchants where SUCCESS amount < 50% of total processed amount

SQL

SELECT merchant_id FROM payments GROUP BY merchant_id HAVING SUM(CASE WHEN status='SUCCESS' THEN amount ELSE 0 END) < 0.5 * SUM(amount);

PySpark

df.groupBy("merchant_id").agg( sum(when(col("status")=="SUCCESS", col("amount")).otherwise(0)).alias("success_amt"), sum("amount").alias("total_amt") ).filter(col("success_amt") < 0.5*col("total_amt")).show()

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing