VERY HARD – Pattern 1 (GROUP BY + HAVING)

Level: Senior Data Engineer / 5–8+ years

🧩 DATASET 1: BANKING (Fraud + Behavioral Analysis)

Table: `transactions`


txn_id | account_id | txn_date   | amount | txn_type
-----------------------------------------------------
1      | 101        | 2025-01-01 | 5000   | CREDIT
2      | 101        | 2025-01-01 | 2000   | DEBIT
3      | 101        | 2025-01-02 | 10000  | DEBIT
4      | 102        | 2025-01-01 | 8000   | CREDIT
5      | 102        | 2025-01-02 | 9000   | DEBIT

🔥 Q1. Find accounts where total debit > total credit

✅ SQL Advanced


SELECT account_id,
       SUM(CASE WHEN txn_type='DEBIT' THEN amount ELSE 0 END) AS total_debit,
       SUM(CASE WHEN txn_type='CREDIT' THEN amount ELSE 0 END) AS total_credit
FROM transactions
GROUP BY account_id
HAVING SUM(CASE WHEN txn_type='DEBIT' THEN amount ELSE 0 END) >
       SUM(CASE WHEN txn_type='CREDIT' THEN amount ELSE 0 END);

✅ PySpark Advanced


from pyspark.sql.functions import sum, when, col

df.groupBy("account_id").agg(
    sum(when(col("txn_type")=="DEBIT", col("amount")).otherwise(0)).alias("total_debit"),
    sum(when(col("txn_type")=="CREDIT", col("amount")).otherwise(0)).alias("total_credit")
).filter(col("total_debit") > col("total_credit")).show()

🔥 Q2. Find accounts where average debit amount > 2× average credit amount

SQL


SELECT account_id
FROM transactions
GROUP BY account_id
HAVING AVG(CASE WHEN txn_type='DEBIT' THEN amount END) >
       2 * AVG(CASE WHEN txn_type='CREDIT' THEN amount END);

PySpark


from pyspark.sql.functions import avg

df.groupBy("account_id").agg(
    avg(when(col("txn_type")=="DEBIT", col("amount"))).alias("avg_debit"),
    avg(when(col("txn_type")=="CREDIT", col("amount"))).alias("avg_credit")
).filter(col("avg_debit") > 2*col("avg_credit")).show()

🧩 DATASET 2: HR Analytics

Table: `employees`


emp_id | dept_id | salary | gender
-----------------------------------
1      | 10      | 50000  | M
2      | 10      | 60000  | F
3      | 10      | 70000  | F
4      | 20      | 30000  | M
5      | 20      | 35000  | M

🔥 Q3. Find departments where female average salary > male average salary

SQL


SELECT dept_id
FROM employees
GROUP BY dept_id
HAVING AVG(CASE WHEN gender='F' THEN salary END) >
       AVG(CASE WHEN gender='M' THEN salary END);

PySpark


df.groupBy("dept_id").agg(
    avg(when(col("gender")=="F", col("salary"))).alias("avg_female"),
    avg(when(col("gender")=="M", col("salary"))).alias("avg_male")
).filter(col("avg_female") > col("avg_male")).show()

🔥 Q4. Find departments where salary variance > 10,000

SQL


SELECT dept_id,
       MAX(salary) - MIN(salary) AS salary_gap
FROM employees
GROUP BY dept_id
HAVING MAX(salary) - MIN(salary) > 10000;

PySpark


from pyspark.sql.functions import max, min

df.groupBy("dept_id").agg(
    (max("salary") - min("salary")).alias("salary_gap")
).filter("salary_gap > 10000").show()

🧩 DATASET 3: E-COMMERCE (Behavioral Segmentation)

Table: `orders`


order_id | customer_id | product_id | amount
------------------------------------------------
1        | 100         | A          | 200
2        | 100         | B          | 300
3        | 100         | A          | 100
4        | 101         | C          | 500

🔥 Q5. Find customers purchasing more than 2 distinct products

SQL


SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING COUNT(DISTINCT product_id) > 2;

PySpark


from pyspark.sql.functions import countDistinct

df.groupBy("customer_id") \
  .agg(countDistinct("product_id").alias("prod_count")) \
  .filter("prod_count > 2") \
  .show()

🔥 Q6. Find customers whose max order > 50% of their total spend

SQL


SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING MAX(amount) > 0.5 * SUM(amount);

PySpark


from pyspark.sql.functions import max, sum

df.groupBy("customer_id").agg(
    max("amount").alias("max_amt"),
    sum("amount").alias("total_amt")
).filter(col("max_amt") > 0.5*col("total_amt")).show()

🧩 DATASET 4: App / Platform Logs

Table: `user_sessions`


user_id | session_id | duration_minutes | device
-------------------------------------------------
1       | S1         | 30               | Android
1       | S2         | 5                | Android
2       | S3         | 100              | iOS
2       | S4         | 120              | iOS

🔥 Q7. Find users whose total session duration > global average duration

SQL


SELECT user_id
FROM user_sessions
GROUP BY user_id
HAVING SUM(duration_minutes) >
      (SELECT AVG(duration_minutes) FROM user_sessions);

PySpark


global_avg = df.agg(avg("duration_minutes")).collect()[0][0]

df.groupBy("user_id") \
  .agg(sum("duration_minutes").alias("total_duration")) \
  .filter(col("total_duration") > global_avg) \
  .show()

🔥 Q8. Find devices where average session duration is 2× overall average

SQL


SELECT device
FROM user_sessions
GROUP BY device
HAVING AVG(duration_minutes) >
       2 * (SELECT AVG(duration_minutes) FROM user_sessions);

🧩 DATASET 5: Payments (Data Quality Check)

Table: `payments`


payment_id | merchant_id | status   | amount
---------------------------------------------
1          | M1          | SUCCESS  | 100
2          | M1          | FAILED   | 200
3          | M2          | SUCCESS  | 300

🔥 Q9. Find merchants where failed transactions > 30% of total

SQL


SELECT merchant_id
FROM payments
GROUP BY merchant_id
HAVING 
    SUM(CASE WHEN status='FAILED' THEN 1 ELSE 0 END) * 1.0
    / COUNT(*) > 0.3;

PySpark


df.groupBy("merchant_id").agg(
    (sum(when(col("status")=="FAILED",1).otherwise(0))
     / count("*")).alias("failure_ratio")
).filter("failure_ratio > 0.3").show()

🔥 Q10. Find merchants where SUCCESS amount < 50% of total processed amount

SQL


SELECT merchant_id
FROM payments
GROUP BY merchant_id
HAVING 
    SUM(CASE WHEN status='SUCCESS' THEN amount ELSE 0 END)
    < 0.5 * SUM(amount);

PySpark


df.groupBy("merchant_id").agg(
    sum(when(col("status")=="SUCCESS", col("amount")).otherwise(0)).alias("success_amt"),
    sum("amount").alias("total_amt")
).filter(col("success_amt") < 0.5*col("total_amt")).show()

VERY HARD – Pattern 1 (GROUP BY + HAVING)

Level: Senior Data Engineer / 5–8+ years

🧩 DATASET 1: BANKING (Fraud + Behavioral Analysis)

Table: transactions

🔥 Q1. Find accounts where total debit > total credit

✅ SQL Advanced

✅ PySpark Advanced

🔥 Q2. Find accounts where average debit amount > 2× average credit amount

SQL

PySpark

🧩 DATASET 2: HR Analytics

Table: employees

🔥 Q3. Find departments where female average salary > male average salary

SQL

PySpark

🔥 Q4. Find departments where salary variance > 10,000

SQL

PySpark

🧩 DATASET 3: E-COMMERCE (Behavioral Segmentation)

Table: orders

🔥 Q5. Find customers purchasing more than 2 distinct products

SQL

PySpark

🔥 Q6. Find customers whose max order > 50% of their total spend

SQL

PySpark

🧩 DATASET 4: App / Platform Logs

Table: user_sessions

🔥 Q7. Find users whose total session duration > global average duration

SQL

PySpark

🔥 Q8. Find devices where average session duration is 2× overall average

SQL

🧩 DATASET 5: Payments (Data Quality Check)

Table: payments

🔥 Q9. Find merchants where failed transactions > 30% of total

SQL

PySpark

🔥 Q10. Find merchants where SUCCESS amount < 50% of total processed amount

SQL

PySpark

Comments

Post a Comment

Popular posts from this blog

Revenue Cycle Management- HealthCare

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing

Table: `transactions`

Table: `employees`

Table: `orders`

Table: `user_sessions`

Table: `payments`