DE Interview: Self-Join (Hierarchy)

 

a complete, senior-level (6–8 yrs) coverage of
🔹 PATTERN : Self-Join (Hierarchy)
🎯 Core skill: Joining a table to itself



🔹 PATTERN : Self-Join (Hierarchy)

📘 Assumed Table: employees

emp_id | emp_name | manager_id | department | salary
  • manager_id → references emp_id

  • CEO has manager_id = NULL


1️⃣ Employee → Manager Mapping

✅ SQL – Basic

SELECT e.emp_id, e.emp_name, m.emp_name AS manager_name FROM employees e LEFT JOIN employees m ON e.manager_id = m.emp_id;

📌 Basic self-join concept


✅ SQL – Advanced (Include CEO)

SELECT e.emp_id, e.emp_name, COALESCE(m.emp_name, 'CEO') AS manager_name FROM employees e LEFT JOIN employees m ON e.manager_id = m.emp_id;

✅ PySpark – Basic

emp = employees_df.alias("e") mgr = employees_df.alias("m") df = emp.join( mgr, emp.manager_id == mgr.emp_id, "left" ).select( "e.emp_id", "e.emp_name", F.col("m.emp_name").alias("manager_name") )

✅ PySpark – Advanced (Null-safe)

df = emp.join( mgr, emp.manager_id.eqNullSafe(mgr.emp_id), "left" )

2️⃣ Find Employees Reporting to Same Manager

✅ SQL – Basic

SELECT e1.emp_name AS emp1, e2.emp_name AS emp2, e1.manager_id FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.manager_id AND e1.emp_id < e2.emp_id;

📌 Prevents self-matching & duplicates


✅ PySpark – Basic

e1 = employees_df.alias("e1") e2 = employees_df.alias("e2") df = e1.join( e2, (e1.manager_id == e2.manager_id) & (e1.emp_id < e2.emp_id) )

3️⃣ Employee → Manager → Grand-Manager

✅ SQL – Advanced (2-level hierarchy)

SELECT e.emp_name AS employee, m.emp_name AS manager, gm.emp_name AS grand_manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.emp_id LEFT JOIN employees gm ON m.manager_id = gm.emp_id;

📌 Very common interview question


✅ PySpark – Advanced

e = employees_df.alias("e") m = employees_df.alias("m") gm = employees_df.alias("gm") df = (e .join(m, e.manager_id == m.emp_id, "left") .join(gm, m.manager_id == gm.emp_id, "left") .select( F.col("e.emp_name").alias("employee"), F.col("m.emp_name").alias("manager"), F.col("gm.emp_name").alias("grand_manager") ))

4️⃣ Employees Whose Manager Earns Less Than Them

✅ SQL – Advanced

SELECT e.emp_name, e.salary, m.emp_name AS manager, m.salary AS manager_salary FROM employees e JOIN employees m ON e.manager_id = m.emp_id WHERE e.salary > m.salary;

✅ PySpark

df = emp.join( mgr, emp.manager_id == mgr.emp_id ).filter( emp.salary > mgr.salary )

5️⃣ Managers With More Than N Direct Reports

✅ SQL – Advanced

SELECT m.emp_id, m.emp_name, COUNT(e.emp_id) AS reportees FROM employees m JOIN employees e ON e.manager_id = m.emp_id GROUP BY m.emp_id, m.emp_name HAVING COUNT(e.emp_id) > 5;

✅ PySpark – Optimized

df = (employees_df .groupBy("manager_id") .count() .filter("count > 5") .join(employees_df, employees_df.emp_id == F.col("manager_id")))

6️⃣ Detect Circular Reporting (A → B → A)

✅ SQL – Advanced (Data Quality)

SELECT e1.emp_name, e2.emp_name AS reports_to FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.emp_id WHERE e2.manager_id = e1.emp_id;

📌 Very senior-level question


✅ PySpark

df = emp.join( mgr, emp.manager_id == mgr.emp_id ).filter( mgr.manager_id == emp.emp_id )

7️⃣ Employees Reporting Outside Their Department

✅ SQL

SELECT e.emp_name, e.department AS emp_dept, m.department AS mgr_dept FROM employees e JOIN employees m ON e.manager_id = m.emp_id WHERE e.department <> m.department;

✅ PySpark

df = emp.join( mgr, emp.manager_id == mgr.emp_id ).filter( emp.department != mgr.department )

🔥 PySpark Performance Tips (Interview Gold)

  • Self-joins are expensive

  • Always:

    • Select only required columns

    • Use broadcast when one side is small

from pyspark.sql.functions import broadcast df = emp.join(broadcast(mgr), emp.manager_id == mgr.emp_id)
  • Avoid repeated self-joins → cache intermediate results

  • For deep hierarchies → recursive CTE (SQL) or GraphFrames (Spark)


🧠 Common Interview Traps

TrapFix
Self-join duplicatesUse < or !=
CEO missingLEFT JOIN
Circular loopsDetect via multi-joins
PerformancePre-filter + broadcast

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing