DE Interview: Self-Join (Hierarchy)
a complete, senior-level (6–8 yrs) coverage of
🔹 PATTERN : Self-Join (Hierarchy)
🎯 Core skill: Joining a table to itself
🔹 PATTERN : Self-Join (Hierarchy)
📘 Assumed Table: employees
-
manager_id→ referencesemp_id -
CEO has
manager_id = NULL
1️⃣ Employee → Manager Mapping
✅ SQL – Basic
📌 Basic self-join concept
✅ SQL – Advanced (Include CEO)
✅ PySpark – Basic
✅ PySpark – Advanced (Null-safe)
2️⃣ Find Employees Reporting to Same Manager
✅ SQL – Basic
📌 Prevents self-matching & duplicates
✅ PySpark – Basic
3️⃣ Employee → Manager → Grand-Manager
✅ SQL – Advanced (2-level hierarchy)
📌 Very common interview question
✅ PySpark – Advanced
4️⃣ Employees Whose Manager Earns Less Than Them
✅ SQL – Advanced
✅ PySpark
5️⃣ Managers With More Than N Direct Reports
✅ SQL – Advanced
✅ PySpark – Optimized
6️⃣ Detect Circular Reporting (A → B → A)
✅ SQL – Advanced (Data Quality)
📌 Very senior-level question
✅ PySpark
7️⃣ Employees Reporting Outside Their Department
✅ SQL
✅ PySpark
🔥 PySpark Performance Tips (Interview Gold)
-
Self-joins are expensive
-
Always:
-
Select only required columns
-
Use broadcast when one side is small
-
-
Avoid repeated self-joins → cache intermediate results
-
For deep hierarchies → recursive CTE (SQL) or GraphFrames (Spark)
🧠 Common Interview Traps
| Trap | Fix |
|---|---|
| Self-join duplicates | Use < or != |
| CEO missing | LEFT JOIN |
| Circular loops | Detect via multi-joins |
| Performance | Pre-filter + broadcast |
Comments
Post a Comment