DE Interview: Ranking & Ties

 

🔹 PATTERN 3: Ranking & Ties

📘 Assumed Table: employees

emp_id | emp_name | department | salary | join_date

1️⃣ Basic Ranking (No Partition)

✅ SQL – Basic

SELECT emp_id, emp_name, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num, RANK() OVER (ORDER BY salary DESC) AS rank_val, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_val FROM employees;

📌 Interview must-know

  • ROW_NUMBER → unique sequence

  • RANK → gaps after ties

  • DENSE_RANK → no gaps


✅ PySpark – Basic

from pyspark.sql import functions as F from pyspark.sql.window import Window w = Window.orderBy(F.col("salary").desc()) df = employees_df.select( "*", F.row_number().over(w).alias("row_num"), F.rank().over(w).alias("rank_val"), F.dense_rank().over(w).alias("dense_rank_val") )

2️⃣ Ranking Within Groups (Department-wise)

✅ SQL – Basic

SELECT emp_id, emp_name, department, salary, RANK() OVER ( PARTITION BY department ORDER BY salary DESC ) AS dept_rank FROM employees;

✅ PySpark – Basic

w = Window.partitionBy("department") \ .orderBy(F.col("salary").desc()) df = employees_df.withColumn( "dept_rank", F.rank().over(w) )

3️⃣ Top-N per Group (Very Common Interview Pattern)

✅ SQL – Basic (Top 3 salaries per department)

SELECT * FROM ( SELECT *, DENSE_RANK() OVER ( PARTITION BY department ORDER BY salary DESC ) AS dr FROM employees ) t WHERE dr <= 3;

📌 Use DENSE_RANK to include ties


✅ SQL – Advanced (Exact 3 rows per dept)

SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY department ORDER BY salary DESC ) AS rn FROM employees ) t WHERE rn <= 3;

📌 Use ROW_NUMBER for strict limits


✅ PySpark – Basic

w = Window.partitionBy("department") \ .orderBy(F.col("salary").desc()) df = employees_df.withColumn( "dr", F.dense_rank().over(w) ).filter("dr <= 3")

4️⃣ Find Employees With Same Rank (Ties Detection)

✅ SQL – Advanced

SELECT salary, COUNT(*) AS cnt FROM employees GROUP BY salary HAVING COUNT(*) > 1;

✅ SQL – Ranking-based

SELECT * FROM ( SELECT *, RANK() OVER (ORDER BY salary DESC) AS r FROM employees ) t WHERE r IN ( SELECT r FROM ( SELECT RANK() OVER (ORDER BY salary DESC) AS r FROM employees ) x GROUP BY r HAVING COUNT(*) > 1 );

✅ PySpark

df = employees_df.groupBy("salary") \ .count() \ .filter("count > 1")

5️⃣ Nth Highest Salary (Classic Trap Question)

✅ SQL – Advanced (3rd highest salary)

SELECT DISTINCT salary FROM ( SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS dr FROM employees ) t WHERE dr = 3;

📌 RANK ❌ (can skip)
📌 DENSE_RANK


✅ PySpark

df = employees_df.withColumn( "dr", F.dense_rank().over(Window.orderBy(F.col("salary").desc())) ).filter("dr = 3") \ .select("salary") \ .distinct()

6️⃣ Remove Duplicates Using ROW_NUMBER

✅ SQL – Advanced

DELETE FROM employees WHERE emp_id IN ( SELECT emp_id FROM ( SELECT emp_id, ROW_NUMBER() OVER ( PARTITION BY emp_name, department ORDER BY join_date ) AS rn FROM employees ) t WHERE rn > 1 );

📌 ROW_NUMBER is the only safe choice


✅ PySpark – Dedup

w = Window.partitionBy("emp_name", "department") \ .orderBy("join_date") df = employees_df.withColumn( "rn", F.row_number().over(w) ).filter("rn = 1")

7️⃣ Rank by Multiple Columns

✅ SQL – Advanced

SELECT *, RANK() OVER ( PARTITION BY department ORDER BY salary DESC, join_date ASC ) AS rank_val FROM employees;

✅ PySpark

w = Window.partitionBy("department") \ .orderBy(F.col("salary").desc(), F.col("join_date").asc()) df = employees_df.withColumn( "rank_val", F.rank().over(w) )

8️⃣ Performance & Architecture (Senior-Level)

🔥 Key Interview Talking Points

  • Ranking functions require shuffle

  • Always:

    • Partition carefully

    • Select only needed columns

  • Prefer ROW_NUMBER when ties don’t matter

  • Use DENSE_RANK for Top-N per group

  • Avoid multiple window scans:

df.select( "*", F.row_number().over(w).alias("rn"), F.rank().over(w).alias("r") )

🧠 Quick Decision Table (Interview Gold)

RequirementUse
Unique sequenceROW_NUMBER
Allow gapsRANK
No gapsDENSE_RANK
Top-N per groupDENSE_RANK
DeduplicationROW_NUMBER
Nth highestDENSE_RANK

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing