ShaikBlog

Posts

Showing posts from April, 2026

Top 50 SQL Questions

- April 22, 2026

🔥 1. Basic → Intermediate (Warm-up but important) 1. Find duplicate records in a table 2. Delete duplicate records (keep latest) 3. Find second highest salary 4. Find Nth highest salary 5. Get employees earning more than their manager 6. Count employees in each department 7. Departments with more than 5 employees 8. Get employees who joined in last 30 days 9. Find records with NULL values in specific columns 10. Replace NULL values with default ⚡ 2. Joins & Relationships (Very common) 11. Find customers who never placed an order 12. Find orders without matching customers 13. Self join to find employee-manager hierarchy 14. Find mutual friends (self join problem) 15. Cross join to generate combinations 16. Find missing IDs in a sequence 17. Find unmatched records between two tables 18. Anti-join using NOT EXISTS 19. Compare two tables and find differences 20. Join 3+ tables and aggregate results 🚀 3. Window Functions (VERY IMPORTAN...

PySpark Data Skew Handling – Complete Guide

- April 19, 2026

🔴 1. Problem Statement: Skewed Aggregation df . groupBy( "user_id" ) . count() ❗ Issue One user_id contains ~40% of total data Spark sends same key → same partition Result: One task becomes extremely heavy Other tasks finish early Straggler problem → slow job 🧠 2. Why Skew Happens Spark distributes data based on keys: user_id = A → goes to one partition user_id = B → another partition If: A = 40% of data Then: Partition for A = huge → bottleneck 🔍 3. How to Identify Skew (Practical Approach) ✅ Method 1: Distribution Check from pyspark . sql . functions import count df . groupBy( "user_id" ) \ . agg( count ( "*" ) . alias( "cnt" )) \ . orderBy( "cnt" , ascending = False ) \ . show( 10 ) 👉 Example Output: user_id cnt A 40,000,000 ← skew B 1,000 C 900 ✅ Method 2: Percentile Analysis df . groupBy( "user_id" ) \ . count() \ . selectExpr( ...