Welcome to my personal blog. I’m a Data Engineer with hands-on expertise in building robust, scalable, and efficient data pipelines using tools like SQL Server, Azure Data Factory, Databricks, PySpark, and Power BI. This space is where I share technical insights, real-world projects, and practical guides from my journey in the world of data engineering.
Explore my blog posts, projects, and resources to learn more about modern data architecture and analytics.
15-Day PySpark for Data Engineering Master Guide-v2
Get link
Facebook
X
Pinterest
Email
Other Apps
-
Apache Spark Fundamentals
Why Spark exists & core concepts
🔥
What is Spark
A distributed data processing engine for massive datasets, supporting batch, streaming, SQL, ML & Graph processing.
in-memory
🐢→🚀
Why Spark
Replaces slow MapReduce with in-memory computation, complex ETL pipelines, and real-time distributed analytics.
speed
🕸️
Lazy Evaluation
Spark does NOT execute immediately. Execution only triggers on Actions — this allows DAG optimization and reduced cost.
⭐ critical
📊
DAG
Directed Acyclic Graph — Spark's execution plan. Visualizes Stages, Tasks, Lineage, and the optimization flow.
internals
🏗️
Spark Architecture
Driver → Cluster Manager → Executors
🔄
Enterprise ETL Architecture
Bronze → Silver → Gold Medallion Pattern
📡
Source
API / Kafka CSV / DB
🪙
Bronze
Raw landing layer
🔍
Silver
Cleansed validated
🏆
Gold
Aggregated KPIs
📊
Report
Power BI analytics
💻
Core PySpark Patterns
Production-level code examples to memorize
# SparkSession — entry point to everythingfrom pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, broadcast
spark = SparkSession.builder \
.appName("SalesETL") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
# Read with explicit schema — NEVER infer in productionfrom pyspark.sql.types import StructType, StructField, StringType, DoubleType
schema = StructType([
StructField("customer_id", StringType(), True),
StructField("revenue", DoubleType(), True),
])
df = spark.read.schema(schema).csv("s3://bucket/sales/*.csv", header=True)
# Transformation chain — lazy, builds DAG only
result = df \
.filter(col("revenue") > 0) \
.withColumn("tier", when(col("revenue") > 10000, "Gold").otherwise("Standard")) \
.groupBy("tier").agg({"revenue": "sum"})
# Write partitioned Parquet — columnar, compressed, fast
result.write.mode("overwrite") \
.partitionBy("tier") \
.parquet("s3://bucket/gold/sales/") # ACTION — triggers DAG
🗂️
Partitioning Strategy
repartition vs coalesce — critical interview topic
📅
15-Day Execution Plan
Structured daily learning with clear milestones
📘 Week 1 —PySpark Foundation
DAY 1
Spark Basics & Architecture
Spark vs HadoopDistributed ProcessingDAG ⭐Lazy Evaluation ⭐
Brings all data to driver memory. Use show(), take(n), or write to storage instead. Causes OOM crashes in production.
🔢
Prefer built-in functions over Python UDFs
Python UDFs break JVM optimization, add serialization overhead, and disable Catalyst optimizations. Always reach for pyspark.sql.functions first.
📐
Always define explicit schemas
inferSchema reads the entire dataset just to infer types. Define StructType explicitly for production pipelines — faster startup, no surprises with null columns.
📡
Use broadcast joins for small tables
When joining a large table with a small lookup table (<10MB), use broadcast(). Eliminates shuffle entirely and is the single biggest join optimization.
📁
Use Parquet or Delta for all storage
Columnar format enables predicate pushdown and partition pruning. Delta adds ACID, schema evolution, and time travel on top of Parquet.
🔀
Handle data skew with salting
When one partition has 10x more data than others, add a random salt key to distribute load. Then aggregate twice: once with salt, once without.
🎯
Interview Questions
Graded by experience level — know all three tiers
● Basic
What is the difference between repartition() and coalesce()?
repartition() does a full shuffle and can increase or decrease partitions. coalesce() only reduces partitions without a full shuffle by merging existing ones — far more efficient for reducing partition count.
● Basic
What is lazy evaluation in Spark?
Transformations (like filter, map, join) only build the execution plan (DAG). They never run until an Action (like collect, count, write) is called. This allows Catalyst to optimize the entire plan before executing.
● Intermediate
What causes a shuffle, and why is it expensive?
Wide transformations like groupBy, join, and orderBy require data from multiple partitions to be combined. Spark writes intermediate data to disk and redistributes it across executors — this network + disk I/O is the bottleneck.
● Intermediate
How does Adaptive Query Execution (AQE) work?
AQE re-optimizes the query plan at runtime using actual execution statistics. It can automatically coalesce small shuffle partitions, switch to broadcast joins when a table turns out to be small, and handle skew joins dynamically.
● Advanced
Design an SCD Type 2 pipeline in PySpark/Delta Lake.
Read incoming records, join with existing Delta table on business key. For matches, close the old record by setting end_date = current_date and is_active = false. Insert new record with new surrogate key, start_date, and is_active = true. Use Delta MERGE for atomic upserts and to maintain full history.
● Advanced
How do you optimize a slow Spark job in production?
Check Spark UI for skewed stages and long-tail tasks. Enable AQE. Examine shuffle read/write size — reduce with partition pruning or broadcast joins. Review executor memory and GC logs. Check if caching a repeated DataFrame would help. Profile DAG for unnecessary wide transformations.
Below is a COMPLETE, interview-ready guide to SCD Type 2 using MERGE , covering ALL POSSIBLE QUESTIONS (theory + edge cases) AND production-grade SQL code . This is curated for senior data engineer / Azure / Databricks / SQL Server interviews. 🔹 SCD TYPE 2 – INTERVIEW QUESTIONS + MERGE CODE 1️⃣ What is SCD Type 2? Answer: SCD Type 2 maintains full historical changes by: Expiring old records Inserting a new row for every change Typical columns: effective_start_date effective_end_date is_current version (optional) 2️⃣ SCD Type 2 Table Design (Interview MUST) 🎯 Dimension Table CREATE TABLE dim_customer ( customer_sk INT IDENTITY ( 1 , 1 ), customer_id INT , name VARCHAR ( 100 ), city VARCHAR ( 100 ), effective_start_date DATE , effective_end_date DATE , is_current CHAR ( 1 ) ); 🎯 Staging Table CREATE TABLE stg_customer ( customer_id INT , name VARCHAR ( 100 ), city VARCHAR ( 100 ), load_date ...
🔥 TIME-SERIES SQL (MSSQL SERVER ONLY) 📌 Sample Table (Used in All Queries) CREATE TABLE transactions ( user_id INT , txn_time DATETIME, amount INT ); 🟢 SIMPLE LEVEL 1️⃣ Running Total (Cumulative Sum) SELECT user_id, txn_time, amount, SUM(amount) OVER ( PARTITION BY user_id ORDER BY txn_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total FROM transactions; 2️⃣ Row-Based Moving Average (Last 3 Records) SELECT user_id, txn_time, amount, AVG(amount) OVER ( PARTITION BY user_id ORDER BY txn_time ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS moving_avg_3 FROM transactions; 3️⃣ Time Difference Between Events SELECT user_id, txn_time, LAG(txn_time) OVER ( PARTITION BY user_id ORDER BY txn_time ) AS prev_time, DATEDIFF( MINUTE , LAG(txn_time) OVER (PARTITION BY user_id ORDER ...
⏱️ TIME-BASED SQL QUERIES (CODING + LOGIC) 1️⃣ Find Users with N Transactions in X Minutes 🔥 MOST ASKED QUESTION Table: transactions(user_id, txn_id, amount, txn_time) ✅ Requirement Find users with ≥ 3 transactions in 10 minutes SELECT DISTINCT t1.user_id FROM transactions t1 JOIN transactions t2 ON t1.user_id = t2.user_id AND t2.txn_time BETWEEN t1.txn_time AND DATEADD( MINUTE , 10 , t1.txn_time) GROUP BY t1.user_id, t1.txn_time HAVING COUNT ( * ) >= 3 ; 🗣 Explain “This uses a rolling time window using a self-join. Very common in fraud detection.” 2️⃣ Consecutive Days / Events 🔥 Login, sales, activity streak questions Table: logins(user_id, login_date) ✅ Find users with 3 consecutive login days WITH cte AS ( SELECT user_id, login_date, DATEADD( DAY , - ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY login_date ), login_date ) grp FR...
Comments
Post a Comment