15-Day PySpark for Data Engineering Master Guide-v2

 

Apache Spark Fundamentals

Why Spark exists & core concepts
🔥

What is Spark

A distributed data processing engine for massive datasets, supporting batch, streaming, SQL, ML & Graph processing.

in-memory
🐢→🚀

Why Spark

Replaces slow MapReduce with in-memory computation, complex ETL pipelines, and real-time distributed analytics.

speed
🕸️

Lazy Evaluation

Spark does NOT execute immediately. Execution only triggers on Actions — this allows DAG optimization and reduced cost.

⭐ critical
📊

DAG

Directed Acyclic Graph — Spark's execution plan. Visualizes Stages, Tasks, Lineage, and the optimization flow.

internals
🏗️

Spark Architecture

Driver → Cluster Manager → Executors
DRIVERSparkSessionDAG SchedulerTask DispatchersubmitCLUSTERMANAGERYARNKubernetesStandaloneResourcesEXECUTOR 1Task · Task · TaskCache / MemoryPartitionsEXECUTOR 2Task · Task · TaskCache / MemoryPartitionsSTORAGEHDFSS3 / ADLSDelta LakeParquetcontrolsallocatesexecutespersists
JOB → STAGE → TASK HIERARCHYJOBAction triggeredone per actionsplit byshuffleSTAGEWide transforms= new stagesplit bypartitionTASK1 per partitionruns on executorTRANSFORMS▸ Narrow: filter, map▸ Wide: groupBy, join
🔄

Enterprise ETL Architecture

Bronze → Silver → Gold Medallion Pattern
📡
Source
API / Kafka CSV / DB
🪙
Bronze
Raw landing layer
🔍
Silver
Cleansed validated
🏆
Gold
Aggregated KPIs
📊
Report
Power BI analytics
💻

Core PySpark Patterns

Production-level code examples to memorize
# SparkSession — entry point to everything
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, broadcast

spark = SparkSession.builder \
    .appName("SalesETL") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read with explicit schema — NEVER infer in production
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("revenue",     DoubleType(), True),
])
df = spark.read.schema(schema).csv("s3://bucket/sales/*.csv", header=True)

# Transformation chain — lazy, builds DAG only
result = df \
    .filter(col("revenue") > 0) \
    .withColumn("tier", when(col("revenue") > 10000, "Gold").otherwise("Standard")) \
    .groupBy("tier").agg({"revenue": "sum"})

# Write partitioned Parquet — columnar, compressed, fast
result.write.mode("overwrite") \
    .partitionBy("tier") \
    .parquet("s3://bucket/gold/sales/")    # ACTION — triggers DAG
🗂️

Partitioning Strategy

repartition vs coalesce — critical interview topic
REPARTITION vs COALESCErepartition(n)P1P2P3P4⚠ FULL SHUFFLEP1P2P3Increase OR decrease — use for skew fixcoalesce(n)P1P2P3P4✓ NO SHUFFLEP1+P2P3+P4Only reduce — efficient, avoids shuffle
📅

15-Day Execution Plan

Structured daily learning with clear milestones
📘 Week 1 —PySpark Foundation
DAY 1

Spark Basics & Architecture

Spark vs HadoopDistributed ProcessingDAG ⭐Lazy Evaluation ⭐
DAY 2

SparkSession, DataFrames & Schemas

SparkSessionStructType ⭐Explicit SchemaInferSchema pitfalls
DAY 3

Reading & Writing Files

CSV / JSONParquet ⭐ORCSave modesCompression
DAY 4

Transformations & Aggregations

filter / wherewithColumn ⭐groupBy + aggdistinctorderBy
DAY 5

Joins & Window Functions

Broadcast Join ⭐Inner / Left / Fullrow_number ⭐rank / lag / lead
DAY 6

Actions, Built-in Functions & UDFs

collect / countpyspark.sql.functions ⭐Python UDFPandas UDF
DAY 7

Mini ETL Project 🏗️

End-to-end pipelineSales CSV ingestionAggregation outputParquet write
🚀 Week 2 —Advanced PySpark
DAY 8

Partitioning Deep Dive

partitionBy ⭐repartition vs coalesce ⭐Parallelism tuning
DAY 9

Cache, Persist & Memory Management

cache() ⭐MEMORY_AND_DISKGarbage collectionOOM avoidance
DAY 10

Performance Optimization

AQE ⭐Predicate pushdown ⭐Skew handlingSalting
DAY 11

Incremental Loads & CDC

Watermarking ⭐Timestamp filteringChange Data Capture
DAY 12

Delta Lake & SCD

Delta MERGE ⭐SCD Type 2 ⭐ACID transactionsTime travelVacuum
DAY 13

Structured Streaming Basics

Kafka pipelinesWatermarksTrigger intervalsReal-time fraud detection
DAY 14

Mid-Level Projects 🏗️

JSON API PipelineCustomer AnalyticsCDC IncrementalCall Centre KPIs
DAY 15

🎯 Mock Interview + Full Revision

Architecture questionsOptimization scenariosProduction debuggingDesign questions
⚙️

Production Best Practices

What senior engineers always do
🚫

Never use collect() on large datasets

Brings all data to driver memory. Use show(), take(n), or write to storage instead. Causes OOM crashes in production.

🔢

Prefer built-in functions over Python UDFs

Python UDFs break JVM optimization, add serialization overhead, and disable Catalyst optimizations. Always reach for pyspark.sql.functions first.

📐

Always define explicit schemas

inferSchema reads the entire dataset just to infer types. Define StructType explicitly for production pipelines — faster startup, no surprises with null columns.

📡

Use broadcast joins for small tables

When joining a large table with a small lookup table (<10MB), use broadcast(). Eliminates shuffle entirely and is the single biggest join optimization.

📁

Use Parquet or Delta for all storage

Columnar format enables predicate pushdown and partition pruning. Delta adds ACID, schema evolution, and time travel on top of Parquet.

🔀

Handle data skew with salting

When one partition has 10x more data than others, add a random salt key to distribute load. Then aggregate twice: once with salt, once without.

🎯

Interview Questions

Graded by experience level — know all three tiers
● Basic
What is the difference between repartition() and coalesce()?
repartition() does a full shuffle and can increase or decrease partitions. coalesce() only reduces partitions without a full shuffle by merging existing ones — far more efficient for reducing partition count.
● Basic
What is lazy evaluation in Spark?
Transformations (like filter, map, join) only build the execution plan (DAG). They never run until an Action (like collect, count, write) is called. This allows Catalyst to optimize the entire plan before executing.
● Intermediate
What causes a shuffle, and why is it expensive?
Wide transformations like groupBy, join, and orderBy require data from multiple partitions to be combined. Spark writes intermediate data to disk and redistributes it across executors — this network + disk I/O is the bottleneck.
● Intermediate
How does Adaptive Query Execution (AQE) work?
AQE re-optimizes the query plan at runtime using actual execution statistics. It can automatically coalesce small shuffle partitions, switch to broadcast joins when a table turns out to be small, and handle skew joins dynamically.
● Advanced
Design an SCD Type 2 pipeline in PySpark/Delta Lake.
Read incoming records, join with existing Delta table on business key. For matches, close the old record by setting end_date = current_date and is_active = false. Insert new record with new surrogate key, start_date, and is_active = true. Use Delta MERGE for atomic upserts and to maintain full history.
● Advanced
How do you optimize a slow Spark job in production?
Check Spark UI for skewed stages and long-tail tasks. Enable AQE. Examine shuffle read/write size — reduce with partition pruning or broadcast joins. Review executor memory and GC logs. Check if caching a repeated DataFrame would help. Profile DAG for unnecessary wide transformations.

Senior Engineer Mindset

What interviewers expect at 4–10 years
Distributed processing thinking
ETL architecture design
Performance-first mindset
Partitioning strategies
Delta Lake & ACID
Incremental / CDC patterns
Scalability by design
Production debugging
Streaming architectures
Spark UI profiling
Memory optimization
PySpark + SQL integration

Comments

Popular posts from this blog

SCD TYPE 2 – INTERVIEW QUESTIONS + MERGE CODE

TIME-SERIES SQL

TIME-BASED SQL QUERIES