15-Day PySpark for Data Engineering Master Guide-v2

Apache Spark Fundamentals

Why Spark exists & core concepts

🔥

What is Spark

A distributed data processing engine for massive datasets, supporting batch, streaming, SQL, ML & Graph processing.

in-memory

🐢→🚀

Why Spark

Replaces slow MapReduce with in-memory computation, complex ETL pipelines, and real-time distributed analytics.

speed

🕸️

Lazy Evaluation

Spark does NOT execute immediately. Execution only triggers on Actions — this allows DAG optimization and reduced cost.

⭐ critical

📊

DAG

Directed Acyclic Graph — Spark's execution plan. Visualizes Stages, Tasks, Lineage, and the optimization flow.

internals

🏗️

Spark Architecture

Driver → Cluster Manager → Executors

🔄

Enterprise ETL Architecture

Bronze → Silver → Gold Medallion Pattern

📡

Source

API / Kafka CSV / DB

🪙

Bronze

Raw landing layer

🔍

Silver

Cleansed validated

🏆

Gold

Aggregated KPIs

📊

Report

Power BI analytics

💻

Core PySpark Patterns

Production-level code examples to memorize

# SparkSession — entry point to everything
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, broadcast

spark = SparkSession.builder \
    .appName("SalesETL") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read with explicit schema — NEVER infer in production
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("revenue",     DoubleType(), True),
])
df = spark.read.schema(schema).csv("s3://bucket/sales/*.csv", header=True)

# Transformation chain — lazy, builds DAG only
result = df \
    .filter(col("revenue") > 0) \
    .withColumn("tier", when(col("revenue") > 10000, "Gold").otherwise("Standard")) \
    .groupBy("tier").agg({"revenue": "sum"})

# Write partitioned Parquet — columnar, compressed, fast
result.write.mode("overwrite") \
    .partitionBy("tier") \
    .parquet("s3://bucket/gold/sales/")    # ACTION — triggers DAG

🗂️

Partitioning Strategy

repartition vs coalesce — critical interview topic

📅

15-Day Execution Plan

Structured daily learning with clear milestones

📘 Week 1 —PySpark Foundation

DAY 1

Spark Basics & Architecture

Spark vs HadoopDistributed ProcessingDAG ⭐Lazy Evaluation ⭐

DAY 2

SparkSession, DataFrames & Schemas

SparkSessionStructType ⭐Explicit SchemaInferSchema pitfalls

DAY 3

Reading & Writing Files

CSV / JSONParquet ⭐ORCSave modesCompression

DAY 4

Transformations & Aggregations

filter / wherewithColumn ⭐groupBy + aggdistinctorderBy

DAY 5

Joins & Window Functions

Broadcast Join ⭐Inner / Left / Fullrow_number ⭐rank / lag / lead

DAY 6

Actions, Built-in Functions & UDFs

collect / countpyspark.sql.functions ⭐Python UDFPandas UDF

DAY 7

Mini ETL Project 🏗️

End-to-end pipelineSales CSV ingestionAggregation outputParquet write

🚀 Week 2 —Advanced PySpark

DAY 8

Partitioning Deep Dive

partitionBy ⭐repartition vs coalesce ⭐Parallelism tuning

DAY 9

Cache, Persist & Memory Management

cache() ⭐MEMORY_AND_DISKGarbage collectionOOM avoidance

DAY 10

Performance Optimization

AQE ⭐Predicate pushdown ⭐Skew handlingSalting

DAY 11

Incremental Loads & CDC

Watermarking ⭐Timestamp filteringChange Data Capture

DAY 12

Delta Lake & SCD

Delta MERGE ⭐SCD Type 2 ⭐ACID transactionsTime travelVacuum

DAY 13

Structured Streaming Basics

Kafka pipelinesWatermarksTrigger intervalsReal-time fraud detection

DAY 14

Mid-Level Projects 🏗️

JSON API PipelineCustomer AnalyticsCDC IncrementalCall Centre KPIs

DAY 15

🎯 Mock Interview + Full Revision

Architecture questionsOptimization scenariosProduction debuggingDesign questions

⚙️

Production Best Practices

What senior engineers always do

🚫

Never use collect() on large datasets

Brings all data to driver memory. Use show(), take(n), or write to storage instead. Causes OOM crashes in production.

🔢

Prefer built-in functions over Python UDFs

Python UDFs break JVM optimization, add serialization overhead, and disable Catalyst optimizations. Always reach for pyspark.sql.functions first.

📐

Always define explicit schemas

inferSchema reads the entire dataset just to infer types. Define StructType explicitly for production pipelines — faster startup, no surprises with null columns.

📡

Use broadcast joins for small tables

When joining a large table with a small lookup table (<10MB), use broadcast(). Eliminates shuffle entirely and is the single biggest join optimization.

📁

Use Parquet or Delta for all storage

Columnar format enables predicate pushdown and partition pruning. Delta adds ACID, schema evolution, and time travel on top of Parquet.

🔀

Handle data skew with salting

When one partition has 10x more data than others, add a random salt key to distribute load. Then aggregate twice: once with salt, once without.

🎯

Interview Questions

Graded by experience level — know all three tiers

● Basic

What is the difference between repartition() and coalesce()?

repartition() does a full shuffle and can increase or decrease partitions. coalesce() only reduces partitions without a full shuffle by merging existing ones — far more efficient for reducing partition count.

● Basic

What is lazy evaluation in Spark?

Transformations (like filter, map, join) only build the execution plan (DAG). They never run until an Action (like collect, count, write) is called. This allows Catalyst to optimize the entire plan before executing.

● Intermediate

What causes a shuffle, and why is it expensive?

Wide transformations like groupBy, join, and orderBy require data from multiple partitions to be combined. Spark writes intermediate data to disk and redistributes it across executors — this network + disk I/O is the bottleneck.

● Intermediate

How does Adaptive Query Execution (AQE) work?

AQE re-optimizes the query plan at runtime using actual execution statistics. It can automatically coalesce small shuffle partitions, switch to broadcast joins when a table turns out to be small, and handle skew joins dynamically.

● Advanced

Design an SCD Type 2 pipeline in PySpark/Delta Lake.

Read incoming records, join with existing Delta table on business key. For matches, close the old record by setting end_date = current_date and is_active = false. Insert new record with new surrogate key, start_date, and is_active = true. Use Delta MERGE for atomic upserts and to maintain full history.

● Advanced

How do you optimize a slow Spark job in production?

Check Spark UI for skewed stages and long-tail tasks. Enable AQE. Examine shuffle read/write size — reduce with partition pruning or broadcast joins. Review executor memory and GC logs. Check if caching a repeated DataFrame would help. Profile DAG for unnecessary wide transformations.

✅

Senior Engineer Mindset

What interviewers expect at 4–10 years

Distributed processing thinking

ETL architecture design

Performance-first mindset

Partitioning strategies

Delta Lake & ACID

Incremental / CDC patterns

Scalability by design

Production debugging

Streaming architectures

Spark UI profiling

Memory optimization

PySpark + SQL integration