PySpark Interview Questions with Detailed Answers

Spark Internals & Execution

1. Explain the full lifecycle of a Spark job — from code to DAG to task execution.

Definition: The Spark job lifecycle begins when the driver program initiates an action (e.g., .collect(), .count()). Here's a step-by-step breakdown:

User Code: You write transformations and actions in PySpark.
Logical Plan: Spark creates a logical plan based on your transformations.
Optimization: Catalyst optimizer optimizes the logical plan.
Physical Plan: Spark creates a physical plan with stages.
DAG (Directed Acyclic Graph): Stages are translated into a DAG of tasks.
Task Scheduling: Tasks are distributed across executors.
Execution: Executors run the tasks in parallel.

Healthcare Example: In processing electronic health records (EHR), you may filter patients by diagnosis, join with lab results, and then aggregate. Spark translates this pipeline into optimized stages and tasks.

2. What is the difference between transformations and actions? Why are transformations lazy?

Definition:

Transformations (e.g., map, filter, join) are lazy operations that define a computation.
Actions (e.g., collect, count, saveAsTable) trigger the execution of transformations.

Why Lazy? Laziness allows Spark to optimize the DAG before execution, combining transformations into efficient tasks.

Healthcare Example: When filtering patients over age 60 and joining with prescriptions, Spark waits until you call .count() or .show() to execute.

3. What happens under the hood when you call `.collect()` on a huge dataset?

Definition:

.collect() brings the entire dataset from executors to the driver.
This can cause OutOfMemoryError if the data is too large.
Spark executes all transformations, gathers results, and sends them to the driver.

Healthcare Example: Running .collect() on 1 billion patient lab records will crash your driver. Use .take(n) or write to storage instead.

Memory & Partition Tuning

4. How do you identify and fix data skew in PySpark joins?

Definition: Data skew happens when certain keys (e.g., ZIP code 99999) have far more records, leading to uneven partition sizes.

Fixes:

Use salting (add random prefix to keys).
Use broadcast joins if one side is small.
Repartition using balanced keys.

Healthcare Example: Joining a patient table with an insurance claims table by patient_id, where few patients have millions of claims, causes skew. Add a salt column to distribute joins.

5. What is the impact of incorrect partitioning on performance?

Definition: Too few partitions → underutilized CPU. Too many → overhead from task scheduling.

Symptoms:

Long shuffle times
Uneven task execution

Best Practice:

Tune using repartition() or coalesce().
Monitor using Spark UI.

Healthcare Example: When aggregating hospital data across states, partitioning by state gives balanced load. Partitioning by zip_code may cause imbalance.

6. When would you increase `spark.sql.shuffle.partitions`, and what are the risks?

Definition: spark.sql.shuffle.partitions controls how many output partitions are created after shuffles.

Use Case:

Increase if data volume is huge to improve parallelism.
Decrease to reduce small file creation and scheduling overhead.

Risk: Too many partitions can overwhelm the cluster with tasks.

Healthcare Example: Running groupBy on 100 million prescriptions: increase partitions to 1000 for better performance.

7. What is the role of broadcast joins? When should you avoid them?

Definition: Broadcast joins send the smaller dataset to all executors to avoid shuffle.

Use When:

Smaller dataset < 10 MB
Repeated keys

Avoid When:

Large broadcast data
OOM errors

Healthcare Example: Joining a large patient table with a 5 MB hospital location dimension table is ideal for a broadcast join.

Window Functions & Complex ETL

8. Use a PySpark window function to get the top 3 orders per customer.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy("customer_id").orderBy(desc("order_amount"))
df_top3 = df.withColumn("rank", row_number().over(window)).filter("rank <= 3")

9. How would you implement SCD Type 2 with Delta Lake + PySpark?

Definition: Slowly Changing Dimension (SCD) Type 2 tracks changes over time by preserving history.

Steps:

Mark current records with is_current = true.
Compare new vs existing records.
Set old is_current = false, insert new with is_current = true and timestamps.
Use Delta Merge for atomicity.

Healthcare Example: Track change in patient address while preserving historical location for audit.

10. How do you remove duplicates from a dataset based on composite keys?

df_deduped = df.dropDuplicates(["patient_id", "visit_date"])

Delta Lake + Merge Logic

11. What happens if two users try to merge into the same Delta table at the same time?

Definition: Delta uses optimistic concurrency control. One merge may fail with a conflict.

Solution:

Use retries.
Use VACUUM and OPTIMIZE to manage conflicts.

12. How do you perform an upsert using PySpark on a Delta table?

from delta.tables import *
delta_table = DeltaTable.forPath(spark, "/mnt/data/patient_data")

delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.patient_id = source.patient_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

13. How do you handle schema evolution in Delta tables programmatically?

Definition: Schema evolution allows new columns or types to be added automatically.

df.write.format("delta").option("mergeSchema", "true").mode("overwrite").save("/mnt/data")

Structured Streaming + Watermarking

14. Explain exactly-once semantics in Spark Structured Streaming.

Definition: Spark guarantees that each input record is processed once and only once using:

Checkpointing
Idempotent sinks (like Delta Lake)
Write-ahead logs

15. Late data is arriving after the window closes — how do you handle it with watermarks?

Definition: Watermarks specify how late data can arrive. Data later than watermark threshold is dropped.

df.withWatermark("event_time", "15 minutes")

16. How would you join a real-time stream with a dimension table in PySpark?

Solution: Use broadcast joins with static dimension tables.

stream_df.join(broadcast(dim_df), "hospital_id")

Debugging + Monitoring + Failure Handling

17. Your Spark job ran successfully but processed 0 records — how do you debug that?

Checklist:

Check filters, joins, and input paths
Use .explain()
Use Spark UI to view input RDD size

18. Explain speculative execution in Spark — when and how do you enable it?

Definition: Spark runs duplicate copies of slow tasks to avoid stragglers.

spark.conf.set("spark.speculation", True)

Use Case:

Uneven cluster performance

19. How do you use Spark UI to find slow stages or skewed tasks?

Tips:

Go to Stages tab
Look for stages with long durations or skewed task times
Check Shuffle Read/Write size

Security, CI/CD, and Production Readiness

20. How do you manage secrets securely in a PySpark job running in Databricks or EMR?

Best Practices:

Use Databricks Secrets or AWS Secrets Manager
Never hardcode credentials
Access via environment variables or scope

21. Describe your CI/CD pipeline for PySpark using GitHub Actions or Azure DevOps.

Typical Flow:

Code commit triggers build
Linting + unit tests run
PySpark scripts deployed to test
Automated validation
Deploy to production (Databricks, EMR)

22. How do you handle cluster resource allocation (executors, memory) for large jobs?

Tuning Tips:

Set numExecutors, executorMemory, executorCores
Monitor usage in Spark UI
Avoid memory-heavy operations like .collect()

Healthcare Example: For running monthly EHR analysis across hospitals, allocate more memory and cores based on data volume.

Search This Blog

ShaikBlog