Apache Spark Interview Q&A with Detailed Explanations and Examples
Apache Spark Interview Q&A with Detailed Explanations and Examples
1. What is Lazy Evaluation in Spark?
Answer: Lazy Evaluation means Spark does not immediately compute the results of transformations (like
map,filter). Instead, it builds a logical execution plan (DAG). Execution only happens when an action (likecount(),collect(),saveAsTextFile()) is triggered.Why it's useful: It allows Spark to optimize the entire data flow and pipeline before execution, reducing the number of passes over the data.
Example:
rdd = sc.textFile("data.txt")
rdd2 = rdd.map(lambda x: x.upper()) # Transformation, not executed yet
result = rdd2.collect() # Action, triggers execution2. How do you add a new column in Spark?
Answer: You can use
withColumn()to add a new column based on an expression or existing columns.Example:
df = df.withColumn("new_col", df["col1"] + df["col2"])Explanation: This creates a new column
new_colby adding values ofcol1andcol2. It's non-destructive and returns a new DataFrame.
3. What is a Broadcast Join?
Answer: A broadcast join is a technique where a small dataset is broadcast to all the executor nodes. This avoids the need to shuffle the larger dataset, drastically improving performance.
When to use: When one of the datasets is small enough to fit in memory.
Example:
from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "id")Explanation: Instead of shuffling both datasets, Spark broadcasts
small_dfto each executor and performs the join locally withlarge_df.
4. What is the role of Catalyst Optimizer?
Answer: Catalyst Optimizer is the query optimization engine in Spark SQL. It performs logical and physical optimization on queries to generate efficient execution plans.
Key features:
Constant folding
Predicate pushdown
Reordering of joins
Benefit: Speeds up SQL/DataFrame operations without manual tuning.
5. What are Accumulators and Broadcast Variables?
Accumulators: Used for counters or sums across executors. They're write-only from the executors and read-only by the driver.
Broadcast Variables: Allow large read-only data to be cached on all nodes to avoid redundant transmission.
Example of Accumulator:
acc = sc.accumulator(0)
rdd.foreach(lambda x: acc.add(1))
print(acc.value)Example of Broadcast Variable:
broadcast_var = sc.broadcast([1, 2, 3])6. What makes Spark better than MapReduce?
Answer:
In-memory computation → faster than disk-based MapReduce
More concise and expressive APIs
Supports batch, streaming, ML, and graph processing
Example: A WordCount program in Spark is 5-10 lines, while in MapReduce it's 100+ lines.
7. What is Checkpointing in Spark?
Answer: It is the process of saving the RDD/DataFrame to a reliable storage like HDFS to truncate the lineage graph and prevent recomputation during failures.
Use case: Long lineage chains or iterative algorithms like PageRank.
Example:
sc.setCheckpointDir("hdfs://checkpoint_dir")
rdd.checkpoint()8. How do you convert an RDD to a DataFrame?
Answer: Use
createDataFrame()and specify schema or column names.Example:
rdd = sc.parallelize([(1, "Alice"), (2, "Bob")])
df = spark.createDataFrame(rdd, ["id", "name"])9. How would you handle a 10GB file in Spark and optimize it?
Answer:
Tune the number of partitions:
df.repartition(100)Use caching if reused:
df.cache()Use columnar formats like Parquet for efficiency
Avoid wide transformations like
groupByKey
Explanation: Optimizing data locality and memory usage is key for large datasets.
10. What is the difference between Persist and Cache?
Answer:
cache()is shorthand forpersist(StorageLevel.MEMORY_AND_DISK)persist()allows full control over storage levels (e.g., MEMORY_ONLY, DISK_ONLY)
Example:
df.persist(StorageLevel.DISK_ONLY)
df.cache()11. What is Data Skewness? How do you handle it?
Answer: When some partitions have significantly more data than others, causing performance bottlenecks.
Solutions:
Add salting keys
Use broadcast joins
Avoid skewed keys in
groupBy
Example:
rdd.map(lambda x: ((x.key + random.randint(0,10)), x.value))12. What is an Out of Memory (OOM) issue, and how do you deal with it?
Answer: Happens when executors or driver run out of memory.
Solutions:
Increase executor memory
Use efficient formats (Parquet)
Avoid collecting large datasets on driver
Tip: Use
persist(StorageLevel.DISK_ONLY)when memory is constrained
13. What are Shared Variables in Spark?
Answer: Shared variables include broadcast variables and accumulators, used to share data among tasks in a distributed fashion.
14. How can you read a CSV file without using an external schema?
Answer:
df = spark.read.csv("file.csv", header=True, inferSchema=True)Explanation:
inferSchema=Truelets Spark automatically determine column types based on the data.
15. What is Cache in Spark?
Answer: Caching stores data in memory to avoid recomputation, improving performance for iterative operations.
Usage:
df.cache()16. Difference between map and flatMap in Spark
map(): Applies a function and returns one output per input.
flatMap(): Can return multiple outputs (flattened) for each input.
Example:
rdd.map(lambda x: x.split(" ")) # List of lists
rdd.flatMap(lambda x: x.split(" ")) # Flattened list17. How can you add two new columns to a DataFrame with calculated values?
Answer: Use
withColumn()multiple times.Example:
df = df.withColumn("col3", df["col1"] + 1)
df = df.withColumn("col4", df["col2"] * 2)18. What is the advantage of a Parquet file?
Answer:
Columnar storage reduces I/O
Built-in compression
Supports predicate pushdown for faster queries
19. What is a Broadcast Variable?
Answer: Variable cached on executors, allowing them to read without repeated shipping from driver.
Example:
bc = sc.broadcast([1, 2, 3])20. What are Transformations in Spark? What are their types?
Narrow:
map,filter— no shufflingWide:
groupByKey,reduceByKey— requires data shuffling
Comments
Post a Comment