Apache Spark Interview Q&A with Detailed Explanations and Examples

Apache Spark Interview Q&A with Detailed Explanations and Examples

1. What is Lazy Evaluation in Spark?

Answer: Lazy Evaluation means Spark does not immediately compute the results of transformations (like map, filter). Instead, it builds a logical execution plan (DAG). Execution only happens when an action (like count(), collect(), saveAsTextFile()) is triggered.
Why it's useful: It allows Spark to optimize the entire data flow and pipeline before execution, reducing the number of passes over the data.
Example:

rdd = sc.textFile("data.txt")
rdd2 = rdd.map(lambda x: x.upper())  # Transformation, not executed yet
result = rdd2.collect()  # Action, triggers execution

2. How do you add a new column in Spark?

Answer: You can use withColumn() to add a new column based on an expression or existing columns.
Example:

df = df.withColumn("new_col", df["col1"] + df["col2"])

Explanation: This creates a new column new_col by adding values of col1 and col2. It's non-destructive and returns a new DataFrame.

3. What is a Broadcast Join?

Answer: A broadcast join is a technique where a small dataset is broadcast to all the executor nodes. This avoids the need to shuffle the larger dataset, drastically improving performance.
When to use: When one of the datasets is small enough to fit in memory.
Example:

from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "id")

Explanation: Instead of shuffling both datasets, Spark broadcasts small_df to each executor and performs the join locally with large_df.

4. What is the role of Catalyst Optimizer?

Answer: Catalyst Optimizer is the query optimization engine in Spark SQL. It performs logical and physical optimization on queries to generate efficient execution plans.
Key features:
- Constant folding
- Predicate pushdown
- Reordering of joins
Benefit: Speeds up SQL/DataFrame operations without manual tuning.

5. What are Accumulators and Broadcast Variables?

Accumulators: Used for counters or sums across executors. They're write-only from the executors and read-only by the driver.
Broadcast Variables: Allow large read-only data to be cached on all nodes to avoid redundant transmission.
Example of Accumulator:

acc = sc.accumulator(0)
rdd.foreach(lambda x: acc.add(1))
print(acc.value)

Example of Broadcast Variable:

broadcast_var = sc.broadcast([1, 2, 3])

6. What makes Spark better than MapReduce?

Answer:
- In-memory computation → faster than disk-based MapReduce
- More concise and expressive APIs
- Supports batch, streaming, ML, and graph processing
Example: A WordCount program in Spark is 5-10 lines, while in MapReduce it's 100+ lines.

7. What is Checkpointing in Spark?

Answer: It is the process of saving the RDD/DataFrame to a reliable storage like HDFS to truncate the lineage graph and prevent recomputation during failures.
Use case: Long lineage chains or iterative algorithms like PageRank.
Example:

sc.setCheckpointDir("hdfs://checkpoint_dir")
rdd.checkpoint()

8. How do you convert an RDD to a DataFrame?

Answer: Use createDataFrame() and specify schema or column names.
Example:

rdd = sc.parallelize([(1, "Alice"), (2, "Bob")])
df = spark.createDataFrame(rdd, ["id", "name"])

9. How would you handle a 10GB file in Spark and optimize it?

Answer:
- Tune the number of partitions: df.repartition(100)
- Use caching if reused: df.cache()
- Use columnar formats like Parquet for efficiency
- Avoid wide transformations like groupByKey
Explanation: Optimizing data locality and memory usage is key for large datasets.

10. What is the difference between Persist and Cache?

Answer:
- cache() is shorthand for persist(StorageLevel.MEMORY_AND_DISK)
- persist() allows full control over storage levels (e.g., MEMORY_ONLY, DISK_ONLY)
Example:

df.persist(StorageLevel.DISK_ONLY)
df.cache()

11. What is Data Skewness? How do you handle it?

Answer: When some partitions have significantly more data than others, causing performance bottlenecks.
Solutions:
- Add salting keys
- Use broadcast joins
- Avoid skewed keys in groupBy
Example:

rdd.map(lambda x: ((x.key + random.randint(0,10)), x.value))

12. What is an Out of Memory (OOM) issue, and how do you deal with it?

Answer: Happens when executors or driver run out of memory.
Solutions:
- Increase executor memory
- Use efficient formats (Parquet)
- Avoid collecting large datasets on driver
Tip: Use persist(StorageLevel.DISK_ONLY) when memory is constrained

13. What are Shared Variables in Spark?

Answer: Shared variables include broadcast variables and accumulators, used to share data among tasks in a distributed fashion.

14. How can you read a CSV file without using an external schema?

Answer:

df = spark.read.csv("file.csv", header=True, inferSchema=True)

Explanation: inferSchema=True lets Spark automatically determine column types based on the data.

15. What is Cache in Spark?

Answer: Caching stores data in memory to avoid recomputation, improving performance for iterative operations.
Usage:

df.cache()

16. Difference between map and flatMap in Spark

map(): Applies a function and returns one output per input.
flatMap(): Can return multiple outputs (flattened) for each input.
Example:

rdd.map(lambda x: x.split(" "))  # List of lists
rdd.flatMap(lambda x: x.split(" "))  # Flattened list

17. How can you add two new columns to a DataFrame with calculated values?

Answer: Use withColumn() multiple times.
Example:

df = df.withColumn("col3", df["col1"] + 1)
df = df.withColumn("col4", df["col2"] * 2)

18. What is the advantage of a Parquet file?

Answer:
- Columnar storage reduces I/O
- Built-in compression
- Supports predicate pushdown for faster queries

19. What is a Broadcast Variable?

Answer: Variable cached on executors, allowing them to read without repeated shipping from driver.
Example:

bc = sc.broadcast([1, 2, 3])

20. What are Transformations in Spark? What are their types?

Narrow: map, filter — no shuffling
Wide: groupByKey, reduceByKey — requires data shuffling

Search This Blog

ShaikBlog

Apache Spark Interview Q&A with Detailed Explanations and Examples

Comments

Post a Comment

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing

Revenue Cycle Management- HealthCare