Apache Spark Interview Q&A with Detailed Explanations and Examples

 Apache Spark Interview Q&A with Detailed Explanations and Examples


1. What is Lazy Evaluation in Spark?

  • Answer: Lazy Evaluation means Spark does not immediately compute the results of transformations (like map, filter). Instead, it builds a logical execution plan (DAG). Execution only happens when an action (like count(), collect(), saveAsTextFile()) is triggered.

  • Why it's useful: It allows Spark to optimize the entire data flow and pipeline before execution, reducing the number of passes over the data.

  • Example:

rdd = sc.textFile("data.txt")
rdd2 = rdd.map(lambda x: x.upper())  # Transformation, not executed yet
result = rdd2.collect()  # Action, triggers execution

2. How do you add a new column in Spark?

  • Answer: You can use withColumn() to add a new column based on an expression or existing columns.

  • Example:

df = df.withColumn("new_col", df["col1"] + df["col2"])
  • Explanation: This creates a new column new_col by adding values of col1 and col2. It's non-destructive and returns a new DataFrame.


3. What is a Broadcast Join?

  • Answer: A broadcast join is a technique where a small dataset is broadcast to all the executor nodes. This avoids the need to shuffle the larger dataset, drastically improving performance.

  • When to use: When one of the datasets is small enough to fit in memory.

  • Example:

from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "id")
  • Explanation: Instead of shuffling both datasets, Spark broadcasts small_df to each executor and performs the join locally with large_df.


4. What is the role of Catalyst Optimizer?

  • Answer: Catalyst Optimizer is the query optimization engine in Spark SQL. It performs logical and physical optimization on queries to generate efficient execution plans.

  • Key features:

    • Constant folding

    • Predicate pushdown

    • Reordering of joins

  • Benefit: Speeds up SQL/DataFrame operations without manual tuning.


5. What are Accumulators and Broadcast Variables?

  • Accumulators: Used for counters or sums across executors. They're write-only from the executors and read-only by the driver.

  • Broadcast Variables: Allow large read-only data to be cached on all nodes to avoid redundant transmission.

  • Example of Accumulator:

acc = sc.accumulator(0)
rdd.foreach(lambda x: acc.add(1))
print(acc.value)
  • Example of Broadcast Variable:

broadcast_var = sc.broadcast([1, 2, 3])

6. What makes Spark better than MapReduce?

  • Answer:

    • In-memory computation → faster than disk-based MapReduce

    • More concise and expressive APIs

    • Supports batch, streaming, ML, and graph processing

  • Example: A WordCount program in Spark is 5-10 lines, while in MapReduce it's 100+ lines.


7. What is Checkpointing in Spark?

  • Answer: It is the process of saving the RDD/DataFrame to a reliable storage like HDFS to truncate the lineage graph and prevent recomputation during failures.

  • Use case: Long lineage chains or iterative algorithms like PageRank.

  • Example:

sc.setCheckpointDir("hdfs://checkpoint_dir")
rdd.checkpoint()

8. How do you convert an RDD to a DataFrame?

  • Answer: Use createDataFrame() and specify schema or column names.

  • Example:

rdd = sc.parallelize([(1, "Alice"), (2, "Bob")])
df = spark.createDataFrame(rdd, ["id", "name"])

9. How would you handle a 10GB file in Spark and optimize it?

  • Answer:

    • Tune the number of partitions: df.repartition(100)

    • Use caching if reused: df.cache()

    • Use columnar formats like Parquet for efficiency

    • Avoid wide transformations like groupByKey

  • Explanation: Optimizing data locality and memory usage is key for large datasets.


10. What is the difference between Persist and Cache?

  • Answer:

    • cache() is shorthand for persist(StorageLevel.MEMORY_AND_DISK)

    • persist() allows full control over storage levels (e.g., MEMORY_ONLY, DISK_ONLY)

  • Example:

df.persist(StorageLevel.DISK_ONLY)
df.cache()

11. What is Data Skewness? How do you handle it?

  • Answer: When some partitions have significantly more data than others, causing performance bottlenecks.

  • Solutions:

    • Add salting keys

    • Use broadcast joins

    • Avoid skewed keys in groupBy

  • Example:

rdd.map(lambda x: ((x.key + random.randint(0,10)), x.value))

12. What is an Out of Memory (OOM) issue, and how do you deal with it?

  • Answer: Happens when executors or driver run out of memory.

  • Solutions:

    • Increase executor memory

    • Use efficient formats (Parquet)

    • Avoid collecting large datasets on driver

  • Tip: Use persist(StorageLevel.DISK_ONLY) when memory is constrained


13. What are Shared Variables in Spark?

  • Answer: Shared variables include broadcast variables and accumulators, used to share data among tasks in a distributed fashion.


14. How can you read a CSV file without using an external schema?

  • Answer:

df = spark.read.csv("file.csv", header=True, inferSchema=True)
  • Explanation: inferSchema=True lets Spark automatically determine column types based on the data.


15. What is Cache in Spark?

  • Answer: Caching stores data in memory to avoid recomputation, improving performance for iterative operations.

  • Usage:

df.cache()

16. Difference between map and flatMap in Spark

  • map(): Applies a function and returns one output per input.

  • flatMap(): Can return multiple outputs (flattened) for each input.

  • Example:

rdd.map(lambda x: x.split(" "))  # List of lists
rdd.flatMap(lambda x: x.split(" "))  # Flattened list

17. How can you add two new columns to a DataFrame with calculated values?

  • Answer: Use withColumn() multiple times.

  • Example:

df = df.withColumn("col3", df["col1"] + 1)
df = df.withColumn("col4", df["col2"] * 2)

18. What is the advantage of a Parquet file?

  • Answer:

    • Columnar storage reduces I/O

    • Built-in compression

    • Supports predicate pushdown for faster queries


19. What is a Broadcast Variable?

  • Answer: Variable cached on executors, allowing them to read without repeated shipping from driver.

  • Example:

bc = sc.broadcast([1, 2, 3])

20. What are Transformations in Spark? What are their types?

  • Narrow: map, filter — no shuffling

  • Wide: groupByKey, reduceByKey — requires data shuffling


Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing