PySpark, SQL, Python & Databricks Interview Questions – With Easy Answers & Real-World Examples

🔹 PySpark

1. Code Challenge

Question:
Write PySpark code that:

Reads data from a JSON file,
Adds a new column with the current timestamp, and
Writes the enriched data back in Delta format.

Answer:

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp

spark = SparkSession.builder.appName("ReadWriteDelta").getOrCreate()

# Read JSON file (e.g., patient records)
df = spark.read.json("/mnt/healthcare/patients.json")

# Add a timestamp column
df = df.withColumn("ingestion_time", current_timestamp())

# Write as Delta
df.write.format("delta").mode("overwrite").save("/mnt/delta/healthcare/patients_enriched")

Real-world Example:
This could be used in a hospital system where new patient data is ingested daily, and a timestamp helps track when it was processed.

2. String Manipulation

Question:
Using PySpark, how do you split a full name into three columns: first name, middle name, and last name?

Answer:

python

from pyspark.sql.functions import split

df = spark.createDataFrame([("John Michael Doe",)], ["fullname"])

df = df.withColumn("first_name", split("fullname", " ")[0]) \
       .withColumn("middle_name", split("fullname", " ")[1]) \
       .withColumn("last_name", split("fullname", " ")[2])
df.show()

Use Case:
Used in banking systems to separate customer names for KYC (Know Your Customer) processes.

3. Sales Analytics Table

Scenario: You have a table with columns: product, region, and sales.

🔸 Task 1: Total sales per product and per region

python

df.groupBy("product", "region").sum("sales").show()

🔸 Task 2: % contribution of each region to the product's total sales

python

from pyspark.sql.window import Window
from pyspark.sql.functions import sum, col, round

windowSpec = Window.partitionBy("product")
df = df.withColumn("total_sales", sum("sales").over(windowSpec)) \
       .withColumn("region_pct", round((col("sales") / col("total_sales")) * 100, 2))
df.show()

🔸 Task 3: Sort products by total sales in descending order

python

df.groupBy("product").sum("sales") \
  .withColumnRenamed("sum(sales)", "total_sales") \
  .orderBy(col("total_sales").desc()) \
  .show()

Example:
Used by banks to analyze loan sales across regions.

🔹 SQL

1. Match Outcome Query

Question:
Given a table with columns Team1, Team2, and Won (the winning team), write a query to get each team’s total wins and losses.

Answer:

sql

SELECT team,
       SUM(CASE WHEN team = won THEN 1 ELSE 0 END) AS wins,
       SUM(CASE WHEN team != won THEN 1 ELSE 0 END) AS losses
FROM (
    SELECT team1 AS team, team1, team2, won FROM matches
    UNION ALL
    SELECT team2 AS team, team1, team2, won FROM matches
) t
GROUP BY team;

Why this is useful:
Great practice for self-joins and aggregations, useful in sports analytics or campaign win/loss summaries.

2. Top Earners by Department

Question:
From employee and dept tables, return the top 3 salaried employees for each department.

Answer:

sql

SELECT emp_id, dept_id, dept_name, salary
FROM (
    SELECT e.emp_id, e.dept_id, d.dept_name, e.salary,
           ROW_NUMBER() OVER (PARTITION BY e.dept_id ORDER BY e.salary DESC) as rank
    FROM employee e
    JOIN dept d ON e.dept_id = d.dept_id
) ranked
WHERE rank <= 3;

Example:
Used in HR systems to reward top performers in each department.

3. Duplicate Salaries Check

Question:
Find all employees who share the same salary within the same department.

Answer:

sql

SELECT e1.*
FROM employee e1
JOIN (
    SELECT dept_id, salary
    FROM employee
    GROUP BY dept_id, salary
    HAVING COUNT(*) > 1
) e2 ON e1.dept_id = e2.dept_id AND e1.salary = e2.salary;

Use Case:
Helpful in payroll audit systems to detect anomalies.

🔹 Python

1. Longest Substring Without Repeating Characters

Question:
Write a function to return the longest substring without repeating characters.

Answer:


def longest_unique_substring(s):
    start = 0
    max_len = 0
    used = {}
    longest = ""
    
    for i, c in enumerate(s):
        if c in used and used[c] >= start:
            start = used[c] + 1
        used[c] = i
        if i - start + 1 > max_len:
            max_len = i - start + 1
            longest = s[start:i+1]
    return longest

print(longest_unique_substring("abcddgh"))  # Output: "abcd"

2. What is a Decorator in Python?

Answer:
A decorator is a function that modifies the behavior of another function without changing its code.

Example:


def log(func):
    def wrapper(*args, **kwargs):
        print("Calling:", func.__name__)
        return func(*args, **kwargs)
    return wrapper

@log
def greet(name):
    return f"Hello, {name}"

print(greet("Doctor"))  # Output: Logs and returns greeting

Use Case:
Common in web apps for logging, security checks, or timing execution.

🔹 Databricks & Spark

1. What Cluster Types are Available?

All-Purpose Cluster: For development and interactive exploration
Job Cluster: For production ETL jobs (spins up temporarily)
Shared Cluster: Multiple users can access (training, sandbox)

Example:
Healthcare analysts use all-purpose clusters; data engineers use job clusters for ETL pipelines.

2. Workflow Reliability in Databricks

How to handle failures:

Use Job UI to monitor
Configure email/Slack alerts
Apply retry policies
Use checkpointing and Delta Lake time travel

3. Unity Catalog

What it is:
A centralized governance layer for managing data access and auditing across all workspaces.

Benefits:

Row/column-level security
Lineage tracking
Easier compliance (e.g., HIPAA, SOX in banking)

4. Cache vs. Persist in Spark

Method	Storage	When to Use
`cache()`	Memory-only	Small data, faster access
`persist()`	Memory + Disk	Large data, fault tolerance

5. What is Serialization in Spark?

Serialization = converting objects to bytes to transmit across network.

Why it matters:

Spark runs on distributed nodes
Efficient serialization (like Kryo) reduces overhead

6. Spark Optimization Techniques

Broadcast joins for small tables
Predicate pushdown (filter early)
Caching for reused DataFrames
Partition pruning
Use Delta format for performance

7. Join Strategies

Join Type	Best For
Broadcast Join	Small tables (e.g., country codes)
Shuffle Join	Large tables (customer + transaction)

8. Schema Evolution in Autoloader

python

spark.readStream.format("cloudFiles") \
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
  .load("/mnt/banking/transactions/")

Use Case:
When banks add new columns like currency, channel, etc., schema adapts automatically.

🔹 Project Walkthrough

Describe Your Last Project

Answer:

In my last project, I built a data lake for a hospital chain.

Data Sources:
JSON from hospital EHRs, CSV from insurance providers, real-time HL7 API data
Formats Used: JSON, CSV, Parquet
Transformations:
- De-identified patient PII
- Applied ICD code normalization
- Used Delta Lake for SCD2 patient history
- Built ML pipeline for readmission risk scoring
Governance:
Unity Catalog to manage row-level access by role (Doctor, Analyst, Admin)

Search This Blog

ShaikBlog