PySpark, SQL, Python & Databricks Interview Questions

 

PySpark, SQL, Python & Databricks Interview Questions – With Easy Answers & Real-World Examples


🔹 PySpark


1. Code Challenge

Question:
Write PySpark code that:

  • Reads data from a JSON file,

  • Adds a new column with the current timestamp, and

  • Writes the enriched data back in Delta format.

Answer:

python

from pyspark.sql import SparkSession from pyspark.sql.functions import current_timestamp spark = SparkSession.builder.appName("ReadWriteDelta").getOrCreate() # Read JSON file (e.g., patient records) df = spark.read.json("/mnt/healthcare/patients.json") # Add a timestamp column df = df.withColumn("ingestion_time", current_timestamp()) # Write as Delta df.write.format("delta").mode("overwrite").save("/mnt/delta/healthcare/patients_enriched")

Real-world Example:
This could be used in a hospital system where new patient data is ingested daily, and a timestamp helps track when it was processed.


2. String Manipulation

Question:
Using PySpark, how do you split a full name into three columns: first name, middle name, and last name?

Answer:

python

from pyspark.sql.functions import split df = spark.createDataFrame([("John Michael Doe",)], ["fullname"]) df = df.withColumn("first_name", split("fullname", " ")[0]) \ .withColumn("middle_name", split("fullname", " ")[1]) \ .withColumn("last_name", split("fullname", " ")[2]) df.show()

Use Case:
Used in banking systems to separate customer names for KYC (Know Your Customer) processes.


3. Sales Analytics Table

Scenario: You have a table with columns: product, region, and sales.

🔸 Task 1: Total sales per product and per region

python

df.groupBy("product", "region").sum("sales").show()

🔸 Task 2: % contribution of each region to the product's total sales

python

from pyspark.sql.window import Window from pyspark.sql.functions import sum, col, round windowSpec = Window.partitionBy("product") df = df.withColumn("total_sales", sum("sales").over(windowSpec)) \ .withColumn("region_pct", round((col("sales") / col("total_sales")) * 100, 2)) df.show()

🔸 Task 3: Sort products by total sales in descending order

python

df.groupBy("product").sum("sales") \ .withColumnRenamed("sum(sales)", "total_sales") \ .orderBy(col("total_sales").desc()) \ .show()

Example:
Used by banks to analyze loan sales across regions.


🔹 SQL


1. Match Outcome Query

Question:
Given a table with columns Team1, Team2, and Won (the winning team), write a query to get each team’s total wins and losses.

Answer:

sql

SELECT team, SUM(CASE WHEN team = won THEN 1 ELSE 0 END) AS wins, SUM(CASE WHEN team != won THEN 1 ELSE 0 END) AS losses FROM ( SELECT team1 AS team, team1, team2, won FROM matches UNION ALL SELECT team2 AS team, team1, team2, won FROM matches ) t GROUP BY team;

Why this is useful:
Great practice for self-joins and aggregations, useful in sports analytics or campaign win/loss summaries.


2. Top Earners by Department

Question:
From employee and dept tables, return the top 3 salaried employees for each department.

Answer:

sql

SELECT emp_id, dept_id, dept_name, salary FROM ( SELECT e.emp_id, e.dept_id, d.dept_name, e.salary, ROW_NUMBER() OVER (PARTITION BY e.dept_id ORDER BY e.salary DESC) as rank FROM employee e JOIN dept d ON e.dept_id = d.dept_id ) ranked WHERE rank <= 3;

Example:
Used in HR systems to reward top performers in each department.


3. Duplicate Salaries Check

Question:
Find all employees who share the same salary within the same department.

Answer:

sql

SELECT e1.* FROM employee e1 JOIN ( SELECT dept_id, salary FROM employee GROUP BY dept_id, salary HAVING COUNT(*) > 1 ) e2 ON e1.dept_id = e2.dept_id AND e1.salary = e2.salary;

Use Case:
Helpful in payroll audit systems to detect anomalies.


🔹 Python


1. Longest Substring Without Repeating Characters

Question:
Write a function to return the longest substring without repeating characters.

Answer:


def longest_unique_substring(s): start = 0 max_len = 0 used = {} longest = "" for i, c in enumerate(s): if c in used and used[c] >= start: start = used[c] + 1 used[c] = i if i - start + 1 > max_len: max_len = i - start + 1 longest = s[start:i+1] return longest print(longest_unique_substring("abcddgh")) # Output: "abcd"

2. What is a Decorator in Python?

Answer:
A decorator is a function that modifies the behavior of another function without changing its code.

Example:


def log(func): def wrapper(*args, **kwargs): print("Calling:", func.__name__) return func(*args, **kwargs) return wrapper @log def greet(name): return f"Hello, {name}" print(greet("Doctor")) # Output: Logs and returns greeting

Use Case:
Common in web apps for logging, security checks, or timing execution.


🔹 Databricks & Spark


1. What Cluster Types are Available?

  • All-Purpose Cluster: For development and interactive exploration

  • Job Cluster: For production ETL jobs (spins up temporarily)

  • Shared Cluster: Multiple users can access (training, sandbox)

Example:
Healthcare analysts use all-purpose clusters; data engineers use job clusters for ETL pipelines.


2. Workflow Reliability in Databricks

How to handle failures:

  • Use Job UI to monitor

  • Configure email/Slack alerts

  • Apply retry policies

  • Use checkpointing and Delta Lake time travel


3. Unity Catalog

What it is:
A centralized governance layer for managing data access and auditing across all workspaces.

Benefits:

  • Row/column-level security

  • Lineage tracking

  • Easier compliance (e.g., HIPAA, SOX in banking)


4. Cache vs. Persist in Spark

MethodStorageWhen to Use
cache()Memory-onlySmall data, faster access
persist()Memory + DiskLarge data, fault tolerance


5. What is Serialization in Spark?

Serialization = converting objects to bytes to transmit across network.

Why it matters:

  • Spark runs on distributed nodes

  • Efficient serialization (like Kryo) reduces overhead


6. Spark Optimization Techniques

  • Broadcast joins for small tables

  • Predicate pushdown (filter early)

  • Caching for reused DataFrames

  • Partition pruning

  • Use Delta format for performance


7. Join Strategies

Join TypeBest For
Broadcast JoinSmall tables (e.g., country codes)
Shuffle JoinLarge tables (customer + transaction)


8. Schema Evolution in Autoloader

python

spark.readStream.format("cloudFiles") \ .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \ .load("/mnt/banking/transactions/")

Use Case:
When banks add new columns like currency, channel, etc., schema adapts automatically.


🔹 Project Walkthrough

Describe Your Last Project

Answer:

In my last project, I built a data lake for a hospital chain.

  • Data Sources:
    JSON from hospital EHRs, CSV from insurance providers, real-time HL7 API data

  • Formats Used: JSON, CSV, Parquet

  • Transformations:

    • De-identified patient PII

    • Applied ICD code normalization

    • Used Delta Lake for SCD2 patient history

    • Built ML pipeline for readmission risk scoring

  • Governance:
    Unity Catalog to manage row-level access by role (Doctor, Analyst, Admin)

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing