PySpark, SQL, Python & Databricks Interview Questions
PySpark, SQL, Python & Databricks Interview Questions – With Easy Answers & Real-World Examples
🔹 PySpark
1. Code Challenge
Question:
Write PySpark code that:
-
Reads data from a JSON file,
-
Adds a new column with the current timestamp, and
-
Writes the enriched data back in Delta format.
Answer:
Real-world Example:
This could be used in a hospital system where new patient data is ingested daily, and a timestamp helps track when it was processed.
2. String Manipulation
Question:
Using PySpark, how do you split a full name into three columns: first name, middle name, and last name?
Answer:
Use Case:
Used in banking systems to separate customer names for KYC (Know Your Customer) processes.
3. Sales Analytics Table
Scenario: You have a table with columns: product, region, and sales.
🔸 Task 1: Total sales per product and per region
🔸 Task 2: % contribution of each region to the product's total sales
🔸 Task 3: Sort products by total sales in descending order
Example:
Used by banks to analyze loan sales across regions.
🔹 SQL
1. Match Outcome Query
Question:
Given a table with columns Team1, Team2, and Won (the winning team), write a query to get each team’s total wins and losses.
Answer:
Why this is useful:
Great practice for self-joins and aggregations, useful in sports analytics or campaign win/loss summaries.
2. Top Earners by Department
Question:
From employee and dept tables, return the top 3 salaried employees for each department.
Answer:
Example:
Used in HR systems to reward top performers in each department.
3. Duplicate Salaries Check
Question:
Find all employees who share the same salary within the same department.
Answer:
Use Case:
Helpful in payroll audit systems to detect anomalies.
🔹 Python
1. Longest Substring Without Repeating Characters
Question:
Write a function to return the longest substring without repeating characters.
Answer:
2. What is a Decorator in Python?
Answer:
A decorator is a function that modifies the behavior of another function without changing its code.
Example:
Use Case:
Common in web apps for logging, security checks, or timing execution.
🔹 Databricks & Spark
1. What Cluster Types are Available?
-
All-Purpose Cluster: For development and interactive exploration
-
Job Cluster: For production ETL jobs (spins up temporarily)
-
Shared Cluster: Multiple users can access (training, sandbox)
Example:
Healthcare analysts use all-purpose clusters; data engineers use job clusters for ETL pipelines.
2. Workflow Reliability in Databricks
How to handle failures:
-
Use Job UI to monitor
-
Configure email/Slack alerts
-
Apply retry policies
-
Use checkpointing and Delta Lake time travel
3. Unity Catalog
What it is:
A centralized governance layer for managing data access and auditing across all workspaces.
Benefits:
-
Row/column-level security
-
Lineage tracking
-
Easier compliance (e.g., HIPAA, SOX in banking)
4. Cache vs. Persist in Spark
| Method | Storage | When to Use |
|---|---|---|
cache() | Memory-only | Small data, faster access |
persist() | Memory + Disk | Large data, fault tolerance |
5. What is Serialization in Spark?
Serialization = converting objects to bytes to transmit across network.
Why it matters:
-
Spark runs on distributed nodes
-
Efficient serialization (like Kryo) reduces overhead
6. Spark Optimization Techniques
-
Broadcast joins for small tables
-
Predicate pushdown (filter early)
-
Caching for reused DataFrames
-
Partition pruning
-
Use Delta format for performance
7. Join Strategies
| Join Type | Best For |
|---|---|
| Broadcast Join | Small tables (e.g., country codes) |
| Shuffle Join | Large tables (customer + transaction) |
8. Schema Evolution in Autoloader
Use Case:
When banks add new columns like currency, channel, etc., schema adapts automatically.
🔹 Project Walkthrough
Describe Your Last Project
Answer:
In my last project, I built a data lake for a hospital chain.
-
Data Sources:
JSON from hospital EHRs, CSV from insurance providers, real-time HL7 API data -
Formats Used: JSON, CSV, Parquet
-
Transformations:
-
De-identified patient PII
-
Applied ICD code normalization
-
Used Delta Lake for SCD2 patient history
-
Built ML pipeline for readmission risk scoring
-
-
Governance:
Unity Catalog to manage row-level access by role (Doctor, Analyst, Admin)
Comments
Post a Comment