ETL Failures & Monitoring (Azure-specific)
🔹 How do you handle job failures in an ETL pipeline?
Tools: Azure Data Factory, Azure Monitor, Log Analytics, Application Insights
Approach:
-
Use ADF's built-in alerts for activity failures (email, Logic Apps, Azure Monitor).
-
Enable diagnostic logging to send logs to Log Analytics.
-
Use ADF’s retry policy to auto-retry transient failures.
-
Use custom error handling with
IfCondition,Switch, andFailactivities.
Example:
An ADF copy activity failed due to a schema mismatch. The pipeline’s OnFailure path was configured to log the error to a Log Analytics table and notify the team via an Azure Logic App (email + Teams).
🔹 What steps do you take when a pipeline is running slower than expected?
Tools: ADF Integration Runtime, Azure Monitor, Azure Databricks, Synapse
Approach:
-
Analyze ADF activity run times via Monitor tab or exported logs.
-
Check IR (Integration Runtime) performance (self-hosted or Azure-hosted).
-
Optimize Databricks notebooks (use caching, cluster sizing).
-
Partition data in Synapse / Databricks to improve load.
Example:
A pipeline that copied 100 GB to Synapse slowed down. I switched to a higher-performance Azure IR, batched files, and parallelized the copy activity using foreach with concurrency enabled.
🔹 What if a scheduled job didn't trigger?
Tools: ADF Triggers, Azure Monitor Alerts
Approach:
-
Validate trigger status in ADF (enabled, conditions met).
-
Check last trigger run logs.
-
Set alerts using Azure Monitor to notify if a pipeline hasn’t run on schedule.
Example:
A pipeline with a tumbling window trigger missed execution because the previous run failed. I enabled dependency chaining using dependsOn and added logic to catch the first failure.
🔹 How do you manage data pipeline dependencies?
Tools: ADF dependency conditions, Data Factory pipeline orchestration
Approach:
-
Use ADF’s dependency conditions in pipeline triggers.
-
Use
Wait,IfCondition, orUntilactivities to manage flows. -
Use Azure Event Grid + Logic App for event-driven dependencies.
Example:
A downstream ADF pipeline waits for a file to land in ADLS from an upstream process. I used Event Grid trigger to launch the pipeline on blob creation.
🔹 How do you monitor SLAs for pipelines?
Tools: Azure Monitor, Log Analytics, Application Insights
Approach:
-
Track pipeline execution time using ADF logs in Log Analytics.
-
Create custom Kusto queries to find SLA breaches.
-
Send alerts if a pipeline exceeds expected runtime.
Example:
I created a dashboard in Azure Workbook that tracks all pipeline runs, highlighting those exceeding SLA. Alerts are sent via email and Teams.
💎 Data Quality, Validation & Duplicates (Azure-specific)
🔹 How do you automate data validation during ingestion?
Tools: ADF, Azure Data Lake, Azure DataBricks, Azure Purview
Approach:
-
Use custom validation notebooks in Azure Databricks.
-
In ADF, use
Stored ProcedureorData Flowto validate counts, nulls, types. -
Use Azure Purview for profiling and monitoring data quality rules.
Example:
Post ingestion, I used a Databricks notebook to check for duplicate IDs and nulls in critical columns and pushed the result to a Power BI dashboard.
🔹 What would you do if there’s a mismatch in record counts?
Tools: ADF, SQL Database, Azure Synapse
Approach:
-
Use
LookupandIfConditionin ADF to compare source vs. destination. -
Trigger alerts or rerun the pipeline if mismatch exceeds threshold.
Example:
In a CDC-based pipeline, I compared source (SQL DB) and sink (Synapse) row counts after ingestion. If mismatch >5%, the pipeline failed intentionally for manual review.
🔹 How do you handle duplicates?
Tools: Azure Databricks, Synapse, Data Flow
Approach:
-
Use Data Flow with
Surrogate KeyandWindow functionsfor deduplication. -
In Databricks, use
dropDuplicates()orROW_NUMBER()logic.
💎 Performance & Optimization (Azure-specific)
🔹 How do you handle Spark job memory issues in Azure Databricks?
Tools: Azure Databricks
Approach:
-
Monitor jobs via Spark UI on Databricks.
-
Tune cluster configs: memory, cores, autoscaling.
-
Use
persist()andbroadcast()wisely.
Example:
A job failed due to OOM during join. I broadcasted the smaller table and repartitioned the larger dataset to resolve memory issues.
🔹 Skewed data in Spark?
Approach:
-
Apply salting: add a random prefix to keys.
-
Repartition before joins.
Example:
For a large sales table joined on customer_id, 10% of records had one customer ID. I salted the key with a mod hash and recombined after join.
💎 Schema & Metadata Management (Azure-specific)
🔹 How do you handle schema evolution in ADLS (Parquet/Avro)?
Tools: Azure Data Lake Gen2, Databricks Delta Lake
Approach:
-
Use Delta Lake with schema evolution (
mergeSchemain Databricks). -
Store schemas in Azure Purview or custom registry.
Example:
Schema changes in a Parquet source were handled via Delta Lake’s automatic schema evolution when writing to Bronze layer.
🔹 How do you manage metadata across tools (e.g., ADF, Synapse)?
Tools: Azure Purview, Azure Data Catalog
Approach:
-
Use Azure Purview for cataloging and lineage tracking.
-
Automate metadata ingestion from Synapse, SQL, ADLS.
💎 Partitioning & Scalability (Azure-specific)
🔹 How do you decide partitioning strategy for a Delta table?
Tools: Delta Lake on Azure Databricks
Approach:
-
Partition by date or region based on query pattern.
-
Avoid high-cardinality columns.
-
Periodically OPTIMIZE with Z-ORDER.
🔹 What if ingestion from a third-party API fails?
Tools: ADF Web Activity, REST Connector, Logic Apps
Approach:
-
Use retry policy on API activity in ADF.
-
Log failure to Azure Monitor.
-
Add
Untilor conditional looping.
🔹 Handle malformed files in ADLS?
Tools: ADF, Databricks
Approach:
-
Enable
faultTolerancemode in ADF’s Data Flow. -
Use
badRecordsPathin Spark to capture corrupted records.
🔹 Late-arriving data in streaming jobs?
Tools: Azure Event Hubs + Azure Databricks Structured Streaming
Approach:
-
Use watermarks and windowed aggregations.
-
Write late events to separate table or reprocess batch-wise.
💎 Consistency, Security, & Recovery (Azure-specific)
🔹 How do you resolve data consistency issues?
Tools: Azure SQL, Synapse, Data Factory
Approach:
-
Use hash comparison between source and sink.
-
Use CDC or snapshots for delta reconciliation.
🔹 What if your pipeline deletes/overwrites production data?
Approach:
-
Implement soft deletes or backups in Synapse.
-
Version your data with Delta Lake time travel.
-
Use ADF's global parameters to separate environments and avoid accidental prod writes.
🔹 How do you secure sensitive data like PAN/Aadhaar?
Tools: Azure Key Vault, Data Masking, Encryption
Approach:
-
Store secrets in Azure Key Vault.
-
Encrypt data at rest using Azure Storage Encryption.
-
Apply Dynamic Data Masking in SQL/Synapse.
🔹 How do you backfill historical data?
Tools: ADF, Databricks, Azure SQL
Approach:
-
Use parameterized pipelines for date ranges.
-
Load in chunks (e.g., monthly).
-
Use
IfConditionto skip already-loaded data.
Example:
To backfill 3 years of logs into Synapse, I created a parameterized ADF pipeline that took start_date and end_date, looping with a ForEach activity.
Comments
Post a Comment