Posts

PySpark Data Skew Handling – Complete Guide

Image
  šŸ”“ 1. Problem Statement: Skewed Aggregation df . groupBy( "user_id" ) . count() ❗ Issue One user_id contains ~40% of total data Spark sends same key → same partition Result: One task becomes extremely heavy Other tasks finish early Straggler problem → slow job 🧠 2. Why Skew Happens Spark distributes data based on keys: user_id = A → goes to one partition user_id = B → another partition If: A = 40% of data Then: Partition for A = huge → bottleneck šŸ” 3. How to Identify Skew (Practical Approach) ✅ Method 1: Distribution Check from pyspark . sql . functions import count df . groupBy( "user_id" ) \ . agg( count ( "*" ) . alias( "cnt" )) \ . orderBy( "cnt" , ascending = False ) \ . show( 10 ) šŸ‘‰ Example Output: user_id cnt A 40,000,000 ← skew B 1,000 C 900 ✅ Method 2: Percentile Analysis df . groupBy( "user_id" ) \ . count() \ . selectExpr( ...

DAX QUERIES

  🧮 DAX QUERIES – COMPLETE INTERVIEW GUIDE (AAS / Power BI) Applies to Azure Analysis Services and Power BI 1️⃣ DAX BASICS (They expect this instantly) ✅ Total Sales Total Sales = SUM ( FactSales[SalesAmount] ) šŸ—£ Say: “This is a simple aggregation evaluated in filter context.” ✅ Total Orders Total Orders = COUNT ( FactSales[OrderID] ) 2️⃣ CALCULATE – MOST IMPORTANT DAX FUNCTION šŸ”„ ✅ Sales for Current Year Sales CY = CALCULATE ( [Total Sales], DimDate[Year] = YEAR ( TODAY() ) ) šŸ—£ Senior explanation: “CALCULATE modifies the filter context before evaluating the measure.” ✅ Sales for a Specific Region Sales US = CALCULATE ( [Total Sales], DimRegion[Country] = "USA" ) 3️⃣ TIME INTELLIGENCE (GUARANTEED QUESTIONS) ✅ Year-to-Date (YTD) Sales YTD = TOTALYTD ( [Total Sales], DimDate[Date] ) ✅ Month-to-Date (MTD) Sales MTD = TOTALMTD ( [Total Sales], DimDate[Date] ) ✅ Previous Year Sales Sales PY = CALCULATE ( [T...

Azure Analysis Services

Image
  šŸ”· Azure Analysis Services (AAS) – TIERS EXPLAINED 1️⃣ What Are AAS Tiers? Azure Analysis Services tiers define: Compute power Memory capacity Concurrent users Query performance Cost šŸ—£ Senior line: “AAS tiers help balance performance, concurrency, and cost based on BI workload.” 2️⃣ Azure Analysis Services Tier Types AAS has two main pricing tiers : Tier Purpose Developer (D) Development / Testing Basic (B) Small production workloads Standard (S) Enterprise production 3️⃣ Developer Tier (D1) Feature Details Use Case Dev / QA SLA ❌ No SLA Scale Fixed Cost Low šŸ—£ Interview explanation: “Developer tier is used for model development and testing, never production.” 4️⃣ Basic Tier (B1–B2) Feature B1 B2 Use Case Small prod Medium prod Memory Low Medium Concurrency Limited Moderate SLA ✅ Yes šŸ—£ When to use: “For limited users and simpler models.” 5️⃣ Standard Tier (S0–S9) ⭐ MOST IMPORTANT Feature Description Use Case Enterprise workloads Me...