Data Analysis Using Python(EDA, Univariate, bivariate)

 Collection of COmmonly used codes for Exploratory Analysis, Focusing on Univariate, bivariate  and general university analysis tasks. 

These examples used Python with Popular Libraries like Pandas, Numpy, Matplotlib, seaborn.

Lets get started with the EDA:


1. Exploratory Data Analysis (EDA):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#LOAD DATA SET

df= pd.read_csv)'data.csv')

#Display Information


df.head()                  #First 5 records
df.tail()                    #Last 5 records
df.info()                   #Summary of the dataset
df.describe               #Statistical summary of the data)

#CHecking for missing Values

df.isnull().sum()     #Give the details of the null values

#VIsualise the missing Values:
sns.heatmap(df.isnull(), cbar=False, cmap="virdis")
plt.title("Missing Vales HeatMap")
plt.show()


#Handling Missing Values:

#Filling the missing Value with Median
df.['column_name'].fillna(df['column_name'].median(), inplace=True)

#Drop rows with missing values
df.dropna(inplace=True)



#Data Distribution

#Histogram
sns.histplot(df['column_name'], bins=30, kde= True)
plt.title("DIstribution if Column Name")
plt.show()


#Boxplot
sns.boxplot(x=df['column_name']
plt.title("Box plot of column Name")
plt.show()

#Outlier detection


#using IQR method
Q1 =  df['column_name'].quantile(0.25)
Q3= df['Column_name'].quantile(0.75)
IQR= Q3-Q1

outliers =  df[(df['column_name] < Q1-1.5 *IQR) | (df['column_name'] > Q3 +1.5 *IQR)]
print("No of outliers:", len(outliers))


2. University Analysis commonly used codes

# Analysing Student Demographics


#coountplot for categorical variables (gender, course)
sns.countplot(x='gender', data=df)
plt.title("Gender Distribution")
plt.show()


sns.countplot(x= 'course', data =df,order=df['course'].value_counts().index)
plt.title("COurse populatory")
plt.xticke(rotation=45)
plt.show()


#Analyzing scores

#Distrubution of scores:
sns.histplot(df['score'], bins=20, kde=True)
plt.title("Score Distribution")
plt.show()

#top 10 students based on scores
top_students = df.nlargest(10,'score')
print(top_students[['name','score']])


#Exam performance Analysis:

#Average score by course:
avg_score = df.groupby('cpurse)['score'].mean().sort_values(ascending=False)
print(avg_score)

#visualization:
avg_score.plot(kind='bar',figsize=(10,6))
plt.title("Average scores by course")
plt.ylabel("AVerage score")
plt.show()




3.Bivariate Analysis COmmonly codes:

#Correlation Analysis:
#correlation heatmap
correlation= df.corr()
sns.heatmap(correlation, annot=True,cmap='coolwarm")
plt.title("Correlation Heatmap")
plt.show()

#Scatter Plots
 sns.scatterplot(x='hours_studies', y='score;, data=df)
plt.title("hours studies Vs Score")
plt.show()


















, kde=True

Comments

Popular posts from this blog

SyBase Database Migration to SQL Server

Basics of US Healthcare -Medical Billing