Retail Product Data Analysis
Project Overview: This project focuses on exploratory data analysis (EDA) and business insights generation using a retail product dataset containing missing values. It focuses on identifying, analyzing, and handling missing data in a retail product dataset. Various imputation techniques were applied based on data distribution and business reasoning rather than blanket filling. It demonstrates end-to-end data analysis skills using Python, SQL, and Power BI.
Dataset: Retail Product Dataset with Missing Values
- Records:4361 entries distributed in 5 columns category, price, rating, discount, stock
- The dataset intentionally includes missing values to simulate real-world data challenges.
- Source: Kaggle - https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Data Cleaning & Preprocessing: The following steps were performed using Python (Pandas):
-
Identified missing values in numerical and categorical columns
-
Imputed: Numerical columns using median values Categorical columns using mode values Removed duplicate records Verified data types and corrected inconsistencies Created derived fields where necessary for analysis
-
Exploratory Data Analysis (EDA) Numerical Analysis: Distribution analysis using histograms with KDE; Outlier detection using boxplots. Categorical Analysis: Category-wise product count; Stock availability comparison (In Stock vs Out of Stock); Category-level price comparison using boxplots; Correlation Analysis - Pearson correlation between: Price & Discount, Price & Rating, Discount & Rating.
-
Result: All correlations are close to zero, indicating no strong linear relationship between pricing, discounts, and ratings.
SQL Analysis SQL was used to answer business-oriented questions such as: Total number of products
- Average price, discount, and rating
- Products with high discounts (≥ 40%)
- Category-wise pricing trends
- Stock availability breakdown
- The SQL queries are consolidated and included for easy review and reproducibility.
Power BI Dashboard An interactive Power BI dashboard was created to visualize:
- Price and discount distributions (histograms)
- Category-wise price comparison
- Stock availability
- Correlation summary using DAX measures
Key Highlights:
- Custom DAX measures for correlation
- Clean layout with business-focused KPIs
- Designed for stakeholder-friendly interpretation
Tools & Technologies used:
- Python: Pandas, NumPy, Matplotlib, Seaborn
- SQL: SQLite
- Power BI: DAX, interactive dashboards
- Version Control: Git & GitHub
Key Insights
- Most products fall within a mid-price range with moderate discounts
- Category C dominates the product count
- High discounts do not necessarily correspond to higher ratings
- Price, discount, and rating operate largely independently in this dataset
Conclusion This project showcases practical data analysis skills, including data cleaning, EDA, SQL querying, and dashboard creation. It reflects real-world scenarios where data is imperfect and insights must be derived through structured analysis.