This project focuses on data loading, exploratory data analysis (EDA), data cleaning, and missing value imputation using the Red Wine Quality dataset from the UCI Machine Learning Repository.
The main goal is to demonstrate good data preprocessing practices before applying Machine Learning models.
- Name: Wine Quality β Red Wine
- Source: UCI Machine Learning Repository
- Observations: 1,599 samples
- Features: 11 physicochemical variables + 1 target variable (
quality)
- Python 3
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- SciPy
- Dataset loaded directly from the UCI repository
- Basic inspection of shape, data types, and sample rows
- Descriptive statistics
- Distribution of wine quality
- Alcohol content distribution
- Correlation analysis
- Handling missing values
- Standardizing column names
- Date format unification
- Outlier detection (IQR and z-score)
- Removing duplicates
- Numeric and text normalization
- Feature categorization
Artificial missing values (15%) were generated and handled using:
- Mean imputation
- Median imputation
- Mode imputation
- KNN imputation
The methods were compared, and mean imputation was selected for the final dataset.
- Practice real-world data cleaning techniques
- Understand different imputation strategies
- Prepare data for Machine Learning models
- Build a solid Data Science portfolio project
This project uses publicly available data from the UCI repository and is intended for educational purposes.