This project performs exploratory data analysis (EDA) and builds classification and regression models to analyze wine characteristics and predict wine quality using various machine learning algorithms.
- Dataset
- Exploratory Data Analysis (EDA)
- Data Preprocessing
- Model Training and Evaluation
- Results
- Conclusion
- How to Run
- Dependencies
The dataset contains physicochemical properties and sensory quality ratings of red and white Portuguese "Vinho Verde" wines.
Each record includes attributes like:
- Fixed acidity, volatile acidity, citric acid
- Residual sugar, chlorides, free sulfur dioxide
- Density, pH, alcohol content
- Quality score (target)
EDA includes:
- Distribution plots of numerical features
- Correlation heatmaps
- Outlier detection
- Wine type comparison (red vs white)
- Quality class distribution
- Imputation: Fill missing values using median
- Outlier Handling: Clip extreme values per wine type
- Feature Scaling: Normalize for distance-based models
- One-hot encode wine types (if needed)
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- Random Forest
- K-Nearest Neighbors (KNN)
- Gaussian Naive Bayes
Evaluation Metrics: Accuracy, Precision, Recall, F1-score
- Linear Regression
- Huber Regressor
- RANSAC Regressor
- Theil-Sen Regressor
- Decision Tree Regressor
- Random Forest Regressor
- Support Vector Regressor (SVR)
- KNN Regressor
Evaluation Metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE)
| Model | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Logistic Regression | 97.69% | 97.84% | 99.06% | 98.45% |
| SVM | 92.62% | 92.26% | 98.23% | 95.15% |
| Decision Tree | 98.38% | 98.96% | 98.85% | 98.90% |
| ✅ Random Forest | 99.62% | 99.58% | 99.90% | 99.74% |
| KNN | 95.62% | 96.69% | 97.39% | 97.04% |
| Gaussian Naive Bayes | 97.15% | 98.94% | 97.18% | 98.05% |
| Model | MSE | RMSE |
|---|---|---|
| Linear Regression | 0.5300 | 0.7280 |
| Huber Regressor | 0.5373 | 0.7330 |
| RANSAC Regressor | 0.7293 | 0.8540 |
| Theil-Sen Regressor | 0.5428 | 0.7368 |
| Decision Tree Regressor | 0.7069 | 0.8408 |
| ✅ Random Forest Regressor | 0.3704 | 0.6086 |
| SVR | 0.6099 | 0.7809 |
| KNN Regressor | 0.6318 | 0.7948 |
- ✅ Random Forest Classifier achieved the highest classification performance.
- ✅ Random Forest Regressor outperformed all others in predicting quality ratings.
- Preprocessing and EDA significantly improved performance and interpretability.
- Clone the repository
- Install dependencies
- Run the notebook:
Wine_prediction.ipynb
Install required packages:
pip install pandas numpy matplotlib seaborn scikit-learn