This project aims to classify forest cover types based on cartographic variables (such as elevation, slope, soil type, etc.) using Machine Learning algorithms. The project utilizes the Forest Cover Type dataset and implements multiple models to compare their performance.
- Name: Forest Cover Type (Covtype)
- Source: UCI Machine Learning Repository / Scikit-Learn
- Target: 7 different forest cover types
- Features: 54 columns (Elevation, Aspect, Slope, Distances to Hydrology/Roadways/Firepoints, Hillshade, Wilderness Areas, Soil Types)
The following models were trained and evaluated:
-
Multi-Layer Perceptron (MLP)
- Type: Neural Network (
MLPClassifier) - Optimization: Hyperparameter tuning using
RandomizedSearchCV - Best Params: (Found via tuning, e.g., hidden layers, activation, alpha)
- Performance: High Accuracy (~95%) and AUC (~0.99)
- Type: Neural Network (
-
Support Vector Machine (SVM)
- Type: Linear SVM (
LinearSVC) - Configuration: Wrapped in
CalibratedClassifierCVfor probability estimates. - Parameters:
dual=False,random_state=42 - Performance: Good baseline, efficient for large datasets.
- Type: Linear SVM (
-
Logistic Regression (LR)
- Type: Logistic Regression
- Parameters:
solver='saga',max_iter=500,n_jobs=-1 - Performance: Comparable to SVM, serves as a linear baseline.
Models are evaluated using:
- Accuracy Score: Overall correctness of predictions.
- AUC Score (Macro): Area Under the ROC Curve, handling multi-class classification via One-vs-Rest (OvR).
- Confusion Matrix: Visualizing true vs. predicted classes.
- Classification Report: Precision, Recall, and F1-Score for each class.
The notebook includes several visualizations to understand the data and model performance:
- Class Distribution: Count plot of the target variable.
- Feature Correlation: Heatmap showing relationships between features.
- Boxplots: Distribution of continuous features by class.
- ROC Curves: Multi-class ROC curves for the MLP model.
- Confusion Matrices: Heatmaps for MLP, SVM, and Logistic Regression.
- Model Comparison: Bar chart comparing Accuracy and AUC across all models.
- Loss Curve: Training loss over iterations for the MLP.
To run this notebook, you need the following Python libraries:
numpypandasscikit-learnmatplotlibseabornjoblib
- Ensure all dependencies are installed (
pip install -r requirements.txtif available, or install individually). - Open
ML_Project.ipynbin Jupyter Notebook or VS Code. - Run all cells sequentially.
- Note: The SVM and Logistic Regression training steps might take a few minutes due to the dataset size.
- The final cells will display the model comparison table and plots.
ML_Project.ipynb: Main project notebook.mlp_model_metrics.csv: Saved metrics for the MLP model.mlp_covtype_tuned_final_model.joblib: Saved trained MLP model.scaler_covtype.joblib: Saved data scaler.