Skip to content

Latest commit

 

History

History
440 lines (344 loc) · 12.6 KB

File metadata and controls

440 lines (344 loc) · 12.6 KB

🎓 Student Score Prediction System - End-to-End ML Project

Project Banner Python Version License

📋 Project Overview

A comprehensive end-to-end machine learning project that predicts student writing scores based on multiple demographic and academic factors. This project demonstrates a complete ML pipeline from data ingestion to model deployment with a modern web interface.

✨ Key Features

  • 🤖 Multiple ML algorithms (Linear Regression, Random Forest, XGBoost, CatBoost, etc.)
  • 📊 Advanced data preprocessing and feature engineering
  • 🎯 Best model accuracy: 88% (Linear Regression)
  • 🌐 Modern Flask web interface with real-time predictions
  • 📈 Interactive data visualizations and analysis
  • ⚡ Sub-100ms inference time
  • 🔄 Automated hyperparameter tuning with GridSearchCV

🎨 Project Screenshots & Dashboard

Web Interface Overview

Dashboard Feature Image 1 Dashboard with Project Overview and Dataset Information

Dashboard Feature Image 2 Interactive Visualizations and Data Analysis

Dashboard Feature Image 3 Model Performance and Metrics Display


🏗️ Project Architecture

Student Score Prediction System
│
├── Data Layer
│   ├── Raw Data (Students.csv)
│   ├── Processed Data (train.csv, test.csv)
│   └── Artifacts Storage
│
├── ML Pipeline
│   ├── Data Ingestion
│   ├── Data Transformation & Preprocessing
│   ├── Model Training
│   ├── Model Evaluation
│   └── Best Model Selection
│
└── Web Interface
    ├── Flask Backend
    ├── HTML/CSS/JavaScript Frontend
    └── Real-time Prediction API

🛠️ Technology Stack

Backend Technologies

┌─────────────────────────────────────────────┐
│         Machine Learning Stack              │
├─────────────────────────────────────────────┤
│  Python 3.8+          - Core Language       │
│  Scikit-learn         - ML Algorithms       │
│  XGBoost              - Gradient Boosting   │
│  CatBoost             - Categorical Boost   │
│  Pandas               - Data Processing     │
│  NumPy                - Numerical Computing │
│  Pickle/Joblib        - Model Serialization │
└─────────────────────────────────────────────┘

Web Framework

┌─────────────────────────────────────────────┐
│         Web Development Stack               │
├─────────────────────────────────────────────┤
│  Flask                - Web Framework       │
│  Jinja2               - Template Engine     │
│  HTML5/CSS3           - Frontend Design     │
│  JavaScript (ES6)     - Client Interactivity│
│  Bootstrap            - Responsive UI       │
└─────────────────────────────────────────────┘

Data & Visualization

┌─────────────────────────────────────────────┐
│      Data & Visualization Libraries         │
├─────────────────────────────────────────────┤
│  Matplotlib           - Static Plots        │
│  Seaborn              - Statistical Charts  │
│  Plotly               - Interactive Graphs  │
│  Pandas Profiling     - Data Reports        │
└─────────────────────────────────────────────┘

📊 Project Structure

MlProject/
├── app.py                      # Flask application entry point
├── README.md                   # Project documentation
├── requirements.txt            # Python dependencies
├── setup.py                    # Package setup configuration
│
├── artifacts/                  # Generated model files
│   ├── train.csv              # Training dataset
│   ├── test.csv               # Testing dataset
│   ├── data.csv               # Raw dataset
│   └── model.pkl              # Trained model
│
├── src/                        # Source code
│   ├── __init__.py
│   ├── exception.py            # Custom exceptions
│   ├── logger.py               # Logging configuration
│   ├── utils.py                # Utility functions
│   │
│   ├── components/
│   │   ├── data_ingestion.py   # Load & split data
│   │   ├── data_transformation.py # Preprocessing
│   │   └── model_trainer.py    # Model training
│   │
│   └── pipeline/
│       ├── train_pipeline.py   # Training workflow
│       └── predict_pipeline.py # Prediction workflow
│
├── templates/                  # HTML templates
│   ├── index.html             # Dashboard
│   └── home.html              # Home page
│
├── notebook/                   # Jupyter notebooks
│   ├── Model Training.ipynb    # Model development
│   └── problemstatement.ipynb  # Problem analysis
│
└── logs/                       # Application logs

🔄 ML Pipeline Flow

START
  │
  ├─→ [Data Ingestion]
  │    └─→ Load Students.csv (1000+ records)
  │
  ├─→ [Train/Test Split] (80/20)
  │
  ├─→ [Data Preprocessing]
  │    ├─→ Handle Missing Values
  │    ├─→ Categorical Encoding (One-Hot/Label)
  │    ├─→ Feature Scaling (StandardScaler)
  │    └─→ Outlier Detection
  │
  ├─→ [Feature Engineering]
  │    └─→ Advanced Feature Creation
  │
  ├─→ [Model Training]
  │    ├─→ Random Forest Regressor
  │    ├─→ Gradient Boosting Regressor
  │    ├─→ XGBRegressor
  │    ├─→ CatBoost Regressor
  │    ├─→ Decision Tree Regressor
  │    ├─→ KNN Regressor
  │    ├─→ AdaBoost Regressor
  │    └─→ Linear Regression ⭐ BEST
  │
  ├─→ [Hyperparameter Tuning]
  │    └─→ GridSearchCV (5-Fold CV)
  │
  ├─→ [Model Evaluation]
  │    ├─→ R² Score: 0.88
  │    ├─→ MAE: ±2.8 points
  │    ├─→ RMSE: 3.2 points
  │    └─→ Cross-Validation Scores
  │
  ├─→ [Best Model Selection]
  │    └─→ Linear Regression (88% Accuracy)
  │
  └─→ [Deployment]
       └─→ Flask Web Interface
END

📈 Model Performance Metrics

🏆 Best Performing Model: Linear Regression

Metric Value
Accuracy 88%
R² Score 0.88
Mean Absolute Error (MAE) ±2.8 points
Root Mean Squared Error (RMSE) 3.2 points
Training Time < 1 second
Inference Time < 10ms per prediction
Cross-Validation Score 0.87 (5-fold)

📊 Dataset Information

Features (8 total)

Feature Type Description Range
Gender Categorical Male/Female 2 categories
Race/Ethnicity Categorical Groups A-E 5 categories
Parental Education Categorical Education levels 6 levels
Lunch Type Categorical Standard/Free-Reduced 2 categories
Test Preparation Categorical None/Completed 2 states
Math Score Numerical Math test score 0-100
Reading Score Numerical Reading test score 0-100
Writing Score Numerical Target Variable 0-100

Dataset Statistics

  • Total Records: 1,000+
  • Training Set: 80% (800+ records)
  • Testing Set: 20% (200+ records)
  • Data Completeness: 100% (no missing values)
  • Source: Kaggle - Students Performance in Exams

🚀 Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • Virtual Environment (recommended)

Step 1: Clone the Repository

git clone https://github.com/yourusername/MlProject.git
cd MlProject

Step 2: Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Run the Application

python app.py

The application will be available at: http://localhost:5000


💻 Usage

Making Predictions via Web Interface

  1. Open http://localhost:5000 in your browser
  2. Fill in the student information form:
    • Select gender and race/ethnicity
    • Choose parental education level
    • Select lunch type and test preparation status
    • Enter math and reading scores
  3. Click "Predict Result" to get the writing score prediction

API Usage

from src.pipeline.predict_pipeline import PredictionPipeline

# Create predictor
predictor = PredictionPipeline()

# Make prediction
input_data = {
    'gender': 'male',
    'race_ethnicity': 'group A',
    'parental_education': 'bachelor\'s degree',
    'lunch': 'standard',
    'test_preparation_course': 'completed',
    'math_score': 85,
    'reading_score': 90
}

prediction = predictor.predict(input_data)
print(f"Predicted Writing Score: {prediction}")

🔧 Training the Model

Retraining with New Data

python -c "from src.pipeline.train_pipeline import TrainPipeline; pipeline = TrainPipeline(); pipeline.main()"

Evaluating Model Performance

python -c "from src.pipeline.train_pipeline import TrainPipeline; pipeline = TrainPipeline(); pipeline.evaluate_model()"

📚 Project Components

1. Data Ingestion (src/components/data_ingestion.py)

  • Loads raw student data
  • Splits into training and testing sets
  • Handles data validation

2. Data Transformation (src/components/data_transformation.py)

  • Categorical encoding
  • Feature scaling
  • Missing value handling
  • Outlier detection

3. Model Training (src/components/model_trainer.py)

  • Trains multiple ML algorithms
  • Performs hyperparameter tuning
  • Selects best performing model
  • Saves model artifacts

4. Prediction Pipeline (src/pipeline/predict_pipeline.py)

  • Loads trained model
  • Processes input data
  • Generates predictions

📊 Visualizations Available

The dashboard includes 9 interactive visualizations:

Visualization Gallery

1. Student Demographics Distribution Visualization 1

2. Score Distributions and Correlations Visualization 2

3. Model Performance Metrics Visualization 3

4. Feature Importance Analysis Visualization 4

5. Prediction Accuracy Charts Visualization 5

6. Data Quality Reports Visualization 6

7. Score Distribution Analysis Visualization 7

8. Feature Correlations Heatmap Visualization 8

9. Model Comparison Dashboard Visualization 9


🔐 Error Handling & Logging

The project includes comprehensive:

  • ✅ Custom exception handling
  • ✅ Detailed logging system
  • ✅ Data validation
  • ✅ Model validation
  • ✅ Error recovery mechanisms

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.


👨‍💻 Author

Your Name - End-to-End ML Project Developer


🙋 Support & Contact

For questions or issues, please:


📞 Additional Resources


Made with ❤️ for the ML Community