A comprehensive end-to-end machine learning project that predicts student writing scores based on multiple demographic and academic factors. This project demonstrates a complete ML pipeline from data ingestion to model deployment with a modern web interface.
- 🤖 Multiple ML algorithms (Linear Regression, Random Forest, XGBoost, CatBoost, etc.)
- 📊 Advanced data preprocessing and feature engineering
- 🎯 Best model accuracy: 88% (Linear Regression)
- 🌐 Modern Flask web interface with real-time predictions
- 📈 Interactive data visualizations and analysis
- ⚡ Sub-100ms inference time
- 🔄 Automated hyperparameter tuning with GridSearchCV
Dashboard with Project Overview and Dataset Information
Student Score Prediction System
│
├── Data Layer
│ ├── Raw Data (Students.csv)
│ ├── Processed Data (train.csv, test.csv)
│ └── Artifacts Storage
│
├── ML Pipeline
│ ├── Data Ingestion
│ ├── Data Transformation & Preprocessing
│ ├── Model Training
│ ├── Model Evaluation
│ └── Best Model Selection
│
└── Web Interface
├── Flask Backend
├── HTML/CSS/JavaScript Frontend
└── Real-time Prediction API
┌─────────────────────────────────────────────┐
│ Machine Learning Stack │
├─────────────────────────────────────────────┤
│ Python 3.8+ - Core Language │
│ Scikit-learn - ML Algorithms │
│ XGBoost - Gradient Boosting │
│ CatBoost - Categorical Boost │
│ Pandas - Data Processing │
│ NumPy - Numerical Computing │
│ Pickle/Joblib - Model Serialization │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Web Development Stack │
├─────────────────────────────────────────────┤
│ Flask - Web Framework │
│ Jinja2 - Template Engine │
│ HTML5/CSS3 - Frontend Design │
│ JavaScript (ES6) - Client Interactivity│
│ Bootstrap - Responsive UI │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Data & Visualization Libraries │
├─────────────────────────────────────────────┤
│ Matplotlib - Static Plots │
│ Seaborn - Statistical Charts │
│ Plotly - Interactive Graphs │
│ Pandas Profiling - Data Reports │
└─────────────────────────────────────────────┘
MlProject/
├── app.py # Flask application entry point
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── setup.py # Package setup configuration
│
├── artifacts/ # Generated model files
│ ├── train.csv # Training dataset
│ ├── test.csv # Testing dataset
│ ├── data.csv # Raw dataset
│ └── model.pkl # Trained model
│
├── src/ # Source code
│ ├── __init__.py
│ ├── exception.py # Custom exceptions
│ ├── logger.py # Logging configuration
│ ├── utils.py # Utility functions
│ │
│ ├── components/
│ │ ├── data_ingestion.py # Load & split data
│ │ ├── data_transformation.py # Preprocessing
│ │ └── model_trainer.py # Model training
│ │
│ └── pipeline/
│ ├── train_pipeline.py # Training workflow
│ └── predict_pipeline.py # Prediction workflow
│
├── templates/ # HTML templates
│ ├── index.html # Dashboard
│ └── home.html # Home page
│
├── notebook/ # Jupyter notebooks
│ ├── Model Training.ipynb # Model development
│ └── problemstatement.ipynb # Problem analysis
│
└── logs/ # Application logs
START
│
├─→ [Data Ingestion]
│ └─→ Load Students.csv (1000+ records)
│
├─→ [Train/Test Split] (80/20)
│
├─→ [Data Preprocessing]
│ ├─→ Handle Missing Values
│ ├─→ Categorical Encoding (One-Hot/Label)
│ ├─→ Feature Scaling (StandardScaler)
│ └─→ Outlier Detection
│
├─→ [Feature Engineering]
│ └─→ Advanced Feature Creation
│
├─→ [Model Training]
│ ├─→ Random Forest Regressor
│ ├─→ Gradient Boosting Regressor
│ ├─→ XGBRegressor
│ ├─→ CatBoost Regressor
│ ├─→ Decision Tree Regressor
│ ├─→ KNN Regressor
│ ├─→ AdaBoost Regressor
│ └─→ Linear Regression ⭐ BEST
│
├─→ [Hyperparameter Tuning]
│ └─→ GridSearchCV (5-Fold CV)
│
├─→ [Model Evaluation]
│ ├─→ R² Score: 0.88
│ ├─→ MAE: ±2.8 points
│ ├─→ RMSE: 3.2 points
│ └─→ Cross-Validation Scores
│
├─→ [Best Model Selection]
│ └─→ Linear Regression (88% Accuracy)
│
└─→ [Deployment]
└─→ Flask Web Interface
END
| Metric | Value |
|---|---|
| Accuracy | 88% |
| R² Score | 0.88 |
| Mean Absolute Error (MAE) | ±2.8 points |
| Root Mean Squared Error (RMSE) | 3.2 points |
| Training Time | < 1 second |
| Inference Time | < 10ms per prediction |
| Cross-Validation Score | 0.87 (5-fold) |
| Feature | Type | Description | Range |
|---|---|---|---|
| Gender | Categorical | Male/Female | 2 categories |
| Race/Ethnicity | Categorical | Groups A-E | 5 categories |
| Parental Education | Categorical | Education levels | 6 levels |
| Lunch Type | Categorical | Standard/Free-Reduced | 2 categories |
| Test Preparation | Categorical | None/Completed | 2 states |
| Math Score | Numerical | Math test score | 0-100 |
| Reading Score | Numerical | Reading test score | 0-100 |
| Writing Score | Numerical | Target Variable | 0-100 |
- Total Records: 1,000+
- Training Set: 80% (800+ records)
- Testing Set: 20% (200+ records)
- Data Completeness: 100% (no missing values)
- Source: Kaggle - Students Performance in Exams
- Python 3.8 or higher
- pip (Python package manager)
- Virtual Environment (recommended)
git clone https://github.com/yourusername/MlProject.git
cd MlProjectpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtpython app.pyThe application will be available at: http://localhost:5000
- Open
http://localhost:5000in your browser - Fill in the student information form:
- Select gender and race/ethnicity
- Choose parental education level
- Select lunch type and test preparation status
- Enter math and reading scores
- Click "Predict Result" to get the writing score prediction
from src.pipeline.predict_pipeline import PredictionPipeline
# Create predictor
predictor = PredictionPipeline()
# Make prediction
input_data = {
'gender': 'male',
'race_ethnicity': 'group A',
'parental_education': 'bachelor\'s degree',
'lunch': 'standard',
'test_preparation_course': 'completed',
'math_score': 85,
'reading_score': 90
}
prediction = predictor.predict(input_data)
print(f"Predicted Writing Score: {prediction}")python -c "from src.pipeline.train_pipeline import TrainPipeline; pipeline = TrainPipeline(); pipeline.main()"python -c "from src.pipeline.train_pipeline import TrainPipeline; pipeline = TrainPipeline(); pipeline.evaluate_model()"- Loads raw student data
- Splits into training and testing sets
- Handles data validation
- Categorical encoding
- Feature scaling
- Missing value handling
- Outlier detection
- Trains multiple ML algorithms
- Performs hyperparameter tuning
- Selects best performing model
- Saves model artifacts
- Loads trained model
- Processes input data
- Generates predictions
The dashboard includes 9 interactive visualizations:
1. Student Demographics Distribution

2. Score Distributions and Correlations

4. Feature Importance Analysis

7. Score Distribution Analysis

The project includes comprehensive:
- ✅ Custom exception handling
- ✅ Detailed logging system
- ✅ Data validation
- ✅ Model validation
- ✅ Error recovery mechanisms
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Your Name - End-to-End ML Project Developer
For questions or issues, please:
- Open an issue on GitHub
- Contact: your.email@example.com
- Check existing documentation
Made with ❤️ for the ML Community




