An AI-powered bus delay analysis and prediction system built with machine learning models. This project provides both backend model training and a modern, interactive frontend for analyzing and predicting transportation delays.
The Transport Delay Predictor system includes:
- Multiple ML Models: XGBoost, Random Forest, Linear Regression, and K-Nearest Neighbors (KNN)
- Interactive Frontend: Web-based UI with real-time predictions and visualizations
- Streamlit Dashboard: Advanced analytics and model performance monitoring
- Data Processing Pipeline: Automated feature engineering and preprocessing
- Delay Prediction: Predict bus delays based on various features (route, weather, passenger count, time, location)
- Model Comparison: Compare predictions across multiple ML models simultaneously
- Performance Metrics: View MAE (Mean Absolute Error) and RΒ² scores for each model
- Data Analysis: Exploratory data analysis with interactive visualizations
- Historical Data Insights: Analyze patterns and trends in historical delay data
| Model | Test MAE | Test RΒ² | Status |
|---|---|---|---|
| XGBoost | 56.29 | 0.425 | β Recommended |
| Random Forest | 56.29 | 0.427 | β Good |
| Linear Regression | 62.53 | 0.185 | β Baseline |
| K-Nearest Neighbors | 67.72 | -0.043 |
- Python 3.8+
- Node.js (optional, for running HTTP server)
- Git
- Clone the repository (if not already done):
git clone <repository-url>
cd "Transport Train Model"- Create a virtual environment:
python -m venv .venv
.\.venv\Scripts\activate # On Windows
source .venv/bin/activate # On macOS/Linux- Install dependencies:
pip install -r requirements.txt- Download the cleaned dataset (if not included):
- Place
cleaned_transport_dataset.csvin the project root
- Place
streamlit run app.pyAccess at: http://localhost:8501
# In PowerShell/Terminal
python -m http.server 8000
# Or use Node.js
npm install -g http-server
http-serverAccess at: http://localhost:8000
βββ app.py # Streamlit application (advanced analytics)
βββ app.js # Frontend JavaScript logic
βββ index.html # Web frontend UI
βββ styles.css # Frontend styling
βββ train_models.py # Model training script
βββ transport_delay_analysis.ipynb # Jupyter notebook for analysis
βββ cleaned_transport_dataset.csv # Processed dataset
βββ dirty_transport_dataset.csv # Raw dataset
βββ requirements.txt # Python dependencies
βββ model_evaluation_summary.csv # Model performance metrics
βββ README.md # This file
β
βββ models/ # Trained model artifacts
β βββ linear_regression.pkl
β βββ random_forest.pkl
β βββ xgboost.pkl
β βββ knn.pkl
β βββ scaler.pkl
β βββ label_encoder_*.pkl
β βββ metadata.json
β
βββ tools/ # Utility scripts
β βββ extract_importances.py
β βββ extract_xgb_importances.py
β
βββ old/ # Archived files
βββ app_old.py
βββ train_models_old.py
To retrain all models with your data:
python train_models.pyThis script will:
- Load and preprocess the cleaned dataset
- Perform feature engineering
- Split data into train/test sets
- Train all four ML models
- Evaluate model performance
- Save trained models and metadata
- Generate performance metrics CSV
The models use the following features:
- Temporal: Hour, Day of Week, Time of Day, Weekend indicator
- Location: Latitude, Longitude
- Traffic: Route ID, Passenger Count
- Weather: Weather Condition, Weather Severity
- Delay (minutes): Actual delay from scheduled time
- Key metrics (total records, mean/median/max delay)
- Delay distribution histogram
- Delay by route analysis
- Weather impact analysis
- Passenger count correlation
- Dataset preview table
- Interactive prediction form with all input parameters
- Model selection dropdown (XGBoost, Random Forest, Linear Regression, KNN)
- Real-time predictions with status badges
- Gauge chart visualization
- All-models comparison table
- Statistics Tab: Comprehensive dataset statistics
- Exploratory Tab: Correlation analysis and visualizations
- Raw Data Tab: Searchable, filterable data table
- MAE comparison chart
- RΒ² score comparison chart
- Model rankings and recommendations
- Detailed performance metrics table
The Streamlit app (app.py) provides advanced features:
- Real-time model retraining interface
- Cross-validation results
- Feature importance visualization
- Shapley value explanations
- Custom prediction scenarios
- Gradient boosting ensemble method
- Best overall performance (RΒ² = 0.425)
- Robust to outliers and non-linear relationships
- Suitable for production use
- Ensemble of decision trees
- Good generalization (RΒ² = 0.427)
- Provides feature importance scores
- Parallel prediction capability
- Baseline statistical model
- Interpretable coefficients
- Moderate performance (RΒ² = 0.185)
- Fast inference
- Instance-based learning
- Reference model for comparison
- Lower performance (-0.043 RΒ²)
- Useful for local pattern analysis
Models are evaluated using:
- MAE (Mean Absolute Error): Average prediction error in minutes
- RMSE (Root Mean Squared Error): Penalizes larger errors more heavily
- RΒ² Score: Coefficient of determination (0-1 scale)
- Cross-Validation: k-fold CV for stability assessment
- No personal data is collected or stored
- Dataset contains aggregated transportation metrics only
- All model artifacts are saved locally
- No external API calls for predictions
# Retrain models
python train_models.py# Change port for HTTP server
python -m http.server 9000
# For Streamlit
streamlit run app.py --server.port 8502pip install --upgrade -r requirements.txtEnsure CSV files use UTF-8 encoding.
- Added K-Nearest Neighbors model to the ensemble
- Integrated into all UI components (frontend and Streamlit)
- Added to model comparison visualizations
- Includes performance metrics evaluation
- Deep Learning models (LSTM, Neural Networks)
- Real-time data ingestion
- Geographic heat maps
- Mobile app version
- REST API for external integrations
- Automated retraining pipeline
- Model explainability dashboard
- Anomaly detection for unusual delays
See requirements.txt for full list:
- pandas: Data manipulation
- numpy: Numerical computing
- scikit-learn: ML algorithms & preprocessing
- xgboost: Gradient boosting
- plotly: Interactive visualizations
- streamlit: Web framework
- jupyter: Notebook environment
- joblib: Model serialization
- Train and test the model in the Jupyter notebook
- Add model saving to
train_models.py - Update
app.pyto load the new model - Add to
app.jsfrontend model selection - Update performance comparison in
index.html - Run tests and validate predictions
- Python: PEP 8 compliance
- JavaScript: ES6+ standards
- HTML/CSS: Semantic markup
This project is provided as-is for educational and operational purposes.
For issues or questions:
- Check the troubleshooting section
- Review model training logs
- Verify dataset format
- Check browser console for frontend errors
For more information about this project, please refer to the model documentation and code comments.
Last Updated: December 2025
Models Included: XGBoost, Random Forest, Linear Regression, K-Nearest Neighbors
Dataset: Transport Delay Analysis (500+ records)