A comprehensive machine learning project for predicting Air Quality Index (AQI) in Kolkata using Linear Regression
Overview • Dataset • Features • Results • Visualizations • Installation • Usage
This project implements a Linear Regression model to predict Air Quality Index (AQI) values for Kolkata, India. By analyzing pollutant concentrations and temporal patterns, the model provides accurate predictions of air quality conditions, enabling better environmental monitoring and public health awareness.
Built by: valiantProgrammer (Rupayan Dey)
City Day Dataset - Historical air quality measurements from Indian cities
- Time Period: Multi-year air quality records
- Records: 804 daily observations
- Collection Method: Ground-based monitoring stations
| Feature | Unit | Description | Range |
|---|---|---|---|
| PM2.5 | µg/m³ | Fine Particulate Matter | Variable |
| PM10 | µg/m³ | Coarse Particulate Matter | Variable |
| NO2 | ppb | Nitrogen Dioxide | Variable |
| CO | ppm | Carbon Monoxide | Variable |
| SO2 | ppb | Sulfur Dioxide | Variable |
| AQI | Index | Air Quality Index (Target) | 0-500+ |
| AQI_Status | Category | Air Quality Category | Good/Moderate/Poor/Severe |
✓ Extracted Kolkata-specific records (19,042 to 19,846)
✓ Selected relevant columns for analysis
✓ Renamed columns for clarity
✓ Removed missing values in AQI column
✓ Converted date strings to datetime format
-
Data Exploration & Analysis
- Temporal pattern analysis
- Pollutant concentration trends
- Missing data assessment
-
Visualization Suite
- Time series analysis
- Correlation matrices
- Distribution analysis
- Heatmap representations
-
Machine Learning Pipeline
- Feature engineering (date ordinal conversion)
- Train-test split (80-20)
- Linear regression modeling
- Performance evaluation
-
Model Evaluation
- Multiple performance metrics
- Residual analysis
- Prediction accuracy visualization
- Coefficient interpretation
═════════════════════════════════════════════════════
MODEL PERFORMANCE EVALUATION
═════════════════════════════════════════════════════
│ R² Score (Coefficient of Determination): 0.9527 │
│ Mean Absolute Error (MAE): 15.9930 │
│ Mean Squared Error (MSE): 530.3787 │
│ Root Mean Squared Error (RMSE): 23.0300 │
═════════════════════════════════════════════════════
- R² Score of 0.9527: The model explains 95.27% of the variance in AQI values - excellent performance!
- MAE of 15.99: Predictions are off by approximately 16 AQI points on average
- RMSE of 23.03: Root mean squared error indicates model reliability
- High Accuracy: Model demonstrates strong predictive capability for real-world AQI forecasting
Insights:
- Shows seasonal patterns in air quality
- Identifies months with consistently poor air quality
- Reveals year-over-year trends
- Highlights winter months with elevated AQI values
Insights:
- Continuous temporal trend of AQI levels
- Identifies seasonal peaks and valleys
- Shows variability in daily measurements
- Reveals long-term air quality trajectory
Insights:
- PM2.5: Highly variable with multiple peaks
- PM10: Broader distribution pattern
- NO2: Concentrated distribution with right skew
- CO: Shows typical urban contamination pattern
- SO2: Lower concentrations with occasional spikes
- AQI: Composite index reflecting all pollutants
Key Findings:
- Strong correlations: Pollutants show moderate to strong relationships
- PM2.5 & PM10: High positive correlation (common source)
- NO2: Correlated with traffic/combustion sources
- AQI Relationship: AQI strongly correlates with all measured pollutants
Model Performance Visualization:
- Points cluster tightly around the diagonal line
- Indicates accurate predictions across the AQI range
- Minimal outliers and prediction errors
- Demonstrates model reliability
Residual Diagnostics:
- Histogram: Approximately normal distribution (validates assumption)
- Scatter Plot: Residuals centered around zero
- Pattern: No systematic bias detected
- Spread: Relatively consistent across prediction range
Feature Importance:
- Shows the weight/importance of each feature
- Positive coefficients: Features that increase predicted AQI
- Negative coefficients: Features that decrease predicted AQI
- Relative magnitudes indicate variable impact on predictions
├─ NumPy # Numerical computations
├─ Pandas # Data manipulation & analysis
├─ Scikit-Learn # Machine learning models
│ ├─ LinearRegression
│ ├─ train_test_split
│ └─ Metrics (R², MAE, MSE)
├─ Matplotlib # Static visualization
├─ Seaborn # Statistical plotting
└─ Jupyter # Interactive notebook environment- Minimum: Python 3.6+
- Recommended: Python 3.8+
- Python 3.6 or higher
- pip or conda package manager
- Git (optional, for version control)
-
Clone the Repository
git clone https://github.com/valiantProgrammer/AQI-prediction.git cd AQI-prediction -
Create Virtual Environment (Recommended)
python -m venv venv # Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install Dependencies
pip install pandas numpy scikit-learn matplotlib seaborn jupyter
-
Launch Jupyter Notebook
jupyter notebook 01_aqi_analysis.ipynb
- Open
01_aqi_analysis.ipynbin Jupyter - Execute cells sequentially (Shift + Enter)
- Review outputs and visualizations after each cell
- Analyze printed metrics and model performance
import pandas as pd
from sklearn.linear_model import LinearRegression
# Define input pollutant values
input_pm25 = 90.0 # PM2.5 concentration
input_pm10 = 60.0 # PM10 concentration
input_no2 = 60.0 # NO2 concentration
input_co = 3.0 # CO concentration
input_so2 = 8.0 # SO2 concentration
input_date = '2023-01-15' # Date for prediction
# Convert date to ordinal format
input_date_ordinal = pd.to_datetime(input_date).toordinal()
# Create input DataFrame
new_data = pd.DataFrame([[
input_pm25, input_pm10, input_no2, input_co, input_so2, input_date_ordinal
]], columns=['PM-2.5', 'PM-10', 'NO2', 'CO', 'SO2', 'Date_ordinal'])
# Generate prediction
predicted_aqi = model.predict(new_data)
print(f"Predicted AQI: {predicted_aqi[0]:.2f}")python generate_aqi_visualizations.pyThis script will:
- Load the data
- Process and prepare features
- Train the model
- Generate all visualizations
- Save images to
assets/folder - Print performance metrics
┌─────────────────────────────────────────────────────────────┐
│ 1. DATA LOADING & EXPLORATION │
│ ├─ Load city_day.csv dataset │
│ ├─ Extract Kolkata records │
│ ├─ Analyze data structure and content │
│ └─ Check for missing values │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 2. DATA PREPROCESSING & CLEANING │
│ ├─ Remove null values in target (AQI) │
│ ├─ Select relevant features │
│ ├─ Rename columns for clarity │
│ └─ Convert date strings to datetime │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 3. EXPLORATORY DATA ANALYSIS │
│ ├─ Generate descriptive statistics │
│ ├─ Create correlation matrices │
│ ├─ Visualize distributions │
│ └─ Identify temporal patterns │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 4. FEATURE ENGINEERING │
│ ├─ Convert Date to ordinal format │
│ ├─ Select independent features (X) │
│ ├─ Identify target variable (y) │
│ └─ Prepare feature matrix │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 5. TRAIN-TEST SPLIT │
│ ├─ 80% Training data │
│ └─ 20% Testing data (validation) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 6. MODEL TRAINING │
│ ├─ Initialize LinearRegression │
│ ├─ Fit model to training data │
│ └─ Extract coefficients and intercept │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 7. MODEL EVALUATION & VALIDATION │
│ ├─ Make predictions on test set │
│ ├─ Calculate R² Score │
│ ├─ Calculate MAE, MSE, RMSE │
│ ├─ Analyze residuals │
│ └─ Validate assumptions │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 8. VISUALIZATION & REPORTING │
│ ├─ Create comprehensive visualizations │
│ ├─ Generate performance plots │
│ ├─ Save visualizations to assets/ │
│ └─ Document findings and insights │
└─────────────────────────────────────────────────────────────┘
Where:
-
$\hat{AQI}$ = Predicted Air Quality Index -
$\beta_0$ = Intercept (base AQI level) -
$\beta_1, ..., \beta_6$ = Feature coefficients - Each coefficient represents the contribution of that pollutant to AQI
R² Score (Coefficient of Determination):
Mean Absolute Error:
Root Mean Squared Error:
- Seasonal Patterns: Clear seasonal variation in AQI with winter peaks
- Pollutant Relationships: Strong correlations between PM2.5, PM10, and AQI
- Temporal Trends: Year-over-year variations in air quality
- Urban Impact: Elevated pollutant levels correlate with traffic patterns
- High Predictability: 95.27% R² indicates excellent model fit
- Feature Importance: PM2.5 and PM10 are primary AQI drivers
- Temporal Factor: Date ordinal captures seasonal effects
- Residual Properties: Normally distributed residuals validate assumptions
- Seasonal Risk: Winter months require enhanced monitoring
- Policy Relevance: Identifies primary pollution sources
- Public Health: Enables early AQI forecasting
- Mitigation Strategy: Supports targeted pollution control
| Issue | Solution |
|---|---|
FileNotFoundError: city_day.csv |
Ensure CSV is in project directory |
ModuleNotFoundError |
Run pip install -r requirements.txt |
| Jupyter not found | Run pip install jupyter |
| Memory error with large CSV | Use data chunking or filtering |
AQI-prediction/
├── 01_aqi_analysis.ipynb # Main Jupyter notebook
├── generate_aqi_visualizations.py # Visualization generation script
├── city_day.csv # Dataset
├── readme2.md # This file
└── assets/ # Generated visualizations
├── 01_aqi_heatmap.png
├── 02_aqi_timeseries.png
├── 03_pollutant_distributions.png
├── 04_correlation_heatmap.png
├── 05_actual_vs_predicted.png
├── 06_residuals_analysis.png
└── 07_model_coefficients.png
By working with this project, you will learn:
✅ How to handle time-series environmental data
✅ Data cleaning and preprocessing techniques
✅ Exploratory data analysis with multiple visualizations
✅ Linear regression model implementation
✅ Performance metric evaluation and interpretation
✅ Residual analysis and assumption validation
✅ Temporal feature engineering (date conversion)
✅ Model prediction on new data
✅ Visualization best practices
✅ Environmental data science applications
This project demonstrates techniques applicable to:
- Environmental Monitoring: Real-time air quality prediction
- Public Health: Forecasting health risk periods
- Urban Planning: Identifying pollution hotspots
- Climate Studies: Long-term air quality trends
- Policy Making: Data-driven pollution control strategies
- Research: Environmental data analysis methods
-
Advanced Models
- Implement Random Forest or Gradient Boosting
- Try neural networks for pattern detection
- Compare multiple model performances
-
Feature Engineering
- Add weather data (temperature, humidity, wind speed)
- Create interaction terms between pollutants
- Add holiday/weekend indicators
-
Temporal Analysis
- Implement time-series forecasting (ARIMA, LSTM)
- Add lag features for sequential patterns
- Conduct multi-step ahead predictions
-
Additional Features
- Include traffic data from nearby regions
- Incorporate industrial activity indices
- Add geographic/topographic features
-
Deployment
- Create REST API for predictions
- Build web dashboard for visualization
- Deploy model to cloud platform
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- GitHub: valiantProgrammer
- Email: rupayandey134@gmail.com
- Issues: Please report bugs via GitHub Issues
- Dataset Source: Open government air quality data
- Libraries: Scikit-learn, Pandas, Matplotlib, Seaborn teams
- Jupyter Project: For interactive computing environment
- Community: Data science and environmental monitoring communities
- Air Quality Index Standards (AQI)
- Regression Analysis
- Time-Series Forecasting
- Environmental Monitoring
Made with ❤️ by valiantProgrammer






