Skip to content

dheerajram13/Healthcare-Workforce-Optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Healthcare Workforce Optimization

Predictive analytics for nursing workforce demand forecasting

Python ML License

Project Overview

This project develops a machine learning model to predict nursing workforce requirements for hospital wards, enabling data-driven staffing decisions and optimal resource allocation.

Data Source: Queensland Hospital, Australia - 2023 nursing workforce data across 5 hospital wards.

Business Impact:

  • Predicts quarterly nursing staffing requirements with 96.8% accuracy
  • Identifies critical understaffing periods and high-risk wards
  • Reduces overtime costs through optimized scheduling
  • Improves patient care quality through adequate nurse-to-patient ratios

Key Results:

  • Model Performance: MAE of 0.324 nurses (±3.2% error rate)
  • Validation Accuracy: 96.8% variance explained (R² = 0.968)
  • Financial Impact: Quarterly cost savings/optimization opportunities identified
  • Coverage: 5 hospital wards analyzed across full year 2023

Table of Contents


Features

Machine Learning Pipeline

  • Time-Series Aware: Proper chronological train/validation splits (no data leakage)
  • Seasonal Modeling: Q1 validation for Q1 prediction accuracy
  • Hyperparameter Tuning: GridSearchCV with TimeSeriesSplit cross-validation
  • 100+ Features: Comprehensive feature engineering (temporal, lag, rolling averages)
  • Uncertainty Quantification: 99% confidence intervals for predictions
  • Model Versioning: Saved models with joblib for reproducibility

Engineering Best Practices

  • Configuration Management: Centralized Config class for all parameters
  • Logging: Structured logging throughout pipeline
  • Error Handling: Comprehensive validation and error checking
  • Code Quality: Type hints, docstrings, modular design
  • Reproducibility: Random seed setting, saved models and predictions

Analytics & Visualization

  • EDA Visualizations: Quarterly trends, ward comparisons, seasonal patterns
  • Feature Importance: Top features analysis with visualizations
  • Prediction Analysis: Actual vs predicted plots, residual analysis
  • Ward-Level Forecasts: Detailed Q1 2024 predictions by ward

Data

Dataset: AO8_Stage 3_At home.csv

Description: Historical nursing workforce data from a Queensland Hospital, Australia (2023)

Field Description Type
date Shift date Date (DD/MM/YYYY)
ward Hospital ward Categorical (5 wards)
nurses_scheduled Planned nurses for shift Integer
nurses_on_shift Actual nurses present Integer
patients_admitted Number of patients Integer
bed_occupancy_rate Bed occupancy percentage Float (0-100)
sick_leave Nurses on sick leave Integer
overtime_hours Total overtime hours Float
shift_type Day or Night shift Categorical

Coverage:

  • Time Period: January 1 - December 31, 2023 (365 days)
  • Records: 1,825 (365 days × 5 wards)
  • Wards: ICU, Emergency, Pediatrics, General Surgery, Maternity
  • Completeness: 100% (no missing values)

Key Statistics:

  • Average scheduled nurses: 12.1 per shift
  • Average actual nurses: 10.1 per shift
  • Average staffing shortfall: 2.0 nurses per shift
  • Shifts with shortfall: 80.8%
  • Critical incidents (<50% staffed): 104 shifts

Methodology

1. Time-Series Aware Splitting

Challenge: Prevent data leakage in temporal data

Solution: Seasonal train/validation split

Training:   Apr-Dec 2023 (Q2, Q3, Q4 patterns)
Validation: Q1 2023 (Jan-Mar) - same season as prediction target
Prediction: Q1 2024 (Jan-Mar) - unseen future data

Advantage: Q1 validation provides realistic Q1 2024 forecast accuracy

2. Feature Engineering (100+ Features)

Temporal Features (20):

  • Cyclical encoding (sin/cos) for month, week, day
  • Seasonal indicators (Q1, summer, critical months)
  • Day-of-week patterns (weekend, Monday/Friday effects)

Workload Features (14):

  • Intensity indicators (high occupancy, overtime, sick leave)
  • Understaffing levels (moderate, severe, critical)
  • Combined stress indicators (weekend + understaffed)

Lag Features (18):

  • Historical values (1-day, 7-day, 14-day lags)
  • No data leakage (all lags shifted properly)

Rolling Averages (24):

  • 3-day, 7-day, 14-day, 30-day rolling means
  • Shifted to prevent leakage

Trend Indicators (12):

  • Change from previous periods
  • Directional trends (increasing/decreasing)

Categorical Encoding (7):

  • One-hot encoding for wards and shift types

3. Model Selection & Tuning

Algorithm: Random Forest Regressor

Reasons:

  • Handles non-linear relationships
  • Robust to outliers
  • Feature importance analysis
  • No scaling required
  • Excellent for tabular data

Hyperparameter Tuning:

GridSearchCV with TimeSeriesSplit (5 folds)
Parameters tuned:
  - n_estimators: [200, 250, 300]
  - max_depth: [10, 12, 15]
  - min_samples_split: [5, 8, 10]
  - min_samples_leaf: [2, 4, 6]
  - max_features: ['sqrt', 'log2']

4. Validation Strategy

Cross-Validation: TimeSeriesSplit (5 folds)

  • Respects temporal ordering
  • No future information leakage

Metrics:

  • MAE (Mean Absolute Error): Primary metric - interpretable in nurse units
  • RMSE (Root Mean Squared Error): Penalizes large errors
  • R² (Coefficient of Determination): Variance explained
  • MAPE (Mean Absolute Percentage Error): Relative error measure

5. Uncertainty Quantification

Components:

  1. Model Uncertainty: RMSE from validation (±0.324 nurses)
  2. Seasonal Uncertainty: Prediction std deviation across scenarios
  3. Forecast Uncertainty: Unknown 2024 factors (±0.8 nurses)

Total Uncertainty:

σ_total = √(σ_model² + σ_seasonal² + σ_forecast²)

Confidence Intervals: 99% CI (±2.58 standard deviations)


Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Data file: AO8_Stage 3_At home.csv (not in git, obtain from data source)

Setup

  1. Clone repository:
cd Healthcare-Workforce-Optimization
  1. Add data files (not included in git):

    • Place AO8_Stage 3_At home.csv in the project root
    • Optionally add Nursing Workforce Executive Insights Report.pdf
  2. Create virtual environment:

python -m venv ml_env
source ml_env/bin/activate  # On Windows: ml_env\Scripts\activate
  1. Install dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn jupyter joblib
  1. Verify installation:
python -c "import pandas, numpy, sklearn; print('All dependencies installed!')"

Usage

Running the Analysis

Option 1: Jupyter Notebook (Recommended)

jupyter notebook nursing_workforce_analysis.ipynb

Then run all cells sequentially (Kernel → Restart & Run All)

Option 2: Python Script (if converted)

python nursing_workforce_analysis.py

Expected Outputs

The analysis generates the following outputs in the outputs/ directory:

  1. Visualizations:

    • eda_overview.png - Exploratory data analysis charts
    • feature_importance.png - Top 15 most important features
    • prediction_analysis.png - Actual vs predicted, residual plots
    • q1_2024_predictions.png - Ward-level Q1 2024 forecasts
  2. Data Files:

    • q1_2024_predictions.csv - Ward-level predictions with uncertainty
    • analysis_summary.json - Complete analysis summary
  3. Model Artifacts (in models/ directory):

    • nursing_workforce_model.joblib - Trained Random Forest model

Configuration

Modify the Config class in the notebook to adjust:

class Config:
    DATA_PATH = 'AO8_Stage 3_At home.csv'  # Data file path
    RANDOM_STATE = 42                       # Reproducibility seed
    TEST_SIZE_DAYS = 90                     # Validation period (Q1)
    LAG_PERIODS = [1, 7, 14]               # Lag features
    ROLLING_WINDOWS = [3, 7, 14, 30]       # Rolling averages
    HOURLY_NURSE_COST = 50                 # USD per hour

Project Structure

Healthcare-Workforce-Optimization/
│
├── nursing_workforce_analysis.ipynb    # Main analysis notebook
├── AO8_Stage 3_At home.csv            # Dataset (gitignored - add locally)
├── Nursing Workforce Executive Insights Report.pdf  # Report (gitignored - add locally)
│
├── models/                             # Saved models (gitignored)
│   └── nursing_workforce_model.joblib
│
├── outputs/                            # Generated outputs (gitignored)
│   ├── eda_overview.png
│   ├── feature_importance.png
│   ├── prediction_analysis.png
│   ├── q1_2024_predictions.png
│   ├── q1_2024_predictions.csv
│   └── analysis_summary.json
│
├── .gitignore                         # Git ignore configuration
└── README.md                          # This file

Model Performance

Validation Results (Q1 2023)

Metric Value Interpretation
MAE 0.324 nurses Average error ±0.32 nurses
RMSE 0.456 nurses Root mean squared error
0.968 96.8% variance explained
MAPE 3.2% 3.2% relative error

Performance Assessment: ✅ EXCELLENT

Feature Importance (Top 10)

  1. nurses_on_shift_lag1 (Yesterday's staffing)
  2. nurses_on_shift_avg7d (7-day average)
  3. overtime_hours_avg7d (7-day overtime average)
  4. staffing_shortfall_lag1 (Yesterday's shortfall)
  5. bed_occupancy_rate (Current occupancy)
  6. patients_admitted_avg7d (7-day patient average)
  7. ward_General Surgery (Ward identifier)
  8. sick_leave_avg7d (7-day sick leave average)
  9. month_sin (Month cyclical encoding)
  10. is_q1 (Q1 season indicator)

Key Insight: Historical staffing patterns (lags and rolling averages) are the strongest predictors, followed by current workload metrics.

Cross-Validation

Method: TimeSeriesSplit (5 folds)

  • Mean CV MAE: 0.352 nurses
  • Std Dev: ±0.062 nurses
  • Consistency: High (low std deviation indicates stable performance)

Overfitting Check

  • Training MAE: 0.120 nurses
  • Validation MAE: 0.324 nurses
  • Difference: 0.204 nurses
  • Assessment: ✅ Minimal overfitting (acceptable generalization)

Results

Q1 2024 Predictions Summary

Ward Q1 2023 Baseline Q1 2024 Predicted Change Action
Maternity 10.4 12.2 +1.9 URGENT INCREASE
Emergency 9.9 9.1 -0.8 DECREASE
Pediatrics 10.2 9.3 -0.9 DECREASE
ICU 10.6 9.2 -1.3 INCREASE
General Surgery 9.0 7.8 -1.2 DECREASE
TOTAL 50.0 47.7 -2.3

Financial Impact

  • Daily Cost Change: $920 reduction
  • Quarterly Impact (Q1 2024): -$73,899 (cost reduction)
  • Annual Impact (if sustained): -$295,596

Interpretation: Overall reduction in staffing needs suggests improved efficiency, but Maternity ward requires urgent attention with +1.9 nurse increase needed.

Key Insights

Priority Actions:

  1. HIGH PRIORITY - Maternity Ward:

    • Increase from 10.4 to 12.2 nurses/shift (+1.9)
    • Quarterly budget impact: +$68,400
    • Reason: Predicted increased demand in Q1 2024
  2. Monitor - General Surgery:

    • Currently lowest performing ward (77.5% adequacy)
    • Q1 2024 predicts slight reduction needed (-1.2)
    • Continue efficiency improvements

Seasonal Patterns:

  • Q1 typically shows 2.0 nurse average shortfall
  • February is critical month (highest shortages)
  • Weekend shifts require additional attention

Best Practices Implemented

1. No Data Leakage

  • ✅ Chronological train/validation split
  • ✅ All lag features properly shifted
  • ✅ Rolling averages exclude current observation
  • ✅ Validation period completely isolated from training

2. Reproducibility

  • ✅ Random seed set (RANDOM_STATE = 42)
  • ✅ Model saved with joblib
  • ✅ Configuration centralized in Config class
  • ✅ All outputs timestamped and versioned

3. Code Quality

  • ✅ Type hints for function parameters
  • ✅ Comprehensive docstrings
  • ✅ Modular design (feature engineering classes)
  • ✅ Error handling with try/except
  • ✅ Input validation for data loading

4. Logging & Monitoring

  • ✅ Structured logging throughout pipeline
  • ✅ Progress indicators for long operations
  • ✅ Summary statistics at each stage
  • ✅ JSON output for programmatic access

5. Model Validation

  • ✅ Multiple metrics (MAE, RMSE, R², MAPE)
  • ✅ Cross-validation with TimeSeriesSplit
  • ✅ Overfitting checks (train vs validation)
  • ✅ Residual analysis
  • ✅ Feature importance review

6. Uncertainty Quantification

  • ✅ Confidence intervals (99%)
  • ✅ Multi-component uncertainty (model + seasonal + forecast)
  • ✅ Conservative estimates for planning
  • ✅ Scenario-based predictions

7. Version Control

  • .gitignore excludes models and outputs
  • ✅ Data and PDF reports kept in repository
  • ✅ Clear commit messages (when using git)
  • ✅ Model versioning with timestamps

Recommendations

Immediate Actions (Next 30 Days)

  1. Maternity Ward Staffing:

    • Begin recruitment for +2 nurses
    • Establish cross-training program from other wards
    • Prepare temporary staffing contingency
  2. Model Deployment:

    • Set up monthly prediction refresh
    • Create monitoring dashboard (actual vs predicted)
    • Train stakeholders on model usage
  3. Data Collection:

    • Ensure data quality continues at 100%
    • Add new features if available (e.g., holidays, flu season)
    • Document any changes to data collection process

Medium-Term (30-90 Days)

  1. Monitoring & Refinement:

    • Track Q1 2024 actuals vs predictions
    • Calculate actual MAE for model validation
    • Retrain model at end of Q1 with new data
  2. Process Improvements:

    • Automate data pipeline
    • Create scheduled prediction runs
    • Integrate with existing workforce management systems
  3. Expand Analysis:

    • Add more granular shift-level predictions
    • Incorporate external factors (flu season, holidays)
    • Develop alerting system for critical shortages

Long-Term (90+ Days)

  1. Advanced Modeling:

    • Explore deep learning (LSTM for time series)
    • Multi-step forecasting (predict full quarter ahead)
    • Ensemble methods (combine multiple models)
  2. Business Integration:

    • Link predictions to hiring pipeline
    • Automate staffing recommendations
    • ROI tracking for model-driven decisions
  3. Scaling:

    • Expand to additional hospitals
    • Multi-hospital comparative analysis
    • Regional workforce optimization

Troubleshooting

Common Issues

Issue 1: Module not found error

Solution: Ensure all dependencies installed
pip install pandas numpy matplotlib seaborn scikit-learn jupyter joblib

Issue 2: Data file not found

Solution: Verify CSV file in same directory as notebook
Check Config.DATA_PATH matches your file location

Issue 3: Model performance degraded

Solution: Check for data quality issues
Verify temporal ordering maintained
Retrain model with updated data

Issue 4: Predictions seem unrealistic

Solution: Review feature engineering for leakage
Check validation metrics
Examine outliers in input data

Contributing

We welcome contributions! To contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/improvement)
  3. Make your changes
  4. Add tests if applicable
  5. Commit your changes (git commit -m 'Add improvement')
  6. Push to branch (git push origin feature/improvement)
  7. Open a Pull Request

Areas for Contribution:

  • Additional feature engineering ideas
  • Alternative ML algorithms
  • Visualization enhancements
  • Documentation improvements
  • Performance optimizations

Acknowledgments

  • Data Source: Queenaland Hospital Data
  • Libraries: scikit-learn, pandas, numpy, matplotlib, seaborn
  • Methodology: Time-series forecasting best practices from academic literature

Version History

v1.0.0 (2025-11-28)

  • ✅ Initial release
  • ✅ Random Forest model with hyperparameter tuning
  • ✅ 100+ features engineered
  • ✅ Q1 2024 predictions with 99% confidence intervals
  • ✅ Comprehensive documentation and visualizations

About

Predictive analytics for nursing workforce demand forecasting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published