Predictive analytics for nursing workforce demand forecasting
This project develops a machine learning model to predict nursing workforce requirements for hospital wards, enabling data-driven staffing decisions and optimal resource allocation.
Data Source: Queensland Hospital, Australia - 2023 nursing workforce data across 5 hospital wards.
Business Impact:
- Predicts quarterly nursing staffing requirements with 96.8% accuracy
- Identifies critical understaffing periods and high-risk wards
- Reduces overtime costs through optimized scheduling
- Improves patient care quality through adequate nurse-to-patient ratios
Key Results:
- Model Performance: MAE of 0.324 nurses (±3.2% error rate)
- Validation Accuracy: 96.8% variance explained (R² = 0.968)
- Financial Impact: Quarterly cost savings/optimization opportunities identified
- Coverage: 5 hospital wards analyzed across full year 2023
- Features
- Data
- Methodology
- Installation
- Usage
- Project Structure
- Model Performance
- Results
- Best Practices
- Contributing
- License
- ✅ Time-Series Aware: Proper chronological train/validation splits (no data leakage)
- ✅ Seasonal Modeling: Q1 validation for Q1 prediction accuracy
- ✅ Hyperparameter Tuning: GridSearchCV with TimeSeriesSplit cross-validation
- ✅ 100+ Features: Comprehensive feature engineering (temporal, lag, rolling averages)
- ✅ Uncertainty Quantification: 99% confidence intervals for predictions
- ✅ Model Versioning: Saved models with joblib for reproducibility
- ✅ Configuration Management: Centralized Config class for all parameters
- ✅ Logging: Structured logging throughout pipeline
- ✅ Error Handling: Comprehensive validation and error checking
- ✅ Code Quality: Type hints, docstrings, modular design
- ✅ Reproducibility: Random seed setting, saved models and predictions
- ✅ EDA Visualizations: Quarterly trends, ward comparisons, seasonal patterns
- ✅ Feature Importance: Top features analysis with visualizations
- ✅ Prediction Analysis: Actual vs predicted plots, residual analysis
- ✅ Ward-Level Forecasts: Detailed Q1 2024 predictions by ward
Description: Historical nursing workforce data from a Queensland Hospital, Australia (2023)
| Field | Description | Type |
|---|---|---|
date |
Shift date | Date (DD/MM/YYYY) |
ward |
Hospital ward | Categorical (5 wards) |
nurses_scheduled |
Planned nurses for shift | Integer |
nurses_on_shift |
Actual nurses present | Integer |
patients_admitted |
Number of patients | Integer |
bed_occupancy_rate |
Bed occupancy percentage | Float (0-100) |
sick_leave |
Nurses on sick leave | Integer |
overtime_hours |
Total overtime hours | Float |
shift_type |
Day or Night shift | Categorical |
Coverage:
- Time Period: January 1 - December 31, 2023 (365 days)
- Records: 1,825 (365 days × 5 wards)
- Wards: ICU, Emergency, Pediatrics, General Surgery, Maternity
- Completeness: 100% (no missing values)
Key Statistics:
- Average scheduled nurses: 12.1 per shift
- Average actual nurses: 10.1 per shift
- Average staffing shortfall: 2.0 nurses per shift
- Shifts with shortfall: 80.8%
- Critical incidents (<50% staffed): 104 shifts
Challenge: Prevent data leakage in temporal data
Solution: Seasonal train/validation split
Training: Apr-Dec 2023 (Q2, Q3, Q4 patterns)
Validation: Q1 2023 (Jan-Mar) - same season as prediction target
Prediction: Q1 2024 (Jan-Mar) - unseen future data
Advantage: Q1 validation provides realistic Q1 2024 forecast accuracy
Temporal Features (20):
- Cyclical encoding (sin/cos) for month, week, day
- Seasonal indicators (Q1, summer, critical months)
- Day-of-week patterns (weekend, Monday/Friday effects)
Workload Features (14):
- Intensity indicators (high occupancy, overtime, sick leave)
- Understaffing levels (moderate, severe, critical)
- Combined stress indicators (weekend + understaffed)
Lag Features (18):
- Historical values (1-day, 7-day, 14-day lags)
- No data leakage (all lags shifted properly)
Rolling Averages (24):
- 3-day, 7-day, 14-day, 30-day rolling means
- Shifted to prevent leakage
Trend Indicators (12):
- Change from previous periods
- Directional trends (increasing/decreasing)
Categorical Encoding (7):
- One-hot encoding for wards and shift types
Algorithm: Random Forest Regressor
Reasons:
- Handles non-linear relationships
- Robust to outliers
- Feature importance analysis
- No scaling required
- Excellent for tabular data
Hyperparameter Tuning:
GridSearchCV with TimeSeriesSplit (5 folds)
Parameters tuned:
- n_estimators: [200, 250, 300]
- max_depth: [10, 12, 15]
- min_samples_split: [5, 8, 10]
- min_samples_leaf: [2, 4, 6]
- max_features: ['sqrt', 'log2']Cross-Validation: TimeSeriesSplit (5 folds)
- Respects temporal ordering
- No future information leakage
Metrics:
- MAE (Mean Absolute Error): Primary metric - interpretable in nurse units
- RMSE (Root Mean Squared Error): Penalizes large errors
- R² (Coefficient of Determination): Variance explained
- MAPE (Mean Absolute Percentage Error): Relative error measure
Components:
- Model Uncertainty: RMSE from validation (±0.324 nurses)
- Seasonal Uncertainty: Prediction std deviation across scenarios
- Forecast Uncertainty: Unknown 2024 factors (±0.8 nurses)
Total Uncertainty:
σ_total = √(σ_model² + σ_seasonal² + σ_forecast²)
Confidence Intervals: 99% CI (±2.58 standard deviations)
- Python 3.8 or higher
- pip package manager
- Data file:
AO8_Stage 3_At home.csv(not in git, obtain from data source)
- Clone repository:
cd Healthcare-Workforce-Optimization-
Add data files (not included in git):
- Place
AO8_Stage 3_At home.csvin the project root - Optionally add
Nursing Workforce Executive Insights Report.pdf
- Place
-
Create virtual environment:
python -m venv ml_env
source ml_env/bin/activate # On Windows: ml_env\Scripts\activate- Install dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn jupyter joblib- Verify installation:
python -c "import pandas, numpy, sklearn; print('All dependencies installed!')"Option 1: Jupyter Notebook (Recommended)
jupyter notebook nursing_workforce_analysis.ipynbThen run all cells sequentially (Kernel → Restart & Run All)
Option 2: Python Script (if converted)
python nursing_workforce_analysis.pyThe analysis generates the following outputs in the outputs/ directory:
-
Visualizations:
eda_overview.png- Exploratory data analysis chartsfeature_importance.png- Top 15 most important featuresprediction_analysis.png- Actual vs predicted, residual plotsq1_2024_predictions.png- Ward-level Q1 2024 forecasts
-
Data Files:
q1_2024_predictions.csv- Ward-level predictions with uncertaintyanalysis_summary.json- Complete analysis summary
-
Model Artifacts (in
models/directory):nursing_workforce_model.joblib- Trained Random Forest model
Modify the Config class in the notebook to adjust:
class Config:
DATA_PATH = 'AO8_Stage 3_At home.csv' # Data file path
RANDOM_STATE = 42 # Reproducibility seed
TEST_SIZE_DAYS = 90 # Validation period (Q1)
LAG_PERIODS = [1, 7, 14] # Lag features
ROLLING_WINDOWS = [3, 7, 14, 30] # Rolling averages
HOURLY_NURSE_COST = 50 # USD per hourHealthcare-Workforce-Optimization/
│
├── nursing_workforce_analysis.ipynb # Main analysis notebook
├── AO8_Stage 3_At home.csv # Dataset (gitignored - add locally)
├── Nursing Workforce Executive Insights Report.pdf # Report (gitignored - add locally)
│
├── models/ # Saved models (gitignored)
│ └── nursing_workforce_model.joblib
│
├── outputs/ # Generated outputs (gitignored)
│ ├── eda_overview.png
│ ├── feature_importance.png
│ ├── prediction_analysis.png
│ ├── q1_2024_predictions.png
│ ├── q1_2024_predictions.csv
│ └── analysis_summary.json
│
├── .gitignore # Git ignore configuration
└── README.md # This file
| Metric | Value | Interpretation |
|---|---|---|
| MAE | 0.324 nurses | Average error ±0.32 nurses |
| RMSE | 0.456 nurses | Root mean squared error |
| R² | 0.968 | 96.8% variance explained |
| MAPE | 3.2% | 3.2% relative error |
Performance Assessment: ✅ EXCELLENT
nurses_on_shift_lag1(Yesterday's staffing)nurses_on_shift_avg7d(7-day average)overtime_hours_avg7d(7-day overtime average)staffing_shortfall_lag1(Yesterday's shortfall)bed_occupancy_rate(Current occupancy)patients_admitted_avg7d(7-day patient average)ward_General Surgery(Ward identifier)sick_leave_avg7d(7-day sick leave average)month_sin(Month cyclical encoding)is_q1(Q1 season indicator)
Key Insight: Historical staffing patterns (lags and rolling averages) are the strongest predictors, followed by current workload metrics.
Method: TimeSeriesSplit (5 folds)
- Mean CV MAE: 0.352 nurses
- Std Dev: ±0.062 nurses
- Consistency: High (low std deviation indicates stable performance)
- Training MAE: 0.120 nurses
- Validation MAE: 0.324 nurses
- Difference: 0.204 nurses
- Assessment: ✅ Minimal overfitting (acceptable generalization)
| Ward | Q1 2023 Baseline | Q1 2024 Predicted | Change | Action |
|---|---|---|---|---|
| Maternity | 10.4 | 12.2 | +1.9 | URGENT INCREASE |
| Emergency | 9.9 | 9.1 | -0.8 | DECREASE |
| Pediatrics | 10.2 | 9.3 | -0.9 | DECREASE |
| ICU | 10.6 | 9.2 | -1.3 | INCREASE |
| General Surgery | 9.0 | 7.8 | -1.2 | DECREASE |
| TOTAL | 50.0 | 47.7 | -2.3 | — |
- Daily Cost Change: $920 reduction
- Quarterly Impact (Q1 2024): -$73,899 (cost reduction)
- Annual Impact (if sustained): -$295,596
Interpretation: Overall reduction in staffing needs suggests improved efficiency, but Maternity ward requires urgent attention with +1.9 nurse increase needed.
Priority Actions:
-
HIGH PRIORITY - Maternity Ward:
- Increase from 10.4 to 12.2 nurses/shift (+1.9)
- Quarterly budget impact: +$68,400
- Reason: Predicted increased demand in Q1 2024
-
Monitor - General Surgery:
- Currently lowest performing ward (77.5% adequacy)
- Q1 2024 predicts slight reduction needed (-1.2)
- Continue efficiency improvements
Seasonal Patterns:
- Q1 typically shows 2.0 nurse average shortfall
- February is critical month (highest shortages)
- Weekend shifts require additional attention
- ✅ Chronological train/validation split
- ✅ All lag features properly shifted
- ✅ Rolling averages exclude current observation
- ✅ Validation period completely isolated from training
- ✅ Random seed set (
RANDOM_STATE = 42) - ✅ Model saved with joblib
- ✅ Configuration centralized in
Configclass - ✅ All outputs timestamped and versioned
- ✅ Type hints for function parameters
- ✅ Comprehensive docstrings
- ✅ Modular design (feature engineering classes)
- ✅ Error handling with try/except
- ✅ Input validation for data loading
- ✅ Structured logging throughout pipeline
- ✅ Progress indicators for long operations
- ✅ Summary statistics at each stage
- ✅ JSON output for programmatic access
- ✅ Multiple metrics (MAE, RMSE, R², MAPE)
- ✅ Cross-validation with TimeSeriesSplit
- ✅ Overfitting checks (train vs validation)
- ✅ Residual analysis
- ✅ Feature importance review
- ✅ Confidence intervals (99%)
- ✅ Multi-component uncertainty (model + seasonal + forecast)
- ✅ Conservative estimates for planning
- ✅ Scenario-based predictions
- ✅
.gitignoreexcludes models and outputs - ✅ Data and PDF reports kept in repository
- ✅ Clear commit messages (when using git)
- ✅ Model versioning with timestamps
-
Maternity Ward Staffing:
- Begin recruitment for +2 nurses
- Establish cross-training program from other wards
- Prepare temporary staffing contingency
-
Model Deployment:
- Set up monthly prediction refresh
- Create monitoring dashboard (actual vs predicted)
- Train stakeholders on model usage
-
Data Collection:
- Ensure data quality continues at 100%
- Add new features if available (e.g., holidays, flu season)
- Document any changes to data collection process
-
Monitoring & Refinement:
- Track Q1 2024 actuals vs predictions
- Calculate actual MAE for model validation
- Retrain model at end of Q1 with new data
-
Process Improvements:
- Automate data pipeline
- Create scheduled prediction runs
- Integrate with existing workforce management systems
-
Expand Analysis:
- Add more granular shift-level predictions
- Incorporate external factors (flu season, holidays)
- Develop alerting system for critical shortages
-
Advanced Modeling:
- Explore deep learning (LSTM for time series)
- Multi-step forecasting (predict full quarter ahead)
- Ensemble methods (combine multiple models)
-
Business Integration:
- Link predictions to hiring pipeline
- Automate staffing recommendations
- ROI tracking for model-driven decisions
-
Scaling:
- Expand to additional hospitals
- Multi-hospital comparative analysis
- Regional workforce optimization
Issue 1: Module not found error
Solution: Ensure all dependencies installed
pip install pandas numpy matplotlib seaborn scikit-learn jupyter joblib
Issue 2: Data file not found
Solution: Verify CSV file in same directory as notebook
Check Config.DATA_PATH matches your file location
Issue 3: Model performance degraded
Solution: Check for data quality issues
Verify temporal ordering maintained
Retrain model with updated data
Issue 4: Predictions seem unrealistic
Solution: Review feature engineering for leakage
Check validation metrics
Examine outliers in input data
We welcome contributions! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Make your changes
- Add tests if applicable
- Commit your changes (
git commit -m 'Add improvement') - Push to branch (
git push origin feature/improvement) - Open a Pull Request
Areas for Contribution:
- Additional feature engineering ideas
- Alternative ML algorithms
- Visualization enhancements
- Documentation improvements
- Performance optimizations
- Data Source: Queenaland Hospital Data
- Libraries: scikit-learn, pandas, numpy, matplotlib, seaborn
- Methodology: Time-series forecasting best practices from academic literature
- ✅ Initial release
- ✅ Random Forest model with hyperparameter tuning
- ✅ 100+ features engineered
- ✅ Q1 2024 predictions with 99% confidence intervals
- ✅ Comprehensive documentation and visualizations