Algae Bloom Prediction

Predictive machine learning system for detecting and forecasting harmful algal blooms (HABs) using geospatial data, satellite imagery, and ensemble learning techniques.

Overview

This project implements an advanced machine learning pipeline for early detection and prediction of algal blooms in aquatic environments. By combining multi-temporal satellite imagery with environmental features and ensemble learning methods, the system achieves state-of-the-art performance in identifying conditions that lead to harmful algal bloom events.

The project extends the second-place solution from the DrivenData TickTickBloom competition, implementing improvements in feature engineering, model ensembling, and geospatial analysis. This work was developed for the GeoAI course (AG2418) at KTH Royal Institute of Technology.

Key Features

Multi-Source Data Integration: Combines satellite imagery from Planetary Computer, environmental sensor data, and geospatial indices
Advanced Feature Engineering: Implements domain-specific geospatial features and temporal analysis
Ensemble Learning: Leverages multiple gradient boosting algorithms (CatBoost, XGBoost, LightGBM) for robust predictions
Hyperparameter Optimization: Automated hyperparameter tuning using Optuna for model optimization
Geospatial Analysis: Full support for geographic data using GeoPandas and spatial indexing
Production-Ready Pipeline: Modular code structure with data preparation, feature engineering, and model inference stages

Technology Stack

Machine Learning & Data Science:

CatBoost, XGBoost, LightGBM
scikit-learn
Optuna (hyperparameter optimization)
Pandas, NumPy, SciPy

Geospatial & Satellite Imagery:

GeoPandas
Planetary Computer
PySTAC Client
rioxarray, ODC-STAC
OpenCV, Pillow

Visualization & Analysis:

Matplotlib, Seaborn
Jupyter Notebook

Prerequisites

Python 3.9 or later
Anaconda or Miniconda
8GB+ RAM (16GB+ recommended for full pipeline)
Internet connection for downloading satellite data

Quick Start

1. Set up Python Environment

# Create conda environment with GeoPandas (installed first due to complex dependencies)
conda create --name bloom python=3.9 pip geopandas
conda activate bloom

# Install remaining dependencies
pip install -r requirements.txt

# Install Jupyter kernel for notebooks
ipython kernel install --name "bloom_jpy" --user

2. Prepare Data

# Run from repository root directory
# This script downloads satellite data and prepares the SQLite database
python main_prepdata.py

3. Train and Evaluate Models

# Train ensemble models and generate predictions
python extension.py

This generates:

Trained models in ./models directory (organized by date)
Analysis figures in ./figures directory
Model performance metrics and visualizations

Usage Guide

Data Preparation

The main_prepdata.py script:

Downloads multi-temporal Sentinel-2 satellite imagery via Planetary Computer
Extracts spectral indices (NDVI, NDWI, etc.)
Aggregates environmental features at target locations
Constructs a SQLite database for efficient data access

Model Training

The extension.py script:

Loads preprocessed data from SQLite database
Applies feature engineering transformations
Trains ensemble models using optimized hyperparameters
Generates predictions on test datasets
Produces evaluation metrics and visualization plots

Hyperparameter Tuning

For custom hyperparameter optimization:

# Run hyperparameter tuning experiments
python main_hypertune.py > hypertune_results.txt

Results provide guidance for model selection and parameter ranges.

Modeling Strategy

Detailed information about the modeling approach, feature engineering decisions, and experimental results is available in:

model_strategy.ipynb

This Jupyter notebook documents:

Exploratory data analysis
Feature importance analysis
Model comparison and validation
Final ensemble configuration

Project Architecture

.
├── main_prepdata.py           # Data download and preparation pipeline
├── extension.py               # Main model training and evaluation script
├── main_hypertune.py          # Hyperparameter optimization experiments
├── model_strategy.ipynb       # Detailed analysis and methodology documentation
├── src/
│   ├── feat.py               # Feature engineering utilities
│   └── mod.py                # Model definitions and training utilities
├── data/                      # Raw and processed datasets
├── models/                    # Trained model artifacts (organized by timestamp)
├── figures/                   # Generated analysis plots and visualizations
└── requirements.txt           # Python package dependencies

Configuration

Key configuration options in the main scripts:

Python Version: Python 3.9 (specified in conda environment creation)
Data Directory: ./data (contains raw and processed datasets)
Model Output: ./models/ (timestamped subdirectories)
Figure Output: ./figures/ (analysis and validation plots)

Environment variables are not currently required. All configuration is embedded in script parameters.

Security Features

Data Validation: Input data is validated before processing to prevent malformed datasets
Safe File Operations: All file paths use secure methods to prevent directory traversal
Dependency Verification: All dependencies are pinned in requirements.txt
No Sensitive Data: No API keys, credentials, or sensitive information in repository

See LICENSE for terms of use.

Performance Optimizations

Efficient Data Loading: SQLite database optimized for sequential and random access patterns
Vectorized Operations: NumPy and Pandas vectorization throughout feature engineering
Memory-Efficient Satellite Processing: Chunked processing of large raster data using rioxarray
Parallel Training: XGBoost and LightGBM leverage multi-core processors
Feature Caching: Preprocessed features cached to avoid recomputation

Model Performance

The ensemble approach combines three gradient boosting algorithms, achieving:

Robust predictions across diverse environmental conditions
High sensitivity to early bloom indicators
Generalization to geographic regions outside training data

Performance metrics and detailed validation results are available in the generated analysis figures and the model_strategy.ipynb notebook.

Reproducibility

To ensure reproducibility:

All random seeds are fixed in training scripts
Python version and exact dependencies are specified
Data downloads use deterministic sources (Planetary Computer)
Generated models are timestamped for versioning

Contributors

Nils Olivier - Extension and implementation (nolivier@kth.se)
- KTH Royal Institute of Technology
Andy Wheeler - Original competition solution and methodology (apwheele@gmail.com)
- 2nd Place Solution - TickTickBloom Competition

License

This project is licensed under the MIT License. See LICENSE file for details.

Acknowledgments

DrivenData for the TickTickBloom competition and challenge dataset
KTH Royal Institute of Technology for supporting this research
Planetary Computer for providing satellite imagery and computational resources
The open-source community for excellent geospatial and machine learning libraries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Algae Bloom Prediction

Overview

Key Features

Technology Stack

Prerequisites

Quick Start

1. Set up Python Environment

2. Prepare Data

3. Train and Evaluate Models

Usage Guide

Data Preparation

Model Training

Hyperparameter Tuning

Modeling Strategy

Project Architecture

Configuration

Security Features

Performance Optimizations

Model Performance

Reproducibility

Contributors

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
catboost_info		catboost_info
data		data
figures		figures
models		models
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extension.py		extension.py
hypertune_results.txt		hypertune_results.txt
main_hypertune.py		main_hypertune.py
main_prepdata.py		main_prepdata.py
model_strategy.ipynb		model_strategy.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Algae Bloom Prediction

Overview

Key Features

Technology Stack

Prerequisites

Quick Start

1. Set up Python Environment

2. Prepare Data

3. Train and Evaluate Models

Usage Guide

Data Preparation

Model Training

Hyperparameter Tuning

Modeling Strategy

Project Architecture

Configuration

Security Features

Performance Optimizations

Model Performance

Reproducibility

Contributors

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages