DS expedition ML competition

Complete geospatial data processing pipeline for archaeological LiDAR segmentation ML competition:

1. Comprehensive EDA Notebook (`notebooks/comprehensive_eda.ipynb`)

A professional, production-ready notebook that analyzes your dataset:

Dataset inventory: Scans all 56 locations, catalogs 255 TIFFs and 294 GeoJSON files
TIFF analysis: Metadata extraction (dimensions, bands, CRS, data types), file size distribution
GeoJSON analysis: Category distribution, feature counts, geometry types, area statistics
Visualizations: Interactive plots showing TIFF overlaid with annotations
Data completeness: Checks which locations have complete data

Key insights:

56 archaeological sites with LiDAR data
4 TIFF types per location (с, сh, g, i - different LiDAR visualizations)
~5-6 annotation categories per site (городища, дороги, пашня, ямы, etc.)

2. Geospatial Processing Utilities (`src/RaijinNetto/geospatial_utils.py`)

Three powerful classes for working with TIFF and GeoJSON:

TiffProcessor: Read and process TIFF files

processor = TiffProcessor("path/to/file.tif")
data = processor.read(normalize=True)  # Normalized to 0-1
crop = processor.extract_crop(x, y, crop_size=256)

GeoJsonRasterizer: Convert vector annotations to raster masks

rasterizer = GeoJsonRasterizer(geojson_files)
mask = rasterizer.rasterize(reference_tiff)  # Creates multi-class mask

CropExtractor: Extract training crops with aligned masks

extractor = CropExtractor(tiff_path, geojson_paths, crop_size=256)
crops = extractor.extract_all_crops(min_annotation_ratio=0.01)
# Returns: [(image, mask, metadata), ...]

Key features:

Automatic CRS alignment between TIFF and GeoJSON
Multi-class annotation support
Sliding window extraction with overlap control
Annotation density filtering

3. Baseline Dataset Preparation Script (`src/RaijinNetto/prepare_baseline_dataset.py`)

Batch processing script to extract crops from all locations:

python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline \
    --crop-size 256 \
    --stride 128

Output structure:

data/processed/baseline/
├── 002_ДЕМИДОВКА_FINAL/
│   ├── images/            # Image crops (PNG)
│   ├── masks/             # Mask crops (PNG with class IDs)
│   └── class_mapping.json # Category to class ID mapping
├── 003_ЛУБНО_FINAL/
│   └── ...
└── global_class_mapping.json  # Unified mapping across all locations

How to Use

Step 1: Explore the Data

# Install dependencies
pip install rasterio geopandas matplotlib seaborn tqdm

# Run the EDA notebook
jupyter notebook notebooks/comprehensive_eda.ipynb

This will give you a complete understanding of your dataset.

Step 2: Extract Training Crops

# Test with a few locations first
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline_test \
    --crop-size 256 \
    --stride 128 \
    --max-locations 5

# Process all locations
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline \
    --crop-size 256 \
    --stride 128

Parameters to tune:

--crop-size: 128, 256, 512 (balance context vs. speed)
--stride: crop_size/2 for 50% overlap (more data)
--min-annotation-ratio: 0.0 (all crops) to 0.1 (10% annotated)

Step 3: Train Baseline Model

After crops are extracted, update config and train:

# Check how many classes you have
cat data/processed/baseline/global_class_mapping.json

# Train (adjust num_classes based on mapping)
python -m src.RaijinNetto.train \
    data.image_dir=data/processed/baseline/*/images \
    data.mask_dir=data/processed/baseline/*/masks \
    model.num_classes=8 \
    trainer.max_epochs=50

Example Workflow

# 1. Explore data
jupyter notebook notebooks/comprehensive_eda.ipynb

# 2. Extract crops (test first!)
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline_test \
    --max-locations 3

# 3. Check results
ls -la data/processed/baseline_test/*/images | head
cat data/processed/baseline_test/global_class_mapping.json

# 4. If satisfied, process all
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline

# 5. Train baseline
python -m src.RaijinNetto.train \
    data.image_dir=data/processed/baseline/*/images \
    data.mask_dir=data/processed/baseline/*/masks \
    model.num_classes=8 \
    trainer.max_epochs=50 \
    data.batch_size=8

# 6. Monitor training
tensorboard --logdir logs/

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
baseline		baseline
data		data
metrics		metrics
notebooks		notebooks
src/RaijinNetto		src/RaijinNetto
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DS expedition ML competition

1. Comprehensive EDA Notebook (`notebooks/comprehensive_eda.ipynb`)

2. Geospatial Processing Utilities (`src/RaijinNetto/geospatial_utils.py`)

3. Baseline Dataset Preparation Script (`src/RaijinNetto/prepare_baseline_dataset.py`)

How to Use

Step 1: Explore the Data

Step 2: Extract Training Crops

Step 3: Train Baseline Model

Example Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DS expedition ML competition

1. Comprehensive EDA Notebook (notebooks/comprehensive_eda.ipynb)

2. Geospatial Processing Utilities (src/RaijinNetto/geospatial_utils.py)

3. Baseline Dataset Preparation Script (src/RaijinNetto/prepare_baseline_dataset.py)

How to Use

Step 1: Explore the Data

Step 2: Extract Training Crops

Step 3: Train Baseline Model

Example Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Comprehensive EDA Notebook (`notebooks/comprehensive_eda.ipynb`)

2. Geospatial Processing Utilities (`src/RaijinNetto/geospatial_utils.py`)

3. Baseline Dataset Preparation Script (`src/RaijinNetto/prepare_baseline_dataset.py`)

Packages