Skip to content

L0thlorien/ds-expedition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DS expedition ML competition

Complete geospatial data processing pipeline for archaeological LiDAR segmentation ML competition:

1. Comprehensive EDA Notebook (notebooks/comprehensive_eda.ipynb)

A professional, production-ready notebook that analyzes your dataset:

  • Dataset inventory: Scans all 56 locations, catalogs 255 TIFFs and 294 GeoJSON files
  • TIFF analysis: Metadata extraction (dimensions, bands, CRS, data types), file size distribution
  • GeoJSON analysis: Category distribution, feature counts, geometry types, area statistics
  • Visualizations: Interactive plots showing TIFF overlaid with annotations
  • Data completeness: Checks which locations have complete data

Key insights:

  • 56 archaeological sites with LiDAR data
  • 4 TIFF types per location (с, сh, g, i - different LiDAR visualizations)
  • ~5-6 annotation categories per site (городища, дороги, пашня, ямы, etc.)

2. Geospatial Processing Utilities (src/RaijinNetto/geospatial_utils.py)

Three powerful classes for working with TIFF and GeoJSON:

TiffProcessor: Read and process TIFF files

processor = TiffProcessor("path/to/file.tif")
data = processor.read(normalize=True)  # Normalized to 0-1
crop = processor.extract_crop(x, y, crop_size=256)

GeoJsonRasterizer: Convert vector annotations to raster masks

rasterizer = GeoJsonRasterizer(geojson_files)
mask = rasterizer.rasterize(reference_tiff)  # Creates multi-class mask

CropExtractor: Extract training crops with aligned masks

extractor = CropExtractor(tiff_path, geojson_paths, crop_size=256)
crops = extractor.extract_all_crops(min_annotation_ratio=0.01)
# Returns: [(image, mask, metadata), ...]

Key features:

  • Automatic CRS alignment between TIFF and GeoJSON
  • Multi-class annotation support
  • Sliding window extraction with overlap control
  • Annotation density filtering

3. Baseline Dataset Preparation Script (src/RaijinNetto/prepare_baseline_dataset.py)

Batch processing script to extract crops from all locations:

python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline \
    --crop-size 256 \
    --stride 128

Output structure:

data/processed/baseline/
├── 002_ДЕМИДОВКА_FINAL/
│   ├── images/            # Image crops (PNG)
│   ├── masks/             # Mask crops (PNG with class IDs)
│   └── class_mapping.json # Category to class ID mapping
├── 003_ЛУБНО_FINAL/
│   └── ...
└── global_class_mapping.json  # Unified mapping across all locations

How to Use

Step 1: Explore the Data

# Install dependencies
pip install rasterio geopandas matplotlib seaborn tqdm

# Run the EDA notebook
jupyter notebook notebooks/comprehensive_eda.ipynb

This will give you a complete understanding of your dataset.

Step 2: Extract Training Crops

# Test with a few locations first
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline_test \
    --crop-size 256 \
    --stride 128 \
    --max-locations 5

# Process all locations
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline \
    --crop-size 256 \
    --stride 128

Parameters to tune:

  • --crop-size: 128, 256, 512 (balance context vs. speed)
  • --stride: crop_size/2 for 50% overlap (more data)
  • --min-annotation-ratio: 0.0 (all crops) to 0.1 (10% annotated)

Step 3: Train Baseline Model

After crops are extracted, update config and train:

# Check how many classes you have
cat data/processed/baseline/global_class_mapping.json

# Train (adjust num_classes based on mapping)
python -m src.RaijinNetto.train \
    data.image_dir=data/processed/baseline/*/images \
    data.mask_dir=data/processed/baseline/*/masks \
    model.num_classes=8 \
    trainer.max_epochs=50

Example Workflow

# 1. Explore data
jupyter notebook notebooks/comprehensive_eda.ipynb

# 2. Extract crops (test first!)
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline_test \
    --max-locations 3

# 3. Check results
ls -la data/processed/baseline_test/*/images | head
cat data/processed/baseline_test/global_class_mapping.json

# 4. If satisfied, process all
python -m src.RaijinNetto.prepare_baseline_dataset \
    --data-root /mnt/e/kaggle-data/train \
    --output-dir data/processed/baseline

# 5. Train baseline
python -m src.RaijinNetto.train \
    data.image_dir=data/processed/baseline/*/images \
    data.mask_dir=data/processed/baseline/*/masks \
    model.num_classes=8 \
    trainer.max_epochs=50 \
    data.batch_size=8

# 6. Monitor training
tensorboard --logdir logs/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors