Complete geospatial data processing pipeline for archaeological LiDAR segmentation ML competition:
A professional, production-ready notebook that analyzes your dataset:
- Dataset inventory: Scans all 56 locations, catalogs 255 TIFFs and 294 GeoJSON files
- TIFF analysis: Metadata extraction (dimensions, bands, CRS, data types), file size distribution
- GeoJSON analysis: Category distribution, feature counts, geometry types, area statistics
- Visualizations: Interactive plots showing TIFF overlaid with annotations
- Data completeness: Checks which locations have complete data
Key insights:
- 56 archaeological sites with LiDAR data
- 4 TIFF types per location (с, сh, g, i - different LiDAR visualizations)
- ~5-6 annotation categories per site (городища, дороги, пашня, ямы, etc.)
Three powerful classes for working with TIFF and GeoJSON:
TiffProcessor: Read and process TIFF files
processor = TiffProcessor("path/to/file.tif")
data = processor.read(normalize=True) # Normalized to 0-1
crop = processor.extract_crop(x, y, crop_size=256)GeoJsonRasterizer: Convert vector annotations to raster masks
rasterizer = GeoJsonRasterizer(geojson_files)
mask = rasterizer.rasterize(reference_tiff) # Creates multi-class maskCropExtractor: Extract training crops with aligned masks
extractor = CropExtractor(tiff_path, geojson_paths, crop_size=256)
crops = extractor.extract_all_crops(min_annotation_ratio=0.01)
# Returns: [(image, mask, metadata), ...]Key features:
- Automatic CRS alignment between TIFF and GeoJSON
- Multi-class annotation support
- Sliding window extraction with overlap control
- Annotation density filtering
Batch processing script to extract crops from all locations:
python -m src.RaijinNetto.prepare_baseline_dataset \
--data-root /mnt/e/kaggle-data/train \
--output-dir data/processed/baseline \
--crop-size 256 \
--stride 128Output structure:
data/processed/baseline/
├── 002_ДЕМИДОВКА_FINAL/
│ ├── images/ # Image crops (PNG)
│ ├── masks/ # Mask crops (PNG with class IDs)
│ └── class_mapping.json # Category to class ID mapping
├── 003_ЛУБНО_FINAL/
│ └── ...
└── global_class_mapping.json # Unified mapping across all locations
# Install dependencies
pip install rasterio geopandas matplotlib seaborn tqdm
# Run the EDA notebook
jupyter notebook notebooks/comprehensive_eda.ipynbThis will give you a complete understanding of your dataset.
# Test with a few locations first
python -m src.RaijinNetto.prepare_baseline_dataset \
--data-root /mnt/e/kaggle-data/train \
--output-dir data/processed/baseline_test \
--crop-size 256 \
--stride 128 \
--max-locations 5
# Process all locations
python -m src.RaijinNetto.prepare_baseline_dataset \
--data-root /mnt/e/kaggle-data/train \
--output-dir data/processed/baseline \
--crop-size 256 \
--stride 128Parameters to tune:
--crop-size: 128, 256, 512 (balance context vs. speed)--stride: crop_size/2 for 50% overlap (more data)--min-annotation-ratio: 0.0 (all crops) to 0.1 (10% annotated)
After crops are extracted, update config and train:
# Check how many classes you have
cat data/processed/baseline/global_class_mapping.json
# Train (adjust num_classes based on mapping)
python -m src.RaijinNetto.train \
data.image_dir=data/processed/baseline/*/images \
data.mask_dir=data/processed/baseline/*/masks \
model.num_classes=8 \
trainer.max_epochs=50# 1. Explore data
jupyter notebook notebooks/comprehensive_eda.ipynb
# 2. Extract crops (test first!)
python -m src.RaijinNetto.prepare_baseline_dataset \
--data-root /mnt/e/kaggle-data/train \
--output-dir data/processed/baseline_test \
--max-locations 3
# 3. Check results
ls -la data/processed/baseline_test/*/images | head
cat data/processed/baseline_test/global_class_mapping.json
# 4. If satisfied, process all
python -m src.RaijinNetto.prepare_baseline_dataset \
--data-root /mnt/e/kaggle-data/train \
--output-dir data/processed/baseline
# 5. Train baseline
python -m src.RaijinNetto.train \
data.image_dir=data/processed/baseline/*/images \
data.mask_dir=data/processed/baseline/*/masks \
model.num_classes=8 \
trainer.max_epochs=50 \
data.batch_size=8
# 6. Monitor training
tensorboard --logdir logs/