YOLOv11-based region segmentation for OCR-D. You can find the ocrd_detectron2 wrapper here: https://github.com/bertsky/ocrd_detectron2/tree/master
This OCR-D processor uses YOLOv11 models to detect and segment document regions in document images.
pip install -e .Or install via Dockerhub:
- docker compose build
- docker-compose run ocrd-yolo
For CPU only:
- docker compose build ocrd-yolo-cpu
- docker-compose run ocrd-yolo-cpu
# Using a pre-trained model from resources
ocrd-yolo-segment \
-I OCR-D-IMG \
-O OCR-D-SEG-REGION \
-p '{
"model_weights": "yolo11s-example.pt",
"categories": ["TextRegion:paragraph", "TextRegion:heading", "Border:page", "TableRegion", "ImageRegion"],
"min_confidence": 0.5
}'For segmenting table regions:
ocrd-yolo-segment \
-I OCR-D-SEG-BLOCK \
-O OCR-D-SEG-TABLE \
-p '{
"model_weights": "yolo11s-table.pt",
"categories": ["TextRegion:columns", "TextRegion:header"],
"level-of-operation": "table",
"min_confidence": 0.7
}'- TODO
model_weights(string, required): Path to YOLOv11 model weightscategories(array, required): Maps model classes to PAGE-XML region typeslevel-of-operation(string, default: "page"): "page" or "table" level processingmin_confidence(float, default: 0.5): Detection confidence thresholdpostprocessing(string, default: "full"): Post-processing mode- "full": NMS + morphological operations
- "only-nms": Only non-maximum suppression
- "only-morph": Only morphological operations
- "none": No post-processing
debug_img(string, default: "none"): Debug visualizationdevice(string, default: "cuda"): Computing device
Convert your annotations to YOLO format with this structure:
dataset/
├── images/
│ ├── train/
│ ├── val/
│ └── test/
└── labels/
├── train/
├── val/
└── test/
Categories map model predictions to PAGE-XML region types:
{
"categories": [
"TextRegion", // Simple text region
"TextRegion:heading", // Text region with subtype
"Border:page", // Page border
"ImageRegion", // Image/figure
"TableRegion", // Table
"GraphicRegion", // Graphics/drawings
"SeparatorRegion", // Lines/separators
"", // Skip this class
"CustomRegion:formula" // Custom region type
]
}# 1. Import images
ocrd-import ...
# 2. Binarize
ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
# 3. Detect regions with YOLO
ocrd-yolo-segment -I OCR-D-BIN -O OCR-D-SEG-REGION \
-p '{"model_weights": "yolo11s-example.pt", ...}'
# 4. Detect lines
ocrd-tesserocr-segment-line -I OCR-D-SEG-REGION -O OCR-D-SEG-LINE
# 5. OCR
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCRThe processor supports incremental segmentation - it won't overwrite existing regions:
# First pass: detect main regions
ocrd-yolo-segment -I OCR-D-IMG -O OCR-D-SEG-REGION-1 \
-p '{"model_weights": "model1.pt", ...}'
# Second pass: detect additional regions
ocrd-yolo-segment -I OCR-D-SEG-REGION-1 -O OCR-D-SEG-REGION-2 \
-p '{"model_weights": "model2.pt", ...}'-
Model Selection:
- Use
yolo11nfor speed (real-time processing) - Use
yolo11sfor balanced performance - Use
yolo11mor larger for maximum accuracy
- Use
-
Batch Processing:
- Process multiple pages together for better GPU utilization
- Adjust batch size based on GPU memory
-
Resolution:
- Image resolution is retained as it increases YOLO's performance
- Will be adaptable in the future
-
Post-processing:
- Use "only-nms" for cleaner text documents
- Use "full" for complex layouts with touching regions
Reduce batch size or use a smaller model:
# Use smaller model
-p '{"model_weights": "yolo11n-example.pt"}'
# Or force CPU
-p '{"device": "cpu"}'-
Check confidence threshold:
-p '{"min_confidence": 0.3}' # Lower threshold
-
Try different post-processing:
-p '{"postprocessing": "only-nms"}' -
Use appropriate model for document type
Enable debug visualization to see all detections:
-p '{"debug_img": "visualize"}'ChatGPT 4.5, 5 as well as Claude Opus 4.5 have been used to generate this OCR-D extension. At the beginning, the detectron2 extension was used as an example: github.com/bertsky/ocrd_detectron2 Generated code has been manually tested and iteratively improved using the listed models.