Tibetan OCR Tools

Overview

This Python project focuses on detecting and recognizing Tibetan text in images. It provides a complete pipeline from dataset generation to OCR:

Dataset Generation: Create synthetic training data by embedding Tibetan text into background images
Model Training: Train a YOLO-based object detection model with optional Weights & Biases logging
Inference: Detect Tibetan text blocks in new images, including support for Staatsbibliothek zu Berlin digital collections
OCR: Apply Tesseract OCR to the detected text blocks to extract the actual text content

Quick Start Guide

Installation

# Clone the repository
git clone https://github.com/nih23/Tibetan-NLP.git
cd Tibetan-NLP

# Install dependencies
pip install -r requirements.txt

# Install Tesseract OCR (required for text recognition)
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
# macOS: brew install tesseract
# Windows: Download installer from https://github.com/UB-Mannheim/tesseract/wiki

Complete Workflow

# 1. Generate dataset
python generate_training_data.py --train_samples 1000 --val_samples 200 --image_size 1024

# 2. Train model
python train_model.py --epochs 100 --export

# 3. Run inference (object detection only)
# On local images:
yolo predict task=detect model=runs/detect/train/weights/best.torchscript imgsz=1024 source=data/my_inference_data/*.jpg

# On Staatsbibliothek zu Berlin data:
python inference_sbb.py --ppn PPN12345678 --model runs/detect/train/weights/best.torchscript

# 4. Run OCR on detected text blocks
# On local images:
python ocr_on_detections.py --source data/my_inference_data/*.jpg --model runs/detect/train/weights/best.torchscript --lang bod

# On Staatsbibliothek zu Berlin data:
python ocr_on_detections.py --ppn PPN12345678 --model runs/detect/train/weights/best.torchscript --lang bod

Features

Automated Dataset Generation: Create training and validation datasets for Tibetan text detection
Customizable Parameters: Configure background images, corpora, font, image size, and more
Text Generation Options: Use synthetic Tibetan text or existing corpora
Advanced Image Processing: Apply rotation and noise augmentation strategies
Multiprocessing Support: Leverage parallel processing for efficient dataset generation
Experiment Tracking: Integration with Weights & Biases for monitoring training progress
Specialized Inference: Support for Staatsbibliothek zu Berlin digital collections
OCR Integration: Extract text from detected text blocks using Tesseract OCR

Code Structure

The project has been refactored to improve modularity and reduce redundancy:

Core Scripts

generate_training_data.py: Creates synthetic training data
train_model.py: Trains the YOLO model with optional W&B logging
inference_sbb.py: Performs inference on Staatsbibliothek zu Berlin data
ocr_on_detections.py: Applies OCR to detected text blocks

Utility Library

The tibetan_utils package contains shared functionality used across the project:

tibetan_utils.config: Configuration constants and settings
tibetan_utils.arg_utils: Command-line argument parsing utilities
tibetan_utils.io_utils: File and directory operations
tibetan_utils.image_utils: Image processing functions
tibetan_utils.model_utils: Model loading, training, and inference
tibetan_utils.ocr_utils: OCR processing utilities
tibetan_utils.sbb_utils: Staatsbibliothek zu Berlin API integration

Detailed Documentation

1. Dataset Generation

The dataset generation script creates synthetic training data by embedding Tibetan text into background images.

python generate_training_data.py --train_samples 1000 --val_samples 200 --augmentation rotate

Dataset Generation Arguments

Argument	Description	Default
`--background_train`	Folder with background images for training	'./data/background_images_train/'
`--background_val`	Folder with background images for validation	'./data/background_images_val/'
`--dataset_name`	Folder for the generated YOLO dataset	'yolo_tibetan/'
`--corpora_folder`	Folder with Tibetan corpora	'./data/corpora/Tibetan Number Words/'
`--train_samples`	Number of training samples to generate	100
`--val_samples`	Number of validation samples to generate	100
`--no_cols`	Number of text columns to generate [1-5]	1
`--font_path`	Path to a Tibetan font file	'ext/Microsoft Himalaya.ttf'
`--single_label`	Use a single label "tibetan" for all files	flag
`--debug`	Enable debug mode for verbose output	flag
`--image_size`	Size of generated images in pixels	1024
`--augmentation`	Type of augmentation to apply ['rotate', 'noise']	'noise'

2. Model Training

After generating the dataset, you can train a YOLO model using our dedicated training script.

python train_model.py --epochs 100 --imgsz 1024 --export

Training Arguments

Argument	Description	Default
`--dataset`	Name of the dataset folder	'yolo_tibetan/'
`--model`	Path to the base model	'yolov8n.pt'
`--epochs`	Number of training epochs	100
`--batch`	Batch size for training	16
`--imgsz`	Image size for training	1024
`--workers`	Number of workers for data loading	8
`--device`	Device for training (e.g., cpu, 0, 0,1,2,3)	''
`--project`	Project name for output	'runs/detect'
`--name`	Experiment name	'train'
`--export`	Export the model after training as TorchScript	flag
`--patience`	EarlyStopping patience in epochs	50

Weights & Biases Integration

The training script includes integration with Weights & Biases for experiment tracking and visualization.

python train_model.py --epochs 100 --wandb --wandb-project TibetanOCR

Argument	Description	Default
`--wandb`	Enable Weights & Biases logging	flag
`--wandb-project`	W&B project name	'TibetanOCR'
`--wandb-entity`	W&B entity (team or username)	None
`--wandb-tags`	Comma-separated tags for the experiment	None
`--wandb-name`	Name of the experiment in wandb	same as --name

When wandb logging is enabled, the script will:

Log training metrics (loss, mAP, precision, recall, etc.)
Upload dataset samples for visualization
Save model checkpoints as artifacts
Generate plots and confusion matrices

Alternative: Ultralytics CLI

You can also use the Ultralytics CLI directly:

# Train the model
yolo detect train data=yolo_tibetan/data.yml epochs=100 imgsz=1024 model=yolov8n.pt

# Export the model
yolo detect export model=runs/detect/train/weights/best.pt

3. Inference

Standard Inference

For inference on local image files:

yolo predict task=detect model=runs/detect/train/weights/best.torchscript imgsz=1024 source=data/my_inference_data/*.jpg

The results are saved to folder runs/detect/predict

Inference on Staatsbibliothek zu Berlin Data

For inference on documents from the Staatsbibliothek zu Berlin, use our specialized script:

python inference_sbb.py --ppn PPN12345678 --model runs/detect/train/weights/best.torchscript

If you encounter SSL certificate issues, you can use the --no-ssl-verify option:

python inference_sbb.py --ppn PPN12345678 --model runs/detect/train/weights/best.torchscript --no-ssl-verify

Argument	Description	Default
`--ppn`	PPN (Pica Production Number) of the document	required
`--model`	Path to the trained model	required
`--imgsz`	Image size for inference	1024
`--conf`	Confidence threshold for detections	0.25
`--download`	Download images instead of processing them directly	flag
`--output`	Directory for saving downloaded images	'sbb_images'
`--max-images`	Maximum number of images for inference (0 = all)	0
`--no-ssl-verify`	Disable SSL certificate verification	flag

4. OCR on Detected Text Blocks

The OCR script applies Tesseract OCR to text blocks detected by the YOLO model:

python ocr_on_detections.py --source image.jpg --model runs/detect/train/weights/best.pt --lang bod

OCR Arguments

Argument	Description	Default
`--source`	Path to image or directory	required (if no --ppn)
`--ppn`	PPN for Staatsbibliothek zu Berlin data	required (if no --source)
`--model`	Path to the trained model	required
`--lang`	Language for Tesseract OCR	'eng+deu'
`--tesseract-config`	Additional Tesseract configuration	''
`--save-crops`	Save cropped text blocks as images	flag
`--output`	Directory for saving results	'ocr_results'
`--no-ssl-verify`	Disable SSL certificate verification	flag

OCR Output Format

The script generates a JSON file for each processed image with the following structure:

{
  "image_name": "example.jpg",
  "detections": [
    {
      "id": 0,
      "box": {
        "x": 0.5,
        "y": 0.3,
        "width": 0.2,
        "height": 0.1
      },
      "confidence": 0.95,
      "class": 0,
      "text": "Recognized Tibetan text from this block"
    }
  ]
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tibetan OCR Tools

Overview

Quick Start Guide

Installation

Complete Workflow

Features

Code Structure

Core Scripts

Utility Library

Detailed Documentation

1. Dataset Generation

Dataset Generation Arguments

2. Model Training

Training Arguments

Weights & Biases Integration

Alternative: Ultralytics CLI

3. Inference

Standard Inference

Inference on Staatsbibliothek zu Berlin Data

4. OCR on Detected Text Blocks

OCR Arguments

OCR Output Format

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data		data
deprecated		deprecated
res		res
tibetanDataGenerator		tibetanDataGenerator
tibetan_utils		tibetan_utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_training_data.py		generate_training_data.py
inference_sbb.py		inference_sbb.py
ocr_on_detections.py		ocr_on_detections.py
ocr_utils.py		ocr_utils.py
requirements.txt		requirements.txt
train_model.py		train_model.py

License

CodexAITeam/PechaBridge

Folders and files

Latest commit

History

Repository files navigation

Tibetan OCR Tools

Overview

Quick Start Guide

Installation

Complete Workflow

Features

Code Structure

Core Scripts

Utility Library

Detailed Documentation

1. Dataset Generation

Dataset Generation Arguments

2. Model Training

Training Arguments

Weights & Biases Integration

Alternative: Ultralytics CLI

3. Inference

Standard Inference

Inference on Staatsbibliothek zu Berlin Data

4. OCR on Detected Text Blocks

OCR Arguments

OCR Output Format

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages