Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 35 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,17 @@ This Python project focuses on detecting and recognizing Tibetan text in images.
3. **Inference**: Detect Tibetan text blocks in new images, including support for Staatsbibliothek zu Berlin digital collections
4. **OCR**: Apply Tesseract OCR to the detected text blocks to extract the actual text content

![Validation results](res/results_val_1.png)
## Example of synthetic data
![generated synthetic data](res/results_val_1.jpg)

## Quick Start Guide

### Installation

```bash
# Clone the repository
git clone https://github.com/nih23/Tibetan-NLP.git
cd Tibetan-NLP
git clone https://github.com/CodexAITeam/PechaBridge
cd PechaBridge

# Install dependencies
pip install -r requirements.txt
Expand All @@ -33,7 +34,37 @@ pip install -r requirements.txt

```bash
# 1. Generate dataset
python generate_training_data.py --train_samples 1000 --val_samples 200 --image_size 1024
python generate_training_data.py --train_samples 10 --val_samples 10 --font_path_tibetan ext/Microsoft\ Himalaya.ttf --font_path_chinese ext/simkai.ttf --dataset_name tibetan-yolo

# 1.5 Inspect and validate dataset with Label Studio (optional)
# Install Label Studio if not already installed:
# pip install label-studio label-studio-converter

# Set up environment variables for local file serving
export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=$(pwd)/datasets/tibetan-yolo

# Create classes.txt for Label studio compatibility
echo "tibetan_no\ntext_body\nchinese_no" > datasets/tibetan-yolo/train/classes.txt
echo "tibetan_no\ntext_body\nchinese_no" > datasets/tibetan-yolo/val/classes.txt

# Convert YOLO annotations to Label Studio format
label-studio-converter import yolo -i datasets/tibetan-yolo/train -o ls-tasks.json --image-ext ".png" --image-root-url "/data/local-files/?d=train/images"

# Start Label Studio web interface (opens at http://localhost:8080)
label-studio

# In Label Studio:
# 1. Create a new project:
# 1.1 Go to the project settings and select Cloud Storage.
# 1.2 Click Add Source Storage and select Local files from the Storage Type options.
# 1.3 Set the Absolute local path to `$(pwd)/datasets/tibetan-yolo` (You need to resolv `$(pwd)`)
# 1.4 Click Add storage.
# 2. Import the generated ls-tasks.json file
# 3. Review and validate the generated annotations
# 4. Export corrections if needed

# [1] https://github.com/HumanSignal/label-studio-sdk/tree/master/src/label_studio_sdk/converter#tutorial-importing-yolo-pre-annotated-images-to-label-studio-using-local-storage

# 2. Train model
python train_model.py --epochs 100 --export
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
yolo_label,label_name,bbox_x,bbox_y,bbox_width,bbox_height,image_name,image_width,image_height
0,tibetan_no,18,40,54,275,bg_PPN337138764X_00000005.png,1024,361
1,illustration_left,73,40,211,276,bg_PPN337138764X_00000005.png,1024,361
2,text_body,286,40,442,278,bg_PPN337138764X_00000005.png,1024,361
3,illustration_right,731,40,224,277,bg_PPN337138764X_00000005.png,1024,361
4,chinese_no,956,41,52,279,bg_PPN337138764X_00000005.png,1024,361
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
0 0.044237 0.491942 0.052760 0.762097
1 0.174513 0.493093 0.206169 0.764399
2 0.495130 0.496546 0.431818 0.771306
3 0.823052 0.494244 0.219156 0.766701
4 0.958604 0.500000 0.050325 0.773609
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
yolo_label,label_name,bbox_x,bbox_y,bbox_width,bbox_height,image_name,image_width,image_height
0,tibetan_no,24,28,52,304,bg_PPN337138764X_00000005.png,1024,361
1,illustration_left,80,28,263,303,bg_PPN337138764X_00000005.png,1024,361
2,illustration_centered,348,28,313,304,bg_PPN337138764X_00000005.png,1024,361
3,illustration_right,668,28,267,304,bg_PPN337138764X_00000005.png,1024,361
4,chinese_no,940,28,57,305,bg_PPN337138764X_00000005.png,1024,361
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
0 0.049107 0.497698 0.051136 0.842681
1 0.206575 0.496546 0.257305 0.840378
2 0.493101 0.500000 0.306006 0.842681
3 0.782873 0.497698 0.260552 0.842681
4 0.945617 0.498849 0.055195 0.844983
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
yolo_label,label_name,bbox_x,bbox_y,bbox_width,bbox_height,image_name,image_width,image_height
0,tibetan_no,14,46,113,283,bg_PPN337138764X_00000005.png,1024,361
1,text_body,130,45,772,284,bg_PPN337138764X_00000005.png,1024,361
2,chinese_no,906,45,107,284,bg_PPN337138764X_00000005.png,1024,361
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
0 0.068994 0.519570 0.110390 0.785121
1 0.504464 0.518419 0.754058 0.787423
2 0.937094 0.518419 0.104708 0.787423
Binary file added data/tibetan numbers/backgrounds/bg_IMG_5086.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added data/tibetan numbers/bg_train/Dalle_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added data/tibetan numbers/bg_train/Dalle_2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added data/tibetan numbers/bg_val/Dalle2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added data/tibetan numbers/bg_val/Dalle3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added data/tibetan numbers/bg_val/Dalle4.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0001.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
གཅིག་
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0002.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
གཉིས
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0003.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
གསུམ
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0004.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
བཞི
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0005.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ལྔ
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0006.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
དྲུག་
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0007.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
བདུན་
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0008.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
བརྒྱད
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0009.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
དགུ
1 change: 1 addition & 0 deletions data/tibetan numbers/corpora/tib_no_0010.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
བཅུ
86 changes: 56 additions & 30 deletions generate_training_data.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,20 @@
#!/usr/bin/env python3
"""
Skript zur Generierung von Trainingsdaten für Tibetische OCR.
Erstellt synthetische Bilder mit Tibetischem Text für YOLO-Training.
Skript zur Generierung von Multi-Klassen-Trainingsdaten für Tibetische OCR.
Erstellt synthetische Bilder mit Tibetischem Text, chinesischen Zahlen und allgemeinem Text für YOLO-Training.

Unterstützt 3 Klassen:
- Klasse 0: tibetan_number_word (Tibetische Zahlen)
- Klasse 1: tibetan_text (Allgemeiner tibetischer Text)
- Klasse 2: chinese_number_word (Chinesische Zahlen)
"""

from pathlib import Path
from collections import OrderedDict
from ultralytics.data.utils import DATASETS_DIR
try:
from ultralytics.data.utils import DATASETS_DIR
except ImportError:
DATASETS_DIR = "./datasets" # Fallback if ultralytics not installed
from tibetanDataGenerator.dataset_generator import generate_dataset

# Importiere Funktionen aus der tibetan_utils-Bibliothek
Expand All @@ -15,38 +23,56 @@


def main():
# Parse arguments
# Parse arguments (Multi-Klassen-Support)
parser = create_generate_dataset_parser()
args = parser.parse_args()

# Set dataset path
datasets_dir = Path(DATASETS_DIR)
path = str(datasets_dir / args.dataset_name)
args.dataset_name = path
print(f"Generiere YOLO-Datensatz {args.dataset_name}...")

# Generate training dataset
train_dataset_dict = generate_dataset(args, validation=False)

# Generate validation dataset
val_dataset_dict = generate_dataset(args, validation=True)

# Combine train and val dataset information
dataset_dict = {
'path': args.dataset_name,
'train': 'train/images',
'val': 'val/images',
'nc': train_dataset_dict['nc'],
'names': train_dataset_dict['names']
}

# Save dataset configuration
yaml_path = f"{args.dataset_name}/data.yml"
save_yaml(dataset_dict, yaml_path)

print("Datensatzgenerierung abgeschlossen.")
full_dataset_path = Path(args.output_dir) / args.dataset_name
original_dataset_name = args.dataset_name
args.dataset_name = str(full_dataset_path)

print(f"Generiere Multi-Klassen YOLO-Datensatz in {args.dataset_name}...")
print("Speicherort kann geändert werden per `yolo settings`.")
print("Unterstützte Klassen:")
print(" - Klasse 0: tibetan_number_word (Tibetische Zahlen)")
print(" - Klasse 1: tibetan_text (Allgemeiner tibetischer Text)")
print(" - Klasse 2: chinese_number_word (Chinesische Zahlen)")

# Generate training dataset (Multi-Klassen)
train_dataset_info = generate_dataset(args, validation=False)

# Generate validation dataset (Multi-Klassen)
val_dataset_info = generate_dataset(args, validation=True)

# Multi-Klassen YAML-Konfiguration
yaml_content = OrderedDict()
yaml_content['path'] = original_dataset_name
yaml_content['train'] = 'train/images'
yaml_content['val'] = 'val/images'
yaml_content['test'] = ''

if 'nc' not in train_dataset_info or 'names' not in train_dataset_info:
raise ValueError("generate_dataset did not return 'nc' or 'names' in its info dictionary.")
yaml_content['nc'] = train_dataset_info['nc']
yaml_content['names'] = train_dataset_info['names']

# YAML speichern mit korrekter Funktion
yaml_file_path = Path(args.output_dir) / f"{original_dataset_name}.yaml"

# Verwende die modulare save_yaml Funktion
import yaml
def represent_ordereddict(dumper, data):
return dumper.represent_mapping('tag:yaml.org,2002:map', data.items())

yaml.add_representer(OrderedDict, represent_ordereddict)

with open(yaml_file_path, 'w', encoding='utf-8') as f:
yaml.dump(dict(yaml_content), f, sort_keys=False, allow_unicode=True)

print(f"\nMulti-Klassen-Datensatzgenerierung abgeschlossen. YAML-Konfiguration gespeichert: {yaml_file_path}")
print("Training kann mit folgendem Befehl gestartet werden:\n")
print(f"yolo detect train data={yaml_path} epochs=100 imgsz=1024 model=yolov8n.pt")
print(f"yolo detect train data={yaml_file_path} epochs=100 imgsz=[{args.image_height},{args.image_width}] model=yolov8n.pt")


if __name__ == "__main__":
Expand Down
Binary file added res/results_val_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed res/results_val_1.png
Binary file not shown.
Binary file removed res/results_val_2.png
Binary file not shown.
70 changes: 70 additions & 0 deletions tibetanDataGenerator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Tibetan Text Detection Dataset Generator

A tool for generating synthetic YOLO-formatted datasets for detecting Tibetan text, numbers, and their Chinese number counterparts in document images.

## Features
- Generates synthetic document images with Tibetan text, numbers, and Chinese numbers
- Creates corresponding YOLO-format annotations
- Maintains consistent numbering between Tibetan and Chinese number representations
- Supports multiple text corpora with intelligent text placement
- Includes data augmentation options (rotation, noise)

## New Options
python main.py \
--corpora_tibetan_numbers_path ./data/corpora/Tibetan\ Number\ Words/ \
--corpora_tibetan_text_path ./data/corpora/UVA\ Tibetan\ Spoken\ Corpus/ \
--corpora_chinese_numbers_path ./data/corpora/Chinese\ Number\ Words/ \
--font_path_tibetan ./fonts/Microsoft\ Himalaya.ttf \
--font_path_chinese ./fonts/simkai.ttf \
--image_width 1024 \
--image_height 361 \
--annotations_file_path ./data/annotations/tibetan_chinese_no.txt \

## Example Usage
python path/to/main.py \
--corpora_tibetan_numbers_path "path/to/data/corpora/Tibetan Number Words" \
--corpora_tibetan_text_path "path/to/data/corpora/UVA Tibetan Spoken Corpus" \
--corpora_chinese_numbers_path "path/to/data/corpora/Chinese Number Words" \
--background_train "path/to/data/background_images_train" \
--background_val "path/to/data/background_images_val" \
--annotations_file_path "path/to/data/annotations/tibetan_chinese_no/bg_example_0001.txt" \
--font_path_tibetan "path/to/fonts/Microsoft Himalaya.ttf" \
--font_path_chinese "path/to/fonts/simkai.ttf" \
--train_samples 2 \
--val_samples 2

## List of altered scripts
- main.py (for correct use shift the script to the [initial project directory](https://github.com/CodexAITeam/TibetanOCR/tree/synthetic_generation_tib_chi_no)
- dataset_generator.py => altered to dataset_generator_tib_chi_no.py
- text_renderer.py =>altered to text_renderer_img_size.py

## Script Details
The script loads the Corpus path inputs from main.py to their corresponding bounding boxes of their ann_class_id (YOLO CLASS ID) in order to produce different texts with generate_dataset_tib_chi_no.py.
The ann_class_id are parsed from a preconfigured annotation template named bg_PPN337138764X_00000005.txt which is located in the Tibetan Layout Analyser project. See our [Tibetan Numbers Dataset Folder](https://github.com/CodexAITeam/TibetanLayoutAnalyzer/tree/main/data/tibetan%20numbers) for sample files. Furthermore, the script uses different background image from that project in the format 1024x361
because it reflects the original historical data format. The argparse input font_path_tibetan is used to display generated tibetan text, while is font_path_chinese used for chinese text.

Here is the table of the label mapping:

| Class Name | Corpus Path | Planned Label ID Range* | ann_class_id / YOLO Class ID |
|-----------------------|---------------------------------|-------------------------|------------------------------|
| Tibetan Number Words | `corpora_tibetan_numbers_path` | 000-009 | 0 |
| Tibetan Text Body | `corpora_tibetan_text_path` | 101-110 | 1 |
| Chinese Number Words | `corpora_chinese_numbers_path` | 201-210 | 2 |

\* see Limitations

The different text inputs are given by:
- Tibetan Numbers: tib_no_0001.txt to tib_no_0010.txt: Randomly selected
- Tibetan Text: uvrip*.txt: Randomly selected
- Chinese Numbers: chi_no_0001.txt to chi_no_0010.txt: Simultaneosly selected (for instance chi_no_001.txt is selected when tib_no_0001.txt is selected)
See our [Corpora Folder](https://github.com/CodexAITeam/TibetanOCR/tree/synthetic_generation_tib_chi_no/data/corpora) for sample files.

## Generated synthetic image sample
- generated_sample.png

## Limitations and Outline for future development
- Label_dict is still not producing correct results of Planned Label ID Ranges because it only uses tibetan number file labels so far.
- Augmentations are still very limited and will be expanded.

## License
This project is licensed under the MIT License - see the [LICENSE](https://github.com/CodexAITeam/TibetanOCR/blob/synthetic_generation_tib_chi_no/LICENSE) file for details.
Loading