CodexAITeam · nih23 · May 18, 2025 · May 18, 2025 · May 18, 2025 · May 18, 2025
diff --git a/README.md b/README.md
@@ -9,16 +9,17 @@ This Python project focuses on detecting and recognizing Tibetan text in images.
 3. **Inference**: Detect Tibetan text blocks in new images, including support for Staatsbibliothek zu Berlin digital collections
 4. **OCR**: Apply Tesseract OCR to the detected text blocks to extract the actual text content
 
-![Validation results](res/results_val_1.png)
+## Example of synthetic data
+![generated synthetic data](res/results_val_1.jpg)
 
 ## Quick Start Guide
 
 ### Installation
 
 ```bash
 # Clone the repository
-git clone https://github.com/nih23/Tibetan-NLP.git
-cd Tibetan-NLP
+git clone https://github.com/CodexAITeam/PechaBridge
+cd PechaBridge
 
 # Install dependencies
 pip install -r requirements.txt
@@ -33,7 +34,37 @@ pip install -r requirements.txt
 
 ```bash
 # 1. Generate dataset
-python generate_training_data.py --train_samples 1000 --val_samples 200 --image_size 1024
+python generate_training_data.py --train_samples 10 --val_samples 10 --font_path_tibetan ext/Microsoft\ Himalaya.ttf --font_path_chinese ext/simkai.ttf --dataset_name tibetan-yolo
+
+# 1.5 Inspect and validate dataset with Label Studio (optional)
+# Install Label Studio if not already installed:
+# pip install label-studio label-studio-converter
+
+# Set up environment variables for local file serving
+export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
+export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=$(pwd)/datasets/tibetan-yolo
+
+# Create classes.txt for Label studio compatibility
+echo "tibetan_no\ntext_body\nchinese_no" > datasets/tibetan-yolo/train/classes.txt
+echo "tibetan_no\ntext_body\nchinese_no" > datasets/tibetan-yolo/val/classes.txt
+
+# Convert YOLO annotations to Label Studio format
+label-studio-converter import yolo -i datasets/tibetan-yolo/train -o ls-tasks.json --image-ext ".png" --image-root-url "/data/local-files/?d=train/images"
+
+# Start Label Studio web interface (opens at http://localhost:8080)
+label-studio
+
+# In Label Studio:
+# 1. Create a new project:
+# 1.1 Go to the project settings and select Cloud Storage.
+# 1.2 Click Add Source Storage and select Local files from the Storage Type options.
+# 1.3 Set the Absolute local path to `$(pwd)/datasets/tibetan-yolo` (You need to resolv `$(pwd)`)
+# 1.4 Click Add storage.
+# 2. Import the generated ls-tasks.json file
+# 3. Review and validate the generated annotations
+# 4. Export corrections if needed
+
+# [1] https://github.com/HumanSignal/label-studio-sdk/tree/master/src/label_studio_sdk/converter#tutorial-importing-yolo-pre-annotated-images-to-label-studio-using-local-storage
 
 # 2. Train model
 python train_model.py --epochs 100 --export

diff --git a/data/tibetan numbers/annotations/complete_layout/bg_PPN337138764X_00000005.csv b/data/tibetan numbers/annotations/complete_layout/bg_PPN337138764X_00000005.csv
@@ -0,0 +1,6 @@
+yolo_label,label_name,bbox_x,bbox_y,bbox_width,bbox_height,image_name,image_width,image_height
+0,tibetan_no,18,40,54,275,bg_PPN337138764X_00000005.png,1024,361
+1,illustration_left,73,40,211,276,bg_PPN337138764X_00000005.png,1024,361
+2,text_body,286,40,442,278,bg_PPN337138764X_00000005.png,1024,361
+3,illustration_right,731,40,224,277,bg_PPN337138764X_00000005.png,1024,361
+4,chinese_no,956,41,52,279,bg_PPN337138764X_00000005.png,1024,361
diff --git a/data/tibetan numbers/annotations/complete_layout/bg_PPN337138764X_00000005.txt b/data/tibetan numbers/annotations/complete_layout/bg_PPN337138764X_00000005.txt
@@ -0,0 +1,5 @@
+0 0.044237 0.491942 0.052760 0.762097
+1 0.174513 0.493093 0.206169 0.764399
+2 0.495130 0.496546 0.431818 0.771306
+3 0.823052 0.494244 0.219156 0.766701
+4 0.958604 0.500000 0.050325 0.773609
diff --git a/data/tibetan numbers/annotations/illustrations/bg_PPN337138764X_00000005.csv b/data/tibetan numbers/annotations/illustrations/bg_PPN337138764X_00000005.csv
@@ -0,0 +1,6 @@
+yolo_label,label_name,bbox_x,bbox_y,bbox_width,bbox_height,image_name,image_width,image_height
+0,tibetan_no,24,28,52,304,bg_PPN337138764X_00000005.png,1024,361
+1,illustration_left,80,28,263,303,bg_PPN337138764X_00000005.png,1024,361
+2,illustration_centered,348,28,313,304,bg_PPN337138764X_00000005.png,1024,361
+3,illustration_right,668,28,267,304,bg_PPN337138764X_00000005.png,1024,361
+4,chinese_no,940,28,57,305,bg_PPN337138764X_00000005.png,1024,361
diff --git a/data/tibetan numbers/annotations/illustrations/bg_PPN337138764X_00000005.txt b/data/tibetan numbers/annotations/illustrations/bg_PPN337138764X_00000005.txt
@@ -0,0 +1,5 @@
+0 0.049107 0.497698 0.051136 0.842681
+1 0.206575 0.496546 0.257305 0.840378
+2 0.493101 0.500000 0.306006 0.842681
+3 0.782873 0.497698 0.260552 0.842681
+4 0.945617 0.498849 0.055195 0.844983
diff --git a/data/tibetan numbers/annotations/tibetan_chinese_no/bg_PPN337138764X_00000005.csv b/data/tibetan numbers/annotations/tibetan_chinese_no/bg_PPN337138764X_00000005.csv
@@ -0,0 +1,4 @@
+yolo_label,label_name,bbox_x,bbox_y,bbox_width,bbox_height,image_name,image_width,image_height
+0,tibetan_no,14,46,113,283,bg_PPN337138764X_00000005.png,1024,361
+1,text_body,130,45,772,284,bg_PPN337138764X_00000005.png,1024,361
+2,chinese_no,906,45,107,284,bg_PPN337138764X_00000005.png,1024,361
diff --git a/data/tibetan numbers/annotations/tibetan_chinese_no/bg_PPN337138764X_00000005.txt b/data/tibetan numbers/annotations/tibetan_chinese_no/bg_PPN337138764X_00000005.txt
@@ -0,0 +1,3 @@
+0 0.068994 0.519570 0.110390 0.785121
+1 0.504464 0.518419 0.754058 0.787423
+2 0.937094 0.518419 0.104708 0.787423
diff --git a/data/tibetan numbers/backgrounds/bg_IMG_5086.jpg b/data/tibetan numbers/backgrounds/bg_IMG_5086.jpg
diff --git a/data/tibetan numbers/backgrounds/bg_PPN3371387534_00000007.jpg b/data/tibetan numbers/backgrounds/bg_PPN3371387534_00000007.jpg
diff --git a/data/tibetan numbers/backgrounds/bg_PPN337138764X_00000005.jpg b/data/tibetan numbers/backgrounds/bg_PPN337138764X_00000005.jpg
diff --git a/data/tibetan numbers/backgrounds/bg_PPN3371388603_00000004.jpg b/data/tibetan numbers/backgrounds/bg_PPN3371388603_00000004.jpg
diff --git a/data/tibetan numbers/backgrounds/bg_PPN3371389286_00000003.jpg b/data/tibetan numbers/backgrounds/bg_PPN3371389286_00000003.jpg
diff --git a/data/tibetan numbers/bg_train/Dalle_1.jpg b/data/tibetan numbers/bg_train/Dalle_1.jpg
diff --git a/data/tibetan numbers/bg_train/Dalle_2.jpg b/data/tibetan numbers/bg_train/Dalle_2.jpg
diff --git a/data/tibetan numbers/bg_val/Dalle2.jpg b/data/tibetan numbers/bg_val/Dalle2.jpg
diff --git a/data/tibetan numbers/bg_val/Dalle3.jpg b/data/tibetan numbers/bg_val/Dalle3.jpg
diff --git a/data/tibetan numbers/bg_val/Dalle4.jpg b/data/tibetan numbers/bg_val/Dalle4.jpg
diff --git a/data/tibetan numbers/buddha_illustrations/buddha_01.png b/data/tibetan numbers/buddha_illustrations/buddha_01.png
diff --git a/data/tibetan numbers/buddha_illustrations/buddha_02.png b/data/tibetan numbers/buddha_illustrations/buddha_02.png
diff --git a/data/tibetan numbers/buddha_illustrations/buddha_03.png b/data/tibetan numbers/buddha_illustrations/buddha_03.png
diff --git a/data/tibetan numbers/buddha_illustrations/buddha_05.png b/data/tibetan numbers/buddha_illustrations/buddha_05.png
diff --git a/data/tibetan numbers/buddha_illustrations/buddha_06.png b/data/tibetan numbers/buddha_illustrations/buddha_06.png
diff --git a/data/tibetan numbers/buddha_illustrations/buddha_07.png b/data/tibetan numbers/buddha_illustrations/buddha_07.png
diff --git a/data/tibetan numbers/corpora/tib_no_0001.txt b/data/tibetan numbers/corpora/tib_no_0001.txt
@@ -0,0 +1 @@
+གཅིག་
diff --git a/data/tibetan numbers/corpora/tib_no_0002.txt b/data/tibetan numbers/corpora/tib_no_0002.txt
@@ -0,0 +1 @@
+གཉིས
diff --git a/data/tibetan numbers/corpora/tib_no_0003.txt b/data/tibetan numbers/corpora/tib_no_0003.txt
@@ -0,0 +1 @@
+གསུམ
diff --git a/data/tibetan numbers/corpora/tib_no_0004.txt b/data/tibetan numbers/corpora/tib_no_0004.txt
@@ -0,0 +1 @@
+བཞི
diff --git a/data/tibetan numbers/corpora/tib_no_0005.txt b/data/tibetan numbers/corpora/tib_no_0005.txt
@@ -0,0 +1 @@
+ལྔ
diff --git a/data/tibetan numbers/corpora/tib_no_0006.txt b/data/tibetan numbers/corpora/tib_no_0006.txt
@@ -0,0 +1 @@
+དྲུག་
diff --git a/data/tibetan numbers/corpora/tib_no_0007.txt b/data/tibetan numbers/corpora/tib_no_0007.txt
@@ -0,0 +1 @@
+བདུན་
diff --git a/data/tibetan numbers/corpora/tib_no_0008.txt b/data/tibetan numbers/corpora/tib_no_0008.txt
@@ -0,0 +1 @@
+བརྒྱད
diff --git a/data/tibetan numbers/corpora/tib_no_0009.txt b/data/tibetan numbers/corpora/tib_no_0009.txt
@@ -0,0 +1 @@
+དགུ
diff --git a/data/tibetan numbers/corpora/tib_no_0010.txt b/data/tibetan numbers/corpora/tib_no_0010.txt
@@ -0,0 +1 @@
+བཅུ
diff --git a/generate_training_data.py b/generate_training_data.py
@@ -1,12 +1,20 @@
 #!/usr/bin/env python3
 """
-Skript zur Generierung von Trainingsdaten für Tibetische OCR.
-Erstellt synthetische Bilder mit Tibetischem Text für YOLO-Training.
+Skript zur Generierung von Multi-Klassen-Trainingsdaten für Tibetische OCR.
+Erstellt synthetische Bilder mit Tibetischem Text, chinesischen Zahlen und allgemeinem Text für YOLO-Training.
+
+Unterstützt 3 Klassen:
+- Klasse 0: tibetan_number_word (Tibetische Zahlen)
+- Klasse 1: tibetan_text (Allgemeiner tibetischer Text)  
+- Klasse 2: chinese_number_word (Chinesische Zahlen)
 """
 
 from pathlib import Path
 from collections import OrderedDict
-from ultralytics.data.utils import DATASETS_DIR
+try:
+    from ultralytics.data.utils import DATASETS_DIR
+except ImportError:
+    DATASETS_DIR = "./datasets"  # Fallback if ultralytics not installed
 from tibetanDataGenerator.dataset_generator import generate_dataset
 
 # Importiere Funktionen aus der tibetan_utils-Bibliothek
@@ -15,38 +23,56 @@
 
 
 def main():
-    # Parse arguments
+    # Parse arguments (Multi-Klassen-Support)
     parser = create_generate_dataset_parser()
     args = parser.parse_args()
 
     # Set dataset path
-    datasets_dir = Path(DATASETS_DIR)
-    path = str(datasets_dir / args.dataset_name)
-    args.dataset_name = path
-    print(f"Generiere YOLO-Datensatz {args.dataset_name}...")
-
-    # Generate training dataset
-    train_dataset_dict = generate_dataset(args, validation=False)
-
-    # Generate validation dataset
-    val_dataset_dict = generate_dataset(args, validation=True)
-
-    # Combine train and val dataset information
-    dataset_dict = {
-        'path': args.dataset_name,
-        'train': 'train/images',
-        'val': 'val/images',
-        'nc': train_dataset_dict['nc'],
-        'names': train_dataset_dict['names']
-    }
-
-    # Save dataset configuration
-    yaml_path = f"{args.dataset_name}/data.yml"
-    save_yaml(dataset_dict, yaml_path)
-
-    print("Datensatzgenerierung abgeschlossen.")
+    full_dataset_path = Path(args.output_dir) / args.dataset_name
+    original_dataset_name = args.dataset_name
+    args.dataset_name = str(full_dataset_path)
+
+    print(f"Generiere Multi-Klassen YOLO-Datensatz in {args.dataset_name}...")
+    print("Speicherort kann geändert werden per `yolo settings`.")
+    print("Unterstützte Klassen:")
+    print("  - Klasse 0: tibetan_number_word (Tibetische Zahlen)")
+    print("  - Klasse 1: tibetan_text (Allgemeiner tibetischer Text)")
+    print("  - Klasse 2: chinese_number_word (Chinesische Zahlen)")
+
+    # Generate training dataset (Multi-Klassen)
+    train_dataset_info = generate_dataset(args, validation=False)
+
+    # Generate validation dataset (Multi-Klassen)
+    val_dataset_info = generate_dataset(args, validation=True)
+
+    # Multi-Klassen YAML-Konfiguration
+    yaml_content = OrderedDict()
+    yaml_content['path'] = original_dataset_name
+    yaml_content['train'] = 'train/images'
+    yaml_content['val'] = 'val/images'
+    yaml_content['test'] = ''
+
+    if 'nc' not in train_dataset_info or 'names' not in train_dataset_info:
+        raise ValueError("generate_dataset did not return 'nc' or 'names' in its info dictionary.")
+    yaml_content['nc'] = train_dataset_info['nc']
+    yaml_content['names'] = train_dataset_info['names']
+
+    # YAML speichern mit korrekter Funktion
+    yaml_file_path = Path(args.output_dir) / f"{original_dataset_name}.yaml"
+
+    # Verwende die modulare save_yaml Funktion
+    import yaml
+    def represent_ordereddict(dumper, data):
+        return dumper.represent_mapping('tag:yaml.org,2002:map', data.items())
+
+    yaml.add_representer(OrderedDict, represent_ordereddict)
+
+    with open(yaml_file_path, 'w', encoding='utf-8') as f:
+        yaml.dump(dict(yaml_content), f, sort_keys=False, allow_unicode=True)
+
+    print(f"\nMulti-Klassen-Datensatzgenerierung abgeschlossen. YAML-Konfiguration gespeichert: {yaml_file_path}")
     print("Training kann mit folgendem Befehl gestartet werden:\n")
-    print(f"yolo detect train data={yaml_path} epochs=100 imgsz=1024 model=yolov8n.pt")
+    print(f"yolo detect train data={yaml_file_path} epochs=100 imgsz=[{args.image_height},{args.image_width}] model=yolov8n.pt")
 
 
 if __name__ == "__main__":

diff --git a/res/results_val_1.jpg b/res/results_val_1.jpg
diff --git a/res/results_val_1.png b/res/results_val_1.png
diff --git a/res/results_val_2.png b/res/results_val_2.png
diff --git a/tibetanDataGenerator/README.md b/tibetanDataGenerator/README.md
@@ -0,0 +1,70 @@
+# Tibetan Text Detection Dataset Generator
+
+A tool for generating synthetic YOLO-formatted datasets for detecting Tibetan text, numbers, and their Chinese number counterparts in document images.
+
+## Features
+- Generates synthetic document images with Tibetan text, numbers, and Chinese numbers
+- Creates corresponding YOLO-format annotations
+- Maintains consistent numbering between Tibetan and Chinese number representations
+- Supports multiple text corpora with intelligent text placement
+- Includes data augmentation options (rotation, noise)
+
+## New Options
+python main.py \
+  --corpora_tibetan_numbers_path ./data/corpora/Tibetan\ Number\ Words/ \
+  --corpora_tibetan_text_path ./data/corpora/UVA\ Tibetan\ Spoken\ Corpus/ \
+  --corpora_chinese_numbers_path ./data/corpora/Chinese\ Number\ Words/ \ 
+  --font_path_tibetan ./fonts/Microsoft\ Himalaya.ttf \
+  --font_path_chinese ./fonts/simkai.ttf \
+  --image_width 1024 \
+  --image_height 361 \
+  --annotations_file_path ./data/annotations/tibetan_chinese_no.txt \
+
+## Example Usage
+python path/to/main.py \
+  --corpora_tibetan_numbers_path "path/to/data/corpora/Tibetan Number Words" \
+  --corpora_tibetan_text_path "path/to/data/corpora/UVA Tibetan Spoken Corpus" \
+  --corpora_chinese_numbers_path "path/to/data/corpora/Chinese Number Words" \
+  --background_train "path/to/data/background_images_train" \
+  --background_val "path/to/data/background_images_val" \
+  --annotations_file_path "path/to/data/annotations/tibetan_chinese_no/bg_example_0001.txt" \
+  --font_path_tibetan "path/to/fonts/Microsoft Himalaya.ttf" \
+  --font_path_chinese "path/to/fonts/simkai.ttf" \
+  --train_samples 2 \
+  --val_samples 2
+
+## List of altered scripts
+- main.py (for correct use shift the script to the [initial project directory](https://github.com/CodexAITeam/TibetanOCR/tree/synthetic_generation_tib_chi_no)
+- dataset_generator.py => altered to dataset_generator_tib_chi_no.py
+- text_renderer.py =>altered to text_renderer_img_size.py
+
+## Script Details
+The script loads the Corpus path inputs from main.py to their corresponding bounding boxes of their ann_class_id (YOLO CLASS ID) in order to produce different texts with generate_dataset_tib_chi_no.py. 
+The ann_class_id are parsed from a preconfigured annotation template named bg_PPN337138764X_00000005.txt which is located in the Tibetan Layout Analyser project. See our [Tibetan Numbers Dataset Folder](https://github.com/CodexAITeam/TibetanLayoutAnalyzer/tree/main/data/tibetan%20numbers) for sample files. Furthermore, the script uses different background image from that project in the format 1024x361 
+because it reflects the original historical data format. The argparse input font_path_tibetan is used to display generated tibetan text, while is font_path_chinese used for chinese text.
+
+Here is the table of the label mapping: 
+
+| Class Name            | Corpus Path                     | Planned Label ID Range* | ann_class_id / YOLO Class ID |
+|-----------------------|---------------------------------|-------------------------|------------------------------|
+| Tibetan Number Words  | `corpora_tibetan_numbers_path`  | 000-009                 | 0                            |
+| Tibetan Text Body     | `corpora_tibetan_text_path`     | 101-110                 | 1                            |
+| Chinese Number Words  | `corpora_chinese_numbers_path`  | 201-210                 | 2                            |
+
+\* see Limitations
+
+The different text inputs are given by:
+- Tibetan Numbers: tib_no_0001.txt to tib_no_0010.txt: Randomly selected
+- Tibetan Text: uvrip*.txt: Randomly selected
+- Chinese Numbers: chi_no_0001.txt to chi_no_0010.txt: Simultaneosly selected (for instance chi_no_001.txt is selected when tib_no_0001.txt is selected)  
+See our [Corpora Folder](https://github.com/CodexAITeam/TibetanOCR/tree/synthetic_generation_tib_chi_no/data/corpora) for sample files.
+
+## Generated synthetic image sample
+- generated_sample.png
+
+## Limitations and Outline for future development
+- Label_dict is still not producing correct results of Planned Label ID Ranges because it only uses tibetan number file labels so far. 
+- Augmentations are still very limited and will be expanded.
+
+## License
+This project is licensed under the MIT License - see the [LICENSE](https://github.com/CodexAITeam/TibetanOCR/blob/synthetic_generation_tib_chi_no/LICENSE) file for details.