KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents

🎉 Excited to share that our paper, "KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents", has been accepted to the FinTech in AI CUP Special Session @ NTCIR-18 Conference !

Abstract

Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches.

Installation

1. Clone the Repository

Clone the repository and navigate into the project directory:

git clone https://github.com/JustinHsu1019/KAP.git
cd KAP

2. Set Up a Virtual Environment

Create and activate a virtual environment:

python3 -m venv kap_venv
source kap_venv/bin/activate

3. Install Dependencies

Install all required dependencies:

pip install -r requirements.txt

Additionally, install OCR and Docker-related dependencies:

./exp_src/preprocess/ocr/tessocr.sh
./exp_src/docker/docker_install.sh

4. Configure API Keys

Copy the example configuration file and set up your API keys:

cp config.ini config_real.ini

Edit config_real.ini and manually add your API keys:

OpenAI API Key: Obtain from the OpenAI official website.
Claude API Key: Obtain from the Claude official website.

5. Set Up the Database (Weaviate via Docker)

Navigate to the docker directory:

cd exp_src/docker

Modify the docker-compose.yml file:

Replace the following line with your actual OpenAI API Key:
```
OPENAI_APIKEY: ${OPENAI_APIKEY}
```

Start the Weaviate database using Docker Compose:

docker-compose up -d

6. Obtain and Prepare the Dataset

The dataset used in this study is privately provided by E.SUN Bank. You must obtain authorization from E.SUN Bank to access the dataset.
If you want to reproduce our methodology, you can use any other dataset with a large number of tabular Chinese PDFs.
Once obtained, place the dataset in the data/ directory.

Running the Full Experimental Pipeline

1. Data Augmentation

Generate augmented validation sets for evaluation:

python3 exp_src/auto_runall_pipeline/question_augment.py

2. Convert PDFs to Images

Convert all PDFs into images for downstream processing:

python3 exp_src/convert_pdfs_to_images.py

3. Generate OCR Text

Extract OCR text using the baseline Tesseract OCR:

python3 exp_src/rewrite.py --task Tess

4. Generate Preprocessed Text for All Settings

Run all text preprocessing pipelines, including ablation studies, and our proposed KAP framework:

python3 exp_src/auto_runall_pipeline/run_all_rewrite.py

5. Perform Text Embedding and Store in the Vector Database

Convert the processed text into vector representations and store them in the Weaviate vector database. This step includes:

Text embedding using OpenAI’s text-embedding-3-large model (for dense retrieval)
Tokenization using Jieba (for bm25 retrieval)
Storing the processed embeddings in the vector database

Run the following command to execute the full pipeline:

python3 exp_src/auto_runall_pipeline/run_all_db_insert.py

6. Run Hybrid Retrieval Experiments

Execute retrieval experiments using pure sparse retrieval, dense retrieval, and hybrid retrieval:

python3 exp_src/auto_runall_pipeline/run_all_hybrid.py

7. Reproduce Experimental Results

To validate stability, our experiments were repeated three times in the paper. You may repeat steps 1-6 multiple times to reproduce and verify results.

Additional Information

Prompt Engineering for Post-OCR Correction

The core of our approach is MLLM-assisted Post-OCR enhancement.
To view or modify the prompts used for this step, navigate to:

cd exp_src/preprocess/ocr/

This directory contains all ablation experiments and our framework's prompt designs.

Acknowledgement

This study was supported by E.SUN Bank, which provided the dataset from the "AI CUP 2024 E.SUN Artificial Intelligence Open Competition." We sincerely appreciate E.SUN Bank for its generous data support, which has been invaluable to this research.

Citation

If you found the provided code with our paper useful, we kindly request that you cite our work.

@misc{hsu2025kapmllmassistedocrtext,
      title={KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents}, 
      author={Hsin-Ling Hsu and Ping-Sheng Lin and Jing-Di Lin and Jengnan Tzeng},
      year={2025},
      eprint={2503.08452},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2503.08452}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
exp_src		exp_src
images		images
logs		logs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents

🎉 Excited to share that our paper, "KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents", has been accepted to the FinTech in AI CUP Special Session @ NTCIR-18 Conference !

Abstract

Installation

1. Clone the Repository

2. Set Up a Virtual Environment

3. Install Dependencies

4. Configure API Keys

5. Set Up the Database (Weaviate via Docker)

6. Obtain and Prepare the Dataset

Running the Full Experimental Pipeline

1. Data Augmentation

2. Convert PDFs to Images

3. Generate OCR Text

4. Generate Preprocessed Text for All Settings

5. Perform Text Embedding and Store in the Vector Database

6. Run Hybrid Retrieval Experiments

7. Reproduce Experimental Results

Additional Information

Prompt Engineering for Post-OCR Correction

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents

🎉 Excited to share that our paper, "KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents", has been accepted to the FinTech in AI CUP Special Session @ NTCIR-18 Conference !

Abstract

Installation

1. Clone the Repository

2. Set Up a Virtual Environment

3. Install Dependencies

4. Configure API Keys

5. Set Up the Database (Weaviate via Docker)

6. Obtain and Prepare the Dataset

Running the Full Experimental Pipeline

1. Data Augmentation

2. Convert PDFs to Images

3. Generate OCR Text

4. Generate Preprocessed Text for All Settings

5. Perform Text Embedding and Store in the Vector Database

6. Run Hybrid Retrieval Experiments

7. Reproduce Experimental Results

Additional Information

Prompt Engineering for Post-OCR Correction

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages