An end-to-end OCR pipeline for invoice document processing. The recognition module (CRNN with MobileNetV3 + BiLSTM + CTC) is fine-tuned on a corpus of 1,413 annotated invoices. A working prototype is exposed via FastAPI and Streamlit, packaged with Docker.
- Overview
- Problem statement
- What the project does
- OCR pipeline
- Dataset
- Preprocessing
- Detection model
- Recognition model
- Model comparison
- Evaluation metrics
- Key results
- API and prototype
- Repository structure
- How to run
- Limitations
- Future improvements
- Tech stack
- License
- Author
Promy covers the full OCR chain for invoice images: image preprocessing, text detection, character recognition, evaluation, and a deployable prototype. It is structured to go beyond a notebook experiment, with a clear separation between research notebooks, training artifacts, and a self-contained deployment package.
Extracting text from invoice images is not straightforward. Invoices vary in layout, font, resolution, and scanning quality. Off-the-shelf OCR models are not always adapted to this type of document. Promy addresses this by fine-tuning a lightweight OCR recognition model specifically on invoice data, within a complete and reproducible pipeline.
The pipeline receives a JPG or PNG image (up to 10 MB) and returns a structured output containing:
- recognized text, line by line
- per-line confidence scores
- preprocessing metadata (deskew angle, original size, processed size)
Output is available as JSON or CSV.
Raw image (JPG/PNG)
|
v
+---------------------------+
| Preprocessing | LAB grayscale, CLAHE, deskew, denoise, resize
| deployment/preprocessing |
+---------------------------+
|
v
+---------------------------+
| Detection: RapidOCR | bounding boxes (DBNet ONNX)
| (not fine-tuned) |
+---------------------------+
|
v (line crops)
+---------------------------+
| Recognition: PaddleOCR | CRNN: MobileNetV3 + BiLSTM + CTC
| CRNN, fine-tuned | approx. 8M parameters
| deployment/models/rec_infer
+---------------------------+
|
v
Structured output (JSON / CSV)
- lines
- confidences
- preprocessing metadata
Only the recognition module (REC) is fine-tuned. Text detection is handled by RapidOCR without retraining. This is a deliberate scope boundary, documented in the notebooks.
High Quality Invoice Images for OCR (Kaggle, Osama Hosam Abdellatif)
- 1,413 annotated invoices (batch_1, used for training)
- 300 unannotated invoices (batch_2, used for qualitative validation)
- 2 additional out-of-corpus invoices for A/B tests
Dataset link: https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr
The dataset is not redistributed in this repository.
The preprocessing module (deployment/preprocessing.py, also present in notebooks/preprocessing.py) applies the following steps in sequence:
- Grayscale conversion via LAB color space
- CLAHE contrast enhancement
- Skew correction (deskew)
- Light denoising
- Resolution normalization
This module is shared between the notebook environment and the deployed API.
Text detection uses RapidOCR with DBNet in ONNX format. It localizes text regions on the full invoice image and produces bounding boxes, which are cropped and passed to the recognition module.
RapidOCR is used as-is, without fine-tuning. A benchmark of detection alternatives is documented in NB_DET_Benchmark.ipynb.
The recognition model is the PaddleOCR CRNN architecture:
- Backbone: MobileNetV3
- Sequence modeling: BiLSTM
- Decoder: CTC
- Approximate size: 8M parameters
It is fine-tuned on invoice crops generated by pseudo-labelling from batch_1 annotations. Training used a 75/25 anti-leakage split by invoice to avoid data contamination. Training ran for 40 epochs; the best checkpoint was selected at epoch 34 based on val_norm_edit_dis.
The notebook NB_Comparatif.ipynb documents a quantitative comparison between:
- TrOCR (Microsoft, Transformer-based)
- PaddleOCR CRNN (fine-tuned on invoices)
PaddleOCR CRNN was selected for its lower inference latency, smaller model footprint, and better fit for a prototype deployment context. The TrOCR experiment is preserved in NB_experiment_TrOCR.ipynb.
Fine-tuning uses norm_edit_dis (normalized edit distance) as the training metric, equivalent to 1 - CER at the character level. Epoch-by-epoch metrics are logged in:
workspace_paddleocr_invoice/runs/metrics/rec_epoch_metrics.csvworkspace_paddleocr_invoice/runs/metrics/rec_epoch_metrics.png
Results from the internal validation set (75/25 anti-leakage split by invoice):
- CER proxy: 0.19% on the validation set
- Inference latency: 3.4 ms per crop at batch size 12 on a GPU L4
These numbers reflect a controlled benchmark on the training corpus. Performance on out-of-corpus or significantly different invoice formats may vary.
The deployment package exposes:
FastAPI (port 8000):
GET /health- service health checkPOST /ocr- multipart file upload, returns{lines, confidences, mean_confidence, n_segments, preprocessing}- Swagger docs: http://localhost:8000/docs
Streamlit (port 8501): a web interface to upload an invoice, adjust the confidence threshold, visualize the output table, and download the CSV.
Both services are packaged together with Docker Compose.
Promy/
├── deployment/ # Docker prototype (API + frontend)
│ ├── api/ # FastAPI routes (/ocr, /health)
│ ├── front/ # Streamlit app
│ ├── models/rec_infer/ # Fine-tuned CRNN model (inference)
│ ├── preprocessing.py
│ ├── tests/ # pytest tests (API + vendor)
│ ├── Dockerfile
│ ├── docker-compose.yml
│ └── pyproject.toml
│
├── notebooks/ # Research and training notebooks
│ ├── NB1_EDA.ipynb
│ ├── NB2_Preprocessing.ipynb
│ ├── NB3_Fine-tuning_DETRapidOCR_RECPaddleOCR.ipynb
│ ├── NB_Comparatif.ipynb # TrOCR vs PaddleOCR comparison
│ ├── NB_DET_Benchmark.ipynb # Detection benchmark
│ ├── NB_experiment_TrOCR.ipynb # TrOCR experiment archive
│ ├── preprocessing.py
│ └── outputs/
│
├── models/ # Final model and metrics
│ └── PaddleOCR_Invoice_v2/
│ ├── rec_infer/
│ ├── latency_benchmark.json
│ └── README.md
│
├── workspace_paddleocr_invoice/ # Training artifacts (see local README)
│ ├── export/rec_infer/
│ ├── runs/
│ │ ├── rec/
│ │ │ ├── config.yml # Fine-tuning configuration
│ │ │ └── train.log # Full training log (40 epochs)
│ │ └── metrics/
│ ├── prepared_data/
│ ├── testsAB_outputs/
│ └── README.md
│
├── Promy_raw/ # Raw data (see local README)
│ └── datasets/
│
├── .gitignore
├── pyproject.toml
├── uv.lock
├── README.md
└── README.fr.md
For the Docker prototype:
- Docker 24+ and Docker Compose v2
For local notebook execution:
- Python 3.12
- uv for environment management
- Optional: NVIDIA GPU with CUDA for retraining
-
Download from Kaggle:
https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr
-
Extract to:
Promy_raw/datasets/High-Quality Invoice Images for OCR/ ├── batch_1/ │ ├── *.csv │ └── images... └── batch_2/ └── images... -
Alternative: in NB3, set
ALLOW_KAGGLEHUB_FALLBACK = True(cell 3) to download via kagglehub (requires a configured Kaggle API key).
The two out-of-corpus test images used for A/B tests are already present in Promy_raw/datasets/.
cd deployment
docker compose up -d --buildOnce running:
- Streamlit UI: http://localhost:8501
- FastAPI docs: http://localhost:8000/docs
The notebooks are primarily written in French because they document the project methodology in detail. Each notebook includes an English summary at the top to make the workflow understandable for non-French readers.
Recommended reading order:
NB1_EDA.ipynb- dataset exploration, annotation inventory, biasesNB2_Preprocessing.ipynb- preprocessing pipeline and design choicesNB3_Fine-tuning_DETRapidOCR_RECPaddleOCR.ipynb- pseudo-labelling, anti-leakage split, fine-tuning, export, A/B testsNB_Comparatif.ipynb- quantitative TrOCR vs PaddleOCR comparisonNB_DET_Benchmark.ipynb- detection benchmarkNB_experiment_TrOCR.ipynb- TrOCR experiment archive (narrative)
-
Clone PaddleOCR into the workspace:
cd workspace_paddleocr_invoice git clone https://github.com/PaddlePaddle/PaddleOCR.git .
-
Download the pretrained weights referenced in
runs/rec/config.yml(sectionGlobal.pretrained_model). -
In NB3, set:
FORCE_REBUILD_PREPARED_DATA = True(cell 5) to regenerate pseudo-labels and cropsRUN_REC_TRAINING = True(cell 5) to start training
-
Checkpoints will be written to
runs/rec/and the best model exported toexport/rec_infer/.
See workspace_paddleocr_invoice/README.md for details.
- Detection is not fine-tuned. RapidOCR is used as-is. Performance on atypical invoice layouts depends on the DBNet pretrained model.
- Word spacing. The CRNN model can miss spaces between words in some configurations.
- French-language coverage. The base model and training corpus are English-dominant. Performance on French invoices is not fully characterized.
- No structured field extraction. The pipeline outputs raw text lines. It does not extract fields such as amounts, dates, or vendor names.
- Template diversity. Results may degrade on invoice formats significantly different from the training corpus.
- Prototype scope. The Docker deployment is a demonstrator, not a production-ready system.
- Fine-tune the detection module on invoice-specific layouts
- Add a structured field extraction layer (KIE)
- Expand French-language coverage in the training corpus
- Benchmark on a broader range of invoice templates
- Add a CI pipeline for automated regression testing
| Component | Technology |
|---|---|
| Language | Python 3.12 |
| OCR recognition | PaddleOCR (CRNN fine-tuning) |
| OCR detection | RapidOCR (DBNet ONNX) |
| Image preprocessing | OpenCV, NumPy, Pillow |
| Model experiments | PyTorch, Hugging Face Transformers (TrOCR) |
| API | FastAPI |
| Frontend | Streamlit |
| Deployment | Docker, Docker Compose |
| Environment | uv |
- Code: MIT, unless otherwise noted in source files.
- Dataset: the Kaggle dataset license applies. The dataset is not redistributed here.
- PaddleOCR: Apache 2.0 - https://github.com/PaddlePaddle/PaddleOCR
- RapidOCR: Apache 2.0 - https://github.com/RapidAI/RapidOCR
- Out-of-corpus test images in
Promy_raw/datasets/: anonymized internal scans, for educational use only.
Valentin Valluet
- GitHub: github.com/V-Vaal
- LinkedIn: linkedin.com/in/valentin-valluet
- X: @val2_x