This project implements a complete, high-performance Optical Character Recognition (OCR) system built from scratch in PyTorch. The primary goal is to accurately digitize text from scanned PDF documents.
The project documents a full engineering journey, starting with a classic but flawed architecture and culminating in the implementation of a modern, state-of-the-art CRNN (Convolutional Recurrent Neural Network) model. The final model is trained exclusively on a "ground truth" dataset generated directly from the source PDFs, enabling it to achieve high accuracy on the specific fonts and noise profiles of the documents.
Our first approach was a classic OCR pipeline with a Triage + Expert Model architecture:
- Segmentation: Use OpenCV (
cv2.findContours) to draw a bounding box around every individual character on the page. - Triage Model: A CNN designed to classify each character box as a
digit,uppercase, orlowercaseletter. - Expert Models: Three separate CNNs, each specialized in recognizing characters within its assigned class.
Why it Failed: This architecture proved to be fundamentally flawed for real-world scanned documents.
- Segmentation Failure: The system was unable to correctly segment characters that were touching due to font kerning, ligatures (
fi,fl), or scanner noise. It often identified whole words or multiple characters as a single, unrecognizable "blob." - Data Mismatch: The models, trained on perfectly isolated characters, produced garbage output when fed these malformed, multi-character blobs.
The result was incoherent, unusable text, proving that a pre-segmentation step is too fragile for this task.
We pivoted to the industry-standard architecture for OCR: a Convolutional Recurrent Neural Network (CRNN). This approach solves the fundamental flaws of the previous method.
How it Works:
- Input: The model processes an entire line of text as a single image, completely bypassing the need for fragile single-character segmentation.
- CNN Backbone (The "Eyes"): A deep convolutional network scans the line image from left to right, extracting a sequence of rich feature vectors.
- RNN Processor (The "Brain"): A bi-directional LSTM network reads this sequence of features, using the order and context to understand how features form characters and words. This is how it naturally handles touching and connected letters.
- CTC Loss (The "Translator"): The model is trained with a Connectionist Temporal Classification (CTC) loss function. This powerful algorithm allows the model to learn how to align its sequence of predictions with the ground-truth text label, without needing to be told where each character is.
To ensure the highest accuracy, we abandoned purely synthetic data. The create_real_dataset.py script implements a "ground truth" pipeline:
- Rich Text Extraction: It uses
PyMuPDFto extract every word from the source PDFs along with its precise(x, y)coordinates on the page. - Line Image Extraction: It uses OpenCV to find the bounding boxes of text lines on the scanned page image.
- Automatic Alignment: It matches the words-with-coordinates to the line-image-boxes, automatically generating a perfectly labeled ground-truth pair of
(real_line_image, "correct_line_text"). - HDF5 Storage: This final, high-quality dataset is stored in a single, efficient
real_line_dataset.h5file for fast training.
The train_real_data.py script trains the CRNN model from scratch on our custom ground-truth dataset.
Training Details:
- Architecture: Deep CRNN with Batch Normalization.
- Loss Function:
nn.CTCLoss. - Optimizer: Adam.
- Scheduler:
StepLRto manage the learning rate. - Validation: After each epoch, the script performs a full OCR on a real PDF page to provide a true, real-world benchmark of the model's progress.
Install the required libraries:
pip install -r requirements.txtPlace your source PDF files in the sample_documents/books/ directory. Then, run the data creation script. This only needs to be done once.
python create_real_dataset.py --cleanRun the training script. This will process the real_line_dataset.h5 file and save the final trained model to models/crnn_final/.
python train_real_data.pyUse the final application script to perform OCR on any page of a PDF.
Example:
python run_crnn_ocr.py "sample_documents/books/Applied-Machine-Learning-and-AI-for-Engineers.pdf" --page 2