📝 Optical Character Recognition (OCR) System

Tesseract OCR + OpenCV + IAM Handwritten Dataset

This project implements a complete Optical Character Recognition (OCR) system designed to extract both printed and handwritten text from images. It combines the power of OpenCV for image preprocessing and Tesseract OCR (LSTM engine) for accurate text recognition. The system is tested using the IAM Handwritten Dataset, a leading dataset for handwriting research.

🚀 Features

Extracts text from printed and handwritten images
Uses a full OpenCV preprocessing pipeline for improved accuracy
Utilizes Tesseract OCR (LSTM neural engine) for recognition
Automatically downloads dataset using KaggleHub
Works with multiple image formats (PNG, JPG, scanned documents)
Displays original and processed images for comparison
Suitable for real-world OCR tasks like document scanning and handwriting digitization

📁 Dataset Used: IAM Handwritten Forms

The project uses the IAM Handwriting Dataset, which contains thousands of real handwritten English text samples. This dataset is widely used in academic and industrial handwriting recognition research due to its quality and variety.

🧠 How It Works

1️⃣ Dataset Loading

The IAM dataset is automatically pulled from Kaggle using KaggleHub, providing seamless access to handwriting samples.

2️⃣ Image Preprocessing (OpenCV)

To improve recognition accuracy, images undergo:

Grayscale conversion
Noise removal
Blurring
Thresholding (Otsu or adaptive)
Contrast enhancement

These steps create cleaner, OCR-ready images.

3️⃣ Text Extraction (Tesseract OCR)

The processed images are passed to the Tesseract OCR engine, configured to read lines of printed or handwritten text. Tesseract's LSTM-based model improves the recognition of handwritten characters.

4️⃣ Visualization

The system displays:

The original input image
The preprocessed image
The final extracted text

This helps users compare and understand the OCR pipeline.

📊 Results

The OCR system provides clear and readable text output for:

IAM handwritten samples
Scanned documents
Printed text Accuracy varies depending on handwriting clarity, but preprocessing greatly improves recognition quality.

🛠 Technologies Used

OpenCV – Image preprocessing
Tesseract OCR – Text recognition engine
pytesseract – Tesseract interface
KaggleHub – Dataset download
Matplotlib – Visualization
Python – Implementation

🎯 Applications

Handwriting digitization
Document scanning systems
Automated form processing
Archiving handwritten notes
Real-time OCR systems
Text extraction for AI NLP pipelines

🤝 Contributing

Contributions are welcome! Feel free to submit issues, improvements, or pull requests.

📜 License

This project is licensed under the MIT License (or your preferred license).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
Optical_Character_Recognition_(OCR).ipynb		Optical_Character_Recognition_(OCR).ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 Optical Character Recognition (OCR) System

Tesseract OCR + OpenCV + IAM Handwritten Dataset

🚀 Features

📁 Dataset Used: IAM Handwritten Forms

🧠 How It Works

1️⃣ Dataset Loading

2️⃣ Image Preprocessing (OpenCV)

3️⃣ Text Extraction (Tesseract OCR)

4️⃣ Visualization

📊 Results

🛠 Technologies Used

🎯 Applications

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📝 Optical Character Recognition (OCR) System

Tesseract OCR + OpenCV + IAM Handwritten Dataset

🚀 Features

📁 Dataset Used: IAM Handwritten Forms

🧠 How It Works

1️⃣ Dataset Loading

2️⃣ Image Preprocessing (OpenCV)

3️⃣ Text Extraction (Tesseract OCR)

4️⃣ Visualization

📊 Results

🛠 Technologies Used

🎯 Applications

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages