Skip to content
This repository was archived by the owner on Oct 21, 2024. It is now read-only.

Latest commit

 

History

History
23 lines (15 loc) · 1.03 KB

File metadata and controls

23 lines (15 loc) · 1.03 KB

Normalization

Have a look at the jupyter notebook which contains helper functions for preprocessing scanned documents prior to performing knowledge extraction using Forms Understanding or OCR.

The included functionality:

  • Descriptive statistics on scanned document
  • Normalization
  • Turn into grayscale
  • Binarization

Removing boxes around text

The form_boxes.py module contains methods for handling forms with boxes to retrieve the individual characters. This is useful when you have an form where the handwriting overlaps with the box surrounding it causing the OCR to misread the characters. For a related technique see the accelerator Projection to correct image skew and identify text lines

The included functionality:

  • Form alignment based on the orientation of the boxes
  • Background cleaning
  • Conversion into grayscale
  • Field detection and outlining

Back to the Pre-Processing section