OCR with Tesseract 4.0

#OCR-with-Tesseract-4.0

Tesseract 4.0 is in beta, but it is far superior to Tesseract 3.0.

The first step is to build Tesseract 4.0 on the box. Instructions for doing this on Ubuntu are here. Use the terminal to issue the commands and build Tesseract 4.0.

Pre-processing

#Pre-processing

It is not clear to me that DT-OCR is pre-processing images. However my experience is that this a critical step. The clearer the image and the better the resolution, the better the OCR will be.

For pre-processing images we use two packages: OpenCV and ImageMagick.

OpenCV releases are available here. I used these instructions to install/update OpenCV. It appears that ImageMagick can be installed using apt-get.

The utility image_utils supports image identification and manipulation capabilities from OpenCV, ImageMagick, and ghostview. The utilities are written such that they are incredibly easy to perform without having to worry which package to use, etc.

First, get the image. We note that we can apply these options on tiffs of multiple pages, but for simplicity's sake, we are only going to operate on one page.

File Retrieval and Prep

#File-Retrieval-and-Prep

The following function is a general method to get image files for processing and preps them by splitting and preprocessing. We note What is the resolution? Guides for Tesseract recommend that images be at least 300 DPI.

Image Orientation

#Image-Orientation

Tesseract has an option to discover the image orientation. We may want to sharpen the image first, but for now, let's get the orientation of the image.

We can threshold the image to just black and white. However, I'm not sure we want to do this because we may want to keep colors to pick up notary stamps. The other issue with thresholding is that for signatures we may use information important in wavelet transformations (in case we do signature comparison using wavelets).

Get the shape now. We should just see one channel if we performed thresholding, else we will have three channels indicating a color document (channels are for RGB color values per pixel).

Now, make sure we have at least 300 dpi.

Print the shape again.

The image we can work on is always the one with current_fname.

Display the image a jpeg.

Loading output library...
Loading output library...

Does it make a difference?

#Does-it-make-a-difference?

Out of curiosity, let's compare the text obtained from OCR of the original document and the enhanced 300 DPI document. It is felt that multiple documents will have to be run in order to make this a real test, but for now, we do it on this document only.

There may not be any difference on the document being processed. A study needs to be made over a sample of documents to see if correction of documents will be helpful.