It is not clear to me that DT-OCR is pre-processing images. However my experience is that this a critical step. The clearer the image and the better the resolution, the better the OCR will be.
For pre-processing images we use two packages: OpenCV and ImageMagick.
The utility image_utils supports image identification and manipulation capabilities from OpenCV, ImageMagick, and ghostview. The utilities are written such that they are incredibly easy to perform without having to worry which package to use, etc.
First, get the image. We note that we can apply these options on tiffs of multiple pages, but for simplicity's sake, we are only going to operate on one page.
The following function is a general method to get image files for processing and preps them by splitting and preprocessing. We note What is the resolution? Guides for Tesseract recommend that images be at least 300 DPI.
Tesseract has an option to discover the image orientation. We may want to sharpen the image first, but for now, let's get the orientation of the image.
We can threshold the image to just black and white. However, I'm not sure we want to do this because we may want to keep colors to pick up notary stamps. The other issue with thresholding is that for signatures we may use information important in wavelet transformations (in case we do signature comparison using wavelets).
Get the shape now. We should just see one channel if we performed thresholding, else we will have three channels indicating a color document (channels are for RGB color values per pixel).
Now, make sure we have at least 300 dpi.
Print the shape again.
The image we can work on is always the one with current_fname.
Display the image a jpeg.
Out of curiosity, let's compare the text obtained from OCR of the original document and the enhanced 300 DPI document. It is felt that multiple documents will have to be run in order to make this a real test, but for now, we do it on this document only.
There may not be any difference on the document being processed. A study needs to be made over a sample of documents to see if correction of documents will be helpful.