OCR type software in Java

The amount of data processed by machines today is immense. It is estimated that every day is created 2.5 quintillion bits of data such as movies, music, books, documents and other. And while many of these data are generated directly on the computer, we still some times have to extract information from analogue media. Even then, editing such data is not always easy task as with all scanned documents. In this case, OCR software can help us greatly. With a little effort we can create such a software.

 
Unfortunately in Java theres actually none free and good library for text recognition. However, this is no reason to resign from our venture. There are interesting solutions in other languages. One of the best may be Tesseract library written in C ++.
 
Tesseract is a free library for text recognition, which was started in 1985 and work on this libary continues to this day. Since 2005 it is available under open source license and is available for download from github at this address https://github.com/tesseract-ocr. Currently, the default version of Tesseracta is trained on base of more than 400,000 lines of text, and recognizes about 4500 fonts, making it very effective in recognizing Latin languages. In addition, if we work with non-standard texts we are able to train tesseract ourselves. Although it is a library written in C ++, there is a free JNI java wrapaer. This wraper will be used in our program (avaible at http://tess4j.sourceforge.net).
 
To create a simple OCR software, download the tesseract library and install it https://github.com/UB-Mannheim/tesseract/wiki (Windows). Then lets create a new project and add these two dependencies to maven:
 

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>3.3.1</version>
</dependency>

<dependency>
    <groupId>org.im4java</groupId>
    <artifactId>im4java</artifactId>
    <version>1.4.0</version>
</dependency>  

Source code is as follows:

public class App 
{
    public static void main( String[] args ) throws IOException, InterruptedException, IM4JavaException
    {

        File imageFile = new File("C:\\Users\\Sariel\\Desktop\\moby.png");
        ITesseract instance = new Tesseract();  // JNA Interface Mapping
 	instance.setDatapath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");      
  
         ConvertCmd cmd = new ConvertCmd();
         cmd.run(op);
         
         try {
             String result = instance.doOCR(imageFile);
             System.out.println(result);
         } catch (TesseractException e) {
             System.err.println(e.getMessage());
             System.out.println("reading failed");
         }
    }
    
}

The code is relatively simple, we open picture in which we are interested in. We are creating a new instance of tesseract. We indicate where we have installed the library and then we do a text recognition using the doOCR() method.
 
For testing purposes I used this excerpt from “Moby Dick”
moby dick

We should get this result:
 

men swung in the howlines; still wordless Ahab stood up to the blast.
Even when Wearied nature seemed demanding repose he would not
seek that repose in his hammock. Never could Starbuck forget the
old man's aspect, when one night going down into the cabin to mark
how the barometer stood, he saw him with closed eyes sitting straight
in his floor‘screwed chair; the rain and hnlflmelted sleet of the storm
from which he had some time before emerged, still slowly dripping
from the unremoved hat and coat. On the table beside him lay un'
rolled one of those charts of tides and currents which have previously
been spoken of. His lantern swung from his tightly clenched hand
Though the body was erect, the head was thrown back so that the
closed eyes were pointed towards the needle of the tell'tale that
swung from a beam in the ceiling.’

Terrible old man! thought Starbuckwith a. shudder, sleepingin this
gale, still thou steadfastly eyest thy purposes
CHAPTER 52 THE ALBATROSS

OUTH'EASTWARD from the Cape, ed" the distant Crozetts,

a goodcruising ground for Right Whalemen,asail loomedahead,

S the Goney (Albatross) by name. As she slowly drew nigh, from

my lofty perch at the fore'mast—head, I had a good view of that sight

so remarkable to a tyre in the far ocean fisheries—awhaler at sea, and
long absent from home.

As if the waves had been fullers, this craft was bleached like the
skeleton of a stranded walrus. All down her sides, this spectral ap»
pearance was traced with long channels of reddened rust, While all
her spars and her rigging were like the thick branches of trees furred
over with hoar'frost. Only her lower sails were set. A wild sight it
was to see her longfbearded lookouts at those three mast—heads. They
seemed clad in the skins of beasts,so torn and hepatched the raiment
that had survived nearly four years of cruising Standing in iron

‘The cabin—compass is called the tell'ule, because without going to the campus at the
helm,the Captain, while below, can inform himselfof the course ofthe ship.' 

If we want we can also use the command line. We put our scaned document in the tesseracta folder, then in command line navigate to this folder and use this code:
 

tesseract moby.png resultText 

where moby.png is address of the file being read and resultText is result in txt format
 
In fact, it’s all we need to have our simple OCR software. However, many times it may happen that our scans or source images will not be so easy to recognize by the tesseract library, reducing its effectiveness. But there are many things we can do to improve tesseract effectiveness.
 
To improve the text recognition efficiency of the tesseract library we can
 
– binarize the image
– increase the image resolution, text size
– deskewing / rotate image
– get rid of the border
– get rid of the noise
 
In order to make changes to the document we can use any graphics library, or even photoshop or gimpa, though it would probably require manual editing of each scan. Personally I would recommend image magick and based on it will show how can we make some adjustments to the image.
 
Image magick is available for download at:
https://www.imagemagick.org/script/download.php
 
We install the software with all addons like install legacy utilities, install development headers and liberies for c and c ++, etc. Add image magick to environment variables:
 

Variable name: MAGICK_HOME
Variable value: C:\Program Files\ImageMagick-7.0.5-Q16

in command line by code:
 

convert -version 

we can check if the installation was successful, we should get the image magick version information.

 
Having image magick installed we can prepare our image for text recognition:
 
1) Binarization of the image that is transformation of images into black and white. This can be achieved by a simple command:
 

convert book.png -monochrome book2.png

book.png – source file
-monochrome – command converts image to black and white
book2.png – resulting image


alice color


alice black and white

The best result is when we get a black readable text on a white background. However, often the image needs further processing.
 
2) It is recommended that the images in tesserac have at least 300 dpi resolution, so we can use the following command:
 

convert -units PixelsPerInch dpi.png -density 300 dpiresult.png

It is also worth remembering that if the size of our font is less than 10pt, the image recognition efficiency of tesseract can significantly decrease.
 
3) Deskewing / rotating the image repeatedly our scanned image can be slightly rotated, making it difficult to recognize individual letters. Image rotation can be achieved by the following command:
 

convert rotate.png -rotate -3 rotate2.png

4) Get rid of the border. Frequently, during scanning, we may have a black border around the image that may interfere with reading the text. We can get rid of it in several ways:
 
Using the trim method that is used to remove the border
 

convert black_border.png -trim tes_out.png

Unfortunately, often the black level may not be equal on the whole sheet, and by default the program will only search exactly one of the selected color, so we can use the -fuzz method to find the nearest colors.
 

convert black_border2.png -fuzz 40% -trim tes_out.png

If the perimeter is a fixed size, it can be simply hoisted using the crop or shave method
 

convert black_border2.png -crop 300x500+15+12 tes_out.png
convert black_border2.png -crop 300x500-15-12 tes_out.png

convert black_border2.png -shave 15x15 tes_out.png

Often the border is only on one side, it is not uniform and we can not remove it by trim, in this case we can use this command:
 

convert myborder.png -bordercolor black -border 1 -fuzz 95% -fill white -draw "color 
0,0 floodfill" noborder.png

5) The last thing left is getting rid of the noise from the picture. Unfortunately, this is the heaviest part and depends largly on the kind of noise we have on the image. In order to remove it we can for example use blur, clarify the image, add contrast or use other filters. For instance, if you have white and black dots A good choice is use of median filter:
 

convert ship.png -median 5 ship2.png

Other useful commands:
 

convert image.png -adaptive-blur 5 image2.png // adding blura to image
convert image.png -negate image2.png // Change / negation of colors
convert image.png -noise image2.png // adding noise
convert image.png -white-threshold 20000 image2.png // Replaces all pixels above the given value to white, while the rest of the pixels remain unchanged

List of all commands and their descriptions is available at https://www.imagemagick.org/script/command-line-options.php

 
The same functions we use in our Java software, we just need to add another wraper through the maven this time for Image Magick:
 

<dependency>
    <groupId>org.im4java</groupId>
    <artifactId>im4java</artifactId>
    <version>1.4.0</version>
</dependency>

The sample code could look like bellow, all methods work the same as described above using the command line:
 

        

ProcessStarter.setGlobalSearchPath("C:\\Program Files\\ImageMagick-7.0.5-Q16");
ConvertCmd cmd = new ConvertCmd();

IMOperation op = new IMOperation();
op.addImage("C:\\Users\\Sariel\\Desktop\\image.png");
op.density(300);
op.whiteThreshold(90d*256);
op.monochrome();
op.rotate(-1d);
op.trim().fuzz(40d);
op.blur(6d);
op.addImage("C:\\Users\\Sariel\\Desktop\\testResult");
cmd.run(op);

 
There is also a good ready made script for image magick to produce images for OCR called textcleaner, but it only works under Linux or Windows with cygwin installed:
http://www.fmwconcepts.com/imagemagick/textcleaner/
 
Thanks to just one command we can do everything described above and more giving us very good results.
 
There is nothing to prevent us from implementing some solutions ourselves. I myself wrote simple function that reads all pixels from image and replaces them according to our needs:
 

public static void readPixels(String in, String out) throws IOException {

		File img = new File(in);
		BufferedImage image = ImageIO.read(img);

		int w = image.getWidth();
		int h = image.getHeight();

		int[] dataBuffInt = image.getRGB(0, 0, w, h, null, 0, w);
		int[] pixels = new int[dataBuffInt.length];

		for (int a = 0; a < dataBuffInt.length; a++) {

			Color c = new Color(dataBuffInt[a]);

			if (c.getRed() > 0 && c.getRed() <= 50 && c.getGreen() > 0 && c.getGreen() <= 50 && c.getBlue() > 0
					&& c.getBlue() <= 100) {
				pixels[a] = -1;
			} else {
				pixels[a] = 65536 * 1 + 256 * 1 + 1;
			}
		}

		BufferedImage img2 = new BufferedImage(w, h, BufferedImage.TYPE_INT_RGB);
		img2.getRaster().setDataElements(0, 0, w, h, pixels);

		File outputfile = new File(out);
		ImageIO.write(img2, "png", outputfile);
	}

 
Finally, As a curiosity it is possible to use Tesseract and Image Magick in alternative way, to breaking simple captchas:

Image before
Image after
Functions used

 

captcha1
captcha1
captcha2

captcha2 rendered

op.density(300);
op.crop(278,70,31,67);
op.monochrome();
op.median(3.5d);
captcha3
captcha3 rendered
op.density(300);
op.noise(2d);
op.whiteThreshold(100d*256);
op.adaptiveBlur(1d);
captcha4
captcha4 rendered
op.density(300);
op.median(2d);
op.negate();
op.whiteThreshold(90d*256);
op.monochrome();
op.adaptiveBlur(6d);
op.noise(4d);

By using our software we can easily decode the first captcha without any graphics modifications. The next three require some graphing work, but after this tesseract takes care of them well. Automating this process makes that such captchas dosen’t make its job. Therefore, when we create a website it’s better to use more sophisticated captcha generators. Unfortunately, despite even best graphic fixes we sometimes will not be able to read the scanned document or have errors in it and will need to make or acquire better scan.
 

Below can be downloaded source code for whole software created in Eclipse.
Source code

Leave a Comment

WordPress Video Lightbox Plugin