About Scanning and OCR

Scanning and OCR:
Letting Documents Travel from the Paper-World to the Cyber-World

Rituraj Kalita
July 2012

Scanning is the process via which documents and pictures on paper (e.g., an old book or a hanging photograph), for which no digital version (nowadays also called 'the soft copy') is available, need to proceed so as to remain in your computer or to be available through the Internet (or, may be, through compact disks). A scanner converts the paper document generally to digital image files, though direct one-step conversion to digital text by simultaneous use of OCR technology is also sometimes performed. However, the two-step conversion process (first the image files and then the text documents) is generally preferred for on-paper text matters, as this can make both the image files and the OCR-output text available to the operators - they may then compare the two to check the correctness of the text and to manually introduce some corrections to the text as is found necessary. This is how things are being done by the volunteers working for the great book-digitisation initiatives such as Project Gutenberg: one scans some pages of a book and deposit the image files to the online repository, another one performs the OCR conversion to get the text files and then deposits that to the online repository, while a third one downloads both sets of files and compares them to manually correct any errors that has crept into the text during the OCR conversion process, and then deposits the corrected text. [If you want to join the said worldwide group of volunteers, click here.]
Note: OCR (Optical Character Recognition) is the modern-day computer based technology with which the digital images of textual characters (i.e., of the alphabets, digits and symbols such as +) can be recognized as those characters themselves. It is clearly a great advancement of computer technology to be able to recognize the characters within an image and obviously requires some degree of artificial intelligence. Free, effective and easily accessible software for performing OCR have started appearing only as recently as in the new millennium (we're going to discuss some of them)! But yet there may occur a few errors here and there while performing automated OCR, so a final finishing touch of manual correction on such OCR-output still remains necessary.

Using a scanner, scanning may be done using the proprietary software of the scanner producer (e.g., HP, Inc.) distributed along with the scanner, but the operating system (e.g., Windows or Fedora Linux) generally provides its common scanning program. Thus Windows OS has Microsoft Scanner and Camera Wizard, whereas Fedora has Simple Scan. Opening that sort of a program, you may choose two basic types of scanning options as: (a) color (colour)/ grayscale (greyscale)/ black & white scanning and (b) fine or low resolution (implying high or low quality, and thus also larger or smaller file size) - it is measured in dpi ('dots per inch') unit, with 300 dpi (also called 300 x 300 dpi) meaning fine, 200 dpi the usual and 100 dpi rough quality. For a coloured photograph, you need to obviously go for a colour scan option - otherwise you'll arrive at a monochrome photograph image! But for scanning a text document with a final aim to go for its OCR, even if the document was printed in blue ink, you may safely go for the greyscale scan (thus greatly reducing the image file size) - knowing that colour or greyscale scan would hardly be of any difference as far as the final OCR result is concerned. But, remember, the black & white option (found in case of Windows), in contrast to the greyscale option is generally disastrous: I found practically nothing (while seen with the eyes) after thus scanning the first page (A4 sized) of a novel freshly printed on white paper and also nothing meaningful as its OCR output, whereas a corresponding greyscale scan gave a rather good image and an OCR output with only seven errors (illustrated later)!

Scanning Using the Simple Scan Utility within the Free Fedora 13 Linux OS

Greyscale Scan in Windows Wizard

The Grayscale Option in Microsoft Scanner and Camera Wizard (Windows XP)

About scanning, I've observed something really curious and useful as well, which I should share with you now. Using my HP Scanjet 2300c scanner (a common one, besides what I'm going to disclose is most probably true with other scanners also), I uniformly found that the scanned page images obtained in my Fedora is much clearer as well as pleasantly readable compared to those obtained in my Windows (see below). It's not something about the dpi resolution option (I've carefully tested with 300 dpi in both cases), nor it has something to do with the colour/greyscale option (I used the colour option in both cases). Just see the unbelievable difference in the images below (they were reduced to 25%, so as to be viewable within this window), corresponding to a recipe offered by a local Guwahati bakery. The large-sized scanned images actually obtained are being stored here and here for your reference (to save any of these images to your computer for practicing OCR, click to open the desired webpage and then right-click at the large image, followed by clicking at the Save Image As... option). And, did it take a significantly longer time to scan in Fedora? Though I haven't tested that with a stopwatch, I didn't feel anything like that - rather felt that Fedora took a somewhat lesser time!

Images for Scanned Text in Fedora (above) and in Windows!

        However, as expected, the file sizes for the scanned images in Fedora are nearly twice as large (~2 MB per A4 sized page for Fedora), but unless you want to directly place the images in a website but instead planning to perform OCR, this factor hardly matters. So it seems we should learn Fedora (say, Fedora 13) just for scanning pages! The present-day Linux versions are mostly rather user-friendly, with Windows-like graphical user interface (GUI) as seen in the screenshot at the top. All the Fedora-s are free, downloadable free of cost (though of several hundred MB download size), but installing them side by side in your Windows computer would require the service of an expert having expertise in disk partitioning. To access Simple Scan in Fedora 13, you need to mouse-click at Applications found on the top-left of the screen (see screenshot), then click at Graphics and then at Simple Scan. Within Simple Scan, I found that I could do only colour scan (no greyscale option), but options about the dpi-choices are there. The most pleasing thing about setting up a scanner here is that Fedora 13 already knows about all the scanner models of the major manufacturers, so unlike in Windows there's generally no need to download and install the scanner driver files etc. I just plugged in my scanner, opened Simple Scan, and wow!, it knows what scanner (HP Scanjet 2300c) I've plugged in!

        The only catch in using any form of Linux in the same computer in which you work with Windows (yes, that's possible and I'm regularly doing that) is that you can't expect Linux to place its output files (say, the scanned image files) in a local hard-disk folder so as to be accessible with Windows later - instead, you need to save (or to copy) the output files in a flash drive (i.e., a USB or a pen drive) and then finally to 'safely remove' the flash drive (right-click at the drive icon on the desktop, then choose that option) within Linux. This strange problem arises because the file systems are different for the hard-disk drives in case of Linux and Windows, but for the flash drives there's no such incompatibility.
Note: It's no big deal to have a USB drive plugged in to your computer every time you're working in it, and to save a copy of your present session's work there before you shut down or restart your computer. This is because the life of the data in your hard disk is like, as they say in Assamese about human life, 'a drop of water on an arum leaf' - on which day that'd fall down nobody can be sure! But a computer generally do not crashes while you're still working on it, so before you shut it down you should always transfer your recent output files to the USB drive, no matter whether you're working in Windows or in Linux.

        Let's now learn to perform OCR on scanned images of text. Till now, OCR has been primarily successful on printed texts written in the English script or its close variations (such as German). To perform OCR on such texts, I've found two free software to be greatly successful: the SuperGeek Free Document OCR 2.3.1 and the Advanced OCR Free 5.0.1. Both of them are quite user-friendly (the former is slightly more obvious to operate upon), and provide high (and practically the same) speed as well as accuracy on scanned images of printed English texts. Their recent download sizes are within 6 to 7 MB each - rather attractive. Earlier, I used to marvel on the accuracy of FreeOCR.net 2.3. However, I found that this old version of FreeOCR that I have has surely got surpassed by the aforesaid pair: on the same scanned image of an A4-sized page of a novel in English it took double the time (9 s) and produced 25% more (total 9) errors - however, I couldn't test its present (3.0) version as that doesn't have a compact downloadable setup file and its 0.1 MB sized stub took too long (and uncertain) a time to download the full package. For performing OCR on English as well as several select scripts (even including Arabic, Hindi etc.) there's being developed the free Tesseract-OCR as a Google 'open-source' project, but it seems to be mainly meant for the software developers, and strangely the performance of even its current Windows version (v 3.01) on the said test page in English, on the count of accuracy, is at most half as good compared to the aforesaid pair of privately owned free software! [All the above software are freely downloadable from CNET.]

        Let's now have a look at the slim, handsome SuperGeek performing OCR on the aforesaid bakery recipe. To operate it, first we need to click at the Open... button to select the image file, then to click at OCR to perform its OCR - so straightforward is our role here! We see the input image in the left window, and after a few seconds get the output text in the right window - we may then either select and copy the text, or may save the text within a text file.

SuperGeek Free Document OCR 2.3.1 Performing OCR on the Scanned Image of a Recipe

SuperGeek and Advanced OCR both, of course, wholly messed up five of the six (:-) signs within the recipe, but recognised all the words correctly (including all the wrongly spelt words in the original, such as evently) - see the full text output here. Let's now go for a more real-life example of scanning documents such as books. For this, the first page (A4 sized) of a novel (Chingiz Aitmatov's The First Teacher) obtained as a Word document was printed (with Fast Draft quality in a common HP DeskJet printer) on white paper, and was variously scanned followed by OCR operations to see how correctly the original text would get reproduced. I found that while using the said Windows wizard, the choice of colour or greyscale scan hardly mattered, and even the choice of 200 dpi (default for the Windows wizard) or 300 dpi didn't matter - the OCR output for the page, by both SuperGeek and Advanced OCR, would strangely have the same eight (8) errors - four wrongly spelt words, one error in a sign and three spaces lost within consecutive words: see the scanned 200 dpi greyscale image (size-reduced, its large original is stored here), the OCR-output text and the generated errors within that text red-underlined by Microsoft Word, one below the other within this page. However, scanning the printed page with Simple Scan within Fedora 13 (colour scan, 300 dpi default - the large original is here) the errors in OCR output (see the output text here) by SuperGeek (and also by Advanced OCR) drastically came down to one - only a period sign was found missing! [The time taken in each of these one-page OCR operations, whether in SuperGeek or Advanced OCR, in my Dual Core Pentium-IV computer was uniformly 4.8 (i.e., ~5) seconds.] So encouraging a state of affairs is here, isn't it? The initially unexpected moral of this story is obvious - we should use Fedora to scan text-based paper documents, at least if we intend to perform OCR.
Note: Do you recognise that a good method to quickly check the errors in OCR for English or some other European-language texts is to paste the OCR text output into a well-developed word processor such as Microsoft Word, free Apache OpenOffice, free AbiWord 2.8.6 or Ginger which would underline the spelling and the grammatical errors in red or green?

Tesseract-OCR (even for Windows) has to be run at the command prompt (obtainable from the Start Menu as: Programs --> Accessories --> Command Prompt), as if in the days of old MS-DOS. Tesseract-OCR generally gets installed within the Tesseract-OCR folder in your C-drive. The command prompt (also similar to the command line interface - CLI or the Konsole in Linux) is a black window with white texts (see below) - there you're provided a line (the command line) to type in your present command, to be always followed by pressing Enter. Within the command prompt, you may open the C-drive (if not already remaining open) by typing C: and then pressing Enter. [Do you recognise that this is similar to double-clicking at the C-drive icon within your Windows Explorer or within a similar Linux GUI?] Within the C-drive (there'll be shown a C: at the left-most part of the last line - see below), open the C:\Tesseract-OCR folder by typing in cd \Tesseract-OCR (cd stands for change directory, where directory is the old-days synonym for folder - also note the backslash sign, not slash, unlike in Linux) and pressing Enter (isn't this similar to double-clicking at the Tesseract-OCR folder within the C-drive?). Next type the final command, say Tesseract Hindi001.jpg OutH001 -l hin (see below) -- where Hindi001.jpg is the complete filename (note the mandatory absence of space within the filename) of the scanned image file, OutH001.txt (with the unspecified extension .txt as understood) is the desired text-output file (note the absence of space and also the preferable eight-character limitation of the first name part - e.g., Hindi001 - of the complete filename to be specified here), -l (dash el, not dash one) means the language-script involved (l stands for language) and hin (lowercase, three characters) is the 3-character abbreviation for the Hindi language (similarly, it is eng for English or ara for Arabic) within the Tesseract-OCR project. Next, a output line (Tesseract Open Source ...) would appear to indicate undertaking of your OCR work, and finally the next blank command line would appear after the OCR work ends (see below). You may then enter another OCR work, or enter the Exit command to close the command prompt.
Note: The output files OutH001.txt, etc. may, obviously, be opened with Notepad to view the OCR-output as Hindi unicode text. However, if you're working with an old version of Windows, the unicode text may be seen as just garbage. This happened with my old desktop running Windows XP, but in the newer laptop running Windows 7, the Hindi text became recognisable!

Tesseract-OCR Working on the Scanned Image of Some Hindi Text

How was the observed accuracy here in Hindi OCR work? Well, not so enchanting at present! To test it, I similarly printed some typed newspaper text written in the free yet very good Kiran Hindi font, using the same HP DeskJet printer, on white paper, even taking care to have a large font size (I felt the complicated-looking conjunct consonants in Hindi might need a larger font size to be understood by a machine) and also good spacing (see here). Then I scanned that (the large original image is here) using the Windows Wizard, with greyscale - 200 dpi option (as was done with the aforesaid A4-sized page of a novel in English, finally getting only eight minor OCR errors there with SuperGeek). Next, its OCR was performed using Tesseract, as shown in the above screenshot of the command prompt window. Sadly, the Hindi OCR-output resembles the original only in a majority of the places - in several places it was queerly different (you may compare them here). As some of you may not be able to read Hindi, let's read the short first sentence aloud: Wayan ki aage koi nahi tiyar paataa (in OCR output) and Wakt ki aage koi nahi tik paataa (in the original). Well, it's a great improvement for Hindi OCR over being non-existent just a few years back, but it seems to need some more progress to become really meaningful for the users!