Scanning and OCR:
Letting Documents Travel from the Paper-World to the Cyber-WorldRituraj Kalita
July 2012
Scanning
is the process via which documents and pictures on paper (e.g., an old
book or a hanging photograph), for which no digital version (nowadays
also called 'the soft copy') is available, need to proceed so as to
remain in your computer or to be available through the Internet (or,
may be, through compact disks). A scanner converts the paper document
generally to digital image files, though direct one-step conversion to
digital text by simultaneous use of OCR technology is also sometimes
performed. However, the two-step conversion process (first the image
files and then the text documents) is generally preferred for on-paper text matters, as this can
make both the image files and the OCR-output text available to the
operators - they may then compare the two to check the correctness of
the text and to manually introduce some corrections to the text as is
found necessary. This is how things are being done by the volunteers
working for the great book-digitisation initiatives such as Project Gutenberg:
one scans some pages of a book and deposit the image files to the
online repository, another one performs the OCR conversion to get the
text files and then deposits that to the online repository, while a
third one downloads both sets of files and compares them to manually
correct any errors that has crept into the text during the OCR
conversion process, and then deposits the corrected text. [If you want
to join the said worldwide group of volunteers, click here.]
Note:
OCR (Optical Character Recognition) is the modern-day computer based
technology with which the digital images of textual characters
(i.e., of the alphabets, digits and symbols such as +) can be
recognized as those characters themselves. It is clearly a great
advancement of computer
technology to be able to recognize the characters within an image
and obviously requires some degree of artificial
intelligence. Free, effective and easily accessible software for
performing OCR have started appearing only as recently as in the new
millennium (we're going to discuss some of them)! But yet there may
occur a few errors here and there while performing automated OCR, so a
final finishing touch of manual correction on such OCR-output still
remains necessary.
Using a scanner, scanning may be done using the
proprietary software of the scanner producer (e.g., HP, Inc.)
distributed along with the scanner, but the operating system (e.g., Windows or Fedora Linux) generally provides its common scanning program. Thus Windows OS has Microsoft Scanner and Camera Wizard, whereas Fedora has Simple Scan.
Opening that sort of a program, you may choose two basic types of
scanning options as: (a) color (colour)/ grayscale (greyscale)/ black & white
scanning and (b) fine or low resolution (implying high or low quality,
and thus also larger or smaller file size) - it is measured in dpi
('dots per inch') unit, with 300 dpi (also called 300 x 300 dpi)
meaning fine, 200 dpi the usual and 100 dpi rough quality. For a
coloured photograph, you need to obviously go for a colour
scan option - otherwise you'll arrive at a monochrome photograph image!
But for scanning a text document with a final aim to go for its OCR,
even if the document was printed in blue ink, you may safely go for the
greyscale scan (thus greatly reducing the image file size) - knowing
that colour or greyscale scan would hardly be of any difference as
far as the final OCR result is concerned. But, remember, the black & white option (found in case of Windows), in contrast to the greyscale option is generally disastrous: I found practically nothing (while seen with the eyes) after thus
scanning the first page (A4 sized) of a novel freshly printed on white paper and
also nothing meaningful as its OCR output, whereas a corresponding greyscale scan gave a rather good image and an OCR output with only
seven errors (illustrated later)!
Scanning Using the Simple Scan Utility within the Free Fedora 13 Linux OSThe Grayscale Option in Microsoft Scanner and Camera Wizard (Windows XP)
About scanning, I've
observed something really curious and useful as well, which I
should share
with you now. Using my HP Scanjet 2300c scanner (a common one, besides
what
I'm going to disclose is most probably true with other scanners also),
I uniformly found that the scanned page images obtained in my Fedora is much clearer as well as pleasantly readable compared to those obtained in my Windows (see below).
It's not something about the dpi resolution option (I've carefully
tested with 300 dpi in both cases), nor it has something to do with the
colour/greyscale option (I used the colour option in both cases). Just see the unbelievable difference in the images below (they were reduced to 25%, so as to be viewable within this
window), corresponding to a recipe offered by a local Guwahati
bakery. The large-sized scanned images actually obtained are being
stored here and here
for your reference (to save any of these images to your computer for
practicing OCR, click to open the desired webpage and then right-click
at the large image, followed by clicking at the Save Image As... option). And, did it take a significantly longer time to scan in Fedora? Though I haven't tested that with a stopwatch, I didn't feel anything like that - rather felt that Fedora took a somewhat lesser time!
Images for Scanned Text in Fedora (above) and in Windows!
However, as expected, the file sizes for the scanned images in Fedora are nearly twice as large (~2 MB per A4 sized page for Fedora),
but unless you want to directly place the images in a website but
instead planning to perform OCR, this factor hardly matters. So it
seems we should learn Fedora (say, Fedora 13) just for scanning pages! The present-day Linux versions are mostly rather user-friendly, with Windows-like graphical user interface (GUI) as seen in the screenshot at the top. All the Fedora-s are free, downloadable free of cost (though of several hundred MB download size), but installing them side by side in your Windows computer would require the service of an expert having expertise in disk partitioning. To access Simple Scan in Fedora 13, you need to mouse-click at Applications found on the top-left of the screen (see screenshot), then click at Graphics and then at Simple Scan. Within Simple Scan, I found that I could do only colour scan (no greyscale option), but options about the dpi-choices are there. The most pleasing thing about setting up a scanner here is that Fedora 13 already knows about all the scanner models of the major manufacturers, so unlike in Windows there's generally no need to download and install the scanner driver files etc. I just plugged in my scanner, opened Simple Scan, and wow!, it knows what scanner (HP Scanjet 2300c) I've plugged in!
The only catch in using any form of Linux in the same computer in which you work with Windows (yes, that's possible and I'm regularly doing that) is that you can't expect Linux to place its output files (say, the scanned image files) in a local hard-disk folder so as to be accessible with Windows
later - instead, you need to save (or to copy) the output files in a flash drive
(i.e., a USB or a pen drive) and then finally to 'safely remove' the flash
drive (right-click at the drive icon on the desktop, then choose that option) within Linux. This strange problem arises because the file systems are different for the hard-disk drives in case of Linux and Windows, but for the flash drives there's no such incompatibility.
Note:
It's no big deal to have a USB drive plugged in to your computer every
time you're working in it, and to save a copy of your present session's
work there before you shut down or restart your computer. This is
because the life of the data in your hard disk is like, as they say in
Assamese about human life, 'a drop of water on an arum leaf' - on
which day that'd fall down nobody can be sure! But a computer generally
do not crashes while you're still working on it, so before you shut it
down you should always transfer your recent output files to the USB
drive, no matter whether you're working in Windows or in Linux.
Let's now learn to perform OCR on scanned images of
text. Till now, OCR has been primarily successful on printed
texts written in the English script or its close variations (such as
German). To perform OCR on such texts, I've found two free
software to be greatly successful: the SuperGeek Free Document OCR 2.3.1 and the Advanced OCR Free 5.0.1.
Both of them are quite user-friendly (the former is slightly more
obvious to operate upon), and provide high (and practically the same)
speed as well as accuracy on scanned images of printed English texts.
Their recent download sizes are within 6 to 7 MB each - rather attractive. Earlier, I used
to marvel on the accuracy of FreeOCR.net 2.3. However, I found that this old version of FreeOCR
that I have has surely got surpassed by the aforesaid pair: on the same
scanned image of an A4-sized page of a novel in English it took double
the time (9 s) and produced 25% more (total 9) errors - however, I
couldn't test its present (3.0) version as that doesn't have a compact
downloadable setup file and its 0.1 MB sized stub took too long (and
uncertain) a time to download the full package. For performing OCR on
English as well as several select scripts (even including Arabic,
Hindi etc.) there's being developed the free Tesseract-OCR as a Google 'open-source' project,
but it seems to be mainly meant for the software developers, and
strangely the performance of even its current Windows
version (v 3.01) on the said
test page in English, on the count of accuracy, is at most half as good
compared to the aforesaid pair of privately owned free software! [All
the above software are freely downloadable from CNET.]
Let's now have a look at the slim, handsome SuperGeek performing OCR on the aforesaid bakery recipe. To operate it, first we need to click at the Open... button to select the image file, then to click at OCR
to perform its OCR - so straightforward is our role here! We see the
input image in the left window, and after a few seconds get the output
text in the right window - we may then either select and copy the text,
or may save the text within a text file.
SuperGeek Free Document OCR 2.3.1 Performing OCR on the Scanned Image of a Recipe
SuperGeek and Advanced OCR
both, of course, wholly messed up five of the six (:-) signs within the
recipe, but recognised all the words correctly (including all the wrongly spelt words in the original, such as evently) - see the full text output here.
Let's now go for a more real-life example of scanning documents such as
books. For this, the first page (A4 sized) of a novel (Chingiz
Aitmatov's The First Teacher) obtained as a Word document was printed (with Fast Draft quality in a common HP DeskJet
printer) on white paper, and was variously scanned followed by OCR
operations to see how correctly the original text would get reproduced.
I found that while using the said Windows wizard, the choice of colour or greyscale scan hardly mattered, and even the choice of 200 dpi (default for the Windows wizard) or 300 dpi didn't matter - the OCR output for the page, by both SuperGeek and Advanced OCR,
would strangely have the same eight (8) errors - four wrongly spelt
words, one error in a sign and three spaces lost within consecutive
words: see the scanned 200 dpi greyscale image (size-reduced, its large original is stored here), the OCR-output text and
the generated errors within that text red-underlined by Microsoft Word, one below the other within this page. However, scanning the printed page with Simple Scan within Fedora 13
(colour scan, 300 dpi default - the large original is here) the errors in OCR output (see the output text here) by SuperGeek (and also by Advanced OCR) drastically came
down to one - only a period sign was found missing! [The time taken in each of these one-page OCR operations, whether in SuperGeek or Advanced OCR, in my Dual Core Pentium-IV computer was uniformly 4.8 (i.e., ~5) seconds.] So encouraging a state of affairs is here, isn't it? The initially unexpected moral
of this story is obvious - we should use Fedora to scan text-based paper documents, at least if we intend to perform OCR.
Note:
Do you recognise that a good method to quickly check the errors in
OCR for English or some other European-language texts is to paste
the OCR text output into a well-developed word processor such as Microsoft Word, free Apache OpenOffice, free AbiWord 2.8.6 or Ginger which would underline the spelling and the grammatical errors in red or green? Tesseract-OCR (even for Windows) has to be run at the command prompt (obtainable from the Start Menu as: Programs --> Accessories --> Command Prompt), as if in the days of old MS-DOS. Tesseract-OCR generally gets installed within the Tesseract-OCR folder in your C-drive. The command prompt (also similar to the command line interface - CLI or the Konsole in Linux) is a black window with white texts (see below) - there you're provided a line (the command line) to type in your present command, to be always followed by pressing Enter. Within the command prompt, you may open the C-drive (if not already remaining open) by typing C: and then pressing Enter. [Do you recognise that this is similar to double-clicking at the C-drive icon within your Windows Explorer or within a similar Linux GUI?] Within the C-drive (there'll be shown a C: at the left-most part of the last line - see below), open the C:\Tesseract-OCR folder by typing in cd \Tesseract-OCR (cd stands for change directory, where directory is the old-days synonym for folder - also note the backslash sign, not slash, unlike in Linux) and pressing Enter (isn't this similar to double-clicking at the Tesseract-OCR folder within the C-drive?). Next type the final command, say Tesseract Hindi001.jpg OutH001 -l hin (see below) -- where Hindi001.jpg is the complete filename (note the mandatory absence of space within the filename) of the scanned image file, OutH001.txt (with the unspecified extension .txt as understood) is the desired text-output file (note the absence of space and also the preferable eight-character limitation of the first name part - e.g., Hindi001 - of the complete filename to be specified here), -l (dash el, not dash one) means the language-script involved (l stands for language) and hin (lowercase, three characters) is the 3-character abbreviation for the Hindi language (similarly, it is eng for English or ara for Arabic) within the Tesseract-OCR project. Next, a output line (Tesseract Open Source
...) would appear to indicate undertaking of your OCR work, and finally
the next blank command line would appear after the OCR work ends (see
below). You may then enter another OCR work, or enter the Exit command to close the command prompt.
Note: The output files OutH001.txt, etc. may, obviously, be opened with Notepad to view the OCR-output as Hindi unicode text. However, if you're working with an old version of Windows, the unicode text may be seen as just garbage. This happened with my old desktop running Windows XP, but in the newer laptop running Windows 7, the Hindi text became recognisable!
Tesseract-OCR Working on the Scanned Image of Some Hindi Text How
was the observed accuracy here in Hindi OCR work? Well, not so
enchanting at present! To test it, I similarly printed some typed newspaper text written in the
free yet very good Kiran Hindi font, using the same HP DeskJet
printer, on white paper, even taking care to have a large font size (I felt the complicated-looking conjunct consonants in
Hindi might need a larger font size to be understood by a machine) and also good spacing (see here). Then
I scanned that (the large original image is here) using the Windows Wizard,
with greyscale - 200 dpi option (as was done with the aforesaid A4-sized page of
a novel in English, finally getting only eight minor OCR errors there with SuperGeek). Next, its OCR was performed using Tesseract,
as shown in the above screenshot of the command prompt window. Sadly, the
Hindi OCR-output resembles the original only in a majority of the places
- in several places it was queerly different (you may compare them
here). As some of you may not be able to read Hindi, let's read the
short first sentence aloud: Wayan ki aage koi nahi tiyar paataa (in OCR output) and Wakt ki aage koi nahi tik paataa
(in the original). Well, it's a great improvement for Hindi OCR over
being non-existent just a few years back, but it seems to need some
more progress to become really meaningful for the users!