Anbumani Subramanian and Bhadri Kubendran
Department of Electrical and Computer
Engineering
Virginia Tech, Blacksburg
December 10, 2000
{
Prof. A. Lynn Abbott
}
{ ECPE 5554 - Theory and Design of Computer Vision Systems - Fall 2000 }
Document analysis and optical character recognition (OCR) systems have been under research for few decades. But still it remains a highly challenging task to implement a OCR that works under all possible conditions and gives highly accurate results. In this project, we have developed a simple optical character recognition application for Tamil characters..
The motivation behind our work was to develop a character recognition system, which can further be developed to recognize printed and handwritten Tamil characters. Such a system will be extremely useful to digitize innumerable ancient documents available both on paper and palm leaves and are dated few hundred years back. These invaluable ancient works are in the process of being lost to termites in libraries in India, due to the lack of proper OCR systems. Therefore, our attempt here is a humble effort towards this noble goal of developing an efficient OCR system, in public domain, for Tamil characters.
Tamil is a classical South Indian language spoken in India, Sri Lanka, Singapore, Malaysia. The language has 31 basic alphabets (12 vowels, 18 consonants and a special consonant) and the written script is comprised of 247 characters.
Figure 1 shows the unique characters that can be printed using the InaimathiTSC font family, at 12 points size. This font follows the Tamil Standard Code for Information Interchange (TSCII) [4] encoding. For our analysis, we scanned the documents at 150dpi (dots per inch). Given the nature of this project, we decided to have a short scope and hence kept our analysis simple. Therefore, we concentrated on recognizing only 31 characters (12 vowels - from the first row, 18 alphabets - from the third line in fig. 1 and a full-stop character).
Any OCR implementation consists of a number of preprocessing steps followed by the actual recognition. The number and types of preprocessing algorithms employed on the scanned image depend on many factors such as age of the document, paper quality, resolution of the scanned image, the amount of skew in the image, the format and layout of the images and text, the kind of script used and also on the type of characters - printed or handwritten. The recognition stage usually involves calculating a number of statistical parameters and hence recognizing the character. Typical preprocessing stages include noise cleaning, binarization, skeletonization, skew detection and correction and feature extraction - like line and word segmentation.
In our approach, we assumed a fixed size (12 pt.) Tamil printed character from the InaimathiTSC family with no skew. So, the preprocessing stages included noise cleaning. binarization, skeletonization and feature extraction. Figure 2 shows the flow of our approach.
Scanned documents often contain noise that arise due to printer, scanner, print quality, age of the document, etc.. Therefore, it is necessary to filter these noise before we process the image. The commonly used approach is to low-pass filter the image and use it for later processing.
In our case, we assumed that the document does not have high noise content,
in which case the image can be binarized directly.
Binarization is a technique by which the gray scale images are converted to binary images. The most common method is to select a proper threshold for the image and then convert all the intensity values above the threshold intensity to one intensity value representing either “black” or “white” value. All intensity values below a threshold, are converted to one intensity level and intensities higher than this threshold are converted to the other chosen intensity.
We used Otsu’s threshold algorithm to binarize our gray scale image. In our convention, black represented a character (or noise) and white represented the foreground. Otsu’s threshold makes an assumption that the histogram of the gray scale image has a bimodal distribution. Examples of thresholding applied on the gray scale images are shown in later figures.
7. Skew Detection and Estimation
During the scanning operation, a certain amount of image skew is unavoidable especially when large number of documents are to be scanned in a limited amount of time. Very rarely do we see documents, which does not have a skew. So, skew detection and correction is a an important pre-processing operation. Skew angle is the angle that the text lines of the scanned image make with the horizontal direction. This angle has to be detected and once this is detected, it can be corrected using an affine transformation. Many research papers are dedicated just for this problem which shows the amount of complexities involved in this calculation. Despite our initial efforts, due to lack of time and complexity, we did not achieve highly succesful results. So, we did not explore this problem in great detail and our documents are scanned without introducing skew.
8. Image Thinning (skeletonization)
Skeletonization or thinning is a process by which a one-pixel width representation (or the skeleton) of an object is obtained, by preserving the connectedness of the object and its end points. Hence, this process can be seen as a conditional deletion of boundary pixels. The process of thinning is of practical importance in applications that further involves a structural analysis because thinning can greatly simplify an analysis which could become complex later. A number of thinning algorithms have been proposed and are being used. The most common algorithm used is the classical Hilditch algorithm [2] and its variants.
We used a simple algorithm for thinning our images. The algorithm makes two passes: one pass on the actual image and another pass over an intermediate image (constructed from the original image during the first pass). The result after the two passes decide if a pixel is to be removed to get the skeleton of the object of interest.
The thinning algorithm works as follows:
Figure 3(a) shows an input image on which the thinning algorithm is to be applied. Each box represents a pixel. Figure 3(b) shows a 3x3 mask that is used to move over the input image during the first pass and then on the intermediate image during the second pass. The center of the mask is positioned on the pixel of interest during the scanning process and the numbered boxes represent the corresponding neighbor pixels. The pixel of interest, shown as “o”, is marked for deletion when all the following conditions are satisfied. These conditions are checked only for a “dark pixel”.
These conditions are repeated for each pixel in the image as it is scanned. The intermediate result is an image, which consists of pixels that are marked for deletion and other pixels that are left as in the original input image. The second pass is performed on this intermediate image and now the conditions of deletion on the pixel are as follows:
Some results using this thinning algorithm are shown in fig. 5 and 6. It is evident from these results that input characters have been reduced to one-pixel width.
Before analyzing any character, it is important to identify the (pixel) boundaries of that character. Thus, a bounding box has to be identified for each character. For this, we first calculate the row-wise horizontal projection of the whole document and identify the start and end positions of each line, from the valleys in the projection. Having found the line boundaries, we can then find the vertical projection of each line, the valleys of which show the boundaries of each character in the line [1]. Since, all the characters in a line may not be of equal height, what we have now is only an approximate bounding rectangle for the characters. (In our definition, a bounding box exactly encloses the character, while a bounding rectangle may contain blank lines above or below the character and hence can result in incorrect moment estimates.)
Now, the bounding rectangle is used to identify the exact bounding box. This is a crucial step in moment based recognition methods and can be calculated from a vertical projection of the bounding rectangle.
Each character to be recognized is expressed as a point in a N-dimensional vector space, where N is the number of features considered. Here, we used a 6-dimensional (so, N = 6) feature space for recognition where the features are
By including more features for each character in our analysis, we can increase the dimensions of the feature space, which will help improve the accuracy of the OCR. We can call Qi, all the character we can recognize. Here, we used 31 characters for recognition (so, max(i) = 31). Table 1 shows the values of these features for each character we used and its corresponding character (TSCII) code.
Once we identify the bounding box for a character, we can calculate its parameters in our feature space. Let us represent the character to be recognized as a point P in N-dimensional feature space.
If we represent the properties of each character in our database (which we can recognize) with suffix i, then
ei = sqrt [ (xi - x)^2 + (yi - y)^2 + (fi1 - f1)^2 + (fi2 - f2)^2 + (fi3 - f3)^2 + (fi4 - f4)^2 ]Now, this square root of sum of the squared errors for all Qi, gives a measure of resemblance to the characters we can recognize. Therefore, the value of least ei (least error distance) is the recognized character. This method is very similar to the nearest-neighbor classifier approach.
Once we recognize a character, the result can be written to a text file with its corresponding character code. Here, we used the standards proposed by the TSCII group [4] and hence the output was easily readable in Tamil using a text-editor.
As can be seen from fig. 7, our implementation recognized the characters with sufficient accuracy, whose numerical result is summarized as confusion matrices in tables 2 and 3. In an ideal case, we will only see values along the diagonal of a confusion matrix. In our case, there were some characters that were incorrectly recognized. Nevertheless, the accuracy rate is encouragingly within acceptable limits.
Driven by curiosity, we also tested our application with characters that are not in our database. Not surprisingly, there characters in our database were identified with sufficient accuracy. Figure 8 shows our test inputs and their corresponding recognized outputs. (Note that characters with a 'dot' above were not included in our database.)
(a)
(b)
(c)
(d)
An astute reader will have noticed from our character set, that some characters closely resemble another character in our chosen set. Fig. 9 shows some of these similar characters. Therefore, the centroid and moments of both characters will be similar and hence there arises a difficulty in recognition. From our results (tables 2 and 3), we can identify some of these cases by the corresponding high error rates. This problem can only be solved by increasing the number of features, which is presently not addressed and is left as a future work.
Our efforts to correct skew demanded more time and effort, than we anticipated and hence we decided to incorporate skew correction in future. The possible improvements suggested for this work, to develop a more robust OCR are higher dimensional feature space, character slant correction, size and font invariance. An extensive statistical analysis of alphabet distribution in documents may also improve the accuracy rate. To recognize characters on a page from a magazine or newspaper, a thorough study of document type recognition has to be made. Ultimately, the goal of any OCR is to recognize hand-written characters, which is highly complex and demanding task.
A simple character recognition application was successfully developed for Tamil characters and was found to perform reasonably well with sufficient accuracy (results are shown in tables 2 and 3). This is our preliminary study of character recognition systems and we plan to continue our work in future. We hope a more complete OCR will evolve in public domain with this work as the starting point, so that our primary source of motivation will benefit from our efforts.
[1] H. Bunke, P.S.P. Wang, "Handbook of Character Recognition and Document Image Analysis", World Scientific, 1997.
[2] S. Mori, H. Nishida, H. Yamada, "Optical Character Recognition", John Wiley & Sons, 1999.
[3] S.X. Liao, Q. Liu, "A Study of Moment Functions and Its Use in Chinese Character Recognition", Proceedings of International Conference on Document Analysis and Recognition", vol. 2, pp. 572-575, 1997.
[4] Tamil Standard Code for Information Interchange, http://www.tamil.net/tscii
Will be available soon (needs some cleanup and comments). Till then, ask the authors personally through email.
Update (Dec. 2, 2003) - The algorithm and references used are documented above. If your interest is developing an OCR, take a look at: