The main aim of this page is to determine
Edition 1 |
This document deals with the topics like,
| |
Problems in alphabetic sorting with UNICODE Tamil EncodingFollowing (Table 1) is table that contains all the possible Tamil characters that are supported by The Unicode Standard, Version 4.0 in a sorted maner. The original position of Tamil in Unicode chart starts from B80 and extends up to BFF, a total of 127 allocations.If you look at the Chart for Tamil in Unicode, the order in which the alphabets are arranged are not at all in the conventional, natural, traditional and standard way, that modern Tamil uses. This can be considered as the primary issue. Since Unicode is not in a position to rearrange the current Tamil Unicode scheme, It is wise for us, atleast to standardize the Unicode Tamil ordering/sorting scheme to our best. The main flaws in Unicode Tamil ordering are: Flaw 1: '' (aaytham - B83) comes before vowels. Soln: Sorting algorithm should arrange this after (au - B94). Flaw 2: Vowel (au - B94) has been given the flexibility of representing it as two different letters as (o - B92) + '' ( `au - BD7). This flexibility creates lot of problems by having two different character sequences for a same word that includes the letter (au). Soln: A Standard has to be devised to eliminate the usage of '' (`au - BD7). Flaw 3: Similarly, Vowel modifiers for ; (o - B92), (O - B93) and (au - B94) such as ' ' ( `o - BCA), ' ' ( `O - BCB) and ' ' ( `au - BCC) are also given the flexibility as follows - "consonant + ( `o - BCA)" can aslo be represented as " ( `e - BC6) + consonant + ( `aa - BBE)" - "consonant + ( `O - BCB)" can aslo be represented as " ( `E - BC7) + consonant + ( `aa - BBE)" - "consonant +ௌ ( `au - BCC)" can aslo be represented as " ( `e - BC6) + consonant + ( `au - BD7)" This flexibility creates lot of problems by having two different character sequences for a same word that includes the letter ஒ, ஓ or ஔ. Soln: A Standard has to be devised to eliminate the usage of - " ( `e - BC6) + consonant + ( `aa - BBE)", - " ( `E - BC7) + consonant + ( `aa - BBE)" and - " ( `e - BC6) + consonant + ( `au - BD7)" Flaw 4: (ja - B9C) comes in between (cha - B9A) and (nYa - B9E) as in any other Non-Tamil languages and Gerantha Script. Soln: Sorting algorithm should arrange this as a separate set of characters after the regular Tamil consonants (). Flaw 5: Positions of (na - BA9), (Ra - BB1), (La - BB3) and (zha - BB4) are totally wrong. Soln: Sorting algorithm should take care of it. Flaw 6: Order of Tamil Grantha letters (ja - B9C), (Ca - BB6) [Not 'cha' - B9A], (sha - BB7), (Sa - BB8), (ha - BB9), (ksha - B95+BCD+BB7) and (sree - BB8+BCD+BB0+BC0) are not even internationally standardized. Soln: We need to find a unique sort order for these Tamil Grantha letters and the sorting algorithm should take care of this ordering. Flaw 7: Tamil digits are positioned after alphabets and vowel modifier symbols. This is the most unconventional ordering. Tamil digits should be ordered before vowels. Soln: Sorting algorithm should take care of this ordering. Flaw 8: Normally (Rs - BF9) is used as a currency symbol in front of numbers (eg. .5000). So, this symbol should be ordered before Tamil digits. Soln: Sorting algorithm should take care of this ordering. Flaw 9: Normally (`# - BFA) is used as a number symbol (same as the usage of '#') in front of numbers (eg. 5). So, this symbol should at least be ordered before Tamil digits or to be considered having equal priority with '#' (hash - 0x23). Soln: We should decide upon this and sorting algorithm should take care of this ordering. Soln: Proposal documents are there in internet that requests the inclusion of zero (depicted as 'O'), along with other Tamil digits in Unicode Tamil. But the document supplied by unicode.org does not have this in the chart. Flaw 11 (less important): The very special combinatory letter '' (fa) is being widely used in names (such as fizal, fathima). But it fall in the category of 'most non standard alphabets of Tamil'. The position in the ordering of this letter can be considered just after the Tamil Grantha list. Soln: We should decide upon this and sorting algorithm should take care of this ordering. Soln: We need to initiate Unicode proposals regarding this. |
^ |
^ | |
I request you to go through the above table. In the above table, each number in the first column
represent a set of individual traditional sorted alphabet blocks. They are,
|
^ |
Other major Flaws with UNICODE Tamil EncodingApart from the above-specified Tamil Unicode ordering problem, there are lot of grammatical and syntactical flawsin Tamil Unicode. They include
|
^ |
NotesFrom Various documents and newsgroups, I came to know that, the current scheme of Tamil in Unicode can NEVER be changed. The only possibility according to Unicode is that we can add some Tamil symbols that fill the free gaps, that too with greater pain.The Funny and shameful thing here is, "Department of Information Technology, Government of Tamil Nadu and Tamil Virtual University" and "Ministry of Information Technology, Government of India" are Full Corporate Members of Unicode Consortium. I also came to know that, these representatives of Indian/Tamil languages have raised very less issues regarding the above-discussed things, in many of the Unicode Consortium meetings and discussions. Actually we Tamil people are the most unlucky to have such a irresponsible Governments, Universities and Research Institutes that are spread all over the world. India (having Tamil as one among the 18 state languages), Singapore (having Tamil as one of the national language), Sri Lanka (having Tamil as one of the national language), USA and other places. Regarding the ordering scheme in Unicode Tamil and the most un-natural way of representing alphabets and combinatory alphabets, Unicode has the following points to tell.
Somebody are even trying to propose new set of slots in Unicode for the entire set of Tamil alphabets and symbols in a perfect manner. But I don't know how far they will succeed. If it is done, we are the luckiest. Although the above topics are discussed lots and lots of times at many news groups, meetings and conferences, there are very less resources on the internet to describe the standard sorting order scheme for Tamil. I heard that Microsoft has supplied the API CompareString, which is capable of two Tamil Strings. Are there any specification about "which ordering scheme it is using"? References:
|
^ |
Author: R.Padmakumar
Date: 20-Dec-2004