Site hosted by Angelfire.com: Build your free website today!
 
 

The main aim of this page is to determine
A GENERALLY ACCEPTED SORT ORDER
for
Unicode Tamil characters

Edition 1
(Through out this document, the word 'Tamil' has to be pronounced as '' / 'thamizh' )

This document deals with the topics like,
I started to write this document, when I tried to developed an algorithm to compare two Tamil Unicode strings. The corresponding C Code is here. It basically uses an array of sorted Tamil Unicode positions and a comparison logic that uses this array to justify the two Unicode Tamil strings to be compared.



Problems in alphabetic sorting with UNICODE Tamil Encoding

Following (Table 1) is table that contains all the possible Tamil characters that are supported by The Unicode Standard, Version 4.0 in a sorted maner. The original position of Tamil in Unicode chart starts from B80 and extends up to BFF, a total of 127 allocations.

If you look at the Chart for Tamil in Unicode, the order in which the alphabets are arranged are not at all in the conventional, natural, traditional and standard way, that modern Tamil uses. This can be considered as the primary issue.

Since Unicode is not in a position to rearrange the current Tamil Unicode scheme, It is wise for us, atleast to standardize the Unicode Tamil ordering/sorting scheme to our best.

The main flaws in Unicode Tamil ordering are:

Flaw 1: '' (aaytham - B83) comes before vowels.
Soln: Sorting algorithm should arrange this after (au - B94).

Flaw 2: Vowel (au - B94) has been given the flexibility of representing it as two different letters as (o - B92) + '' ( `au - BD7). This flexibility creates lot of problems by having two different character sequences for a same word that includes the letter (au).
Soln: A Standard has to be devised to eliminate the usage of '' (`au - BD7).

Flaw 3: Similarly, Vowel modifiers for ; (o - B92), (O - B93) and (au - B94) such as ' ' ( `o - BCA), ' ' ( `O - BCB) and ' ' ( `au - BCC) are also given the flexibility as follows
- "consonant + ( `o - BCA)" can aslo be represented as " ( `e - BC6) + consonant + ( `aa - BBE)"
- "consonant + ( `O - BCB)" can aslo be represented as " ( `E - BC7) + consonant + ( `aa - BBE)"
- "consonant +ௌ ( `au - BCC)" can aslo be represented as " ( `e - BC6) + consonant + ( `au - BD7)"
This flexibility creates lot of problems by having two different character sequences for a same word that includes the letter ஒ, ஓ or ஔ.
Soln: A Standard has to be devised to eliminate the usage of
- " ( `e - BC6) + consonant + ( `aa - BBE)",
- " ( `E - BC7) + consonant + ( `aa - BBE)" and
- " ( `e - BC6) + consonant + ( `au - BD7)"

Flaw 4: (ja - B9C) comes in between (cha - B9A) and (nYa - B9E) as in any other Non-Tamil languages and Gerantha Script.
Soln: Sorting algorithm should arrange this as a separate set of characters after the regular Tamil consonants ().

Flaw 5: Positions of (na - BA9), (Ra - BB1), (La - BB3) and (zha - BB4) are totally wrong.
Soln: Sorting algorithm should take care of it.

Flaw 6: Order of Tamil Grantha letters (ja - B9C), (Ca - BB6) [Not 'cha' - B9A], (sha - BB7), (Sa - BB8), (ha - BB9), (ksha - B95+BCD+BB7) and (sree - BB8+BCD+BB0+BC0) are not even internationally standardized.
Soln: We need to find a unique sort order for these Tamil Grantha letters and the sorting algorithm should take care of this ordering.

Flaw 7: Tamil digits are positioned after alphabets and vowel modifier symbols. This is the most unconventional ordering. Tamil digits should be ordered before vowels.
Soln: Sorting algorithm should take care of this ordering.

Flaw 8: Normally (Rs - BF9) is used as a currency symbol in front of numbers (eg. .5000). So, this symbol should be ordered before Tamil digits.
Soln: Sorting algorithm should take care of this ordering.

Flaw 9: Normally (`# - BFA) is used as a number symbol (same as the usage of '#') in front of numbers (eg. 5). So, this symbol should at least be ordered before Tamil digits or to be considered having equal priority with '#' (hash - 0x23).
Soln: We should decide upon this and sorting algorithm should take care of this ordering.

Flaw 10: Tamil digit O (zero - BE6), even though is not available in the conventional way of representing Tamil numbers, has to be included in Unicode ordering mechanism to cope with international number representations.
Soln: Proposal documents are there in internet that requests the inclusion of zero (depicted as 'O'), along with other Tamil digits in Unicode Tamil. But the document supplied by unicode.org does not have this in the chart.


Flaw 11 (less important): The very special combinatory letter '' (fa) is being widely used in names (such as fizal, fathima). But it fall in the category of 'most non standard alphabets of Tamil'. The position in the ordering of this letter can be considered just after the Tamil Grantha list.
Soln: We should decide upon this and sorting algorithm should take care of this ordering.

Flaw 12 (less important): Similarly the (Ca - BB6) [Not 'cha' - B9A] has to be listed along with the Tamil Grantha alphabets, as it is very much used in spiritual Tamil scripts.
Soln: We need to initiate Unicode proposals regarding this.


^

(Table 1)
^

I request you to go through the above table. In the above table, each number in the first column represent a set of individual traditional sorted alphabet blocks. They are,
  1. Rupees Symbol
  2. Number Symbol
  3. Tamil Digits
  4. Tamil Vowels
  5. Primary Tamil Consonants
  6. Sa Series
  7. Sha Series
  8. Ja Series
  9. Ha Series
  10. Ksha Series
  11. Ca [Not 'cha'] Series
  12. Sree
  13. Tamil Vowel Modifiers
  14. Day Symbol
  15. Month Symbol
  16. Year Symbol
  17. Credit Symbol
  18. Debit Symbol
  19. Ditto Symbol

I hope, the above order is correct according to me.

I request you to review this and suggest me the corrections that are to be made in the above list of sortings. For Example, If you feel the order of Tamil Grantha Series are not correct, and you want the new order to be like
    Ja, Sa, Sha, Ha, Ksha, Sree and Ca [Not 'cha']
Then, please let me know the entire new order (with respect to the current order specified above) as follows.
    1, 2, 3, 4, 5, 8, 6, 7, 9, 10, 12, 11, 13, 14, 15, 16, 17, 18, 19

Enter Your
Suggestions
Here

The above link is a Guest book where you can file your suggestions. I will be updating this page with all your valuable suggestions. See previous suggestions in guest book You can also post thru Group or mail to tamilsortorder@yahoo.com


^

Other major Flaws with UNICODE Tamil Encoding

Apart from the above-specified Tamil Unicode ordering problem, there are lot of grammatical and syntactical flaws
in Tamil Unicode. They include
  1. Consonants represented in Unicode, should be perfect Mey. That is, they should be represented as (...) and should not be like what is now (...). The current representation violates the grammar. The actual conventional, correct and natural Tamil grammar is as follows.
      (k) + ` ( `a) becomes (ka)
      (ch) + ( `u) becomes (chu)

    But according to Unicode coding scheme, the entire grammar is being reversed as follows.
      (ka) + (pulli) becomes (k)
      (cha) + ( `u) becomes (chu)
      (ta) + ( `O) becomes (tO)

    This is totally wrong and a BIG blunder. This is also the most un-natural way of representing Tamil language alphabets. This will create lot of problems in future in areas such as,
    • Tamil Natural language processing
    • Tamil Database processing
    • Tamil Computer language development

  2. '' (aaytham - B83) is represented as a vowel modifier in Unicode Tamil. It should be represented as an individual alphabet.


^

Notes

From Various documents and newsgroups, I came to know that, the current scheme of Tamil in Unicode can NEVER be changed. The only possibility according to Unicode is that we can add some Tamil symbols that fill the free gaps, that too with greater pain.

The Funny and shameful thing here is, "Department of Information Technology, Government of Tamil Nadu and Tamil Virtual University" and "Ministry of Information Technology, Government of India" are Full Corporate Members of Unicode Consortium. I also came to know that, these representatives of Indian/Tamil languages have raised very less issues regarding the above-discussed things, in many of the Unicode Consortium meetings and discussions. Actually we Tamil people are the most unlucky to have such a irresponsible Governments, Universities and Research Institutes that are spread all over the world. India (having Tamil as one among the 18 state languages), Singapore (having Tamil as one of the national language), Sri Lanka (having Tamil as one of the national language), USA and other places.

Regarding the ordering scheme in Unicode Tamil and the most un-natural way of representing alphabets and combinatory alphabets, Unicode has the following points to tell.
  1. The unicode ordering scheme has nothing to do with the natural and traditional ordering scheme of any language.
  2. It is the resposibility of the algorithm that uses unicode, to fix the sort order.
  3. It is well accepted that very few languages that have Unicode standard enjoy natural and traditional ordering.
  4. As lot of softwares and documents already implemented the current system of Unicode standard. So, It is very hard to introduce any change in the sort order in Unicode. Organizations like Microsoft (a Full Corporate Member of Unicode) are expressing complete opposition for change in unicode. (Organizations that promote Tamil computing have given very less supported the change in Unicode)
  5. Unicode has adopted the ISCII coding scheme devoleped and recomended by Department of Electronics and Indian Government (During this recomendation, Tamil Scholars are not cosulted.)
themselves agree to an extend by saying that,

Somebody are even trying to propose new set of slots in Unicode for the entire set of Tamil alphabets and symbols in a perfect manner. But I don't know how far they will succeed. If it is done, we are the luckiest.

Although the above topics are discussed lots and lots of times at many news groups, meetings and conferences, there are very less resources on the internet to describe the standard sorting order scheme for Tamil.

I heard that Microsoft has supplied the API CompareString, which is capable of two Tamil Strings. Are there any specification about "which ordering scheme it is using"?

References:
^

Please post your comments and suggestions through any of the following means:

Suggestions


Author: R.Padmakumar
Date: 20-Dec-2004