Site hosted by Angelfire.com: Build your free website today!
 
 

The main aim of this page is to determine
A GENERALLY ACCEPTED SORT ORDER
for
Tamil Alphabets

Edition 2
[Through out this document, the word 'Tamil' has to be pronounced as '' / 'thamizh']

This document deals with the topics like,
  1. Introduction
  2. Problems in Tamil Collation Suggested By INFITT
  3. Problems in UNICODE Tamil Encoding
  4. Problems in UNICODE Tamil Collation
  5. Notes on Tamil in UNICODE
  6. UNICODE Tamil - NEW
  7. Conclusion
  8. Suggestions
  9. Previous Version Of This Page
^
^

1. Introduction

I started to write this document, when I tried to develop an algorithm to compare two Tamil Unicode strings. The corresponding C Code is HERE. It basically uses an array of sorted Tamil Unicode positions and a comparison logic that uses this array to justify the two Unicode Tamil strings to be compared.

A typical sorting algorithm will have a comparison logic as its core part. Comparison implies comparing two character strings by comparing the individual characters of both the stings having same index value, one by one in the sequential order. Normally the comparison of individual characters means comparing their positional value specified by either the character mapping order or as specified in the corresponding language's standard collation order.

With respect to Tamil there are four main coding and collation standards. They are Unicode, TSCII, TAB and TAM. Of these Unicode and TSCII are widely used in the internet and Tamil softwares. TAB and TAM are the mainly used in Tamilnadu Govt based WebPages and softwares.

As Unicode proposes an universal coding and collation scheme for all the internationally recognized written languages, this document mainly deals with Unicode's coding and collation with respect to Tamil. The original position of Tamil in Unicode chart starts from B80 and extends up to BFF, a total of 128 allocations. The problems with Unicode Encoding Standard for Tamil and the Unicode Collation Standard for Tamil is discussed in detail in the following sections.
Since Unicode is not in a position to rearrange the current Tamil Unicode encoding scheme [pic], where the alphabetic positional order is not proper and not in the most accepted traditional manner, it is wise for us, at least to standardize the Unicode Tamil collation scheme to our best.

A collation standard for a language should heavily map the exact and more accepted order of alphabets and symbols of that language. With respect to Tamil, the current international standard for collation recommended by Unicode does not adhere to the more accepted sequence of Tamil alphabets and symbols. By the way, many Standard Proposing Organizations recommended various Tamil collations by releasing an associated font or a software that uses a specific Tamil Collation or even by publishing their proposed Tamil collation standard through internet.

INFITT (International Forum for Information Technology in Tamil) is one such Organization that proposes a Standard Collation for Tamil alphabets for Unicode through Govt Order of Tamil Nadu Govt. Following section discusses more of this proposed collation standard.

^

2. Problems in Tamil Collation Suggested By INFITT

Following (Table 1) is table that contains all the possible Tamil alphabets and symbols that are supported by The Unicode Standard, Version 4.0 in the collation proposed by INFITT.


Table 1

This above table follows a collation standard proposed by INFITT ( Resource-1, Resource-2). This along with the Tamil IT Task force of the Tamilnadu Govt has already made some specific recommendations on Tamil Collation and the Tamilnadu Govt in turn is expected to issue a specific GO on this. It is also expected that Unicode will soon take this GO into account and make necessary changes in the Collation chart for Tamil

Even though this proposed scheme looks pretty good, it is also having some Minor Problems. They are:

Problem 1: The very special combinatory alphabet '' (fa) is being widely used in names (such as fizal, fathima, fax..). If this is not included in Tamil Collation standard, an ordering may result in moving '' just after ''.
Soln: The position in the sequence of this combinatory alphabet can be considered along with the Tamil-Grantha list (as shown in the following Table 2 or Table 3). We should decide upon this and the sorting algorithm should take care of this ordering.

Problem 2: Vowel modifier symbols are not included in this collation. But vowel modifiers are part of Tamil language symbols as they can be represented individually without preceding consonant in any texts. Moreover this list is already there in the current Tamil Unicode Collation.
Soln: The list of vowel modifiers can be listed after Tamil-grantha alphabets. (as shown in the following Table 2 or Table 3 )


Table 2



Table 3

Problem 3: Though the number symbol (Rs - BF9) and '$' (dollar - 0x24) are equal by meaning, they are far apart by their positional value and by any collation standard. Current collation will order these two symbols far away from each other.
Soln: The currency symbol (Rs - BF9) can be normalized (considered having equal priority or adjacent priorities) with '$' (dollar - 0x24). The ordering can follow '$' < ''. We should decide upon this and the sorting algorithm should take care of this ordering. Proposal can be initiated to include this in the Unicode Tamil Normalization Chart and Unicode Currency Symbol Normalization Chart

Problem 4: Though the number symbol (`# - BFA) and '#' (hash - 0x23) are equal by meaning, they are far apart by their positional value and by any collation standard. Current collation will order these two symbols far away from each other.
Soln: The number symbol (`# - BFA) can be normalized (considered having equal priority or adjacent priorities) with '#' (hash - 0x23). The ordering can follow '#' < ''. We should decide upon this and the sorting algorithm should take care of this ordering. Proposal can be initiated to include this in the Unicode Tamil Normalization Chart

Problem 5: Though Tamil digits and 0, 1, 2..9 are equal by meaning, they are far apart by their positional values and by any collation standard. Current collation will order these symbols far away from each other.
Soln: The Tamil digit symbols can be normalized (considered having equal priority or adjacent priorities) with '0', '1', '2' .. '9'. The ordering can follow '0' < Tamil-0, '1' < Tamil-1, ... '9' < Tamil-9. We should decide upon this and the sorting algorithm should take care of this ordering. Proposal can be initiated to include this in the Unicode Tamil Normalization Chart

Here Problem1 and Problem2 deals with including the missed out Tamil alphabets and symbols and Problem3, Problem4 and Problem5 deals with proposing a standard for normalization of certain symbols.

Please post your comments and suggestions through any of the following means:
Guest Book
Group
Email: tamilsortorder@yahoo.com

^
^

3. Problems in Alphabetic Sorting With UNICODE Tamil Encoding

Following (Table 4) is table that contains all the possible Tamil characters that are supported by The Unicode Standard, Version 4.0 in a sorted manner according to their positional values. The original position of Tamil in Unicode chart starts from B80 and extends up to BFF, a total of 128 allocations.


Table 4

If you look at the above table, the order in which the alphabets are arranged in the Unicode Tamil Chart are not at all in the conventional, natural, traditional and standard way, that modern Tamil uses. Even though UNICODE is not in a position to re-arrange the Tamil encoding sequence, one must know the major problems, flaws and foolishness in this current Unicode Tamil encoding sequence and the reason why this scheme is not at all suitable for using as the base straightly in a comparison or sorting algorithm. The main problems in Unicode Tamil Encoding are:

Problem 1: '' (aaytham - B83) comes before vowels. It should come after vowels

Problem 2: (ja - B9C) comes in between (cha - B9A) and (nYa - B9E) as in any other Non-Tamil languages and Gerantha Script. It should be listed along with the Gratha-Tamil alphabets in the following order. (ja - B9C), (Ca - BB6) [Not 'cha' - B9A], (sha - BB7), (Sa - BB8) and (ha - BB9)

Problem 3: Positions of (na - BA9), (Ra - BB1), (La - BB3) and (zha - BB4) are totally wrong. The correct order is (zha - BB4), (La - BB3), (Ra - BB1), (na - BA9)

Problem 4: Tamil digits are positioned after alphabets and vowel modifier symbols. This is the most unconventional order. Tamil digits should be ordered before vowels.

Problem 5: Normally the symbols (Rs - BF9) used as a currency and (`# - BFA) number indicator (eg. .5000; 5). So, these symbols should be ordered before Tamil digits.

Problem 6: The vowel modifier (pulli) should come just before .

Problem 7:Apart from the above-specified problems, there are some of grammatical and syntactical flaws in Tamil Unicode. They include
Consonants represented in Unicode, should be perfect Mey. That is, they should be represented as (...) and should not be like what is now (...). The current representation violates the grammar. The actual conventional, correct and natural Tamil grammar is as follows.
    (k) + ` ( `a) becomes (ka)
    (ch) + ( `u) becomes (chu)
But according to Unicode coding scheme, the entire grammar is being reversed as follows.
    (ka) + (pulli) becomes (k)
    (cha) + ( `u) becomes (chu)
    (ta) + ( `O) becomes (tO)
This is totally wrong and a BIG blunder. This is also the most un-natural way of representing Tamil language alphabets. This will create lot of problems in future in areas such as,
  • Tamil Natural language processing
  • Tamil Database processing
  • Tamil Computer language development
It seems that it is too late to discus on the collation of Tamil Unicode Encoding scheme. So, it is better for us at this time to do either of the following.
  1. Standardize the Collation for Tamil in Unicode. This is discussed in Section 2 and Section 4.
  2. Invent a new Block in Unicode for Tamil, which has the perfect set of alphabets and symbols in the exact sorted order. This is discussed in Section 6.
Here Problem1 and Problem2 deals with including the missed out Tamil alphabets and symbols and Problem3, Problem4 and Problem5 deals with proposing a standard for normalization of certain symbols.

Please post your comments and suggestions through any of the following means:
Guest Book
Group
Email: tamilsortorder@yahoo.com

^
^

4. Problems in Alphabetic Sorting With UNICODE Tamil Collation

Following (Table 5) is table that contains all the possible Tamil characters that are supported by The Unicode Standard, Version 4.0 in a sorted manner according to the Collation Standard Proposed by Unicode for Tamil.


Table 5


The main Problems in Unicode Tamil Collation are:

Problem 1: Collation of '' (aaytham - B83) is wrong.
Soln: Sorting algorithm should arrange this after (au - B94).

Problem 2: (ja - B9C) comes in between (cha - B9A) and (nYa - B9E) as in any other Non-Tamil languages and Gerantha Script.
Soln: Sorting algorithm should arrange this as a separate set of characters after the regular Tamil consonants ().

Problem 3: Collation of (na - BA9), (Ra - BB1), (La - BB3) and (zha - BB4) are totally wrong.
Soln: Sorting algorithm should take care of it.

Problem 4: Collation of Tamil Grantha alphabets (ja - B9C), (Ca - BB6) [Not 'cha' - B9A], (sha - BB7), (Sa - BB8) and (ha - BB9) are not even internationally standardized.
Soln: This unique sort order for these Tamil Grantha alphabets can be followed and the sorting algorithm should take care of this ordering.

Problem 5: Collation of Tamil Grantha combinatory alphabets (ksha - B95+BCD+BB7) and (sree - BB8+BCD+BB0+BC0) are not at all discussed.
Soln: This unique sort order for these Tamil Grantha combinatory alphabets can be followed and the sorting algorithm should take care of this ordering.

Problem 6: As the very special combinatory alphabet '' (fa - B83+BAA) is being widely used in names (such as fizal, fathima), they should be included in the collation standard.
Soln: The position in the ordering of this alphabet can be considered just after the Tamil Grantha list. We should decide upon this and sorting algorithm should take care of this ordering.

Problem 7: Tamil digits are positioned after alphabets and vowel modifier symbols. This is the most unconventional ordering. Tamil digits should be ordered before vowels.
Soln: Sorting algorithm should take care of this ordering.

Problem 8: Normally (Rs - BF9) is used as a currency symbol (same as the usage of '$') in front of numbers (eg. .5000). So, this symbol should be ordered before Tamil digits or to be considered having equal priority with '$' (dolor - 0x24).
Soln: We should decide upon this and Sorting algorithm should take care of this ordering.

Problem 9: Normally (`# - BFA) is used as a number symbol (same as the usage of '#') in front of numbers (eg. 5). So, this symbol should at least be ordered before Tamil digits or to be considered having equal priority with '#' (hash - 0x23).
Soln: We should decide upon this and sorting algorithm should take care of this ordering.

Problem 10: Vowel (au - B94) has been given the flexibility of representing it as two different letters as (o - B92) + '' ( `au - BD7). This flexibility creates lot of problems by having two different character sequences for a same word that includes the letter (au).
Soln: Though normalization specifications are there, a Standard has to be devised to eliminate the usage of '' (`au - BD7).

Problem 11: Similarly, Vowel modifiers for (o - B92), (O - B93) and (au - B94) such as '' ( `o - BCA), ' ' ( `O - BCB) and ' ' ( `au - BCC) are also given the flexibility as follows
- "consonant + ( `o - BCA)" can also be represented as " ( `e - BC6) + consonant + ( `aa - BBE)"
- "consonant + ( `O - BCB)" can also be represented as " ( `E - BC7) + consonant + ( `aa - BBE)"
- "consonant +ௌ ( `au - BCC)" can also be represented as " ( `e - BC6) + consonant + ( `au - BD7)"
This flexibility creates lot of problems by having two different character sequences for a same word that includes the vowel modifier for , and .
Soln: Though normalization specifications are there, a Standard has to be devised to eliminate the usage of
- " ( `e - BC6) + consonant + ( `aa - BBE)",
- " ( `E - BC7) + consonant + ( `aa - BBE)" and
- " ( `e - BC6) + consonant + ( `au - BD7)"

Here Problem 1 to Problem 7 mainly deals with including the missed out Tamil alphabets and symbols or relocation in the Unicode standard collation sequence positions. Problem 8 and Problem 9 deals with normalization of certain symbols. And Problem 10 and Problem 11 deals with eliminating the use of certain duplicate (combinatory) vowel modifier.

Please post your comments and suggestions through any of the following means:
Guest Book
Group
Email: tamilsortorder@yahoo.com

^
^

5. Notes on Tamil in UNICODE

From various documents and newsgroups, I came to know that, the current encoding scheme of Tamil in Unicode could NEVER be changed. The only possibility according to Unicode is that we can add some Tamil symbols that fill the free gaps, that too with greater pain.

The Funny and shameful thing here is that, the following organizations are Full Corporate Members of Unicode Consortium for a long time.
  • Department of Information Technology, Government of Tamil Nadu and Tamil Virtual University
  • Ministry of Information Technology, Government of India
I also came to know that, these representatives of Indian/Tamil languages have raised very less issues regarding the above-discussed things, in many of the Unicode Consortium meetings and discussions.

Actually we Tamil people are the most unlucky to have such irresponsible Governments, Universities and Research Institutes that are spread in India (having Tamil as one among the 18 state languages), Singapore (having Tamil as one of the national language), Sri Lanka (having Tamil as one of the national language), USA, Germany and all over the world.

Regarding the order in the encoding scheme of Unicode Tamil and the most un-natural way of representing alphabets, combinatory alphabets and Tamil symbols, Unicode has the following points to tell.
  1. The order in Unicode encoding scheme has nothing to do with the natural and traditional order scheme of any language.
  2. It is the responsibility of the algorithm that uses Unicode collation standard, to bring the sort order.
  3. It is well accepted that very few languages that have Unicode standard enjoy natural and traditional order.
  4. As lot of softwares and documents already implemented the current system of Unicode standard. So, It is very hard to introduce any change in the sort order in Unicode. Organizations like Microsoft (a Full Corporate Member of Unicode) are expressing complete opposition for change in unicode. (Organizations that promote Tamil computing have given very less supported the change in Unicode encoding scheme)
  5. Unicode has adopted the ISCII coding scheme developed and recommended by Department of Electronics and Indian Government (During this recommendation, Tamil Scholars are not consulted.)
Please post your comments and suggestions through any of the following means:
Guest Book
Group
Email: tamilsortorder@yahoo.com

^
^

6. UNICODE Tamil - NEW

There are even proposals like Unicode Tamil-New proposed by the Working Group (WG08) of INFITT EC for an alternative 16 bit encoding scheme for Tamil in Unicode in the PUA (Private Use Area E000 - F8FF) . This proposes a two-byte scheme for the entire set of Tamil alphabets and symbols, which are to be allocated in the Unicode - PUA positions from E200 to E38F. INFITT (WG08) assures the following with respect to this proposal.
  • Encoding scheme with the correct sorting order
  • The fonts are very easy to develop and easy to implement in all applications without waiting for the respective developers to provide for Tamil support.
  • 40% more efficient than the current Standard Unicode Tamil
INFITT (WG08) further says that:
"The creation of the 16 bits is done in a scientific way. Of the sixteen bits, the first 7 bits indicates the language. The next 5 bits gives the serial number of the consonant part of a Tamil letter. The next 4 bits gives the serial number of the vowel part of a Tamil letter. A zero here means the absence of the consonant or vowel part, that is, it is a pure vowel or pure consonant. Hence, it is extremely easy to see what a letter contains. This simplicity comes from the natural way in, which the coding is designed. Unicode Tamil-New is as efficient as the English in all applications tested so far. It is at least 40% more efficient than the current Unicode Standard ver 4.0. Unicode Consortium may not be concerned about efficiency of encoding. Can Tamil Diaspora afford to pass this avoidable inefficiency to the next generations?" For more on this see Annex 4 of INFITT Executive Director's Report (Sep 03 - Dec 04)

Please post your comments and suggestions through any of the following means:
Guest Book
Group
Email: tamilsortorder@yahoo.com

^

7. Conclusion

Although the above topics are discussed lots and lots of times at many news groups, meetings and conferences, there are very less resources on the internet to describe the standard sorting order scheme for Tamil.

I heard that Microsoft has supplied the API CompareString, which is capable of two Tamil Strings. Are there any specification about "which collation scheme it is using"?

I Referred:
Please post your comments and suggestions through any of the following means:
Guest Book
Group
Email: tamilsortorder@yahoo.com

^
^


8. Suggestions

^

Author: R.Padmakumar
Date: 20-Dec-2004