Unicode for Indian Languages

download Unicode for Indian Languages

of 75

Transcript of Unicode for Indian Languages

  • 7/28/2019 Unicode for Indian Languages

    1/75

    Existing standards for codes in respect ofIndian ScriptsInternal representation of text in Indian Languages may be viewed as the problem of assigning codes to theaksharas of the languages. The complexities of the syllabic writing systems in use have presented difficulties instandardizing internal representations. TeX was an inspiration in the late 1980s but using TeX was more suited forTypesetting and not Text processing per se. In the absence of appropriate fonts, interactive applications could notbe attempted and when fonts became available, applications simply used the Glyph positions as the codes and thenumber of Glyphs was restricted on account of the eight bit fonts.The following representations still apply as many applications have been written to use one or the other. It must beremembered that these representations primarily address the issue of internal representation for rendering text.

    Use of Roman letters with diacritic marksISCII codesUnicode for Indian Scripts.ISFOC standard from CDACOf the above, the first has been discussed in the section onTransliteration principles. The ISFOC standard appliesmore to standardization of Fonts for different scripts and cannot really be thought as as an encoding standard. Weconfine our discussion in this section to ISCII and the Unicode. A brief note on ISFOC will be found in a separatepage.

    Indian Script Code for Information Interchange (ISCII)ISCII was proposed in the eighties and a suitable standard was evolved by 1991. Here are the salient aspects ofthe ISCII representation.

    It is a single representation for all the Indian Scripts. codes have been assigned in the upper ASCII region (160 - 255) forthe aksharas of the language. The scheme also assigns codes for the Matras (vowel extensions). Special characters have been included to specify how a consonant in a syllable should be

    rendered.Rendering of Devanagari has been kept in mind. A special Attribute character has been included to identify the script to be used in rendering

    specific sections of the text.shown below is the basic assignment in the form of a Table. There is also a version of this table known as PC-ISCII, where there are no characters defined in the range 176-223. In PC-ISCII, The first three columns of theISCII-91 table have been shifted to the starting location of 128. PC-ISCII has been used in many applicationsbased on the GIST Card, a hardware adapter which supported Indian language applications on an IBM PC. In thetable, some code values have not been assigned. Six columns of 16 assignments each start at the Hexadecimalvalue of A0 which is equivalent to decimal 160.

    http://acharya.iitm.ac.in/multi_sys/exist_codes.phphttp://acharya.iitm.ac.in/multi_sys/exist_codes.phphttp://acharya.iitm.ac.in/multi_sys/exist_codes.phphttp://acharya.iitm.ac.in/multi_sys/exist_codes.phphttp://acharya.iitm.ac.in/multi_sys/translit.phphttp://acharya.iitm.ac.in/multi_sys/translit.phphttp://acharya.iitm.ac.in/multi_sys/translit.phphttp://acharya.iitm.ac.in/multi_sys/translit.phphttp://acharya.iitm.ac.in/multi_sys/exist_codes.php
  • 7/28/2019 Unicode for Indian Languages

    2/75

    The following observations are made.1. The ISCII code is reasonably well suited for representing the syllables of Indian languages, though one mustremember that a multiple byte representation is inevitable, which could vary from one byte to as many as 10 bytesfor a syllable.

    2. The ISCII code has effected a compromise in grouping the consonants of the languages into a common set thatdoes not preserve the true sorting order of the aksharas across the languages. Specifically, some aksharas ofTamil, Malayalam and Telugu are out of place in the assignment of codes.

    3. The ISCII code provides for some tricks to be used in representing some aksharas, specifically the case ofDevanagari aksharas representing Persian letters. ISCII uses a concept known as the Nukta Character to indicatethe required akshara.4. When forming conjuncts, ISCII specifications require that the halanth character be used once or twicedepending on whether the halanth form of the consonant or half form of the consonant is present. This results inmore than one internal representations for the same syllable. Also, ISCII provides for the concept of the softhalanth as well as an invisible consonant to handle representations of special letters. Parsing a text string made upof ISCII codes is a fairly complex problem requiring a state machine which is also language dependent. This is aconsequence of the observation that languages like Tamil do not support conjuncts made up of three or morediffering consonants. In fact it is stated that Tamil has no conjunct aksharas. What is probably implied here is thata syllable in Tamil is always split into its basic consonants and the Matra. Several decades ago Tamil writing inpalm leaves did show geminated consonants in special form.Though representation at the level of a syllable is possible in ISCII, processing a syllable can become quitecomplex, i.e., linguistic processing may pose specific difficulties due to the variable length codes for syllables.

    5. The code assignments, though language independent, do not admit of clean and error free transliteration across

  • 7/28/2019 Unicode for Indian Languages

    3/75

    languages especially into Tamil from Devanagari.

    6. It is difficult to perform a check on an ISCII string to see if arbitrary syllables are present. Though theoreticallymany syllables are possible, in practice the set is limited to about 600 - 800 basic syllables which can alsocombine with all the vowels. The standard provides for arbitrary syllables to handle cases where new words maybe introduced in the language or syllables from other languages are to be handled.It must be stated here that ISCII represents the very first attempt at syllable level coding of Indian Languageaksharas. Unfortunately, outside of CDAC which promoted ISCII through their GIST technology, very few seem touse ISCII.ISCII codes have nothing to do with fonts and a given text in ISCII may be displayed using many different fonts forthe same script. This will require specific rendering software which can map the ISCII codes to the glyphs in amatching font for the script. Multibyte syllables will have to be mapped into multiple glyphs in a font dependent andlanguage dependent manner. It is primarily this complexity that has rendered ISCII less popular. Details of ISCIIare covered in the Bureau of Indian Standard Documents No. IS:13194-1991.

    Shown below are some examples of strings in Devanagri and other scripts along with their ISCII representations.

  • 7/28/2019 Unicode for Indian Languages

    4/75

  • 7/28/2019 Unicode for Indian Languages

    5/75

    Unicode for Indian LanguagesUnicode was the first attempt at producing a standard for multilingual documents. Unicode owes its origin to theconcept of the ASCII code extended to accommodate International Languages and scripts.

    Short character codes ( 7 bits or 8 bits) are adequate to represent the letters of the alphabets of many languagesof the world. The fundamental idea behind Unicode is that a superset of characters from all the differentlanguages/scripts of the world be formed so that a single coding scheme could effectively handle almost all thealphabets of all the languages. What this implies is that the different scripts used in the writing systems followed bydifferent languages be accommodated in the coding scheme. In Unicode more than 65000 different characters canbe referenced. This large set includes not only the letters of the alphabet from many different languages of theworld but also punctuation, special shapes such as mathematical symbols, Currency symbols etc. The term CodeSpace is often used to refer to the full set of codes and in Unicode, the Code space is divided into consecutiveregions spanning typically 128 code values. Essentially this assignment retains the ordering of the characterswithin the assigned group and is therefore very similar to the ASCII assignments which were in vogue earlier.

    Unicode assignments may be viewed geometrically as a stack of planes, each plane having one and possiblymultiple chunks of 128 consecutive code values. Logically related characters or symbols have been groupedtogether in Unicode to span one or more regions of 128 code values. We may view these regions as differentplanes in the Code Space as illustrated in the figure below. Data processing software using Unicode will be able toidentify the Language of the text for each character by identifying the plane the character is located in and use

    appropriate font to display the same or invoke some meaningful linguistic processing.

    Technically, Unicode can handle many more languages than the supported scripts if these languages use thesame script in their writing systems. By consolidating a complete set of symbols used in the writing systems across

  • 7/28/2019 Unicode for Indian Languages

    6/75

    a family of languages, one can get a script that caters to all of them. The Latin script with its supplementarycharacters and extended symbol has about 550 different characters and this is quite adequate to handle almostanything that has appeared in print in respect of the Latin script. Hence in the geometrical view above, someplanes may be larger (wider) than others and more than one script could have characters from logically similargroups specified in a plane.

    The fact that several languages/scripts of the world require many more than 128 codes has necessitatedassignments of more than one basic plane (i.e., multiples of 128 code values) for them. Languages such as Greek,Arabic or Chinese have larger planes assigned to them. In particular, Unicode has allowed nearly 20000

    characters of Chinese, Japanese and Korean scripts to be included in a contiguous region of the Code Space.Currently fewer than a hundred different groups of symbols or specific scripts are included in Unicode.

    Even though it is a sixteen bit code and can therefore handle more than 65000 code values, Unicode should notbe viewed as a scheme which allows several thousand characters for each and every language. That it hasprovision for fewer than 128 characters for many scripts is a general observation since many languages do notrequire more than 128 characters to display text.

    In respect of Indian languages which use syllabic writing systems, one might think that Unicode would haveprovided several thousands of codes for the syllables similar to the nearly 11000 Hangul syllables alreadyincluded. On the contrary, Unicode has pretty much accepted the concept behind ISCII and has provided only forthe most basic units of the writing systems which include the vowels, consonants and the vowel modifiers.

    Unlike ISCII, which has a uniform coding scheme for all the languages, Unicode has provided individual planes forthe nine major scripts of India. Within these planes of 128 code values each, assignments are language specificthough the ISCII base has been more or less retained. Consequently, Unicode suffers from the same limitationsthat ISCII runs into. There are some questionable assignments in Unicode in respect of Matras. A Matra is not acharacter by itself. It is a representation of a combination of a vowel and consonant, in other words therepresentation of a medial vowel. A vowel and NOT its Matra is the basic linguistic unit. Consequently linguisticprocessing will be difficult with Unicode with Indian languages, just as in ISCII.

    Here is the Unicode assignment for Sanskrit (Devanagari). The language code for Sanskrit (Devanagari) is 09(hex) and so the codes span the range 0901 to 097f (Hexadecimal values). In this chart, the characters ofDevanagari with a dot beneath, are grouped in the range 0958 to 095f. These are the characters used in Hindiwhich are derived from Persian and seen in Urdu as well. Likewise in locations 0929, 0931 and 0934 the letters

    are dotted. The codes are similar to ISCII in ordering but Unicode includes characters not specified in ISCII. Also,the assignments for each language more or less adhere to the same relative locations for the basic vowels andconsonants as in ISCII but include many language dependent codes. The code positions in Unicode will notexactly match the corresponding ISCII assignments.

  • 7/28/2019 Unicode for Indian Languages

    7/75

    Shown below are the Unicode representations for some strings in different scripts. These are the same stringsshown earlier under ISCII.

  • 7/28/2019 Unicode for Indian Languages

    8/75

    From the discussion above, it will be seen that ISCII and Unicode provide multibyte representations for syllables.This is not unlike the case for English and other European languages where syllables are shown only with the

    basic letters of the Alphabet. However, in all the writing systems used in India, each syllable is individuallyidentifiable through a unique shape and one has to provide for thousands of shapes while rendering text.

    While these thousands of shapes may be composed from a much smaller set of basic shapes for the vowels,consonants and vowel modifiers, one must admit that several hundreds of syllables have unique shapes whichcannot be derived by putting together the basic shapes. It is estimated that in practice, more than 600 differentglyphs would be required to adequately represent all the different syllables in most of the scripts. The mainproblem of dealing with Unicode for Indian languages/scripts has to do with the mapping between a multibyte codefor a syllable and its displayed shape. This is a very complex issue requiring further understanding of renderingrules. As such a full discussion of this would require that the viewer understand the intricacies of the writingsystems of India. We cover this in a separate page.

    Specific technical problems with ISCII and Unicode.It must be observed, in the light of the above discussion that displaying a Unicode string in Indian language

  • 7/28/2019 Unicode for Indian Languages

    9/75

    requires a complex piece of processing software to identify the syllables and get the corresponding glyphs from anappropriate font for the script. The multibyte nature of Unicode (for a syllable) makes a table driven approach tothis quite difficult. Even though it is possible to write such modules which can go from Unicode to the display oftext using some font, one faces a formidable problem in respect of data entry, where formation of syllables frommultiple key sequences Is truly overwhelming. With limited number of keys available in standard keyboards, it isoften not possible to accommodate all the symbols one would require to produce meaningful printouts in eachscript consistent with quality typesetting systems.Unicode based applications employ the concept of "Locales" to permit data entry of multilingual text. Each Locale

    is associated with its own keyboard mapping and application software can switch Locales to permit data entry ofmultilingual text. It will be seen that for Indian scripts, the Locales themselves have limitations since they do notpermit a full complement of letters and special characters to be typed in, much less the standard punctuation thathas become part of Indian scripts today.While it is possible to write special keyboard driver programs which implement a state machine to handle keysequences to produce syllables, the approach is not universal enough to be included into the Operating Systems,certainly not when a single driver should cater to all the Indian scripts. There is no meaning in having a Hindiversion of OS with its own Data entry convention which differs substantially from a Tamil or Telugu version.

    Here is a summary of the issues that confront us when dealing with Unicode for Indian scripts. Rendering text in a manner that is uniform across applications is quite difficult. Windowing

    applications with cut,copy/paste features suffer due to problems in correctly identifying the width ofeach syllable on the screen. Also, applications have to worry about specific rendering issues whenmodifier codes are present. How applications run into difficulties in rendering even simple stringsis illustrated withexamplesin a separate page.

    Interpreting the syllabic content involves context dependent processing, that too with a variablenumber of codes for each syllable.

    A complete set of symbols used in standard printed text has not been included in Unicode foralmost all the Indian scripts.

    Displaying text in scripts other that what Unicode supports is not possible. For instance, many ofthe scripts used in the past such as the Grantha Script, Modi, Sharada etc., cannot be used todisplay Sanskrit text. This will be a fairly serious limitation in practice when thousands ofmanuscripts written over the centuries have to be preserved and interpreted.

    Transliteration across Indian scripts will not be easy to implement since appropriate symbolscurrently recommended for transliteration are not part of the Unicode set. In the Indian context,transliteration very much a requirement.

    The unicode assignments bear little resemblance to the linguistic base on which the aksharas ofIndian scripts are founded. While this is not a critical issue, it is desirable to have codes whosevalues are based on some linguistic properties assigned to the vowels and consonants, as hasbeen the practice in India.

    In a separate web page, we discuss theproblems associated with Unicodefor linguistic processing of text inIndian languages.Details of Unicode for Indian scripts have been published in the standard available from the Unicode consortium.

    TheUnicode web sitedoes have useful information but one will have to resort to the printed text to get the realdetails. These are also available in PDF format from the above web site.

    Is Unicode for Indian Languages meaningless then ?The answer is certainly No. The main purpose of the Unicode is to transport information across computer systems.As of today, Unicode is reasonably adequate to do this job since it does provide for representing text at thesyllable level though not in the fixed size units (Bytes).

    Applications dealing with Indian Languages will have to include a special layer which transforms Unicode text into

    a more meaningful layer for linguistic or text processing purposes. The point to keep in mind is that the seven bitASCII based representation for most World language serves both purposes well i.e., not only are text stringstransferable across systems, but linguistic processing is consistent with the seven bit representation . It sohappens that the phonetic nature of our Indian Languages has necessitated a different representation for linguistic

    http://acharya.iitm.ac.in/multi_sys/unicode/render/render.phphttp://acharya.iitm.ac.in/multi_sys/unicode/render/render.phphttp://acharya.iitm.ac.in/multi_sys/unicode/render/render.phphttp://acharya.iitm.ac.in/multi_sys/unicode/intro.phphttp://acharya.iitm.ac.in/multi_sys/unicode/intro.phphttp://acharya.iitm.ac.in/multi_sys/unicode/intro.phphttp://www.unicode.org/http://www.unicode.org/http://www.unicode.org/http://www.unicode.org/http://acharya.iitm.ac.in/multi_sys/unicode/intro.phphttp://acharya.iitm.ac.in/multi_sys/unicode/render/render.php
  • 7/28/2019 Unicode for Indian Languages

    10/75

    analysis.With majority of the Languages of the World, which use a relatively small set of symbols to represent the letters oftheir alphabet, 8 bit (or even 7 bit) character codes are adequate to represent the letters.

    Unicode for Indian Languages: A discussion

    Support for Unicode in applications catering to Indian languages is a highly debated issue. Though Unicode hasemerged as a viable standard and is finding increasing use all over the world, there are some real difficulties in using itin practice for building applications supporting multilingual user interfaces in Indian languages. The conceptual basisfor Unicode, though well accepted for the western languages (scripts), does not fully conform to the linguisticrequirements seen in our languages.

    At the Systems Development Laboratory, IIT Madras, where some meaningful multilingual solutions consistent withthe linguistic requirements for all the Indian languages have been developed and distributed as well, there is a strongfeeling that Unicode will not really help. It is true that Unicode is a world standard proposed and accepted by a largecommunity of academics, professionals and users. Unfortunately, it does not really blend with the syllabic writingsystems used in india, much less provide the means to express linguistic content without ambiguity and in a mannerthat ties in well with our own understanding of languages.What we have tried to say here reflects the above view.Multilingual Computing: A view from SDLIntroductionViewpointIdiosyncrasies of the writing systemsDefining Linguistic requirementsDealing with Text consistent with Linguistic requirementsMultilingual computing requirements(for India)

    Unicode for Indian LanguagesThe conceptual basis for UnicodeUnicode for Indian languages/scriptsData entry and associated problemsIssues in rendering UnicodeUsing a shaping engine to render Unicode text Discussion on sorting or collationThe conceptual basis of the Open type font

    Unicode support in Microsoft applicationsUniscribe, the shaping engineLimitations of UniscribeA review of some Microsoft applications in respect of handling linguistic content

    Recommendations for Developers of Indian language ApplicationsUse of True type fonts to render Unicode Text Can we simplify handling Unicode text?Guidelines for development under Linux

    Examples of Unicode Rendering by different applications (Windows and Linux)

    http://acharya.iitm.ac.in/multi_sys/unicode/intro.phphttp://acharya.iitm.ac.in/multi_sys/unicode/intro.phphttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=introviewhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=introviewhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=viewpointhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=viewpointhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=idio_wrihttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=idio_wrihttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=deflinghttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=deflinghttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=multreqhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=multreqhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=mreq_indiahttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=mreq_indiahttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=concept_unihttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=concept_unihttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=assignhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=assignhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uni_dentryhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uni_dentryhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=unirenderhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=unirenderhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=shapinghttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=shapinghttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=debatehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=debatehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=opentypehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=opentypehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uniscribehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uniscribehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uni_limithttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uni_limithttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=reviewhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=reviewhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=truetypehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=truetypehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=recohttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=recohttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=linux_guidehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=linux_guidehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=linux_guidehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=recohttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=truetypehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=reviewhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uni_limithttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uniscribehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=opentypehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=debatehttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=shapinghttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=unirenderhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=uni_dentryhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=assignhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=concept_unihttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=mreq_indiahttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=multreqhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=deflinghttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=idio_wrihttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=viewpointhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=introviewhttp://acharya.iitm.ac.in/multi_sys/unicode/intro.php
  • 7/28/2019 Unicode for Indian Languages

    11/75

    circa 2003circa 2007

    Summary of ObservationsThe experiences of the lab in working with Unicode aresummarizedin the linked page. As of this update (June 2006),one has not seen an application in any of the Indian Languages that can be cited as a satisfactory implementationbased on Unicode. Though a number of developers are counting on using Unicode, it is not going to be easy to effect

    Localization of our languages, consistent with the requirements of Computing with Indian Languages.

    Unicode- A Brief IntroductionIntroduction

    In the context of internationalization and providing uniformity in the handling of text based information across thelanguages of the world, Unicode has gained considerable importance. The fundamental concept behind Unicode isthat text (Unicode based text) representation retains the linguistic content that must be conveyed while at the sametime provide for this content to be displayed in human readable form. By catering to both these requirements, Unicodehas emerged as the best choice for representing text in a computer application, specifically one that deals withmultilingual content. Developers across the world are committing themselves to providing Unicode support in all theirapplications.Multilingual information processing is one of the essential requirements when it comes to computerization in India.Here, the development of applications requires that interactive user interfaces in different regional languages must bepart of each application. A specific regional language may be supported through one or more scripts despite the factthat a given script may be used for more than one language,A very important issue, from a conceptual angle at least, is whether support for a script is equivalent tosupporting a language? During the initial phases of development of applications in Indian languages, one wasconcerned more with the rendering aspects of text, a formidable problem in itself on account of the syllabic writingsystem followed for all the Indian languages. No one really felt compelled to take into consideration text processingissues. Majority of the early applications required text entry and display with computation effected on numbers ratherthan text per se. It is not surprising therefore that whatever standardization was attempted, did emphasize mostly the

    aspects of the writing system without really catering to the linguistic requirements.In essence, the standardization mentioned above (ISCII and Unicode) requires context dependent text processing ofeach character as opposed to simple handling of a character by itself. In western scripts, the writing system employs arelatively small set of shapes and symbols as this is sufficient to satisfy the requirement that linguistic content as wellas rendering information be exactly specified through the same set of codes. Consequently, text processing could becomfortably achieved using a small set of codes.In respect of our languages, the complexities of the writing systems demand that a large number of written shapes(typically in thousands) be used though the linguistic content may still be specified using a small set of codes for thevowels and consonants (typically less than a hundred). Hence it is not possible to use the same set of codes to satisfyboth the requirements. In their wisdom, the designers of ISCII and subsequently Unicode, essentially struck acompromise where the smaller set of codes was recommended. Yet, they yielded to the temptation of incorporating

    codes to include rendering information as well. These codes conveying rendering information took care of Devanagariderived writing systems but do not adequately address the writing systems of the South.The problem that we face today, in respect of efficient representation of text in our languages, is precisely one of notbeing able to do either effective linguistic processing or meet the real requirements of the writing systems. The Multilingual Systems Development Project at IIT Madras had taken the view that efficient text processing isabsolutely essential and is perhaps more important than precise rendering of text so long as ambiguities are avoided.The consequence of this decision was that the coding structure should preserve linguistic content as well as providecomplete rendering information within the flexibilities offered by the writing system. Such a coding scheme wouldrequire syllables to be coded since the linguistic content is expressed through syllables and the writing systemdisplays syllables. The multilingual software applications developed at IIT Madras have successfully demonstrated

    that linguistic text processing at the syllable level is not only possible but can also be accomplished by usingconventional algorithms which work with fixed size codes. In contrast with this, application development with Unicodesupport has raised a number of issues which must be thoroughly discussed and understood before one acceptsUnicode as a viable standard for computing with Indian languages.

    http://acharya.iitm.ac.in/multi_sys/unicode/render/render.phphttp://acharya.iitm.ac.in/multi_sys/unicode/render/ren_07.phphttp://acharya.iitm.ac.in/multi_sys/unicode/render/ren_07.phphttp://acharya.iitm.ac.in/multi_sys/unicode/render/ren_07.phphttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=summaryhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=summaryhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=summaryhttp://acharya.iitm.ac.in/multi_sys/unicode/uni.php?topic=summaryhttp://acharya.iitm.ac.in/multi_sys/unicode/render/ren_07.phphttp://acharya.iitm.ac.in/multi_sys/unicode/render/render.php
  • 7/28/2019 Unicode for Indian Languages

    12/75

    In the light of the above, the Systems Development Laboratory, IIT Madras is pleased to share with the viewers, theLab's experiences in dealing with linguistic and rendering issues of text in all the important scripts of India.

    Unicode: A Viewpoint from SDLThe Multilingual Systems project at IIT Madras was started around the time ISCII had evolved into a standard. It wasclear to the development team that though ISCII was conceived as the basis for syllabic representation of text inIndian languages, one had to reckon the need to process a variable number of bytes to identify a proper syllable. Thevariable length code makes text processing very complex especially in the presence of codes which do not havelinguistic significance but are required for correctly rendering the syllable.In recent years, software developers have indeed given serious thought to supporting Unicode for Indian languages.Unicode for Indian languages has basically evolved from ISCII and has retained the essence of eight bit codingscheme though script specific codes have been assigned for the different scripts. World over, there has been acontinuing debate about the real suitability of Unicode for applications in Indian languages but the open commitmentgiven by Microsoft has led many developers to toe the line towards Unicode.From the very beginning, the Multilingual Systems project at IIT Madras had seen the futility of attempting to do textand linguistic processing with variable length codes for syllables and had therefore evolved a uniform two bytescheme to simplify text processing.The question of adhering to a meaningful standard where developers see distinct advantages is an important issuebut a standard becomes meaningful only if most of what we have successfully attempted earlier can beaccommodated. In this respect, Unicode for Indian languages does pose fairly serious challenges and to this date(March 2005) no satisfactory implementation of useful applications can be cited as examples.The purpose of this article is not to present an argument against using Unicode but to bring out the realdifficulties in coping with its implementation for Indian languages.Many of the complexities involved in rendering Unicode text through Uniscribe (Microsoft's shaping engine) orequivalent interfaces will be taken up one by one and the difficulties faced in linguistic processing will be explained.Where required, test files have been included for viewers to download and verify the points made.The information provided here will probably convince the reader that it is quite difficult to work with Unicode for Indianlanguages. Hence one should seriously consider alternatives for text processing. On the issue of using Unicode fortransporting information across system, there is enough consensus however.Idiosyncrasies of Writing systems in India.Writing systems followed in India are considered complex on account of the rules which specify how a syllable shouldbe written. The reader is advised to look at the page discussing theprinciples of writing systemsbefore looking at thecurrent page which concentrates on the problem of rendering syllables on a computer.By and large most languages of India follow the syllabic writing system which represent syllables rather than pureconsonants and vowels. Though there can be thousands of syllables, the writing systems generally follow some rules

    by which the syllables are shaped. These rules allow a syllable to be built up from a smaller set of shapes whichinclude the vowels, consonants and the representations for the medial vowels. This smaller set is usually madeavailable in a font and on a computer a syllable is shaped typically by placing the glyphs in the required order.It will help if we specify the manner in which a syllable is shaped by examining the structure of the syllable.A syllable may be made up of1. A pure vowel .

    This usually applies to a vowel appearing at the beginning of a word, though in some languages, a pure vowel may beseen inside a word. A pure vowel has a unique shape and is written using this shape wherever it occurs.2. A consonant with an implied "ah".

    The consonants of our languages cannot be pronounced easily unless a vowel is attached to the consonant or otherconsonants follow. Unlike in western scripts where a consonant is always written in its generic form, consonants in

    http://acharya.iitm.ac.in/linguistics/wrisys.phphttp://acharya.iitm.ac.in/linguistics/wrisys.phphttp://acharya.iitm.ac.in/linguistics/wrisys.phphttp://acharya.iitm.ac.in/linguistics/wrisys.php
  • 7/28/2019 Unicode for Indian Languages

    13/75

    India are almost always written with an implied "ah" so that one can pronounce an independent consonant directlywithout having to refer to it by a name (unlike in western languages where each letter has a name).

    e.g., "m" is normally referred to as "em" and only when an "a" comes with it as in "ma" will one say it as "ma". InSanskrit (and in other India languages), when you see the consonant 'm', you will know that it is to be pronounced"ma".This subtle distinction has to be retained when a child is taught the writing system.In Indian scripts, a generic consonant occurs only as part of a syllable and not by itself except that a word may end ina generic consonant. Hence the writing convention includes a special form for the same by attaching a "halanth"ligature. So m is the generic form of m but it is not easy to pronounce it by itself. (Try saying "hmm")A pure consonant is written using the shape assigned to the consonant.3. A consonant vowel combination.In India, one refers to the consonant as the body and the vowel as the one that gives a consonant its life. Hence thevowel symboically represents life.This simple syllable is almost always written by adding a ligature to the shape of the consonant which ligaturedepends on the vowel. This medial vowel representation has specific forms in specific scripts. There are exceptions tothis rule as well in some of the scripts (Tamil and Malayalam).

    In the above, we see three scripts where the syllables with "ta" have been formed with all the vowels. Notice that inTamil, the Matra (ligature) can have components on both sides of the consonant while in Telugu, the components maybe written above and below the consonant as well as on one side.

    4. Two or more consonants and a vowel.

    Very simply, we can say this conforms to the ccv, cccv, ccccv etc. format. It will be useful to point out here that one cannot really have arbitrarily long syllables. It will become almost impossibleto pronounce them. By and large two and three consonant syllables are common and very few with four or fiveconsonants. One sees long syllables even in English (Angstrom!)Across all the languages of India, approximatley eight hundred to a thousand syllables ( with implied vowel "ah") areknown to be present in spoken and written form. Since a basic syllable can include any of the vowels, the number ofactual syllables will be of the order of about eight thousand, for all the vowels may not be seen with a base syllablewhich has two or more consonants in it.Rules for generating the display1. A pure vowel or a basic consonant has an individual shape associated with it. This shape has evolved over a periodof time but one does find significant variations in older manuscripts. A pure vowel or a basic consonant is alwaysdisplayed by drawing the associated shape.

  • 7/28/2019 Unicode for Indian Languages

    14/75

    The forms for all the vowels and pure consonants are defined uniquely in each script.2. A consonant vowel combination is written with a Matra (ligature) atatched to the basic consonant. The Matra maybe drawn on either side of the consonant and in some cases, it is written on both sides or above and below aconsonant. This applies to Tamil, Telugu, Malayalam, Bengali and older scripts such as Grantha.Now, it is also true that in Tamil and Malayalam, there is no specific matra in respect of the vowels "uh" and its longversion. No matras are applicable here and these will have to be remembered as exceptions.In most scripts, there will be such exceptions for specific combinations and these exceptions will have to be kept in

    mind when rendering the syllable.3. The shape for a consonant in a syllable may be roughly specified by applying the rules observed in practice foreach script. There rules vary across scripts. Some of the rules are explained below.The half form of a consonant is normally used in many cases, especially with scripts which are closer to Devanagarie.g., Gujarati. The half form is also referred to as the joining form. Usually, the half form has enough resemblance tothe full form of the consonant.

    However, the half form is not defined for all the consonants, especially those which do not have a vertical stroke inthem (Devanagari). Several consonants which do not have a clearly defined half form are shown in the figure above.Inthese cases, a form diminished in size but in a manner where the consonants can be written one below the other isconsidered useful. Again, examples are seen in the figure above,The one below the other form is actually the default for South Indian scripts, except Tamil. In these, there is no halfform for a consonant. The first consonant in the syllable is written first, the second is written below in reduced size andthe third may also get written below this combination. Since one seldom finds arbitrarily long syllables and most of thethree or four consonant syllable end with "ra" or "ya", the actual need to write three consonants one below the otherarises only rarely. The syllables with "ra" or "ya" as the last consonant have a special form for them. Composing syllables with generic consonantsThe shape of a syllable can always be built by using the generic form of the consonants. This will be linguisticallycorrect though not conforming to convention. Using generic consonants to write syllables generally results in a smallerset of shapes for the writing system. Among the Indian languages, Tamil employs a simple script where a syllable isalways shown in this manner.

  • 7/28/2019 Unicode for Indian Languages

    15/75

  • 7/28/2019 Unicode for Indian Languages

    16/75

    Syllable Representation Examples

    When we compare the rules across different scripts, the following seem to apply in general, though different rules mayapply in different scripts for the same syllable. In other words, several displayed forms may refer to the same sound.

    Concatenate half forms except for the last consonant. Write the consonants one below the other but retain their basic shapes with diminished size. Use special ligatures for specific vowel combinations in some of the scripts. Use unique forms for a syllable. Just decompose any syllable into its consonants and the vowel. Use special ligatures for "ra" in Devanagari based scripts. The ligature will depend on where "ra"

    occurs within the syllable. Use special ligatures for other consonants as well. This applies to Telugu. The medial vowel representations may have ligatures on both sides of the consonant.

    The following are illustrative of syllable formation in different scripts. The variations in the writing systems will be seenby examining these carefully. This is not an exhaustive set but is provided only as an example.

  • 7/28/2019 Unicode for Indian Languages

    17/75

    Coding schemes: Linguistic requirements

    1. Accommodate all basic soundsAll the basic vowels and consonants should find a place in the code space. All the symbols that convey relatedinformation about the text (Vedic symbols, Accounting symbols etc.) should also be coded. Punctuation marks,consistent with the use of the scripts in use today and the ten numerals, should also be accommodated in the codespace irrespective of whether they have been accommodated with other scripts or not.2. Lexical orderingA meaningful ordering of the vowels and consonants will help in text processing. Over the years, on line dictionarieshave become very meaningful. Arrangement of words within a dictionary should conform to some known lexicalordering. Lexical ordering of the aksharas may not really conform to any known arrangement for different languagessince no standards have been recommended or proposed. The ordering currently in vogue is somewhat arbitrary anddifferent across languages.3. Coding structure to reflect linguistic informationWhen codes are assigned to the basic vowels and consonants, it would be of immense help to relate the code valueto some linguistic information. For instance, the consonants in our languages are grouped into classes based on themanner in which the sound is generated such as the cerebrals, palatals etc. It would certainly help if looking at a code

  • 7/28/2019 Unicode for Indian Languages

    18/75

    one could immediately recognize the class. In fact the system of using aksharas to refer to numerals is a well knownapproach to specifying numbers and this system, familiar to many as the "katapayadi" system has been followed inIndia for ages.4. Ease of data entryThe scheme proposed for data entry must provide for typing in all the symbols without having to install additionalsoftware or use multiple keyboard schemes. It is also important that data entry modules restrict data entry to onlythose strings that carry meaningful linguistic content. In the context of Unicode, data entry schemes may permit typing

    in any valid Unicode character though it may convey nothing linguistically. It would therefore help if the schemesallowed only linguistically valid text strings.5. Transliteration across scriptsIt is important that the coding structure allows codes corresponding to one script be easily displayed using otherscripts as well. In a country such as India, where a lot of common information has to be disseminated to the public,one should not be burdened with the task of generating the text independently for each script. The Unicodeassignments for linguistically equivalent aksharas across languages is not sufficiently uniform to permit quick andeffective transliteration. One requires independent tables for each pair of scripts. ISCII assignments were uniformacross the scripts and made transliteration easier. Transliteration is quite complex with Unicode. The problem offinding equivalents requires that characters assigned in one script but not in the other will have to be mapped based

    on some phonetic content. This may not always be possible with current Unicode assignments. The illustration belowis typical of what one may prefer. Three consonants in Tamil have their Unicode equivalents specified only inDevanagari but not for other scripts. This means that proper transliteration of Tamil text into say Bengali or Gujaratimay not be feasible with the existing Unicode assignments and only nearest equivalents may be shown.Transliteration based on nearest phonetic equivalents may not be appropriate from a linguistic angle.

    This brings up another important issue as well. In the Unicode assignment for Devanagari, equivalent codes foraksharas from Tamil have been specifically provided for. But the Unicode book also allows the same aksharas to berendered using two Unicode characters, the first corresponding to the basic phonetic equivalent and second, theNukta character which identifies the dot in the preceding character. This creates problems in practice when twodifferent Unicode strings result in identical text displays, for tracing back to the correct internal representation will bedifficult. This shows the bias exhibited by Unicode towards a coding structure which also specifies renderinginformation as opposed to rigidly specifying syllables alone. 6. String matching issuesArchives of text in Indian languages may have to be indexed and stored for purposes of retrieval against specificqueries. The query string may pertain to text in a given language but the result may actually be text in anotherlanguage. Here is a situation which illustrates this.A Journalist might have filed a report in a language for publication in a magazine. At a later time, a similar event mayhave to be reported in another region and information from the earlier report might prove useful. Here the journalistcovering the latter event may actually query a data base for keywords in the original language in which the earlierreport was written but actually submit the query in a different script but containing the same linguistic information. Thequestion of correctly forming a query string is also something that one must think about, for it is quite easy to makespelling errors while typing in the query string. How would one find a match? This is a typical scenario in India wherecentralized information sources cater to dissemination of the information in different regional languages.7. Handling spelling errorsOne of the major difficulties in preparing a query string is getting the spelling right. With syllabic writing systems, it is

    entirely possible that conjuncts (i.e., syllables with multiple consonants) are typed in with some error. Often the stringis derived on the basis of its pronunciation. With errors in spelling, string matching on the basis of syllables can bevery difficult. The problem indicated here assumes significance when central data bases are queried in regionalscripts. A person in Tamilnadu may desire to lookup information about places in the Himalayas and submits a query inTamil for a match against the name.

  • 7/28/2019 Unicode for Indian Languages

    19/75

    The characters in the Tamil string will have to be transliterated into appropriate codes for Devanagari text in which theinformation may be kept. The syllables in Tamil are always written in decomposed form and this will result indifferences between the Tamil and Devanagari strings causing the string matching program to report either a spellingerror or the absence of a match. In respect if Indian scripts it will be too much to expect users to know the correctspelling. Thus string matching on the basis of close sounds will be required rather than on the internal representation.This argument will also apply to applications that might attempt to check spelling in a data entry program.

    Linguistic issues in text processing

    Dealing with Text consistent with linguistic requirementsText processing with linguistic requirements in mind can be effected with a minimal set of characters and a few specialsymbols. By this we mean that a displayed text string can be interpreted with respect to the language it represents.When we are looking for the meaning of a word in a text string, the language does come into the picture and acomputer program may actually match the string with a set of words in order to arrive at a linguistically importantfeature in the word.Interestingly, what associates a word with a language is not the script in which the word is written but the soundsassociated with the word. For example, the bilingual text we see in railway stations in India conveys the samelinguistic information even though written in different scripts. Unfortunately, computers have forced us to work withscripts rather than the sounds constraining us to handle representations for the shapes of the written letters. The

    reader will agree with this readily once he/she reads the following text strings and relates them all to the samelinguistic content.

    An important consequence of the above observation is that in the case of two of the scripts (Roman diacritics andGreek), a minimal set of about 30-40 shapes is adequate to represent virtually any text one wishes to display. In thecase of the other two (Devanagari and Tamil), hundreds of shapes may have to used since each shape is associatedwith a unique sound which is in contrast with the other situation where a sequence of shapes from a small set areplaced one after the other. In other words, while in the western scripts a syllable is always shown in decomposedform, in Indian scripts, a syllable is usually shown in its individual form though this individual form may conform to

    some convention in respect of how it is generated.In the context of Indian scripts, one seldom runs into a problem of reading the text correctly since the readerautomatically associates the shapes with the sounds whereas there is enough room for incorrect reading with theRoman script. Thus the shapes of the symbols used in Indian scripts relate more directly to linguistic content withoutambiguity when one pronounces the sounds as inferred from the shapes. This brings us to an important problem oftext representation. If we want to code the text in a way the linguistic content and the shape are mapped one to one,we will have to find a code for each syllable and we will have to provide for thousands of these, even for a singlelanguage. The reader who is familiar with language primers in elementary schools will immediately remember the verybasic set consisting of all the consonant vowel combinations. Shown below is a portion of the table of syllablerepresentations in their most basic form with just one vowel with a consonant and this includes the case where thegeneric consonant is represented as well. Thus the total set, equals the product of the number of consonants and the

    number of vowels together with the set of vowels and this may be what constitutes the bare minimum requirement forsyllable representation. This set is linguistically adequate though the writing conventions may require special ligatureswhen specific conjuncts are formed.

    This large set of displayed shapes has certainly posed problems for the computer scientists who had always workedwith a limited set of letters. The new requirement can be met only with schemes that allow more than eight bits per

  • 7/28/2019 Unicode for Indian Languages

    20/75

    code since the required number is far in excess of 256. Till recently, majority of computer applications had beenwritten only to work with eight bit codes for text representation except perhaps those meant for use with Chinese,Japanese and Korean, where more than 20,000 shapes are required. Surprisingly, individual codes have beenassigned to each of these ( a very tedious process indeed but one that had been handled meticulously). Tocircumvent the data entry problem with that many symbols, a dictionary based approach is used for these specificlanguages where the name of the shape is typed in using a very small set of letters (called kana) and the applicationsubstitutes the shapes (called ideographs).Handling Indian scripts.Computer applications written for the western scripts can handle about 150-200 shapes (letters, accented letters andsymbols). Designers have thought of clever approaches to dealing with Indian scripts by identifying a minimal set ofprimitive shapes from which the required shape for any syllable could be constructed. For Indian scripts, the basic setof consonant vowel combinations can be easily accommodated through a minimal set of basic shapes involving onlythe vowels, consonants and the matras. When we write text in our languages, we can in fact build the required shapeof the syllable from these but writing conventions are such that for almost all the scripts (except Tamil) many syllableshave independent shapes. It is very likely that as writing systems evolved in India, the syllables which did occur morefrequently got special shapes assigned to them. We observe that there are about a hundred and fifty of these specialshapes which will have to be included in our set if we wish to generate displays conforming to most of theconventions.These basic shapes can be used as the glyphs in a font so that one can generate meaningful displays conforming tothe writing conventions. If we look at the number of glyphs, we will find that about 230-240 may be adequate to buildalmost all the syllables in use. However, fonts used in computers cannot really support this many glyphs. Eachsystem, Win9X, Unix or the MacIntosh, has its own specifications for the correct handling of fonts and the commondenominator that all these platforms can truly cater to is only about 190 glyphs, though individually, the Macintosh cansupport many more. For most scripts, multiple copies of the Matras, each one magnified or reduced in size andlocated appropriately to blend with the consonant or conjunct will be required. In some cases, it may be difficult to adda matra by overlaying two glyphs because the basic shape of the consonant may not permit an attachment that is notindividually tailored to it. This happens for example with the "u" matra for the consonant "ha". In these cases, newglyphs are invariably added.The observations made above may not hold for the case of text representation through Unicode which provides alarge code space of more than 64000 codes. Yet, within this large space, each language (identified through the script

    associated with it) will be confined to a much smaller set of codes but this set can truly exceed 256. Thus Unicode,used with an appropriate 16 bit font can accommodate a fairly large number of characters for a script. The WesternLatin set has more than 450 assigned codes to cater to most European requirements.We will now make some specific observations about handling our scripts and assigning codes.1. If we agree to represent text using codes assigned to shapes used in building up the displayed symbols, we willcertainly be able to store and display the text and possibly handle data entry as well using the same methods adoptedfor plain ASCII text. However, tracing the displayed text to the linguistic content requires us to map the displayedshape into the consonants and vowels that make up the syllable. This makes linguistic processing quite complicated.Also, this approach will not work uniformly across fonts since each font has its own selection of basic glyphs andligatures.

    2. We can agree to assign codes to the basic vowels and consonants of our languages which run into about fifty onesymbols. However these codes cannot be directly mapped to shapes in the displayed text. A string containing thesecodes will necessarily have to be parsed to identify syllable boundaries and the result mapped to a shape. If we dowhat is done in the western scripts, we will end up with a situation such as seen below. If we take the approachthrough ISCII and try to display text directly with the codes, we will also run into similar difficulties.

  • 7/28/2019 Unicode for Indian Languages

    21/75

    In the use of ISCII, the situation similar to Roman is acceptable so long as the convention for including the vowelshape to only one side of the consonant is retained. The group of codes will indeed contribute to identifying thelinguistic content properly but the display may require swapping of glyphs if the matra addition follows a different rule.The main advantage of ISCII is that it provides for codes that relate to the linguistic content (sounds) and thus thesecould be used uniformly across the Indian languages which are based on a more or less common set of sounds.However, this simplistic view does not always hold, for ISCII also prescribed the means for interpreting specific codesto result in a specific display form. It achieved this through two special codes called the INV and Nukta. Going from an ISCII string to displayed shapes requires one to identify syllable boundaries and also properly interpretthe INV and Nukta characters. This approach will be script dependent as well as font dependent. Such a program willcode into itself the rules of the writing system followed for a language when using the script. Clearly, writing such

    programs to handle multiple scripts in the same document will not be easy. Also, since the writing system rules arecoded into the program, handling a new script for a language will require the program to be modified and recompiled.It is however possible to read into the program the rules if the program were written in an appropriate mannerinvolving data structures that directly specify the rules and are read in at run time from appropriate files (Tables orsimple structures can help).

  • 7/28/2019 Unicode for Indian Languages

    22/75

    Going from the displayed shape to the internal representation.How easy or difficult will it be for us to retrace the steps and go from a displayed shape to the ISCII codes whichgenerated the shape? This problem is faced in practice when we perform copy paste operations. The problem is quitedifficult to handle since the display is based on codes corresponding to the glyphs in the font while the internalrepresentation conforms to ISCII (or Unicode). What is recommended in practice is the approach through a backingstore for the displayed string, typically implemented as a buffer in memory that retains the internal codes of thedisplayed text. This buffer will have to be maintained in addition to any other buffer maintained by the application formanipulating the text. When a block of text is selected on the screen, a copy of the display is generated again from the

    internal buffer and this is compared with the codes corresponding to the display. In other words, one really does notgo from the displayed codes to the internal codes but rather matches the displayed codes by generating a virtualdisplay and comparing the two. We now appreciate the fact that if the displayed code and the internal code were thesame, there is no difficulty at all in doing this. The writing systems which are syllable based do not permit thishowever.Tracing back can be quite complicated when the same syllable gets displayed in alternate forms as in the illustrationbelow.

    One has perfect freedom in choosing any of the above forms when displaying text and no one would complain that thetext is not readable since all the forms are accepted as equivalent.The assignment of ISCII or Unicode values does not specify in which form a syllable should be rendered so long asthe result is acceptable. The rendering in practice will have to take into account the availability of the required basicshapes to build up the final form. Hence the rendering process will depend on the font used for the script. Experiencetells us that at least in respect of Devanagari, the first and the fourth forms above are seen only in some commerciallyavailable fonts which are normally recommended for high quality typesetting.Summary and specific observations.1. The characters defined in any coding scheme should meet the basic linguistic requirements as applicable to alanguage. It is also necessary to accommodate all the special symbols used in the writing system to add syntacticvalue to a string. For instance, the Vedic marks used in Sanskrit text or the accounting symbols used in Tamil provideadditional information which may not be strictly linguistic in nature but useful for interpreting the contents.2. As far as possible, every text string must conform to the basic requirement that the displayed shape always carryspecific linguistic information. That is, some amount of semantic detail must also be part of the information conveyedby the string. In the absence of this, an application will have great difficulty in interpreting a text string from a linguisticangle, though the string may contain only valid codes.3. The same linguistic information may be conveyed by more than one displayed shape. The coding schemes mustpermit alternative representations to be traced back to specific linguistic content.

    Multilingual Computing Requirements(for India)

    The multilingual multicultural environment in India calls for new approaches to computing with Indian languages.There is a need for applications across the country which deal with information common to all the regions. Individuallywithin each region, where there is homogeneity in terms of the language spoken, applications need to address localrequirements.Government and public institutions in the country have traditionally resorted to Bilingual documentation with English asthe base and the regional language as the means to communicating with the people.Today we need to cater to the following requirements.

    Dissemination of centralized information to different regions of the country, both in English and theregional language. Information flows back from the region to the center as well and this is invariablydone in English.

  • 7/28/2019 Unicode for Indian Languages

    23/75

    Exchange of information across the states. This is accomplished through English though the use ofthe National Language has been encouraged.

    Dissemination of information within a region. This is done primarily in the local language and to someextent in English. In the rural setup, the regional language is always used.

    Exchange of information between institutions (public and private) which have offices distributedthroughout India.

    In almost all cases, information originates in the regional languages even if it is subsequently transmitted in English.Often bilingual documents are exchanged between Government institutions in different regions.Data shared across the country is usually independent of a specific language since the nature of the sharedinformation usually relates to demographic details, schedules of national events, prices of agricultural products etc.Types of applicationsApplications catering to bilingual document preparation. Though only two languages may be involved at a time,the regional language can be any one of the national languages. These are covered by nine scripts. Urdu should alsobe supported since it is a national language.Applications catering to multilingual documents. Such applications are called for in the study of scriptures and oldmanuscripts preserved in india. Such documents are also required to be generated when data has to be displayed in

    public places attended by people from different regions of the country. Typically, signs and posters in railway stationsdisplay information in three or four different scripts.Applications used in the teaching of languages . Introducing one language through another is useful and is alsoeffective when the cultural background of the people speaking the different languages is common or similar. Thelanguage is easier to understand since many words would be common.Creation of centralized data bases where the stored information is common to all the regions. While English may besuited well for such applications, what is desirable is an approach to storing information in a language/scriptindependent fashion. Postal information remains essentially the same across the country and this could be anexample. All the rules and regulations in effect throughout the country will also qualify here.Applications catering to linguistic analysis. machine translation programs, user interfaces based on naturallanguage queries, generation of linguistic corpora are examples of these applications.Internet based applications such a email, chat, search engines also qualify as multilingual applications. It will beimportant to permit localized versions of these applications where knowledge of English will not be required on thepart of the user to run the applications.The different applications mentioned above may be grouped into

    Document preparation and data entry Creating conventional data bases and access the data through a web interface or client applications

    running on standard systems. Create and manage text data bases which include indexing. Command processors similar to a shell where text based applications may be run with ease. The

    commands may include standard shell commands to manipulate files, invoke applications, managefiles and directories etc.

    While it is entirely conceivable that any application currently supported through English will qualify for localization,there are still many questions to be answered about items of information which need to be identified on a global basis.It is unlikely that in the near future totally localized applications will be available matching corresponding applicationsrunning in English. Basically one is looking for applications that can be handled comfortably in the mother tongue sothat a large population in the country may use computers in a meaningful manner.Requirements to be met1. Ease of data entry in the regional script as well as in English. Keyboard mapping must be flexible enough to supportall the aksharas, traditional symbols and punctuation. The use of the keyboard should be uniform across the scripts.

  • 7/28/2019 Unicode for Indian Languages

    24/75

    2. Ease of transliteration across the scripts. This is important to disseminate common information, set up centralizeddata bases etc. There is also the need to transliterate between English and the regional script to help people learn thelanguage.3. User interfaces should be uniform across the languages, platforms or operating systems so that training will beeasy both for the person learning computers and the trainer himself/herself. Training programs in different regionsmay be easily handled by experts who may not speak the language but can communicate through a commonlanguage other than English. The phonetic nature of India's languages is indeed very useful for this.4. Applications should cater to a large population of children and adults with serious disabilities, specifically visualimpairments.

    Conceptual Basis for Unicode

    A coding scheme provides for representing text in a computer. The text presumably comes from some language. Thewriting system employed for a language utilizes shapes associated with the linguistic elements which are fundamentalto the language. These are usually the vowels and consonants present in the language and this set is normally knownas the alphabet. Assigning codes to the letters of the alphabet has been the standard practice in respect of processinginformation with computers.Codes are generally assigned on the basis of linguistic requirements. A code is essentially a number that is

    associated with a letter of the alphabet. Working with numbers is easier and a good deal of text processing can beeffected by just manipulating the numbers. For example, an upper case letter in English can be changed to its lowercase by applying a simple formula. The number of codes required in practice for any particular language will bedecided by the totality of shapes associated with the writing system such as upper case letters, punctuation symbols,numerals etc. Typically this would be a set with less than a hundred codes for most Western languages.Traditionally, computer applications dealt with text corresponding to only one language. Subsequently the need towork with multilingual text was felt and this brought in additional requirements in respect of codes. The letters fromdifferent languages cannot be normally distinguished on the basis of their codes, for across different languages, thenumerical values assigned for the codes fall in the same range. Thus one might find that the code assigned to theletter "a" in English is really the same as the code assigned for the Greek letter "alpha" or an equivalent letter in theCyrillic alphabet. A multilingual document with text from different languages cannot really be identified as one, unlessa mechanism is available to specifically mark sections of the text as belonging to a specific language/script.The traditional way of solving this was to embed descriptors in the text in a default language/script and allow thesedescriptors to specify multilingual content. Typically one would use different fonts to identify different languages andthe application would use the specified font to display portions of the text in a particular language/script. This way, atleast the display of multilingual information was possible though it was still difficult to associate a code, i.e., acharacter in the text with its language, unless the application kept track of the context. Keeping track of the contextrequires that an application should necessarily examine the text in the document from the beginning to the currentletter, for only then the language associated with the letter could be ascertained without doubt.

    The eight bit coding schemes, the codes are typically in the range 32-127 though values above 128 are also used.Since different characters from different languages are assigned codes in the same range, identification of thelanguage for a given code is rather difficult unless the context is also specified. The concept of the "character set" was

    precisely introduced for this purpose so that each language/script could be identified through the name given to thecharacter set. The character set name would figure in the document (in a default language) and thus the context couldbe established. This is predominantly the method used in most word processor documents as well as web pagesdisplayed through web browsers.Linguistic processing with codes can proceed only when the language associated with the codes is known. keepingtrack of the context of the language is cumbersome though not impossible. The idea behind Unicode is to presentthe language information associated with each character code in a manner that an application can readilyassociate the character with the particular language. Clearly, the need to identify the set of languages/scriptswhich would qualify for processing comes up first and Unicode first examined the different scripts used in the writingsystems of the world and provided a comprehensive set of codes to cover most of the languages of importance. Therationale for this is the following. Typically, the writing systems employ shapes or symbols which are directly related tothe alphabet and so by providing for the script, one would also provide for the language or languages which use thesame script (though with minor variations). Majority of the languages of the world could be handled this way includingJapanese, Chinese and Korean where literally twenty thousand or more shapes are required. Unicode indeed setaside a very large range of numbers to cater to these.

  • 7/28/2019 Unicode for Indian Languages

    25/75

    The basic idea in Unicode was to assign codes over a much larger range of numbers form 0 to nearly 65000. Thislarge range would be apportioned to different languages/scripts by assigning chunks of 128 consecutive numbers toeach script which may also include a group of special symbols. The size of the alphabet in many languages is muchless than 50 and so this minimal range of 128 is quite adequate even to cover additional symbols, punctuation etc..Thelist of languages supported in the current version of Unicode(Version 3.2) is given at the Unicode web site.

    An important concept in Unicode is that codes are assigned to a language on the basis of linguistic requirements.Thus, for most languages of the world which use the letters of their alphabet in the writing system, the linguistic

    requirement is basically satisfied if all the letters are covered along with special symbols. Display of text wouldproceed by identifying the letters through their assigned Unicode values both in the input string and the displayedstring, which for most languages/scripts would be identical. Thus a Unicode font for a language need incorporate onlythe glyphs corresponding to the letters of the alphabet and the glyphs in the font would be identified with the samecodes used for the letters they represent.As a concept, Unicode provides for a very effective way of dealing with multilingual information both in respect of textdisplay and linguistic processing. Unfortunately, we encounter special problems with languages which use syllabicwriting systems where the shapes of the displayed text may not bear a one to one relationship with the letters of thealphabet. In other words, for those languages of the world where the writing system employed displays syllables, theone to one relationship between the letters of the alphabet and the displayed shape does not apply. The languages ofthe South Asian region as well as the Semitic languages like Hebrew, Arabic, Persian etc., typically employ thesyllabic writing system. Unicode assignment for these languages does meet the basic linguistic requirements.However, the issue of display or text rendering has to be addressed separately for these languages.

    Unicode for Indian LanguagesA Perspective

    A brief introductionThe essential concept underlying Unicode relates to the assignment of codes for a superset of world languages(essentially scripts used in different writing systems) such that a single coding scheme would adequately handlemultilingual text in any document. In Unicode, it is generally possible to identify the language/script and the letter ofthe alphabet or a language specific symbol from a unique code made up of sixteen bits. It is important to keep in mindthe fact that the need for handling different languages of the world had been felt long before Unicode was thought of.The earlier solution was a simple one. Collect the set of letters to be displayed and give the set a name or an

    identification. A computer application could then be told to interpret a character code with respect to a character set.The idea of the character set was simply that a set of values, typically 128 or in some cases going up to 255, wouldrelate to a set of displayed shapes or symbols for a specific language associated with the character set. The characterset name would be given as a parameter to the application which would then choose an appropriate font to display thetext specified by the eight bit code values in a text string.The only issue that had to be taken care of with the earlier approach was that the application always had to work inthe context of some language to be able to correctly interpret the code. Since the codes were common to all thecharacter sets (being eight bit codes), it would not be easy for an application to interpret a given code unless theassociated character set was also known. This would be a constraint to reckon with while handling multilingual text. For most western scripts, the number of distinct shapes to convey information through displayed text is usually small,typically of the order of about 70 and perhaps about 100 if the new symbols which have become meaningful in the

    context of electronic data processing get included. In some of the western scripts, accented characters are presentwhich will have to be treated as independent linguistic entities. Otherwise, an accented letter may be viewed as acomposite with a base letter and an accent mark. Viewed in the light of this, the normal ISO-Latin character set hasabout 94 displayable characters without accents and perhaps another 90 which include accented letters, the accentsthemselves and other special symbols. An eight bit code is entirely adequate to meet all linguistic requirements here.Computer applications render text by using the rendering support provided by the Operating System. Given that acode value is associated with a character set, the application will choose an appropriate font containing the letters andsymbols for the script associated with the character set. Traditionally most of the fonts were eight bit fonts providing amaximum of about 190-200 Glyphs for each character set.Multilingual documentsAn application rendering multilingual text should know which portion of the document should be rendered in aparticular script. Typically, the format of multilingual documents included the means to identify portions of the text ashaving certain attributes which include the font, colour and the size of text. The Rich Text Format standardized byMicrosoft or the HTML specification allows a document to describe itself using descriptors made up of symbols fromthe set of letters used in the script. Readers familiar with Word processors will readily appreciate the fact that the

    http://www.unicode.org/standard/supported.htmlhttp://www.unicode.org/standard/supported.htmlhttp://www.unicode.org/standard/supported.htmlhttp://www.unicode.org/standard/supported.html
  • 7/28/2019 Unicode for Indian Languages

    26/75

    document contains a lot of formatting details all of which is described using only the characters from the set. Theseare generally known as tags. HTML documents contain a lot of tags which tell the browser application how to presentthe text in a window.Formats for documents which allow the document to describe itself, are usually known as Mark up languages. RTF,HTML and XML all belong to this category of Mark up languages. While this approach appears meaningful, there arepractical difficulties in using the self describing tags where the tags themselves appear as text in the document but thespecifications for the document usually provide for handling such situations through the concept ofEntities, whereanentity may uniquely describe a specific character in the text through a unique name assigned to the character.

    Multilingual text is usually tagged in ASCII but the tags can confuse Web Browsers if not handled properly.Unicode was introduced as the solution to the problem of handling multilingual text where any character in the textcould be individually and uniquely identified as belonging to a script/langauge. In Unicode for Indian languages, eachcharacter is identified through a field within the code which specifies the language and a field which specifies anindividual letter within that language. Though sixteen bits are used to specify each code, the number of codesassigned to any language is small and is often just about 128, with very few exceptions.

    The Unicode experts may actually describe Unicode as one single scheme for dealing with all the scripts andlanguages of the world, where the code space of 65536 has been apportioned to the different languages, one afteranother. So the idea of splitting the code into two fields does not really apply in general. However, when only 128 code

    values have been assigned for a language, it is very easy to see that the two fields can be uniquely discerned. Amongthe Indian languages, Unicode assignment has been effected for all the basic scripts: Devanagari, Bengali, Oriya,Gurmukhi, Gujarati, Tamil, Telugu, Kannada and Malayalam. For these, the language descriptor part of the codeoccupies nine bits and the remaining seven refer to the consonants, vowels and the matras along with specialsymbols.Devanagari - 128 code values from 0900Bengali - 128 code values from 0980Oriya - 128 code values from 0B00Gurmukhi - 128 code values from 0A00Gujarati - 128 code values from 0A80Tamil 128 code values from 0B80Telugu - 128 code values from 0C00Kannada - 128 code values from 0C80Malayalam - 128 code values from 0D00The Unicode book specifies a unique English name for each code. This is typically a combination of the languagename and and an individual name for each of the 128 characters in the range. For most of the Indian scripts, severalcode values in the set of 128 for each may be reserved.The actual code assignments may be seen from the webpages at the Unicode Consortium web site.Unicode and conformity to linguistic requirements.The Unicode Book is specific in respect of implementing schemes to render text in a manner which is consistent withthe linguistic requirements of the language. Here the original intent of Unicode was to represent only the basic

    linguistic elements forming the alphabet and not a specific rendered form. For example, an accented character whichmay be used in German or French is identified as a single letter though composed of a letter and an independentaccent mark. Since such accented characters belong to the set of letters used in the writing system, they are assignedindividual codes. An accented character could well be described by two codes, one for the letter and one for theaccent but in the wisdom of the designers of Unicode, almost all accented characters have been assigned individualcodes to make text processing simpler.

    http://www.utoronto.ca/ian/books/xhtml1/entity/index.htmlhttp://www.utoronto.ca/ian/books/xhtml1/entity/index.htmlhttp://www.utoronto.ca/ian/books/xhtml1/entity/index.htmlhttp://www.unicode.org/http://www.unicode.org/http://www.unicode.org/http://www.unicode.org/http://www.unicode.org/http://www.unicode.org/http://www.utoronto.ca/ian/books/xhtml1/entity/index.html
  • 7/28/2019 Unicode for Indian Languages

    27/75

    In normal Roman (standard English), one does not see such characters and so the basic set for Roman excludesthem. However, these are linguistically important and so they are included as an extension to the normal Latincharacter set, called the Latin supplement where each accented character is assigned a unique code. (Refer to thechart at the Unicode Web site). Unicode consortium did not however specify how they would be typed in along withEnglish. This was the responsibility of the application. Even today, very few applications can actually permit direct dataentry of accented characters from the standard keyboard without resorting to a keyboard switch.The generic concept of Unicode works well for the western languages where there is only one shape associated withone and only one code value. That is, each code value can directly refer to a glyph index and when the glyphs areplaced side by side, the required display is achieved. In this case, a text string is rendered simply by horizontallyconcatenating the shapes (Glyphs) of the letters. Thus a Unicode font for a western script need have only one glyph

    for each character code. The Glyph index and the code value can therefore be exactly the same. When the glyphindices are given, the original text is also known exactly due to the one to one mapping. Most languages whose writingsystem is based on the Latin alphabet come under this category.This simplistic view does not help when the displayed shape does not correspond to a single letter but relates to agroup of consonants and a vowel which constitute a linguistic quantum. In the South East Asian region, writingsystems are based on rendering syllables and not the consonants and vowels. The accented characters mentionedearlier may also be viewed in this light as being made up of two or more shapes derived from two or more codes.

    The problem at hand in respect of Indian languages is one of finding a way to display thousands of such combinationsof basic letters where each combination is recognized as a proper syllable. This corresponds to a situation where astring of character codes map to a single shape. In the context of Indian scripts, the code for a consonant followed bya code for the vowel will usually imply a simple syllable often rendered by adding a matra (ligature) to the consonant,though there are enough exceptions to this rule.Those responsible for assigning Unicode values to Indian languages had known about the complexity of renderingsyllables. But they felt that the assigned codes correctly reflected the linguistic information in the syllable and sosuggested that there was no need to assign codes to each syllable. It would be (and should be) possible to identify thesame from a string of consonant and vowel codes (Just as syllables are identified in English). What was specificallyrecommended was that an appropriate rendering engine or shaping engine should be used to actually generate thedisplay from the multibyte representation of a syllable.Since Unicode evolved from ISCII, there was also the special provision of Unicode values to specify the context inwhich a consonant or vowel was being rendered as part of a syllable. In other words, Unicode also provided forexplicit representations achieved by forcing the shaping engine to build up a shape for a syllable, different from what

    might be a default. The zero width modifier characters accomplish this along with the Nukta character, when dottedcharacters (the Persian or Urdu characters in Hindi) have to be handled. These do not directly belong to the basic setof vowels and consonants but are sort of derived shapes.

  • 7/28/2019 Unicode for Indian Languages

    28/75

    Download Unicode Text fileThe idea of assigning codes to displayed shapes may appear to contradict the original intent of Unicode where codeswould be assigned only to the linguistic elements. This is usually justified on the following grounds.You always require a font containing the basic letter shapes and ligatures to render text as per the rules of the writingsystem. It is not going to hurt to add a few characters in the input string which may influence the selection of specificglyphs for a given context so long as the application does not interpret the string linguistically and performs only stringmatching. This is perfectly acceptable in situations where serious text processing is not attempted ( e.g., parse theinput string to identify prefixes or suffixes in a verb). However, in the context of Indian languages, a word has to beinterpreted properly to extract linguistic information and this requires analyzing the syllable structure. It is here that themultibyte representation can cause serious headaches for a programmer, for the algorithms working with multibyte

    structures are usually quite complex. The presence of characters which do not carry linguistic information will onlycompound the problem and there is also the possibility that the algorithm would fail when ambi