SiyaBasScript - Mozilla Firefox and Google Chrome Extension for converting non-Unicode Sinhala text...

4
Abstract - Use of Sinhala language in computer technology have been present since the late 1980s But no standard character representation system was put in place which resulted in proprietary character representation systems and fonts. Then the Unicode standard, which has the explicit aim of transcending the limitations of traditional character encodings, was introduced to Sinhala in 1998. But still, some major news sites in Sinhala has not adopted to the standard, and misuses styling hacks to display Sinhala text in various other font-faces. This causes lot of compatibility issues when viewed in different browsers and operating systems. SiyaBasScript extension solves the problem by converting that text to Sinhala Unicode. It will help to increase the content of world wide web in Sinhala Unicode, allowing a far easier standard of representing, searching, sorting and processing knowledge. Index Terms—Internet, Languages, Localization I.INTRODUCTION ith the introduction of microcomputers in the early eighties, Sri Lanka too embarked on the use of computers with local language input and output. The University of Colombo developed a Sinhala screen output for television displays and went on to provide election result displays in the three languages Sinhala, Tamil and English within a few years. Software like 'DOS Word Perfect', 'Super 77', 'Wadan Tharuwa' and 'Sarasavi' was introduced later to enable Sinhala word processing and printing. However, no standard character representation system was put in place which resulted in proprietary character representation systems and fonts. But later, the requirement for a standard code was identified and steps were taken by the Computer and Information Technology Council of Sri Lanka (CINTEC) to establish a committee for the use of Sinhala & Tamil in Computer Technology in 1985, soon after its inception. W This committee quite correctly took steps to meet the immediate need to agree on an acceptable Sinhala alphabet and an alphabetical order. Thus this committee joined with a committee appointed by the Natural Resources, Energy and Science Authority of Sri Lanka (NARESA) to form the Manuscript received February 21, 2010. This extension for Google Chrome and Mozilla Firfox has been created as per the guidelines of CS3200 – Programming Project module of Bsc (Computer Science and Engineering) degree course of University of Moratuwa. G.M. Keheliya Bandara Gallaba is an undergraduate student at University of Moratuwa, Sri Lanka and currently having his internship at WSO2 Inc. (Phone: +94-715518881; fax: +94-112412236; e-mail: keheliya.gallaba@ gmail.com). Committee on Adaptation of National Languages in IT (CANLIT), which agreed on a unique Sinhala alphabet and alphabetical order. The CINTEC Internet Committee agreed that one of the major impediments to the development and use of the Internet in Sri Lanka, especially into rural areas is the lack of local language content. The Committee agreed that the availability of a high quality, free, and standards-conformant Sinhala font would enable content providers to create Sinhala language content. As a first measure, the Internet Committee decided that a Committee on Unicode Compatible Sinhala Fonts should be formed. This Committee would define the basic minimum requirements for Unicode compatible Sinhala fonts; define the essential features which should be present in a Sinhala character set, character combinations and their input, address the requirements for a standard Sinhala keyboard, key board stroke sequences, and issues relating to the glyphs and keyboard drivers. In 1998 SLS1134/Unicode standards for Sinhala was released by CINTEC for the first time. With the establishment of ICTA in 2003 the responsibilities of the Fonts Committee was assigned to ICTA and it set up a Language Requirements Committee to take the Sinhala Unicode initiative forward. The Unicode range for Sinhala is U+0D80–U+0DFF. Grey areas indicate non-assigned code points. Currently Sinhala Unicode does not come built in with Windows XP, unlike Tamil and Hindi. However, all versions of Windows Vista come with Sinhala Unicode support by default, and do not require external fonts to be installed to read Sinhalese script. For OS X, Sinhala font and keyboard support can be found at http://web.nickshanks.com/typography/ and at http://www.xenotypetech.com/osxSinhala.html For Linux, the scim input method selector allows to use Sinhala script in applications like terminals or web browsers. If you are using Fedora 7 or later then it already has the required input methods, which can be installed using 1 SiyaBasScript - Mozilla Firefox and Google Chrome Extension for converting web sites with non-Unicode Sinhala fonts to Unicode Keheliya Bandara Gallaba, Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka

description

Use of Sinhala language in computer technology have been present since the late 1980s But no standard character representation system was put in place which resulted in proprietary character representation systems and fonts. Then the Unicode standard, which has the explicit aim of transcending the limitations of traditional character encodings, was introduced to Sinhala in 1998. But still, some major news sites in Sinhala has not adopted to the standard, and misuses styling hacks to display Sinhala text in various other font-faces. This causes lot of compatibility issues when viewed in different browsers and operating systems. SiyaBasScript extension solves the problem by converting that text to Sinhala Unicode. It will help to increase the content of world wide web in Sinhala Unicode, allowing a far easier standard of representing, searching, sorting and processing knowledge.For Downloading Instructions visit http://galpotha.wordpress.com

Transcript of SiyaBasScript - Mozilla Firefox and Google Chrome Extension for converting non-Unicode Sinhala text...

Abstract - Use of Sinhala language in computer technology have been present since the late 1980s But no standard character representation system was put in place which resulted in proprietary character representation systems and fonts. Then the Unicode standard, which has the explicit aim of transcending the limitations of traditional character encodings, was introduced to Sinhala in 1998. But still, some major news sites in Sinhala has not adopted to the standard, and misuses styling hacks to display Sinhala text in various other font-faces. This causes lot of compatibility issues when viewed in different browsers and operating systems. SiyaBasScript extension solves the problem by converting that text to Sinhala Unicode. It will help to increase the content of world wide web in Sinhala Unicode, allowing a far easier standard of representing, searching, sorting and processing knowledge.

Index Terms—Internet, Languages, Localization

I.INTRODUCTION

ith the introduction of microcomputers in the early eighties, Sri Lanka too embarked on the use of

computers with local language input and output. The University of Colombo developed a Sinhala screen output for television displays and went on to provide election result displays in the three languages Sinhala, Tamil and English within a few years. Software like 'DOS Word Perfect', 'Super 77', 'Wadan Tharuwa' and 'Sarasavi' was introduced later to enable Sinhala word processing and printing. However, no standard character representation system was put in place which resulted in proprietary character representation systems and fonts. But later, the requirement for a standard code was identified and steps were taken by the Computer and Information Technology Council of Sri Lanka (CINTEC) to establish a committee for the use of Sinhala & Tamil in Computer Technology in 1985, soon after its inception.

W

This committee quite correctly took steps to meet the immediate need to agree on an acceptable Sinhala alphabet and an alphabetical order. Thus this committee joined with a committee appointed by the Natural Resources, Energy and Science Authority of Sri Lanka (NARESA) to form the

Manuscript received February 21, 2010. This extension for Google Chrome and Mozilla Firfox has been created as per the guidelines of CS3200 – Programming Project module of Bsc (Computer Science and Engineering) degree course of University of Moratuwa.

G.M. Keheliya Bandara Gallaba is an undergraduate student at University of Moratuwa, Sri Lanka and currently having his internship at WSO2 Inc. (Phone: +94-715518881; fax: +94-112412236; e-mail: keheliya.gallaba@ gmail.com).

Committee on Adaptation of National Languages in IT (CANLIT), which agreed on a unique Sinhala alphabet and alphabetical order.

The CINTEC Internet Committee agreed that one of the major impediments to the development and use of the Internet in Sri Lanka, especially into rural areas is the lack of local language content. The Committee agreed that the availability of a high quality, free, and standards-conformant Sinhala font would enable content providers to create Sinhala language content. As a first measure, the Internet Committee decided that a Committee on Unicode Compatible Sinhala Fonts should be formed. This Committee would define the basic minimum requirements for Unicode compatible Sinhala fonts; define the essential features which should be present in a Sinhala character set, character combinations and their input, address the requirements for a standard Sinhala keyboard, key board stroke sequences, and issues relating to the glyphs and keyboard drivers. In 1998 SLS1134/Unicode standards for Sinhala was released by CINTEC for the first time.

With the establishment of ICTA in 2003 the responsibilities of the Fonts Committee was assigned to ICTA and it set up a Language Requirements Committee to take the Sinhala Unicode initiative forward.

The Unicode range for Sinhala is U+0D80–U+0DFF. Grey areas indicate non-assigned code points.

Currently Sinhala Unicode does not come built in with Windows XP, unlike Tamil and Hindi. However, all versions of Windows Vista come with Sinhala Unicode support by default, and do not require external fonts to be installed to read Sinhalese script.

For OS X, Sinhala font and keyboard support can be found at http://web.nickshanks.com/typography/ and at http://www.xenotypetech.com/osxSinhala.html

For Linux, the scim input method selector allows to use Sinhala script in applications like terminals or web browsers. If you are using Fedora 7 or later then it already has the required input methods, which can be installed using

1

SiyaBasScript - Mozilla Firefox and Google Chrome Extension for converting web sites with

non-Unicode Sinhala fonts to UnicodeKeheliya Bandara Gallaba, Department of Computer Science and Engineering, University of

Moratuwa, Sri Lanka

Applications->Add/Remove Software menu item. For other GNU/Linux distributions such as Debian or Ubuntu follow the instructions at the Sinhala GNU/Linux site for complete Sinhala Unicode support.

II.THE NEED FOR SIYABASSCRIPT

A.Initial Survey

Although Unicode has been considered as the standard for creating and viewing Sinhala language content, some web sites including some famous news sites still create content in non-Unicode and misuse methods that are for styling webpages, to force the browser to render Sinhala text.

For Example:

Online edition of Lankadeepa (http://www.lankadeepa.lk/) uses following snippet of code for displaying Sinhala text using a non-Unicode font.@font-face { font-family: Wijeya; font-style: normal; font-weight: 700; src: url(../../../02\WIJEYA1.prf); }

The Mirisa.org (http://mirisa.org/) uses following snippet of code for displaying Sinhala text using a non-Unicode font.

@font-face { font-family: IsiMalithi; font-style: normal; font-weight: normal; src: url(ISIWMAL0.eot); }

The Lanka-E-News site (http://lankaenews.com/) uses following snippet of code for displaying Sinhala text using a non-Unicode font.

<style type="text/css"> @font-face { font-family: sandaru-n; font-style: normal; font-weight: normal; src: url(SANDARU0.eot); } </style>

After further more research it was found some more news sites including Online Edition of Lakbima (http://www.lakbima.lk/), Rivira (http://www.rivira.lk/), LankaScreen (http://www.lankascreen.com/), GossipLanka (http://gossiplanka.blogspot.com/), HotHotLanka (http://hothotlanka.blogspot.com/) uses similar styling techniques to display Sinhala text without using standard Unicode encoding.

B.Drawbacks of using Styling techniques to display localized content

B.1.Font Proliferation

The purpose of <font face> as mentioned in W3C Recommendation is for controlling style of a webpage by mentioning a set of one or more fonts, in one or more sizes, designed with stylistic unity, each comprising a coordinated set of glyphs, but does not address the problems it creates when misused in multilingual documents.

The point is that if <font face> is used, and specify a font for a different script, it is in fact lying to the browser about the identity of the characters that are supposedly identified by the underlying codes in the computer.

There are a number of problems with the above approach. The most evident is that bad things happen if the user looking at the page does not have exactly the font that has been specified: he will see the text in his browser's default font, which will not be Sinhala and will not have glyphs to display Sinhala characters, whereas he may have a perfectly good standard Sinhala Unicode font on his system, which could have been used if developer had coded the text properly.

The characters (actually glyphs) in a font are numbered, the set of glyph-number associations forming what is known as the coding of the font. But there are a large number of these, even for a given language or script. If simplistic font mapping is used (which is what <font face> does) to encode text, you are at the mercy of the particular coding of the font you chose.

Since users will have to install all the fonts specified by different sites, this proliferation creates useless redundancy without addressing styling which it was intended for. And the Webspace becomes fragmented, with mutually incomprehensible parts.

B.2.Incompatibility Issues

Although user can get around the problem of missing font files by installing them in the computer or using embedded online font files, this leads to lot of incompatibility issues in different browsers and in different operating systems.

The ability to embed fonts on web pages was originally implemented by Microsoft in Internet Explorer 4.0 - the catch was that these font files needed to be in a custom form of OpenType format, with an EOT file extension. The other catch is that embedding EOT files only works in Internet Explorer.

Embedded OpenType is a proprietary standard supported exclusively by Internet Explorer but was submitted to the W3C in 2007 as part of CSS3, which was rejected and resubmitted as a standalone submission March 18, 2008. The W3C team comment on the submission states that the "W3C plans to submit a proposal to the W3C members for a working group whose goal is to try and develop EOT into a W3C Recommendation."

More recently, the new CSS 3 added a specification for embedding fonts on web pages in a more open, standardized way. Browsers that support the full CSS 3 specification can render web pages which embed a TrueType font file.

2

New browsers such as Firefox 3.5 therefore support TrueType Fonts to be embedded on pages, whereas Internet Explorer supports OpenType Fonts. Firefox 3.5 won't render OpenType, and Internet Explorer won't render TrueType. To get around this problem, multiple types of fonts must be embedded on a page at the same time.

Browser Lowest version

Support of

Internet Explorer

4.0 Embedded OpenType fonts only

3.5 (1.9.1) TrueType and OpenType fonts only

3.6 (1.9.2)

Opera 10.0 TrueType and OpenType fonts only

Safari (WebKit) 3.1 (525) TrueType and OpenType

fonts onlyBrowser compatibility of different font types with @font-face syntax.

But rather than embedding multiple types of fonts, web developers tend to embed the EOT file and recommend the reader to use Microsoft Internet Explorer to view Sinhala web pages correctly.

These constraints create lot of usability issues in a real world scenario. Primarily web applications should be written for the web — not browsers. Developers should strive for device-independence rather than targeting specific OS and browser versions.

B.3. Knowledge Representation issues with searching, sorting and text processing,

Since non-Unicode fonts does not contain any information about the language, search engines or any other bodies can't understand text created using these fonts as Sinhala content. So they won't be indexed meaningfully and will not be considered in respective searching queries or sorting operationsSometimes in sinhala scripts the glyph changes according to the position of the character within a word (initial, medial, final or isolated). Or there exist compulsory ligatures where two or more characters turn into a single glyph. Or one character is displayed as two glyphs that straddle the glyph of another character.

Simple non-Unicode fonts used for mapping complex scripts are often rather limited in terms of glyphs and ligatures, and sometimes use ugly tricks like building characters from pieces to render barely passable text. By contrast, an implementation based on a proper coded Unicode character set can fully use a good font not subject to the constraints of font mapping, resulting in better quality rendering. Furthermore, the rendering is independent of the font used, which means that improvements in the latter can be leveraged against old documents without recoding them: they simply display better.

So it is absolutely essential to emphasize on Unicode Sinhala content creation on the word wide web and take

measures to convert previously added non-Unicode text content to Unicode so that they will be available to be searched, sorted, indexed and properly represented.

III. SOLUTION BY SIYABASSCRIPT

SiyaBasScript Mozilla Firefox and Google Chrome Extension solves the above problem by recognizing elements of above mentioned web sites that contain non-Unicode texts and maps them respectively in to Unicode characters so that web sites could be viewed in any Unicode enabled browser running in any operating system without the hassle on installation of fonts or expensive proprietary software.

Fundamental idea behind the architecture of this extension came from the Greasemonkey scripting engine, which allows users to run scripts that are written in JavaScript and manipulate the contents of a web page using the Document Object Model interface. These scripts are site-specific and allows users to install scripts that make on-the-fly changes to HTML web page content on the DOMContentLoaded event, which happens immediately after it is loaded in the browser (also known as augmented browsing). As Greasemonkey scripts are persistent, the changes made to the web pages are executed every time the page is opened, making them effectively permanent for the user running the script.

Greasemonkey scripts can also poll external HTTP resources via a non-domain-restricted XMLHTTP request. And they contain optional metadata, which specifies the name of the script, a description, relevant resources to the script, a namespace URL used to differentiate identically named scripts, and URL patterns for which the script is intended to be invoked or not.

However, Greasemonkey scripts are limited due to security restrictions imposed by Mozilla's XPCNativeWrappers. For example, Greasemonkey scripts do not have access to many of Firefox's components, such as the download manager, IO processes or its main toolbars. Additionally, Greasemonkey scripts run per instance of a matching webpage. Because of this, managing lists of items globally is difficult. However, script writers have been using cookies and Greasemonkey even offers APIs such as GM_getValue and GM_setValue to overcome this.

For creating SiyaBasScript Firefox add-on, initial development was done using javaScript and Greasemonkey-multi-script-compiler was used to make it fully fledged and adhering to Firefox extension's XPToolkit architecture and Extension Component Interactions.

3

Detailed component architecture of Firefox

As of February 2010, Google Chrome has started providing "native support" for greasemonkey scripts. They are internally converted to extensions, and are managed as such. Chrome ignores @exclude metadata within the scripts, so the scripts are executed for all domains/pages. On the other hand, Chromium honors the @include directives and executes the scripts only for the domains/pages specified.

So although, scripts that use one of the GM_setValue or GM_getValue initiatives will break, and scripts that use the popular E4X standard will not run, SiyaBasScript was ported successfully to Google Chrome.

IV. CONCLUSION

This means SiyaBasScript not only allows the hassle-free viewing of above web pages but allows copying and pasting of earlier non-Unicode content to Unicode enabled sites using browsers like Mozilla Firefox and Google Chrome . This allows quoting, referencing and sharing of above content via Unicode enabled sites that was nearly impossible when they were available as non-Unicode content.

Simply this thinking, takes away the need of recommending or restricting users to certain browsers such as Internet Explorer for viewing, editing and sharing localized content.

This ensures the ultimate goal of the world wide web as it intended to be in the beginning, where it seamlessly gives humanity, the easy access and availability of information.

ACKNOWLEDGMENT

The Author Keheliya Bandara Gallaba, specially thanks and Sinhala Unicode Group, Dr. Shehan Perera, Senior Lecturer at Department of Computer Science and Engineering

at University of Moratuwa, Sri Lanka and Mr Chulaka Gunasekara, Lecturer at Department of Computer Science and Engineering at University of Moratuwa, Sri Lanka.

REFERENCES

[1] J. B. Disanayaka, "Samakālīna Si hala lēkhana vyākara aya"ṃ ṇS. Go agē saha Sahōdarayō, 1995ḍ

[2] J. B. Disanayaka, "Si hala bhā āvē nava muhu uvara"ṃ ṣ ṇSa skrtika Ka ayutu Depārtamēntuva, 1996ṃ ̥ ṭ

[3] Wasantha Deshapriya, “Sri Lankan Country Report on Local Language Computing Policy”,Re-engineering Government Programme, Information and Communication Technology Agency of Sri Lanka

[4] Dasun Sameera Weerasingha, “Sinhala Uniketha, Pasubima ha thakshanika Pathikada”

[5] Sinhala Unicode Character Code Chart, (Available: http://www.unicode.org/charts/PDF/U0D80.pdf)

[6] Jukka K. Korpela, 'Unicode Explained”, O'Reilly; 1st edition, 2006. ISBN 0-596-10121-X

[7] The Unicode Consortium, 'The Unicode Standard”, Version 5.0, Fifth Edition, Addison-Wesley Professional, 27 October 2006. ISBN 0-321-48091-0

[8] Alis Technologies inc. “<FONT FACE> considered harmful”, 1996, (Available: http://alis.isoc.org/web_ml/html/fontface.en.html)

[9] Warren Steel ([email protected]), "What's wrong with the FONT element?", 2003, (Available: http://www.mcsr.olemiss.edu/~mudws/font.html)

[10] Mark Pilgrim, "Dive Into Greasemonkey", 2005, (Available: http://diveintogreasemonkey.org)

[11] Cheah Chu Yeow,"Firefox Secrets", 2005, SitePoint Pty. Ltd.[12] Kenneth C. Feldt, "Programming Firefox", 2007, O’Reilly Media, 1005

Gravenstein Highway North, Sebastopol, CA 95472.[13] Mark Pilgrim, "Greasemonkey Hacks", 2007, O’Reilly Media, 1005

Gravenstein Highway North, Sebastopol, CA 95472.[14] Mozilla Developer Center , “@font face”, (Available:

https://developer.mozilla.org/en/CSS/@font-face)[15] SiyaBasScript – Project hosting on Google Code, (Available:

http://code.google.com/p/siyabasscript/downloads/list)

4