Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

24
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich

Transcript of Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Page 1: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Unicode Normalize Engine

Submitted by: Jose Yallouz

Shlomi Ben-Shabat

Supervisor: Maxim Gurevich

Page 2: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Agenda

Project Goals Background Preliminary Examination Unicode Normalize Design Application Analyses Summary and conclusion

Page 3: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Project Goals

Recognition of web pages’ encoding.

Translation of web page to Utf-8.

Normalize the web into a single encoding standard- Utf-8.

Page 4: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Background - Definitions

Character Set – collection of characters that can be represented.

Character Encoding – bit representation of a character set.

Unicode – character set which includes most of the world‘s writing systems characters.

Utf-8 - character encoding of Unicode used in the web.

Page 5: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Recognizing Encodings

HTML meta tag <meta http-equiv=Content-Type content="text/html; charset=”Shift_JIS">

HTTP protocol Content-Type: text/html; charset =windows-1255

BOM (byte order mark) tag - EF BB BF ("  ")

Auto detection – based on Firefox.

Page 6: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Preliminary Examination System

100 first results of Google search All languages supported by Google

Goals: Success rate of each recognition method Contradiction cases Encodings supported by java

Page 7: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Examination Results

Bom tag is very reliable.

In case of contradiction between Http and Meta tag – Http is mostly correct.

Auto detection is very reliable when recognizing Utf-8.

Except Utf-8 Auto detection is reliable only when language indication is given.

Page 8: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Translation

Decision

HTML

HTTP Header

URL

Bom tag

Auto Detection

METAHTTP

Uni

code

Out

put

Page 9: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Unicode Normalize Design

Recognition System Four mentioned methods Heuristic decision tree

Translation System Translates a web page into utf-8. Using java translation mechanism.

Page 10: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

+NormalizeHtml(in html_file : string, in output_file : string, in ouput_type : Output) : string+NormalizeUrl(in url, in file_name : string, in output_file : string, in output_type : Output) : string+NormalizeHttp_html(in html_file : string, in http_file : string, in output_file : string, in output_type : Output) : string

UnicodeNormalizer

+init(in html_file : string)+isHtmlPage() : bool+getPage() : string+getFileName() : string

-html_file : string-html_string : string

HtmlPage

+init(in http_header : string)+getContentType() : string

-http_header : string

HttpHeader

1

*

+init(in htmlPage : HtmlPage, in httpHeader : HttpHeader)+getDecision() : Encoding

EncodingDetector

1

*

1

*

+init(in html_page : HtmlPage)+Recognize() : Encoding+SupportEncoding(in encoding : string) : bool+getEncodingLanguage(in encoding : string) : int+SetDetectiontype(in encoding : string) : void

-SupportedEncoding : hash_table

AutoRecognizer

1

*

-encodingName : string-chanonicalName : string-charset

Encoding

*

*

**

1

*

1

*

+Recognize() : Encoding

«interface»Recognizer

+init(in html_page : HtmlPage)+Recognize() : Encoding

HttpRecognizer

+init(in html_page : HtmlPage)+Recognize() : Encoding

MetaTagRecognizer

1 *1 *

+DownloadPage(in url : string, in file_name : string) : string+RecursiveDownloadPage(in url : string, in file_name : string) : string

Downloader

1

*

+recognize() : string

refreshTagRecognizer1

+init(in html_page : HtmlPage)+Recognize() : Encoding

BomRecognizer

+downloadFile(in urlStr : string, in fileName : string)

Filedownloader

1

*

1 *

+getFixedCharsetName(in encoding : string) : string

-hashCharset : hash_table

charsetAliasTable

1

*

«exception»InvalidEncodingDetection

«exception»InvalidHtmlException

«exception»InvalidURLException

+init()+translate(in html_page : HtmlPage, in encoding : Encoding) : string

-charsetEncoder

Translator

+init(in fileName : string, in output_type : OutputType)

Output*

*

-html_file-text_file

OutputType1

*

+InvertText(in text : string) : string

HtmlTextInverter

+ConvertMetaTag(in str : string) : string

MetaTagConvertor

1

* 1

*

Class Diagram

Page 11: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Recognition System

+init(in htmlPage : HtmlPage, in httpHeader : HttpHeader)+getDecision() : Encoding

EncodingDetector

+init(in html_page : HtmlPage)+Recognize() : Encoding+SupportEncoding(in encoding : string) : bool+getEncodingLanguage(in encoding : string) : int+SetDetectiontype(in encoding : string) : void

-SupportedEncoding : hash_table

AutoRecognizer

1

*

+Recognize() : Encoding

«interface»Recognizer

+init(in html_page : HtmlPage)+Recognize() : Encoding

HttpRecognizer

+init(in html_page : HtmlPage)+Recognize() : Encoding

MetaTagRecognizer

1 *1 *

+init(in html_page : HtmlPage)+Recognize() : Encoding

BomRecognizer

1 *

+getFixedCharsetName(in encoding : string) : string

-hashCharset : hash_table

charsetAliasTable

1

*

Page 12: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

BOM?

Http?

yesno

UTF-8

noyes

Meta? yes

Http==Meta?

no

Auto include http?

Meta?

yes

Auto include meta?

no

Auto?

no

null

yes

UTF-8yes

UTF-8

no

null

no

Metayes

Auto?no

yes

Meta

yes

Auto==Meta?

Meta

no

null

no

http

yes

Auto?no

yes

http

yes

Http==Auto?

Auto

no

null

yes

http Auto include Http or

meta?

no

http

yes

Auto?no

httpyes

(Http==Auto)or

(meta==Auto)?

Auto

no

null

yes

no

(Auto==Ascii )or

(Auto==UTF-8)?

yes

UTF-8 no

Decision heuristic

Page 13: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

+NormalizeHtml(in html_file : string, in output_file : string, in ouput_type : Output) : string+NormalizeUrl(in url, in file_name : string, in output_file : string, in output_type : Output) : string+NormalizeHttp_html(in html_file : string, in http_file : string, in output_file : string, in output_type : Output) : string

UnicodeNormalizer

+init(in html_file : string)+isHtmlPage() : bool+getPage() : string+getFileName() : string

-html_file : string-html_string : string

HtmlPage

+init(in http_header : string)+getContentType() : string

-http_header : string

HttpHeader

1

*

+init(in htmlPage : HtmlPage, in httpHeader : HttpHeader)+getDecision() : Encoding

EncodingDetector

1

*

1

*

+init(in html_page : HtmlPage)+Recognize() : Encoding+SupportEncoding(in encoding : string) : bool+getEncodingLanguage(in encoding : string) : int+SetDetectiontype(in encoding : string) : void

-SupportedEncoding : hash_table

AutoRecognizer

1

*

-encodingName : string-chanonicalName : string-charset

Encoding

*

*

**

1

*

1

*

+Recognize() : Encoding

«interface»Recognizer

+init(in html_page : HtmlPage)+Recognize() : Encoding

HttpRecognizer

+init(in html_page : HtmlPage)+Recognize() : Encoding

MetaTagRecognizer

1 *1 *

+DownloadPage(in url : string, in file_name : string) : string+RecursiveDownloadPage(in url : string, in file_name : string) : string

Downloader

1

*

+recognize() : string

refreshTagRecognizer1

+init(in html_page : HtmlPage)+Recognize() : Encoding

BomRecognizer

+downloadFile(in urlStr : string, in fileName : string)

Filedownloader

1

*

1 *

+getFixedCharsetName(in encoding : string) : string

-hashCharset : hash_table

charsetAliasTable

1

*

«exception»InvalidEncodingDetection

«exception»InvalidHtmlException

«exception»InvalidURLException

+init()+translate(in html_page : HtmlPage, in encoding : Encoding) : string

-charsetEncoder

Translator

+init(in fileName : string, in output_type : OutputType)

Output*

*

-html_file-text_file

OutputType1

*

+InvertText(in text : string) : string

HtmlTextInverter

+ConvertMetaTag(in str : string) : string

MetaTagConvertor

1

* 1

*

Class Diagram

Page 14: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Translation System

+init()+translate(in html_page : HtmlPage, in encoding : Encoding) : string

-charsetEncoder

Translator

+init(in fileName : string, in output_type : OutputType)

Output*

*

-html_file-text_file

OutputType1

*

+InvertText(in text : string) : string

HtmlTextInverter

+ConvertMetaTag(in str : string) : string

MetaTagConvertor

1

* 1

*

Page 15: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Problems and solutions

Left to right : The encoding ISO-8859-8 (Hebrew visual)

specification defines that a Hebrew character will be written in an invert order.

Solution: The system checks for ISO-8859-8 encoding,

and when it is detected we invert the order of the Hebrew characters

Page 16: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Translate Example

before after

Page 17: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Application Analyses

Two kinds of analyses were performed in our application:

Google analysis This analysis checks the 100 first results of

Google in each language Google supports. This analysis checked about 10000 web

pages. The average detection of all languages is

about 97 percent.

Page 18: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Application Analyses- cont’

ODP analysis Open Directory Project (ODP) is a widely

distributed data base of Web content classified by humans.

This analysis checks about 150000 random pages of the odp database.

The average detection of all languages is about 92.615685 percent.

Page 19: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Google analysis

Google Analysis

0

20

40

60

80

100

120

Por

tugu

ese

Japa

nese

Arm

enia

n

Fre

nch

Sw

edis

h

Chi

nese

_tra

ditio

nal

Nor

weg

ian

Eng

lish

Heb

rew

Ger

man

Rus

sian

Per

sian

Ukr

aini

an

Est

onia

n

Ser

bian

Slo

vak

Pol

ish

Vie

tnam

ese

Ara

bic

Bel

arus

ian

Fili

pino

Indo

nesi

an

Tur

kish

Slo

veni

an

Hun

garia

n

Icel

andi

c

Rom

ania

n

Gre

ek

Chi

nese

_sim

ple

Dut

ch

Kor

ean

Fin

nish

Cze

ch

Esp

eran

to

Tha

i

Spa

nish

Ital

ian

Lith

uani

an

Dan

ish

Latv

ian

Bul

garia

n

Cat

alan

ISO-8859-15

KOI8-R

EUC-JP

UTF-8

KOI8-U

GB2312

Shift_JIS

ISO-8859-2

windows-1251

ISO-8859-1

windows-1250

ISO-8859-4

ISO-8859-8

ISO-8859-7

GBK

ISO-8859-9

windows-1257

windows-1256

windows-1255

no-detect

windows-1254

EUC-KR

windows-1253

windows-1252

Page 20: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

ODP analysis

ODP Analysis

0%

20%

40%

60%

80%

100%

org cz nl be com gov it fr ru biz us in edu za ch au uk ie ca gr jp mil net dk pl nz info de se

TIS-620

US-ASCII

x-windows-874

Big5

ISO-8859-15

KOI8-R

UTF-8

EUC-JP

KOI8-U

GB2312

windows-31j

Shift_JIS

ISO-8859-2

windows-1251

windows-1250

ISO-8859-1

UTF-16

ISO-8859-6

ISO-8859-5

ISO-8859-8

GBK

ISO-8859-7

ISO-8859-9

windows-1257

windows-1256

windows-1255

no-detect

windows-1254

EUC-KR

windows-1253

windows-1252

Page 21: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Application Usage

Client usage – client browser can use this system to show the different web page in one encoding format – utf8.

Server usage – web server can use this system to translate the different storage pages into utf8.

Processing usage – different web page processing systems, like search engines, can use our system to convert different pages into the standard Unicode encoding.

Page 22: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Future Project Proposals

Implementation of the application on Firefox Browser

Implementation of the application on Apache Server

Design of a new auto-detection method (based on a encoding dictionary)

Page 23: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Summary and Conclusion

We build an efficient system which translates a page to utf8-encoding.

Analyses show 93 percent of Success.

Implementation of the application will improve the web surfing experience for millions of users all over the world.

Page 24: Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Questions

THANK YOUTHANK YOU!!