A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John...

29
A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John [email protected]
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John...

Page 1: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

A Method for Enhancing Search Using Transliteration of

Mandarin Chinese

Vijay [email protected]

Page 2: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Transliterated Mandarin Search

Google suggests spelling correction

Page 3: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Alternate Transliterations?

Want to say “Did you mean Peiching?”

Page 4: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Transliteration Problems

• “Beijing” provides many results

• Google doesn’t find “Peiching,” “Peking,” “Bukgyeong,” etc.

• Many pages using variety of transliterations

• Transliterations unorganized

• This paper organizes for Mandarin Chinese

Page 5: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

The Problem (Cont’d)

• Why variety of transliterations?

• Web content: 82% Romanized

• Majority’s native languages: other scripts

• Standard keyboards

• Non-Romanized sources normally transliterated (esp. on Web)

• Transliteration variations

Page 6: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Example 1: Tibetan

• Four languages: transliteration problems• Hello in Tibetan• Wylie (bkra shis bde legs)• Tibetan Pinyin• Several unofficial systems based on

pronunciation• Spelled/transcribed in several ways (with

some guidelines)

Page 7: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Example 2: Malayalam

• No official transliteration system

• Transliteration based on personal preference (many unorganized variations)

• Script conversion programs: more consistent systems

• /maleja:m/ usu. transcribed “Malayalam”

• malayaaLam (Maya), Malajal- (Slavic)

Page 8: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Example 3: Romani

• Vlax Romani standard

• Literacy → few adopt standard

• Different countries, different official languages → different spellings

• No official systems (government)

• Several transliteration systems exist (often inconsistent)—as in last 2 languages

Page 9: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Example 4: Mandarin

• Hànyŭ Pīnyīn

• Tōngyòng Pīnyīn

• Wade-Giles

• Gwoyeu Romatzyh

• (Yóuzhèngshì Pīnyīn) (etc.)

Page 10: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Prior Work

• In Mandarin: geared towards Chinese users searching for information from West

• Western names-Hànzĭ-Hànyŭ Pīnyīn-Hànzĭ• Algorithms designed for Arabic & Japanese

transliteration• Google• This method designed for Western users

searching for Chinese information

Page 11: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Initial Effort on Mandarin

• Practical first step: increased trade with China• Simple transliteration problem (relatively)• Modifications for Tibetan, Romani,

Hindustani, etc.• Intact for some other languages? (e.g.

Russian, Arabic, Japanese, Korean)• Input = Hànyŭ Pīnyīn; output = other systems

Page 12: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Initial Program

• Combined many systems

• Ying – yink – yenk – yenk’ – yemk’ – yermk’ – yarmk’

• Instead of “victory,” searched for “Yarmuk” River in Middle East

• Transliteration systems organized by row but not by column

Page 13: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Organize into Transliteration TableEntries for “beijing” in two systems

(Purpose is to go from one column to another)

Hanyu Pinyin Wade Giles

1 b p

2 ei ei

3 j ch

4 ing ing

Page 14: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Part of Patterns Table 8 systems

HP TP1 TP2 MHP1 MHP2 MHP3 WG1 WG2ci cih cih ci ci ci tz'u tz'usi sih sih si si si szu ssuzi zih zih zi zi zi tzu tzuju jyu jyu ju ju ju chü chüqu cyu cyu qu qu qu ch'ü ch'ü

Page 15: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Decomposition

• Search for “Beijing” in table

• Delete one letter; search for “Beijin”

• Beiji, Beij…B

• Search for “eijing” (beijing – b) similarly

• Ei found, search for “jing”

• “J” found, search for “ing”

Page 16: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Composing new search terms

• Components: b, ei, j, ing

• B → b, p

• ei → ei

• j → j, ch

• ing → ing

Page 17: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Implementation

• Java program

• After composition, how does algorithm search?

• Connects to Google via Google API (Application Programming Interface)

• Google searches

• 1-2 second delay (due to Google)

Page 18: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Transliteration Patterns

• Transliterations organized into table

• {"üe", "yue", "yue", "ue", "ve", "üeh", "üeh", "üeh"}

• lüe, lyue, lue, lve, lüeh

• 3 transliteration systems; at most 5 patterns

• First column Hànyŭ Pīnyīn like “ing” “b” “ei”

Page 19: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Transliteration Systems By Column

• Only 3 systems (in effect)• Hànyŭ Pīnyīn (HP)• Tōngyòng Pīnyīn #1 (TP1) & Tōngyòng

Pīnyīn #2 (TP2)• Modified Hànyŭ Pīnyīn #1 (MHP1) &

Modified Hànyŭ Pīnyīn #2 (MHP2) • Wade-Giles #1 (WG1), Wade-Giles #2

(WG2), & Wade-Giles #3 (WG3)

Page 20: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Differences Between Transliteration System Variants

• TP1- iu, ui, ‘

• TP2- iou, uei, -

• WG2- h’ung (not hung)

• WG3- ts’u (not tz’u)

• WG1- szu (not ssu)

Page 21: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Web versionhttp://www.translitsearch.com/demos/demos.htm

Page 22: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Web search

Page 23: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

What is the effect?

• Search for 130 Pinyin cities/regions

• 16 – no other transliteration

• 60 – at least two others

• 6 – three or more

• How much did Xiaozhi find? (8% more)

• 5 min. 12 sec. – entire search

Page 24: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Further work 1

• Include Yale, GR (Gwoyeu Romatzyh), &c.

• YZSPY (Yóuzhèngshì Pīnyīn)

• Accents

• Hanja- and Kanji-based transliterations

• Application to research archives

Page 25: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Further Work 2

• Improvements in accuracy of transliteration

• Search in other transliterations

• Japanese version of current paper

• Hindustani version

• Romani with Indic cognates

• Extension to translation (transliterated Mandarin-Cantonese characters)

Page 26: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Solutions for Tibetan

• Start with Wylie

• Xiaozhi with adjustments

• Dzongkha

• Dzongkha-based variations?

• Analysis of common transliteration patterns (usu. based on closest pronunciation)

Page 27: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Solutions for Malayalam

• Start with Maya (script conversion program)

• Include minor variations from other script conversion programs

• Analysis of transliterations used

Page 28: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Solutions for Romani

• Start with Vlax Romani Standard

• Regional variations

• Some transliterations easier to use on computers

• e.g. chh, sh to omit hacek

Page 29: A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu.

Conclusions

• Enhances search by finding alternate transliterations– Applied to Mandarin– Applicable to other languages

• Applicable to lesser-studied (& other) languages

• Language- (or script-) specific