A New Lexicon Mechanism for Chinese Word Segmentation

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

A New Lexicon Mechanism for Chinese Word Segmentation

Advisor : Dr. Hsu

Graduate : Kuo-min Wang

2006 PACIS

.

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Outline Motivation Objective Introduction A New Lexicon Mechanism Experiments Conclusion Personal Opinions

3


N.Y.U.S.T.

I. M.

Motivate Under the development of global networking through

Internet, the amount of articles in Chinese or other oriental languages is increasing rapidly.

As the lack of explicit separator, word segmentation is a precondition for the processing of these character-based languages and thus affecting the whole system in performance.

4


N.Y.U.S.T.

I. M.

Objective This paper propose a new solution for Chinese word

segmentation problem based on lexicon named double-character-and-long-world-hash-indexing (DCLWHI).

This method can improve the speed and efficiency of word segmentation without extra memory spending, and gains the same accuracy.

5


N.Y.U.S.T.

I. M.

Introduction The current methods of Chinese word

segmentation are divided into two kinds Lexicon

Easily accomplished, high level arithmetic efficiency Out of vocabulary problem (OOV)

(new words, names of people, organizations and locations) Frequency statistic

Has the advantage on OOV problems But the arithmetic efficiency is much lower than the

lexicon based method.

6


N.Y.U.S.T.

I. M.

A New Lexicon Mechanism The Double-Character words hold large proportion in

Chinese words. 70% are double-character word [4] Make a hash indexing for the first two characters of the lexicon words,

then add the remaining string into a special long word table, which has

a hash indexing.

7


N.Y.U.S.T.

I. M.

A New Lexicon Mechanism First-Double-Character-Hash-Indexing

Flag Bit(2Bytes) If the two-character is a prefix of a word which length is N, the big N-1 of the 2

bytes will be set 1; Exaple “ 圖籍” , which is a double-character word, but can’t be the prefix of ot

her words, So the Flag Big of 圖籍 is set 0000000000000010(0x0002) 電老 is not a Chinese word, but it can be a prefix of a word 電老虎 . So the Fla

g Bit is 0000000000000100(0x0004) Similar examples : 春夏 (ox000A) ，君子 (x0006) 、敢作 (x0008)

Long Word Hash Indexing Similar to the First-Double-Character-Hash-Indexing.

0000 0000 0000 00102-character3-character4-character

0000 0000 0000 01000000 0000 0000 1000

8


N.Y.U.S.T.

I. M.

A New Lexicon Mechanism Example of Search －＞君子當圖籍是電老虎 Pick up first two characters “ 君子” , Flag Big is x0006 can be a 2-charac

ter or a prefix of a Treble-Character word. Then shift to the character “ 當” , compute the hash value of the substring

“ 君子當” , search in the long word Find the marching index, confirm the string , marching succeed. Shift to Character “ 圖籍” (0x002) Shift to Character “ 是電”

There is no value in hash-indexing, 2 situations may happen First, there is no value in hash-indexing, return one character “ 是” Second, there is a substring in the index, but value unequally; return one character “

是” Shift to Character 電老” (0x004) Shift to Character “ 虎”

君子當圖籍是電老虎


君子當圖籍是電老虎君子當圖籍是電老虎



9


N.Y.U.S.T.

I. M.

Experiments Comparison of Searching Cycles Comparison of Memory Space Cost Comparison of Speed

10


N.Y.U.S.T.

I. M.

Background Binary-Seek-by-Word Composed of three parts

Lexicon text, word-index-table, first-character-index-table

11


N.Y.U.S.T.

I. M.

Background (cont.) TRIE indexing tree is a multi-chain-table tree, the mechanism is

composed of two parts: First-character-index table

and TRIE index-tree node Didn’t need to predict the length

of the word , only need to match the word by chain-tree

12


N.Y.U.S.T.

I. M.

Background (cont.) Binary-Seek-by-Characters Absorbs the search-advantage in TRIE indexing tree,

using searching by characters not searching by words

13


N.Y.U.S.T.

I. M.

Background (cont.)Summary above methods’ drawbacks

Binary-seek-by-word is using full-words marching, the efficiency is evidently low.

The design and maintenance of the TRIE tree is very complex, wastes mass memory space

Binary-seek-by characters Improves some aspects, but it doesn’t change the

data structure of the binary-seek-by-word which restrict the efficiency.

14


N.Y.U.S.T.

I. M.

Some novel schemes Double-Character-Hash- indexing[4]

An new searching tree improved from the TRIE indexing tree.

Composed of two parts: Hashing index, remaining strings.

Can avoids the deep searching , increases the segment speed without complex increasing.

15


N.Y.U.S.T.

I. M.

Some novel schemes (cont.) A new lexicon mechanism based on PATRICIA[3]

Use of the ISN (internal statement number) of the words as the key words bit-string,

Constructs the PATRICIA tree by comparing the big-string. Advantage

The searching process only need some cycles of bit comparison and some cycles of string comparison.

Double-Array Trie[1] Even node in the tree stands for a status of an auto-machine, Which changes according to the difference of the variable. This new structure actually is an improved scheme of the

TRIE tree, using 2 linear arrays to express the TRIE tree

16


N.Y.U.S.T.

I. M.

Some novel schemes (cont.)用負的 base 值表示該位置為詞語。如果狀態 i 對應某一個詞，而且 Base[i]=0 ，那麼令 Base[i]= （ -1 ） *i ，如果 Base[i] 的值不是 0 ，那麼令 Base[i]= （ -1 ） *Base[i] 。得到雙陣列如下：

例如設“阿根”的下標為 i=8 ，那麼 check[i] 的內容是“阿”的下標，而 base[i] 是“阿根廷”的下標的基值。“ 廷”的序列碼為 x=8 ，那麼“阿根廷”的下標為 base[i]+x=base[8]+8=12 。

17


N.Y.U.S.T.

I. M.

Some novel schemes (cont.) Double-code scheme [1]

Basic idea is mapping the 6768 Chinese characters in GB-2312 into the sequence-code from 1 to 6768.

Every string written in Chinese can only maps to a number string,

Composed of two steps: Switch from number-sequence into even-coding Establish indexing mechanism

18


N.Y.U.S.T.

I. M.

Analysis of the novel schemes Double-Character-Hash-Indexing

Improvement of TRIE index tree, while it is easier structured and maintained than the former mechanism.

PATRICIA Is a super arithmetic in segment speed, but it waste on the memory spac

e and reduce the efficiency.

Double-Array Trie When decrease or increase the lexicon, the whole double-array should b

e adjusted.

Double-code scheme The extract rate of the arithmetic is not good enough, which result in a v

ery big array, restrict the performance of the search efficiency

19


N.Y.U.S.T.

I. M.

大白日大白日夢

大白日大白日夢

大白0x00E

大白

Experiment Detail

20


N.Y.U.S.T.

I. M.

Experiment Detail

21


N.Y.U.S.T.

I. M.

Conclusions Our mechanism DCLWHI farther improves

the speed and efficiency of segmentation. The scheme A has a very high process speed

but costs too much memory space, while scheme B costs less storage with a high efficiency. We think it a good eclectic mechanism for Chinese word segmentation.

22


N.Y.U.S.T.

I. M.

Opinions Experiments are not enough to evidence this m

ethod is very well. …..

A New Lexicon Mechanism for Chinese Word Segmentation

Documents

Transcript of A New Lexicon Mechanism for Chinese Word Segmentation