A New Lexicon Mechanism for Chinese Word Segmentation
-
Upload
kaitlin-higgins -
Category
Documents
-
view
21 -
download
1
description
Transcript of A New Lexicon Mechanism for Chinese Word Segmentation
1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
A New Lexicon Mechanism for Chinese Word Segmentation
Advisor : Dr. Hsu
Graduate : Kuo-min Wang
2006 PACIS
.
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline Motivation Objective Introduction A New Lexicon Mechanism Experiments Conclusion Personal Opinions
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivate Under the development of global networking through
Internet, the amount of articles in Chinese or other oriental languages is increasing rapidly.
As the lack of explicit separator, word segmentation is a precondition for the processing of these character-based languages and thus affecting the whole system in performance.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective This paper propose a new solution for Chinese word
segmentation problem based on lexicon named double-character-and-long-world-hash-indexing (DCLWHI).
This method can improve the speed and efficiency of word segmentation without extra memory spending, and gains the same accuracy.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction The current methods of Chinese word
segmentation are divided into two kinds Lexicon
Easily accomplished, high level arithmetic efficiency Out of vocabulary problem (OOV)
(new words, names of people, organizations and locations) Frequency statistic
Has the advantage on OOV problems But the arithmetic efficiency is much lower than the
lexicon based method.
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A New Lexicon Mechanism The Double-Character words hold large proportion in
Chinese words. 70% are double-character word [4] Make a hash indexing for the first two characters of the lexicon words,
then add the remaining string into a special long word table, which has
a hash indexing.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A New Lexicon Mechanism First-Double-Character-Hash-Indexing
Flag Bit(2Bytes) If the two-character is a prefix of a word which length is N, the big N-1 of the 2
bytes will be set 1; Exaple “ 圖籍” , which is a double-character word, but can’t be the prefix of ot
her words, So the Flag Big of 圖籍 is set 0000000000000010(0x0002) 電老 is not a Chinese word, but it can be a prefix of a word 電老虎 . So the Fla
g Bit is 0000000000000100(0x0004) Similar examples : 春夏 (ox000A) ,君子 (x0006) 、敢作 (x0008)
Long Word Hash Indexing Similar to the First-Double-Character-Hash-Indexing.
0000 0000 0000 00102-character3-character4-character
0000 0000 0000 01000000 0000 0000 1000
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A New Lexicon Mechanism Example of Search ->君子當圖籍是電老虎 Pick up first two characters “ 君子” , Flag Big is x0006 can be a 2-charac
ter or a prefix of a Treble-Character word. Then shift to the character “ 當” , compute the hash value of the substring
“ 君子當” , search in the long word Find the marching index, confirm the string , marching succeed. Shift to Character “ 圖籍” (0x002) Shift to Character “ 是電”
There is no value in hash-indexing, 2 situations may happen First, there is no value in hash-indexing, return one character “ 是” Second, there is a substring in the index, but value unequally; return one character “
是” Shift to Character 電老” (0x004) Shift to Character “ 虎”
君子當圖籍是電老虎
君子當圖籍是電老虎
君子當圖籍是電老虎君子當圖籍是電老虎
君子當圖籍是電老虎
君子當圖籍是電老虎
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments Comparison of Searching Cycles Comparison of Memory Space Cost Comparison of Speed
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background Binary-Seek-by-Word Composed of three parts
Lexicon text, word-index-table, first-character-index-table
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.) TRIE indexing tree is a multi-chain-table tree, the mechanism is
composed of two parts: First-character-index table
and TRIE index-tree node Didn’t need to predict the length
of the word , only need to match the word by chain-tree
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.) Binary-Seek-by-Characters Absorbs the search-advantage in TRIE indexing tree,
using searching by characters not searching by words
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)Summary above methods’ drawbacks
Binary-seek-by-word is using full-words marching, the efficiency is evidently low.
The design and maintenance of the TRIE tree is very complex, wastes mass memory space
Binary-seek-by characters Improves some aspects, but it doesn’t change the
data structure of the binary-seek-by-word which restrict the efficiency.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes Double-Character-Hash- indexing[4]
An new searching tree improved from the TRIE indexing tree.
Composed of two parts: Hashing index, remaining strings.
Can avoids the deep searching , increases the segment speed without complex increasing.
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes (cont.) A new lexicon mechanism based on PATRICIA[3]
Use of the ISN (internal statement number) of the words as the key words bit-string,
Constructs the PATRICIA tree by comparing the big-string. Advantage
The searching process only need some cycles of bit comparison and some cycles of string comparison.
Double-Array Trie[1] Even node in the tree stands for a status of an auto-machine, Which changes according to the difference of the variable. This new structure actually is an improved scheme of the
TRIE tree, using 2 linear arrays to express the TRIE tree
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes (cont.)用負的 base 值表示該位置為詞語。如果狀態 i 對應某一個詞,而且 Base[i]=0 ,那麼令 Base[i]= ( -1 ) *i ,如果 Base[i] 的值不是 0 ,那麼令 Base[i]= ( -1 ) *Base[i] 。得到雙陣列如下:
例如設“阿根”的下標為 i=8 ,那麼 check[i] 的內容是“阿”的下標,而 base[i] 是“阿根廷”的下標的基值。“ 廷”的序列碼為 x=8 ,那麼“阿根廷”的下標為 base[i]+x=base[8]+8=12 。
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes (cont.) Double-code scheme [1]
Basic idea is mapping the 6768 Chinese characters in GB-2312 into the sequence-code from 1 to 6768.
Every string written in Chinese can only maps to a number string,
Composed of two steps: Switch from number-sequence into even-coding Establish indexing mechanism
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Analysis of the novel schemes Double-Character-Hash-Indexing
Improvement of TRIE index tree, while it is easier structured and maintained than the former mechanism.
PATRICIA Is a super arithmetic in segment speed, but it waste on the memory spac
e and reduce the efficiency.
Double-Array Trie When decrease or increase the lexicon, the whole double-array should b
e adjusted.
Double-code scheme The extract rate of the arithmetic is not good enough, which result in a v
ery big array, restrict the performance of the search efficiency
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
大白日 大白日夢
大白日 大白日夢
大白0x00E
大白
Experiment Detail
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment Detail
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions Our mechanism DCLWHI farther improves
the speed and efficiency of segmentation. The scheme A has a very high process speed
but costs too much memory space, while scheme B costs less storage with a high efficiency. We think it a good eclectic mechanism for Chinese word segmentation.
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Opinions Experiments are not enough to evidence this m
ethod is very well. …..