Deep Technologies about Kana Kanji Conversion
-
Upload
yoh-okuno -
Category
Technology
-
view
2.185 -
download
1
Transcript of Deep Technologies about Kana Kanji Conversion
Deep Technologies in Kana Kanji Conversion
Yoh Okuno
Components of Converter • Model, Training, Storage, Interface etc.
Corpora Model
Converter User
Train
Lookup Input: Kana
Output: Kanji
(Batch)
Deep Technologies
• Various Language Models
• Training LMs using the Web and Hadoop
• Automatic Pronunciation Inference
• Data Compression
• Predictive Conversion and Spelling Correction
Various Language Models
Various Language Models • Word N-‐gram
– Accurate but too large!
• Class N-‐gram
– Small but inaccurate
• Combination
– Good trade-‐off is needed
Language Models • Word N-‐gram
• Class Bigram
• Phrase-‐based Model
P (y) =�
i
P (yi|yi−1i−N+1)
P (y) =�
i
P (yi|ci)P (ci|ci−1)
P (y) =�
i∈IC
P (yi|ci)P (ci|ci−1)�
i∈IW
P (yi+N−1i , ci+N−1|ci)P (ci|ci−1)
Class-‐based sub model Word-‐based sub model
Phrase-‐based Model • Replace partial class bigram with word N-‐gram
• Intermediate classes are marginalized
Phrase probability: P(w1, w2, w3, c3 | c1) Only left-‐side class is conditionalized!
: Classes : Words
Training Large-‐Scale Language Models
Issues about Training • How to collect large corpora?
– Crawl, crawl, crawl !
– Morphological analyzer is needed
• How to store and process them?
– Hadoop MapReduce helps us
– Speeding up N-‐gram counting?
Crawling the Web • Raw html can be collected from the Web
• Statistics have no copyright
• Required components:
– Web crawler
– Body text extraction
– (Spam filter)
– Morphological analyzer make use of cloud
Japanese Morphological Analyzer
• Input: raw text
• Output: segmented words, part-‐of-‐speech...
MapReduce for Language Model
• Distributed computing of N-‐gram statistics
Corpora Mapper
Mapper
Mapper
Reducer
Reducer N-‐grams
Corpora
Mapper: extract N-‐grams from corpora
Reducer: aggregate N-‐gram counts
MapReduce: Pseudo Code
Speeding up N-‐gram count • Use binary representation for N-‐grams
– Variable length ID for word is efficient
• Use In-‐mapper combine by Jimmy Lin
– Combine in-‐memory is more efficient
• Use Stripes Pattern by Jimmy Lin
– Group N-‐grams by first word
Performance-‐Size Trade off
15
Cross Entropy(bit) and Size(byte) Threshold
Mobile PC Cloud
[Okuno+ 2011]
Automatic Pronunciation Inference
Pronunciation Inference
• Japanese word has 1-‐3 pronunciations
• How to pronounce sentences or phrases?
• Basic approach:
– Word-‐based: Combination of word pronunciation
– Character-‐based: Combination of character’s
Mining Pronunciation via Hadoop
• Corpora contain (phrase, pronunciation) pairs
• Expression like:四季多彩(しきたさい)
• In English: Phrase (Pronunciation)
• Distributed grep by the regular expression:
“\p{InCJKUnifiedIdeographs}+(\p{InHiragana}+)”
Character Alignment Task • Character Alignment for Noise Reduction
• Input: Pairs of Word and Pronunciation
• Output: Aligned Pairs
四季多彩 しきたさい 西都原 さいとばる iPhone あいふぉん
四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|
We can use HMM and EM Algorithm
Data Compression
Why Compression?
• IMs should save memory for other apps
• Typically 50 MB for PC and 1-‐2 MB for mobile
• Compress data as small as possible!
• Solution: Succinct data structures
LOUDS: Succinct Trie
22
a b c d e f g h i
10 11110 0 110 0 10 0 0 10 0
size = #nodes * 2 + 1 = 19 bit require auxiliary index besides
• Use unary code to represent tree compactly
a
b c
d e
f
g h
i
MARISA: Nested Patricia Trie
• Merge no-‐branch nodes in tree
[Yata+ 11]
Normal Trie
Patricia Trie (Apply recursively)
Other Functions
Predictive Conversion
• Motivation: we want to save key strokes
• Approach: show most probable completion
when users input their first some characters
Predictive Conversion • Accuracy and length are trade-‐offs
• Phrase extraction is needed
– Eliminate candidates like とうございます
(you very much): sub-‐sequence of phrase
おはよう Good
おはようございます Good morning
Phrase Extraction for Prediction • A paper about phrase extraction to appear
• Digest: fast and accurate phrase extraction
[Okuno+ 2011]
Spelling Correction
• Correct user’s miss types
• Search: Trie for fuzzy match
• Model: Edit distance for error model
• Edit operation: Insert, delete and replace
Conclusion
Conclusion
• Various technologies are needed
– Statistical language models and training
– Morphological analyzer, pronunciation inference
– Data compression and retrieval
– Predictive conversion and spell correction