Deep Technologies about Kana Kanji Conversion

30
Deep Technologies in Kana Kanji Conversion Yoh Okuno

Transcript of Deep Technologies about Kana Kanji Conversion

Page 1: Deep Technologies about Kana Kanji Conversion

Deep  Technologies  in  Kana  Kanji  Conversion

Yoh  Okuno  

Page 2: Deep Technologies about Kana Kanji Conversion

Components  of  Converter •  Model,  Training,  Storage,  Interface  etc.  

Corpora Model

Converter User

Train

Lookup Input:  Kana

Output:  Kanji

(Batch)

Page 3: Deep Technologies about Kana Kanji Conversion

Deep  Technologies

•  Various  Language  Models  

•  Training  LMs  using  the  Web  and  Hadoop  

•  Automatic  Pronunciation  Inference  

•  Data  Compression  

•  Predictive  Conversion  and  Spelling  Correction

Page 4: Deep Technologies about Kana Kanji Conversion

Various  Language  Models

Page 5: Deep Technologies about Kana Kanji Conversion

Various  Language  Models •  Word  N-­‐gram  

– Accurate  but  too  large!  

•  Class  N-­‐gram  

– Small  but  inaccurate  

•  Combination  

– Good  trade-­‐off  is  needed

Page 6: Deep Technologies about Kana Kanji Conversion

Language  Models •  Word  N-­‐gram  

•  Class  Bigram  

•  Phrase-­‐based  Model  

P (y) =�

i

P (yi|yi−1i−N+1)

P (y) =�

i

P (yi|ci)P (ci|ci−1)

P (y) =�

i∈IC

P (yi|ci)P (ci|ci−1)�

i∈IW

P (yi+N−1i , ci+N−1|ci)P (ci|ci−1)

Class-­‐based  sub  model Word-­‐based  sub  model

Page 7: Deep Technologies about Kana Kanji Conversion

Phrase-­‐based  Model •  Replace  partial  class  bigram    with  word  N-­‐gram  

•  Intermediate  classes  are  marginalized

Phrase  probability:  P(w1,  w2,  w3,  c3  |  c1)  Only  left-­‐side  class  is  conditionalized!  

:  Classes :  Words

Page 8: Deep Technologies about Kana Kanji Conversion

Training  Large-­‐Scale  Language  Models

Page 9: Deep Technologies about Kana Kanji Conversion

Issues  about  Training •  How  to  collect  large  corpora?  

– Crawl,  crawl,  crawl  !  

– Morphological  analyzer  is  needed  

•  How  to  store  and  process  them?  

– Hadoop  MapReduce  helps  us  

– Speeding  up  N-­‐gram  counting?  

Page 10: Deep Technologies about Kana Kanji Conversion

Crawling  the  Web •  Raw  html  can  be  collected  from  the  Web  

•  Statistics  have  no  copyright  

•  Required  components:  

– Web  crawler  

– Body  text  extraction    

–  (Spam  filter)  

– Morphological  analyzer  make  use  of  cloud

Page 11: Deep Technologies about Kana Kanji Conversion

Japanese  Morphological  Analyzer

•  Input:  raw  text  

•  Output:  segmented  words,  part-­‐of-­‐speech...

Page 12: Deep Technologies about Kana Kanji Conversion

MapReduce  for  Language  Model

•  Distributed  computing  of  N-­‐gram  statistics

Corpora Mapper

Mapper

Mapper

Reducer

Reducer N-­‐grams

Corpora

Mapper:  extract  N-­‐grams  from  corpora

Reducer:  aggregate  N-­‐gram  counts

Page 13: Deep Technologies about Kana Kanji Conversion

MapReduce:  Pseudo  Code

Page 14: Deep Technologies about Kana Kanji Conversion

Speeding  up  N-­‐gram  count •  Use  binary  representation  for  N-­‐grams  

– Variable  length  ID  for  word  is  efficient  

•  Use  In-­‐mapper  combine  by  Jimmy  Lin  

– Combine  in-­‐memory  is  more  efficient  

•  Use  Stripes  Pattern  by  Jimmy  Lin  

– Group  N-­‐grams  by  first  word

Page 15: Deep Technologies about Kana Kanji Conversion

Performance-­‐Size  Trade  off

15  

Cross  Entropy(bit)  and  Size(byte) Threshold

Mobile PC Cloud

[Okuno+  2011]

Page 16: Deep Technologies about Kana Kanji Conversion

Automatic  Pronunciation  Inference

Page 17: Deep Technologies about Kana Kanji Conversion

Pronunciation  Inference

•  Japanese  word  has  1-­‐3  pronunciations  

•  How  to  pronounce  sentences  or  phrases?  

•  Basic  approach:  

– Word-­‐based:  Combination  of  word  pronunciation  

– Character-­‐based:  Combination  of  character’s  

Page 18: Deep Technologies about Kana Kanji Conversion

Mining  Pronunciation  via  Hadoop

•  Corpora  contain  (phrase,  pronunciation)  pairs  

•  Expression  like:四季多彩(しきたさい)  

•  In  English:                  Phrase  (Pronunciation)  

•  Distributed  grep  by  the  regular  expression:    

“\p{InCJKUnifiedIdeographs}+(\p{InHiragana}+)”

Page 19: Deep Technologies about Kana Kanji Conversion

Character  Alignment  Task •  Character  Alignment  for  Noise  Reduction  

•  Input:  Pairs  of  Word  and  Pronunciation  

•  Output:  Aligned  Pairs

四季多彩  しきたさい  西都原  さいとばる  iPhone  あいふぉん

四|季|多|彩|  し|き|た|さい|  西|都|原|  さい|と|ばる|  i|Ph|o|n|e|  あい|ふ|ぉ|ん|_|

We  can  use  HMM  and  EM  Algorithm

Page 20: Deep Technologies about Kana Kanji Conversion

Data  Compression

Page 21: Deep Technologies about Kana Kanji Conversion

Why  Compression?

•  IMs  should  save  memory  for  other  apps  

•  Typically  50  MB  for  PC  and  1-­‐2  MB  for  mobile  

•  Compress  data  as  small  as  possible!  

•   Solution:  Succinct  data  structures

Page 22: Deep Technologies about Kana Kanji Conversion

LOUDS:  Succinct  Trie

22

a b c d e f g h i

10 11110 0 110 0 10 0 0 10 0

size  =  #nodes  *  2  +  1  =  19  bit  require  auxiliary  index  besides  

•  Use  unary  code  to  represent  tree  compactly

a

b c

d e

f

g h

i

Page 23: Deep Technologies about Kana Kanji Conversion

MARISA:  Nested  Patricia  Trie

•  Merge  no-­‐branch  nodes  in  tree

[Yata+  11]

Normal  Trie

Patricia  Trie  (Apply  recursively)

Page 24: Deep Technologies about Kana Kanji Conversion

Other  Functions

Page 25: Deep Technologies about Kana Kanji Conversion

Predictive  Conversion

•  Motivation:  we  want  to  save  key  strokes  

•  Approach:  show  most  probable  completion  

when  users  input  their  first  some  characters  

Page 26: Deep Technologies about Kana Kanji Conversion

Predictive  Conversion •  Accuracy  and  length  are  trade-­‐offs  

•  Phrase  extraction  is  needed  

– Eliminate  candidates  like  とうございます    

(you  very  much):  sub-­‐sequence  of  phrase  

おはよう  Good

おはようございます  Good  morning

Page 27: Deep Technologies about Kana Kanji Conversion

Phrase  Extraction  for  Prediction •  A  paper  about  phrase  extraction  to  appear  

•  Digest:  fast  and  accurate  phrase  extraction  

[Okuno+  2011]

Page 28: Deep Technologies about Kana Kanji Conversion

Spelling  Correction

•  Correct  user’s  miss  types  

•  Search:  Trie  for  fuzzy  match  

•  Model:  Edit  distance  for  error  model  

•  Edit  operation:  Insert,  delete  and  replace  

Page 29: Deep Technologies about Kana Kanji Conversion

Conclusion

Page 30: Deep Technologies about Kana Kanji Conversion

Conclusion

•  Various  technologies  are  needed  

– Statistical  language  models  and  training  

– Morphological  analyzer,  pronunciation  inference  

– Data  compression  and  retrieval  

– Predictive  conversion  and  spell  correction