Project Leaders: Sun Maosong Isahara Hitoshi Choi...

51
Word segmentation: Part 2 Chinese, Japanese, Korean Project Leaders: Sun Maosong Isahara Hitoshi Choi Key-Sun

Transcript of Project Leaders: Sun Maosong Isahara Hitoshi Choi...

  • Word segmentation: Part 2Chinese, Japanese, Korean

    Project Leaders:Sun MaosongIsahara HitoshiChoi Key-Sun

  • Presentation

    • General Introduction to CJK Word Segmentation (Choi Key-Sun)

    • Chinese GB Standard on Word Segmentation (Song Min)

    • Japanese Specification (Isahara Hitoshi)• Korean Specification

    (Choi Key-Sun, Hwang Dosam)• Chinese Specification (Song Min)

  • Preparatory Works

    • One CJK Tri-lingual Parallel Corpus– Mainichi Shinbun in Japanese source– Translated into Chinese and Korean– Word segmented and POS-tagged

    • Word list – 1,000 word lists of CJK

    • Independently collected, • Possibly sorted by frequency in each language’s

    corpus

  • Tri-Parallel Corpus of CJI

    • # S-ID:950101004-002 KNP:96/10/27 MOD:96/11/26• J# ロシア南部チェチェン共和国の首都グロズヌイに進

    攻したロシア軍は三十一日、首都中心部を装甲車などで攻撃、大統領官邸など数カ所が炎上した。

    • C# 进攻俄罗斯南部车臣共和国首都格罗兹尼的俄罗斯军队,31日出动装甲车等进攻首都中心地带,总统官邸等数处着火。

    • K# 러시아남부체첸공화국수도그로즈니를공격하는러시아군은 31일장갑차등을동원해수도중심지대를진격하여대통령관저등많은곳에서불이났다.

  • Language and Morphology

    • English– Lemmatizer + POS tagger– A word has at most several word forms

    • Chinese– Segmentation System + POS tagger– Word has one word form.

    • Korean (Japanese)– Morphological Analyzer is more essential than other

    language• Complex Segmentation• Frequent Agglutination• Word Formation

    – The Chinese characters in common use

  • Computational Morphology

    • Morpheme– A minimal meaningful elements– "computational" = "comput+ation+al"

    • Morphological Analysis– Segmentation

    • To divide into morphemes or words• To include lemmatization (= to find its stem)

    – Categorization• To assign Part-of-speech category (POS tags)• To assign Semantic features

  • Morphological Analysis: e.g.,

    • These probabilities are learned from the raw corpus in an unsupervised manner.– Original text

    • These probability+Plural be+Plural+Presentlearn+PP from the raw corpus in a+Vowelun_supervise+PP manner .– Lemmatized text

    • These/pron probability/noun+pluralare/be_verb+plural+present learn/verb_ed/ppfrom/prep the/article raw/adjective corpus/nounin/prep an/article un/prefix_supervis/verb_ ed/pp manner/noun ./period– part-of-speech categorization (POS

    tagging/Annotation)

  • Word Formation

    • Types of Languages– Inflectional

    • e.g., Latin• canta-bo

    – Analytic - derivation• e.g., English • I will sing

    – Agglutinative - concatenation• e.g., Korean, Japanese, Hindi, Turkish, • na-neun norayha-keyssda• 나는 노래하-겠다

  • Complexity Comparison• Complexity?

    – In English, we have only to consider each word form.– In Chinese, segmentation rather than morphological analysis.– In Korean, MA should process the segmentation and

    agglutination simultaneously. Much more Complexity in segmentation and analyzing functional words. (Japanese is similar with Korean )

    Spacing

    Order of verb formsPer one verb

    Complexity of Segmentation

    English

    Word form

    5

    Easy

    Korean

    Word Phrase

    More than 5000

    Very Hard

    Chinese

    No

    1

    Very Hard

  • Table of Contents

    • Introduction: Morphology and Characteristics

    • English, Korean, Chinese, Japanese, – Complexity of Morphology– Types of Ambiguities

    • General Representation– Components of Morphological Analysis– Implementation Scheme– Unknown Words

  • General Morphological Analyzer: 2 phases

    • Candidate generation– Generates possible sequences of morphemes

    • Segmentation– Lemmatization - Recovery of phonetic changes

    • Processing for unknown word

    • Candidate selection– “POS Tagging”– Methodologies

    • Statistical methods: e.g., if we see two or three consecutive words and their POS tags, we can predict what the current word's tag is. (Noun-Verb-Noun)

    • Rule-based methods• Hybrid methods

  • General Scheme of ImplementationMorphological AnalyzerMorphological Analyzer

    POS POS TaggerTaggerStatsticsStatstics

    Linguistic RulesLinguistic Rules

    Additional Additional ComponentsComponents

    Unknown wordsUnknown words

    Symbols/NumbersSymbols/Numbers

    Exceptional Exceptional word phraseword phrase

    PostPost--processingprocessing

    Foreign wordsForeign words

    Basic ComponentsBasic Components

    CodeCode SystemSystem

    DictionaryDictionaryPhonologicalPhonological

    RulesRules

    ConnectionConnectionRulesRules

    Algorithms Algorithms and and

    Data StructureData Structure

    ParserParser

  • Implementation (2/3)

    • Morphological connection rules– idioms– patterns– weighted rules

    • Dictionaries (Lexicons)– general lexicon

    • level of segmentation depends on the lexicon– domain-specific dictionary

    • e.g., economics, law, patent (science & technology), ...

    – user-defined dictionary

  • Implementation (3/3)Diagram of word segmentation system for Chinese

    Disambiguation

    POS-Tagging

    Unknown WordsIdentification

    GenerateSubstrings

    Word Matching

    MorphologicalRules,Idioms,Patterns

    Unknown WordModels

    ChooseLexicon orLexicons

    AccessLexicon

    Find wordcandidates

    level ofsegmentation

    DomainSpecific/

    User Defined Lexicon

    GeneralLexicon

    HeuristicRules

    from (Keh-Jiann Chen, 2000)

  • Corpora • A collection of linguistic data, either written texts or a

    transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language– David Crystal, A Dictionary of Linguistics and Phonetics

    • A collection of naturally occurring language text, chosen to characterize a state or variety of a language.– John Sinclair, Corpus, Concordance, Collection

    • Text Corpora– Written Language– Spoken Language

    • Corpora– English: Brown, LOB, BNC and Penn TreeBank– Korean: KAIST Corpus and Sejong Corpus– Japan: Nihon Keizai Shimbun, EDR Corpus– China: Beijing University, LIVAC (HK City U.), Sinica corpus

    • Statistics can be obtained from a large Corpus.

  • Basic Components of Morphological Analyzer (1/2)

    • Dictionary– Lemma, Part-of-speech, Semantic Feature(optional)– Structure

    • Connection Rules– Connection Table (POS bigram)– Word Formation Graph (n-gram)

    • Phonological Rules– Declarative Rules

    • 2-level model

    – Procedural Rules

  • Basic Components of Morphological Analyzer (2/2)

    • Algorithms and Data Structures– Top-Down and Bottom-Up

    • Chart Parsing Recursive Parsing– Data Structure

    • Chart, Table, Tree, Lattice, Graphs

    • Disambiguation (Tagging phase)– Dictionary Searching

    • All words in dictionary + rules for unknown words• Stem dictionary + rule = Derived dictionary

    – Stochastic Model– Rule-based Model– Hybrid Model, Integrated Model – Simple Heuristics

    • preferring to the longest morpheme.

  • Word Segmentation in Chinese- A Heuristics -

    • Maximal matching rule– The most plausible segmentation is the

    three word sequence with the maximal length.

    – This heuristic rules achieves as high as 99.69% accuracy and a high applicability of 93.21%, i.e. the 93.21% of the ambiguities were resolved by this rule. [Chen & Liu 1992]

    • 完 成 鑑定• 完成 鑑定 報

    • 完成 鑑定 報告 (finish judgment report)

  • Additional Components of Analyzer

    • Unknown Word Analysis– Suffix, endings and preposition information

    • Symbol / Number expression– Templates

    Ex) “2005年 6月 20日” (2005/6/20), “3千 4百圓” ($3,400), “+82-42-869-5565” (telephone number)

    • Foreign Languages– TV, [컴퓨터 kom-pyu-teo] (computer)

    • Exceptional Word Phrase– Word Phrase Dictionary

    • Post-Processing– Spacing Problem– Processing Non-Standard language

  • Characteristics of different types of unknown words

    • Possibly infinite number of elements but with closed form representations, such as numeric type compounds:– 2005年 = 2005-year

    • Open-ended types without closed form representations, such as– proper names: "Microsoft"– derived words: "computer-ize"– compounds: "computer desk"– abbreviation (acronym): "LG" (Lucky-

    Goldstar) from ACL [Chen, 2000]

  • Proper Nameswith Transliteration

    • Personal– Bill Clinton (bil klintən, 克林頓)– Jian Zemin (jiang z'əmin, 江澤民)

    • Geographical– Korea (koria, 韓國)– Cambodia (kambodia, カンボジア)

    • Brands/Organization– Panasonic (panasonik, 파나소닉,樂聲)– Samsung (samsəŋ, 삼성, 三星, サムソン)

  • Numeric Expression Representation- Regular Expression -

    • To represent the type of unknown words with possibly infinite number of elements but with a closed form representation, – such as numbers, dates, times, determinant-

    measure compounds, etc.• e.g.,

    – Number → Digit Number | Digit– Digit → 0 | 1 | 2 | ... | 9

  • Methodologies

    • Tabular Parsing– Segmentation by looking-up dictionary

    • Syllable information-based model– Segmentation by syllable information

    • Multi-phase filtering• Brute-force• Etc.

    – Head-tail : simple model for segmentation

  • POS Tagging: Statistical approaches

    • Statistical Models– HMM-based

    • HMM on Morpheme sequences (bi-gram, tri-gram)• HMM Using Word Phrase Structure(HMM using Intra-/inter- word phrases information)

    – Weighted Network– Maximum Entropy Model

    • More information can be integrated into model

    • Pros – Easy for training, guarantees not bad performance

    • Cons– Difficult to tune or modify – Requires more space

  • POS Tagging: other approaches

    • Rule-based approach– Transformation Rules (Eric Brill’s Style)– Pros and cons

    • Difficult to get rules and maintain consistency of rules

    • Easy to lexicalize

    • Hybrid Approach– Statistical approach + Rule Based Approach– Pros and cons

    • Guarantees a good performance• Difficult to integrate

  • Morphological Analysis of English

    • Affix Analysis : un+happy, be+er(X)• Additional Processing

    – AbbreviationEx) I’d I would

    – Symbols, Numerals and Units– Processing Idiomatic Expression– Unknown Word Processing

  • Part of speech Tagging (English)• Stochastic Part-of-Speech Tagging

    – Hidden Markov Model and Viterbi Algorithm• Using Markov assumption – efficient

    – A Categorization Problem: Machine Learning• Decision Tree, Neural Network, Markov Random Field, ...

    FliesFlies like alike a flower.flower.NounNoun Prep. Art.Art. VerbVerb VerbVerb Noun NounNoun

    Noun

    flies/N

    flies/Vlike/V

    like/N

    like/P

    a/ART

    a/N

    flower/V

    flower/N

  • Part of speech Tagging (English)• Rule based Part-of-Speech Tagging

    – Transformation-Based [Brill95] [Brill92]

    • Using contextual information• if Noun X Noun, then X may be Verb

    • Hybrid Model– Integrating a stochastic tagger

    and rule-based system – stochastic: tri-gram

    • P(Noun Verb Noun) = 0.2, • P(Noun Noun Noun) = 0.01

    • Integrated Model– Maximum Entropy Model– “Classifier Combination for Improved Lexical

    Disambiguation” [Brill 99]• Various Models have complementary behaviors.

    Initial StateInitial State

    LearnerLearner

    Unannotated Text

    Annotated Text Truth

    RulesRulesRules

  • Using word phrase Structure in Tagging

    • “An HMM POS tagger for Korean based on Wordphrase” (J, Shin 1994) : simple model

    • “Two-ply HMM” (J. Kim 1997)– HMM variation using POS tags of head and tail of word

    phraseH

    T$

    H

    T

    H

    T

    $

    Each word phrase has a conventional tagging HMM.Morpheme / POS tags

    $

    NounN

    Noun

    Objective.

    N

    Noun ModifierN p

    AdverbAA

    Subjective.

    p

    pNSNS

    NMNM

    a

    NONO

    Verb V connectingePCPC

    Verb V final ePFPF

  • HMM

    • Hidden Markov Model– What is the sequence of nodes?

    • transient probability • usually 1- or 2- or 3-gram

    – What is the tag (or label) of the node? ("hidden")

  • Applications and Word Segmentation

    • Machine Translation• Spell checking and correction

    – Spell correction– Auto-spacing and spacing correction

    • Information Retrieval– Extracting index terms – noun and noun phrases– Question Answering

    • Natural Language Interface• Text-To-Speech• Concordance

  • Application: TTS and IR- A Japanese Example -

    • Text to speech synthesis– Word segmentation of orthographic text

    • 試験|の|最中|に|映画|へ|行った• (I went to movie during examination)

    – Homograph disambiguation • (grapheme to phoneme)• 最中 → saichu (during), monaka (Japanese sweets)• 行った → itta (went), okonatta (did)

    • Information retrieval (indexing)– Word segmentation of orthographic text

    • 試験|の|最中|に|映画|へ|行った• (I went to movie during examination)

    – Part of speech tagging, keyword extraction, stemming• 試験 (examination), 映画 (movie) from the slide (Nagata, ACL2000)

  • Evaluation of Morphological Analysis (Word Segmentation)

    • Correctness– Recall– Precision

    • In unit of morpheme and word phrases

    • Processing Speed• Robustness

    – Processing erroneous input• Spacing• Spell errors

    • Effectiveness of Result– Evaluation of Tag set

  • Recall and Precision

    • Recall = how much the correct one was hit from the whole set

    • Precision = how much is correct among the system generated set

  • Question

    • A common tagset for word segmentation for CJK

  • Resources and Tools

    • Corpora and Tagsets for morphological analysis– Corpus

    • Contest for standardization of POS tagset.

    • Visualization Tool– To provide a visible process – understandable to user and easy to debug

    for developer

  • Korean Specification for Word Segmentation

    Key-Sun CHOIDosam HWANG

  • Morphology:how to describe Korean language

    • Surface and spacing unit: (Korean)– na+neun norayha+keyssda–나는노래하겠다.

    • Grammar: – I+subjective sing+will (future & intention)– I+subjective postposition sing+

    future&intention ending• meaning: (English)

    – I will sing

  • Grammatical morphemes(functional words)

    • Postposition– to express a case feature (or a functional role)

    • subjective case, objective case, etc.• e.g., I sing a song. (I = subjective, a song = objective)

    – to represent semantic roles of constituents– typically Noun + postposition

    • Ending– to represent features like tenses, aspects, moods and voices– to derive relative clauses– typically Verb or Adjective + endings for its Auxiliary Verbs or

    sentence connectives– E.g.: I “a song” sing+will (I will sing a song.)

    • 나는 노래를 부른다• 私は 歌を 歌う。• 我。。。

  • Code System for Implementation (Korean)

    • Standard (KSC5601, KSC5700, KSX001)– Syllable-based – 2-byte-code– including common Chinese Characters

    • Combinative Code (one Korean character = CV(C))– Grapheme-based – 2-byte-code

    – Other internal code systems for MA• N-byte code• 3 byte code• 2-3 variable byte • Symbol code (Romanization style)

    5bitsInitial ConsonantInitial Consonant

    5bitsVowelVowel

    5bitsFinal consonantFinal consonant

    1bitMSB

    1byte 1byte

    word phraseword phrasesyllablesyllable

    graphemegraphemeMorphemeMorpheme

  • Example of Grammatical Morpheme 1/2

    • Grammatical morpheme in a predicative word phrase: (word phrase = a segmentation unit)– "(someone) went but"

    ga syeoss ji man가 셨 지 만

    man지eosssiga

    “go” honorfic PasttenseDeclarative

    moodConcessive/Limitation

    가 시 만었 지

    Verb AUX Particle Ending Postposition

    Predicate Grammatical Morphemes

    Predicative word phrase

  • Example of Grammatical Morpheme 2/2

    • Grammatical morpheme in a substantive word phrase– "(It should be prohibited) from home, even (others

    cannot do)"

    jib eseo buteo man

    집 만

    eun

    에서 부터 은

    집 에서 부터 만 은

    jib eseo buteo man eun“house” place origin concessive topical marker

    Adverbial postpos.Noun postposition postposition

    Grammatical morphemesNoun

    Substantive word phrase

  • Language units

    • Grapheme– Alphabet, a minimal unit

    • consonant: (e.g.,) b, c, d, f, ...• vowel: (e.g.,) a, e, i, o, u, ...

    • Syllable– 2 or 3 graphemes form a syllable– e.g., ga, sun, wa, ... (가, 선, 와)

    • Word phrase (or word)– Segmentation unit (or spacing unit)– it consists of one or more

    morphemes

    word phraseword phrasesyllablesyllable

    graphemegraphemeMorphemeMorpheme

  • Types of Ambiguities- Homonymy -

    • English examples:– lead (metal) lead (past tense is "led")– swallow (bird) swallow (through mouth)– Bill (name) bill (for payment)

    • What is the part-of-speech of the word?

  • Types of Ambiguities in Word Phrase 1/2

    • Types of ambiguity in word phrase analysis– Ambiguity in Segmentation + POS tag

    • 가ga면myeon

    – Ambiguity in POS tag• 감gam을eul• goto (English simulation)

    – go (N or V)

    – Ambiguity in lemmatization• 도do와wa

    Ending

    Noun

    postpositionp

    NVerb

    V

    e

    assumption if go

    objective

    objective

    and

    감gam감gam -을eul-을eul

    도do도do -와wa-와wa

    N

    N p

    p

    도dob도dob -와a-와aeV

    감gam감gam -을eul-을eulV e

    가ga가ga -면myeon-면myeon

    가ga-면myeon가ga-면myeonN

    eVmask

    go

    persimon

    go(noun form)

    way

    help with

  • Types of Ambiguities in Word Phrase 2/2

    • Mixed type–가ga-시si-는neun

    gasigasi

    gaga -si--si- -neun-neuneV f

    Verb

    Ending

    NounN

    V

    e

    Prefinal ending

    Postpositionp

    f

    galgal -si--si- -neun-neuneV f

    -neun-neunpN

    gasigasi -neun-neuneV

    go honoric

    disappear ending

    thorn subjective

  • Types of Ambiguities in Word Phrase 1/2

    • Types of ambiguity in word phrase analysis– Ambiguity in Segmentation + POS tag

    • gamyeon

    – Ambiguity in POS tag• gameul• goto (English simulation)

    – go (N or V)

    – Ambiguity in lemmatization• dowa

    gamgam -eul-eul

    gamgam -eul-eul

    dobdob -a-a

    dodo -wa-wa

    Verb

    Ending

    Noun

    postposition

    N

    N p

    p

    p

    N

    V

    e

    eV

    V e

    gaga -myeon-myeon

    ga-myeonga-myeonN

    eV

  • Types of Ambiguities in Word Phrase 2/2

    Verb

    Ending

    NounN

    V

    e

    Prefinal ending

    Postpositionp

    f

    • Mixed type– ga-si-neun

    gaga -si--si- -neun-neuneV f

    galgal -si--si- -neun-neuneV f

    gasigasi -neun-neuneV

    gasigasi -neun-neunpN

  • Metrics of Ambiguity• Considering ambiguities

    – Average ambiguities per an word phrase : 3.5 – In case of 1-syllable morphems : 2.8 ambiguities average– In case of 2-syllable morphemes : 1.13 ambiguities average*Randomly selected 15 morphemes from the most frequent 10000 morpheme

    • Considering phonetic transformation rules– About 20 rules are needed.

    • More than 50 sub-rules (considering contextual information)

    • Complexity of n-syllable word phrase segmentations– At least 2n-1

    • 동서남북[東西南北/dong-seo-nam-buk]– east-west-south-north

    • e.g., n=4 (Korean example) 東 西 南 北

    서 남 북

    남 북서

    동 서 남 북

    남 북동 서

    동 서 남 북

    동 서 남 북 동 서 남 북

  • Typical Ambiguities of Word Phrase Analysis

    • Categorial Disambiguation– e.g., gam+eun (Korean example),

    • Noun+Particle• Verb stem + Ending

    • Segmentation Disambiguation– e.g., gamgineun

    • Noun+Particle = gamgi+neun• Verb stem + Ending+Particle = gam+gi+neun

    • Stem Identification– e.g., naneun

    • Noun+particle = na+neun• Verb stem+ending = nal+neun (with verbal ending change)