Coreference Resolution - Kangwoncs.kangwon.ac.kr/~parkce/seminar/140228_coreference... ·...
Transcript of Coreference Resolution - Kangwoncs.kangwon.ac.kr/~parkce/seminar/140228_coreference... ·...
Coreference Resolution
Intelligent Software Lab.
Cheon Eum ParkKyoung Ho Choi
Index• A Multi-Pass Sieve for Coreference Resolution
(Raghunathan et al. 2010)
• Stanford`s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task
(CoNLL 2011 first pride)
• System Architecture
• Core Sieve
• Deterministic Coreference Resolution Based• Example
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• 본발표의이전 ver.의연구로, multi-pass sieve의방법론을제안.
• copora
• Evaluation Metrics• F1 (Ghosh, 2003) : 같은 entity cluster에서 mention쌍을비교하여측정• MUC (Vilain et al., 1995) : predicted clusters & gold clusters
• 𝐵𝐵3 (Amit and Baldwin, 1998) : 1. 수정된mention을마킹하기위해주어진 mention에대하여 predicted와 gold clusters의공통적인부분사용.
2. precision과 recall에대해 predicted와 gold clusters를사용.
Copora document mention 특징
ACE2004-ROTH-DEV 68 4,536 본연구의개발에대해사용
ACE2004-CULOTTA-TEST 107 5,469 테스트로사용
ACE2004-NWIRE 128 11,413 테스트로사용
MUC6-TEST 30 2,068 테스트로사용
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• Mention• 원문에서 real-world entity.• parse tree에서하나의 node (= NPs).• 𝑟𝑟 ∈ 𝑅𝑅 ∶ 𝑤𝑤𝑟𝑟 과같이mapping• 기본적으로 3가지형태로이뤄짐
• proper(NAM) : 고유명사• nominal(NOM) : 명사• pronominal(PRO) : 대명사
key(property)
value(word)
Mention
set of property = R : head, modifier집합
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• Mention Processing• mention 𝒎𝒎𝒊𝒊
① mention 𝒎𝒎𝒊𝒊은model들을통해하나의 solution으로줄여나간다.② second : best 선행사선택 (𝑚𝑚1, … ,𝑚𝑚𝑖𝑖−1을통하여..)③ Stanford parser로후보선행사들을다음규칙에따라정렬
• Syntactic Information (Stanford parser 정렬규칙)• Same Sentence
• Previous Sentence
• Attribute Sharing
• Mention Selection
• Search Pruning
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• Mention Processing• Same Sentence
• 같은문장내의후보자들은 syntactic tree에서 left-right breath first search.• 장점
• 문장의시작과가깝다.• 개연성있는선행사를찾는다.
• Previous Sentence• Attribute Sharing• Mention Selection• Search Pruning
Richard Levin, the Chancellor of this prestigious university will head the Globalization Studies Center.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• Mention Processing• Same Sentence• Previous Sentence
1) 명사mention : right-left BFS (= 통사론특징을보장, 유망한 document에근접)2) 대명사mention : left-right BFS (= 유망한주어를탐색) Point!!!
본연구의 framework에서각모델들은multi-pass system의이전Coreference model들로부터각mention에대해 clustering 정보를얻는다.
𝐶𝐶𝑗𝑗 = 𝑚𝑚1𝑗𝑗, … ,𝑚𝑚𝑘𝑘
𝑗𝑗 , 𝑚𝑚𝑖𝑖 ∈ 𝐶𝐶𝑗𝑗
이하의방법들은 cluster 정보사용.• Attribute Sharing• Mention Selection• Search Pruning
대명사에대한선행사가매우개연성있기때문.
[pepsi] says it expects to double [quaker]`s snack food growth rate. after a month-long courtship,
[they] agreed to buy quaker oats···
EX) 대명사mention
2)`
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• Mention Processing• Same Sentence• Previous Sentence• Attribute Sharing
• 대명사상호참조해결은 missing attribute, incorrect attribute에의해심하게영향을받는다.• missing attribute : 잘못된정보로옳지않은선행사가선택되어 precision error가발생하는경우.• incorrect attribute : mention과선행사가 mismatch될때옳은 link 발생하지않아 recall error 발생.• 해결방안① 주어진 cluster로 number, gender, animacy등mention의모든속성통합.② 모든 cluster mention들의결과공유.③ 다른mention들의속성이부정확하다면서로부정확한상태를유지. (e.g. unknown)
• ex) mention 공유로인한오류
a group of student singular
five students plural
• Mention Selection• Search Pruning
{ Singular, Plural }
∴위와같이대명사의단�복수가merge될수있다.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• Mention Processing• Same Sentence• Previous Sentence• Attribute Sharing• Mention Selection
• 기존에 text 안에서mention을해결하기위해 error의 likelihood를증가시키는방법을사용.• 본연구에서는추가로이전단계 cluster 에서 textual을정렬.• 그안에서첫번째mention을해결한이전모델로부터얻은 cluster 정보를활용.
• ex)
𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑚𝑚11,𝑚𝑚2
2,𝑚𝑚32,𝑚𝑚4
3,𝑚𝑚51,𝑚𝑚6
2 ⇒ 𝑚𝑚22, 𝑚𝑚3
2,𝑚𝑚43 = 𝑚𝑚2
2 : 𝑚𝑚43
• two-fold 방법사용.• first : 초기 cluster mention들은차후의수식어 or 대명사보다잘정의되어있음.• second : 첫번째mention은 document 시작부분에나타난다.
따라서 document로부터선택할선행사후보가적어지고, 실수발생도적어진다.
• Search Pruning
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• Mention Processing• Same Sentence• Previous Sentence• Attribute Sharing• Mention Selection• Search Pruning
• discourse salience(담화적특징)을사용하여검색된것을 pruning.• 다음에따라첫번째 cluster mentions에대한상호참조를불가능하게함.
• 부정대명사와시작하거나 (some, other)• 정관사와시작하거나 (a, an)
• Exception• mention 크기가정확히 match mention들만 link.• ex)
thinks it is the perfect place for [a sports bar] and has put up ··· the property as a location for [a sports bar], using the Broncons` name and color ···
• search pruning 개념에서, 이것은 document안에서정의되지않은 mention들이반복되는것이가능하기때문에담화적특징에개의치않고모든명사 mention에대한 trigger가된다.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• pass 1. – Exact Match
• 두mention에같은규모의 text가포함되어있을경우,둘의한정사나관사를포함한mention을 link.• ex) 선행사mention : the Shahab 3 ground-ground missile
현재mention : the Shahab 3 ground-ground missile
• pass 2. – Precise Constructs• 다음규칙에의하여두mention을 link.
• Appositive(동격) –두개의명사mentions이같은격을가지게되면 link.• ex) [Israel`s Deputy Defense Minister], [Ephraim Sneh], said ···
• Predicate nominative(술부주격형태) –주어, 목적어관계를연결하는두mention (명사, 대명사).• ex) [The New York-based College Board] is [a nonprofit organization that administers the SATs and promotes higher education]
• Role appositive(역할동격) – mention label 정보가 NER의 person일경우만 trigger 된다. (다음규칙에따른다.)• mention label 정보가 person 일경우
• 선행사가 animate인경우
• 선행사의 gender가중성이아닌경우
• ex) [[actress] Rebecca Schaeffer].
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 2. – Precise Constructs
• 다음규칙에의하여두mention을 link(이어서).• Relative Pronoun(관계대명사) –이mention은선행사 NP의 head를수식하는관계대명사.
• ex) [the finance street [which] has already formed in the Waitan district.]• Acronym(두문자어) – NNP로 tagging된 mention과그것의두문자어일경우 link.
• ex) [Agence France Presse] . . . [AFP]. (이면 link)• Demonym – mention들중하나가 demonym이면 link.
• ex) [Israel] . . . [Israeli].• Demonym을통해 ‘the pairwise’에서 recall 약 5%향상.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 3. – Strict Head Matching
• 여태mention linking은 naïve matching에기반. (호환성없는수식어는무시)
• ex) Yale University and Harvard University : 단어의 head는같지만둘은다른개체.
• Strict Head 규칙• Cluster head match – mention head word는선행사 cluster에서어떤
head word와도matching 해야함.• ex) [U.S. President Barack Obama] has attacked the North. because
the North provokes that [he] led the country.- antecedent cluster head word : [Barack, Obama] - mention head word : [he]
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 3. – Strict Head Matching
• Strict Head 규칙• Word inclusion –같은 entity의mention은이야기가진행하는동안정보를잃거나짧아지게된다.
• ex) (a) intervene in the [Florida Supreme Court]`s move . . . does look like very
dramatic change made by [the Florida court] (같은 entity끼리 link)(b) The pilot had confirmed ··· he had turned onto [the correct runway] but
pilots behind him say he turned onto [the wrong runway]. (다른 entity를link)
• Compatible modifiers only(feature) – mention의수식어는모두선행사후보의수식어에포함된다.
• 이자질은같은담화속성에이전자질로모델이된다.• 명사와형용사에만수식어자질을사용.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 3. – Strict Head Matching
• Strict Head 규칙• Not i-within-i –두mention중하나는 NP 구성성분에서자식
NP가될수없다.• ex) her husband her husband != her
• Pass 4 and 5. – Variants of Strict Head• Pass 4 : Pass 3의 compatible modifiers only feature를제거• Pass 5 : Pass 3의 word inclusion 규칙을제거• Pass 4, 5 Pairwise F1 : +1.7%
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 6. – Relaxed Head Matching
• 선행사후보의 cluster에어떤단어와mention head를매칭하는것으로 the cluster head match heuristic이가능하게한다.
• 조건• named entity로 label된mention, antecedent 필요.• type일치. (org, loc, per 등)
• ex) heuristic matching• Sanders {Sauls, the judge, Circuit Judge N. Sanders Sauls}
• word inclusion과 not i-within-i 자질의결합을시행.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 7. – Pronouns
• pronominal Coreference resolution 규칙• Number -다음기능을부여.
• 대명사에대한 static list. (e.g. they, that, this, it, wh-, 인칭대명사등)
• NER labels: mention은단수인 named entity로마크된다.(단�복수모두가능한 organization은예외)
• POS-tag: NN*S tag = 복수, NN* tag = 단수• static dictionary
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 7. – Pronouns
• pronominal Coreference resolution 규칙• Gender
• 해당mention의성별자질• Person
• 대명사로사람속성적용.• 무조건적으로적용할필요없다. 인용구에서두mention에대한대명사중하나는나타나기때문.
• ex)“[I] voted my conscience” [she] said.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 7. – Pronouns
• pronominal Coreference resolution 규칙• Animacy
• 대명사 static list• NER labels• ex) PERSON = animate, LOCATION = not• a dictionary bootstrapped from the web.
• NER label (from the Stanford NER)• 속성을찾지못하면 unknown 적용.
A Multi-Pass Sieve for Coreference Resolution (Raghunathan et al. 2010)
• The Modules of the Multi-Pass Sieve• Pass 7. – Pronouns
• Pass 7을적용하면서 F1값최대 20% 상승.
System Architecture
1.Mention Detection• mention, 관련정보를추출해낸다. (다음단계준비과정)• ex) gender, number
2.Coreference Resolution• 여기서확인된mentions의 Coreference resolution을수행.• Sieve는 precision을내림차순으로정렬
• first sieve (highest precision) : mentions과선행사의정확한문자열matching이필요.• last sieve (lowest precision) : 대명사 Coreference resolution을수행.
3.Post-processing• output을제약조건에맞게비교.
Stanford`s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task
(CoNLL 2011 first pride)
System Architecture
1.Mention Detection Sieves2.Mention processing3.Coreference Resolution Sieves4.Post Processing
System Architecture
1.Mention Detection Sieves• recall 향상목적.
• mention recall은 final score에절대적으로영향을미친다.• but, 가짜mention은 singleton, post-processing으로처리하기때문에전체 score에영향을미치지않
는다.• 실험후mention 간의정확도는높으나 recal률이낮아서mention detection algorithms에는mention
precisio보다mention recal에더중점을둠.
• 각 Sieve에서사용하는것들.• Syntactic parse tree• Identified Named Entity• few manually written pattern based on heuristic• OntoNote specification.
System Architecture
1.Mention Detection Sieves• first sieve (= highest recall sieve)
• mark all noun phrase (NP)• mark 소유대명사• mark candidate mentions으로서각문장에서 named entity
mentions• following sieve ( : 다음조건에만족하는부분제거)
following sieve Example
head가같으면 large mention을선택 The five insurance companies approved to be established this time.
percents, money, cardinals, quantities.무시
9%, $10,000, Tens of thousands, 100 miles
부분표현, 수량형용사표현제거 a total of 177 projects
가주어 It,군말제거 It is possible that
지역명의형용사적표현제거 American
stop word 제거 there, ltd., hmm
System Architecture
1.Mention Detection Sieves• 동격이나연결적인관계에조심!
• OntoNote에선그냥 large mention으로표현하지만, 여기선다음
과같다.
• [[Yonkang Zhou], the general manager] or [Mr. Savoca] had been [a consultant···].
• 이런식으로자질로사용하며, candidate mention으로유지.
• 나중에 post-processing을통해제거.
System Architecture
2. Mention processing
• 문장숫자순서로정렬
• left-right Breath first search
• resolution에대해각 cluster의첫번째mention만선택.
(상호참조된mention 간에 Attribute를공유)• Why?
• 첫번째 mention이기준
System Architecture
2. Mention processing• ex)
• 정렬된주어진𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑚𝑚11,𝑚𝑚2
2,𝑚𝑚32,𝑚𝑚4
3,𝑚𝑚51,𝑚𝑚6
2 , (아래 : textural order, 위 : cluster ID)
• 부정대명사(some, other, anything 등), 부정관사(a, an)로이뤄진 first mention은버림.• 더이상선행사를조회할필요가없다.
• 해당 mention이시작되는 mention이기때문에.
• 선행문장으로부터같은문장이확장된것이아니므로···
• 각𝑚𝑚𝑖𝑖는𝑚𝑚𝑖𝑖−1, … ,𝑚𝑚1이선행mention으로모든 sieve는표준에맞거나, list 끝에도달할동안 Coreference 선행사를검색.
• 두mention을비교시• local한mention들대신 entire cluster로부터의정보를사용.
• cluster의mention은서로속성(number, gender, animacy등)을공유.
• ex)• a group of student = singular
• five students = plural에따라 number 결정
Core Sieve
3. Coreference Resolution Sieves• Raghunathan의연구보다 2가지 sieves를확장
• 명사의mention• corpus에서 precision에기반한삽입• 위 2가지 sieves는단순히 semantic 정보없이제약조건에의존하기때문에 baseline
model을고려
추가된Sieves
Core Sieve
3. Coreference Resolution SievesHead가같은mention
ex)[Clinton] and [Clinton, whose term ends in
January.]
coreference로서두mentions head에mark.같은 head word, 또는아래제약조건만족에따라.
제약조건 설명
Not i-within-i ex) his friend >> his의 he와 friend는서로다름
지역명이다르면mismatches
[Lebanon] and [southern Lebanon]은서로다름
No numeric mismatches
선행사가숫자로나타나는경우는없다.[people] and [around 200 people] are not coreferent.
Pronoun distance 문장거리는대명사와선행사로최대 3을넘지않는다.
Bare plurals 기본적인복수형은포괄적이고상호참조선행사를가질수없다.
Core Sieve
3. Coreference Resolution SievesSemantic-Similarity Sieves
따라서 sieve에 input은이전 sieve가 build한mention cluster의collection이므로, 앞의 three knowledge bases에의하여mention cluster를record에 link 해야한다.
본연구에서언급한두가지 Sieves는WordNet, Wikipedia infoboxes, Freebase records로부터 semantic을개발하였다. (three knowledge bases)
Core Sieve
3. Coreference Resolution SievesSemantic-Similarity Sieves – link 방법
1. cluster mention에서대표되는mention 선택• 맨앞의대명사• 맨먼저의일반명사• 명사의mention에선 string이가장긴것(대명사한정)
ex)cluster = {President George W. Bush, president, he}여기서선택되는것 = President George W. Bush
2. knowledge base로부터 return이없다면개발 algorithm에따라 query를삽입
Core Sieve
3. Coreference Resolution SievesSemantic-Similarity Sieves – link 방법
2. knowledge base로부터 return이없다면개발 algorithm에따라 query를삽입
a. mention에따르는 head word를지운다.b. mention head word를포함한 parse tree에서 lowest NP를택한다.c. head word가끝나는가장긴대명사를사용한다.
ex)query = president Bill Clinton, whose term ends in January.first : president Bill Clinton – (c)second : Bill Clinton – (b)finally : Clinton – (a)
1. cluster mention에서대표되는mention 선택
Core Sieve
3. Coreference Resolution SievesSemantic-Similarity Sieves
둘이상의고유명사를가진mention은별명을
사용.
ex)America Online : AOL
• 상위어, 동의선관계에있는WordNet lexical chain에 link된다면두개의명사mention을상호참조라marking.
• 모든 synset을mention마다사용하지만mention을 3개문장간격으로, 그리고 lexical chain을 4만큼의길이로제한
ex)Britain : countryplane : aircraft
Core Sieve
3. Coreference Resolution SievesDiscourse Processing Sieve• 이 sieve는화자를호환가능한대명사에맞춘다.• 또한인용구와회화기록을다루기위해
shallow 담화해석사용.
비대화적 text
text에서화자를식별
같은문장 or이웃문장탐색
해당문장에서동사감지
후주어찾기
본논문에서제시한휴리스틱
Core Sieve
3. Coreference Resolution SievesDiscourse Processing Sieve – heuristics
• <I>𝑠𝑠4를같은화자로배치된다 -> coreferent.• 같은화자와함께 <you>s는 coreferent.• her text 안에서화자와 <I>s는 corefernt.
기호 정의
<I> I, my, me, mine
<we> first person plural pronouns
<you> second person pronouns
추가제약조건• 화자와mention은화자의표현안에서 <I>가아니면 coreferent가불가능.• <I>가두개일때다른화자에게배정하면 coreferent가불가능.• 같은화자에두개의다른인칭대명사가존재하면 coreferent가불가능. (<my>와 <he>)• 명사mentions은같은부분이나인용구에서 <I>, <you>, <we>로 coreferent할수없다.• 대화에서, <you>는오직이전화자에게만상호참조가가능하다.
Post Processing
• singleton clusters제거• 동격또는연결어구관계에서 text의뒤에나오는mentions는버림
ex)[[Yongkang Zhou], the general manager] or [Mr. Savoca] had been [a consultant] 에서Yongkang Zhou와 a consultant는제거.
The architecture of coreference system.Deterministic Coreference Resolution Based
Deterministic Coreference Resolution Based
• Input sentence :
• Mention Detection
• Speaker Sieve
• String Match
• Relaxed String Match
• Precise Constructs
• Strict Head Match A
• Strict Head Match B, C
• Proper Head Noun Match
• Relaxed head Match
• Pronoun Match
• Post Processing
John is a musician. He played a new song. A girl was listening to the song. “It is my favorite,” John said to her.
- Example
Example
• Input sentence :Mention detection :
• Mention을너비우선탐색순서로정렬
• Head가같으면큰mention 을선택
• percent, money, cardinals, and quantities 정보제거
• partitive or expressions 제거 (양정보)• It의숙어표현제거 (ex. It is possible that, It seems that, It turns out···)• 나라의형용사적표현제거
• stop word 제거
John is a musician. He played a new song. A girl was listening to the song. “It is my favorite,” John said to her.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟗𝟗
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
• Speaker Sieve :• “[I] voted for [Nader] because [he] was most aligned with [my] values,” [she] said
• “I voted for ~~~”에서 I가 speaker, mention이서로다르면상호참조불가
• <I>s(or <you>s, or <we>s)가두개일때다른화자에게배정하면상호참조불가
• 같은화자에두개의다른인칭대명사가존재하면상호참조불가. (<my>와 <he>)
• 일반명사mentions은같은부분이나인용구에서 <I>, <you>, <we>로상호참조할수없다
• 대화에서, <you>는오직이전화자에게만상호참조가가능
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟗𝟗
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟗𝟗
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
• String Match (Exact Match) :• mention이정확히일치하면 link• [the Shahab 3 ground-ground missile] and blah blah [the Shahab 3 ground-ground missile] .
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟗𝟗
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
• Relaxed String Match :• Head가같은mention• [관계사절, PP, 후치수식어]와같을때사용.• [Clinton]과 [Clinton, whose term ends in January]
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
관계사절
Example
• Previous Sieve :
• Precise Constructs :• “, “로수식되는동격 : [Israel`s Deputy Defense Minister], [Ephraim Sneh], said . . .• 2형식문장 : The C language is the programming language.• 의미상동격 : [[actress] Da Hey Lee]• 관계대명사.• 약어.• Demonym (국가의거주민을가리킴)
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟐𝟐. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟖𝟖,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
• Strict Head Match A :• ex) Yale University != Harvard University (but, similar head word)• Entity head match• Word inclusion : 길이상관없이같은 entity간이면 link
• ex) true : [Florida Supreme Court] - [the Florida court], false : [the correct runway] - [the wrong runway]
• Compatible modifiers only• Not i-within-i
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟔𝟔. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
• Strict Head Match B, C :• B : compatible modifiers only 자질제거
• C : word inclusion 조건제거
• recall 향상
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
•
• Proper Head Noun Match :• 같은 head를가지고
• Not i-within-i• No location mismatches : [Lebanon] 과 [Southern Lebanon]은다르다.
다른지역명, 올바른명사, 공간적인수식어는유지불가.• No Numeric mismatches : [People] 과 [around 200 people]은다르다.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
•
• Relaxed Head Match :• ex) Heuristic matching
• Sanders {Sauls, the judge, Circuit Judge N. Sanders Sauls}
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
Example
• Previous Sieve :
•
• Pronominal Coreference Resolution :• 각mention에태깅된자질들(Number, Gender, Person, Animacy, Location)이일치하는mention을 Linking
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟑𝟑 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟕𝟕 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟕𝟕,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟏𝟏 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟒𝟒 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟒𝟒,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟓𝟓 .
Example
• Previous Sieve :
•
• Post Processing• OntoNote corpus를이용
• singleton cluster 제거• 동격패턴(2형식문장)과접속사문장에서나중에나오는mention제거
[John]𝟏𝟏𝟏𝟏 is a musician. [He]𝟑𝟑𝟏𝟏 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟒𝟒 is [my]𝟗𝟗𝟏𝟏 favorite,”
[John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟓𝟓 .
[John]𝟏𝟏𝟏𝟏 is [a musician]𝟐𝟐𝟏𝟏. [He]𝟑𝟑𝟏𝟏 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟒𝟒 is [[my]𝟗𝟗𝟏𝟏
favorite]𝟖𝟖𝟒𝟒,” [John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟓𝟓 .
Example
• Final Output Sentence:
[John]𝟏𝟏𝟏𝟏 is a musician. [He]𝟑𝟑𝟏𝟏 played [a new song]𝟒𝟒𝟒𝟒. [A girl]𝟓𝟓𝟓𝟓 was listening to [the song]𝟔𝟔𝟒𝟒. “[It]𝟕𝟕𝟒𝟒 is [my]𝟗𝟗𝟏𝟏 favorite,”
[John]𝟏𝟏𝟏𝟏𝟏𝟏 said to [her]𝟏𝟏𝟏𝟏𝟓𝟓 .
Reference
• [book] Discourse Structure and Anaphora_Written and Conversational English - Fox.1984
• Vilain, J. Burger, J. Aberdeen, D. Connolly, and L. Hirschman. 1995. A model-theoretic Coreference scoring scheme. In MUC-6.
• D. Klein and C. Manning. 2003. Accurate unlexicalized parsing. In ACL.
• S. Bergsma and D. Lin. 2006. Bootstrapping Path-Based Pronoun Resolution. In ACL-COLING.
• Kehler, Andrew, et al. "Coherence and coreference revisited." Journal of Semantics 25.1 (2008): 1-44.
• conll2011_shared_task
• Haghighi, Aria, and Dan Klein. "Coreference resolution in a modular, entity-centered model." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.
• H. Ji and D. Lin. 2009. Gender and animacy knowledge discovery from web-scale n-grams for unsupervised person mention detection. In PACLIC.
• De Marneffe, Marie-Catherine, Bill MacCartney, and Christopher D. Manning. "Generating typed dependency parses from phrase structure parses."Proceedings of LREC. Vol. 6. 2006.
• L. Kertz, A. Kehler, and J. Elman. 2006. Grammatical and Coherence-Based Factors in Pronoun Interpretation. In Proceedings of the 28th Annual Conference of the Cognitive Science Society.
• i-within-i(논문아닙니다.)
• H. Poon and P. Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic. In EMNLP.
• X. Luo. 2005. On coreference resolution performance metrics. In HTL-EMNLP.
• Hovy, Eduard, et al. "OntoNotes: the 90% solution." Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers. Association for Computational Linguistics, 2006.
• PronounResolution-____ (논문아닙니다.)
• A. Haghighi and D. Klein. 2009. Simple Coreference resolution with rich syntactic and semantic features. In EMNLP.
• J.R. Hobbs. 1977. Resolving pronoun references. Lingua.
• Haghighi, Aria, and Dan Klein. "Unsupervised coreference resolution in a nonparametric bayesian model." Annual meeting-Association for Computational Linguistics. Vol. 45. No. 1. 2007.
• Pradhan, Sameer S., et al. "Unrestricted coreference: Identifying entities and events in OntoNotes." Semantic Computing, 2007. ICSC 2007. International Conference on. IEEE, 2007.
• Deterministic Coreference Resolution Based
Coreference Resolution for Korean
• Init CR• Input
• Dataset: wise QA, newswire [JSON]• Dictionary: pronoun Dic, Acronym Dic• etc..: option file, domain selection
• processing• if: raw dataset (true:false)
• Core• Mention detection• Sieve 1~10• Post processing
• Output• N_DOC JSON
Output Json Format• 상호참조해결입력
• Parsing 및 SRL을거친 N_Doc구조체 (or JSON 파일)
• 상호참조해결결과• Entity array 정보가추가된 N_Doc구조체 (or JSON 파일)• Entity (mention cluster)
• id: entity id• type= {NE type, “”}• Optional
• number={singular, plural, “”}, gender={male, female, intersex, “”}• person={1, 2, 3, “”}, animacy={human, animate, inanimate, “”}
• Mention array
• Mention• id: mention id• text: mention string (실질형태소,조사제거)• sent_id: 문장 id• start_eid: 시작어절 id• end_eid: 끝어절 id• ne_id: mention 내부에 NE가있는경우 NE id, 그렇지않은경우 -1
Mention
• 상호참조해결의대상이되는명사류.• Head
• 명사류의실질적인의미를나타내는단어• Mention은 Head를중심으로수식어류를포함.
• 의존트리를베이스로추출• 문제점
• 우리말에는동사의의미를가진명사의형태의한자어들이많다.
• 따라서• mention은문서내에서한번이상의참조가나타나는명사구로제한.
• 어절단위로mention 수집.
Entity
• Entity-Centric Model• 결정규칙은하나의문서내에서mention과mention을참조하는데필요한조건이며 Sieve마다참조를위한규칙이주어진다.
• 단계를진행하면서 Entity를확장해나가며각 Sieve마다상호참조해결을위하여 Entity 정보를이용한다.
• mention들이서로상호참조되어같은 Entity id를가지는 Entity집합으로해당mention들은집합내에서 Entity 속성을공유한다.
• 이를이용하여선행mention과현재등장한mention간의참조를해결할때 Entity 속성를이용한다.
Core
• Speaker Match• Exact String Match• Relaxed String Match• Precise Constructs
• Role Appositive• Acronym
• Strict Head Match A• Strict Head Match B• Strict Head Match C• Proper Head Word Match• Relaxed Head Match• Pronoun Match• Post Processing
Exact String Match
• 문서내에서서로다른 mention의문자열이똑같이일치하는경우참조
Precise Constructs: Role Appositive
• 사람이름 Mention을직책 mention으로수식하거나, 반대로사람이름 Mention 뒤에직책등을나타내는 mention이붙어나타나는경우가있다.
• 이처럼안은 Mention의개체명속성이사람, 직업,직책등이고안긴 Mention이사람, 직업, 직책등의개체명속성을가질때두 Mention을참조한다.
Precise Constructs: Acronym
• 복합명사가축약되어사용되는것을약어라함.
• ex) ‘한국전력공사’ >> ‘한전’, ‘한국전력’ 등
• 알고리즘• 명사생략형: 복수명사중일부단일명사를생략하여명사를축약한다.
(i.e. 복수명사: ‘르노자동차’, 단일명사: ‘르노’, ‘자동’, ‘차’, 축약결과: ‘르노자동’, ‘르노차’, ‘자동차’)
• 음절조합형: 명사를구성하는단일명사중첫음절로축약한다.
(i.e. 복수명사: ‘한국전력공사’, 단일명사: ‘한국’, 전력’, ‘공사’, 축약결과: ‘한전’, ‘전공’, ‘한전공’)
• 혼합형: 명사생략형과음절조합형을합친개념으로복수명사중에서일부단일명사는그대로추출하며, 나머지단일명사들중첫음절을추출해축약한다.
(i.e. 복수명사: ‘대우자동차판매’, 단일명사: ‘대우’, ‘자동차’, ‘판매’, 축약결과: ‘대우자판’)
• 인칭약어: 인칭약어는사람이름과그사람의직위, 직업등이나왔을때그사람의성과직위로축약한다. 인칭약어로축약하는방법에대해서는개체명정보를조회하여 “CV_POSITION”을찾으면해당인덱스로부터최대 2번째까지의이전인덱스까지조회하여 “PS_NAME”이존재하면해단 word만축약하여약어를생성한다.
(i.e. [현재 NE 정보 = CV_POSITION : 대통령, 해당 NE의이전 NE 정보 = PS_NAME: 김대중] 이면 “김대통령”으로축약.)
• 영어약어: 영어는한국어와달리단어의첫음절만축약하면된다.
(i.e. 복수명사: ‘Natural Language Processing’, 단일명사: ‘Natural’, ‘Language’, ‘Processing’, 축약결과: ‘NLP’
Strict Head Match A
• 세개의제약조건을만족하고, Head가같은Mention 들의참조여부를결정한다.
• Word inclusion (다음것은더짧음):문서내에서두Mention이참조된다면, 앞서등장한Mention의길이가이후에등장하는Mention의길이보다길다.
• Compatible modifiers only:문서내에서두Mention이참조된다면, 뒤에등장하는Mention을수식하는정보들은앞서등장하는Mention이반드시갖고있어야한다.
• Not I within I:안긴Mention의 Head에 “~의” 가붙어다음명사구를수식다면 “I within I” 상태의Mention이다. (e.i. [[남극의]1
1 눈물]00 두Mention은
참조하지않는다.)
Strict Head Match B
• “Compatible modifiers only” 제약조건을제외하고 Head가같은Mention 들의참조여부를결정한다.
• Word inclusion (다음것은더짧음):문서내에서두Mention이참조된다면, 앞서등장한Mention의길이가이후에등장하는Mention의길이보다길다.
• Compatible modifiers only:문서내에서두Mention이참조된다면, 뒤에등장하는Mention을수식하는정보들은앞서등장하는Mention이반드시갖고있어야한다.
• Not I within I:안긴Mention의 Head에 “~의” 가붙어다음명사구를수식다면 “I within I” 상태의Mention이다. (e.i. [[남극의]1
1 눈물]00 두Mention은
참조하지않는다.)
Strict Head Match C
• “Word inclusion”을제외한 Head가같은Mention 들의참조여부를결정한다.
• Word inclusion (다음것은더짧음):문서내에서두Mention이참조된다면, 앞서등장한Mention의길이가이후에등장하는Mention의길이보다길다.
• Compatible modifiers only:문서내에서두Mention이참조된다면, 뒤에등장하는Mention을수식하는정보들은앞서등장하는Mention이반드시갖고있어야한다.
• Not I within I:안긴Mention의 Head에 “~의” 가붙어다음명사구를수식다면 “I within I” 상태의Mention이다. (e.i. [[남극의]1
1 눈물]00 두Mention은
참조하지않는다.)
Proper Head Word Match
• Head가같고, “I within I” 상태가아닌두 Mention이갖고있는수량정보와지역정보가같다면참조한다.
Relaxed Head Match
• 본 Sieve는다른 head match와달리좀더완화된규칙을가진다.
• 규칙은휴리스틱에의하여현재mention의 head가선행된 entity에서일치하는단어가있으면참조.
• 여기서 head는 head의 text 또는 head의morp정보중 nc(자립명사)만가진 text를비교.
• 참조되기위해서선행mention과현재mention의개체명이존재하고서로같은개체명을가져야한다.
• e.i. [프랑스의르노자동차], [르노사], [르노] • 앞의 3개의mention은 ‘르노삼성자동차’를가리키는단어.• 결국위 3개의mention은상호참조하게되므로하나의 entity에포함된다.• mention들의개체명을보면모두 OGG_BUSINESS이다.
Pronoun Match
• 한국어대명사특징• 한국어에서는존칭에따라인칭대명사의형태가매우다양하다.• “그”, ”이”, “저”등의지시대명사와일반명사의조합이영어의 “it”, ”that” 등의대명사의역할을수행한다.
• 수행방법• 대명사의역할을하는Mention들을찾아내고, 그속성을부여하는과정이필요.• 확인된대명사Mention들을대상으로참조하는Mention을찾는데있어중심화이론을적용.
• 세종말뭉치의대명사사전을이용하여, Mention 중대명사의역할을하는것들을찾아내어별도로처리한다.
• 대명사사전:• animacy, gender, number, 높임법등의속성들을가짐.
• 확정짓지못하는속성: undefined
• 없는속성: none
Post Processing
• 위의 Sieve들을걸쳐생성된 entity 정보들중상호참조해결과관련없는mention들을제거하고 ‘Entity 객체’에직접갱신하는단계.
• singleton• 참조되지않은mention들은하나의 entity에하나의mention만갖게되는데이것을 singleton이라한다.
• singleton-cluster• entity에mention이하나만남았을경우해당 entity를지우는것
• PostProcessing• 이런 singleton들을제거하기위해 singleton-cluster(=entity)를적용한다.