Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro...

23
Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor A. Bolshakov Lab. Lenguaje Natural CIC – IPN México, D. F.
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro...

Page 1: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Identification of Composite Named Entities

in a Spanish Textual Database

Sofía N. Galicia-Haro

Facultad de Ciencias - UNAM

Alexander F. Gelbukh and Igor A. Bolshakov

Lab. Lenguaje Natural CIC – IPN

México, D. F.

Page 2: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Contents Introduction

Named Entities in Textual Databases

NE Analysis

Recognition Method

Conclusions

Page 3: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Contents Introduction

Named Entities in Textual Databases

NE Analysis

Recognition Method

Conclusions

Page 4: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Textual Databases

They have been entered to computers and to Web

to save tons of paper

to allow people to have remote access

to provide much better access to texts in electronic format, etc.

Searching through this huge material for informationis a time consuming task

Page 5: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Named Entities

NE mentioned in textual databases constitute an important part of their semantic contents

A collection of political electronic texts shows that almost 50% of the total sentences contains at least one NE

This indicates the relevance of NE identification and its role in document indexing and retrieval

Page 6: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Composite Named Entities NE with coordinated constituents

Luz y Fuerza del Centro

NE with prepositional phrasesEjército Zapatista de Liberación Nacional

Page 7: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Contents Introduction

Named Entities in Textual Databases

NE Analysis

Recognition Method

Conclusions

Page 8: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

NEs in Mexican Textual DB

NEs appear at least in 50% of the sentences

Selection of Collection 1 taken for training

Collections of Political Mexican texts

  Coll. 1 Coll. 2

# Sentences 442,719 208,298

# Sentences w/named entities 243,165 100,602

     

Page 9: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Initial NE Recognition Step Identification of linguistic characteristics

Example: Prepositions • link two different NE• are included in the NE

Identification of style characteristicsEx: Specific words introduce convention names

coordinadora del programa Mundo Maya ‘Mundo Maya program’s coordinator’

Page 10: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Contents Introduction

Named Entities in Textual Databases

NE Analysis

Recognition Method

Conclusions

Page 11: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Training File A Perl program extracts “compounds” Los miembros del Ejercito Federal (1) lejos de

aplicar la Ley sobre Armas de Fuego y Explosivos parecen (2) proteger a los participantes en el tiroteo.

Compounds contain no more than three non-capitalized words between capitalized words

Compounds are left- and right- delimitedby a punctuation marks or a word

Page 12: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Sentences of coll.1 From 243,165 sentences 472,087 compounds

500 randomly selected sentences were manually analyzed

Main result from analysis:Syntactic ambiguity is frequent

Page 13: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Syntactic Ambiguity

Coordination of coordinated namesComisión Federal de Electricidad y Luz y Fuerza del Centro

Margarita Diéguez y Armas y Carlos Virgilio

Prepositional phrase attachmentDifferent names linked by prepositions

Comandancia General del Ejército Zapatista de Liberación

Nacional

Page 14: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Contents Introduction

Named Entities in Textual Databases

NE Analysis

Recognition Method

Conclusions

Page 15: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Knowledge Contributions External lists

Linguistic knowledge

Heuristics

Statistics

Page 16: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

External Lists Hand-made list of similes (625 items)

paz y justicia ‘peace and justice’

Latinoamérica y el Caribe

Hand-made list of words Lists from the WEB

personal names (697 items) main Mexican cities (910 items)

Page 17: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Linguistic Knowledge

Examples of linguistic restrictions Lists of groups of capitalized words

Corea del Sur (1), Taiwan (2), Checoslovaquia (3) y Sudáfrica (4)

Preposition por followed by indefinite article cannot be the link within a personal name Cuauhtémoc Cárdenas (1) por la Alianza por la Ciudad de

México (2)

Page 18: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Heuristics and Statistics Heuristic example: a first name can be the part of

only one name sequence among those coordinated Ex.: Margarita Diéguez y Armas y Carlos Virgilio

Carlos belongs to the list of first names. Thus there are two name sequences here: Margarita Diéguez y Armas Vs. Carlos Virgilio

Statistics from training fileWith a high score, Estados Unidos is a 2-word group

Thus Estados Unidos sobre México could be separated

Page 19: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Application of the Method Obtaining compounds with functional words Using previous resources, the program

decides on splitting, delimiting or leaving each compound as suchExtract

• coordinated groups

• prepositional phrases

• the rest of groups of capitalized words

Page 20: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Results - 1

Obtained from 500 sentences of Coll. 2

 Number of:

CoordinatedGroups

Prepositional PhraseGroups

total

Precision 54 69 89

Recall 48 67 87

Page 21: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Results - 2 Total: 1496 NE

63 names with coordination 167 prepositional groups

To compare with:Carreras, X., L. Márques and L. Padró. Named Entity

Extraction using AdaBoost, CoNLL-2002 92% for precision and 91% for recall However, the test file only includes one coordinated name If a NE is embedded in another one, only the top level

entity was marked

Page 22: Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Conclusions We present a method to identify and disambiguate

groups of capitalized words Our work is focused on composite named entities Our method use extremely short lists and a small

POS-marked dictionary The method use heterogeneous knowledge to decide

on splitting or joining groups with capitalized words