Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro...
Identification of Composite Named Entities
in a Spanish Textual Database
Sofía N. Galicia-Haro
Facultad de Ciencias - UNAM
Alexander F. Gelbukh and Igor A. Bolshakov
Lab. Lenguaje Natural CIC – IPN
México, D. F.
Contents Introduction
Named Entities in Textual Databases
NE Analysis
Recognition Method
Conclusions
Contents Introduction
Named Entities in Textual Databases
NE Analysis
Recognition Method
Conclusions
Textual Databases
They have been entered to computers and to Web
to save tons of paper
to allow people to have remote access
to provide much better access to texts in electronic format, etc.
Searching through this huge material for informationis a time consuming task
Named Entities
NE mentioned in textual databases constitute an important part of their semantic contents
A collection of political electronic texts shows that almost 50% of the total sentences contains at least one NE
This indicates the relevance of NE identification and its role in document indexing and retrieval
Composite Named Entities NE with coordinated constituents
Luz y Fuerza del Centro
NE with prepositional phrasesEjército Zapatista de Liberación Nacional
Contents Introduction
Named Entities in Textual Databases
NE Analysis
Recognition Method
Conclusions
NEs in Mexican Textual DB
NEs appear at least in 50% of the sentences
Selection of Collection 1 taken for training
Collections of Political Mexican texts
Coll. 1 Coll. 2
# Sentences 442,719 208,298
# Sentences w/named entities 243,165 100,602
Initial NE Recognition Step Identification of linguistic characteristics
Example: Prepositions • link two different NE• are included in the NE
Identification of style characteristicsEx: Specific words introduce convention names
coordinadora del programa Mundo Maya ‘Mundo Maya program’s coordinator’
Contents Introduction
Named Entities in Textual Databases
NE Analysis
Recognition Method
Conclusions
Training File A Perl program extracts “compounds” Los miembros del Ejercito Federal (1) lejos de
aplicar la Ley sobre Armas de Fuego y Explosivos parecen (2) proteger a los participantes en el tiroteo.
Compounds contain no more than three non-capitalized words between capitalized words
Compounds are left- and right- delimitedby a punctuation marks or a word
Sentences of coll.1 From 243,165 sentences 472,087 compounds
500 randomly selected sentences were manually analyzed
Main result from analysis:Syntactic ambiguity is frequent
Syntactic Ambiguity
Coordination of coordinated namesComisión Federal de Electricidad y Luz y Fuerza del Centro
Margarita Diéguez y Armas y Carlos Virgilio
Prepositional phrase attachmentDifferent names linked by prepositions
Comandancia General del Ejército Zapatista de Liberación
Nacional
Contents Introduction
Named Entities in Textual Databases
NE Analysis
Recognition Method
Conclusions
Knowledge Contributions External lists
Linguistic knowledge
Heuristics
Statistics
External Lists Hand-made list of similes (625 items)
paz y justicia ‘peace and justice’
Latinoamérica y el Caribe
Hand-made list of words Lists from the WEB
personal names (697 items) main Mexican cities (910 items)
Linguistic Knowledge
Examples of linguistic restrictions Lists of groups of capitalized words
Corea del Sur (1), Taiwan (2), Checoslovaquia (3) y Sudáfrica (4)
Preposition por followed by indefinite article cannot be the link within a personal name Cuauhtémoc Cárdenas (1) por la Alianza por la Ciudad de
México (2)
Heuristics and Statistics Heuristic example: a first name can be the part of
only one name sequence among those coordinated Ex.: Margarita Diéguez y Armas y Carlos Virgilio
Carlos belongs to the list of first names. Thus there are two name sequences here: Margarita Diéguez y Armas Vs. Carlos Virgilio
Statistics from training fileWith a high score, Estados Unidos is a 2-word group
Thus Estados Unidos sobre México could be separated
Application of the Method Obtaining compounds with functional words Using previous resources, the program
decides on splitting, delimiting or leaving each compound as suchExtract
• coordinated groups
• prepositional phrases
• the rest of groups of capitalized words
Results - 1
Obtained from 500 sentences of Coll. 2
Number of:
CoordinatedGroups
Prepositional PhraseGroups
total
Precision 54 69 89
Recall 48 67 87
Results - 2 Total: 1496 NE
63 names with coordination 167 prepositional groups
To compare with:Carreras, X., L. Márques and L. Padró. Named Entity
Extraction using AdaBoost, CoNLL-2002 92% for precision and 91% for recall However, the test file only includes one coordinated name If a NE is embedded in another one, only the top level
entity was marked
Conclusions We present a method to identify and disambiguate
groups of capitalized words Our work is focused on composite named entities Our method use extremely short lists and a small
POS-marked dictionary The method use heterogeneous knowledge to decide
on splitting or joining groups with capitalized words