Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

17
Rich Set of Features for Proper Name Recognition in Polish Texts Michal Marcińczuk, Michal Stanek, Maciej Piasecki and Adam Musial Wroclaw University of Technology 11 czerwca 2011 Project NEKST (Natively Enhanced Knowledge Sharing Technologies) co-financed by Innovative Economy Programme project POIG.01.01.02-14-013/09

Transcript of Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Page 1: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Rich Set of Featuresfor Proper Name Recognition in Polish Texts

Michał Marcińczuk, Michał Stanek,Maciej Piasecki and Adam Musiał

Wrocław University of Technology

11 czerwca 2011

Project NEKST (Natively Enhanced Knowledge Sharing Technologies)co-financed by Innovative Economy Programme project POIG.01.01.02-14-013/09

Page 2: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Scope » Introduction

Introduction

Scope:

recognition of Proper Names in Polish texts,

5 types of proper names: first names, surnames, names ofcountries, cities and roads ,

three corpora: Stock Exchange Reports (CSER), PoliceReports (CPR) [1] and Economic News (CES),

combination of a machine learning approach with manuallycreated rules (filtering),

utilization of rich set of features,

application of Conditional Random Fields [2].

The corpora are available at http://nlp.pwr.wroc.pl/inforex?page=download

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 2 / 17

Page 3: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Scope » Introduction

Problem Statement

Recognition of proper names in Polish texts (comparing to English)is difficult because of:

1 weakly constrained word order,

2 rich inflection,

and also:

3 premises that given language expression is a proper name canappear in left and right context.

To solve the above problems (to some extent) we propose a rich set offeatures and application of Conditional Random Fields that can make useof the features.

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 3 / 17

Page 4: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Resources »

Corpora

CSER — corpus of stock exchange reports used in 10-fold CV,

CPR — corpus of police reports used to cross-domain validation,

CEN — corpus of economic news used to corss-domain validation.

Annotation statistics

PN category CSER CPR CENfirst name 688 333 1097surname 691 411 1517country name 484 27 1695city name 1849 191 657road name 395 42 31Total 4107 1004 4997

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 4 / 17

Page 5: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Resources » Ortographic features

Ortographic features

orth — a word itself,

base — a morphological base form of a word,

n prefixes/n suffixes — n first/last characters of theencountered word form, where n ⊂ {1, 2, 3, 4}. The missingcharacters are replaced with ’ ’. Motivated by an observationthat some groups of proper names have typical prefixes and/orendings.

pattern — encode pattern of characters sequence; one of:ALL UPPER, ALL LOWER, DIGITS, SYMBOLS,UPPER INIT, UPPER CAMEL CASE,LOWER CAMEL CASE, MIXED

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 5 / 17

Page 6: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Resources » Binary ortographic features

Binary ortographic features

8 binary features, the feature is 1 if the condition is met, 0 otherwise:

1 (word) starts with an uppercase letter,

2 starts with a lower case letter,

3 starts with a symbol,

4 starts with a digit,

5 contains an upper case letter,

6 contains a lower case letter,

7 contains a symbol

8 contains digit.

The features are based on filtering rules described in [3], e.g. first namesstarts from upper case and does not contain symbols.

The binary features encode information on the level of single characters,while the aim of the pattern feature is to encode a repeatable sequenceof characters.

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 6 / 17

Page 7: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Resources » Wordnet-based features

Wordnet-based features

synonym — word’s synonym, first in the alphabetical orderfrom all word synonyms in Polish Wordnet. The sense of theword is not disambiguated,hypernym n — a hypernym of the word in the distance of n.

Wordnet-based features are used to decrease the variety of observed words.

token synonym hypernym 1 hypernym 2 hypernym 3

Pan mężczyzna dorosły człowiek ze względu na wiek człowiekmale adult person in specified age human

Prezes przewodniczący głowa człowiek ze względu człowiekna pełnioną funkcję

chairman head person holding a position human

Zarząd centrala władza grupa ludzi zbiórhead office authority group of people set

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 7 / 17

Page 8: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Resources » Morphological features

Morphological features

Morphological features — are based on NER grammars that utilizemorphological information [4]. The features are:

ctag — complete tag with morphological informationgenerated by TaKIPI,

part of speech, case, gender, number — enumeration typesaccording to tagset described in [5].

ExampleA capitalized singular adjective after word ulica (‘road’) might aroad name.

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 8 / 17

Page 9: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Resources » Gazetteer-based features

Gazetteer-based features

One feature for every gazetteer. If a sequence of words is found in agazetteer the first word in the sequence is set as B and the other as I.

5 for every proper name category,

5 for every list of key words, i.e.:country prefix — a list of common words that can occur in a country

name, e.g. republika (‘republic’) in ‘Czech Republic’,person prefix — a list of positions and titles that can precede person

name (1774 words),person suffix — a list of words that might appear directly after

person name (112 entries),person noun — a list of expressions that can refer to people, e.g.

profession names (6339 entries that were described as nouns denoting

people in plWordNet),road prefix — a list of words (full and short forms) that can precede

road name, e.g. ulica (‘street’), ul. (‘st.’). The list contains 14 entries.M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 9 / 17

Page 10: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Evaluation » Baselines

Baselines

The baselines were calculated for the best configuration from ourprevious experiments [3], i.e. Hidden Markov Model trained onwords using character language model and rule-based filtering.

10-fold CV Cross-domainCSER CPR CEN

Precision 89.84% 67.49% 54.83%Recall 89.66% 84.36% 76.95%F1-measure 89.75% 74.99% 64.03%

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 10 / 17

Page 11: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Evaluation » Single Domain Evaluation

Single Domain Evaluation

10-fold Cross Validation on the Stock Exchange Corpus (CSER).

road surname first name country city Total

Orth feature for previous, current and next tokens + Filtering

Precision 97.76% 97.22% 98.61% 92.59% 95.46% 96.06%Recall 71.58% 57.85% 61.03% 64.94% 79.57% 70.21%

F1 82.65% 72.54% 75.40% 76.34% 86.79% 81.13%

All features for previous, current and next tokens + Filtering

Precision 93.33% 95.78% 94.67% 81.22% 92.53% 92.07%Recall 84.15% 87.60% 81.38% 86.15% 95.14% 89.47%

F1 88.51% 91.51% 87.52% 83.61% 93.82% 90.75%

All features for 3 preceding, current and 3 following tokens + Filtering

Precision 96.67% 97.85% 96.89% 89.67% 94.74% 95.20%Recall 95.08% 87.88% 80.23% 82.68% 95.35% 90.00%

F1 95.87% 92.60% 87.77% 86.04% 95.04% 92.53%

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 11 / 17

Page 12: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Evaluation » Cross Domain Evaluation (CEN)

Cross Domain Evaluation (CEN)

The model was trained on CSER and tested on CEN.

BaselinePrecision 54.83%, Recall 76.95% and F1-measure 64.03%.

road surname first name country city Total

All features for previous, current and next token

Precision 71.43% 93.06% 96.57% 91.19% 79.91% 91.15%Recall 16.13% 51.29% 58.98% 70.86% 55.71% 59.98%

F1 26.32% 66.13% 73.23% 79.75% 65.65% 72.35%

All features for wide context

Precision 62.50% 94.42% 97.05% 90.31% 80.87% 91.41%Recall 16.13% 49.11% 57.06% 68.73% 50.84% 57.53%

F1 25.64% 64.61% 71.87% 78.06% 62.43% 70.62%

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 12 / 17

Page 13: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Evaluation » Cross Domain Evaluation (CPR)

Cross Domain Evaluation (CPR)

The model was trained on CSER and tested on CPR.

BaselinePrecision 67.49%, Recall 84.36% and F1-measure 74.99%.

road surname first name country city Total

All features for previous, current and next token

Precision 100.00% 93.06% 93.89% 100.00% 89.05% 92.88%Recall 50.00% 48.91% 50.75% 81.48% 63.87% 53.29%

F1 66.67% 64.11% 65.89% 89.80% 74.39% 67.72%

All features for 3 preceding, current and 3 following tokens + Filtering

Precision 100.00% 92.82% 94.08% 100.00% 95.69% 94.48%Recall 54.76% 44.04% 47.75% 81.48% 58.12% 49.40%

F1 70.77% 59.74% 63.35% 89.80% 72.31% 64.88%

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 13 / 17

Page 14: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Summary » Conclusions

Conclusions

on a single domain the CRF with extended set of features and widecontext outperformed the baseline (F-measure: 92.53% vs 89.75%),

in the cross-domain evaluation on both corpora the near contextperformed better — the wide context tend to overtrain,

in the cross-domain evaluation in case of CEN the final results wasimproved from 64.09% to 72.35%, but for CPR was decreased by7.27%.

CRF did not obtain high recall in the cross-domain evaluation —a combination of multiple classifiers trained on different domainsmight be a solution.

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 14 / 17

Page 15: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Summary » On-line Demonstration

On-line Demonstration

http://nlp.pwr.wroc.pl/inforex?page=ner

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 15 / 17

Page 16: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Summary » On-line Demonstration

Ongoing and Future Works

Ongoing Worksdeveloping similarity function for proper names (Synat),

extending the dictionaries with the use of Guesser and corpora.

Future Worksto set up a web service for NER (Synat),

to annotate the corpora with extended schema of propernames (56 categories) (Synat),

to extend the feature set with chunking information —maximum PN borders,

to develop a model for nested proper names.

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 16 / 17

Page 17: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

References » Main papers

Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: NamedEntity Recognition in Machine Anonymization, in M. A. Kłopotek,A. Przepiorkowski, A. T. Wierzchoń, and K. Trojanowski, editors,Recent Advances in Intelligent Information Systems., pp. 247–260,Academic Publishing House Exit (2009)

CRF++: Yet Another CRF toolkit, http://crfpp.sourceforge.net/

Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognitionin Polish Economic Texts, To appear in Control and Cybernetics,(2011)

Piskorski, J.: Extraction of Polish named entities, in Proceedings ofthe Fourth International Conference on Language Resources andEvaluation, LREC 2004 (ELR, 2004), pp. 313–316, Association forComputational Linguistics, Prague, Czech Republic (2004)

Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version,Institute of Computer Science, Polish Academy of Sciences, Warsaw(2004)

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 17 / 17