Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Rich Set of Featuresfor Proper Name Recognition in Polish Texts

Michał Marcińczuk, Michał Stanek,Maciej Piasecki and Adam Musiał

Wrocław University of Technology

11 czerwca 2011

Project NEKST (Natively Enhanced Knowledge Sharing Technologies)co-financed by Innovative Economy Programme project POIG.01.01.02-14-013/09

Scope » Introduction

Introduction

Scope:

recognition of Proper Names in Polish texts,

5 types of proper names: first names, surnames, names ofcountries, cities and roads ,

three corpora: Stock Exchange Reports (CSER), PoliceReports (CPR) [1] and Economic News (CES),

combination of a machine learning approach with manuallycreated rules (filtering),

utilization of rich set of features,

application of Conditional Random Fields [2].

The corpora are available at http://nlp.pwr.wroc.pl/inforex?page=download

M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 2 / 17

http://nlp.pwr.wroc.pl/inforex?page=download

Scope » Introduction

Problem Statement

Recognition of proper names in Polish texts (comparing to English)is difficult because of:

1 weakly constrained word order,

2 rich inflection,

and also:

3 premises that given language expression is a proper name canappear in left and right context.

To solve the above problems (to some extent) we propose a rich set offeatures and application of Conditional Random Fields that can make useof the features.


Resources »

Corpora

CSER — corpus of stock exchange reports used in 10-fold CV,

CPR — corpus of police reports used to cross-domain validation,

CEN — corpus of economic news used to corss-domain validation.

Annotation statistics

PN category CSER CPR CENfirst name 688 333 1097surname 691 411 1517country name 484 27 1695city name 1849 191 657road name 395 42 31Total 4107 1004 4997


Resources » Ortographic features

Ortographic features

orth — a word itself,

base — a morphological base form of a word,

n prefixes/n suffixes — n first/last characters of theencountered word form, where n ⊂ {1, 2, 3, 4}. The missingcharacters are replaced with ’ ’. Motivated by an observationthat some groups of proper names have typical prefixes and/orendings.

pattern — encode pattern of characters sequence; one of:ALL UPPER, ALL LOWER, DIGITS, SYMBOLS,UPPER INIT, UPPER CAMEL CASE,LOWER CAMEL CASE, MIXED


Resources » Binary ortographic features

Binary ortographic features

8 binary features, the feature is 1 if the condition is met, 0 otherwise:

1 (word) starts with an uppercase letter,

2 starts with a lower case letter,

3 starts with a symbol,

4 starts with a digit,

5 contains an upper case letter,

6 contains a lower case letter,

7 contains a symbol

8 contains digit.

The features are based on filtering rules described in [3], e.g. first namesstarts from upper case and does not contain symbols.

The binary features encode information on the level of single characters,while the aim of the pattern feature is to encode a repeatable sequenceof characters.


Resources » Wordnet-based features

Wordnet-based features

synonym — word’s synonym, first in the alphabetical orderfrom all word synonyms in Polish Wordnet. The sense of theword is not disambiguated,hypernym n — a hypernym of the word in the distance of n.

Wordnet-based features are used to decrease the variety of observed words.

token synonym hypernym 1 hypernym 2 hypernym 3

Pan mężczyzna dorosły człowiek ze względu na wiek człowiekmale adult person in specified age human

Prezes przewodniczący głowa człowiek ze względu człowiekna pełnioną funkcję

chairman head person holding a position human

Zarząd centrala władza grupa ludzi zbiórhead office authority group of people set


Resources » Morphological features

Morphological features

Morphological features — are based on NER grammars that utilizemorphological information [4]. The features are:

ctag — complete tag with morphological informationgenerated by TaKIPI,

part of speech, case, gender, number — enumeration typesaccording to tagset described in [5].

ExampleA capitalized singular adjective after word ulica (‘road’) might aroad name.


Resources » Gazetteer-based features

Gazetteer-based features

One feature for every gazetteer. If a sequence of words is found in agazetteer the first word in the sequence is set as B and the other as I.

5 for every proper name category,

5 for every list of key words, i.e.:country prefix — a list of common words that can occur in a country

name, e.g. republika (‘republic’) in ‘Czech Republic’,person prefix — a list of positions and titles that can precede person

name (1774 words),person suffix — a list of words that might appear directly after

person name (112 entries),person noun — a list of expressions that can refer to people, e.g.

profession names (6339 entries that were described as nouns denoting

people in plWordNet),road prefix — a list of words (full and short forms) that can precede

road name, e.g. ulica (‘street’), ul. (‘st.’). The list contains 14 entries.M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 9 / 17

Evaluation » Baselines

Baselines

The baselines were calculated for the best configuration from ourprevious experiments [3], i.e. Hidden Markov Model trained onwords using character language model and rule-based filtering.

10-fold CV Cross-domainCSER CPR CEN

Precision 89.84% 67.49% 54.83%Recall 89.66% 84.36% 76.95%F1-measure 89.75% 74.99% 64.03%


Evaluation » Single Domain Evaluation

Single Domain Evaluation

10-fold Cross Validation on the Stock Exchange Corpus (CSER).

road surname first name country city Total

Orth feature for previous, current and next tokens + Filtering

Precision 97.76% 97.22% 98.61% 92.59% 95.46% 96.06%Recall 71.58% 57.85% 61.03% 64.94% 79.57% 70.21%

F1 82.65% 72.54% 75.40% 76.34% 86.79% 81.13%

All features for previous, current and next tokens + Filtering


F1 88.51% 91.51% 87.52% 83.61% 93.82% 90.75%

All features for 3 preceding, current and 3 following tokens + Filtering


F1 95.87% 92.60% 87.77% 86.04% 95.04% 92.53%


Evaluation » Cross Domain Evaluation (CEN)

Cross Domain Evaluation (CEN)

The model was trained on CSER and tested on CEN.

BaselinePrecision 54.83%, Recall 76.95% and F1-measure 64.03%.


All features for previous, current and next token


F1 26.32% 66.13% 73.23% 79.75% 65.65% 72.35%

All features for wide context


F1 25.64% 64.61% 71.87% 78.06% 62.43% 70.62%


Evaluation » Cross Domain Evaluation (CPR)

Cross Domain Evaluation (CPR)

The model was trained on CSER and tested on CPR.

BaselinePrecision 67.49%, Recall 84.36% and F1-measure 74.99%.


All features for previous, current and next token


F1 66.67% 64.11% 65.89% 89.80% 74.39% 67.72%

All features for 3 preceding, current and 3 following tokens + Filtering


F1 70.77% 59.74% 63.35% 89.80% 72.31% 64.88%


Summary » Conclusions

Conclusions

on a single domain the CRF with extended set of features and widecontext outperformed the baseline (F-measure: 92.53% vs 89.75%),

in the cross-domain evaluation on both corpora the near contextperformed better — the wide context tend to overtrain,

in the cross-domain evaluation in case of CEN the final results wasimproved from 64.09% to 72.35%, but for CPR was decreased by7.27%.

CRF did not obtain high recall in the cross-domain evaluation —a combination of multiple classifiers trained on different domainsmight be a solution.


Summary » On-line Demonstration

On-line Demonstration

http://nlp.pwr.wroc.pl/inforex?page=ner


http://nlp.pwr.wroc.pl/inforex?page=ner

Summary » On-line Demonstration

Ongoing and Future Works

Ongoing Worksdeveloping similarity function for proper names (Synat),

extending the dictionaries with the use of Guesser and corpora.

Future Worksto set up a web service for NER (Synat),

to annotate the corpora with extended schema of propernames (56 categories) (Synat),

to extend the feature set with chunking information —maximum PN borders,

to develop a model for nested proper names.


References » Main papers

Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: NamedEntity Recognition in Machine Anonymization, in M. A. Kłopotek,A. Przepiorkowski, A. T. Wierzchoń, and K. Trojanowski, editors,Recent Advances in Intelligent Information Systems., pp. 247–260,Academic Publishing House Exit (2009)

CRF++: Yet Another CRF toolkit, http://crfpp.sourceforge.net/

Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognitionin Polish Economic Texts, To appear in Control and Cybernetics,(2011)

Piskorski, J.: Extraction of Polish named entities, in Proceedings ofthe Fourth International Conference on Language Resources andEvaluation, LREC 2004 (ELR, 2004), pp. 313–316, Association forComputational Linguistics, Prague, Czech Republic (2004)

Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version,Institute of Computer Science, Polish Academy of Sciences, Warsaw(2004)


Rich Set of Features for Proper Name Recognition in Polish Texts - presentation

Documents

Transcript of Rich Set of Features for Proper Name Recognition in Polish Texts - presentation