Rich Set of Features for Proper Name Recognition in Polish Texts - presentation
-
Upload
michal-marcinczuk -
Category
Documents
-
view
156 -
download
0
Transcript of Rich Set of Features for Proper Name Recognition in Polish Texts - presentation
![Page 1: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/1.jpg)
Rich Set of Featuresfor Proper Name Recognition in Polish Texts
Michał Marcińczuk, Michał Stanek,Maciej Piasecki and Adam Musiał
Wrocław University of Technology
11 czerwca 2011
Project NEKST (Natively Enhanced Knowledge Sharing Technologies)co-financed by Innovative Economy Programme project POIG.01.01.02-14-013/09
![Page 2: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/2.jpg)
Scope » Introduction
Introduction
Scope:
recognition of Proper Names in Polish texts,
5 types of proper names: first names, surnames, names ofcountries, cities and roads ,
three corpora: Stock Exchange Reports (CSER), PoliceReports (CPR) [1] and Economic News (CES),
combination of a machine learning approach with manuallycreated rules (filtering),
utilization of rich set of features,
application of Conditional Random Fields [2].
The corpora are available at http://nlp.pwr.wroc.pl/inforex?page=download
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 2 / 17
![Page 3: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/3.jpg)
Scope » Introduction
Problem Statement
Recognition of proper names in Polish texts (comparing to English)is difficult because of:
1 weakly constrained word order,
2 rich inflection,
and also:
3 premises that given language expression is a proper name canappear in left and right context.
To solve the above problems (to some extent) we propose a rich set offeatures and application of Conditional Random Fields that can make useof the features.
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 3 / 17
![Page 4: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/4.jpg)
Resources »
Corpora
CSER — corpus of stock exchange reports used in 10-fold CV,
CPR — corpus of police reports used to cross-domain validation,
CEN — corpus of economic news used to corss-domain validation.
Annotation statistics
PN category CSER CPR CENfirst name 688 333 1097surname 691 411 1517country name 484 27 1695city name 1849 191 657road name 395 42 31Total 4107 1004 4997
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 4 / 17
![Page 5: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/5.jpg)
Resources » Ortographic features
Ortographic features
orth — a word itself,
base — a morphological base form of a word,
n prefixes/n suffixes — n first/last characters of theencountered word form, where n ⊂ {1, 2, 3, 4}. The missingcharacters are replaced with ’ ’. Motivated by an observationthat some groups of proper names have typical prefixes and/orendings.
pattern — encode pattern of characters sequence; one of:ALL UPPER, ALL LOWER, DIGITS, SYMBOLS,UPPER INIT, UPPER CAMEL CASE,LOWER CAMEL CASE, MIXED
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 5 / 17
![Page 6: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/6.jpg)
Resources » Binary ortographic features
Binary ortographic features
8 binary features, the feature is 1 if the condition is met, 0 otherwise:
1 (word) starts with an uppercase letter,
2 starts with a lower case letter,
3 starts with a symbol,
4 starts with a digit,
5 contains an upper case letter,
6 contains a lower case letter,
7 contains a symbol
8 contains digit.
The features are based on filtering rules described in [3], e.g. first namesstarts from upper case and does not contain symbols.
The binary features encode information on the level of single characters,while the aim of the pattern feature is to encode a repeatable sequenceof characters.
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 6 / 17
![Page 7: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/7.jpg)
Resources » Wordnet-based features
Wordnet-based features
synonym — word’s synonym, first in the alphabetical orderfrom all word synonyms in Polish Wordnet. The sense of theword is not disambiguated,hypernym n — a hypernym of the word in the distance of n.
Wordnet-based features are used to decrease the variety of observed words.
token synonym hypernym 1 hypernym 2 hypernym 3
Pan mężczyzna dorosły człowiek ze względu na wiek człowiekmale adult person in specified age human
Prezes przewodniczący głowa człowiek ze względu człowiekna pełnioną funkcję
chairman head person holding a position human
Zarząd centrala władza grupa ludzi zbiórhead office authority group of people set
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 7 / 17
![Page 8: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/8.jpg)
Resources » Morphological features
Morphological features
Morphological features — are based on NER grammars that utilizemorphological information [4]. The features are:
ctag — complete tag with morphological informationgenerated by TaKIPI,
part of speech, case, gender, number — enumeration typesaccording to tagset described in [5].
ExampleA capitalized singular adjective after word ulica (‘road’) might aroad name.
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 8 / 17
![Page 9: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/9.jpg)
Resources » Gazetteer-based features
Gazetteer-based features
One feature for every gazetteer. If a sequence of words is found in agazetteer the first word in the sequence is set as B and the other as I.
5 for every proper name category,
5 for every list of key words, i.e.:country prefix — a list of common words that can occur in a country
name, e.g. republika (‘republic’) in ‘Czech Republic’,person prefix — a list of positions and titles that can precede person
name (1774 words),person suffix — a list of words that might appear directly after
person name (112 entries),person noun — a list of expressions that can refer to people, e.g.
profession names (6339 entries that were described as nouns denoting
people in plWordNet),road prefix — a list of words (full and short forms) that can precede
road name, e.g. ulica (‘street’), ul. (‘st.’). The list contains 14 entries.M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 9 / 17
![Page 10: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/10.jpg)
Evaluation » Baselines
Baselines
The baselines were calculated for the best configuration from ourprevious experiments [3], i.e. Hidden Markov Model trained onwords using character language model and rule-based filtering.
10-fold CV Cross-domainCSER CPR CEN
Precision 89.84% 67.49% 54.83%Recall 89.66% 84.36% 76.95%F1-measure 89.75% 74.99% 64.03%
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 10 / 17
![Page 11: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/11.jpg)
Evaluation » Single Domain Evaluation
Single Domain Evaluation
10-fold Cross Validation on the Stock Exchange Corpus (CSER).
road surname first name country city Total
Orth feature for previous, current and next tokens + Filtering
Precision 97.76% 97.22% 98.61% 92.59% 95.46% 96.06%Recall 71.58% 57.85% 61.03% 64.94% 79.57% 70.21%
F1 82.65% 72.54% 75.40% 76.34% 86.79% 81.13%
All features for previous, current and next tokens + Filtering
Precision 93.33% 95.78% 94.67% 81.22% 92.53% 92.07%Recall 84.15% 87.60% 81.38% 86.15% 95.14% 89.47%
F1 88.51% 91.51% 87.52% 83.61% 93.82% 90.75%
All features for 3 preceding, current and 3 following tokens + Filtering
Precision 96.67% 97.85% 96.89% 89.67% 94.74% 95.20%Recall 95.08% 87.88% 80.23% 82.68% 95.35% 90.00%
F1 95.87% 92.60% 87.77% 86.04% 95.04% 92.53%
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 11 / 17
![Page 12: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/12.jpg)
Evaluation » Cross Domain Evaluation (CEN)
Cross Domain Evaluation (CEN)
The model was trained on CSER and tested on CEN.
BaselinePrecision 54.83%, Recall 76.95% and F1-measure 64.03%.
road surname first name country city Total
All features for previous, current and next token
Precision 71.43% 93.06% 96.57% 91.19% 79.91% 91.15%Recall 16.13% 51.29% 58.98% 70.86% 55.71% 59.98%
F1 26.32% 66.13% 73.23% 79.75% 65.65% 72.35%
All features for wide context
Precision 62.50% 94.42% 97.05% 90.31% 80.87% 91.41%Recall 16.13% 49.11% 57.06% 68.73% 50.84% 57.53%
F1 25.64% 64.61% 71.87% 78.06% 62.43% 70.62%
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 12 / 17
![Page 13: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/13.jpg)
Evaluation » Cross Domain Evaluation (CPR)
Cross Domain Evaluation (CPR)
The model was trained on CSER and tested on CPR.
BaselinePrecision 67.49%, Recall 84.36% and F1-measure 74.99%.
road surname first name country city Total
All features for previous, current and next token
Precision 100.00% 93.06% 93.89% 100.00% 89.05% 92.88%Recall 50.00% 48.91% 50.75% 81.48% 63.87% 53.29%
F1 66.67% 64.11% 65.89% 89.80% 74.39% 67.72%
All features for 3 preceding, current and 3 following tokens + Filtering
Precision 100.00% 92.82% 94.08% 100.00% 95.69% 94.48%Recall 54.76% 44.04% 47.75% 81.48% 58.12% 49.40%
F1 70.77% 59.74% 63.35% 89.80% 72.31% 64.88%
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 13 / 17
![Page 14: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/14.jpg)
Summary » Conclusions
Conclusions
on a single domain the CRF with extended set of features and widecontext outperformed the baseline (F-measure: 92.53% vs 89.75%),
in the cross-domain evaluation on both corpora the near contextperformed better — the wide context tend to overtrain,
in the cross-domain evaluation in case of CEN the final results wasimproved from 64.09% to 72.35%, but for CPR was decreased by7.27%.
CRF did not obtain high recall in the cross-domain evaluation —a combination of multiple classifiers trained on different domainsmight be a solution.
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 14 / 17
![Page 15: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/15.jpg)
Summary » On-line Demonstration
On-line Demonstration
http://nlp.pwr.wroc.pl/inforex?page=ner
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 15 / 17
![Page 16: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/16.jpg)
Summary » On-line Demonstration
Ongoing and Future Works
Ongoing Worksdeveloping similarity function for proper names (Synat),
extending the dictionaries with the use of Guesser and corpora.
Future Worksto set up a web service for NER (Synat),
to annotate the corpora with extended schema of propernames (56 categories) (Synat),
to extend the feature set with chunking information —maximum PN borders,
to develop a model for nested proper names.
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 16 / 17
![Page 17: Rich Set of Features for Proper Name Recognition in Polish Texts - presentation](https://reader036.fdocuments.us/reader036/viewer/2022081908/553d669d550346b9308b465b/html5/thumbnails/17.jpg)
References » Main papers
Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: NamedEntity Recognition in Machine Anonymization, in M. A. Kłopotek,A. Przepiorkowski, A. T. Wierzchoń, and K. Trojanowski, editors,Recent Advances in Intelligent Information Systems., pp. 247–260,Academic Publishing House Exit (2009)
CRF++: Yet Another CRF toolkit, http://crfpp.sourceforge.net/
Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognitionin Polish Economic Texts, To appear in Control and Cybernetics,(2011)
Piskorski, J.: Extraction of Polish named entities, in Proceedings ofthe Fourth International Conference on Language Resources andEvaluation, LREC 2004 (ELR, 2004), pp. 313–316, Association forComputational Linguistics, Prague, Czech Republic (2004)
Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version,Institute of Computer Science, Polish Academy of Sciences, Warsaw(2004)
M. Marcińczuk, M. Stanek, M. Piasecki and A. Musiał 11 czerwca 2011 17 / 17