Rich Set of Features for Proper Name Recognition in Polish Texts - extended astract

8/3/2019 Rich Set of Features for Proper Name Recognition in Polish Texts - extended astract

http://slidepdf.com/reader/full/rich-set-of-features-for-proper-name-recognition-in-polish-texts-extended 1/4

Rich Set of Features for Proper Name

Recognition in Polish Texts

Michał Marcińczuk, Michał Stanek, Maciej Piasecki, and Adam Musiał

Wrocław University of Technology, Wrocław, Poland

1 Introduction

Statistical recognition of named entities (NEs) in Polish is more difficult then inEnglish mainly because of less constrained word order and richer morphology.

As a result, the number of words sequences corresponding to a multi-word NEis relatively high. A complex statistical model is required for this diversity. Thelevel of diversity can be reduced by using some kind of generalization combinedwith raw observations, i.e. introducing some levels of granularities to text repre-sentation. Experiments with HMM showed, cf [1], that HMM often makes wrongdecisions because the important premises appear in the close but right context,which is not available for HMM. This problem also seems to be more seriousfor less-constrained word order languages, than for English. The sample errorsare: “siedziba w Nowym Sadzie w Republice Serbii ” (‘office in Nowy Sad inRepublic of Serbia’) — HMM recognised Republice Serbi as a country name, butnot Nowym Sadzie as a city name. However a more sophisticated model couldrecognise that Nowym Sadzie is a city name from the left and right context takentogether; “Elektroproizvodnja -ZPUE D.O.O.” (‘D.O.O.’ is an abbreviation for

Limited liability company in Serbian) — HMM recognised Elektroproizvodnja asa road name while D.O.O. in the right context indicate a company name.

Our goal is to develop a general method for NE recognition (NER) for Polish.Due to the limited resources for Polish for this task which are still under intensivedevelopment (e.g. [1, 5]) we have limited our scope to five categories of propernames, i.e. first names, surnames, names of countries, cities and roads. We aimat processing non-literary texts (newspaper articles, reports, brochures, etc.).

2 Approach

We have defined a set of 34 features that are used to form a description of a

word occurrence in the sequence. The features are:

1. Orthographic features: word form, morphological base form, n character pre-fixes and suffixes, as well selected patterns of characters (8 categories, e.g.all upper, digits, upper init etc.)

2. Binary orthographic features — indicating the presence of given charactersin the word; based on filtering rules from [1].



2 Michał Marcińczuk, Michał Stanek, Maciej Piasecki, and Adam Musiał

3. Wordnet-base features — reducing the observation diversity; based on syn-onyms from plWordNet and hypernyms of the word in the distance of n .

4. Morphological features — complete tag (according to the IPI PAN tagset[4]) as disambiguated by TaKIPI, and selected attributes: part of speech,case, gender, number.

5. Gazetteer-based features — one feature for every gazetteer. If a sequence of words is found in a gazetteer the first word in the sequence is set as B andthe other as I ; O is assigned to words not covered by the entries.

CRF is a modern machine learning method applied successfully to labelling se-quence data in many NLP tasks, e.g., in shallow parsing or NER. CRF-basedNER models outperform HMM models because CRF can utilise additional con-text features — encoding observations — in a non-linear manner. CRF is ableto analyse a much broader context than HMM based methods, utilise featuresencoding both preceding and following observations. Our goal is to improve NER

with respect to the problems identified earlier. We aim at examining a new setof features and their influence on NER. We ignore the problem of CRF learn-ing algorithms and normalization factors in CRF applying parameters typicallyused for the English [2], and state of the art stochastic gradient descent learningmethod [6].

3 Experiments

In the experiments we used three corpora: a corpus of stock exchange reports(CSER), a corpus of police reports (CPR) and a corpus of economic news (CEN),see [1]. In the single-domain evaluation we followed 10-fold cross-validation onthe revised CSER. Due to changes introduced in CSER we had to repeat the

baseline experiments, see Table 1. The best configurations from [1] were applied.In 10-fold HMM all folds were used and HMM was combined with re-scoringbased on heuristics and gazetteers. HMM + post is a cross-validation on folds6–10 for HMM with re-scoring and rule-based post-processing. Table 1 showsthat the correction of errors improved slightly the results. The baseline resultfor HMM is F 1=89.75%. The single-domain evaluation of CRF was performedon the revised CSER. CRF with the extended set of features and close context(previous, current and next token) obtained near the same level of recall as HMMbut with higher precision of 92.07%, achieving F 1 =90.75% which is better thanthe one of HMM. The highest results were obtained for wide context (3 preceding,current and 3 following tokens), i.e. F 1=92.53% with high precision 95.20%. Thisconfirm that the discriminative information appears in wider context.

To investigate the generality of the CRF model we evaluated it on cross-domain corpora. We trained the CRF model on CSER using feature templatesthat achieved the best result in the cross-validation on CSER and next we appliedit to CEN and CPR. Due to changes in CSER the experiments from [1] wererepeated using the same best HMM configuration and used as a baseline: forCPR 67.49% of precision (P), 84.36% of recall (R) and 74.99% of F1; for CENP=54.83%, R=76.95% and F1=64.03%. Results of the cross-domain evaluation



Proper Name Recognition for Polish 3

Table 1. Base line evaluation on CSER.

CSER Revised CSER

10-fold HMM HMM + post 10-fold HMM HMM + post

Precision 83.55% 85.28% 88.69% 89.84%

Recall 89.70% 88.56% 90.68% 89.66%

F1 86.52% 86.88% 89.67% 89.75%

Table 2. Results of cross evaluation of CRF on CSER dataset

road surname first name country city Total

All features for previous, current and next token + Filtering

Precision 93.33% 95.78% 94.67% 81.22% 92.53% 92.07%

Recall 84.15% 87.60% 81.38% 86.15% 95.14% 89.47%F1 88.51% 91.51% 87.52% 83.61% 93.82% 90.75%

All features for wide context + Filtering

Precision 96.67% 97.85% 96.89% 89.67% 94.74% 95.20%Recall 95.08% 87.88% 80.23% 82.68% 95.35% 90.00%

F1 95.87% 92.60% 87.77% 86.04% 95.04% 92.53%

on CPR are presented in Table 3 — F 1=67.71% is less by 7.27% than HMM.However, the 92.88% of precision of CRF is significantly better than the one of HMM. Application of wider context resulted in the precision improvement (byca. 2%) but also with recall reduction (by ca. 4%) — the wider context tend to

overtrain the model. Next, we tested the model on the other corpora — CEN.The results achieved on CEN are presented in Tab. 4: F 1 increased by 8.32%.On both corpora CRF achieved very high overall precision. The worst resultsfor CRF were achieved for the recognition of person names. The wider contextimproved precision at the cost of recall.

Table 3. Cross-domain evaluation on CPR.


All features for previous, current and next token

Precision 100.00% 93.06% 93.89% 100.00% 89.05% 92.88%

Recall 50.00% 48.91% 50.75% 81.48% 63.87% 53.29%F1 66.67% 64.11% 65.89% 89.80% 74.39% 67.72%

All features for wide context


F1 70.77% 59.74% 63.35% 89.80% 72.31% 64.88%



4 Michał Marcińczuk, Michał Stanek, Maciej Piasecki, and Adam Musiał

Table 4. Cross-domain evaluation on CEN.


All features for current, next and previous token


F1 26.32% 66.13% 73.23% 79.75% 65.65% 72.35%

All features for wide context


F1 25.64% 64.61% 71.87% 78.06% 62.43% 70.62%

4 Summary

In the paper we presented some limitations of HMM in the task of NE recogni-tion, i.e. a problem with encoding data generalization of linguistic informationand modelling contextual information from two-side context. To overcome thesetwo limitations we applied CRF — a modern method for sequence labelling ona rich set of features: based on linguistic observation and used to reduce theobservation diversity. In the single-domain cross-validation CRF outperformedHMM. CRF obtained 92.53% of F-measure, while HMM only 89.75%. On thecross-domain evaluation we have trained the model on CSER and evaluated onCPR and CEN. On both corpora we observed the same effect, the precisionincreased but also the recall decreased. In case of CEN the final results wasimproved from 64.09% to 72.35%, but for CPR was decreased by 7.27%. Cross-domain evaluation has shown that CRF models are capable to fit very good tothe data in the training dataset. Unfortunately, CRF did not obtain high recall.

References

1. Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Eco-nomic Texts, To appear in Control and Cybernetics, (2011)

2. Peng, F., McCallum, A.: Accurate Information Extraction from Research PapersUsing Conditional Random Fields, in In HLT-NAACL, pp. 329–336 (2004)

3. Piskorski, J.: Extraction of Polish named entities, in Proceedings of the FourthInternational Conference on Language Resources and Evaluation, LREC 2004 (ELR,2004), pp. 313–316, ACL, Prague, Czech Republic (2004)

4. Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version, Institute of Com-puter Science, Polish Academy of Sciences, Warsaw (2004)

5. Savary, A., Waszczuk, J., Przepiórkowski, A.: Towards the Annotation of NamedEntities in the National Corpus of Polish, in LREC 2010 proceedings (2010)

6. Vishwanathan, S., Schraudolph, N., Schmidt, M., Murphy, K.: Accelerated trainingof conditional random fields with stochastic gradient methods, in Proceedings of ICML ’06, pp. 969–976, ACM, New York, NY, USA (2006)

Rich Set of Features for Proper Name Recognition in Polish Texts - extended astract

Documents

Transcript of Rich Set of Features for Proper Name Recognition in Polish Texts - extended astract