Download - Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

CLARIN-PL

Language Technology for Polish in Practice Systems supporting development of resources

Maciej Piasecki, Marek Maziarz, Michał Marcińczuk, Marcin Oleksy

Wrocław University of Science and Technology G4.19 Research Group

[email protected] 2017-01-17

Systems supporting development

§  Inforex §  system for corpus construction, editing and verification §  linguistic team managament

§  Complex system of tools for corpus-based, semi-automatic wordnet development §  Morpho-syntactic preprocessing §  Extraction of Multiword Expressions with MeWeX (Maziarz et al.,

2015), (Piasecki et al., 2015) §  SuperMatrix – extraction of lemmas and statistics from corpora, and

Measures of Semantic Relatedness (Broda & Piasecki, 2013) §  LexCSD – identification and extraction of usage examples (Broda &

Piasecki, 2011) §  Corpus browsing, e.g. NoSketch https://nlp.fi.muni.cz/trac/noske §  WordnetLoom 2.0 – wordnet editing, verification, group working

(Piasecki et al., 2013b) §  WordnetWeaver – semi-automatic wordnet expansion (Piasecki et

al., 2013a)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

plWordNet development proces

plWordNet Corpus 7.0 2 billion tokens

plWordNet Corpus – a merged corpus:•  available Polish corpora:

•  Corpus IPI PAN •  Rzeczypospolita Corpus •  Wikipedia (2015)

•  Texts on open licence •  Text collected from Internet

•  larger texts •  Max. 20% tokens not recognised by Morfeusz

•  The version 7.0: ~ 2 billion tokens •  The version 10.0: >4 billion tokens (for plWordNet 4.0)



CLARIN-PL

Cf (Maziarz et al., 2013)



List of entries (most frequent lemmas)



CLARIN-PL


Identification of meanings





CLARIN-PL


software tools Identification of meanings





CLARIN-PL


corpus concordancer






CLARIN-PL


Korpus Słowosieci 2 mld tokenów

siatka haseł (słowa najczęstsze)

wyróżnić znaczenia konkordancer korpusu

narzędzia komputerowe

automatyczne przykłady użycia

NoSketch Engine

Inforex



CLARIN-PL


corpus concordancer






CLARIN-PL


automated extraction of usage examples

corpus concordancer






CLARIN-PL




wyróżnić znaczenia konkordancer korpusu

narzędzia komputerowe

automatyczne przykłady użycia

n.a. - przykłady użycia -> wyróżnianie znaczeń, przykłady typowe, 10 znaczeń (Marek)

ò zwierzętach: gryźć używając zębów, powodując rany’ ò zjawiskach pogodowych (np. mrozie): gryźć, szczypać’

Usage examples for kąsać

ò owadach: gryźć’ ò zmartwieniach, wyrzutach sumienia: gryźć’ ò ludziach: dokuczać, szkodzić komuś’

1 2 3 4 5 6 7 8 9

10



CLARIN-PL



corpus concordancer






CLARIN-PL


plWordNet Team guidelines


corpus concordancer






CLARIN-PL


dictionaries, encyclopaedias, lexicons…


corpus concordancer







CLARIN-PL





defining lexical units





CLARIN-PL


relation assignment = linking to network

WordnetWeaver









CLARIN-PL




wyróżnić znaczenia narzędzia komputerowe

słowniki, encyklopedie, leksykony…

zespół Słowosieci wytyczne

zdefiniować jednostkę

przypisać relacje = podpiąć

Tkacz Wordnetu



CLARIN-PL



WordnetWeaver









CLARIN-PL


Measure of Semantic Relatedness relation assignment =

linking to network

WordnetWeaver









CLARIN-PL


antonym hypernym hyponym co-hyponym

closely related holonym

(Piasecki & Wendelberger, 2014)

Measure of Semantic Relatedness: results (generated by SuperMatrix)



CLARIN-PL


Measure of Semantic Relatedness relation assignment =

linking to network

WordnetWeaver









CLARIN-PL









+ stylistic register + gloss + usage examples

concordancer extracted usage examples WordnetWeaver Measure of Semantic Relatedness



CLARIN-PL



Identification of meanings




+ stylistic register + gloss + usage examples

•  Intuition: linguist, team, •  But controled by:

•  guidelines •  and substitution tests



CLARIN-PL

Definition of relations

Substitution Test for Hypernymy Condition:

Stylistic register of Y must be not lower in the register hierarchy than register of X.

Testing expressions: If she/it is X, then she/it must be Y If she/it is Y, then she/it need not be X If she/it is not Y, then she/it cannot be X

(Maziarz, Piasecki, Szpakowicz, Rabiega-Wiśniewska 2010)



CLARIN-PL

Definition of relations

Applying Hypernymy Test to a Pair Condition:

Both: ocean ‘ocean’ and zbiornik wodny ‘water basin’ are of the general stylistic register.

Testing expressions: If she/it is oceanem ‘ocean’, then she/it must be

zbiornikiem wodnym ‘water basin’ If she/it is zbiornikiem wodnym ‘water basin’, then she/

it need not be oceanem ‘ocean’ If she/it is not zbiornikiem wodnym ‘water basin’, then

she/it cannot be oceanem ‘ocean’



CLARIN-PL

WordnetLoom Samsung R&D

Institute Invit. Lecture

2017-01-17

CLARIN-PL

Cf (Piasecki et al., 2013b)

plWordNet `Big Brother’ Samsung R&D

Institute Invit. Lecture

2017-01-17

CLARIN-PL

Nie można wyświetlić obrazu. Na komputerze może brakować pamięci do otwarcia obrazu lub obraz może być uszkodzony. Uruchom ponownie komputer, a następnie otwórz plik ponownie. Jeśli czerwony znak x nadal będzie wyświetlany, konieczne może być usunięcie obrazu, a następnie ponowne wstawienie go.

WordnetWeaver

§  Semi-automated wordnet expansion method §  For new lemmas – not yet described in a wordnet §  possible attachment synsets are automatically identified §  and visually presented on the screen as wordnet subgraphs §  Wordnet editors are free to make any action

§  Implemented as an extension to WordnetLoom



CLARIN-PL

Paintball: knowledge sources

§  Knowledge sources K1, … Ks extracted by different methods from the corpus

§  Ki = { <ln, lj, w>: ln – a new word, (not in the wordnet) lj – a wordnet word w – local weight (for the pair) }

§  weight(Ki) ∈ (0,1] – global weight (for the knowledge source)

(Piasecki et al., 2013a)



CLARIN-PL

Knowledge sources

§  Methods §  Measure of Semantic Relatedness §  Lexico-syntactic Patterns

§  specific – manually constructed §  generic – automatically extracted

§  Classifiers based on Machine Learning §  Only some of them produce probability values §  Results: heterogeneous, partial, and imperfect – substantial

error level



CLARIN-PL

Knowledge sources: used in experiments §  Hypernymy classifier (Snow et al. 2004)

§  trained on patterns in the corpus parsed by Minipar (Lin, 1993)

§  e.g. 〈feminism, movement, 1.0〉, 〈feminism, idea, 0.951〉, 〈feminism, study, 0.951〉, 〈feminism, theory, 0.948〉, 〈feminism, politics, 0.867〉, 〈feminism, relationship, 0.867〉

§  Cousin classifier §  logistic regression applied to a Measure of Semantic

Relatedness §  e.g 〈feminism, socialism, 0.204〉, 〈feminism, humanism,

0.207〉, 〈feminism, nationalism, 0.208〉, 〈feminism, liberalism, 0.207〉, 〈feminism, pacifism, 0.208〉, 〈feminism, anarchism, 0.205〉



CLARIN-PL

Paintball algorithm

§  Input: a wordnet, a new word and a set of Knowledge Sources

§  Output: a set of subgraps – attachment areas – with one synset marked in each

§  Idea §  each knowledge source expresses some error level §  knowledge source triples are not precise in pointing to

particular synsets §  hits covers regions §  spreading activation helps to analyse and combine the

delivered information



CLARIN-PL

Paintball Metaphor: initial state

nowy lemat



CLARIN-PL

Paintball Metaphor: hits from the knowledge sources

nowy lemat



CLARIN-PL

Paintball Metaphor: attachment area

nowy lemat



CLARIN-PL

Paintball: algorithm

Step 0 Setting up the initial state 1.  Converting the synset graph into a graph of

lexical units –> table Q 2.  ∀j∈J.Q[j] = supp(j, x) 3.   for each j∈J

if Q[j ]) > τ0 T=append(T, j) 4.  T = sort_descendingly(T) §  where:

§  J – a set of lexical units (word+senses) §  Q – graph nodes, supp() – sum of weights (support)



CLARIN-PL

Paintball: algorithm

Step 1 Spreading support across the graph 1. k = head(T) and T = tail(T ) 2. fitRep(k, x, supp(k, x))

spreading support for x from the node k to linked nodes 3. if not empty(T) then goto Step 1



CLARIN-PL

Paintball: algorithm: Step 1 §  fitReplication(j, x, M, T) 1.   if M < ε then return 2.   for each p ∈ dsc(j)

fitRepTrans(p, x, fT (p, µ ∗ M ), [j])

§  fitRepTrans(p, x, M , T) 1.   if M < ε then return 2.   for each p’ ∈ dsc(p|1)

if not (p’|1 ∈ T ) fitRepTrans(p’, x, fI(p, p’, fT(p’, µ ∗ M)), [p’|1|T])

3.  Q[p|1] = Q[p|1] + M



CLARIN-PL

Paintball: algorytm Step 2 Identifying attachment areas 1.  Calculating synset support matrix F from Q 2.  Indentifying connected wordnet subgraphs (activation

areas), such that Gm = {s ∈ Synsety : F[s] > τ3}

3.   for each Gm score(Gm) = F[jm], where jm = maxj∈Gm.F[j]

4.  Return Gm, such that score(Gm) > τ4, as attachment areas



CLARIN-PL

Evaluation: method

§  Evaluation by reconstruction §  a word sample is removed from the wordnet §  Paintball is applied to reattach the words

§  Data collected §  histogram of path lengths between suggested synsets

and the original positions in a wordnet §  paths of up to 5 links, including hyper/hyponymy links

with at most one final meronymic were considered

Samsung R&D Institute Invit. Lecture 2017-01-17

CLARIN-PL

Evaluation: method

§  Criteria §  closest path: attachment proposition that is closest to

the original location §  strongest suggestion: top scored §  all suggestions


CLARIN-PL

Evaluation: experiment setup

§  Wikipedia corpus, including almost 1 billion words

§  Word sample §  corpus frequency threshold for words: 200 §  words that have at least 3 hypernymy links to the top synset §  1064 test words selected §  margin of error 3% and 95% confidence level §  frequent words ≥ 1000 §  infrequent words ≤ 999



CLARIN-PL

Evaluation: baseline

§  Baseline: Probabilistic Wordnet Expansion (Snow, Jurafsky, & Ng, 2006) §  lack of procedure for setting the values of parameters §  selected experimentally:

§  minimal probability of evidence: 0.1, §  inverse odds of the prior: k = 4, §  maximum size of the cousins neighbourhood: (m, n) ≤ (3,3), §  maximum links in hypernym graph: 10 §  penalization factor: = 0.9



CLARIN-PL

Evaluation: Paintball parameters

§  Spreading start (τ0): 0.4 §  Spreading stop (ε): 0.14 §  Threshold for synset activation (τ3): 0.4 §  Threshold for attachment areas (τ4): 0.8 §  Spreading decay factor (µ): 0.65



CLARIN-PL

Results: straight path strategy

Method Hit distance

0 1 2 3 4 5 6 [0-2] ∑ PWE

Rare C 3.7 21.7 16.2 9.6 6.9 3.4 0.1 41.6 61.5

S 0.5 5.9 9.7 10.9 8.9 4.5 0.5 16.1 40.9 A 0.8 4.9 5.0 4.5 3.8 2.0 0.4 10.7 21.5

Freq. C 0.8 14.8 24.2 21.0 15.1 5.5 0.2 39.8 81.6

S 0.1 2.7 9.4 16.1 15.7 13.2 0.8 12.2 58.0 A 0.2 3.2 7.0 10.0 9.8 7.3 0.5 10.4 38.0

PB

Rare C 9.2 21.7 12.6 6.7 4.2 1.0 0.6 43.5 56.1

S 4.8 13.1 10.0 6.5 3.4 1.2 0.4 27.9 39.4

A 2.9 6.9 4.8 3.5 2.2 1.0 0.2 14.6 21.5 Freq. C 6.3 20.5 15.0 11.9 6.7 2.6 0.5 41.8 63.3

S 1.9 9.1 8.4 8.1 4.8 1.9 0.3 19.4 34.7

A 1.4 4.9 4.4 4.4 3.1 1.6 0.2 10.7 20.0


CLARIN-PL

Results: folded path strategy

Method Hit distance

0 1 2 3 4 ∑

PWE

Rare C 3.7 21.7 18.4 11.8 2.5 58.2

S 0.5 5.9 10.7 12.6 2.3 32.0

A 0.8 4.9 6.6 6.9 1.5 20.7

Freq. C 0.8 14.8 25.2 22.9 4.0 67.7

S 0.1 2.7 9.6 17.0 3.4 32.8

A 0.2 3.2 7.9 12.2 2.9 26.4

PB Rare C 9.2 21.7 21.9 10.7 1.9 65.5

S 4.8 13.1 15.3 13.1 1.5 47.9

A 2.9 6.9 14.7 13.2 1.7 39.4

Freq. C 6.3 20.5 20.7 18.6 2.8 68.8

S 1.9 9.1 11.5 13.5 3.1 39.2

A 1.4 4.9 8.4 11.6 2.3 28.5


CLARIN-PL

Results: coverage

§  For the straight path strategy §  Coverage for words

§  PWE: propositions for 100% of words (freq. 100%) §  Paintball: 63.15% of words (freq. 91.93%)

§  Recall for senses §  PWE: 44.79% (freq. 43.93%) §  Paintball : 24.66% (freq. 26.62%)



CLARIN-PL

Results: example

§  PWE suggestions for feminism {abstraction, abstract entity},

{entity}, {communication}, {group, grouping}, {state}

§  Paintball suggestions: {causal agent, cause, causal agency},

{change}, {political orientation, ideology, political theory}, {discipline, subject, subject area, subject field, field, field of study, study, bailiwick}, {topic, subject, issue, matter}



CLARIN-PL

Semi-automated Wordnet Expanssion: WordnetWeaver in Use

climbing

speedway

recreation



CLARIN-PL

Inforex History

Inforex – a system for construction, annotation and searching

text corpora (Marcińczuk et al., 2012) http://nlp.pwr.wroc.pl/inforex/ History: §  Developed in WUST (G4.19) since 2010, §  used:

§  In research projects: NEKST, SyNaT, CLARIN-PL §  Individual research: M. Zaśko-Zielińska (językoznawstwo - listy pożegnalne

samobójców), Ł. Damurski (urbanistyka - dokumenty dotyczące polityki terytorialnej UE)

§  PhD thesises: B. Broda (WSD), M. Marcińczuk (NER, relacje semantyczne), A. Radziszewski (frazy składniowe), J. Kocoń (wyrażenia temporalne, wyznaczniki sytuacji)

§  Other research tasks: E. Kaczmarz (konwersacje z Facebooka), Bernaś (teksty w j. hebrajskim).

§  Interface to several corpora: §  KPWr - Korpus Politechniki Wrocławskiej §  CEN - korpus wiadomości ekonomicznych from Wikinews §  PCSN - Polski korpus listów pożegnalnych samobójców



CLARIN-PL

Inforex Main features

§ http://inforex.clarin-pl.eu/ access for users with an account in DSpace

§ Accessible via web browser (Firefox is suggested) – does not require installation by the user, needs permanent access to Internet,

§  Integrated with DSpace (import/export of data), § Enables sharing data among users, § Access to data on the basis of authorisation related to

corpora and annotation layers, § Supports work on documents that are tagged (assumed

segmentation into tokens and sentences) and non-tagged § Provides visualisation of the document structure during

annotation,



CLARIN-PL

Inforex Visualisation of the document structure (1/2)

KPWr Rozmowy z Facebooka (E. Kaczmarz)



CLARIN-PL

Inforex Visualisation of the document structure (1/2)

PCSN (M. Zaśko-Zielińska) Teksty w j. hebrajskim (T. Bernaś)



CLARIN-PL

KPWr Controlled state of the work (1/2)



CLARIN-PL

Inforex Metadata



CLARIN-PL

Inforex Content editing history



CLARIN-PL

Inforex Annotation, annotation schema



CLARIN-PL

Inforex Adding annotation to text



CLARIN-PL

Inforex Verification of annotation



CLARIN-PL

Inforex Lematisation



CLARIN-PL

Inforex Translation of phrases



CLARIN-PL

Inforex Normalisation of temporal expressions



CLARIN-PL

Inforex Adding relation links



CLARIN-PL

Inforex Relations – co-reference



CLARIN-PL

Inforex Word Sense Disambiguation



CLARIN-PL

Inforex Statistics – word frequency



CLARIN-PL

Inforex Browsing annotations



CLARIN-PL

Inforex Browsing annotations (translations)



CLARIN-PL

Inforex Browsing relation links



CLARIN-PL

Bibliography

§  Maziarz, M.; Szpakowicz, S. & Piasecki, M. (2015) A Procedural Definition of Multi-word Lexical Units. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing -- RANLP'2015, INCOMA Ltd. Shoumen, BULGARIA, 2015, 427-435 http://aclweb.org/anthology/R15-1056

§  Piasecki, M.; Wendelberger, M. & Maziarz, M. (2015) Extraction of the Multi-word Lexical Units in the Perspective of the Wordnet Expansion. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing -- RANLP'2015, INCOMA Ltd. Shoumen, BULGARIA, 2015, 512–-520 http://aclweb.org/anthology/R15-1067

§  Broda, B. & Piasecki, M. (2013) Parallel, Massive Processing in SuperMatrix -- a General Tool for Distributional Semantic Analysis of Corpora. International Journal of Data Mining, Modelling and Management, 2013, 5, pp. 1-19

§  Maziarz, M.; Piasecki, M.; Rudnicka, E. & Szpakowicz, S. (2013) Beyond the Transfer-and-Merge Wordnet Construction: plWordNet and a Comparison with WordNet. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, INCOMA Ltd. Shoumen, BULGARIA, 2013, 443-452 http://aclweb.org/anthology/R13-1058



CLARIN-PL

Bibliography

§  Piasecki, M. & Wendelberger, M. (2014) Partial Measure of Semantic Relatedness Based on the Local Feature Selection. In Sojka, P.; Horák, A.; Kopecek, I. & Pala, K. (Eds.) Text, Speech and Dialogue - 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings, Springer, 2014, 8655, 336-343

§  Piasecki, M.; Ramocki, R. & Kaliński, M. (2013a) Information Spreading in Expanding Wordnet Hypernymy Structure. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, INCOMA Ltd. Shoumen, BULGARIA, 2013, 553-561, http://aclweb.org/anthology/R13-1073

§  Piasecki, M.; Marcińczuk, M.; Ramocki, R. & Maziarz, M. (2013b) WordnetLoom: a Wordnet Development System Integrating Form-based and Graph-based Perspectives. International Journal of Data Mining, Modelling and Management, 2013, 5, 210-232

§  Broda, B. & Piasecki, M. (2011) Evaluating LexCSD in a Large Scale Experiment Control and Cybernetics, Vol. 40, 419-436.

§  Maciej Piasecki, Łukasz Burdka, Marek Maziarz, Michał Kaliński. (2016) In Zygmunt Vetulani, Hans Uszkoreit, Marek Kubis (Eds.)Human Language Technology. Challenges for Computer Science and Linguistics. Volume 9561 of the series Lecture Notes in Computer Science pp 255-273. http://link.springer.com/chapter/10.1007/978-3-319-43808-5_20

§  Marcińczuk, M., Kocoń, J. & Broda, B (2012). Inforex — a web-based tool for text corpus management and semantic annotation. In Calzolari, N., et al (editors), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 224-230. Istanbul, Turkey : European Language Resources Association (ELRA). https://www.researchgate.net/publication/308886657_Inforex_-_a_web-based_tool_for_text_corpus_management_and_semantic_annotation



CLARIN-PL

CLARIN-PL

Thank you very much for your attention! www.clarin-pl.eu

Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]