Language Technology for Polish in...

75
CLARIN-PL Language Technology for Polish in Practice Systems supporting development of resources Maciej Piasecki, Marek Maziarz, Michał Marcińczuk, Marcin Oleksy Wrocław University of Science and Technology G4.19 Research Group [email protected] 2017-01-17

Transcript of Language Technology for Polish in...

Page 1: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

CLARIN-PL

Language Technology for Polish in Practice Systems supporting development of resources

Maciej Piasecki, Marek Maziarz, Michał Marcińczuk, Marcin Oleksy

Wrocław University of Science and Technology G4.19 Research Group

[email protected] 2017-01-17

Page 2: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Systems supporting development

§  Inforex §  system for corpus construction, editing and verification §  linguistic team managament

§  Complex system of tools for corpus-based, semi-automatic wordnet development §  Morpho-syntactic preprocessing §  Extraction of Multiword Expressions with MeWeX (Maziarz et al.,

2015), (Piasecki et al., 2015) §  SuperMatrix – extraction of lemmas and statistics from corpora, and

Measures of Semantic Relatedness (Broda & Piasecki, 2013) §  LexCSD – identification and extraction of usage examples (Broda &

Piasecki, 2011) §  Corpus browsing, e.g. NoSketch https://nlp.fi.muni.cz/trac/noske §  WordnetLoom 2.0 – wordnet editing, verification, group working

(Piasecki et al., 2013b) §  WordnetWeaver – semi-automatic wordnet expansion (Piasecki et

al., 2013a)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 3: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

plWordNet Corpus 7.0 2 billion tokens

plWordNet Corpus – a merged corpus:•  available Polish corpora:

•  Corpus IPI PAN •  Rzeczypospolita Corpus •  Wikipedia (2015)

•  Texts on open licence •  Text collected from Internet

•  larger texts •  Max. 20% tokens not recognised by Morfeusz

•  The version 7.0: ~ 2 billion tokens •  The version 10.0: >4 billion tokens (for plWordNet 4.0)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Cf (Maziarz et al., 2013)

Page 4: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 5: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 6: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 7: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

corpus concordancer

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 8: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

Korpus Słowosieci 2 mld tokenów

siatka haseł (słowa najczęstsze)

wyróżnić znaczenia konkordancer korpusu

narzędzia komputerowe

automatyczne przykłady użycia

NoSketch Engine

Inforex

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 9: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

corpus concordancer

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 10: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

automated extraction of usage examples

corpus concordancer

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 11: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

Korpus Słowosieci 2 mld tokenów

siatka haseł (słowa najczęstsze)

wyróżnić znaczenia konkordancer korpusu

narzędzia komputerowe

automatyczne przykłady użycia

n.a. - przykłady użycia -> wyróżnianie znaczeń, przykłady typowe, 10 znaczeń (Marek)

`o zwierzętach: gryźć używając zębów, powodując rany’ `o zjawiskach pogodowych (np. mrozie): gryźć, szczypać’

Usage examples for kąsać

`o owadach: gryźć’ `o zmartwieniach, wyrzutach sumienia: gryźć’ `o ludziach: dokuczać, szkodzić komuś’

1 2 3 4 5 6 7 8 9

10

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 12: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

automated extraction of usage examples

corpus concordancer

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 13: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

plWordNet Team guidelines

automated extraction of usage examples

corpus concordancer

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 14: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

dictionaries, encyclopaedias, lexicons…

automated extraction of usage examples

corpus concordancer

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

plWordNet Team guidelines

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 15: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

defining lexical units

dictionaries, encyclopaedias, lexicons…

plWordNet Team guidelines

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 16: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

relation assignment = linking to network

WordnetWeaver

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

defining lexical units

dictionaries, encyclopaedias, lexicons…

plWordNet Team guidelines

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 17: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

Korpus Słowosieci 2 mld tokenów

siatka haseł (słowa najczęstsze)

wyróżnić znaczenia narzędzia komputerowe

słowniki, encyklopedie, leksykony…

zespół Słowosieci wytyczne

zdefiniować jednostkę

przypisać relacje = podpiąć

Tkacz Wordnetu

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 18: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

relation assignment = linking to network

WordnetWeaver

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

defining lexical units

dictionaries, encyclopaedias, lexicons…

plWordNet Team guidelines

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 19: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

Measure of Semantic Relatedness relation assignment =

linking to network

WordnetWeaver

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

defining lexical units

dictionaries, encyclopaedias, lexicons…

plWordNet Team guidelines

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 20: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

antonym hypernym hyponym co-hyponym

closely related holonym

(Piasecki & Wendelberger, 2014)

Measure of Semantic Relatedness: results (generated by SuperMatrix)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 21: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

Measure of Semantic Relatedness relation assignment =

linking to network

WordnetWeaver

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

defining lexical units

dictionaries, encyclopaedias, lexicons…

plWordNet Team guidelines

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 22: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

relation assignment = linking to network

software tools Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

defining lexical units

dictionaries, encyclopaedias, lexicons…

plWordNet Team guidelines

+ stylistic register + gloss + usage examples

concordancer extracted usage examples WordnetWeaver Measure of Semantic Relatedness

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 23: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet development proces

relation assignment = linking to network

Identification of meanings

plWordNet Corpus 7.0 2 billion tokens

List of entries (most frequent lemmas)

defining lexical units

+ stylistic register + gloss + usage examples

•  Intuition: linguist, team, •  But controled by:

•  guidelines •  and substitution tests

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 24: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Definition of relations

Substitution Test for Hypernymy Condition:

Stylistic register of Y must be not lower in the register hierarchy than register of X.

Testing expressions: If she/it is X, then she/it must be Y If she/it is Y, then she/it need not be X If she/it is not Y, then she/it cannot be X

(Maziarz, Piasecki, Szpakowicz, Rabiega-Wiśniewska 2010)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 25: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Definition of relations

Applying Hypernymy Test to a Pair Condition:

Both: ocean ‘ocean’ and zbiornik wodny ‘water basin’ are of the general stylistic register.

Testing expressions: If she/it is oceanem ‘ocean’, then she/it must be

zbiornikiem wodnym ‘water basin’ If she/it is zbiornikiem wodnym ‘water basin’, then she/

it need not be oceanem ‘ocean’ If she/it is not zbiornikiem wodnym ‘water basin’, then

she/it cannot be oceanem ‘ocean’

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 26: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

WordnetLoom Samsung R&D

Institute Invit. Lecture

2017-01-17

CLARIN-PL

Cf (Piasecki et al., 2013b)

Page 27: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

plWordNet `Big Brother’ Samsung R&D

Institute Invit. Lecture

2017-01-17

CLARIN-PL

Nie można wyświetlić obrazu. Na komputerze może brakować pamięci do otwarcia obrazu lub obraz może być uszkodzony. Uruchom ponownie komputer, a następnie otwórz plik ponownie. Jeśli czerwony znak x nadal będzie wyświetlany, konieczne może być usunięcie obrazu, a następnie ponowne wstawienie go.

Page 28: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

WordnetWeaver

§  Semi-automated wordnet expansion method §  For new lemmas – not yet described in a wordnet §  possible attachment synsets are automatically identified §  and visually presented on the screen as wordnet subgraphs §  Wordnet editors are free to make any action

§  Implemented as an extension to WordnetLoom

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 29: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball: knowledge sources

§  Knowledge sources K1, … Ks extracted by different methods from the corpus

§  Ki = { <ln, lj, w>: ln – a new word, (not in the wordnet) lj – a wordnet word w – local weight (for the pair) }

§  weight(Ki) ∈ (0,1] – global weight (for the knowledge source)

(Piasecki et al., 2013a)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 30: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Knowledge sources

§  Methods §  Measure of Semantic Relatedness §  Lexico-syntactic Patterns

§  specific – manually constructed §  generic – automatically extracted

§  Classifiers based on Machine Learning §  Only some of them produce probability values §  Results: heterogeneous, partial, and imperfect – substantial

error level

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 31: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Knowledge sources: used in experiments §  Hypernymy classifier (Snow et al. 2004)

§  trained on patterns in the corpus parsed by Minipar (Lin, 1993)

§  e.g. 〈feminism, movement, 1.0〉, 〈feminism, idea, 0.951〉, 〈feminism, study, 0.951〉, 〈feminism, theory, 0.948〉, 〈feminism, politics, 0.867〉, 〈feminism, relationship, 0.867〉

§  Cousin classifier §  logistic regression applied to a Measure of Semantic

Relatedness §  e.g 〈feminism, socialism, 0.204〉, 〈feminism, humanism,

0.207〉, 〈feminism, nationalism, 0.208〉, 〈feminism, liberalism, 0.207〉, 〈feminism, pacifism, 0.208〉, 〈feminism, anarchism, 0.205〉

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 32: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball algorithm

§  Input: a wordnet, a new word and a set of Knowledge Sources

§  Output: a set of subgraps – attachment areas – with one synset marked in each

§  Idea §  each knowledge source expresses some error level §  knowledge source triples are not precise in pointing to

particular synsets §  hits covers regions §  spreading activation helps to analyse and combine the

delivered information

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 33: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball Metaphor: initial state

nowy lemat

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 34: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball Metaphor: hits from the knowledge sources

nowy lemat

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 35: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball Metaphor: hits from the knowledge sources

nowy lemat

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 36: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball Metaphor: hits from the knowledge sources

nowy lemat

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 37: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball Metaphor: attachment area

nowy lemat

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 38: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball: algorithm

Step 0 Setting up the initial state 1.  Converting the synset graph into a graph of

lexical units –> table Q 2.  ∀j∈J.Q[j] = supp(j, x) 3.   for each j∈J

if Q[j ]) > τ0 T=append(T, j) 4.  T = sort_descendingly(T) §  where:

§  J – a set of lexical units (word+senses) §  Q – graph nodes, supp() – sum of weights (support)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 39: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball: algorithm

Step 1 Spreading support across the graph 1. k = head(T) and T = tail(T ) 2. fitRep(k, x, supp(k, x))

spreading support for x from the node k to linked nodes 3. if not empty(T) then goto Step 1

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 40: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball: algorithm: Step 1 §  fitReplication(j, x, M, T) 1.   if M < ε then return 2.   for each p ∈ dsc(j)

fitRepTrans(p, x, fT (p, µ ∗ M ), [j])

§  fitRepTrans(p, x, M , T) 1.   if M < ε then return 2.   for each p’ ∈ dsc(p|1)

if not (p’|1 ∈ T ) fitRepTrans(p’, x, fI(p, p’, fT(p’, µ ∗ M)), [p’|1|T])

3.  Q[p|1] = Q[p|1] + M

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 41: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Paintball: algorytm Step 2 Identifying attachment areas 1.  Calculating synset support matrix F from Q 2.  Indentifying connected wordnet subgraphs (activation

areas), such that Gm = {s ∈ Synsety : F[s] > τ3}

3.   for each Gm score(Gm) = F[jm], where jm = maxj∈Gm.F[j]

4.  Return Gm, such that score(Gm) > τ4, as attachment areas

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 42: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Evaluation: method

§  Evaluation by reconstruction §  a word sample is removed from the wordnet §  Paintball is applied to reattach the words

§  Data collected §  histogram of path lengths between suggested synsets

and the original positions in a wordnet §  paths of up to 5 links, including hyper/hyponymy links

with at most one final meronymic were considered

Samsung R&D Institute Invit. Lecture 2017-01-17

CLARIN-PL

Page 43: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Evaluation: method

§  Criteria §  closest path: attachment proposition that is closest to

the original location §  strongest suggestion: top scored §  all suggestions

Samsung R&D Institute Invit. Lecture 2017-01-17

CLARIN-PL

Page 44: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Evaluation: experiment setup

§  Wikipedia corpus, including almost 1 billion words

§  Word sample §  corpus frequency threshold for words: 200 §  words that have at least 3 hypernymy links to the top synset §  1064 test words selected §  margin of error 3% and 95% confidence level §  frequent words ≥ 1000 §  infrequent words ≤ 999

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 45: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Evaluation: baseline

§  Baseline: Probabilistic Wordnet Expansion (Snow, Jurafsky, & Ng, 2006) §  lack of procedure for setting the values of parameters §  selected experimentally:

§  minimal probability of evidence: 0.1, §  inverse odds of the prior: k = 4, §  maximum size of the cousins neighbourhood: (m, n) ≤ (3,3), §  maximum links in hypernym graph: 10 §  penalization factor: = 0.9

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 46: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Evaluation: Paintball parameters

§  Spreading start (τ0): 0.4 §  Spreading stop (ε): 0.14 §  Threshold for synset activation (τ3): 0.4 §  Threshold for attachment areas (τ4): 0.8 §  Spreading decay factor (µ): 0.65

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 47: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Results: straight path strategy

Method Hit distance

0 1 2 3 4 5 6 [0-2] ∑ PWE

Rare C 3.7 21.7 16.2 9.6 6.9 3.4 0.1 41.6 61.5

S 0.5 5.9 9.7 10.9 8.9 4.5 0.5 16.1 40.9 A 0.8 4.9 5.0 4.5 3.8 2.0 0.4 10.7 21.5

Freq. C 0.8 14.8 24.2 21.0 15.1 5.5 0.2 39.8 81.6

S 0.1 2.7 9.4 16.1 15.7 13.2 0.8 12.2 58.0 A 0.2 3.2 7.0 10.0 9.8 7.3 0.5 10.4 38.0

PB

Rare C 9.2 21.7 12.6 6.7 4.2 1.0 0.6 43.5 56.1

S 4.8 13.1 10.0 6.5 3.4 1.2 0.4 27.9 39.4

A 2.9 6.9 4.8 3.5 2.2 1.0 0.2 14.6 21.5 Freq. C 6.3 20.5 15.0 11.9 6.7 2.6 0.5 41.8 63.3

S 1.9 9.1 8.4 8.1 4.8 1.9 0.3 19.4 34.7

A 1.4 4.9 4.4 4.4 3.1 1.6 0.2 10.7 20.0

Samsung R&D Institute Invit. Lecture 2017-01-17

CLARIN-PL

Page 48: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Results: folded path strategy

Method Hit distance

0 1 2 3 4 ∑

PWE

Rare C 3.7 21.7 18.4 11.8 2.5 58.2

S 0.5 5.9 10.7 12.6 2.3 32.0

A 0.8 4.9 6.6 6.9 1.5 20.7

Freq. C 0.8 14.8 25.2 22.9 4.0 67.7

S 0.1 2.7 9.6 17.0 3.4 32.8

A 0.2 3.2 7.9 12.2 2.9 26.4

PB Rare C 9.2 21.7 21.9 10.7 1.9 65.5

S 4.8 13.1 15.3 13.1 1.5 47.9

A 2.9 6.9 14.7 13.2 1.7 39.4

Freq. C 6.3 20.5 20.7 18.6 2.8 68.8

S 1.9 9.1 11.5 13.5 3.1 39.2

A 1.4 4.9 8.4 11.6 2.3 28.5

Samsung R&D Institute Invit. Lecture 2017-01-17

CLARIN-PL

Page 49: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Results: coverage

§  For the straight path strategy §  Coverage for words

§  PWE: propositions for 100% of words (freq. 100%) §  Paintball: 63.15% of words (freq. 91.93%)

§  Recall for senses §  PWE: 44.79% (freq. 43.93%) §  Paintball : 24.66% (freq. 26.62%)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 50: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Results: example

§  PWE suggestions for feminism {abstraction, abstract entity},

{entity}, {communication}, {group, grouping}, {state}

§  Paintball suggestions: {causal agent, cause, causal agency},

{change}, {political orientation, ideology, political theory}, {discipline, subject, subject area, subject field, field, field of study, study, bailiwick}, {topic, subject, issue, matter}

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 51: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Semi-automated Wordnet Expanssion: WordnetWeaver in Use

climbing

speedway

recreation

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 52: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex History

Inforex – a system for construction, annotation and searching

text corpora (Marcińczuk et al., 2012) http://nlp.pwr.wroc.pl/inforex/ History: §  Developed in WUST (G4.19) since 2010, §  used:

§  In research projects: NEKST, SyNaT, CLARIN-PL §  Individual research: M. Zaśko-Zielińska (językoznawstwo - listy pożegnalne

samobójców), Ł. Damurski (urbanistyka - dokumenty dotyczące polityki terytorialnej UE)

§  PhD thesises: B. Broda (WSD), M. Marcińczuk (NER, relacje semantyczne), A. Radziszewski (frazy składniowe), J. Kocoń (wyrażenia temporalne, wyznaczniki sytuacji)

§  Other research tasks: E. Kaczmarz (konwersacje z Facebooka), Bernaś (teksty w j. hebrajskim).

§  Interface to several corpora: §  KPWr - Korpus Politechniki Wrocławskiej §  CEN - korpus wiadomości ekonomicznych from Wikinews §  PCSN - Polski korpus listów pożegnalnych samobójców

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 53: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Main features

§ http://inforex.clarin-pl.eu/ access for users with an account in DSpace

§ Accessible via web browser (Firefox is suggested) – does not require installation by the user, needs permanent access to Internet,

§  Integrated with DSpace (import/export of data), § Enables sharing data among users, § Access to data on the basis of authorisation related to

corpora and annotation layers, § Supports work on documents that are tagged (assumed

segmentation into tokens and sentences) and non-tagged § Provides visualisation of the document structure during

annotation,

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 54: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Visualisation of the document structure (1/2)

KPWr Rozmowy z Facebooka (E. Kaczmarz)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 55: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Visualisation of the document structure (1/2)

PCSN (M. Zaśko-Zielińska) Teksty w j. hebrajskim (T. Bernaś)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 56: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

KPWr Controlled state of the work (1/2)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 57: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

KPWr Controlled state of the work (1/2)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 58: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Metadata

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 59: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Content editing history

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 60: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Annotation, annotation schema

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 61: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Adding annotation to text

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 62: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Verification of annotation

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 63: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Lematisation

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 64: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Translation of phrases

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 65: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Normalisation of temporal expressions

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 66: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Adding relation links

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 67: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Relations – co-reference

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 68: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Word Sense Disambiguation

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 69: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Statistics – word frequency

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 70: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Browsing annotations

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 71: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Browsing annotations (translations)

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 72: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Inforex Browsing relation links

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 73: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Bibliography

§  Maziarz, M.; Szpakowicz, S. & Piasecki, M. (2015) A Procedural Definition of Multi-word Lexical Units. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing -- RANLP'2015, INCOMA Ltd. Shoumen, BULGARIA, 2015, 427-435 http://aclweb.org/anthology/R15-1056

§  Piasecki, M.; Wendelberger, M. & Maziarz, M. (2015) Extraction of the Multi-word Lexical Units in the Perspective of the Wordnet Expansion. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing -- RANLP'2015, INCOMA Ltd. Shoumen, BULGARIA, 2015, 512–-520 http://aclweb.org/anthology/R15-1067

§  Broda, B. & Piasecki, M. (2013) Parallel, Massive Processing in SuperMatrix -- a General Tool for Distributional Semantic Analysis of Corpora. International Journal of Data Mining, Modelling and Management, 2013, 5, pp. 1-19

§  Maziarz, M.; Piasecki, M.; Rudnicka, E. & Szpakowicz, S. (2013) Beyond the Transfer-and-Merge Wordnet Construction: plWordNet and a Comparison with WordNet. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, INCOMA Ltd. Shoumen, BULGARIA, 2013, 443-452 http://aclweb.org/anthology/R13-1058

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 74: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

Bibliography

§  Piasecki, M. & Wendelberger, M. (2014) Partial Measure of Semantic Relatedness Based on the Local Feature Selection. In Sojka, P.; Horák, A.; Kopecek, I. & Pala, K. (Eds.) Text, Speech and Dialogue - 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings, Springer, 2014, 8655, 336-343

§  Piasecki, M.; Ramocki, R. & Kaliński, M. (2013a) Information Spreading in Expanding Wordnet Hypernymy Structure. In Mitkov, R.; Angelova, G. & Boncheva, K. (Eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, INCOMA Ltd. Shoumen, BULGARIA, 2013, 553-561, http://aclweb.org/anthology/R13-1073

§  Piasecki, M.; Marcińczuk, M.; Ramocki, R. & Maziarz, M. (2013b) WordnetLoom: a Wordnet Development System Integrating Form-based and Graph-based Perspectives. International Journal of Data Mining, Modelling and Management, 2013, 5, 210-232

§  Broda, B. & Piasecki, M. (2011) Evaluating LexCSD in a Large Scale Experiment Control and Cybernetics, Vol. 40, 419-436.

§  Maciej Piasecki, Łukasz Burdka, Marek Maziarz, Michał Kaliński. (2016) In Zygmunt Vetulani, Hans Uszkoreit, Marek Kubis (Eds.)Human Language Technology. Challenges for Computer Science and Linguistics. Volume 9561 of the series Lecture Notes in Computer Science pp 255-273. http://link.springer.com/chapter/10.1007/978-3-319-43808-5_20

§  Marcińczuk, M., Kocoń, J. & Broda, B (2012). Inforex — a web-based tool for text corpus management and semantic annotation. In Calzolari, N., et al (editors), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 224-230. Istanbul, Turkey : European Language Resources Association (ELRA). https://www.researchgate.net/publication/308886657_Inforex_-_a_web-based_tool_for_text_corpus_management_and_semantic_annotation

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Page 75: Language Technology for Polish in Practiceclarin-pl.eu/wp-content/uploads/2017/01/SM-Systems-for-LRs-part3.pdf · linking to network Identification of meanings plWordNet Corpus 7.0

CLARIN-PL

Thank you very much for your attention! www.clarin-pl.eu

Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]