"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek...

44
"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute, Ljubljana, Slovenia

Transcript of "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek...

Page 1: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

"Communication in Slovene"

with an emphasis on the Slovene lexical database and

corporaSimon Krek

Amebis, d.o.o., Kamnik, SloveniaJožef Stefan Institute, Ljubljana, Slovenia

Page 2: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

European Union & Slovene Ministry of Education and

Sport• The operation is partly financed by the

European Union, the European Social Fund, and the Ministry of Education and Sport of the Republic of Slovenia. The operation is being carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013.

Page 4: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

GoalsNatural Language Processing Tools and Resources

Didactics

Language description (and standardization)

Language Data

Page 5: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Today

Page 6: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Slovene Lexical Database

Page 7: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Timeline

• Number of lexical units: minimum 2,500

• June-October 2008: preparation• November 2008-June 2009:

specificationsJune 2010

June 2011

June 2012

Page 8: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Legal aspects

• Creative Commons– Attribution – Share Alike – Noncommercial

• Availabitity– On-line (http://www.termania.net/) – Dataset (http://www.slovenscina.eu/)

• Owner: Ministry of Education and Sports• Future: Slovene HLT Agency?

Page 9: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Past experience• International (early):

– GENELEX (1990-94)– LE PAROLE (1993-98)– SIMPLE (1998-2002)-----------------------------------------------

– ACQUILEX I, II (- 1995)– ILC- DELIS …

• Individual languages: elexico (DE), CLIPS (IT), CORNETTO (NL), ALFALEX (FR), STO (DK), ADESSE (SP), GRIAL (SP), CEGLEX (PL), SALDO (S), BLF (FR), PRALED (CZ), ...

• Important for us: FrameNet, Corpus Pattern Analysis, DANTE, COBUILD

Page 10: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Basics

• corpus data analysis• lexicogrammatical approach

• semantics and syntax are not separated• valency – colligation – collocation

• meaning = meaning potential – is not stable (norms & exploitations)

• lumpers vs. splitters = splitters• lexicography first, NLP second

Page 11: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

semantic indicator

semantic frame

syntactic structure & pattern

syntactic combination

collocation

extended collocation

example

phraseology

Lexicogrammatical continuum

Page 12: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

I. LEXICAL UNIT • headword to squeeze • part-of-speech verb

VI. PHRASEOLOGY • phraseological unit to squeeze a quart into a pint pot

II. SENSE • indicator 1. grip firmly 2. press out liquid • frame If a PERSON squeezes an OBJECT, If a PERSON squeezes a LIQUID s|he presses it firmly, usually or a SOFT SUBSTANCE out of

with his|her hands. an OBJECT, s|he gets the liquid

or substance out by pressing

the object.

• multi-word unit (only nouns and adjectives)

IV. COLLOC'S • collocation to squeeze (sb's) [hand, arm] to

squeeze [the poison, the venom] out

V. EXAMPLES • example I squeezed her hand gratefully. She immediately squeezed the poison out and that probably saved her life.

III. SYNTAX • structure vb-obj vb-out-obj

• pattern sb squeezes sth sb squeezes sth out

• combination (to squeeze your eyes shut)

Page 13: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

I. Lexical Unit

• link to the lexicon– morphosyntactic information– corpus frequency– pronunciation etc.

• additional grammatical information– can be inferred (un/countability etc.)– manual (part-of-speech subtypes etc.)

Page 14: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

II. Semantic Level

• Semantic Indicators– simple EFL-like explanations or

synonyms forming a sense menu– self-explanatory in relation to each other

• Semantic Frames– COBUILD / FrameNet / Corpus Pattern

Analysis– combination of the systems

Page 15: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Semantic Indicators

1 padat déšť1.1 o věcech

2 objevovat se ve velkém množství

pršet   sloveso

Page 16: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Semantic Frames• identification of verb/semantic

arguments– prototypical pattern – “the norm” (Hanks)– the headword in its syntactic environement

• identification of semantic types in particular syntactic positions

• the semantic scenario– a full-sentence definition making a link

between the arguments and the situation (FN) typical for a particular sense

Page 17: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Semantic Frame

když prší, padají kapky z mraků na zem

1 padat déšť

když VĚCI nebo jejich SOUČÁSTI prší, padají jako kapky deště na zem

1.1 o věcech

když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně

2 objevovat se ve velkém množství

Page 18: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

III. Syntactic Level• semantic frame (between semantics and syntax)

• semantic arguments in capital letters (ID-ed)• linked with collocates via syntax

• syntactic structures (formal)

• clause and phrase level (all POS; only for NLP)

• the number of syntactic structures is finite (SLB ~290)

• source: word sketches (Sketch Engine)

• syntactic patterns (verbalized)• valency (only verbs; for lexicography and NLP)

• syntactic combinations• more than basic patterns: "to squeeze your eyes

shut"

Page 19: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Syntactic Structures

• NP/S+pršet

• ADV+pršet

když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně

2 objevovat se ve velkém množství

Page 20: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Syntactic Patterns

• NP/S+pršet– co prší– co prší na co/kogo

když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně

2 objevovat se ve velkém množství

Page 21: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

IV. Collocation Level● SEMANTIC FRAME:

• 1 když prší, padají kapky z mraků na zem• 2 když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně

● SYNTACTIC STRUCTURES AND PATTERNS:

1 NP/S+pršet LOKACE 2 NP/S+pršetco prší pršet na co/koho co prší

pršet na čem co prší na co/kogo

If a part of syntactic patterns are collocational, they are shown on the collocation level.

● COLLOCATIONS

■  [kapky, déšť] pršet ■  [kritika, dotazy] prší■  pršet na [zem]■  pršet na [hlavu]

Page 22: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

V. Examples● COLLOCATIONS

■  [kapky, déšť] pršet ■   [kritika, dotazy] prší■   pršet na [zem] ■   pršet na [hlavu]

● EXAMPLES (TBL + GDEX)• Dívám se z okna, jak prší déšť. • Tato klenba zadržuje vodu, která pak skrze průduchy prší na zem.• Nevýhodou přilby s otvory je, že při dešti Vám prší na hlavu.• Na nakladatelství pršely dotazy, zda kniha vyjde i česky.• Zdrcující kritika pršela na adresu vlády i na tiskové konferenci, kterou v

úterý uspořádal Svaz obchodu a cestovního ruchu (SOCR).

Page 23: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

reference

general userschool populationSlovene as foreign

language

semantic info menus + frames

collocations

corpus examples

natural language processing

computerlinguist

FOR WHA

T

FOR WHO

M

WHAT

semantic frames

syntactic structures

syntactic patterns

other grammatical info

Page 24: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Corpus Data & Authoring Tools

• FidaPLUS – Gigafida • Sketch Engine:

www.sketchengine.co.uk– Slovene Sketch Grammar (LBS syn.

structures)– Tick-box Lexicography– GDEX

• IDM Dictionary Production System– http://www.idm.fr/products/dictionary_writing_system/27

/

– custom DTD

Page 25: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

FidaPLUS (Gigafida)

• precursor: FIDA (1997-2000 – 100 million)• 621 million tokens• tagged-lemmatized (85% accuracy – rule-based

tagger)• taxonomy

– text types– medium– linguistic proof-reading

• time span: 1990 – 2006 • concordancers

– http://www.fidaplus.net/– http://www.sketchengine.co.uk/

Page 26: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

SLB sketch grammar + TBL• to love •

Page 27: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

TBL – examples by GDEX

Page 28: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

TBL – Entry Editor

Page 29: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

GDEX – Good Dictionary Examples

• system for evaluation (ranking) of sentences with respect to their suitability to serve as dictionary examples

• sorting sentences so that good examples do not have to be searched for in hundreds of unusable sentences

• initially trained on English, it did not give good results for other languages

Page 30: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Evaluation

Page 31: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Authoring & search tools

• IDM Dictionary Production System– currently used by lexicographers

• iLex (http://www.emp.dk/) – in the process of evaluation

• T-Lex (http://tshwanedje.com/) – evaluated, stand-by

• ABBYY (http://www.abbyy.com/lingvo_content/) – in the process of evaluation

• Termania (http://www.termania.net/) – online search and vizualization tool

Page 32: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Corpora and web concordancers

Page 33: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Corpora• Gigafida

– corpus of written texts• KRES

– smaller and more carefully balanced corpus of written texts

• GOS (Govorjena slovenščina)

– corpus of spoken Slovene• Šolar

– corpus of school essay transcriptions with teachers’ corrections

Page 34: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

CONCORDANCERS

• Gigafida, KRES, Šolar– written– http://demo.gigafida.net/

• GOS– spoken– http://www.korpus-gos.net/

Page 35: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Gigafida

• new generation in the written corpus series– FIDA (2000), FidaPLUS (2006), Gigafida

(2011)• 1,148,350,213 tokens (1.15 billion)• simplified taxonomy• changed copyright status

– 10% can be used freely (downloadable as a data set)

– no authentication for web access• new annotation tools

Page 36: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Corpus annotation

• new statistical tagger: 92.17 %• meta-tagger – a combination of the

Amebis rule-based tagger and the new statistical tagger

• new lemmatizer: 98-99 %• new parser under development: MSTParser• training corpus:

– 500.000 words: manually verified POS tags– 200.000 words (~11.300 sentences): manually

verified dependency treebank with only 10 lables

Page 37: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Taxonomy

PRINTED 87,1 BOOKS 6,5 FICTION 2,1

NON-FICTION

4,4

PERIODICALS 79,9

NEWSPAPERS 57,7

MAGAZINES

22,2

OTHER 0,7INTERNET 12,9

Page 38: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

KRES & free corpus

• KRES (in development)– 100 million words– online– balanced

• Free corpus (in development)– 100 million words– 10% of each corpus document– downloadable data set

Page 39: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Taxonomy KRES

PRINTED 80 BOOKS 35 FICTION 17

NON-FICTION

18

PERIODICALS 40

NEWSPAPERS 20

MAGAZINES

20

OTHER 5INTERNET 20

Page 40: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

GOS• the first corpus of spoken Slovene

– 120 hours of speech– one million words

• criteria– demographic– speech type/situation– additional (language learning, 15%)

• transcription– pronunciation-based– standardized

Page 41: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Web concordancers

• Log analysis of FidaPLUS concordancer

• FidaPLUS web survey• Analysis of existing corpus tools• Analysis of popular web tools (Google

etc.)• Final goal

– use in classroom and by general public– linguists can use existing tools (SkE,

CWB, etc.)

Page 42: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Survey – findings

• Simple search – regularly used by 72% users• Advanced search – rarely used (only 8% use it

regularly)• Lack of intuitiveness• The manual is almost key to learning how to use

a corpus tool• “…if you are not using the interface for a while,

you forget what the search commands are, and you don’t (want to) bother with looking into the manual”

• “…the interface should have a modern design, it should be more user-friendly, and its use should be clear and transparent”

Page 43: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Main design principles

• similarity to the well-known non-linguistic tools (e.g. Google)

• No registration• Minimum navigation• No redundant functions (less is more)• Simplicity of searches• Help and tips in pop-up windows• Simple descriptions of functionality (no

terminology)

Page 44: "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

The result

• two concordancers– written corpora: Gigafida, other w.

corpora– spoken corpus: GOS

• only one meta-character: quotation marks

• extensive use of filters– multiple possible lemmas– use of capital letters– immediate access to meta-information