"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek...

"Communication in Slovene"

with an emphasis on the Slovene lexical database and

corporaSimon Krek

Amebis, d.o.o., Kamnik, SloveniaJožef Stefan Institute, Ljubljana, Slovenia

European Union & Slovene Ministry of Education and

Sport• The operation is partly financed by the

European Union, the European Social Fund, and the Ministry of Education and Sport of the Republic of Slovenia. The operation is being carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013.

“Communication in Slovene”

• http://www.slovenscina.eu• Leading partner: Amebis, d. o. o., Kamnik• Duration: June 2008 - December 2013• Total value: 3.2 million Euro• Project consortium:

• Amebis, d. o. o., Kamnik• Jozef Stefan Institute• University of Ljubljana• Scientific Research Centre of the Slovenian Academy

of Sciences and Arts• Trojina, Institute for Applied Slovene Studies

http://www.slovenscina.eu/

http://www.amebis.si/

http://www.ijs.si/

http://www.ijs.si/

http://www.uni-lj.si/




http://www.zrc-sazu.si/



















http://www.trojina.si/









GoalsNatural Language Processing Tools and Resources

Didactics

Language description (and standardization)

Language Data

Slovene Lexical Database

Timeline

• Number of lexical units: minimum 2,500

• June-October 2008: preparation• November 2008-June 2009:

specificationsJune 2010

June 2011

June 2012

Legal aspects

• Creative Commons– Attribution – Share Alike – Noncommercial

• Availabitity– On-line (http://www.termania.net/) – Dataset (http://www.slovenscina.eu/)

• Owner: Ministry of Education and Sports• Future: Slovene HLT Agency?

http://www.termania.net/


Past experience• International (early):

– GENELEX (1990-94)– LE PAROLE (1993-98)– SIMPLE (1998-2002)-----------------------------------------------

– ACQUILEX I, II (- 1995)– ILC- DELIS …

• Individual languages: elexico (DE), CLIPS (IT), CORNETTO (NL), ALFALEX (FR), STO (DK), ADESSE (SP), GRIAL (SP), CEGLEX (PL), SALDO (S), BLF (FR), PRALED (CZ), ...

• Important for us: FrameNet, Corpus Pattern Analysis, DANTE, COBUILD

http://www.owid.de/elexiko_/index.html

http://www.ilc.cnr.it/clips/quattro_livelli_web.ppt

http://www2.let.vu.nl/oz/cornetto/index.html

http://www.kuleuven.be/alfalex/index.php?id=&ng=0

http://www.cst.dk/cgi-bin/sto/defisto

http://adesse.uvigo.es/

http://grial.uab.es/

http://www.staff.amu.edu.pl/~zlisi/projects/ceglex/index.en.html

http://spraakbanken.gu.se/

https://ilt.kuleuven.be/blf/

Basics

• corpus data analysis• lexicogrammatical approach

• semantics and syntax are not separated• valency – colligation – collocation

• meaning = meaning potential – is not stable (norms & exploitations)

• lumpers vs. splitters = splitters• lexicography first, NLP second

semantic indicator

semantic frame

syntactic structure & pattern

syntactic combination

collocation

extended collocation

example

phraseology

Lexicogrammatical continuum

I. LEXICAL UNIT • headword to squeeze • part-of-speech verb

VI. PHRASEOLOGY • phraseological unit to squeeze a quart into a pint pot

II. SENSE • indicator 1. grip firmly 2. press out liquid • frame If a PERSON squeezes an OBJECT, If a PERSON squeezes a LIQUID s|he presses it firmly, usually or a SOFT SUBSTANCE out of

with his|her hands. an OBJECT, s|he gets the liquid

or substance out by pressing

the object.

• multi-word unit (only nouns and adjectives)

IV. COLLOC'S • collocation to squeeze (sb's) [hand, arm] to

squeeze [the poison, the venom] out

V. EXAMPLES • example I squeezed her hand gratefully. She immediately squeezed the poison out and that probably saved her life.

III. SYNTAX • structure vb-obj vb-out-obj

• pattern sb squeezes sth sb squeezes sth out

• combination (to squeeze your eyes shut)

I. Lexical Unit

• link to the lexicon– morphosyntactic information– corpus frequency– pronunciation etc.

• additional grammatical information– can be inferred (un/countability etc.)– manual (part-of-speech subtypes etc.)

II. Semantic Level

• Semantic Indicators– simple EFL-like explanations or

synonyms forming a sense menu– self-explanatory in relation to each other

• Semantic Frames– COBUILD / FrameNet / Corpus Pattern

Analysis– combination of the systems

Semantic Indicators

1 padat déšť1.1 o věcech

2 objevovat se ve velkém množství

pršet sloveso

Semantic Frames• identification of verb/semantic

arguments– prototypical pattern – “the norm” (Hanks)– the headword in its syntactic environement

• identification of semantic types in particular syntactic positions

• the semantic scenario– a full-sentence definition making a link

between the arguments and the situation (FN) typical for a particular sense

Semantic Frame

když prší, padají kapky z mraků na zem

1 padat déšť

když VĚCI nebo jejich SOUČÁSTI prší, padají jako kapky deště na zem

1.1 o věcech

když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně


III. Syntactic Level• semantic frame (between semantics and syntax)

• semantic arguments in capital letters (ID-ed)• linked with collocates via syntax

• syntactic structures (formal)

• clause and phrase level (all POS; only for NLP)

• the number of syntactic structures is finite (SLB ~290)

• source: word sketches (Sketch Engine)

• syntactic patterns (verbalized)• valency (only verbs; for lexicography and NLP)

• syntactic combinations• more than basic patterns: "to squeeze your eyes

shut"

Syntactic Structures

• NP/S+pršet

• ADV+pršet



Syntactic Patterns

• NP/S+pršet– co prší– co prší na co/kogo



IV. Collocation Level● SEMANTIC FRAME:

• 1 když prší, padají kapky z mraků na zem• 2 když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně

● SYNTACTIC STRUCTURES AND PATTERNS:

1 NP/S+pršet LOKACE 2 NP/S+pršetco prší pršet na co/koho co prší

pršet na čem co prší na co/kogo

If a part of syntactic patterns are collocational, they are shown on the collocation level.

● COLLOCATIONS

■ [kapky, déšť] pršet ■ [kritika, dotazy] prší■ pršet na [zem]■ pršet na [hlavu]

V. Examples● COLLOCATIONS

■ [kapky, déšť] pršet ■ [kritika, dotazy] prší■ pršet na [zem] ■ pršet na [hlavu]

● EXAMPLES (TBL + GDEX)• Dívám se z okna, jak prší déšť. • Tato klenba zadržuje vodu, která pak skrze průduchy prší na zem.• Nevýhodou přilby s otvory je, že při dešti Vám prší na hlavu.• Na nakladatelství pršely dotazy, zda kniha vyjde i česky.• Zdrcující kritika pršela na adresu vlády i na tiskové konferenci, kterou v

úterý uspořádal Svaz obchodu a cestovního ruchu (SOCR).

reference

general userschool populationSlovene as foreign

language

semantic info menus + frames

collocations

corpus examples

natural language processing

computerlinguist

FOR WHA

T

FOR WHO

M

WHAT

semantic frames

syntactic structures

syntactic patterns

other grammatical info

Corpus Data & Authoring Tools

• FidaPLUS – Gigafida • Sketch Engine:

www.sketchengine.co.uk– Slovene Sketch Grammar (LBS syn.

structures)– Tick-box Lexicography– GDEX

• IDM Dictionary Production System– http://www.idm.fr/products/dictionary_writing_system/27

/

– custom DTD

http://www.sketchengine.co.uk/

http://www.idm.fr/products/dictionary_writing_system/27/

http://www.idm.fr/products/dictionary_writing_system/27/

FidaPLUS (Gigafida)

• precursor: FIDA (1997-2000 – 100 million)• 621 million tokens• tagged-lemmatized (85% accuracy – rule-based

tagger)• taxonomy

– text types– medium– linguistic proof-reading

• time span: 1990 – 2006 • concordancers

– http://www.fidaplus.net/– http://www.sketchengine.co.uk/

http://www.fidaplus.net/




http://www.sketchengine.co.uk/

SLB sketch grammar + TBL• to love •

TBL – examples by GDEX

TBL – Entry Editor

GDEX – Good Dictionary Examples

• system for evaluation (ranking) of sentences with respect to their suitability to serve as dictionary examples

• sorting sentences so that good examples do not have to be searched for in hundreds of unusable sentences

• initially trained on English, it did not give good results for other languages

Evaluation

Authoring & search tools

• IDM Dictionary Production System– currently used by lexicographers

• iLex (http://www.emp.dk/) – in the process of evaluation

• T-Lex (http://tshwanedje.com/) – evaluated, stand-by

• ABBYY (http://www.abbyy.com/lingvo_content/) – in the process of evaluation

• Termania (http://www.termania.net/) – online search and vizualization tool

http://www.emp.dk/

http://tshwanedje.com/

http://www.abbyy.com/lingvo_content/




Corpora and web concordancers

Corpora• Gigafida

– corpus of written texts• KRES

– smaller and more carefully balanced corpus of written texts

• GOS (Govorjena slovenščina)

– corpus of spoken Slovene• Šolar

– corpus of school essay transcriptions with teachers’ corrections

CONCORDANCERS

• Gigafida, KRES, Šolar– written– http://demo.gigafida.net/

• GOS– spoken– http://www.korpus-gos.net/

http://demo.gigafida.net/

http://www.korpus-gos.net/

Gigafida

• new generation in the written corpus series– FIDA (2000), FidaPLUS (2006), Gigafida

(2011)• 1,148,350,213 tokens (1.15 billion)• simplified taxonomy• changed copyright status

– 10% can be used freely (downloadable as a data set)

– no authentication for web access• new annotation tools

Corpus annotation

• new statistical tagger: 92.17 %• meta-tagger – a combination of the

Amebis rule-based tagger and the new statistical tagger

• new lemmatizer: 98-99 %• new parser under development: MSTParser• training corpus:

– 500.000 words: manually verified POS tags– 200.000 words (~11.300 sentences): manually

verified dependency treebank with only 10 lables

Taxonomy

PRINTED 87,1 BOOKS 6,5 FICTION 2,1

NON-FICTION

4,4

PERIODICALS 79,9

NEWSPAPERS 57,7

MAGAZINES

22,2

OTHER 0,7INTERNET 12,9

KRES & free corpus

• KRES (in development)– 100 million words– online– balanced

• Free corpus (in development)– 100 million words– 10% of each corpus document– downloadable data set

Taxonomy KRES

PRINTED 80 BOOKS 35 FICTION 17

NON-FICTION

18

PERIODICALS 40

NEWSPAPERS 20

MAGAZINES

20

OTHER 5INTERNET 20

GOS• the first corpus of spoken Slovene

– 120 hours of speech– one million words

• criteria– demographic– speech type/situation– additional (language learning, 15%)

• transcription– pronunciation-based– standardized

Web concordancers

• Log analysis of FidaPLUS concordancer

• FidaPLUS web survey• Analysis of existing corpus tools• Analysis of popular web tools (Google

etc.)• Final goal

– use in classroom and by general public– linguists can use existing tools (SkE,

CWB, etc.)

Survey – findings

• Simple search – regularly used by 72% users• Advanced search – rarely used (only 8% use it

regularly)• Lack of intuitiveness• The manual is almost key to learning how to use

a corpus tool• “…if you are not using the interface for a while,

you forget what the search commands are, and you don’t (want to) bother with looking into the manual”

• “…the interface should have a modern design, it should be more user-friendly, and its use should be clear and transparent”

Main design principles

• similarity to the well-known non-linguistic tools (e.g. Google)

• No registration• Minimum navigation• No redundant functions (less is more)• Simplicity of searches• Help and tips in pop-up windows• Simple descriptions of functionality (no

terminology)

The result

• two concordancers– written corpora: Gigafida, other w.

corpora– spoken corpus: GOS

• only one meta-character: quotation marks

• extensive use of filters– multiple possible lemmas– use of capital letters– immediate access to meta-information

"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek...

Documents

Transcript of "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek...