"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek...
-
Upload
owen-goolsby -
Category
Documents
-
view
217 -
download
0
Transcript of "Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek...
"Communication in Slovene"
with an emphasis on the Slovene lexical database and
corporaSimon Krek
Amebis, d.o.o., Kamnik, SloveniaJožef Stefan Institute, Ljubljana, Slovenia
European Union & Slovene Ministry of Education and
Sport• The operation is partly financed by the
European Union, the European Social Fund, and the Ministry of Education and Sport of the Republic of Slovenia. The operation is being carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013.
“Communication in Slovene”
• http://www.slovenscina.eu• Leading partner: Amebis, d. o. o., Kamnik• Duration: June 2008 - December 2013• Total value: 3.2 million Euro• Project consortium:
• Amebis, d. o. o., Kamnik• Jozef Stefan Institute• University of Ljubljana• Scientific Research Centre of the Slovenian Academy
of Sciences and Arts• Trojina, Institute for Applied Slovene Studies
GoalsNatural Language Processing Tools and Resources
Didactics
Language description (and standardization)
Language Data
Today
Slovene Lexical Database
Timeline
• Number of lexical units: minimum 2,500
• June-October 2008: preparation• November 2008-June 2009:
specificationsJune 2010
June 2011
June 2012
Legal aspects
• Creative Commons– Attribution – Share Alike – Noncommercial
• Availabitity– On-line (http://www.termania.net/) – Dataset (http://www.slovenscina.eu/)
• Owner: Ministry of Education and Sports• Future: Slovene HLT Agency?
Past experience• International (early):
– GENELEX (1990-94)– LE PAROLE (1993-98)– SIMPLE (1998-2002)-----------------------------------------------
– ACQUILEX I, II (- 1995)– ILC- DELIS …
• Individual languages: elexico (DE), CLIPS (IT), CORNETTO (NL), ALFALEX (FR), STO (DK), ADESSE (SP), GRIAL (SP), CEGLEX (PL), SALDO (S), BLF (FR), PRALED (CZ), ...
• Important for us: FrameNet, Corpus Pattern Analysis, DANTE, COBUILD
Basics
• corpus data analysis• lexicogrammatical approach
• semantics and syntax are not separated• valency – colligation – collocation
• meaning = meaning potential – is not stable (norms & exploitations)
• lumpers vs. splitters = splitters• lexicography first, NLP second
semantic indicator
semantic frame
syntactic structure & pattern
syntactic combination
collocation
extended collocation
example
phraseology
Lexicogrammatical continuum
I. LEXICAL UNIT • headword to squeeze • part-of-speech verb
VI. PHRASEOLOGY • phraseological unit to squeeze a quart into a pint pot
II. SENSE • indicator 1. grip firmly 2. press out liquid • frame If a PERSON squeezes an OBJECT, If a PERSON squeezes a LIQUID s|he presses it firmly, usually or a SOFT SUBSTANCE out of
with his|her hands. an OBJECT, s|he gets the liquid
or substance out by pressing
the object.
• multi-word unit (only nouns and adjectives)
IV. COLLOC'S • collocation to squeeze (sb's) [hand, arm] to
squeeze [the poison, the venom] out
V. EXAMPLES • example I squeezed her hand gratefully. She immediately squeezed the poison out and that probably saved her life.
III. SYNTAX • structure vb-obj vb-out-obj
• pattern sb squeezes sth sb squeezes sth out
• combination (to squeeze your eyes shut)
I. Lexical Unit
• link to the lexicon– morphosyntactic information– corpus frequency– pronunciation etc.
• additional grammatical information– can be inferred (un/countability etc.)– manual (part-of-speech subtypes etc.)
II. Semantic Level
• Semantic Indicators– simple EFL-like explanations or
synonyms forming a sense menu– self-explanatory in relation to each other
• Semantic Frames– COBUILD / FrameNet / Corpus Pattern
Analysis– combination of the systems
Semantic Indicators
1 padat déšť1.1 o věcech
2 objevovat se ve velkém množství
pršet sloveso
Semantic Frames• identification of verb/semantic
arguments– prototypical pattern – “the norm” (Hanks)– the headword in its syntactic environement
• identification of semantic types in particular syntactic positions
• the semantic scenario– a full-sentence definition making a link
between the arguments and the situation (FN) typical for a particular sense
Semantic Frame
když prší, padají kapky z mraků na zem
1 padat déšť
když VĚCI nebo jejich SOUČÁSTI prší, padají jako kapky deště na zem
1.1 o věcech
když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně
2 objevovat se ve velkém množství
III. Syntactic Level• semantic frame (between semantics and syntax)
• semantic arguments in capital letters (ID-ed)• linked with collocates via syntax
• syntactic structures (formal)
• clause and phrase level (all POS; only for NLP)
• the number of syntactic structures is finite (SLB ~290)
• source: word sketches (Sketch Engine)
• syntactic patterns (verbalized)• valency (only verbs; for lexicography and NLP)
• syntactic combinations• more than basic patterns: "to squeeze your eyes
shut"
Syntactic Structures
• NP/S+pršet
• ADV+pršet
když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně
2 objevovat se ve velkém množství
Syntactic Patterns
• NP/S+pršet– co prší– co prší na co/kogo
když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně
2 objevovat se ve velkém množství
IV. Collocation Level● SEMANTIC FRAME:
• 1 když prší, padají kapky z mraků na zem• 2 když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně
● SYNTACTIC STRUCTURES AND PATTERNS:
1 NP/S+pršet LOKACE 2 NP/S+pršetco prší pršet na co/koho co prší
pršet na čem co prší na co/kogo
If a part of syntactic patterns are collocational, they are shown on the collocation level.
● COLLOCATIONS
■ [kapky, déšť] pršet ■ [kritika, dotazy] prší■ pršet na [zem]■ pršet na [hlavu]
V. Examples● COLLOCATIONS
■ [kapky, déšť] pršet ■ [kritika, dotazy] prší■ pršet na [zem] ■ pršet na [hlavu]
● EXAMPLES (TBL + GDEX)• Dívám se z okna, jak prší déšť. • Tato klenba zadržuje vodu, která pak skrze průduchy prší na zem.• Nevýhodou přilby s otvory je, že při dešti Vám prší na hlavu.• Na nakladatelství pršely dotazy, zda kniha vyjde i česky.• Zdrcující kritika pršela na adresu vlády i na tiskové konferenci, kterou v
úterý uspořádal Svaz obchodu a cestovního ruchu (SOCR).
reference
general userschool populationSlovene as foreign
language
semantic info menus + frames
collocations
corpus examples
natural language processing
computerlinguist
FOR WHA
T
FOR WHO
M
WHAT
semantic frames
syntactic structures
syntactic patterns
other grammatical info
Corpus Data & Authoring Tools
• FidaPLUS – Gigafida • Sketch Engine:
www.sketchengine.co.uk– Slovene Sketch Grammar (LBS syn.
structures)– Tick-box Lexicography– GDEX
• IDM Dictionary Production System– http://www.idm.fr/products/dictionary_writing_system/27
/
– custom DTD
FidaPLUS (Gigafida)
• precursor: FIDA (1997-2000 – 100 million)• 621 million tokens• tagged-lemmatized (85% accuracy – rule-based
tagger)• taxonomy
– text types– medium– linguistic proof-reading
• time span: 1990 – 2006 • concordancers
– http://www.fidaplus.net/– http://www.sketchengine.co.uk/
SLB sketch grammar + TBL• to love •
TBL – examples by GDEX
TBL – Entry Editor
GDEX – Good Dictionary Examples
• system for evaluation (ranking) of sentences with respect to their suitability to serve as dictionary examples
• sorting sentences so that good examples do not have to be searched for in hundreds of unusable sentences
• initially trained on English, it did not give good results for other languages
Evaluation
Authoring & search tools
• IDM Dictionary Production System– currently used by lexicographers
• iLex (http://www.emp.dk/) – in the process of evaluation
• T-Lex (http://tshwanedje.com/) – evaluated, stand-by
• ABBYY (http://www.abbyy.com/lingvo_content/) – in the process of evaluation
• Termania (http://www.termania.net/) – online search and vizualization tool
Corpora and web concordancers
Corpora• Gigafida
– corpus of written texts• KRES
– smaller and more carefully balanced corpus of written texts
• GOS (Govorjena slovenščina)
– corpus of spoken Slovene• Šolar
– corpus of school essay transcriptions with teachers’ corrections
CONCORDANCERS
• Gigafida, KRES, Šolar– written– http://demo.gigafida.net/
• GOS– spoken– http://www.korpus-gos.net/
Gigafida
• new generation in the written corpus series– FIDA (2000), FidaPLUS (2006), Gigafida
(2011)• 1,148,350,213 tokens (1.15 billion)• simplified taxonomy• changed copyright status
– 10% can be used freely (downloadable as a data set)
– no authentication for web access• new annotation tools
Corpus annotation
• new statistical tagger: 92.17 %• meta-tagger – a combination of the
Amebis rule-based tagger and the new statistical tagger
• new lemmatizer: 98-99 %• new parser under development: MSTParser• training corpus:
– 500.000 words: manually verified POS tags– 200.000 words (~11.300 sentences): manually
verified dependency treebank with only 10 lables
Taxonomy
PRINTED 87,1 BOOKS 6,5 FICTION 2,1
NON-FICTION
4,4
PERIODICALS 79,9
NEWSPAPERS 57,7
MAGAZINES
22,2
OTHER 0,7INTERNET 12,9
KRES & free corpus
• KRES (in development)– 100 million words– online– balanced
• Free corpus (in development)– 100 million words– 10% of each corpus document– downloadable data set
Taxonomy KRES
PRINTED 80 BOOKS 35 FICTION 17
NON-FICTION
18
PERIODICALS 40
NEWSPAPERS 20
MAGAZINES
20
OTHER 5INTERNET 20
GOS• the first corpus of spoken Slovene
– 120 hours of speech– one million words
• criteria– demographic– speech type/situation– additional (language learning, 15%)
• transcription– pronunciation-based– standardized
Web concordancers
• Log analysis of FidaPLUS concordancer
• FidaPLUS web survey• Analysis of existing corpus tools• Analysis of popular web tools (Google
etc.)• Final goal
– use in classroom and by general public– linguists can use existing tools (SkE,
CWB, etc.)
Survey – findings
• Simple search – regularly used by 72% users• Advanced search – rarely used (only 8% use it
regularly)• Lack of intuitiveness• The manual is almost key to learning how to use
a corpus tool• “…if you are not using the interface for a while,
you forget what the search commands are, and you don’t (want to) bother with looking into the manual”
• “…the interface should have a modern design, it should be more user-friendly, and its use should be clear and transparent”
Main design principles
• similarity to the well-known non-linguistic tools (e.g. Google)
• No registration• Minimum navigation• No redundant functions (less is more)• Simplicity of searches• Help and tips in pop-up windows• Simple descriptions of functionality (no
terminology)
The result
• two concordancers– written corpora: Gigafida, other w.
corpora– spoken corpus: GOS
• only one meta-character: quotation marks
• extensive use of filters– multiple possible lemmas– use of capital letters– immediate access to meta-information