SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W...
Transcript of SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W...
1
Sketch EngineSketch EngineSketch EngineSketch Engine
SRDANOVIĆSRDANOVIĆSRDANOVIĆSRDANOVIĆ ERJAVEC Irena, ERJAVEC Irena, ERJAVEC Irena, ERJAVEC Irena,
Sketch Engine
Sketch Engine
Web
1 “Word Sketch”
“Thesaurus” “Sketch Difference” Sketch Engine
JpWaC 4 Web
Sketch Engine
1.
1980
10
80
Kilgarriff & Rundell 2002
500 1,000 20,000
2000
Heid et al. 2000, Kilgarriff &
Tugwell 2001 Sketch Engine Kilgarriff et al. 2004
Srdanović et al. 2008
Sketch Engine
Web Word Sketch Thesaurus
Sketch Difference
2
Sketch Engine
2. Sketch Engine
Sketch Engine Kilgarriff et al. 2004
Erjavec et al. 2007
4 4 Web
Web
Sketch Engine
Sketch Engine
2.1. Sketch Engine
Web Sketch Engine (http://www.sketchengine.com)
4 JpWaC Web 1 Sharoff (2006)
Ueyama & Baroni (2005) Web 5 WAC Baroni &
Bernardini, eds. 2006 BootCat Baroni et al. 2006
HTML
boilerplate removal Web
ChaSen token
lemma tag Erjavec et
al. 2006
.jp .com Erjavec
et al. 2007 Srdanović et al. 2008
Sketch Engine
2 3
URL
Web
JpWaC
2007
3
1111 Sketch Engine 2222 Sketch Engine
3333 Sketch Engine
2.2. Word Sketches
22
Word Sketch, Thesaurus Sketch Difference
Chasen Gahl 1998
corpus query syntax ( ) 4
Word Sketch
4
4
4 2 1
2 salience 1
modifies_N
( )
4 2 dual
*DUAL
=modifier_Ana/modifies_N
2:2:2:2:"N.Ana" "Aux" "Pref.*"? 1:1:1:1:[tag="N.*" & tag!="N.Suff.*" & tag!="N.bnd.*"]
modifier_Anamodifier_Anamodifier_Anamodifier_Ana modifies_Nmodifies_Nmodifies_Nmodifies_N
2:2:2:2:"N.Ana" "Aux" "Pref.*"? N.Ana Aux
Pref.* 1111:::: [tag="N.*" & tag! ="N.Suff.*" & tag!
="N.bnd.*"] N.*
N.Suff.* N.bnd.* - - -
- -
5
* 0
N.* N.g N.Prop
0 1
Sketch Engine
Concordance CQL Corpus Query
Language
• [word=” ”| word=” ”]
ChaSen
[word=” ”] [word=” ”] [lemma=” ”] 3.2
• [tag=”N.*”]&[ word =“ ”]
Word Sketch
Sketch Engine ChaSen IPADIC)
IPADIC Sketch Engine
Web
http://tell.fll.purdue.edu/chakoshipub/index2.html ChaSen
5 ChaSen
ChaSen Sketch
Engine
tokentokentokentoken kanakanakanakana lemmalemmalemmalemma POS tagPOS tagPOS tagPOS tag (((( )))) POS tagPOS tagPOS tagPOS tag----engengengeng (((( ))))
- Adv.P
- N.Ana
Aux
- N.g
Aux
Aux
- Sym.p
ChaSen
ChaSen IPADIC ChaSen
ChaSen
6
Word Sketch ChaSen
Word Sketch
Word Sketch Concordance
100 Word Sketch
ChaSen
Web
2.3. Thesaurus Sketch Difference
Thesaurus Sketch Difference shared triples 3
triple
Srdanović et al. 2008
Thesaurus
6
Sketch Difference 7
8
16,309 6,486 2.5
Web
7
Thesaurus
7777 Sketch Difference only pattern
8888 Sketch Difference only pattern
2.4. Web
Web
Web
8
Web
Web
Keller & Lapata 2003 Web
Web JpWaC
Web
Web Sharoff 2006 Ueyama & Baroni 2005
Web Web
Web
Sharoff 2006 Ueyama &
Baroni 2005
Web
narrative style Web
interactive style
Web
Web
Web
Ghani et al. 2001
Web
Web
Web
Web
Web
Crystal 2006
Web
• Web
• Web
9
Web
3. Sketch Engine
Sketch Engine
3.1. Sketch Engine
80 Cobuild
90 Church & Hanks 1989 (MI)
2000 Word Sketch
Sketch Engine BNC British National Corpus
Rundell, ed. 2002 Kilgarriff &
Rundell (2002)
Word Sketch Word Sketch
Word Sketch
Sketch Engine
Word
Sketch
Sketch Engine
10
Kilgarriff & Rundell 2002
‘challenge’
2004
Sketch Engine
3.1.1
9 Word Sketch
9999 Word Sketch
9 modifier_Ana modifier_Ai
verb verb verb verb
9
‘initiation’ ‘trial’
-
11
Word Sketch ‘challenge to something/somebody‘
Concordance 10 Concordance
CQL [word=" "] []{0,3} [word=" "]
{0,3} 0 3 token
11 ( 3
199
•
•
•
10101010 11111111
Word Sketch
jaSlo Erjavec et al.
2006
12
3.1.2
2004
2004
Word Sketch
10 Word Sketch
1) 2) 3)
4)
1)
1,180 364
Sketch Engine
22 2
Sketch Engine
Sketch
Engine Sketch Engine
13
2)
Word Sketch
Word Sketch Sketch
Engine Web
Sketch Engine
3)
Word Sketch
Word Sketch 12
14
11112222 Word Sketch
4)
Word Sketch Sketch
Engine Thesaurus Sketch Difference
A B A
B A
Sketch Difference
15
Web Web
Word Sketch
Sketch Engine
3.2. Sketch Engine
Sketch Engine
Word Sketch Thesaurus Sketch Difference
Concordance
• suffix ( ) prefix
• suffix_base prefix_base
• bound_V
• V_bound
suffix bound_V
V_bound
Sketch Difference
/ /
16
Word
Sketch Word Sketch
lemma
2)
Concordance
Concordance 2.2 3.3.1 Concordance
CQL
�
Concordance CQL
[word=" "][word=" "][lemma=" "]
[word=" "][word=" "][lemma=" "]
lemma
432 2,975
Collocation candidates
•
•
•
•
•
•
17
�
Concordance CQL [tag="V.*"][word=" | "][word=" "][lemma=" "]
Web 1,170
CQL [word=" | "][word=" "][lemma=" "]
Collocation candidates 10
�
Concordance [word=" "] [word=" "] [lemma="
"] 10,845 Collocation candidates
4,000 13
(lexical sets)
11113333
18
[word=" "][word=" | "][word=" "][word=" "] [word=" "] [lemma=" "]
Srdanović 2007
Word Sketch
Word Sketch
3.3. Sketch Engine
Sketch Engine
Sketch Engine
1)
Sketch Engine
a b
Sketch Engine
Sketch Engine
Nishina &
Yoshihashi 2007
Smrž 2004 Sketch Engine
19
2)
Sketch Engine
3)
a ( )
b
c
d
3.1 3.2 Sketch Engine
Smrž 2004 Sketch Difference
Thesaurus
Sketch Engine
Smrž 2004
Sketch Engine
Sketch Engine
4)
a
b
c
Sketch Engine
Sketch Engine Smith et
al. 2007
20
3.4.
Sketch Engine
2.3
Web Web
Word Sketch
Thesaurus Joice 2005 Sketch Engine
ChaSen
ChaSen
Corpus Builder Sketch Engine
WebBootCat Web
Baroni et al. 2006
4.
Sketch Engine
1) ChaSen 4 Web
2) ChaSen
Sketch Engine
Word Sketch Thesaurus Sketch Difference Concordance
1) Web
2)
3) ChaSen
ChaSen
21
Srdanović Erjavec, Irena 2007
19 , 83-89,
2007 Sketch Engine
18 , 109-112,
2004
Baroni, Marko, Adam Kilgarriff, Jan Pomikalek & Pavel Rychly (2006) WebBootCaT: a web
tool for instant corpora, Proceedings of the EuraLex Conference 2006, 123-132.
Baroni, Marko & Silvia Bernardini, eds. (2006) Wacky! Working papers on the Web as Corpus,
Bologna: GEDIT.
Church, Kenneth Ward & Patrick Hanks (1989) Word association norms, mutual information,
and lexicography, Proceedings of the 27th annual meeting on Association for
Computational Linguistics, 76-83.
Crystal, David (2006) Language and the Internet, Cambridge: Cambridge University Press.
Erjavec, Tomaž, Kristina Hmeljak Sangawa & Irena Srdanović Erjavec (2006) jaSlo, A
Japanese-Slovene Learners' Dictionary: Methods for Dictionary Enhancement,
Proceedings of the 12th EURALEX International Congress
Erjavec, Tomaž, Adam Kilgarriff & Irena Srdanović Erjavec (2007) A large public-access
Japanese corpus and its query tool, CoJaS 2007, The Inaugural Workshop on
Computational Japanese Studies.
Gahl, Susanne (1998) Automatic Extraction of subcategorization frames for corpus-based
dictionary-building, Proc EURALEX 1998, 445-452.
Ghani, Rayid, Rosie Jones & Dunja Mladenic (2001) Using the Web to Create Minority
Language Corpora, Proceedings of the 2001 ACM CIKM: Tenth International
Conference on Information and Knowledge Management, 279-286.
Heid, Ulrich, Stefan Evert, Vincent Docherty, Wolfgang Worsch & Wermke, Matthias (2000)
Computational tools for semi-automatic corpus-based updating of dictionaries,
EURALEX 2000 Proceedings, 183-196.
Joyce, Terry (2005) Constructing a large-scale database of Japanese word associations, In
Katsuo Tamaoka (ed.) Corpus Studies on Japanese Kanji (Glottometrics 10), 82-98,
Tokyo: Hituzi Syobo & Germany: RAM-Verlag:Ludenschied.
Keller, Frank & Maria Lapata (2003) Using the Web to Obtain Frequencies for Unseen
Bigrams, Computational Linguistics 29 (3), 459-484.
22
Kilgarriff, Adam & Michael Rundell (2002) Lexical Profiling Software and its Lexicographic
Applications - a Case Study, EURALEX 2002 Proceedings, 807-818.
Kilgarriff, Adam, Pavel Rychly, Pavel Smrž & David Tugwell (2004) The Sketch Engine, Proc.
Euralex, 105-116.
Kilgarriff Adam & David Tugwell (2001) WORD SKETCH: Extraction and Display of
Significant Collocations for Lexicography, Proc. workshop "COLLOCATION:
Computational Extraction, Analysis and Exploitation. 39th ACL & 10th EACL, 32-38.
Nishina, Kikuko & Kenji Yoshihashi (2007) Japanese Composition Support System
Displaying Occurrences and Example Sentences, Symposium on Large-scale
Knowledge Resources (LKR2007), 119-122.
Rundell, Michael, ed. (2002) Macmillan English Dictionary for Advanced Learners, London:
Macmillan.
Sharoff, Serge (2006) Open-source corpora: using the net to fish for linguistic data,
International Journal of Corpus Linguistics 11(4), 435-462.
Smith, Simon, Alice Chen & Adam Kilgarriff (2007) A corpus query tool for SLA: learning
Mandarin with the help of Sketch Engine, Practical Applications in Language and
Computers - PALC 2007
Smrž, Pavel (2004) Integrating Natural Language Processing into E-learning — A Case of
Czech, Proceedings of the Workshop on eLearning for Computational Linguistics and
Computational Linguistics for eLearning, COLING 2004. 106-111.
Srdanović Erjavec, Irena, Tomaž Erjavec & Adam Kilgarriff (2008 ) A web corpus and
word-sketches for Japanese, ,
Ueyama Motoko & Marko Baroni (2005) Automated construction and evaluation of a
Japanese web-based reference corpus, Proceedings of Corpus Linguistics 2005.
23
SkeSkeSkeSketch Enginetch Enginetch Enginetch Engine corpus query toolcorpus query toolcorpus query toolcorpus query tool for Japanese for Japanese for Japanese for Japanese and its and its and its and its possible applications possible applications possible applications possible applications
SRDANOVIĆ ERJAVEC Irena, NISHINA Kikuko
Tokyo Institute of Technology
KeywordsKeywordsKeywordsKeywords
Sketch Engine, corpus linguistics, lexicography, second language learning, collocations
AbstractAbstractAbstractAbstract
Although corpus-based language research has been developing rapidly in recent years,
there is still a lack of resources in regards to their size, textual variety, and time of creation,
and of efficient and user-friendly corpus query tools. This is also the case for the Japanese
corpus linguistics, which is one of the primary reasons for the recent rise in projects
constructing Japanese corpora resources.
In this paper, we present a method for extracting linguistic information from corpora using
the Sketch Engine corpus query tool, which has recently been extended for the Japanese
language. The Japanese version is based on a 400 million word Japanese Web corpus, which
is linguistically annotated by the morphological analyzer ChaSen, and a Japanese
grammatical relations file. The tool offers efficient and user-friendly ways of extracting
concise linguistic data about words—their grammatical and collocational behavior, as well as
thesaurus-like information and differences in usage for similar words. We explain, through
examples, how the tool could be utilized in corpus lexicography, linguistic research and
computer assisted language learning of the Japanese language. The investigation part of the
article concentrates mainly on the ways that the tool could be applied within the dictionary
creation process, and the results illustrate how each of the tool functions can greatly
contribute to that process.