Getting to know your corpus Adam Kilgarriff Lexical Computing Ltd.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by...
-
Upload
clifford-phelps -
Category
Documents
-
view
217 -
download
2
Transcript of Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by...
Sketch engine for Chinese
Discussion notes
Wordsketch, subsequently Sketch Engine
• Was developed by Kilgarriff et al at Brighton
• Gives automatic, corpus-based summaries of a word’s grammatical and collocational behaviour
• Captures information in a more accessible way then hundreds of KWIC lines
• Uses MI based salience algorithm
Other corpus query tools do collocational salience too, but…
• Sketch engine uses lemmata not word-forms– So that eat and eats are treated the same
• And it takes account of grammatical relations– So that The plane banks and The investment
banks are treated separately– And (if the corpus is appropriately parsed) He
robs banks and He robbed the bank would be accorded similar treatment
Grammatical relations example
Unary relations
Word2 and Prep are not specified
Binary relations
Prep not specified
Binary relations, Word2 not specifiedTrinary relations
Sketch engine modules
• Concordance– KWIC or sentence context
• Thesaurus– A list of “similar” words
• Sketch differences, for distinguishing near-synonyms– If both lemmata x and y have strong collocational
salience with a, then they are near-synonyms
• Wordsketch
Sample of grammatical relation definitions script (M language)
• define(`wh_word',`[tag=3D"AVQ"|tag=3D"D`$ p& TQ"|tag=3D"PNQ"]')� �• define(`whether_if',`[tag=3D"PNQ" & word=3D"if" |word=3D"whether"]')• define(`determiner',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposs_pro]')• define(`conjunction',`"CJC"')• define(`simple_neg',`"XX."')• define(`rel_start',`[tag=3D"DTQ"|tag=3D"PNQ"|tag=3Dthat_comp]')• define(`adv_neg',`[tag=3Dany_adv|tag=3Dsimple_neg]')• define(`number',`"[OC]RD"')• define(`goal_adv',`[word=3D"back"|word=3D"over"|word=3D"home"|word=3D"awa=• y"|word=3D"out"]')• define(`long_np',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposp& €( s_pro|� �
tag=3Dnumber|ta=• g=3Dany_adv|tag=3Dany_adj|tag=3Dgenitive]{0,3} any_noun{0,2} 2:any_noun =• [tag!=3Dany_noun & tag !=3D genitive]')• define(`np_start',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposs_pro|tag=3Dnumber|t=• ag=3Dany_adj|tag=3Dany_noun]')
Applications
• Intended as an aid to lexicographers• At least one paper on MT application• Could be used in pedagogical applications
– Earlier NSF grant aimed at a complete Chinese learning platform, with Wordsketch as a module
– Comparison of similar lexemes cross-linguistically
• Yiching is publishing about express vs biaoshi, and this work may use Wordsketch
Chinese Wordsketch
• Kilgarriff et al report that Wordsketch can be ported to any language– Pavel Rychly in Czech Rep has implemented concordancing at
Chinese character level only
• AS has acquired Chinese Gigaword, and POS-tagged it automatically– No parsing has been attempted so far
• Grammatical relations ruleset for Chinese is needed• I would plan to
– contribute to the writing of this ruleset– collaborate on cross-linguistic lexical analyses, using
Wordsketch where possible
links
• http://nlp.fi.muni.cz/projects/bonito2/chinese/– test chin
• http://www.sketchengine.co.uk/sampler/– ssmith ssmith