Amplifier Collocations in the Chinese Learner English Corpus Jennie, Cai 2006-06-02.
Compiling and Analyzing Your Own Learner Corpus
-
Upload
tobias-gregory -
Category
Documents
-
view
18 -
download
0
description
Transcript of Compiling and Analyzing Your Own Learner Corpus
Compiling and Analyzing Your Own Learner Corpus
Xiaofei LuCALPER 2012 Summer Workshop
July 16, 2012
2
Workshop outlineOpening discussion and corpora overviewGraphic Online Language Diagnostic (GOLD)
overviewSample GOLD (and related) projectsGOLD (or related tool) project labGOLD (or related tool) project discussionsConcluding discussion
3
Opening discussionBrief introduction of your professional/language
background and teaching/research interestsPrior experience with corpus linguisticsPrimary challenges you are dealing withPrimary purposes and goals for taking this
workshop and for learning about corpus linguistics in general
Any other relevant information
4
Corpora overviewWhat is a corpusTypes of corporaCorpus design and compilationCorpus annotationCorpus querying and analysisLearner corpora and L2 developmentResources
5
What is a corpus? Leech (1992):
an unexciting phenomenon, a helluva lot of text, stored on a computer
Sinclair (1991):a collection of naturally-occurring language text, chosen
to characterize a state or a variety of languageSinclair (2004):
a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research
6
Types of corporaGeneral-purpose vs. specialized corpora
British National Corpus & Russian National CorpusMichigan Corpus of Academic Spoken English
Native vs. learner corpora International Corpus of Learner EnglishSpanish Learner Language Oral Corpora
Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer
7
Types of corpora (cont.)Corpora representing one or diverse varieties
International Corpus of English Synchronic vs. diachronic corpora
The Corpus of Historical American EnglishSpoken vs. written corpora
Michigan Corpus of Upper-Level Student Papers
8
Corpus designPurpose and type of corpus
Spoken/written; cross-sectional/longitudinal
External criteria for content selectionCommunicative function of a textMode, medium, interaction, domain, topic
Representativeness, balance, size, samplingDesign of the BNC
9
Corpus design (cont.)Encoding meaningful metadata information
Learner: L1, gender, program level, discipline … Sample: date, mode, task, genre, rating …Facilitates contrastive and longitudinal studies
MICASE speaker and transcript attributes Corpus markup: The ICE example
10
Corpus annotationWhy annotateLevels of corpus annotationDifficulties for corpus annotationStandards and encoding
11
Why annotateRaw text vs. annotated text: How do you…
Count the number of words in a Chinese text?Calculate the lexical density of an English text?Count the frequency of can as a modal verb?Know how many T-units in a text are complex?Extract all imperative sentences from a text?Know whether a syntactic structure is used in a text?
12
Levels of corpus annotationSentence and word segmentationPart-of-speech (POS) tagging and lemmatizationSyntactic parsingSemantic, pragmatic, and discourse annotation Learner corpora: error annotationProject-specific annotation
Sentence and word segmentationWhy is this non-trivial?
I went to the shops in Jones St. Saturday afternoon with Mr. Smith.I can’t remember whether it’s a second- or third-grade book.
克林顿在讲话中指出 Clinton pointed out in his speech (that…) 克林顿 在 讲话 中 指出
Clinton at speech middle point-out 克林顿 在 讲话 中指 出
Clinton at speech middle-finger out
POS taggingThe what and whyWhat are the difficulties?
Ambiguity: 48% tokens in the Brown CorpusUnknown words: neologism
Tagsets: overspecificatin vs. underspecificationPenn Treebank Tagset vs. CLAWS7 Tagset
LemmatizationCounting linguistic items
Types – number of different wordsTokens – number of words
What constitutes a different word type?go, went, gone, goes, going?differ, difference, different, differently?can as a noun, verb, and modal verb?
16
Demos and tools: Part 1Xerox morphological analyzer (demo only)ICTCLAS for Chinese segmentation and POS taggingQuerying POS-tagged corpora and Stanford POS tagger for EnglishTree Tagger for multiple languages
Chunking and parsingPartial/full structural analysis of each sentence
My dog likes eating sausage.(ROOT (S (NP (PRP$ My) (NN dog))
(VP (VBZ likes) (S
(VP (VBG eating) (NP (NN sausage)))))
(. .)))
Chunking and parsing (cont’d)What is it useful for?
Retrieving examples of grammatical patternsGrammar checking, syntactic complexity analysisNLP applications that require syntactic analysis
DifficultiesUngrammatical sentencesAmbiguities, e.g., PP attachmentErrors from preprocessing steps
19
Semantic and discourse analysisSemantic and discourse featuresWord sense disambiguationPropositional idea densityCoherence and cohesion
20
Annotation standards and encodingUseful standards
Separable, linguistically consensualDocumentation, compatibility with existing standards
Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w>Format varies, depending on level of annotation
Manual, computer-aided, and automatic annotationEfficiency, scale, reliabilityUAM CorpusTool
21
Demos and tools: Part 2Stanford parser for Arabic, Chinese and EnglishWord sense disambiguation demoComputerized Propositional Idea Density RaterCoh-Metrix for text coherence analysisCHILDES and CLANComputerized ProfilingWMatrix
22
Corpus querying and analysisManual analysis?Corpus-specific online interfaces
Raw: MICASE and MICUSPPOS-tagged: Corpora @ BYUGrammatically and semantically tagged: RNC
General-purpose online interfaces: GOLDWindows-based querying/concordancing tools
WordSmith Tools & AntConc
23
Corpus querying and analysisNatural language processing tools
Good for processing annotated corporaExtracting occurrences of grammatical patterns Examples: Stanford parser and Tregex
24
ResourcesBooks and journals
Hunston (2002): Corpora in Applied LinguisticsMcEnery (2006): Corpus-Based Language Studies International Journal of Corpus LinguisticsCorpus Linguistics and Linguistic TheoryCorpora
Websites and mailing listsBookmarks for corpus-based linguistsLinguistic data consortiumThe corpora list; corpus in deliciousStanford Natural Language Processing Group
25
DiscussionWhat kind of corpus do you intend to compile
and/or use? For what purpose?What are the design issues?How do you intend to format, organize and store
your files?Do you intend to annotate your corpus in some
way? How?How do you intend to search/query your corpus?
26
Learner corpora and L2 developmentSamples from same students at different times
Did (targeted) language development take place?Was a particular pedagogical intervention effective?
Samples from different studentsWhat areas do students show different levels of
development?What factors affect students’ language development?
27
Graphic Online Language DiagnosticA free online tool for teachers to assess their
students’ language developmentDeveloped at CALPER, Penn State, funded by DOEProject co-directors: Xiaofei Lu and Michael McCarthy
Teachers can use GOLD toCompile, upload, and manage their own corporaShare corpora with each otherSearch and analyze corpora
Demonstration
28
Corpus compilationA user can compile a corpus by
Directly compiling and uploading an XML fileUsing the easy-to-use guided XML creation interface
An uploaded corpus can be easily managedDocuments can be added or deletedThe whole corpus can be deletedContent and metadata of individual documents can be
easily accessed
29
Corpus sharingGOLD facilitates easy data sharingA corpus may be set to be
Private, shared, or public
Corpus owner may give other users right to View, add, edit, or delete corpora
Demonstration
30
Basic corpus informationWord count
Alphabetic or numeric orderCan be downloaded as a text file
Corpus and document statisticsMean sentence lengthMean word lengthType-token ratio
Demonstration
31
Corpus searchSelect one or more corpora to searchSpecify key words or phrases
May use the wildcard character, e.g. book*
Specify contextsSize of context windowContext words and their positions
Specify metadata conditions
32
Corpus search resultsDisplay of search results
Sortable KWIC display of search resultsSortable graphic display of search results
Demonstration
33
Lexical bundle/collocation searchProcedure
Select one or more corpora to searchSpecify search wordSpecify contextsSpecify metadata conditions
Search resultsSortable list of n-grams found in selected corpora
Demonstration
34
Summary of featuresDifference from other online tools
Can create, share, and search multiple corporaCan easily search subsets of dataCan work with any language
Summary of corpus analysis functionsWord listCorpus and document statistics: mean sentence length,
mean word length, type-token ratioCorpus search and collocation search
35
Sample questions to askWith data from an individual student, one can
either describe or track development in Patterns of usages of words and phrases – frequency,
underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles
36
Sample questions to ask (cont.)With data from different (groups of) students,
one can compare similarities or differences among different (groups of) students in terms of Patterns of usages of words and phrases – frequency,
underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles
37
Future enhancementsCorpora for benchmarkingMultilingual natural language processingSuggestions on desirable functions welcome