Colloquia Linguistica
Embed Size (px)
description
Transcript of Colloquia Linguistica
-
Colloquia LinguisticaPart II: The development of Automated Syntactic Taggers
Leif GrnqvistGteborg University
Colloquia Linguistica
-
OverviewSome basic thing about corpora (quick)What is a corpusWhat can we do with itPart-of-speech tagging (slower)What is the problemSome common approachesA rule based taggerA statistical taggerCorpus toolsDifferent toolsDemonstration of Multitool
Colloquia Linguistica
-
What is a corpus for a computational linguist?Various properties are important but the word corpus is just Latin for bodyThese properties should be considered:RepresentativenessSizeForm (annotation standard)Standard reference
Colloquia Linguistica
-
RepresentativenessA corpus used for analyzing spoken Swedish should ideally contain all utterances of Swedish ever spokenBut this is impossible, so there are at least two strategies depending on purpose:Try to collect various dialogue types of sizes proportional to the complete corpusCollect enough big portions of each type to make sure to find all wanted phenomenaRegardless of which strategy you use it is important to select the samples from each type carefully, preferably using random
Colloquia Linguistica
-
Corpus size: how big should it be?Depends on purpose!Some strategies:Monitor corpus: as big as possibleBank of English > 500 million tokensUsed for lexicographyFinite size, big enough for current taskPOS-tagging, ~100 tags: 1 million tokensLanguage model for automatic speech recognition: 100 million tokens
Colloquia Linguistica
-
Machine readable formCorpora have been used in linguistics for more than 100 years.Now: a corpus => machine readableThe annotations should be made in a way to make extraction of wanted features as simple as possible
Colloquia Linguistica
-
Standard reference (quick)Typical content of a research article: We used the corpus XX, took 90% for training, and 10% for testing with our new algorithm. We then got 97.2% correctness, which is a significant improvement from the old tagger at the 99% levelExactly the corpus XX must be available for other research groups
Colloquia Linguistica
-
What to do with a corpusCheck our linguistic intuitionAnnotate interesting features manuallyUse it for training of taggers and parsersAnnotate new data automatically
But, be careful! A corpus is not the complete language
Colloquia Linguistica
-
Text encodingVarious encoding schemes aroundText basedHuman and machine readableCould be difficult to check for validityWord processor basedOnly human readableRarely used in computational linguisticsXML/SGML basedMachine readableMay be transformed to human readable form using XSLTFormalisms and tools for free, well more or less freeLimitations of XML may be annoying sometimes
Colloquia Linguistica
-
Some important properties (skip)Important properties according to Geoffrey LeechPossibility to extract original corpusPossibility to separate annotationsBased on well defined guidelinesMake clear how the annotations were doneMake clear that there may be errors in the corpusWidely agreed theory-neutral annotation schemeNo annotation scheme is the a priori standard scheme
Colloquia Linguistica
-
Some annotation standardsTEI (Text Encoding Initiative)Huge standard for all types of texts and corpora developed by the TEI Consortium since 1987SGML based in the beginning but now XML(X)CES (XML Corpus Encoding Standard)Highly inspired by the TEINot as complicated but only in beta versionISLE (International Standards for Language Engineering)Developed by three working groups (lexicon, multimodality and evaluation)CDIF (Corpus Document Interchange Format)Used by the British National CorpusA lot in common with the TEI
Colloquia Linguistica
-
Some typical results directly extracted from a corpusConcordances (KWIC)Frequency listsN-gram statisticsProbabilities
Colloquia Linguistica
-
Concordancesrer, matematiker och dataloger i Gteborgsregionen, bandavskrifter och dataloggar, skriver Feldt.|Si, bandavskrifter och dataloggar.|Men den nya Palme Ahlberg, forskare i datalogi p Chalmers.|Av PER- Ahlberg, forskare i datalogi p Chalmers.|SIDAN 4und blir professor i datalogi vid Ume universitetund blir professor i datalogi vid Ume universiteta fyra olika kurser: datalogi, pedagogik, teknik oybjer och Jan Smith, datalogi.|Sektionen fr maskiatorer eller pluggar datalogi.|S p fritiden leker det gller trdls datalogistik, nu kommer det
Colloquia Linguistica
-
Frequency lists74556de48104ja39947e3434225694s25639att22378va19134som18679vi18084inte17611p17214man16870i16846d
77810det36843r35471och32404ja30439att28628jag26059s19205som18681inte18469har18421vi17719p17377man17343d90304.56075,40438och33978i26358att25634det21830en21333som19743p15754r14333med13837fr13683av13547jag
Colloquia Linguistica
-
N-gram statistics3395det r2913fr att2451det var1560att det1351r det1278i en1174att han1003i den966som en920men det889p en884att jag882r en882med en42i stllet fr att36fr ngra r sedan35men det r inte34en stor del av33p samma stt som32det var som om31att det r en30r en av de30men det var inte28vad r det fr28det r svrt att27det r som om27att det inte var26fr ett r sedan
Colloquia Linguistica
-
Part-of-speech taggingWe want to assign the right part-of-speech (just as an example) to each word in a corpusInput is a tokenized corpusThe tagset is determined in advanceThe word types in the corpus have various properties in the training dataSome are unambiguousSome are ambiguous (typically 2-7 POS each)Some are unknown (not there)
Colloquia Linguistica
-
An exampleTagset: noun, verb, pron, art, infmrk, prepIn: $A: you have to book a chair on deckOut: pron verb infmrk verb art noun prep noun
But, book and chair may be either verb or noun - the tagger has to disambiguate!Several approaches to do this, all based on patterns and regularities in the language
Colloquia Linguistica
-
Terms used in taggingTagging: put the right label (i.e. word class) on each tokenTagset: all possible labels (word classes)Tokenizing: divide the corpus into tokens (words, sentence boundaries) Training: find the rules or probabilities needed by the tagger
Colloquia Linguistica
-
Various approachesRule based taggingConstraint based tagging (SweTwol, EngTwol by Lingsoft)Transformation-based tagging (Eric Brill)Stochastic tagging (HMM)Calculate the most probable tag sequence Using maximum likelihood estimationOr some bootstrap based training
Colloquia Linguistica
-
Constrain based taggingBasic idea:Assign all possible tags to each wordsRemove tags according to a set of rules of the type: if word+1 is an adj, adv or quantifier and the following is a sentence boundary and word-1 is not a verb like consider then eliminate non-adv else eliminate adv.Continue until no rule is applicable, but never remove the last tag on a wordTypically more than 1000 hand written rules, but may also be machine learned
Colloquia Linguistica
-
The example: Constraint grammarTagset: nn, vb, pron, art, infmrk, prepFirst: look up all possible classes for each wordRules will then remove unwanted tags
Colloquia Linguistica
-
Transformation-based taggingBasic idea:Set the most probable tag for each word as a start valueChange tags according to rules of the type: if a word is tagged as a verb and the word before is an article, then change the tag to noun. Perform rules in a specific order!Training is done using a tagged corpus:Write a set of rule templates of the type: if word-1 or word+1 is an X then change the tag for word to YAmong the set of possible rules, find the one with the highest scoreContinue from 2 until a lowest score threshold is passedKeep the ordered set of rulesRules will make errors that are corrected by later rules
Colloquia Linguistica
-
The example: Transformation based learningTagset: nn, vb, pron, art, infmrk, prepFirst: look up the most common tag for each wordRules will then change to the right tags
Colloquia Linguistica
-
An HMM tagger: uses statistics (brief)The problem may be formulated as:
Which may be reformulated as:
But the denominator is constant and may be removed and we get:
Colloquia Linguistica
-
HMM tagger, cont. (brief)
The Markov assumption (for n=3) and the chain rule gives us:
What we need now is:
Colloquia Linguistica
-
The example: HMMSelect the sequence with the highest probability!
Colloquia Linguistica
-
Training of an HMM taggerThe best way is the Maximum Likelihood Estimation. But it requires a hand tagged corpusA fancy name for a simple principle: expect the new data to be as the training data. Count the thing there:P(c) = freq(c) / NtokP(w,c) = freq(w,c) / NtokP(w|c) = P(w,c) / P(c)
Colloquia Linguistica
-
Evaluation (skip)The result is compared with: the so called Gold Standard (manually coded)Typically accuracy reach 96-97% This may be compared with the result for a baseline tagger, for example a tagger not using context at allSimilarity between two gold standards may verified with the kappa measureImportant to note that 100% is impossible even for human annotators
Colloquia Linguistica
-
Problems (quick)Words and sequences are missing in the training data. This is cured using smoothing:Additive: add one occurrence to each event frequencyGood-Turing estimation: try to calculate the number of unseen events to get a better estimation of their probabilitiesBack-off and Linear interpolationMorphology may help (-arity, -s)
Colloquia Linguistica
-
The Viterbi algorithm (quick)To calculate the probabilities for all possible sequences of tags would take too long timeThe Viterbi algorithm helps us to find the most probable path in linear time to the length of the text and quadratic time to the number of states, using dynamic programming
Colloquia Linguistica
-
Example of corpus tools at the linguistics department in GteborgThe Corpus BrowserA tool for searching (for words and expressions) and browsing in our transcriptionsTraSAA tool that count things like number of words, utterances, overlaps, vocabulary richness, etcMultitoolA tool for browsing and coding a transcription, with audio and video available at the same timeDemonstration?
Colloquia Linguistica
-
Thank you!Thank you for listening!Well, do we have any time left for questions?
Colloquia Linguistica