TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction...
-
Upload
rosaline-cole -
Category
Documents
-
view
217 -
download
0
Transcript of TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction...
TagHelper & SIDE
Carolyn Penstein Rosé
Language Technologies Institute/ Human-Computer Interaction
Institute
TagHelper Tools and SIDE
TagHelper Tools uses text miningtechnology to automate annotationof conversational data SIDE facilitates rapid prototyping of reporting
interfaces for group learning facilitators
Define Summaries
Annotate Data
Visualize Annotated Data
Setting Up Your Data For TagHelper
Setting Up Your Data
How do you know when you have coded enough data?
What distinguishesQuestions and Statements?
Not all questionsend in a questionmark.
Not all WH wordsoccur in questionsI versus you isnot a reliable predictor
You need to codeenough to avoidlearning rules thatwon’t work
Creating a Trained Model
Training and Testing
Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder
You will then see the following tool pallet
The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data
Click on Train New Models
Loading a FileFirst click on Add a File
Then select a file
Simplest Usage
Click “GO!” TagHelper will use its
default setting to train a model on your coded examples
It will use that model to assign codes to the uncoded examples
More Advanced Usage
The second option is to modify the default settings
You get to the options you can set by clicking on >> Options
After you finish that, click “GO!”
Evaluating Performance
Performance report
The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a
confusion matrix that tells you which types of errors are being made
Output File The output file
contains The codes for each
segment Note that the
segments that were already coded will retain their original code
The other segments will have their automatic predictions
The prediction column indicates the confidence of the prediction
Overview of Basic Feature Extraction from Text
Customizations To customize the
settings:Select the file Click on Options
Classifier Options
* The three main types ofClassifiers are
Bayesian models (Naïve Bayes),functions (SMO), and trees (J48)
Rules of thumb: SMO is state-of-the-art for
text classification J48 is best with small
feature sets – also handles contingencies between features well
Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules
Basic Idea
Represent text as a vector where each position corresponds to a term
This is called the “bag of words” approach
Cows make cheese
110001
Hens lay eggs 001110
CheeseCowsEggsHensLayMake
But same representationBut same representationfor “Cheese makes cows.”!for “Cheese makes cows.”!
What can’t you conclude from “bag of words” representations?
Causality: “X caused Y” versus “Y caused X”
Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” Who’s driving, who’s eating, and who’s preparing
food?
Basic Anatomy: Layers of Linguistic Analysis
Phonology: The sound structure of language Basic sounds, syllables, rhythm, intonation
Morphology: The building blocks of words Inflection: tense, number, gender Derivation: building words from other words, transforming part of
speech Syntax: Structural and functional relationships between
spans of text within a sentence Phrase and clause structure
Semantics: Literal meaning, propositional content Pragmatics: Non-literal meaning, language use, language
as action, social aspects of language (tone, politeness) Discourse Analysis: Language in practice, relationships
between sentences, interaction structures, discourse markers, anaphora and ellipsis
Part of Speech Tagging
1. CC Coordinating conjunction
2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective,
comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal
12.NN Noun, singular or mass
13.NNS Noun, plural 14.NNP Proper noun,
singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative
http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
Part of Speech Tagging
23.RP Particle
24.SYM Symbol
25.TO to
26.UH Interjection
27.VB Verb, base form
28.VBD Verb, past tense
29.VBG Verb, gerund/present participle
30.VBN Verb, past participle
31.VBP Verb, non-3rd ps. sing. present
32.VBZ Verb, 3rd ps. sing. present
33.WDT wh-determiner
34.WP wh-pronoun
35.WP Possessive wh-pronoun
36.WRB wh-adverb
http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
TagHelper Customizations
Feature Space Design Think like a computer! Machine learning algorithms look
for features that are good predictors, not features that are necessarily meaningful
Look for approximations If you want to find questions, you
don’t need to do a complete syntactic analysis
Look for question marks Look for wh-terms that occur
immediately before an auxilliary verb
TagHelper Customizations
Feature Space Design Punctuation can be a “stand in” for
mood “you think the answer is 9?” “you think the answer is 9.”
Bigrams capture simple lexical patterns
“common denominator” versus “common multiple”
POS bigrams capture syntactic or stylistic information
“the answer which is …” vs “which is the answer”
Line length can be a proxy for explanation depth
TagHelper Customizations
Feature Space Design Contains non-stop word can be a
predictor of whether a conversational contribution is contentful
“ok sure” versus “the common denominator”
Remove stop words removes some distracting features
Stemming allows some generalization
Multiple, multiply, multiplication Removing rare features is a cheap
form of feature selection Features that only occur once or
twice in the corpus won’t generalize, so they are a waste of time to include in the vector space
Created Features
Why create new features by hand? Rules
For simple rules, it might be easier and faster to write the rules by hand instead of learning them from examples
FeaturesMore likely to capture meaningful
generalizationsBuild in knowledge so you can get by with less
training data
Rule Language
ANY() is used to create listsCOLOR = ANY(red,yellow,green,blue,purple)FOOD = ANY(cake,pizza,hamburger,steak,bread)
ALL() is used to capture contingenciesALL(cake,presents)
More complex rulesALL(COLOR,FOOD)
* Note that you may wish to use part-of-speech tags in your rules!
What can you do with this rule language?
You may want to generalize across sets of related wordsColor = {red,yellow,orange,green,blue}Food = {cake,pizza,hamburger,steak,bread}
You may want to detect contingenciesThe text must mention both cake and
presents in order to count as a birthday party You may want to combine these
The text must include a Color and a Food
Advanced Feature Editing
Advanced Feature Editing
* For small datasets, first deselect Remove rare features.
Advanced Feature Editing
* Next, Click on Adv Feature Editing
Advanced Feature Editing
* Now you may begin creating your own features.
Types of Basic Features Primitive features
inclulde unigrams, bigrams, and POS bigrams
Types of Basic Features The Options change
which primitive features show up in the Unigram, Bigram, and POS bigram lists You can choose to remove
stopwords or not You can choose whether or
not to strip endings off words with stemming
You can choose how frequently a feature must appear in your data in order for it to show up in your lists
Types of Basic Features
* Now let’s look at how to createnew features.
Creating New Features
* You can use the feature editorto create new features.
Creating New Features
* First click on ANY
Creating New Features
* Then click ALL
Creating New Features
* Now fill in ‘tell’ and ‘me’
Creating New Features
* Now fill in the rest of thepattern from the POSBigram list
Creating New Features
* Now change the name
Creating New Features
* Click to add to feature list
Using the Display Option
Viewing Created Features
Viewing Created Features
Viewing Created Features
Any Questions?