TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction...

69
TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Transcript of TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction...

Page 1: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

TagHelper & SIDE

Carolyn Penstein Rosé

Language Technologies Institute/ Human-Computer Interaction

Institute

Page 2: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

TagHelper Tools and SIDE

TagHelper Tools uses text miningtechnology to automate annotationof conversational data SIDE facilitates rapid prototyping of reporting

interfaces for group learning facilitators

Define Summaries

Annotate Data

Visualize Annotated Data

Page 3: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 4: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 5: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 6: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 7: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 8: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 9: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 10: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 11: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 12: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 13: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 14: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 15: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 16: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 17: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 18: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 19: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 20: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 21: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 22: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Setting Up Your Data For TagHelper

Page 23: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Setting Up Your Data

Page 24: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

How do you know when you have coded enough data?

What distinguishesQuestions and Statements?

Not all questionsend in a questionmark.

Not all WH wordsoccur in questionsI versus you isnot a reliable predictor

You need to codeenough to avoidlearning rules thatwon’t work

Page 25: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating a Trained Model

Page 26: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Training and Testing

Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder

You will then see the following tool pallet

The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data

Click on Train New Models

Page 27: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Loading a FileFirst click on Add a File

Then select a file

Page 28: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Simplest Usage

Click “GO!” TagHelper will use its

default setting to train a model on your coded examples

It will use that model to assign codes to the uncoded examples

Page 29: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

More Advanced Usage

The second option is to modify the default settings

You get to the options you can set by clicking on >> Options

After you finish that, click “GO!”

Page 30: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Evaluating Performance

Page 31: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Performance report

The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a

confusion matrix that tells you which types of errors are being made

Page 32: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Output File The output file

contains The codes for each

segment Note that the

segments that were already coded will retain their original code

The other segments will have their automatic predictions

The prediction column indicates the confidence of the prediction

Page 33: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Overview of Basic Feature Extraction from Text

Page 34: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Customizations To customize the

settings:Select the file Click on Options

Page 35: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Classifier Options

* The three main types ofClassifiers are

Bayesian models (Naïve Bayes),functions (SMO), and trees (J48)

Rules of thumb: SMO is state-of-the-art for

text classification J48 is best with small

feature sets – also handles contingencies between features well

Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules

Page 36: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Basic Idea

Represent text as a vector where each position corresponds to a term

This is called the “bag of words” approach

Cows make cheese

110001

Hens lay eggs 001110

CheeseCowsEggsHensLayMake

But same representationBut same representationfor “Cheese makes cows.”!for “Cheese makes cows.”!

Page 37: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What can’t you conclude from “bag of words” representations?

Causality: “X caused Y” versus “Y caused X”

Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” Who’s driving, who’s eating, and who’s preparing

food?

Page 38: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Basic Anatomy: Layers of Linguistic Analysis

Phonology: The sound structure of language Basic sounds, syllables, rhythm, intonation

Morphology: The building blocks of words Inflection: tense, number, gender Derivation: building words from other words, transforming part of

speech Syntax: Structural and functional relationships between

spans of text within a sentence Phrase and clause structure

Semantics: Literal meaning, propositional content Pragmatics: Non-literal meaning, language use, language

as action, social aspects of language (tone, politeness) Discourse Analysis: Language in practice, relationships

between sentences, interaction structures, discourse markers, anaphora and ellipsis

Page 39: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Part of Speech Tagging

1. CC Coordinating conjunction

2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective,

comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal

12.NN Noun, singular or mass

13.NNS Noun, plural 14.NNP Proper noun,

singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative

http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html

Page 40: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Part of Speech Tagging

23.RP Particle

24.SYM Symbol

25.TO to

26.UH Interjection

27.VB Verb, base form

28.VBD Verb, past tense

29.VBG Verb, gerund/present participle

30.VBN Verb, past participle

31.VBP Verb, non-3rd ps. sing. present

32.VBZ Verb, 3rd ps. sing. present

33.WDT wh-determiner

34.WP wh-pronoun

35.WP Possessive wh-pronoun

36.WRB wh-adverb

http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html

Page 41: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

TagHelper Customizations

Feature Space Design Think like a computer! Machine learning algorithms look

for features that are good predictors, not features that are necessarily meaningful

Look for approximations If you want to find questions, you

don’t need to do a complete syntactic analysis

Look for question marks Look for wh-terms that occur

immediately before an auxilliary verb

Page 42: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

TagHelper Customizations

Feature Space Design Punctuation can be a “stand in” for

mood “you think the answer is 9?” “you think the answer is 9.”

Bigrams capture simple lexical patterns

“common denominator” versus “common multiple”

POS bigrams capture syntactic or stylistic information

“the answer which is …” vs “which is the answer”

Line length can be a proxy for explanation depth

Page 43: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

TagHelper Customizations

Feature Space Design Contains non-stop word can be a

predictor of whether a conversational contribution is contentful

“ok sure” versus “the common denominator”

Remove stop words removes some distracting features

Stemming allows some generalization

Multiple, multiply, multiplication Removing rare features is a cheap

form of feature selection Features that only occur once or

twice in the corpus won’t generalize, so they are a waste of time to include in the vector space

Page 44: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Created Features

Page 45: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Why create new features by hand? Rules

For simple rules, it might be easier and faster to write the rules by hand instead of learning them from examples

FeaturesMore likely to capture meaningful

generalizationsBuild in knowledge so you can get by with less

training data

Page 46: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Rule Language

ANY() is used to create listsCOLOR = ANY(red,yellow,green,blue,purple)FOOD = ANY(cake,pizza,hamburger,steak,bread)

ALL() is used to capture contingenciesALL(cake,presents)

More complex rulesALL(COLOR,FOOD)

* Note that you may wish to use part-of-speech tags in your rules!

Page 47: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What can you do with this rule language?

You may want to generalize across sets of related wordsColor = {red,yellow,orange,green,blue}Food = {cake,pizza,hamburger,steak,bread}

You may want to detect contingenciesThe text must mention both cake and

presents in order to count as a birthday party You may want to combine these

The text must include a Color and a Food

Page 48: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Advanced Feature Editing

Page 49: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Advanced Feature Editing

* For small datasets, first deselect Remove rare features.

Page 50: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Advanced Feature Editing

* Next, Click on Adv Feature Editing

Page 51: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Advanced Feature Editing

* Now you may begin creating your own features.

Page 52: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Types of Basic Features Primitive features

inclulde unigrams, bigrams, and POS bigrams

Page 53: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Types of Basic Features The Options change

which primitive features show up in the Unigram, Bigram, and POS bigram lists You can choose to remove

stopwords or not You can choose whether or

not to strip endings off words with stemming

You can choose how frequently a feature must appear in your data in order for it to show up in your lists

Page 54: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Types of Basic Features

* Now let’s look at how to createnew features.

Page 55: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating New Features

* You can use the feature editorto create new features.

Page 56: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating New Features

* First click on ANY

Page 57: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating New Features

* Then click ALL

Page 58: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating New Features

* Now fill in ‘tell’ and ‘me’

Page 59: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating New Features

* Now fill in the rest of thepattern from the POSBigram list

Page 60: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating New Features

* Now change the name

Page 61: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Creating New Features

* Click to add to feature list

Page 62: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Using the Display Option

Page 63: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 64: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 65: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Page 66: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Viewing Created Features

Page 67: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Viewing Created Features

Page 68: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Viewing Created Features

Page 69: TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Any Questions?