TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction...

TagHelper & SIDE

Carolyn Penstein Rosé

Language Technologies Institute/ Human-Computer Interaction

Institute

TagHelper Tools and SIDE

TagHelper Tools uses text miningtechnology to automate annotationof conversational data SIDE facilitates rapid prototyping of reporting

interfaces for group learning facilitators

Define Summaries

Annotate Data

Visualize Annotated Data

Setting Up Your Data For TagHelper

Setting Up Your Data

How do you know when you have coded enough data?

What distinguishesQuestions and Statements?

Not all questionsend in a questionmark.

Not all WH wordsoccur in questionsI versus you isnot a reliable predictor

You need to codeenough to avoidlearning rules thatwon’t work

Creating a Trained Model

Training and Testing

Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder

You will then see the following tool pallet

The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data

Click on Train New Models

Loading a FileFirst click on Add a File

Then select a file

Simplest Usage

Click “GO!” TagHelper will use its

default setting to train a model on your coded examples

It will use that model to assign codes to the uncoded examples

More Advanced Usage

The second option is to modify the default settings

You get to the options you can set by clicking on >> Options

After you finish that, click “GO!”

Evaluating Performance

Performance report

The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a

confusion matrix that tells you which types of errors are being made

Output File The output file

contains The codes for each

segment Note that the

segments that were already coded will retain their original code

The other segments will have their automatic predictions

The prediction column indicates the confidence of the prediction

Overview of Basic Feature Extraction from Text

Customizations To customize the

settings:Select the file Click on Options

Classifier Options

* The three main types ofClassifiers are

Bayesian models (Naïve Bayes),functions (SMO), and trees (J48)

Rules of thumb: SMO is state-of-the-art for

text classification J48 is best with small

feature sets – also handles contingencies between features well

Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules

Basic Idea

Represent text as a vector where each position corresponds to a term

This is called the “bag of words” approach

Cows make cheese

110001

Hens lay eggs 001110

CheeseCowsEggsHensLayMake

But same representationBut same representationfor “Cheese makes cows.”!for “Cheese makes cows.”!

What can’t you conclude from “bag of words” representations?

Causality: “X caused Y” versus “Y caused X”

Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” Who’s driving, who’s eating, and who’s preparing

food?

Basic Anatomy: Layers of Linguistic Analysis

Phonology: The sound structure of language Basic sounds, syllables, rhythm, intonation

Morphology: The building blocks of words Inflection: tense, number, gender Derivation: building words from other words, transforming part of

speech Syntax: Structural and functional relationships between

spans of text within a sentence Phrase and clause structure

Semantics: Literal meaning, propositional content Pragmatics: Non-literal meaning, language use, language

as action, social aspects of language (tone, politeness) Discourse Analysis: Language in practice, relationships

between sentences, interaction structures, discourse markers, anaphora and ellipsis

Part of Speech Tagging

1. CC Coordinating conjunction

2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective,

comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal

12.NN Noun, singular or mass

13.NNS Noun, plural 14.NNP Proper noun,

singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative

http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html

Part of Speech Tagging

23.RP Particle

24.SYM Symbol

25.TO to

26.UH Interjection

27.VB Verb, base form

28.VBD Verb, past tense

29.VBG Verb, gerund/present participle

30.VBN Verb, past participle

31.VBP Verb, non-3rd ps. sing. present

32.VBZ Verb, 3rd ps. sing. present

33.WDT wh-determiner

34.WP wh-pronoun

35.WP Possessive wh-pronoun

36.WRB wh-adverb

http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html

TagHelper Customizations

Feature Space Design Think like a computer! Machine learning algorithms look

for features that are good predictors, not features that are necessarily meaningful

Look for approximations If you want to find questions, you

don’t need to do a complete syntactic analysis

Look for question marks Look for wh-terms that occur

immediately before an auxilliary verb


Feature Space Design Punctuation can be a “stand in” for

mood “you think the answer is 9?” “you think the answer is 9.”

Bigrams capture simple lexical patterns

“common denominator” versus “common multiple”

POS bigrams capture syntactic or stylistic information

“the answer which is …” vs “which is the answer”

Line length can be a proxy for explanation depth


Feature Space Design Contains non-stop word can be a

predictor of whether a conversational contribution is contentful

“ok sure” versus “the common denominator”

Remove stop words removes some distracting features

Stemming allows some generalization

Multiple, multiply, multiplication Removing rare features is a cheap

form of feature selection Features that only occur once or

twice in the corpus won’t generalize, so they are a waste of time to include in the vector space

Created Features

Why create new features by hand? Rules

For simple rules, it might be easier and faster to write the rules by hand instead of learning them from examples

FeaturesMore likely to capture meaningful

generalizationsBuild in knowledge so you can get by with less

training data

Rule Language

ANY() is used to create listsCOLOR = ANY(red,yellow,green,blue,purple)FOOD = ANY(cake,pizza,hamburger,steak,bread)

ALL() is used to capture contingenciesALL(cake,presents)

More complex rulesALL(COLOR,FOOD)

* Note that you may wish to use part-of-speech tags in your rules!

What can you do with this rule language?

You may want to generalize across sets of related wordsColor = {red,yellow,orange,green,blue}Food = {cake,pizza,hamburger,steak,bread}

You may want to detect contingenciesThe text must mention both cake and

presents in order to count as a birthday party You may want to combine these

The text must include a Color and a Food

Advanced Feature Editing


* For small datasets, first deselect Remove rare features.


* Next, Click on Adv Feature Editing


* Now you may begin creating your own features.

Types of Basic Features Primitive features

inclulde unigrams, bigrams, and POS bigrams

Types of Basic Features The Options change

which primitive features show up in the Unigram, Bigram, and POS bigram lists You can choose to remove

stopwords or not You can choose whether or

not to strip endings off words with stemming

You can choose how frequently a feature must appear in your data in order for it to show up in your lists

Types of Basic Features

* Now let’s look at how to createnew features.

Creating New Features

* You can use the feature editorto create new features.


* First click on ANY


* Then click ALL


* Now fill in ‘tell’ and ‘me’


* Now fill in the rest of thepattern from the POSBigram list


* Now change the name


* Click to add to feature list

Using the Display Option

Viewing Created Features

Any Questions?

TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction...

Documents

Transcript of TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction...