Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster /...

28
TextCat / Lancaster / October 2006 Classification of Short Text Sequences Richard Forsyth, Shaaron Ainsworth, David Clarke, Claire O’Malley & Pat Brundell School of Psychology 0115-951 5281 [email protected]

Transcript of Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster /...

Page 1: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Classification of Short Text Sequences

Richard Forsyth, ShaaronAinsworth, David Clarke,

Claire O’Malley & PatBrundell

School of Psychology 0115-951 [email protected]

Page 2: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Outline

Background Introduction

– Situation– Aims

Method– Algorithms– Datasets

Results– Pilot study: linear classifier– Naïve Markov Classifier

Discussion

Page 3: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Background

Relating linguistic to extra-linguistic information

Advances in Learning Sciences

Exploiting digital records (legacy data) of verbalizations

e-Social Science– text-mining as a means of quantifying qualitative information

Computational coding– social scientists spend huge amounts of time hand-coding

verbal data

Page 4: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Intro: disciplinary differences in approach

Several approaches to text categorization

Linguistics:– Tagging (semantic / PoS)– Mostly at word level

Computing:– Classifying (authorship, content)– Mostly at document level

Social Sciences:– Categorical coding– Mostly at “segment” level (phrase, utterance)

Page 5: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Intro: categorical coding of verbal data

Researchers in Social Sciences spend lots of effort &time on

– Recording– Transcribing– Segmenting– Categorical coding

verbal data.

We’re looking at item four (mainly).– Plethora of categorical coding schemes

Page 6: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Intro: talking and learning

Talking to yourself can help:e.g. Self explaining

Talking with others can help:e.g., Collaborative learning, Argumentation-based learning,Socratic dialogue

But not all talk is good….So learning scientists need toanalyse what people say when they talk..– This is both time consuming and can be unreliable.– Hence our interest in whether text mining can help a) speed

things up, b) increase reliability and c) help us deal with theincreasing numbers of digital records of learning

Page 7: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Intro: 3 main categories

Textual Material“The septum divides the heart lengthwise into twosides”

3 codes:– Paraphrase,

• The septum is what goes down the middle of the heart– Self-explanation,

• Septum is what separates the two ... some sort of control– Monitoring-statement,

• I'm not sure why

Page 8: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Intro: aims & context

Objective:– (semi-)automatic classifier to assist categorical coding

Notes:– Very few standard “tag sets”, most coding schemes novel– Therefore trainable classifier essential– Important to economize human effort in loop:

• Code another block of text segments by expert• Test learning system on cases so far (x-validated)• Decide whether to continue:

– Stop, accuracy good enough– Abandon, accuracy will never be good enough– Code more cases (accuracy level will be ok with reasonable effort)

Page 9: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Method: algorithms

Pilot study: linear discriminant function– With vocab & tag variables

Latest study: Naïve Markov Classifier– Khmelev & Tweedie (2001)– Peng et al. (2003)

– N-gram Markovian model at byte or word level– Naïve Bayesian inference as “gluing” mechanism– (“m-estimate” for attenuating probabilities)

Page 10: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Method: naïve Markov classifier

Why I like this algorithm:

It’s fast & not very memory-hungry No pre-processing phase No lexicons or external support s/w needed No variable-selection phase

– (therefore less danger of overfitting)

Just uses all the data of a given type Has a Bayesian underpinning Can work in almost any language

– (e.g. could handle DNA sequences)

Page 11: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Method: operational framework

Obtain test-set size (N2) from user For a user-specified number of repetitions Set initial training-set size to N1 [typically zero] While training-set size plus test-set size (N1+N2) <= data set size Pick a random subsample of size N1 as training set Pick a disjoint random subsample of size N2 as testing set Train classifier on training data Test classifier on testing data Record results Increment N1 by user-specified increment (N3) Append results to file for subsequent analyses

Page 12: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Method: Dataset details (N.B. legacy data):

Both learning tasks on same topic = cardiovascular system (Plenty of data “cleansing” & reorganizing done.)

Dataset: AB (Ainsworth & Burcham, 2004) AR (Robertson, 2004)

Participants: 24 (13 female, 11 male) 24 (13 female, 11 male)

Words: 44,388 23,330

Segments: 2071 1784

Material: text diagrams

Conditions: high versus low coherence abstract versus realistic

Assessment: 3 pre -test & 5 post -test measures 3 pre-test & 5 post -test measures

Page 13: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Method : samples of High & Low coherence text

this oxygenated blood is thenpumped through the bicuspidvalve (the a-v valve on the leftside of the heart) from theatria to the left ventricle

it is pumped through thebicuspid valve into the leftventricle

the oxygenated blood is then pumpedthrough the bicuspid valve intothe left ventricle

MaximalMinimalOriginal

Page 14: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Method: abstract and concrete diagrams

Direction of blood flow

Upper chamber

(atrium)

Upper chamber

(atrium)

Lower chamber

(ventricle)

Lower chamber

(ventricle)

Left SideRight Side

to lungs

septum

to rest ofbody

Page 15: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Pilot study: specimen of WMatrix tagging output (AR27; 35, 36)

CC And Z5 PPH1 it Z8 VBZ 's Z5 A3+ VVG showing A10+ S1.1.1 RR obviously Z4 A11.2+ AT the Z5 JJ main A11.1+ N5+++ NN1 pump O2 B5 VBZ is A3+ Z5 AT the Z5 NN1 heart B1 M6 A11.1+ E1 X5.2+ , , PPH1 it Z8 VBZ 's A3+ Z5 RG very A13.3 JJ big N3.2+ N5+ A11.1+ S1.2.5+ X5.2+ CC and Z5 DD1 that Z8 Z5 VM will T1.1.3 VVI pump M2 A1.1.1 A9- Q2.2 S8+ B3

Word Semantic Codes (in decreasingorder of predicted likelihood)

SyntacticCode

Page 16: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Pilot study: Utterance classification: LDF 3-group plot, AR data:

0 5 10 15

Function 1

-4

-2

0

2

4

Fu

ncti

on

2

1

2

3

segcode3

1=para 2=s-x 3=m

Group Centroid

Canonical Discriminant Functions

1. Paraphrase2. Self-Explanation3. Monitoring

Group Centroid

Canonical Discriminant Function

Page 17: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Pilot study: Utterance classification: LDF 3-by-3, AR data (cross-validated)

1.00 2.00 3.00

pred

0

200

400

600

800

1,000

Cou

nt

true

1.00

2.00

3.00

Bar Chart

ParaphraseSelf-ExplanationMonitoring

Paraphrase Self-Explanation Monitoring

Page 18: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Pilot study: classification variables used (AR data)

8 variables maximum (counts): picked: 6 syntactic, 2semanticStep

entered

Function 1 Function 2 WMatrix category

(* = semantic)

Four commonest

tokens coded thus in

AR data texts

1 2.043 .291 *X2: mental acts &

processes

know, suppose, think

2 2.582 -.129 XX: negation n't, not

3 -.139 .552 VV: lexical verb goes, got, go, get

4 -.065 -1.038 *H2: parts of

buildings

atrium, portal,

chambers, atriums

5 -.164 .587 CS: subordinating

conjunction

that, so, because,

when

6 -.209 .112 NN: noun blood, heart, body,

lungs

7 .103 -.367 JJ: adjective left, right, pulmonary,

other

8 -.699 .777 VM: modal verb can, will, could, ca [n't]

Page 19: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Results: NMC

Factors investigated

Mode:– Byte or Word as basic unit

N-gram size:– 0 .. 3

Page 20: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Results: sample “learning curve” 1

Page 21: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Results: sample “learning curve” 2

Page 22: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Results: mode & gram-size

Page 23: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Results: this is the only slide you need to remember!

Page 24: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Fitting the “learning curve” / “experience curve”

3 formulae tried– Power, Exponential, Log-reciprocal

Y = a + b * x ^ c– Wright (1936), management science– e.g. cost per unit declines as production continues

Y = a + b * (1 – 10 ^ (c * x))– Hull (1943), psychology– e.g. time for rat to find food decreases with repeated trials

Y = a + b * ln(x+1) + c * 1/(x+1)– Forsyth (2006), machine learning (ad hoc curve-fitting?)– e.g. error rate goes down as size of training data goes up

Page 25: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Discussion: some points

LDF with variable-selection = a dead end? NMC gives respectable performance

– Other algorithms to be tried

Log-reciprocal formula best for self-prediction (so far)– Beats power law (Management Science tradition)– Beats exponential law (Learning Theory tradition)– Goldilocks!?

Key point:– Iterative expert coding till automatic system takes over

• Therefore system must self-predict– Hardly anyone in machine learning has tried this

• Therefore our simple model is probably best around

Page 26: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Discussion: what we plan to do next:

Try other algorithms– E.g. k-nearest neighbour

Try other data sets– E.g. rainbow coding (& romance smallads ;)

Try other self-prediction formulae– Why not use genetic programming?!

Something else we’ve overlooked??

Page 27: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

That’s all folks

Thank you for your attention

Page 28: Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster / October 2006 Method: naïve Markov classifier Why I like this algorithm: It’s

TextCat / Lancaster / October 2006

Extra slide!