Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster /...

TextCat / Lancaster / October 2006

Classification of Short Text Sequences

Richard Forsyth, ShaaronAinsworth, David Clarke,

Claire O’Malley & PatBrundell

School of Psychology 0115-951 [email protected]


Outline

Background Introduction

– Situation– Aims

Method– Algorithms– Datasets

Results– Pilot study: linear classifier– Naïve Markov Classifier

Discussion


Background

Relating linguistic to extra-linguistic information

Advances in Learning Sciences

Exploiting digital records (legacy data) of verbalizations

e-Social Science– text-mining as a means of quantifying qualitative information

Computational coding– social scientists spend huge amounts of time hand-coding

verbal data


Intro: disciplinary differences in approach

Several approaches to text categorization

Linguistics:– Tagging (semantic / PoS)– Mostly at word level

Computing:– Classifying (authorship, content)– Mostly at document level

Social Sciences:– Categorical coding– Mostly at “segment” level (phrase, utterance)


Intro: categorical coding of verbal data

Researchers in Social Sciences spend lots of effort &time on

– Recording– Transcribing– Segmenting– Categorical coding

verbal data.

We’re looking at item four (mainly).– Plethora of categorical coding schemes


Intro: talking and learning

Talking to yourself can help:e.g. Self explaining

Talking with others can help:e.g., Collaborative learning, Argumentation-based learning,Socratic dialogue

But not all talk is good….So learning scientists need toanalyse what people say when they talk..– This is both time consuming and can be unreliable.– Hence our interest in whether text mining can help a) speed

things up, b) increase reliability and c) help us deal with theincreasing numbers of digital records of learning


Intro: 3 main categories

Textual Material“The septum divides the heart lengthwise into twosides”

3 codes:– Paraphrase,

• The septum is what goes down the middle of the heart– Self-explanation,

• Septum is what separates the two ... some sort of control– Monitoring-statement,

• I'm not sure why


Intro: aims & context

Objective:– (semi-)automatic classifier to assist categorical coding

Notes:– Very few standard “tag sets”, most coding schemes novel– Therefore trainable classifier essential– Important to economize human effort in loop:

• Code another block of text segments by expert• Test learning system on cases so far (x-validated)• Decide whether to continue:

– Stop, accuracy good enough– Abandon, accuracy will never be good enough– Code more cases (accuracy level will be ok with reasonable effort)


Method: algorithms

Pilot study: linear discriminant function– With vocab & tag variables

Latest study: Naïve Markov Classifier– Khmelev & Tweedie (2001)– Peng et al. (2003)

– N-gram Markovian model at byte or word level– Naïve Bayesian inference as “gluing” mechanism– (“m-estimate” for attenuating probabilities)


Method: naïve Markov classifier

Why I like this algorithm:

It’s fast & not very memory-hungry No pre-processing phase No lexicons or external support s/w needed No variable-selection phase

– (therefore less danger of overfitting)

Just uses all the data of a given type Has a Bayesian underpinning Can work in almost any language

– (e.g. could handle DNA sequences)


Method: operational framework

Obtain test-set size (N2) from user For a user-specified number of repetitions Set initial training-set size to N1 [typically zero] While training-set size plus test-set size (N1+N2) <= data set size Pick a random subsample of size N1 as training set Pick a disjoint random subsample of size N2 as testing set Train classifier on training data Test classifier on testing data Record results Increment N1 by user-specified increment (N3) Append results to file for subsequent analyses


Method: Dataset details (N.B. legacy data):

Both learning tasks on same topic = cardiovascular system (Plenty of data “cleansing” & reorganizing done.)

Dataset: AB (Ainsworth & Burcham, 2004) AR (Robertson, 2004)

Participants: 24 (13 female, 11 male) 24 (13 female, 11 male)

Words: 44,388 23,330

Segments: 2071 1784

Material: text diagrams

Conditions: high versus low coherence abstract versus realistic

Assessment: 3 pre -test & 5 post -test measures 3 pre-test & 5 post -test measures


Method : samples of High & Low coherence text

this oxygenated blood is thenpumped through the bicuspidvalve (the a-v valve on the leftside of the heart) from theatria to the left ventricle

it is pumped through thebicuspid valve into the leftventricle

the oxygenated blood is then pumpedthrough the bicuspid valve intothe left ventricle

MaximalMinimalOriginal


Method: abstract and concrete diagrams

Direction of blood flow

Upper chamber

(atrium)

Upper chamber

(atrium)

Lower chamber

(ventricle)

Lower chamber

(ventricle)

Left SideRight Side

to lungs

septum

to rest ofbody


Pilot study: specimen of WMatrix tagging output (AR27; 35, 36)

CC And Z5 PPH1 it Z8 VBZ 's Z5 A3+ VVG showing A10+ S1.1.1 RR obviously Z4 A11.2+ AT the Z5 JJ main A11.1+ N5+++ NN1 pump O2 B5 VBZ is A3+ Z5 AT the Z5 NN1 heart B1 M6 A11.1+ E1 X5.2+ , , PPH1 it Z8 VBZ 's A3+ Z5 RG very A13.3 JJ big N3.2+ N5+ A11.1+ S1.2.5+ X5.2+ CC and Z5 DD1 that Z8 Z5 VM will T1.1.3 VVI pump M2 A1.1.1 A9- Q2.2 S8+ B3

Word Semantic Codes (in decreasingorder of predicted likelihood)

SyntacticCode


Pilot study: Utterance classification: LDF 3-group plot, AR data:

0 5 10 15

Function 1

-4

-2

0

2

4

Fu

ncti

on

2

1

2

3

segcode3

1=para 2=s-x 3=m

Group Centroid

Canonical Discriminant Functions

1. Paraphrase2. Self-Explanation3. Monitoring

Group Centroid

Canonical Discriminant Function


Pilot study: Utterance classification: LDF 3-by-3, AR data (cross-validated)

1.00 2.00 3.00

pred

0

200

400

600

800

1,000

Cou

nt

true

1.00

2.00

3.00

Bar Chart

ParaphraseSelf-ExplanationMonitoring

Paraphrase Self-Explanation Monitoring


Pilot study: classification variables used (AR data)

8 variables maximum (counts): picked: 6 syntactic, 2semanticStep

entered

Function 1 Function 2 WMatrix category

(* = semantic)

Four commonest

tokens coded thus in

AR data texts

1 2.043 .291 *X2: mental acts &

processes

know, suppose, think

2 2.582 -.129 XX: negation n't, not

3 -.139 .552 VV: lexical verb goes, got, go, get

4 -.065 -1.038 *H2: parts of

buildings

atrium, portal,

chambers, atriums

5 -.164 .587 CS: subordinating

conjunction

that, so, because,

when

6 -.209 .112 NN: noun blood, heart, body,

lungs

7 .103 -.367 JJ: adjective left, right, pulmonary,

other

8 -.699 .777 VM: modal verb can, will, could, ca [n't]


Results: NMC

Factors investigated

Mode:– Byte or Word as basic unit

N-gram size:– 0 .. 3


Results: sample “learning curve” 1


Results: sample “learning curve” 2


Results: mode & gram-size


Results: this is the only slide you need to remember!


Fitting the “learning curve” / “experience curve”

3 formulae tried– Power, Exponential, Log-reciprocal

Y = a + b * x ^ c– Wright (1936), management science– e.g. cost per unit declines as production continues

Y = a + b * (1 – 10 ^ (c * x))– Hull (1943), psychology– e.g. time for rat to find food decreases with repeated trials

Y = a + b * ln(x+1) + c * 1/(x+1)– Forsyth (2006), machine learning (ad hoc curve-fitting?)– e.g. error rate goes down as size of training data goes up


Discussion: some points

LDF with variable-selection = a dead end? NMC gives respectable performance

– Other algorithms to be tried

Log-reciprocal formula best for self-prediction (so far)– Beats power law (Management Science tradition)– Beats exponential law (Learning Theory tradition)– Goldilocks!?

Key point:– Iterative expert coding till automatic system takes over

• Therefore system must self-predict– Hardly anyone in machine learning has tried this

• Therefore our simple model is probably best around


Discussion: what we plan to do next:

Try other algorithms– E.g. k-nearest neighbour

Try other data sets– E.g. rainbow coding (& romance smallads ;)

Try other self-prediction formulae– Why not use genetic programming?!

Something else we’ve overlooked??


That’s all folks

Thank you for your attention


Extra slide!

Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster /...

Documents

Transcript of Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster /...