Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster /...
Transcript of Classification of Short Text Sequences - Nottinghampszaxc/DReSS/CRGS06.pdf · TextCat / Lancaster /...
TextCat / Lancaster / October 2006
Classification of Short Text Sequences
Richard Forsyth, ShaaronAinsworth, David Clarke,
Claire O’Malley & PatBrundell
School of Psychology 0115-951 [email protected]
TextCat / Lancaster / October 2006
Outline
Background Introduction
– Situation– Aims
Method– Algorithms– Datasets
Results– Pilot study: linear classifier– Naïve Markov Classifier
Discussion
TextCat / Lancaster / October 2006
Background
Relating linguistic to extra-linguistic information
Advances in Learning Sciences
Exploiting digital records (legacy data) of verbalizations
e-Social Science– text-mining as a means of quantifying qualitative information
Computational coding– social scientists spend huge amounts of time hand-coding
verbal data
TextCat / Lancaster / October 2006
Intro: disciplinary differences in approach
Several approaches to text categorization
Linguistics:– Tagging (semantic / PoS)– Mostly at word level
Computing:– Classifying (authorship, content)– Mostly at document level
Social Sciences:– Categorical coding– Mostly at “segment” level (phrase, utterance)
TextCat / Lancaster / October 2006
Intro: categorical coding of verbal data
Researchers in Social Sciences spend lots of effort &time on
– Recording– Transcribing– Segmenting– Categorical coding
verbal data.
We’re looking at item four (mainly).– Plethora of categorical coding schemes
TextCat / Lancaster / October 2006
Intro: talking and learning
Talking to yourself can help:e.g. Self explaining
Talking with others can help:e.g., Collaborative learning, Argumentation-based learning,Socratic dialogue
But not all talk is good….So learning scientists need toanalyse what people say when they talk..– This is both time consuming and can be unreliable.– Hence our interest in whether text mining can help a) speed
things up, b) increase reliability and c) help us deal with theincreasing numbers of digital records of learning
TextCat / Lancaster / October 2006
Intro: 3 main categories
Textual Material“The septum divides the heart lengthwise into twosides”
3 codes:– Paraphrase,
• The septum is what goes down the middle of the heart– Self-explanation,
• Septum is what separates the two ... some sort of control– Monitoring-statement,
• I'm not sure why
TextCat / Lancaster / October 2006
Intro: aims & context
Objective:– (semi-)automatic classifier to assist categorical coding
Notes:– Very few standard “tag sets”, most coding schemes novel– Therefore trainable classifier essential– Important to economize human effort in loop:
• Code another block of text segments by expert• Test learning system on cases so far (x-validated)• Decide whether to continue:
– Stop, accuracy good enough– Abandon, accuracy will never be good enough– Code more cases (accuracy level will be ok with reasonable effort)
TextCat / Lancaster / October 2006
Method: algorithms
Pilot study: linear discriminant function– With vocab & tag variables
Latest study: Naïve Markov Classifier– Khmelev & Tweedie (2001)– Peng et al. (2003)
– N-gram Markovian model at byte or word level– Naïve Bayesian inference as “gluing” mechanism– (“m-estimate” for attenuating probabilities)
TextCat / Lancaster / October 2006
Method: naïve Markov classifier
Why I like this algorithm:
It’s fast & not very memory-hungry No pre-processing phase No lexicons or external support s/w needed No variable-selection phase
– (therefore less danger of overfitting)
Just uses all the data of a given type Has a Bayesian underpinning Can work in almost any language
– (e.g. could handle DNA sequences)
TextCat / Lancaster / October 2006
Method: operational framework
Obtain test-set size (N2) from user For a user-specified number of repetitions Set initial training-set size to N1 [typically zero] While training-set size plus test-set size (N1+N2) <= data set size Pick a random subsample of size N1 as training set Pick a disjoint random subsample of size N2 as testing set Train classifier on training data Test classifier on testing data Record results Increment N1 by user-specified increment (N3) Append results to file for subsequent analyses
TextCat / Lancaster / October 2006
Method: Dataset details (N.B. legacy data):
Both learning tasks on same topic = cardiovascular system (Plenty of data “cleansing” & reorganizing done.)
Dataset: AB (Ainsworth & Burcham, 2004) AR (Robertson, 2004)
Participants: 24 (13 female, 11 male) 24 (13 female, 11 male)
Words: 44,388 23,330
Segments: 2071 1784
Material: text diagrams
Conditions: high versus low coherence abstract versus realistic
Assessment: 3 pre -test & 5 post -test measures 3 pre-test & 5 post -test measures
TextCat / Lancaster / October 2006
Method : samples of High & Low coherence text
this oxygenated blood is thenpumped through the bicuspidvalve (the a-v valve on the leftside of the heart) from theatria to the left ventricle
it is pumped through thebicuspid valve into the leftventricle
the oxygenated blood is then pumpedthrough the bicuspid valve intothe left ventricle
MaximalMinimalOriginal
TextCat / Lancaster / October 2006
Method: abstract and concrete diagrams
Direction of blood flow
Upper chamber
(atrium)
Upper chamber
(atrium)
Lower chamber
(ventricle)
Lower chamber
(ventricle)
Left SideRight Side
to lungs
septum
to rest ofbody
TextCat / Lancaster / October 2006
Pilot study: specimen of WMatrix tagging output (AR27; 35, 36)
CC And Z5 PPH1 it Z8 VBZ 's Z5 A3+ VVG showing A10+ S1.1.1 RR obviously Z4 A11.2+ AT the Z5 JJ main A11.1+ N5+++ NN1 pump O2 B5 VBZ is A3+ Z5 AT the Z5 NN1 heart B1 M6 A11.1+ E1 X5.2+ , , PPH1 it Z8 VBZ 's A3+ Z5 RG very A13.3 JJ big N3.2+ N5+ A11.1+ S1.2.5+ X5.2+ CC and Z5 DD1 that Z8 Z5 VM will T1.1.3 VVI pump M2 A1.1.1 A9- Q2.2 S8+ B3
Word Semantic Codes (in decreasingorder of predicted likelihood)
SyntacticCode
TextCat / Lancaster / October 2006
Pilot study: Utterance classification: LDF 3-group plot, AR data:
0 5 10 15
Function 1
-4
-2
0
2
4
Fu
ncti
on
2
1
2
3
segcode3
1=para 2=s-x 3=m
Group Centroid
Canonical Discriminant Functions
1. Paraphrase2. Self-Explanation3. Monitoring
Group Centroid
Canonical Discriminant Function
TextCat / Lancaster / October 2006
Pilot study: Utterance classification: LDF 3-by-3, AR data (cross-validated)
1.00 2.00 3.00
pred
0
200
400
600
800
1,000
Cou
nt
true
1.00
2.00
3.00
Bar Chart
ParaphraseSelf-ExplanationMonitoring
Paraphrase Self-Explanation Monitoring
TextCat / Lancaster / October 2006
Pilot study: classification variables used (AR data)
8 variables maximum (counts): picked: 6 syntactic, 2semanticStep
entered
Function 1 Function 2 WMatrix category
(* = semantic)
Four commonest
tokens coded thus in
AR data texts
1 2.043 .291 *X2: mental acts &
processes
know, suppose, think
2 2.582 -.129 XX: negation n't, not
3 -.139 .552 VV: lexical verb goes, got, go, get
4 -.065 -1.038 *H2: parts of
buildings
atrium, portal,
chambers, atriums
5 -.164 .587 CS: subordinating
conjunction
that, so, because,
when
6 -.209 .112 NN: noun blood, heart, body,
lungs
7 .103 -.367 JJ: adjective left, right, pulmonary,
other
8 -.699 .777 VM: modal verb can, will, could, ca [n't]
TextCat / Lancaster / October 2006
Results: NMC
Factors investigated
Mode:– Byte or Word as basic unit
N-gram size:– 0 .. 3
TextCat / Lancaster / October 2006
Results: sample “learning curve” 1
TextCat / Lancaster / October 2006
Results: sample “learning curve” 2
TextCat / Lancaster / October 2006
Results: mode & gram-size
TextCat / Lancaster / October 2006
Results: this is the only slide you need to remember!
TextCat / Lancaster / October 2006
Fitting the “learning curve” / “experience curve”
3 formulae tried– Power, Exponential, Log-reciprocal
Y = a + b * x ^ c– Wright (1936), management science– e.g. cost per unit declines as production continues
Y = a + b * (1 – 10 ^ (c * x))– Hull (1943), psychology– e.g. time for rat to find food decreases with repeated trials
Y = a + b * ln(x+1) + c * 1/(x+1)– Forsyth (2006), machine learning (ad hoc curve-fitting?)– e.g. error rate goes down as size of training data goes up
TextCat / Lancaster / October 2006
Discussion: some points
LDF with variable-selection = a dead end? NMC gives respectable performance
– Other algorithms to be tried
Log-reciprocal formula best for self-prediction (so far)– Beats power law (Management Science tradition)– Beats exponential law (Learning Theory tradition)– Goldilocks!?
Key point:– Iterative expert coding till automatic system takes over
• Therefore system must self-predict– Hardly anyone in machine learning has tried this
• Therefore our simple model is probably best around
TextCat / Lancaster / October 2006
Discussion: what we plan to do next:
Try other algorithms– E.g. k-nearest neighbour
Try other data sets– E.g. rainbow coding (& romance smallads ;)
Try other self-prediction formulae– Why not use genetic programming?!
Something else we’ve overlooked??
TextCat / Lancaster / October 2006
That’s all folks
Thank you for your attention
TextCat / Lancaster / October 2006
Extra slide!