A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in...

56
A Neural Network A Neural Network Approach to Approach to Topic Spotting Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005

Transcript of A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in...

Page 1: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

A Neural Network A Neural Network Approach to Topic Approach to Topic SpottingSpotting

Presented by: Loulwah AlSumait

INFS 795 Spec. Topics in Data Mining

4.14.2005

Page 2: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Article Information

Published inProceedings of SDAIR-95, 4th Annual

Symposium on Document Analysis and Information Retrieval 1995

AuthorsWiener E.,Pedersen, J.O.Weigend, A.S.

54 citations

Page 3: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Summary

Introduction Related Work The Corpus

Representation Term Selection Latent Semantic Indexing

Generic LSI Local LSI

Cluster-Directed LSI Topic-Directed LSI

Relevancy Weighting LSI

Page 4: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Summary

Neural Network Classifier Neural Networks for Topic

SpottingLinear vs. Non Linear NetworksFlat Architecture vs. Modular

Architecture Experiment Results

Evaluating PerformanceResults & discussions

Page 5: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Introduction

Topic Spotting = Text Categorization = Text Classification

Problem of identifying which of a set of predefined topics are present in a natural language document.

Document

Topic 1

Topic 2

Topic n

Page 6: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Introduction

Classification Approaches Expert system approach:

manually construct a system of inference rules on top of large body of linguistic and domain knowledge

could be extremely accurate very time consuming brittle to changes in the data environment

Data driven approach: induce a set of rules from a corpus of labeled training

documents practically better

Page 7: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Introduction – Related Work

The major remarks regarding the related work:Separate classifier was constructed for

each topic.Different set of terms was used to train

each classifier.

Page 8: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Introduction – The Corpus Reuters 22173 corpus of Reuters newswire

stories from 1987 21,450 stories 9,610 for training 3,662 for testing mean length: 90.6 words, SD 91.6 92 topics appeared at least once in the training set.

The mean is 1.24 topics/doc. (up to 14 topics for some doc.) 11,161 unique terms after preprocessing

inflectional stemming, stop word removal, conversion to lower case elimination of words appeared in fewer three documents

Page 9: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representations

starting point:Document Profile: term by document

matrix containing word frequency entries

vector di

dkdk

f

fP

2

Page 10: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation

3/ 33

1/ 33

1/ 33

2/ 33Thorsten Joachims. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. http://citeseer.ist.psu.edu/joachims97text.html

Page 11: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation - Term Selection

the subset of the original terms that are most useful for the classification task.

Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network Divide problem into 92 independent

classification tasks Search for best discriminator terms between

documents with the topic and those without

Page 12: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation - Term Selection Relevancy Score

measures how unbalanced the term is across documents w/ or w/o the topic

Highly +ve and highly -ve scores indicate useful terms for discrimination

using about 20 terms yielded the best classification performance

61

61

log

t

kt

t

tk

k

dwd

w

r

No. of doc. w/ topic t & contain term k

Total No. of doc. w/ topic t

Page 13: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation - Term Selection

Page 14: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Advantage: little computation is required resulting features have direct interpretability

Drawback: many of best individual predictors contain

redundant information a term which may appear to be a very poor

predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse.

Apple vs. Apple Computers Selected Term Representation

(TERMS) with 20 features

Representation - Term Selection

TERMS

Page 15: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – LSI Transform original doc to lower-dimensional

space by analyzing correlational structure of terms in the document collection (Training Set): applying a singular-value decomposition

(SVD) to the original term by document matrix Get U, , V

(test set): Transform document vectors by projecting them into LSI space

Property of LSI: higher dimensions capture less of variance of original data drop w/ minimal loss. Found: performance continues to improve up to at least 250

dimensions Improvement rapidly slows dawn after about 100

dimensions Generic LSI Representation (LSI) with 200

features

LSI

Page 16: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – LSI

SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

Generic LSI Representation w/ 200 features

Page 17: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – Local LSI

Global LSI performs worse as frequency decreases infrequent topics are usually indicated by

infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise.

Proposed two task-directed methods that make use of prior knowledge of the classification task

Page 18: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – Local LSI

What is Local LSI? modeling only the local portion of the corpus

related to those topics includes documents that use terminology

related to the topics (not necessary have any of the topics assigned)

Performing SVD over only the local set of documents

representation more sensitive to small, localized effects of infrequent terms.

representation more effective for classification of topics related to that local structure.

Page 19: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – Local LSI

Type of Local LSI:Cluster Directed representation

5 Meta-topics (clusters): Agriculture, Energy, Foreign exchange,

Government, and metals How to construct local region?

Break corpus into 5 clusters each containing all documents on corresponding meta-topic

Perform SVD for each Meta-topic region

Clustor-Directed LSI Representation (CD/LSI) with 200 featuresCD/LSI

Page 20: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – Local LSI

SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

Page 21: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Energy

Metal

Foreign Exchange

Agriculture

Government

Representation – Local LSI

SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

ZincGold

GOVERNMENT

AGRICULTURE

Foreign

Exchange

METAL

ENERGY

Clustor-Directed LSI Representation (CD/LSI) w/ 200 features

SVD

SVD

SVD

SVD

Page 22: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – Local LSI

Types of Local LSI: Term Directed representation

More fine-grained approach to local LSI Separate representation for each topic. How to construct the local region?

Use 100 most predictive terms for the topic. Pick N most similar documents.

N = 5 * No. of documents containing topic, 350 N 110

Final Documents in topic region = N documents + 150 random documents

Topic-Directed LSI Representation (TD/LSI) with 200 features

Page 23: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – Local LSI

SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

Page 24: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation – Local LSIReuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

SVD

SVD

SVD

SVD

SVD

SVD

Term-Directed LSI Representation (TD/LSI) w/ 200 features

Page 25: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Drawback of Local LSI:Narrower the region, the Lower

flexibility in representations for modeling the classification of multiple topics

High computational overhead

Representation – Local LSI

Page 26: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Representation - Relevancy Weighting LSI

Use term weight to emphasize the importance of particular terms before applying SVD IDF weighting

importance of low frequency terms the importance of high frequency terms Assumes low frequency terms to be better

discriminators than high frequency terms

Page 27: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Relevancy Weighting tune the IDF assumption emphasize terms in proportion to their estimated topic

discrimination power Global Relevancy Weighting of term k (GRWk)

Final Weighting of term k = IDF2 * GRWk

all low frequency terms pulled up by IDF Poor predictors pushed down leaving only relevant low frequency terms with high

weights Relevancy Weighted LSI Representation

(REL/LSI) with 200 features

t

kk rGRW

Representation - Relevancy Weighting LSI

61

61

log

t

kt

t

tk

k

dwd

w

r

Page 28: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Neural Network Classifier (NN)

NN consists of:processing units (Neurons)weighted links connecting neurons

Page 29: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

major components of NN model:architecture: defines the functional form

relating input to output network topology unit connectivity activation functions: e.g. Logistic regression

fn.

Neural Network Classifier (NN)

Page 30: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Neural Network Classifier (NN) Logistic regression

function

z =

is a linear combination of the input features

p (0,1) - can be converted to binary classification method by thresholding the output probability

zep

1

1

)...(110 nn

xwxww

A

f(A)

+1

-1

0

Sigmoid Function

Page 31: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

major components of NN model (cont):search algorithm: the search in weight

space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS)

Backpropagation method Mean squared errors Cross-entropy error performance

functionC = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output

Neural Network Classifier (NN)

Page 32: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

NN for Topic Spotting

Network outputs are estimates of the probability of topic presence given the feature vector of a document

Generic LSI representation each network uses same representation

Local LSI representationdifferent representation for each network

Page 33: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Linear NNOutput units with logistic activation and

no hidden layer

NN for Topic Spotting

n2

1

Page 34: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

NN for Topic Spotting

Non Linear NN Simple networks with a single hidden layer

of logistic sigmoid units (6 – 15)

Page 35: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

NN for Topic SpottingFlat Architecture

Separate network for each topic use entire training set to train for each topic Avoiding overfitting

problem by adding penalty term

to the cross-entropy cost function to encourage eliminationof small weights.

Early stopping based on cross-validation

Page 36: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

NN for Topic SpottingModular Architecture

decompose learning problem into smaller problems

Meta-Topic Network trained on full training set estimate the presence probability of the five topics in

doc. use 15 hidden units

Page 37: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

NN for Topic SpottingModular Architecture

five groups of local topic networks consists of local topic networks for each topic in meta-

topic each network trained only on the meta-topic region

Page 38: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

NN for Topic SpottingModular Architecture

five groups of local topic networks (cont.) Example: wheat network trained Agriculture meta-topic. Focus on finer distinctions, e.g. wheat and grain Don’t waste time on easier distinctions, e.g. wheat and

gold. Each local topic networks uses 6 hidden units.

Page 39: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

NN for Topic SpottingModular Architecture

To compute topic predictions for a given document

Present document to meta-topic network Present document to each of the topic networks Outputs of meta-topic network estimate of topic

networks = final topic estimates

Page 40: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Evaluating PerformanceMean squared error between actual and

predicted values is inefficient Compute precision and recall based on

contingency table constructed over range of decision thresholds

How to get the decision Thresholds?

Page 41: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Evaluating PerformanceHow to get the decision Thresholds?

Proportional assignment

Topic = ‘wool’

Topic ‘wool’

Predicted Topic = ‘wool’ iff Output probability = output probability of kp’th highest rank doc.K integer, P prior probability of “wool” topic

Predicted Topic ‘wool’, iffoutput probability <

Page 42: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Evaluating PerformanceHow to get the decision Thresholds?

fixed recall level approach determine set of recall levels analyze ranked documents to determine what

decision thresholds lead to the desired set of recall levels. Topic =

‘wool’Topic ‘wool’

Predicted Topic = ‘wool’ iff Output probability = output probability of doc. where # of doc. with higher output probability Leads to desired recall level

Predicted Topic ‘wool’, iffoutput probability <

TargetRecall

Page 43: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results Performance by Micoraveraging

add all contingency tables together across topics at a certain threshold

compute precision and recallused proportional assignment for picking

decision thresholdsdoes not weight the topics evenlyused for comparisons to previously

reported resultsBreakeven point is used as a summary

value

Page 44: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Performance by Macoraveragingcompute precision and recall for each

topictake the average across topicsused fixed set of recall levelssummary values are obtained for

particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95

Page 45: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Microaveraged performance Breakpoints

compared to best algorithm:

rule induction method best on heuristic search with breakpoint (0.789)

0.82

0.801

0.795

0.775

Page 46: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Macroaveraged performance TERMS appears

much closer to other three.

Relative effectivenessof the representationsat low recall levels isreversed at high recalllevels

Page 47: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Slight improvement of nonlinear

networks

LSI performance

degrades compared to

TERMS when ft

decreases

Six techniques performance on54 most frequent topics considerable

variation of performance across topics

relative ups and downs are mirrored in both plots

Page 48: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results Performance of Combination of

Techniques and Its Improvement NN architecture

DocumentRepresentation

Flat Modular(Meta-Topic NW trained using LSI representation)

Linear Non Linear Linear Non Linear

TERMS

LSI

CD-LSI

TD-LSI

REL-LSI

Hybrid (CD-LSI + TERMS)

Match color & shape to get an experiment

Page 49: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Flat Networks

Page 50: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Experimental Results

Modular Networks 4 clusters only used Recomputed average precision for the flat

networks

Page 51: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Non linear networks seem to perform better than the linear models, but the difference is very slight.Non linear networks seem to perform better than the linear models, but the difference is very slight.

Page 52: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

LSI representation is able to equal or exceed TERMS performance

for high frequency topic, but performs poorly for

low frequency

Page 53: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Task-Directed LSI representations improve performance in the low

frequency domain TD/LSI Trade-off Cost

REL/LSI Trade-off lower performance on m/h topics

Page 54: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Modular CD/LSI improves performance

further for low frequency, because individual

networks are trained only in the domain that LSI

was performed

Page 55: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

TERMS proves to be competitive to more

sophisticated LSI technique most topics are predictable

by small set of terms

Page 56: A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Discussion Rich solution – many representations

and many models Total Supervised approach Results are lower than what expected

Is the dataset responsible? High computational overhead Does NN deserve a place in DM tool

boxes?

Questions?Questions?