A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in...

A Neural Network A Neural Network Approach to Topic Approach to Topic SpottingSpotting

Presented by: Loulwah AlSumait

INFS 795 Spec. Topics in Data Mining

4.14.2005

Article Information

Published inProceedings of SDAIR-95, 4th Annual

Symposium on Document Analysis and Information Retrieval 1995

AuthorsWiener E.,Pedersen, J.O.Weigend, A.S.

54 citations

Summary

Introduction Related Work The Corpus

Representation Term Selection Latent Semantic Indexing

Generic LSI Local LSI

Cluster-Directed LSI Topic-Directed LSI

Relevancy Weighting LSI

Summary

Neural Network Classifier Neural Networks for Topic

SpottingLinear vs. Non Linear NetworksFlat Architecture vs. Modular

Architecture Experiment Results

Evaluating PerformanceResults & discussions

Introduction

Topic Spotting = Text Categorization = Text Classification

Problem of identifying which of a set of predefined topics are present in a natural language document.

Document

Topic 1

Topic 2

Topic n

Introduction

Classification Approaches Expert system approach:

manually construct a system of inference rules on top of large body of linguistic and domain knowledge

could be extremely accurate very time consuming brittle to changes in the data environment

Data driven approach: induce a set of rules from a corpus of labeled training

documents practically better

Introduction – Related Work

The major remarks regarding the related work:Separate classifier was constructed for

each topic.Different set of terms was used to train

each classifier.

Introduction – The Corpus Reuters 22173 corpus of Reuters newswire

stories from 1987 21,450 stories 9,610 for training 3,662 for testing mean length: 90.6 words, SD 91.6 92 topics appeared at least once in the training set.

The mean is 1.24 topics/doc. (up to 14 topics for some doc.) 11,161 unique terms after preprocessing

inflectional stemming, stop word removal, conversion to lower case elimination of words appeared in fewer three documents

Representations

starting point:Document Profile: term by document

matrix containing word frequency entries

vector di

dkdk

f

fP

2

Representation

3/ 33

1/ 33

1/ 33

2/ 33Thorsten Joachims. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. http://citeseer.ist.psu.edu/joachims97text.html

Representation - Term Selection

the subset of the original terms that are most useful for the classification task.

Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network Divide problem into 92 independent

classification tasks Search for best discriminator terms between

documents with the topic and those without

Representation - Term Selection Relevancy Score

measures how unbalanced the term is across documents w/ or w/o the topic

Highly +ve and highly -ve scores indicate useful terms for discrimination

using about 20 terms yielded the best classification performance

61

61

log

t

kt

t

tk

k

dwd

w

r

No. of doc. w/ topic t & contain term k

Total No. of doc. w/ topic t

Advantage: little computation is required resulting features have direct interpretability

Drawback: many of best individual predictors contain

redundant information a term which may appear to be a very poor

predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse.

Apple vs. Apple Computers Selected Term Representation

(TERMS) with 20 features


TERMS

Representation – LSI Transform original doc to lower-dimensional

space by analyzing correlational structure of terms in the document collection (Training Set): applying a singular-value decomposition

(SVD) to the original term by document matrix Get U, , V

(test set): Transform document vectors by projecting them into LSI space

Property of LSI: higher dimensions capture less of variance of original data drop w/ minimal loss. Found: performance continues to improve up to at least 250

dimensions Improvement rapidly slows dawn after about 100

dimensions Generic LSI Representation (LSI) with 200

features

LSI

Representation – LSI

SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

Generic LSI Representation w/ 200 features

Representation – Local LSI

Global LSI performs worse as frequency decreases infrequent topics are usually indicated by

infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise.

Proposed two task-directed methods that make use of prior knowledge of the classification task


What is Local LSI? modeling only the local portion of the corpus

related to those topics includes documents that use terminology

related to the topics (not necessary have any of the topics assigned)

Performing SVD over only the local set of documents

representation more sensitive to small, localized effects of infrequent terms.

representation more effective for classification of topics related to that local structure.


Type of Local LSI:Cluster Directed representation

5 Meta-topics (clusters): Agriculture, Energy, Foreign exchange,

Government, and metals How to construct local region?

Break corpus into 5 clusters each containing all documents on corresponding meta-topic

Perform SVD for each Meta-topic region

Clustor-Directed LSI Representation (CD/LSI) with 200 featuresCD/LSI


SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

Energy

Metal

Foreign Exchange

Agriculture

Government


SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

ZincGold

GOVERNMENT

AGRICULTURE

Foreign

Exchange

METAL

ENERGY

Clustor-Directed LSI Representation (CD/LSI) w/ 200 features

SVD

SVD

SVD

SVD


Types of Local LSI: Term Directed representation

More fine-grained approach to local LSI Separate representation for each topic. How to construct the local region?

Use 100 most predictive terms for the topic. Pick N most similar documents.

N = 5 * No. of documents containing topic, 350 N 110

Final Documents in topic region = N documents + 150 random documents

Topic-Directed LSI Representation (TD/LSI) with 200 features


SVD

Reuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

Representation – Local LSIReuters Corpus

Wool

Barley

Wheat

Money-supply

Zinc

Gold

SVD

SVD

SVD

SVD

SVD

SVD

Term-Directed LSI Representation (TD/LSI) w/ 200 features

Drawback of Local LSI:Narrower the region, the Lower

flexibility in representations for modeling the classification of multiple topics

High computational overhead


Representation - Relevancy Weighting LSI

Use term weight to emphasize the importance of particular terms before applying SVD IDF weighting

importance of low frequency terms the importance of high frequency terms Assumes low frequency terms to be better

discriminators than high frequency terms

Relevancy Weighting tune the IDF assumption emphasize terms in proportion to their estimated topic

discrimination power Global Relevancy Weighting of term k (GRWk)

Final Weighting of term k = IDF2 * GRWk

all low frequency terms pulled up by IDF Poor predictors pushed down leaving only relevant low frequency terms with high

weights Relevancy Weighted LSI Representation

(REL/LSI) with 200 features

t

kk rGRW

Representation - Relevancy Weighting LSI

61

61

log

t

kt

t

tk

k

dwd

w

r

Neural Network Classifier (NN)

NN consists of:processing units (Neurons)weighted links connecting neurons

major components of NN model:architecture: defines the functional form

relating input to output network topology unit connectivity activation functions: e.g. Logistic regression

fn.


Neural Network Classifier (NN) Logistic regression

function

z =

is a linear combination of the input features

p (0,1) - can be converted to binary classification method by thresholding the output probability

zep

1

1

)...(110 nn

xwxww

A

f(A)

+1

-1

0

Sigmoid Function

major components of NN model (cont):search algorithm: the search in weight

space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS)

Backpropagation method Mean squared errors Cross-entropy error performance

functionC = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output


NN for Topic Spotting

Network outputs are estimates of the probability of topic presence given the feature vector of a document

Generic LSI representation each network uses same representation

Local LSI representationdifferent representation for each network

Linear NNOutput units with logistic activation and

no hidden layer


n2

1


Non Linear NN Simple networks with a single hidden layer

of logistic sigmoid units (6 – 15)

NN for Topic SpottingFlat Architecture

Separate network for each topic use entire training set to train for each topic Avoiding overfitting

problem by adding penalty term

to the cross-entropy cost function to encourage eliminationof small weights.

Early stopping based on cross-validation

NN for Topic SpottingModular Architecture

decompose learning problem into smaller problems

Meta-Topic Network trained on full training set estimate the presence probability of the five topics in

doc. use 15 hidden units


five groups of local topic networks consists of local topic networks for each topic in meta-

topic each network trained only on the meta-topic region


five groups of local topic networks (cont.) Example: wheat network trained Agriculture meta-topic. Focus on finer distinctions, e.g. wheat and grain Don’t waste time on easier distinctions, e.g. wheat and

gold. Each local topic networks uses 6 hidden units.


To compute topic predictions for a given document

Present document to meta-topic network Present document to each of the topic networks Outputs of meta-topic network estimate of topic

networks = final topic estimates

Experimental Results

Evaluating PerformanceMean squared error between actual and

predicted values is inefficient Compute precision and recall based on

contingency table constructed over range of decision thresholds

How to get the decision Thresholds?


Evaluating PerformanceHow to get the decision Thresholds?

Proportional assignment

Topic = ‘wool’

Topic ‘wool’

Predicted Topic = ‘wool’ iff Output probability = output probability of kp’th highest rank doc.K integer, P prior probability of “wool” topic

Predicted Topic ‘wool’, iffoutput probability <


Evaluating PerformanceHow to get the decision Thresholds?

fixed recall level approach determine set of recall levels analyze ranked documents to determine what

decision thresholds lead to the desired set of recall levels. Topic =

‘wool’Topic ‘wool’

Predicted Topic = ‘wool’ iff Output probability = output probability of doc. where # of doc. with higher output probability Leads to desired recall level

Predicted Topic ‘wool’, iffoutput probability <

TargetRecall

Experimental Results Performance by Micoraveraging

add all contingency tables together across topics at a certain threshold

compute precision and recallused proportional assignment for picking

decision thresholdsdoes not weight the topics evenlyused for comparisons to previously

reported resultsBreakeven point is used as a summary

value


Performance by Macoraveragingcompute precision and recall for each

topictake the average across topicsused fixed set of recall levelssummary values are obtained for

particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95


Microaveraged performance Breakpoints

compared to best algorithm:

rule induction method best on heuristic search with breakpoint (0.789)

0.82

0.801

0.795

0.775


Macroaveraged performance TERMS appears

much closer to other three.

Relative effectivenessof the representationsat low recall levels isreversed at high recalllevels

Slight improvement of nonlinear

networks

LSI performance

degrades compared to

TERMS when ft

decreases

Six techniques performance on54 most frequent topics considerable

variation of performance across topics

relative ups and downs are mirrored in both plots

Experimental Results Performance of Combination of

Techniques and Its Improvement NN architecture

DocumentRepresentation

Flat Modular(Meta-Topic NW trained using LSI representation)

Linear Non Linear Linear Non Linear

TERMS

LSI

CD-LSI

TD-LSI

REL-LSI

Hybrid (CD-LSI + TERMS)

Match color & shape to get an experiment


Flat Networks


Modular Networks 4 clusters only used Recomputed average precision for the flat

networks

Non linear networks seem to perform better than the linear models, but the difference is very slight.Non linear networks seem to perform better than the linear models, but the difference is very slight.

LSI representation is able to equal or exceed TERMS performance

for high frequency topic, but performs poorly for

low frequency

Task-Directed LSI representations improve performance in the low

frequency domain TD/LSI Trade-off Cost

REL/LSI Trade-off lower performance on m/h topics

Modular CD/LSI improves performance

further for low frequency, because individual

networks are trained only in the domain that LSI

was performed

TERMS proves to be competitive to more

sophisticated LSI technique most topics are predictable

by small set of terms

Discussion Rich solution – many representations

and many models Total Supervised approach Results are lower than what expected

Is the dataset responsible? High computational overhead Does NN deserve a place in DM tool

boxes?

Questions?Questions?

A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in...

Documents

Transcript of A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in...