A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in...
-
Upload
alexandra-wood -
Category
Documents
-
view
217 -
download
0
Transcript of A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in...
A Neural Network A Neural Network Approach to Topic Approach to Topic SpottingSpotting
Presented by: Loulwah AlSumait
INFS 795 Spec. Topics in Data Mining
4.14.2005
Article Information
Published inProceedings of SDAIR-95, 4th Annual
Symposium on Document Analysis and Information Retrieval 1995
AuthorsWiener E.,Pedersen, J.O.Weigend, A.S.
54 citations
Summary
Introduction Related Work The Corpus
Representation Term Selection Latent Semantic Indexing
Generic LSI Local LSI
Cluster-Directed LSI Topic-Directed LSI
Relevancy Weighting LSI
Summary
Neural Network Classifier Neural Networks for Topic
SpottingLinear vs. Non Linear NetworksFlat Architecture vs. Modular
Architecture Experiment Results
Evaluating PerformanceResults & discussions
Introduction
Topic Spotting = Text Categorization = Text Classification
Problem of identifying which of a set of predefined topics are present in a natural language document.
Document
Topic 1
Topic 2
Topic n
Introduction
Classification Approaches Expert system approach:
manually construct a system of inference rules on top of large body of linguistic and domain knowledge
could be extremely accurate very time consuming brittle to changes in the data environment
Data driven approach: induce a set of rules from a corpus of labeled training
documents practically better
Introduction – Related Work
The major remarks regarding the related work:Separate classifier was constructed for
each topic.Different set of terms was used to train
each classifier.
Introduction – The Corpus Reuters 22173 corpus of Reuters newswire
stories from 1987 21,450 stories 9,610 for training 3,662 for testing mean length: 90.6 words, SD 91.6 92 topics appeared at least once in the training set.
The mean is 1.24 topics/doc. (up to 14 topics for some doc.) 11,161 unique terms after preprocessing
inflectional stemming, stop word removal, conversion to lower case elimination of words appeared in fewer three documents
Representations
starting point:Document Profile: term by document
matrix containing word frequency entries
vector di
dkdk
f
fP
2
Representation
3/ 33
1/ 33
1/ 33
2/ 33Thorsten Joachims. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. http://citeseer.ist.psu.edu/joachims97text.html
Representation - Term Selection
the subset of the original terms that are most useful for the classification task.
Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network Divide problem into 92 independent
classification tasks Search for best discriminator terms between
documents with the topic and those without
Representation - Term Selection Relevancy Score
measures how unbalanced the term is across documents w/ or w/o the topic
Highly +ve and highly -ve scores indicate useful terms for discrimination
using about 20 terms yielded the best classification performance
61
61
log
t
kt
t
tk
k
dwd
w
r
No. of doc. w/ topic t & contain term k
Total No. of doc. w/ topic t
Representation - Term Selection
Advantage: little computation is required resulting features have direct interpretability
Drawback: many of best individual predictors contain
redundant information a term which may appear to be a very poor
predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse.
Apple vs. Apple Computers Selected Term Representation
(TERMS) with 20 features
Representation - Term Selection
TERMS
Representation – LSI Transform original doc to lower-dimensional
space by analyzing correlational structure of terms in the document collection (Training Set): applying a singular-value decomposition
(SVD) to the original term by document matrix Get U, , V
(test set): Transform document vectors by projecting them into LSI space
Property of LSI: higher dimensions capture less of variance of original data drop w/ minimal loss. Found: performance continues to improve up to at least 250
dimensions Improvement rapidly slows dawn after about 100
dimensions Generic LSI Representation (LSI) with 200
features
LSI
Representation – LSI
SVD
Reuters Corpus
Wool
Barley
Wheat
Money-supply
Zinc
Gold
Generic LSI Representation w/ 200 features
Representation – Local LSI
Global LSI performs worse as frequency decreases infrequent topics are usually indicated by
infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise.
Proposed two task-directed methods that make use of prior knowledge of the classification task
Representation – Local LSI
What is Local LSI? modeling only the local portion of the corpus
related to those topics includes documents that use terminology
related to the topics (not necessary have any of the topics assigned)
Performing SVD over only the local set of documents
representation more sensitive to small, localized effects of infrequent terms.
representation more effective for classification of topics related to that local structure.
Representation – Local LSI
Type of Local LSI:Cluster Directed representation
5 Meta-topics (clusters): Agriculture, Energy, Foreign exchange,
Government, and metals How to construct local region?
Break corpus into 5 clusters each containing all documents on corresponding meta-topic
Perform SVD for each Meta-topic region
Clustor-Directed LSI Representation (CD/LSI) with 200 featuresCD/LSI
Representation – Local LSI
SVD
Reuters Corpus
Wool
Barley
Wheat
Money-supply
Zinc
Gold
Energy
Metal
Foreign Exchange
Agriculture
Government
Representation – Local LSI
SVD
Reuters Corpus
Wool
Barley
Wheat
Money-supply
ZincGold
GOVERNMENT
AGRICULTURE
Foreign
Exchange
METAL
ENERGY
Clustor-Directed LSI Representation (CD/LSI) w/ 200 features
SVD
SVD
SVD
SVD
Representation – Local LSI
Types of Local LSI: Term Directed representation
More fine-grained approach to local LSI Separate representation for each topic. How to construct the local region?
Use 100 most predictive terms for the topic. Pick N most similar documents.
N = 5 * No. of documents containing topic, 350 N 110
Final Documents in topic region = N documents + 150 random documents
Topic-Directed LSI Representation (TD/LSI) with 200 features
Representation – Local LSI
SVD
Reuters Corpus
Wool
Barley
Wheat
Money-supply
Zinc
Gold
Representation – Local LSIReuters Corpus
Wool
Barley
Wheat
Money-supply
Zinc
Gold
SVD
SVD
SVD
SVD
SVD
SVD
Term-Directed LSI Representation (TD/LSI) w/ 200 features
Drawback of Local LSI:Narrower the region, the Lower
flexibility in representations for modeling the classification of multiple topics
High computational overhead
Representation – Local LSI
Representation - Relevancy Weighting LSI
Use term weight to emphasize the importance of particular terms before applying SVD IDF weighting
importance of low frequency terms the importance of high frequency terms Assumes low frequency terms to be better
discriminators than high frequency terms
Relevancy Weighting tune the IDF assumption emphasize terms in proportion to their estimated topic
discrimination power Global Relevancy Weighting of term k (GRWk)
Final Weighting of term k = IDF2 * GRWk
all low frequency terms pulled up by IDF Poor predictors pushed down leaving only relevant low frequency terms with high
weights Relevancy Weighted LSI Representation
(REL/LSI) with 200 features
t
kk rGRW
Representation - Relevancy Weighting LSI
61
61
log
t
kt
t
tk
k
dwd
w
r
Neural Network Classifier (NN)
NN consists of:processing units (Neurons)weighted links connecting neurons
major components of NN model:architecture: defines the functional form
relating input to output network topology unit connectivity activation functions: e.g. Logistic regression
fn.
Neural Network Classifier (NN)
Neural Network Classifier (NN) Logistic regression
function
z =
is a linear combination of the input features
p (0,1) - can be converted to binary classification method by thresholding the output probability
zep
1
1
)...(110 nn
xwxww
A
f(A)
+1
-1
0
Sigmoid Function
major components of NN model (cont):search algorithm: the search in weight
space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS)
Backpropagation method Mean squared errors Cross-entropy error performance
functionC = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output
Neural Network Classifier (NN)
NN for Topic Spotting
Network outputs are estimates of the probability of topic presence given the feature vector of a document
Generic LSI representation each network uses same representation
Local LSI representationdifferent representation for each network
Linear NNOutput units with logistic activation and
no hidden layer
NN for Topic Spotting
n2
1
NN for Topic Spotting
Non Linear NN Simple networks with a single hidden layer
of logistic sigmoid units (6 – 15)
NN for Topic SpottingFlat Architecture
Separate network for each topic use entire training set to train for each topic Avoiding overfitting
problem by adding penalty term
to the cross-entropy cost function to encourage eliminationof small weights.
Early stopping based on cross-validation
NN for Topic SpottingModular Architecture
decompose learning problem into smaller problems
Meta-Topic Network trained on full training set estimate the presence probability of the five topics in
doc. use 15 hidden units
NN for Topic SpottingModular Architecture
five groups of local topic networks consists of local topic networks for each topic in meta-
topic each network trained only on the meta-topic region
NN for Topic SpottingModular Architecture
five groups of local topic networks (cont.) Example: wheat network trained Agriculture meta-topic. Focus on finer distinctions, e.g. wheat and grain Don’t waste time on easier distinctions, e.g. wheat and
gold. Each local topic networks uses 6 hidden units.
NN for Topic SpottingModular Architecture
To compute topic predictions for a given document
Present document to meta-topic network Present document to each of the topic networks Outputs of meta-topic network estimate of topic
networks = final topic estimates
Experimental Results
Evaluating PerformanceMean squared error between actual and
predicted values is inefficient Compute precision and recall based on
contingency table constructed over range of decision thresholds
How to get the decision Thresholds?
Experimental Results
Evaluating PerformanceHow to get the decision Thresholds?
Proportional assignment
Topic = ‘wool’
Topic ‘wool’
Predicted Topic = ‘wool’ iff Output probability = output probability of kp’th highest rank doc.K integer, P prior probability of “wool” topic
Predicted Topic ‘wool’, iffoutput probability <
Experimental Results
Evaluating PerformanceHow to get the decision Thresholds?
fixed recall level approach determine set of recall levels analyze ranked documents to determine what
decision thresholds lead to the desired set of recall levels. Topic =
‘wool’Topic ‘wool’
Predicted Topic = ‘wool’ iff Output probability = output probability of doc. where # of doc. with higher output probability Leads to desired recall level
Predicted Topic ‘wool’, iffoutput probability <
TargetRecall
Experimental Results Performance by Micoraveraging
add all contingency tables together across topics at a certain threshold
compute precision and recallused proportional assignment for picking
decision thresholdsdoes not weight the topics evenlyused for comparisons to previously
reported resultsBreakeven point is used as a summary
value
Experimental Results
Performance by Macoraveragingcompute precision and recall for each
topictake the average across topicsused fixed set of recall levelssummary values are obtained for
particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95
Experimental Results
Microaveraged performance Breakpoints
compared to best algorithm:
rule induction method best on heuristic search with breakpoint (0.789)
0.82
0.801
0.795
0.775
Experimental Results
Macroaveraged performance TERMS appears
much closer to other three.
Relative effectivenessof the representationsat low recall levels isreversed at high recalllevels
Slight improvement of nonlinear
networks
LSI performance
degrades compared to
TERMS when ft
decreases
Six techniques performance on54 most frequent topics considerable
variation of performance across topics
relative ups and downs are mirrored in both plots
Experimental Results Performance of Combination of
Techniques and Its Improvement NN architecture
DocumentRepresentation
Flat Modular(Meta-Topic NW trained using LSI representation)
Linear Non Linear Linear Non Linear
TERMS
LSI
CD-LSI
TD-LSI
REL-LSI
Hybrid (CD-LSI + TERMS)
Match color & shape to get an experiment
Experimental Results
Flat Networks
Experimental Results
Modular Networks 4 clusters only used Recomputed average precision for the flat
networks
Non linear networks seem to perform better than the linear models, but the difference is very slight.Non linear networks seem to perform better than the linear models, but the difference is very slight.
LSI representation is able to equal or exceed TERMS performance
for high frequency topic, but performs poorly for
low frequency
Task-Directed LSI representations improve performance in the low
frequency domain TD/LSI Trade-off Cost
REL/LSI Trade-off lower performance on m/h topics
Modular CD/LSI improves performance
further for low frequency, because individual
networks are trained only in the domain that LSI
was performed
TERMS proves to be competitive to more
sophisticated LSI technique most topics are predictable
by small set of terms
Discussion Rich solution – many representations
and many models Total Supervised approach Results are lower than what expected
Is the dataset responsible? High computational overhead Does NN deserve a place in DM tool
boxes?
Questions?Questions?