A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
-
Upload
shannon-heath -
Category
Documents
-
view
231 -
download
0
Transcript of A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
A Clustering Method Based on Nonnegative Matrix Factorization
for Text Mining
Farial Shahnaz
Topics
• Introduction
• Algorithm
• Performance
• Observation
• Conclusion and Future Work
Introduction
Basic Concepts
• Text Mining : Detection of trends or patterns in text data
• Clustering : Grouping or classifying documents based on similarity of content
Clustering
• Manual Vs Automated
• Supervised Vs Unsupervised
• Hierarchical Vs Partitional
Clustering
• Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents
• Method : Nonnegative Matrix Factorization or NMF
Vector Space Model of Text Data
• Documents represented as n-dimensional vectors– n : terms in the dictionary– vector component : importance of term
• Document collection represented as term-by-document matrix
Term-by-Document Matrix
• Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the)
• Document 1 : a quick brown fox
• Document 2 : jumped over the lazy dog
Term-by-Document Matrix
Clustering Method : NMF
• Low rank approximation of large sparse matrices
• Preserves data nonnegativity
• Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)
Other Methods
• Other rank reduction methods : – Principal Component Analysis (PCA)– Vector Quantization (VQ)
• Produce basis vectors with negative entries
• Additive and Subtractive combinations of basis vectors yield original document vectors
NMF
• Produces nonnegative basis vectors
• Additive combination of basis vectors yield original document vector
Term-by-Document Matrix (all entries nonnegative)
NMF
• Basis vectors interpreted as semantic features or topics
• Documents clustered on the basis of shared features
NMF
• Demonstrated by Xu et. Al (2003):– Outperforms Singular Value Decomposition
(SVD)– Comparable to Graph Partitioning methods
Algorithm
NMF : Definition
Given
• S : Document collection
• Vmxn : term-by-document matrix
• m : terms in the dictionary
• n : Number of documents in S
NMF : Definition
NMF is defined as:
• Low rank approximation of Vmxn in terms of some metric
• Factor V into the product WH– Wmxk : Contains basis vectors– Hkxn : Contains linear combinations– k : Selected number of topics or basis
vectors, k << min(m,n)
NMF : Common Approach
• Minimize objective function:
NMF : Existing Methods
Multiplicative Method (MM) [ by Lee and Seung ]
• Based on Multiplicative update rules
• || V - WH || is monotonically non-increasing and constant iff W, H at stationary point
• Version of Gradient Descent (GD) optimization scheme
NMF : Existing Methods
Sparse Encoding [ by Hoyer ]
• Based on study of neural networks
• Enforces statistical sparsity of H– Minimizes sum of non-zeros in H
NMF : Existing Methods
Sparse Encoding [ by Mu, Plemmons and Santago ]
• Similar to Hoyer’s method
• Enforces statistical sparsity of H using a regularization parameter– Minimizes number of non-zeros in H
NMF : Proposed Algorithm
Hybrid Method:• W approximated using Multiplicative
Method• H calculated using a Constrained Least
Square (CLS) model as the metric– Penalizes the number of non-zeros– Similar to the method by Mu, Plemmons and
Santago
• Called GD-CLS
GD-CLS
Performance
Text Collections Used
• Two benchmark topic detection text collections:– Reuters : Collection of documents on assorted
topics– TDT2 : Transcripts from news media
Text Collections Used
Accuracy Metric
• Defined by:
• di : Document number i• = 1 = 1 if the topic labels match• ∂(di) = 0 otherwise
k = 2, 4, 6, 8, 10, 15, 20λ = 0.1, 0.01, 0.001
Results for Reuters Results for TDT2
Observations
Observations : AC
• AC inversely proportional to k
• Nature of the collection affects AC– Reuters : earn, interest, cocoa– TDT2 : Asian economic crisis, Oprah lawsuit
Observations : λ parameter
• AC declines as λ increases ( mostly effective for homogeneous text collections) :
• CPU time declines as λ increases
Observations : Cluster size• Imbalance in cluster sizes has adverse effect :
Conclusion & Future Work
GD-CLS can be used to effectively cluster text data. Further development involves:
• Smart updating
• Use in Bioinformatics
• Develop user-interface
• Convert to C++