A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

34
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz

description

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining. Farial Shahnaz. Topics. Introduction Algorithm Performance Observation Conclusion and Future Work. Introduction. Basic Concepts. Text Mining : Detection of trends or patterns in text data - PowerPoint PPT Presentation

Transcript of A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Page 1: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

A Clustering Method Based on Nonnegative Matrix Factorization

for Text Mining

Farial Shahnaz

Page 2: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Topics

• Introduction

• Algorithm

• Performance

• Observation

• Conclusion and Future Work

Page 3: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Introduction

Page 4: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Basic Concepts

• Text Mining : Detection of trends or patterns in text data

• Clustering : Grouping or classifying documents based on similarity of content

Page 5: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Clustering

• Manual Vs Automated

• Supervised Vs Unsupervised

• Hierarchical Vs Partitional

Page 6: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Clustering

• Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents

• Method : Nonnegative Matrix Factorization or NMF

Page 7: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Vector Space Model of Text Data

• Documents represented as n-dimensional vectors– n : terms in the dictionary– vector component : importance of term

• Document collection represented as term-by-document matrix

Page 8: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Term-by-Document Matrix

• Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the)

• Document 1 : a quick brown fox

• Document 2 : jumped over the lazy dog

Page 9: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Term-by-Document Matrix

Page 10: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Clustering Method : NMF

• Low rank approximation of large sparse matrices

• Preserves data nonnegativity

• Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)

Page 11: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Other Methods

• Other rank reduction methods : – Principal Component Analysis (PCA)– Vector Quantization (VQ)

• Produce basis vectors with negative entries

• Additive and Subtractive combinations of basis vectors yield original document vectors

Page 12: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF

• Produces nonnegative basis vectors

• Additive combination of basis vectors yield original document vector

Page 13: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Term-by-Document Matrix (all entries nonnegative)

Page 14: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF

• Basis vectors interpreted as semantic features or topics

• Documents clustered on the basis of shared features

Page 15: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF

• Demonstrated by Xu et. Al (2003):– Outperforms Singular Value Decomposition

(SVD)– Comparable to Graph Partitioning methods

Page 16: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Algorithm

Page 17: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF : Definition

Given

• S : Document collection

• Vmxn : term-by-document matrix

• m : terms in the dictionary

• n : Number of documents in S

Page 18: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF : Definition

NMF is defined as:

• Low rank approximation of Vmxn in terms of some metric

• Factor V into the product WH– Wmxk : Contains basis vectors– Hkxn : Contains linear combinations– k : Selected number of topics or basis

vectors, k << min(m,n)

Page 19: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF : Common Approach

• Minimize objective function:

Page 20: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF : Existing Methods

Multiplicative Method (MM) [ by Lee and Seung ]

• Based on Multiplicative update rules

• || V - WH || is monotonically non-increasing and constant iff W, H at stationary point

• Version of Gradient Descent (GD) optimization scheme

Page 21: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF : Existing Methods

Sparse Encoding [ by Hoyer ]

• Based on study of neural networks

• Enforces statistical sparsity of H– Minimizes sum of non-zeros in H

Page 22: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF : Existing Methods

Sparse Encoding [ by Mu, Plemmons and Santago ]

• Similar to Hoyer’s method

• Enforces statistical sparsity of H using a regularization parameter– Minimizes number of non-zeros in H

Page 23: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

NMF : Proposed Algorithm

Hybrid Method:• W approximated using Multiplicative

Method• H calculated using a Constrained Least

Square (CLS) model as the metric– Penalizes the number of non-zeros– Similar to the method by Mu, Plemmons and

Santago

• Called GD-CLS

Page 24: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

GD-CLS

Page 25: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Performance

Page 26: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Text Collections Used

• Two benchmark topic detection text collections:– Reuters : Collection of documents on assorted

topics– TDT2 : Transcripts from news media

Page 27: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Text Collections Used

Page 28: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Accuracy Metric

• Defined by:

• di : Document number i• = 1 = 1 if the topic labels match• ∂(di) = 0 otherwise

k = 2, 4, 6, 8, 10, 15, 20λ = 0.1, 0.01, 0.001

Page 29: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Results for Reuters Results for TDT2

Page 30: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Observations

Page 31: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Observations : AC

• AC inversely proportional to k

• Nature of the collection affects AC– Reuters : earn, interest, cocoa– TDT2 : Asian economic crisis, Oprah lawsuit

Page 32: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Observations : λ parameter

• AC declines as λ increases ( mostly effective for homogeneous text collections) :

• CPU time declines as λ increases

Page 33: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Observations : Cluster size• Imbalance in cluster sizes has adverse effect :

Page 34: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Conclusion & Future Work

GD-CLS can be used to effectively cluster text data. Further development involves:

• Smart updating

• Use in Bioinformatics

• Develop user-interface

• Convert to C++