A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.

A Clustering Method Based on Nonnegative Matrix Factorization

for Text Mining

Farial Shahnaz

Topics

• Introduction

• Algorithm

• Performance

• Observation

• Conclusion and Future Work

Introduction

Basic Concepts

• Text Mining : Detection of trends or patterns in text data

• Clustering : Grouping or classifying documents based on similarity of content

Clustering

• Manual Vs Automated

• Supervised Vs Unsupervised

• Hierarchical Vs Partitional

Clustering

• Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents

• Method : Nonnegative Matrix Factorization or NMF

Vector Space Model of Text Data

• Documents represented as n-dimensional vectors– n : terms in the dictionary– vector component : importance of term

• Document collection represented as term-by-document matrix

Term-by-Document Matrix

• Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the)

• Document 1 : a quick brown fox

• Document 2 : jumped over the lazy dog

Term-by-Document Matrix

Clustering Method : NMF

• Low rank approximation of large sparse matrices

• Preserves data nonnegativity

• Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)

Other Methods

• Other rank reduction methods : – Principal Component Analysis (PCA)– Vector Quantization (VQ)

• Produce basis vectors with negative entries

• Additive and Subtractive combinations of basis vectors yield original document vectors

NMF

• Produces nonnegative basis vectors

• Additive combination of basis vectors yield original document vector

Term-by-Document Matrix (all entries nonnegative)

NMF

• Basis vectors interpreted as semantic features or topics

• Documents clustered on the basis of shared features

NMF

• Demonstrated by Xu et. Al (2003):– Outperforms Singular Value Decomposition

(SVD)– Comparable to Graph Partitioning methods

Algorithm

NMF : Definition

Given

• S : Document collection

• Vmxn : term-by-document matrix

• m : terms in the dictionary

• n : Number of documents in S

NMF : Definition

NMF is defined as:

• Low rank approximation of Vmxn in terms of some metric

• Factor V into the product WH– Wmxk : Contains basis vectors– Hkxn : Contains linear combinations– k : Selected number of topics or basis

vectors, k << min(m,n)

NMF : Common Approach

• Minimize objective function:

NMF : Existing Methods

Multiplicative Method (MM) [ by Lee and Seung ]

• Based on Multiplicative update rules

• || V - WH || is monotonically non-increasing and constant iff W, H at stationary point

• Version of Gradient Descent (GD) optimization scheme


Sparse Encoding [ by Hoyer ]

• Based on study of neural networks

• Enforces statistical sparsity of H– Minimizes sum of non-zeros in H


Sparse Encoding [ by Mu, Plemmons and Santago ]

• Similar to Hoyer’s method

• Enforces statistical sparsity of H using a regularization parameter– Minimizes number of non-zeros in H

NMF : Proposed Algorithm

Hybrid Method:• W approximated using Multiplicative

Method• H calculated using a Constrained Least

Square (CLS) model as the metric– Penalizes the number of non-zeros– Similar to the method by Mu, Plemmons and

Santago

• Called GD-CLS

GD-CLS

Performance

Text Collections Used

• Two benchmark topic detection text collections:– Reuters : Collection of documents on assorted

topics– TDT2 : Transcripts from news media

Text Collections Used

Accuracy Metric

• Defined by:

• di : Document number i• = 1 = 1 if the topic labels match• ∂(di) = 0 otherwise

k = 2, 4, 6, 8, 10, 15, 20λ = 0.1, 0.01, 0.001

Results for Reuters Results for TDT2

Observations

Observations : AC

• AC inversely proportional to k

• Nature of the collection affects AC– Reuters : earn, interest, cocoa– TDT2 : Asian economic crisis, Oprah lawsuit

Observations : λ parameter

• AC declines as λ increases ( mostly effective for homogeneous text collections) :

• CPU time declines as λ increases

Observations : Cluster size• Imbalance in cluster sizes has adverse effect :

Conclusion & Future Work

GD-CLS can be used to effectively cluster text data. Further development involves:

• Smart updating

• Use in Bioinformatics

• Develop user-interface

• Convert to C++

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.

Documents

Transcript of A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.