Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang,...

25
Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    220
  • download

    0

Transcript of Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang,...

Page 1: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Data Mining and Machine Learning Lab

Document Clustering via Matrix Representation

Xufei Wang,

Jiliang Tang and Huan LiuArizona State University

Page 2: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

News Article

• Lead Paragraph

• Explanations

• Additional Information

Page 3: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Research Paper

• Introduction• Related Work• Problem Statement• Solution• Experiment• Conclusion

Page 4: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Book

• Chapters

• References

• Appendix

Page 5: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Document Organization

• Not randomly organized

• Put relevant content together

• Logically independent segments

5

Page 6: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Matrix Space Model

• Represent a document as a matrix–Segment–Term

• Each segment is a vector of terms– Terms + frequency

6

Page 7: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Vector Space Model

• Oversimplify a document– Mixing topics– Word order is lost– Susceptible to noise

7

TdNddd wwwV ),,,( ,,2,1

Page 8: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

An Example

• The IEEE International Conference on Data Mining series (ICDM) has established itself as the world's premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications.

• Vancouver is a coastal seaport city on the mainland of British Columbia, Canada. It is the hub of Greater Vancouver, which, with over 2.3 million residents, is the third most populous metropolitan area in the country, and the most populous in Western Canada.

Page 9: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Matrix Space Model

• Segment 1

• Segment 2

conference data mining research international

3 3 3 2 2

vancouver populous canada metropolitan columbia

2 2 2 1 1

Page 10: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Vector Space Model

conference data mining research international canada vancouver

3 3 3 2 2 2 2

Page 11: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Pros of a Matrix Representation

• Interpretation: – Segments vs. topics

• Finer granularity for data management– Segments vs. document

• Multiple or Single class labels– Flexibility

11

Page 12: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Verify the Effectiveness via Clustering

• Information Retrieval– Indexing based on segments

• Classification– New approaches based on Matrix inputs

• Clustering– New approaches based on Matrix inputs

12

Page 13: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

A Graphical Interpretation

Page 14: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Step 1: Obtaining Segments

• Many approaches for segmentation– Terms– Sentences (Choi et al. 2000)– Paragraphs (Tagarelli et al. 2008)

• Determining the number of segments– Open research problem

14

Page 15: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Step 2: Latent Topic Extraction

• Non-negative Matrix Approximation (NMA)

• LMi represents the probability of a term belonging to a latent topic

• MiRT represents the probability of a segment belonging to a latent topic

15

n

iF

Tii

RRMMLL

RLMA

cii

r0

2

0:0:0:

2

21

1min

Page 16: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Step 3: Clustering

• Un-overlapping clustering

• Overlapping clustering

16

jiji

ii

k

dd

ccentroidd2)(min

ij

ijk

ccentroidd2)(min

Page 17: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Datasets

• 20newsgroup– 20 classes– 6,038 documents

• Reuters-21578– 26 clusters– 1,964 documents

• Classic– 3 clusters– 1,486 documents

17

Page 18: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Experimental Method

• Generate Datasets (by specifying k)

• Evaluate the accuracy

• Repeat

18

Page 19: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Number of Latent Topics

Page 20: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Number of Segments

Page 21: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Comparative Study

Page 22: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Conclusion

• Proposing a matrix representation for documents

• Significant improvements with MSM

• Information Retrieval, classification tasks

22

Page 23: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Questions

Page 24: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

24

Page 25: Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

25