Students: Amit Sharabi Irena Gorlik

23
Students: Amit Sharabi Irena Gorlik Supervisors: Dr. Orly Yahalom Prof. Zeev Volkovich Date: January 2012 1 SE Dept. Adaptive Spectral Clustering of Text Information

description

SE Dept. Adaptive Spectral Clustering of Text Information. Students: Amit Sharabi Irena Gorlik Supervisors: Dr. Orly Yahalom Prof . Zeev Volkovich Date: January 2012. Concise description. - PowerPoint PPT Presentation

Transcript of Students: Amit Sharabi Irena Gorlik

Page 1: Students:         Amit Sharabi                          Irena  Gorlik

Students: Amit Sharabi Irena Gorlik

Supervisors: Dr. Orly Yahalom Prof. Zeev Volkovich

Date: January 20121

SE Dept.

Adaptive Spectral Clustering of Text

Information

Page 2: Students:         Amit Sharabi                          Irena  Gorlik

2

In our project we implement clustering of text information using the spectral clustering approach.

The project is based on the NGW (Ng, Michael and Weiss) spectral clustering algorithm.

By using Brent’s method to modify the NGW algorithm, we will obtain a better clustering results.

Concise description

Page 3: Students:         Amit Sharabi                          Irena  Gorlik

3

Project description• Intro• Short reminder• Bag of words method• Brent’s method• Clustering quality function

NGW Algorithm – In details The Software Engineering (SE) design• Activity diagram• GUI

Results and conclusions• Algorithm executions results• Conclusions base on the results

In this presentation

Page 4: Students:         Amit Sharabi                          Irena  Gorlik

4

As there is no exact definition of what a good clustering is, a clustering algorithm that yields good results for one dataset may not fit another one.

Our purpose is to calibrate the NGW algorithm by fine tuning through finding the optimum scaling factor .

Intro

Page 5: Students:         Amit Sharabi                          Irena  Gorlik

5

Our dataset will be derived from the text using the Bag of words method.

In this method, a text is represented as an unordered collection of words, disregarding grammar and even word order.

Short Reminder - Bag of words method

Page 6: Students:         Amit Sharabi                          Irena  Gorlik

6

Brent's method is a numerical optimization algorithm which combines the inverse parabolic interpolation and the golden section search.

Brent’s method

Page 7: Students:         Amit Sharabi                          Irena  Gorlik

7

is the set of pairs of points lying in the same cluster and is the set of pairs of points lying in different clusters.

We attempt to find the minimum value of the function by using Brent’s method.

Clustering quality function 2 2 2

( , ) ( , )

( ) (1 exp( || || /2 )) log( (1 exp( || || /2 )))i j i j

i j i js s S s s D

f s s s s

DS

Page 8: Students:         Amit Sharabi                          Irena  Gorlik

8

The scaling factor controls how rapidly the affinity matrix falls off with the distance between and .

The NGW algorithm does not specify a value for .

The scaling factor

isA

js

Page 9: Students:         Amit Sharabi                          Irena  Gorlik

9

This is a spectral clustering algorithm that cluster points using eigenvectors of matrices derived from the data.

The algorithm steps: Given a set of points Form the affinity matrix , , . Define diagonal matrix , form the matrix

. Stack the k largest eigenvectors of L to form the columns of the new matrix X. Form matrix Y by renormalize each of X’s rows. Cluster with k-means or PAM the rows of Y as points in . Assign to cluster iff row of Y was assigned to cluster .

NGW algorithm

i j 0iiA 1{ ,..., }.nS s s

ii ijD a 1/2 1/2L D AD

k

2 2exp( || || /2 )ij i jA s s

is j i j

Page 10: Students:         Amit Sharabi                          Irena  Gorlik

10

FlowchartData

Brent’s method for f()

Affinity matrix

SpectrumEigenvectors Clustering

Page 11: Students:         Amit Sharabi                          Irena  Gorlik

11

Preliminary SE documents

Page 12: Students:         Amit Sharabi                          Irena  Gorlik

12

Activity diagram

Page 13: Students:         Amit Sharabi                          Irena  Gorlik

13

GUI – Main Menu Tab

Page 14: Students:         Amit Sharabi                          Irena  Gorlik

14

GUI – Cluster Results Tab

Page 15: Students:         Amit Sharabi                          Irena  Gorlik

15

Results and Conclusions

Page 16: Students:         Amit Sharabi                          Irena  Gorlik

16

Execution 1:Input Books: • The New Testament.• Harry Potter and the Goblet of

Fire.• Harry Potter and the Sorcerer’s

Stone.• The Dead Zone – Stephen King.Each book divided to: 10 partsInput parameters:

Clustering Algorithm: K-MeansNumber of Clusters: 3Number of Runs: 1Brent’s Method Tolerance: 1e-6

Results:Cramer’s V: 0.707Scaling Factor: 0.1

Page 17: Students:         Amit Sharabi                          Irena  Gorlik

17

Execution 2:Input Books: • The New Testament.• Harry Potter and the Goblet of

Fire.• Harry Potter and the Sorcerer’s

Stone• The Dead Zone – Stephen KingEach book divided to: 10 partsInput parameters:

Clustering Algorithm: K-MeansNumber of Clusters: 3Number of Runs: 10Brent’s Method Tolerance: 1e-6

Results:Cramer’s V: 1Scaling Factor: 0.182

Page 18: Students:         Amit Sharabi                          Irena  Gorlik

18

Execution 3:Input Books: • Harry Potter and the Chamber of

Secrets.• Harry Potter and the Deathly

Hallows.Each Book Divided to: 10 partsInput parameters: Clustering Algorithm: PAM Number of Clusters: 2 Number of Runs: 10 Brent’s Method Tolerance: 1e-6Results: Cramer’s V: 0.905 Scaling Factor: 0.113

Page 19: Students:         Amit Sharabi                          Irena  Gorlik

19

Execution 4:Input Books: • Harry Potter and the Goblet of

Fire• Harry Potter and the Chamber of

Secrets• The Starts, List Dust – Isaac

Asimov • The Dead Zone – Stephen KingEach Book Divided to: 10 partsInput parameters: Clustering Algorithm: PAM Number of Clusters: 4 Number of Runs: 10 Brent’s Method Tolerance: 1e-6Results: Cramer’s V: 0.825 Scaling Factor: 0.136

Page 20: Students:         Amit Sharabi                          Irena  Gorlik

20

Conclusion:Within our experiments we have received results which led us to the next conclusions:

• This algorithm shows quite accurate and reliable results.

• The proposed learning process significantly influences and improves the quality of results.

• During the experiments we concluded that the best division of the book is ten parts.

Page 21: Students:         Amit Sharabi                          Irena  Gorlik

21

Conclusion (Cont.):

• The algorithm can distinguish between different books written by the same author.

• To avoid numerical instabilities we need to normalize the dataset.

• The PAM algorithm is more stable, whereas the K-means algorithm sometimes converges to some local optimum.

Page 22: Students:         Amit Sharabi                          Irena  Gorlik

22

Final Remark

In this project we present an algorithm capable to produce optimal clustering results for a given dataset by improving the scaling factor of the algorithm.

The uniqueness of our approach is the ability to adapt itself to the dataset.

Page 23: Students:         Amit Sharabi                          Irena  Gorlik

23

Questions?