Searching the Web

37
Searching the Web Topic-Based Models: Latent Semantic Index

description

Searching the Web. Topic-Based Models: Latent Semantic Index. Motivation. Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector model? Q: Is it desirable? Q: What can we do?. Topic-Based Models. - PowerPoint PPT Presentation

Transcript of Searching the Web

Page 1: Searching the Web

Searching the Web

Topic-Based Models: Latent Semantic Index

Page 2: Searching the Web

Motivation Q: For query “car”, will a document with the

word “automobile” be returned as a result under the TF-IDF vector model?

Q: Is it desirable?

Q: What can we do?

Page 3: Searching the Web

Topic-Based Models Index documents based on “topics” not by

individual terms Return a document if it shares a same topic with

the query We can return a document with automobile for

the query “car” Much fewer “topics” than “terms”

Topic-based index can potentially be more compact than term-based index

Page 4: Searching the Web

Example (1) Two topics: “Car”, “Movies”

Four terms: car, automobile, movie, theater Topic-term matrix

Document-topic matrix

Topic car automobile

movie theater

“Car” 1 0.9 0 0

“Movie” 0 0 1 0.8

“Car” “Movie”

doc1 0 1

doc2 1 0

doc3 0.8 0.2

Page 5: Searching the Web

Example (2) But what we have is document-term matrix!!!

How are the three matrices related?

car automobile

movie theater

doc1 0 0 1 0.8

doc2 1 0.9 0 0

doc3 0.8 0.72 0.2 0.16

Page 6: Searching the Web

Linearity Assumption A document is generated as a topic-weighted linear

combination of topic-term vectors A simplifying assumption on document generation

doc1 = 0 (1,0.9, 0,0) + 1 (0,0,1,0.8) = ( 0, 0, 1, 0.8)doc3 = 0.8 (1,0.9, 0,0) + 0.2 (0,0,1,0.8) = (0.8,0.72, 0.2, 0.16)

Topic car

automobile movie theater

“Car” 1 0.9 0 0

“Movie” 0 0 1 0.8

car automobile movie theater

doc1 0 0 1 0.8

doc2 1 0.9 0 0

doc3 0.8 0.72 0.2 0.16

“Car” “Movie”

doc1 0 1

doc2 1 0

doc3 0.8 0.2

Page 7: Searching the Web

Topic-Based Index as Matrix Decomposition

8.0100

009.01topic

term

2.08.0

01

10

doc

topic

16.02.072.08.0

009.01

8.0100

doc

term

Page 8: Searching the Web

Topic-Based Index as Matrix Decomposition

# topics << # terms, # topics << # docs Decompose (doc-term) matrix to two matrices of

rank-K (K: # topics) Of course, decomposition will be approximate for real

data

doc

doc

topicterm termtopic

= X

Page 9: Searching the Web

Topic-Based Index as Rank-K Approximation Q: How to choose the two decomposed matrices?

What is the “best” decomposition? Latent Semantic Index (LSI)

The decomposition that is “closest” to the original matrix is the one that we are looking for

Singular-Value Decomposition (SVD) A decomposition method from linear algebra that leads to

the best rank-K approximation Use SVD to find the two matrices that are “closest” to

the original matrix We will spend the next two hours to learn about SVD

and its meaning Basic understanding of linear algebra will be very useful for

both IR and datamining

Page 10: Searching the Web

A Brief Review of Linear Algebra Vector and a list of numbers

Addition Scalar multiplication Dot product

Dot product as a projection

Q: (1, 0) vs (0, 1). Are they the same vectors? A: Choice of basis determines the “meaning”

of the numbers Matrix

Matrix multiplication Matrix as vector transformation

Page 11: Searching the Web

Change of Coordinates (1) Two coordinate systems

Q: What are the coordinates of (2,0) under the second coordinate system?

Q: What about (1,1)?

)2

1,

2

1(),

2

1,

2

1(

)1,0(),0,1(

Page 12: Searching the Web

Change of Coordinates (2) In general, we get the new coordinate of a

vector under the new basis vectors by multiplying the original coordinates with the following matrix

Verify with previous example Q: What does the above matrix look like? How

can we identify a coordinate-change matrix?

nbbb

,...,, 21

nb

b

b

Q

...2

1

Page 13: Searching the Web

Matrix and Change of Coordinates vectors are orthonormal to each

other

Orthonormal matrix: An orthonormal matrix can be interpreted as

change-of-coordinate transformation The rows of the matrix Q are the new basis vectors

nbbb

,...,, 21

IQQT

IQQQQ TT

Page 14: Searching the Web

Matrix and Linear Transformation Linear transformation

Every linear transformation can be represented as a matrix By selecting appropriate basis vectors

Matrix form of a linear transformation can be obtained simply by learning how the basis vectors transform

Verify with 120 degree rotation along (1,1,1). What transformations are possible for linear

transformation?

)()()( ybTxaTybxaT

|||

)(...)()(

|||

21 nbTbTbTM

Page 15: Searching the Web

Linear Transformation that We Know Rotation Stretching Anything else?

Claim: Any linear transformation is a stretching followed by rotation “Meaning” of singular value decomposition An important result of linear algebra Let us learn why this is the case

Page 16: Searching the Web

Rotation Matrix form of rotation? What property will it

have? Remember

Rotation matrix R <=> Orthonormal matrix ’s are unit basis vectors as well

Orthornormal matrix Change of coordinates Rotation

|||

)(...)()(

|||

21 nbTbTbT

IRRT )( ibT

Page 17: Searching the Web

Stretching (1) Q: Matrix form of stretching by 3 along x, y, z

coodinates in 3D?

Q: Matrix form of stretching by 3 along x axis and by 2 along y axis in 3D.

Q: Stretching matrix <=> diagonal matrix?

Page 18: Searching the Web

Stretching (2) Q: Matrix form of stretching by 3 along

and by 2 along ?

Verify by transforming (1,1) and (-1, 1) Decomposition of T = QT T’ Q shows the matrix in a

different coordinate system Under the matrix form, the simplicity of the

stretching transformation may not be obvious Q: What if we chose as the

basis?

)2

1,

2

1(

)2

1,

2

1(

2521

2125

2121

2121

20

03

2121

2121

)2

1,

2

1( )

2

1,

2

1(

Page 19: Searching the Web

Stretching (3) Under a good choice of basis vectors,

orthogonal-stretching transformation can always be represented as a diagonal matrix

Q: How can we tell whether a matrix corresponds to an orthogonal-stretching transformation?

Page 20: Searching the Web

Stretching – Orthogonal Stretching (1)

Remember that this is orthogonal-stretching along

If a transformation is orthogonal stretching, we should always be able to represent it as QT DQ for some Q, where Q shows the stretching axes

Q: What is the matrix form of the transformation that stretches by 5 along (4/5, 3/5) and by 4 along (-3/5, 4/5)?

2521

2125

2121

2121

20

03

2121

2121

)2

1,

2

1( )

2

1,

2

1(

Page 21: Searching the Web

Stretching – Orthogonal Stretching (2) Q: Given a matrix, how do we know whether it is

orthogonal-stretching? A: When it can be decomposed to T = QT DQ A: Spectral Theorem

Any symmetric matrix T can always be decomposed into T = QTDQ

Symmetric matrix <=> orthogonal stretching Q: How can we decompose T to QT DQ? A: If T stretches along X, then TX = X for some .

X: eigenvector of T : eigenvalue of T Solve the equation for and X

Page 22: Searching the Web

Eigen Values, Eigen Vectors and Orthogonal Stretching Eigenvector: stretching axis Eigenvalue: stretching factor All eigenvectors are orthogonal

<=> Orthogonal stretching<=> Symmetric matrix (spectral theorem)

Example

Q: What transformation is this?

31

13

Page 23: Searching the Web

Singular Value Decomposition (SVD) Any linear transformation T can be

decomposed toT = R S (R: rotation, S: orthogonal stretching) One of the basic results of linear algebra

In matrix form, any matrix T can be decomposed to

Diagonal entries in D: singular values Example

Q: What transformation is this?

12 DQQDQQQT TS

TSR

5/45/3

5/35/4

20

03

2/12/1

2/12/1

Page 24: Searching the Web

Singular Value Decomposition (2) Q: For (n x m) matrix T, what will be the

dimension of the three matrices after SVD? Q: What is the meaning of non-square

diagonal matrix? The diagonal matrix is also responsible for

projection (or dimension padding).

Page 25: Searching the Web

Singular Values vs Eigenvalues

Q: What is this transformation? A: Q1 – eigenvectors of TTT

D – square root of eigenvalues of TTT. Similarly, Q2 – eigenvectors of TTT

D – square root of eigenvalues of TTT. SVD can be done by computing eigenvalues

and eigenvectors of TTT and TTT

12

11212 )()( QDQDQQDQQTT TTTTT 12 DQQT T

Page 26: Searching the Web

SVD as Matrix Approximation

Q: If we want to reduce the rank of T to 2, what will be a good choice?

The best rank-k approximation of any matrix T is to keep the first-k entries of its SVD.

100

05/35/4

05/45/3

1.000

0100

00100

010

2/102/1

2/102/1

T

Page 27: Searching the Web

SVD Approximation Example:1000 x 1000 matrix with (0…255)

62 60 58 57 58 57 55 53 55 5461 60 58 57 57 57 55 53 55 5461 59 58 57 57 56 55 54 55 5559 59 58 57 57 56 55 54 56 5558 58 58 57 56 55 55 55 56 5557 58 58 57 56 55 55 56 56 5556 57 58 57 55 54 55 56 56 5656 57 58 57 55 54 55 56 56 5659 58 57 56 55 56 56 57 59 5758 58 57 57 56 56 56 56 58 5757 57 57 57 57 57 56 56 57 5656 57 57 58 58 57 56 55 56 56

Page 28: Searching the Web

Image of original matrix 1000x1000

Page 29: Searching the Web

SVD. Rank 1 approximation

Page 30: Searching the Web

SVD. Rank 10 approximation

Page 31: Searching the Web

SVD. Rank 100 approximation

Page 32: Searching the Web

Original vs Rank 100 approximation

Q: How many numbers do we keep for each?

Page 33: Searching the Web

Back to LSI

LSI: decompose (doc-term) matrix to two matrices of rank-K Our goal is to find the “best” rank-K approximation Apply SVD, keep the top-K singular values, meaning

that we keep the first K column and the first K rows of the first and third matrix after SVD.

doc

doc

topicterm termtopic

= X

Page 34: Searching the Web

LSI and SVD LSI

doc

doc

top

ic

term termtopic

=X

doc

term

=

SVD

Page 35: Searching the Web

LSI and SVD LSI summary

Formulate the topic-based indexing problem as rank-K matrix approximation problem

Use SVD to find the best rank-K approximation When applied to real data, 10-20%

improvement reported Using LSI was the road to fame for Excite in early

days

Page 36: Searching the Web

Limitations of LSI Q: Any problems with LSI? Problems with LSI

Scalability SVD is known to be difficult to perform for a large data

Interpretability Extracted document-topic matrix is impossible to

interpret Difficult to understand why we get good/bad results

from LSI for some queries

Q: Any way to develop more interpretable topic-based indexing? Topic for tomorrow’s lecture

Page 37: Searching the Web

Summary Topic-based indexing

Synonym and polyseme problem Index documents by topic, not by terms

Latent Semantic Index (LSI) Document is a linear combination of its topic vector and the

topic-term vectors Formulate the problem as a rank-K matrix approximation

problem Uses SVD to find the best approximation

Basic linear algebra Linear transformation, matrix, stretching and rotation Orthogonal stretching, diagonal matrix, symmetric matrix,

eigenvalue and eigenvectors Rotation, change of coordinate, and orthonrmal matrix SVD and its implication as a linear transformation