Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros”...

Latent Semantic Indexing

Term-document matrices are very large, though most cells are “zeros”

But the number of topics that people talk about is small (in some sense)

Clothes, movies, politics, … Each topic can be represented as a

cluster of (semantically) related terms, e.g.: clothes=golf, jacket, shoe..

Can we represent the term-document space by a lower dimensional “latent” space (latent space=set of topics)?

Searching with latent topics

Given a collection of documents, LSI learns clusters of frequently co-occurring terms (ex: information retrieval, ranking and web)

If you query with ranking, information retrieval LSI “automatically” extends the search to documents including also (and even ONLY) web

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

Selection based on ‘Golf’

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

CarPetrol

TopgearGTIPolo

Document base (20)

MotorBikeOil

PetrolTourer

BedlacelegalPetrolbutton

softPetrol

catline

yellow

windfullsail

harbourbeach

reportPetrol

TopgearJune

FishPondgold

PetrolKoi

PCDellRAMPetrolFloppy

CorePetrolApplePipTree

PeaPod

FreshGreenFrench

LupinPetrolSeedMay April

OfficePenDesk

Petrol VDU

FriendPal

HelpPetrolCan

PaperPetrolPastePencilRoof

CardStampGlue

HappySend

ToilPetrolWorkTimeCost

With standard VSM4 documents are selected

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

Selection based on ‘Golf’

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

CarPetrol

TopgearGTIPolo

20 documents

MotorBikeOil

PetrolTourer

softPetrol

catline

yellow

windfullsail

harbourbeach

reportPetrol

TopgearJune

FishPondgold

PetrolKoi

PeaPod

FreshGreenFrench

OfficePenDesk

Petrol VDU

FriendPal

HelpPetrolCan

CardStampGlue

HappySend

The most relevant wordsassociated with golf in these

docs are:Car, Topgear and Petrol

Rank ofselected

documents

Car 2 *(20/3) = 13

Topgear 2 *(20/3) = 13

Petrol 3 *(20/16) = 4

wf.idf

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

Rank ofselected

docsSelezione basata su ‘Golf’

Car 2 *(20/3) = 13

Topgear 2 *(20/3) = 13

Petrol 3 *(20/16) = 4

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

CarPetrol

TopgearGTIPolo

20 docs

MotorBikeOil

PetrolTourer

softPetrol

catline

yellow

windfullsail

harbourbeach

reportPetrol

TopgearJune

FishPondgold

PetrolKoi

PeaPod

FreshGreenFrench

OfficePenDesk

Petrol VDU

FriendPal

HelpPetrolCan

CardStampGlue

HappySend

Since words are weighted also wrt the idf,

car e topgear turn out to be related to Golfmore than petrol weight of

co-occurrences

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

CarPetrol

TopgearGTIPolo

20 docs

MotorBikeOil

PetrolTourer

softPetrol

catline

yellow

windfullsail

harbourbeach

reportPetrol

TopgearJune

FishPondgold

PetrolKoi

PeaPod

FreshGreenFrench

OfficePenDesk

Petrol VDU

FriendPal

HelpPetrolCan

CardStampGlue

HappySend

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

rankdf selected

docsSelection based on‘Golf’

Selection based on the semantic domain of GolfCar

2 *(20/3) = 13Topgear

2 *(20/3) = 13Petrol

3 *(20/16) = 4

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

CarWheel

TopgearGTIPolo

We now search with allthe words in the “semantic domain” of Golf .

The list of retrieved docs nowIncludes one new document and

does not include a previouslyretrieved document.

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

rankof

selecteddocs

Selezione basata su ‘Golf’

Selection based on semantic domainCar

2 *(20/3) = 13Topgear

2 *(20/3) = 13Petrol

3 *(20/16) = 4

Rank 2617 17 030

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

CarPetrol

TopgearGTIPolo

20 docs

MotorBikeOil

PetrolTourer

softPetrol

catline

yellow

windfullsail

harbourbeach

reportPetrol

TopgearJune

FishPondgold

PetrolKoi

PeaPod

FreshGreenFrench

OfficePenDesk

Petrol VDU

FriendPal

HelpPetrolCan

CardStampGlue

HappySend

GolfCar

TopgearPetrolGTI

GolfCar

ClarksonPetrolBadge

GolfPetrol

TopgearPoloRed

GolfTiger

WoodsBelfryTee

CarWheel

TopgearGTIPolo

The co-occurrence basedranking improves the performance.

Note that one of the most relevant doc does NOT include the word Golf, and

a doc with a “spurious”sense disappears

Linear Algebra Background

The LSI method: how to detect “topics”

Eigenvalues & Eigenvectors

Eigenvectors (for a square mm matrix S)

How many eigenvalues are there at most?

only has a non-zero solution if

this is a m-th order equation in λ which can have at most m distinct solutions (roots of the characteristic polynomial) – can be complex even though S is real.

eigenvalue(right) eigenvector

Example

Example of eigenvector/eigenvalues

Def: A vector v Rn, v ≠ 0, is an eigenvector of a matrix mxm S with corresponding eigenvalue , if:Sv = v

Example of computationremember

2 and 4 are the eigenvalues of S Characteristic

polynomial

Notice that we computed only the DIRECTION of eigenvectors

Matrix-vector multiplication

Therefore, any generic vector (say x= ) can be viewed as

a combination of the eigenvectors: x = 2v1 + 4v2 + 6v3

2, 4 and 6 are the coordinates of v in the orthonormal space O

Eigenvectors define an orthonormal space O; two eigenvectors of different eigenvalues of a symmetric matrix are orthogonal.

Notice, there is not “the” eigenvectors for a matrix, but rather an orthonormal basis (an eigenspace)

Example of 2 bidimensional orthonormal spaces

Difference between orthonormal and orthogonal?

Orthogonal mean that the dot product is null. Orthonormal mean that the dot product is null and the norm is equal to 1.

If two or more vectors are orthonormal they are also orthogonal but the inverse is not true.

Matrix vector multiplication

Thus a matrix-vector multiplication such as Sx (S, x a generic vector as in the previous slide) can be rewritten in terms of the eigenvalues/vectors of S:

Even though x is an arbitrary vector, the action of S on x (transformation) is determined by the eigenvalues/vectors.

Why?16

Matrix vector multiplication

Matrix multipliication by a vector = a linear transformation of the initial vector

Geometric explanation

Multiplying a matrix and a vector has two effects over the vector: rotation (the coordinates of the vector change) and scaling (the length changes). The max compression and rotation depends on the matrix’s eigenvalues λi (s1,s2 and s3 in the picture)

x is a generic vector, v are the eigenvectors)

In the distorsion, the max effect is played by thebiggest eigenvalues (s1 and s2 in the picture )

The eigenvalues describe the distorsion operated by the matrix on the original vector

In this linear transformation, the image of “La Gioconda” is modifiedbut the central vertical axis is the same. The blue vector has changed, not the red one. Hence the red is an eigenvector of the linear transformation. Furthermore, the red vector has not been enlarged, not shortened, nor inverted, therefore its eigenvalue is 1 (the eigenvalue is a constant of translation of the image pixels along the blue direction) .

Let be a square matrix with m orthogonal eigenvectors

Theorem: Exists an eigen decomposition

(cf. matrix diagonalization theorem)

Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of

Eigen/diagonal Decomposition

diagonal

Unique for

distinct eigen-values

Diagonal decomposition: why/how

nvvU ...1Let U have the eigenvectors as columns:

nnnn vvvvvvSSU

............

Then, SU can be written

And S=UU–1.

Thus SU=U, or U–1SU=

Example

From which we get v21=0 and v22 any real

from which we get v11=−2v12

Diagonal decomposition – example 2

Recall .3,1 ;21

The eigenvectors and form

Inverting, we have

2/12/1

2/12/11U

Then, S=UU–1 =

2/12/1

RecallUU–1 =1.

So what?

What do these matrices have to do with IR? Recall M N term-document matrices … But everything so far needs square matrices – so

you need to be patient and learn one last notion

Singular Value Decomposition for non-square matrixes

MM MN V is NN

For an M N matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:

The columns of U are orthogonal eigenvectors of AAT

(left singular vectors).The columns of V are orthogonal eigenvectors of ATA (right

singular eigenvector)

ii rdiag ...1 Singular values.

Eigenvalues 1 … r of AAT = eigenvalues of ATA and:

An example

Singular Value Decomposition

Illustration of SVD dimensions and sparseness

If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red

Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N This is referred to as the reduced SVD

It is the convenient (space-saving) and usual form for computational applications

http://www.bluebit.gr/matrix-calculator/calculate.aspx

Reduced SVD

Let’s recap

Since the yellow part is zero, an exact representation of A is:

But for some k<r, a good approximation is:

Example of rank approximation

0.981 0.000 0.000 0.000 1.9850.000 0.000 3.000 0.000 0.0000.000 0.000 0.000 0.000 0.0000.000 4.000 0.000 0.000 0.000

Approximation error

How good (bad) is this approximation? It’s the best possible, measured by the Frobenius

norm of the error:

where the i are ordered such that i i+1. Suggests why Frobenius error drops as k increases.

kFkFkXrankX

SVD Low-rank approximation

Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000)

We can construct an approximation A100 with rank 100!!! Of all rank 100 matrices, it would have the lowest

Frobenius error.

C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.

Images gives a better intuition

K=322 (the original image)

We save space!! But this is only one issue

But what about texts and IR? (finally!!)

Latent Semantic Indexing via the SVD

If A is a term/document matrix, then AAT and AT A are the matrixes of term and document co-occurrences, repectively

Example 2

Term-document matrix

Co-occurrences of terms in documents

Lij is the number of documents in which wi and wj co-occurr

LijT is the number of co-occurring words in docs di and dj

(or viceversa if A is a document-term matrix)

Word i in doc k Word j in doc k

Term co-occurrences example

L trees,graph = (000001110) (000000111)T=2

A numerical example

d1 d2 d3

w11 w12 w13 w21 w22 w23w31 w32 w33

But wij=wji hencew11 w12 w13 w12 w22 w23w13 w23 w33Symmetrical

LSI: the intuition

Projecting A in the termspace: green yellow andred vector are documents

The black vectorare the unary eigenvector of A: they represent the main “directions” of thedocument vectors

The blue segments give theintuition of eigenvalues of A:bigger values are those forwhich the projection of allvectors on the direction ofcorrespondent eigenvectorIs higher

LSI intuition

t3t2We now project our document vectors on the reference orthonormalsystem represented by the 3 black vectors

LSI intuition

..and we remove the dimension(s)with lower eigenvalues

Now the two axis represent a combination of words e.g.a latent semantic space

Example

1.000 1.000 0.000 0.0001.000 0.000 0.000 0.0000.000 0.000 1.000 1.0000.000 0.000 1.000 1.000

d1 d2 d3 d4

0.000 -0.851 -0.526 0.000 0.000 -0.526 0.851 0.000-0.707 0.000 0.000 -0.707-0.707 0.000 0.000 0.707

2.000 0.000 0.000 0.0000.000 1.618 0.000 0.0000.000 0.000 0.618 0.0000.000 0.000 0.000 0.000

0.000 0.000 -0.707 -0.707-0.851 -0.526 0.000 0.000 0.526 -0.851 0.000 0.000 0.000 0.000 -0.707 0.707

1.172 0.724 0.000 0.0000.724 0.448 0.000 0.0000.000 0.000 1.000 1.0000.000 0.000 1.000 1.000

We project termsand docs on two dimensions, v1 and v2(the principal eigenvectors)

We have two latent semanticcoordinates: s1:(t1,t2) ands2:(t3,t4)

Approximatednew term-docmatrix

Even if t2 does not occur in d2, now if we query with t2 the system will return also d2!!

Co-occurrence space

In principle, the space of document or term co-occurrences is much (much!) higher than the original space of terms!!

But with SVD we consider only the most relevant ones, trough rank reduction

LSI: what are the steps

From term-doc matrix A, compute the approximation Ak. with SVD

Project docs, terms and queries in a space of k<<r dimensions (the k “survived” eigenvectors) and compute cos-similarity as usual These dimensions are not the original

axes, but those defined by the orthonormal space of the reduced matrix Ak

Where σiqi (i=1,2..k<<r) are the new coordinates of q in the orthonormal space of Ak

Projecting terms documents and queries in the LS space

If A=UVT we also have that:

V = ATU-1

t = t∧TVT

d = d∧TU-1

q = q∧TU-1

After rank k-

approximation :A≅Ak=UkkVk

dk ≅ dTUkk-1

qk ≅ qTUkk-1

sim(q, d) = sim(qTUkk

-1, dTUkk-

Consider a term-doc matrix MxN (M=11, N=3) and a query q

1. Compute SVD: A= UVT

2. Obtain a low rank approximation (k=2) Ak= UkkVT

3a. Compute doc/query similarity

For N documents, Ak has N columns, each representing the coordinates of a document d i

projected in the k LSI dimensions

A query is considered like a document, and is projected in the LSI space

3c. Compute the query vector

qk = qTUkk-1

q is projected in the 2-dimension LSI space!

Documents and queries projected in the LSI space

q/d similarity

An overview of a semantic network of terms based on the top 100 most significant latent semantic dimensions (Zhu&Chen)

Another LSI Example

t1t2t3t4

d1 d2 d3

t1t2t3t4

Term co-occurrencesAAT

Ak=UkΣkVkT≈A

Now it is like if t3 belongs to d1!

Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros”...

Documents

Transcript of Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros”...

1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.

Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.

Indexing by Latent Semantic Analysis Scott Deerwester Graduate ...

Regularized Latent Semantic Indexing: A New Approach to ...junxu/publications/TOIS2013_RLSI.pdfRegularized Latent Semantic Indexing 5:3 into many suboptimization problems. These suboptimization

SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI) How ...manuel.midoriparadise.com/public_html/svd-lsi-tutorial.pdf · SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI)

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Analysis of a Vector Space Model, Latent Semantic Indexing and Formal Concept … · · 2012-04-18Analysis of a Vector Space Model, Latent Semantic Indexing ... set of attributes

Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing

Latent Semantic Indexing - Inspiring Innovationnicholas/676/LSIfromIan.pdf · Lecture 12 Information Retrieval 3 Latent Semantic Indexing • Compute singular value decomposition

Latent Semantic Indexing and Beyond

Job Matching Platform Using Latent Semantic Indexing and ... › wp-content › uploads › 2018 › 11 › APJMR... · Job Matching Platform Using Latent Semantic Indexing and Location

Latent Semantic Indexing SI650: Information Retrieva l

Indexing by latent semantic analysis - CMU Statisticscshalizi/350/2008/readings/Deerwester-et-al.pdfThe particular “latent semantic indexing” (LSI) analysis that we have tried

Improving Text Classification using Local Latent Semantic Indexing

A Latent Semantic Indexing-based approach to multilingual document clusteringccc.inaoep.mx/~villasen/bib/A Latent Semantic Indexing... · 2009-06-25 · A Latent Semantic Indexing-based

Multilingual Sentiment Analysis Using Latent Semantic Indexing and ...

Latent Semantic Indexing

Dimensionality reduction by random projection and latent semantic indexing

LATENT SEMANTIC INDEXING FOR HINDI- ENGLISH CLIR ...shodhganga.inflibnet.ac.in/bitstream/10603/12226/9/09_chapter3.pdf · inverted file system, probabilistic latent semantic indexing,

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.