The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and...

The Mathematics of The Mathematics of Information RetrievalInformation Retrieval

11/21/200511/21/2005

Presented by Presented by Jeremy ChapmanJeremy Chapman, , Grant GelvenGrant Gelven and and Ben LakinBen Lakin

AcknowledgmentsAcknowledgments

This presentation is based on the following This presentation is based on the following paper:paper:

““Matrices, Vector Spaces, and Information Matrices, Vector Spaces, and Information Retrieval.” by Michael W. Barry, Zlatko Retrieval.” by Michael W. Barry, Zlatko Drmat, and Elizabeth R.Jessup.Drmat, and Elizabeth R.Jessup.

Indexing of Scientific WorksIndexing of Scientific Works

Indexing primarily done by using the title, Indexing primarily done by using the title, author list, abstract, key word list, and author list, abstract, key word list, and subject classificationsubject classification

These are created in large part to allow These are created in large part to allow them to be found in a search of scientific them to be found in a search of scientific documentsdocuments

The use of automated information retrieval The use of automated information retrieval (IR) has improved consistency and speed(IR) has improved consistency and speed

Vector Space Model for IRVector Space Model for IR

The basic mechanism for this model is the The basic mechanism for this model is the encoding of a document as a vectorencoding of a document as a vector

All documents’ vectors are stored in a All documents’ vectors are stored in a single matrix single matrix

Latent Semantic Indexing (LSI) replaces Latent Semantic Indexing (LSI) replaces the original matrix by a matrix of a smaller the original matrix by a matrix of a smaller rank while maintaining similar information rank while maintaining similar information by use of Rank Reductionby use of Rank Reduction

Creating the Database MatrixCreating the Database Matrix

Each document is defined in a column of Each document is defined in a column of the matrix (d is the number of documents)the matrix (d is the number of documents)

Each term is defined as a row (t is the Each term is defined as a row (t is the number of terms)number of terms)

This gives us a t x d matrixThis gives us a t x d matrix

The document vectors span the content The document vectors span the content

Simple ExampleSimple ExampleLet the six terms as follows:Let the six terms as follows:

T1: bak(e, ing)T1: bak(e, ing)

T2: recipesT2: recipes

T3: breadT3: bread

T4: cakeT4: cake

T5: pastr(y, ies)T5: pastr(y, ies)

T6: pieT6: pie

The following are the d=5 documentsThe following are the d=5 documentsD1: How to Bake Bread Without RecipesD1: How to Bake Bread Without RecipesD2: The Classical Art of Viennese PastryD2: The Classical Art of Viennese PastryD3: Numerical Recipes: The Art of Scientific ComputingD3: Numerical Recipes: The Art of Scientific ComputingD4: Breads, Pastries, Pies, and Cakes: Quantity Baking D4: Breads, Pastries, Pies, and Cakes: Quantity Baking

RecipesRecipesD5:Pastry: A Book of Best French RecipesD5:Pastry: A Book of Best French Recipes

Thus the document matrix becomes:

1 0 0 1 0

1 0 1 1 1

1 0 0 1 0

0 0 0 1 0

0 1 0 1 0

0 0 0 1 0

A =

The matrix A after NormalizationThe matrix A after Normalization

.5774 0 0 .4082 0

.5774 0 1 .4082 .7071

.5774 0 0 .4082 0

0 0 0 .4082 0

0 1 0 .4082 .7071

0 0 0 .4082 0

A

Thus after the normalization of the columns of A we get the following:

Making a QueryMaking a Query

Next we will use the document matrix to Next we will use the document matrix to ease our search for related documents.ease our search for related documents.

Referring to our example we will make the Referring to our example we will make the following query: Baking Breadfollowing query: Baking Bread

We will now format a query using our We will now format a query using our terms definitions given before:terms definitions given before:

q=q= (1(1 00 11 00 00 0)0)TT

Matching the Document to the Matching the Document to the QueryQuery

Matching the documents to a given query is Matching the documents to a given query is typically done by using the cosine of the angle typically done by using the cosine of the angle between the query and document vectorsbetween the query and document vectors

The cosine is given as follows:The cosine is given as follows:

2 2

cos( )|| || || ||

Tj

j

j

a q

a q

A QueryA Query

By using the cosine formula we would get:By using the cosine formula we would get:

We will set our lower limit on our cosine at .5.We will set our lower limit on our cosine at .5.• Thus by conducting a query “baking bread” we get the Thus by conducting a query “baking bread” we get the

following two articles:following two articles:

D1: How to Bake Bread Without RecipesD1: How to Bake Bread Without Recipes

D4: Breads, Pastries, Pies, and Cakes: Quantity D4: Breads, Pastries, Pies, and Cakes: Quantity Baking RecipesBaking Recipes

1 2 3 4 5Cos( )=.8165, Cos( )=0, Cos( )=0, Cos( )=.5774, and Cos( )=0

Singular Value DecompositionSingular Value Decomposition

The Singular Value Decomposition (SVD) is used to The Singular Value Decomposition (SVD) is used to reduce the rank of the matrix, while also giving a good reduce the rank of the matrix, while also giving a good approximation of the information stored in itapproximation of the information stored in it

The decomposition is written in the following manner:The decomposition is written in the following manner:

Where U spans the column space of A, is the matrix with Where U spans the column space of A, is the matrix with singular values of A along the main diagonal, and V singular values of A along the main diagonal, and V spans the row space of A. U and V are also orthogonal.spans the row space of A. U and V are also orthogonal.

TA=U V

SVD continuedSVD continued

• Unlike the QR Factorization, SVD provides us with a lower rank Unlike the QR Factorization, SVD provides us with a lower rank representation of the column and row spacesrepresentation of the column and row spaces

• We know AWe know Ak k is the best rank-k approximation to A by Eckert and Young’s is the best rank-k approximation to A by Eckert and Young’s Theorem that states:Theorem that states:

• Thus the rank-k approximation of A is given as follows:Thus the rank-k approximation of A is given as follows:

AAkk= U= Uk kk kVVkkTT

• Where UWhere Ukk=the first k columns of U=the first k columns of U

kk=a k x k matrix whose diagonal is a set of decreasing =a k x k matrix whose diagonal is a set of decreasing values, call them:values, call them:

VVkkTT=is the k x d matrix whose rows are the first k rows of V=is the k x d matrix whose rows are the first k rows of V

( )

|| || || ||minkrank x k

A A A x

( )

|| || || ||minkrank x k

A A A x

1,..., k

SVD FactorizationSVD Factorization

.2670 -.2567 .5308 -.2847 -.7071 0

.7479 -.3981 -.5249 .0816 0 0

.2670 -.2567 .5308 -.2847 .7071 0

.1182 -.0127 .2774 .6397 0 -.7071

.5198

U

.8423 .0838 -.1158 0 0

.1182 -.0127 .2774 .6394 0 .7071

1.6950 0 0 0 0

0 1.1158 0 0 0

0 0 0.8403 0 0

0 0 0 0.4195 0

0 0 0

0 0

0 0 0 0 0

.4366 -.4717 .3688 -.6715 0

.3067 .7549 .0998 -.2760 -.5000

.4412 -.3568 -.6247 .1945 -.5000

.4909 -.0346 .5711 .6571 0

.5288 .2815 -.3712 -

V

.0577 .7071

InterpretationInterpretation

From the matrix given on the slide before we From the matrix given on the slide before we notice that if we take the rank-4 matrix has only notice that if we take the rank-4 matrix has only four non-zero singular valuesfour non-zero singular values

Also the two non-zero columns in tell us that Also the two non-zero columns in tell us that the first four columns of U give us the basis for the first four columns of U give us the basis for the column space of Athe column space of A

Analysis of the Rank-k Analysis of the Rank-k ApproximationsApproximations

Using the following formula we can calculate the Using the following formula we can calculate the relative error from the original matrix to its rank-k relative error from the original matrix to its rank-k approximation:approximation:

||A-A||A-Akk||||FF= =

Thus only a 19% relative error is needed to change from a rank-4 to a rank-3 matrix, however a 42% relative error is necessary to move to a rank-2 approximation from a rank-4 approximation

• As expected these values are less than the rank-k approximations for the QR factorization

2 21 1... k

Using the SVD for Query MatchingUsing the SVD for Query Matching

• Using the following formula we can calculate Using the following formula we can calculate the cosine of the angles between the query the cosine of the angles between the query and the columns of our rank-k approximation and the columns of our rank-k approximation of A.of A.

• Using the rank-3 approximation we return the Using the rank-3 approximation we return the first and fourth books again using the cutoff first and fourth books again using the cutoff of .5of .5

T Tk

j T2 k 2

[s (U q)]Cos( )=

(|| || ||U q|| )j

js

Term-Term ComparisonTerm-Term Comparison

It is possible to modify the vector space model for It is possible to modify the vector space model for comparing queries with documents in order to comparing queries with documents in order to compare terms with terms.compare terms with terms.When this is added to a search engine it can act as When this is added to a search engine it can act as a tool to refine the resulta tool to refine the resultFirst we run our search as before and retrieve a First we run our search as before and retrieve a certain number of documents in the following certain number of documents in the following example we will have five documents retrieved.example we will have five documents retrieved.We will then create another document matrix with We will then create another document matrix with the remaining information, call it G.the remaining information, call it G.

Another ExampleAnother Example

T1:Run(ning)T1:Run(ning)

T2:BikeT2:Bike

T3:EnduranceT3:Endurance

T4:TrainingT4:Training

T5:BandT5:Band

T6:MusicT6:Music

T7:FishesT7:Fishes

D1:Complete Triathlon Endurance Training Manual:Swim, Bike, RunD2:Lake, River, and Sea-Run Fishes of CanadaD3:Middle Distance Running, Training and CompetitionD4:Music Law: How to Run your Band’s BusinessD5:Running: Learning, Training Competing

Terms Documents

.5000 .7071 .7071 .5774 .7071

.5000 0 0 0 0

.5000 0 0 0 0

.5000 0 .7071 0 .7071

0 0 0 .5774 0

0 0 0 .5774 0

0 .7071 0 .5774 0

G

Analysis of the Term-Term Analysis of the Term-Term ComparisonComparison

For this we use the following formula:For this we use the following formula:T T

jij T T

i 2 j 2

[(e G)(G e )]Cos( )=

(||G e || ||G e || )i

ClusteringClustering

• Clustering is the process by Clustering is the process by which terms are grouped if which terms are grouped if they are related such as bike, they are related such as bike, endurance and trainingendurance and training

• First the terms are split into First the terms are split into groups which are relatedgroups which are related

• The terms in each group are The terms in each group are placed such that their vectors placed such that their vectors are almost parallelare almost parallel

ClustersClusters

In this example the first cluster is runningIn this example the first cluster is running

The second cluster is bike, endurance and The second cluster is bike, endurance and trainingtraining

The third is band and musicThe third is band and music

And the fourth is fishesAnd the fourth is fishes

Analyzing the term-term Analyzing the term-term ComparisonComparison

We will again use the SVD rank-k approximation We will again use the SVD rank-k approximation

Thus the cosine of the angles becomes:Thus the cosine of the angles becomes:T T

k k jk kij T T

k i 2 k j 2k k

[(e U )( U e )]Cos( )=

(|| U e || || U e || )i

ConclusionConclusion

Through the use of Through the use of this model many this model many libraries and smaller libraries and smaller collections can index collections can index their documentstheir documentsHowever, as the next However, as the next presentation will show presentation will show a different approach a different approach is used in large is used in large collections such as collections such as the internetthe internet

The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and...

Documents

Transcript of The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and...