Homework

27
Homework • Define a loss function that compares two matrices (say mean square error) • b = svd(bellcore) • b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) • b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3]) • More generally, for all possible r – Let b.r = b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r]) • Compute the loss between bellcore and b.r as a function of r • Plot the loss as a function of r

description

Homework. Define a loss function that compares two matrices (say mean square error) b = svd(bellcore ) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2] ) b3 = b$u[,1 :3] %*% diag(b$d[1 :3] ) %*% t(b$v[,1 :3]) More generally, for all possible r - PowerPoint PPT Presentation

Transcript of Homework

Page 1: Homework

Homework

• Define a loss function that compares two matrices (say mean square error)

• b = svd(bellcore)• b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])• b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3])• More generally, for all possible r

– Let b.r = b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r])• Compute the loss between bellcore and b.r as a function

of r• Plot the loss as a function of r

Page 2: Homework

IR Models

• Keywords (and Boolean combinations thereof)• Vector-Space ‘‘Model’’ (Salton, chap 10.1)– Represent the query and the documents as V-

dimensional vectors– Sort vectors by

• Probabilistic Retrieval Model– (Salton, chap 10.3)– Sort documents by

sim(x,y) = cos(x, y) =

x i ⋅ y i

i

∑| x |⋅ | y |

score(d) =Pr(w | rel)

Pr(w | rel)w∈d

Page 3: Homework

Information Retrieval and Web SearchAlternative IR models

Instructor: Rada Mihalcea

Some of the slides were adopted from a course tought at Cornell University by William Y. Arms

Page 4: Homework

Latent Semantic Indexing

Objective

Replace indexes that use sets of index terms by indexes that use concepts.

Approach

Map the term vector space into a lower dimensional space, using singular value decomposition.

Each dimension in the new space corresponds to a latent concept in the original data.

Page 5: Homework

Deficiencies with Conventional Automatic Indexing

Synonymy: Various words and phrases refer to the same concept (lowers recall).

Polysemy: Individual words have more than one meaning (lowers precision)

Independence: No significance is given to two terms that frequently appear together

Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)

Page 6: Homework

Bellcore’s Examplehttp://en.wikipedia.org/wiki/Latent_semantic_analysis

 c1 Human machine interface for Lab ABC computer applications

 c2 A survey of user opinion of computer system response time

 c3 The EPS user interface management system

 c4 System and human system engineering testing of EPS

 c5 Relation of user-perceived response time to error measurement

m1 The generation of random, binary, unordered trees

m2 The intersection graph of paths in trees

m3 Graph minors IV: Widths of trees and well-quasi-ordering

m4 Graph minors: A survey

Page 7: Homework

Term by Document Matrix

Page 8: Homework

"bellcore"<-structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0,0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1), .Dim = c(12, 9), .Dimnames = list(c("human", "interface", "computer", "user","system", "response", "time", "EPS", "survey", "trees", "graph","minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4")))

help(dump)help(source)

Page 9: Homework

Query ExpansionQuery:

Find documents relevant to human computer interaction

Simple Term Matching:

Matches c1, c2, and c4Misses c3 and c5

Page 10: Homework

LargeCorrel-ations

Page 11: Homework

Correlations: Too Large to Ignore

Page 12: Homework

How to compute correlationsround(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4c1 100 -19 0 0 -33 -17 -26 -33 -33c2 -19 100 0 0 58 -30 -45 -58 -19c3 0 0 100 47 0 -21 -32 -41 -41c4 0 0 47 100 -31 -16 -24 -31 -31c5 -33 58 0 -31 100 -17 -26 -33 -33m1 -17 -30 -21 -16 -17 100 67 52 -17m2 -26 -45 -32 -24 -26 67 100 77 26m3 -33 -58 -41 -31 -33 52 77 100 56m4 -33 -19 -41 -31 -33 -17 26 56 100

round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minorshuman 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29user -38 19 19 100 23 76 76 19 19 -50 -50 -38system 43 4 4 23 100 4 4 82 4 -46 -46 -35response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29survey -29 -29 36 19 4 36 36 -29 100 -38 19 36trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100

Page 13: Homework

plot(hclust(as.dist(-cor(t(bellcore)))))

Page 14: Homework

plot(hclust(as.dist(-cor(bellcore))))

Page 15: Homework

Correcting for

Large Correlations

Page 16: Homework

Thesaurus

Page 17: Homework

Term by Doc Matrix:

Before & After Thesaurus

Page 18: Homework

Singular Value Decomposition (SVD)X = UDVT

X = U

VTD

t x d t x m m x dm x m

• m is the rank of X < min(t, d)

• D is diagonal

– D2 are eigenvalues (sorted in descending order)

• U UT = I and V VT = I

– Columns of U are eigenvectors of X XT

– Columns of V are eigenvectors of XT X

Page 19: Homework

• m is the rank of X < min(t, d)

• D is diagonal

– D2 are eigenvalues (sorted in descending order)

• U UT = I and V VT = I

– Columns of U are eigenvectors of X XT

– Columns of V are eigenvectors of XT X

Page 20: Homework

Dimensionality Reduction

X =

t x d t x k k x dk x k

k is the number of latent concepts

(typically 300 ~ 500)

U

D VT

^

Page 21: Homework

Dimension Reduction in R

b = svd(bellcore)b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])dimnames(b2) = dimnames(bellcore)par(mfrow=c(2,2))plot(hclust(as.dist(-cor(bellcore))))plot(hclust(as.dist(-cor(t(bellcore)))))plot(hclust(as.dist(-cor(b2))))plot(hclust(as.dist(-cor(t(b2)))))

Page 22: Homework
Page 23: Homework

SVDB BT = U D2 UT

BT B = V D2 VT

Latent

Term

Doc

Page 24: Homework

Dimension Reduction Block Structureround(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4c1 100 -19 0 0 -33 -17 -26 -33 -33c2 -19 100 0 0 58 -30 -45 -58 -19c3 0 0 100 47 0 -21 -32 -41 -41c4 0 0 47 100 -31 -16 -24 -31 -31c5 -33 58 0 -31 100 -17 -26 -33 -33m1 -17 -30 -21 -16 -17 100 67 52 -17m2 -26 -45 -32 -24 -26 67 100 77 26m3 -33 -58 -41 -31 -33 52 77 100 56m4 -33 -19 -41 -31 -33 -17 26 56 100> round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4c1 100 91 100 100 84 -86 -85 -85 -81c2 91 100 91 88 99 -57 -56 -56 -50c3 100 91 100 100 84 -86 -85 -85 -81c4 100 88 100 100 81 -89 -88 -88 -84c5 84 99 84 81 100 -44 -44 -43 -37m1 -86 -57 -86 -89 -44 100 100 100 100m2 -85 -56 -85 -88 -44 100 100 100 100m3 -85 -56 -85 -88 -43 100 100 100 100m4 -81 -50 -81 -84 -37 100 100 100 100

Page 25: Homework

Dimension Reduction Block Structureround(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minorshuman 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29user -38 19 19 100 23 76 76 19 19 -50 -50 -38system 43 4 4 23 100 4 4 82 4 -46 -46 -35response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29survey -29 -29 36 19 4 36 36 -29 100 -38 19 36trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100> round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minorshuman 100 100 93 94 99 82 82 100 -12 -85 -84 -83interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80computer 93 95 100 100 96 98 98 93 26 -59 -57 -56user 94 96 100 100 97 97 97 94 23 -62 -60 -59system 99 100 96 97 100 88 88 100 -2 -79 -78 -77response 82 85 98 97 88 100 100 83 46 -40 -38 -37time 82 85 98 97 88 100 100 83 46 -40 -38 -37EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82survey -12 -7 26 23 -2 46 46 -11 100 63 65 66trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100

Page 26: Homework

t1

t2

t3

d1 d2

The space has as many dimensions as there are terms in the word list.

The term vector space

Page 27: Homework

• term

document

query

--- cosine > 0.9

Latent concept vector space