Vector Space Model

62
1 Vector Space Model Rong Jin

description

Vector Space Model. Rong Jin. How to represent text objects. How to refine query according to users’ feedbacks?. What similarity function should be used?. Basic Issues in A Retrieval Model. Basic Issues in IR. How to represent queries? How to represent documents? - PowerPoint PPT Presentation

Transcript of Vector Space Model

Page 1: Vector Space Model

1

Vector Space Model

Rong Jin

Page 2: Vector Space Model

2

Basic Issues in A Retrieval Model

How to represent text

objects

What similarity function should

be used?

How to refine query according to users’

feedbacks?

Page 3: Vector Space Model

3

Basic Issues in IR How to represent queries? How to represent documents? How to compute the similarity between

documents and queries? How to utilize the users’ feedbacks to

enhance the retrieval performance?

Page 4: Vector Space Model

4

IR: Formal Formulation Vocabulary V={w1, w2, …, wn} of language Query q = q1,…,qm, where qi V Collection C= {d1, …, dk}

Document di = (di1,…,dimi), where dij V

Set of relevant documents R(q) C Generally unknown and user-dependent Query is a “hint” on which doc is in R(q)

Task = compute R’(q), an “approximate R(q)”

Page 5: Vector Space Model

5

Computing R(q) Strategy 1: Document selection

Classification function f(d,q) {0,1} Outputs 1 for relevance, 0 for irrelevance

R(q) is determined as a set {dC|f(d,q)=1} System must decide if a doc is relevant or not

(“absolute relevance”) Example: Boolean retrieval

Page 6: Vector Space Model

Document Selection Approach

6

++

+ +-- -

- - - -

- - - -

-

- - +- -

True R(q)Classifier

C(q)

Page 7: Vector Space Model

7

Computing R(q) Strategy 2: Document ranking

Similarity function f(d,q) Outputs a similarity between document d and query q

Cut off The minimum similarity for document and query to be

relevant

R(q) is determined as the set {dC|f(d,q)>} System must decide if one doc is more likely to

be relevant than another (“relative relevance”)

Page 8: Vector Space Model

8

Document Selection vs. Ranking

++

+ +-- -

- - - -

- - - -

-

- - +- - Doc Ranking

f(d,q)=?

0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -

R’(q)

True R(q)

Page 9: Vector Space Model

9

Document Selection vs. Ranking

++

+ +-- -

- - - -

- - - -

-

- - +- -

Doc Rankingf(d,q)=?

0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -

R’(q)

---

1Doc Selection

f(d,q)=?

++

++

--+

-+

--

- --0

R’(q)

True R(q)

Page 10: Vector Space Model

10

Ranking is often preferred Similarity function is more general than

classification function The classifier is unlikely to be accurate

Ambiguous information needs, short queries Relevance is a subjective concept

Absolute relevance vs. relative relevance

Page 11: Vector Space Model

11

Probability Ranking Principle As stated by Cooper

Ranking documents in probability maximizes the utility of IR systems

“If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”

Page 12: Vector Space Model

12

Vector Space Model Any text object can be represented by a term vector

Examples: Documents, queries, sentences, …. A query is viewed as a short document

Similarity is determined by relationship between two vectors e.g., the cosine of the angle between the vectors, or the

distance between vectors The SMART system:

Developed at Cornell University, 1960-1999 Still used widely

Page 13: Vector Space Model

13

Vector Space Model: illustration

Java Starbuck Microsoft

D1 1 1 0

D2 0 1 1

D3 1 0 1

D4 1 1 1

Query 1 0.1 1

Page 14: Vector Space Model

14

Vector Space Model: illustration

Java

Microsoft

StarbucksD2 ? ?

D1

? ?

D3

? ?

Query

D4 ? ?

Page 15: Vector Space Model

15

Vector Space Model: Similarity Represent both documents and queries by word histogram

vectors n: the number of unique words A query q = (q1, q2,…, qn)

qi: occurrence of the i-th word in query

A document dk = (dk,1, dk,2,…, dk,n)

dk,i: occurrence of the the i-th word in document

Similarity of a query q to a document dk q

dk

Page 16: Vector Space Model

Some Background in Linear Algebra Dot product (scalar product)

Example:

Measure the similarity by dot product

16

nknkk dqdqdq ,2,21,1 ... kdq

26453211

]4,3,1[],5,2,1[

6051241

]0,1,4[],5,2,1[

k

k

k

k

dq

dq

dq

dq

kdq

Page 17: Vector Space Model

Some Background in Linear Algebra Length of a vector

Angle between two vectors

17

),( dq

q

dk

2,

22,

21,

222

21 ...,... nkkkn dddqqq kdq

2,

22,

21,

222

21

,2,21,1

......

...

)),(cos(

nkkkn

nknkk

dddqqq

dqdqdq

k

kk dq

dqdq

Page 18: Vector Space Model

Some Background in Linear Algebra Example:

Measure similarity by the angle between vectors

18

97.0431521

453211)),(cos(

]4,3,1[],5,2,1[

27.0014521

051241)),(cos(

]0,1,4[],5,2,1[

222222

222222

k

k

k

k

dq

dq

dq

dq

),( dq

q

dk

Page 19: Vector Space Model

19

Vector Space Model: Similarity Given

A query q = (q1, q2,…, qn) qi: occurrence of the i-th word in query

A document dk = (dk,1, dk,2,…, dk,n) dk,i: occurrence of the the i-th word in

document

Similarity of a query q to a document dk

q

dk

)),(cos(

...

),(sim

,2,21,1

kkk dqdqdq

nknkk

k

dqdqdq

dq

),( dq

2,

22,

21,

222

21

,2,21,1

......

...

)),(cos(),(sim'

nkkkn

nknkk

k

dddqqq

dqdqdq

dq

k

kk dq

dqdq

Page 20: Vector Space Model

Vector Space Model: Similarity

20

q

dk

26453211

]4,3,1[],5,2,1[

40850201

]8,0,0[],5,2,1[

k

k

k

k

dq

dq

dq

dq

Page 21: Vector Space Model

Vector Space Model: Similarity

21

97.0431521

453211)),(cos(

]4,3,1[],5,2,1[

913.0800521

850201)),(cos(

]8,0,0[],5,2,1[

222222

222222

k

k

k

k

dq

dq

dq

dq

),( dq

q

dk

Page 22: Vector Space Model

22

Term Weighting

wk,i: the importance of the i-th word for document dk

Why weighting ? Some query terms carry more information

TF.IDF weighting TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization: avoid the bias of long documents

1 ,1 2 ,2 ,( , )k k k n k nsim q d q d q d q d

1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w

Page 23: Vector Space Model

23

TF Weighting A term is important if it occurs frequently in document Formulas:

f(t,d): term occurrence of word ‘t’ in document d Maximum frequency normalization:

( , )( , ) 0.5 0.5

MaxFreq( )

f t dTf t d

d

Term frequency normalization

Page 24: Vector Space Model

24

TF Weighting A term is important if it occurs frequently in document Formulas:

f(t,d): term occurrence of word ‘t’ in document d “Okapi/BM25 TF”:

( , )( , )

( )( , ) 1

_

f t dTf t d

doclen df t d

avg docl

k

be

bn

k

Term frequency normalization

doclen(d): the length of document d

avg_doclen: average document length

k,b: predefined constants

Page 25: Vector Space Model

25

TF Normalization Why?

Document length variation “Repeated occurrences” are less informative than the “first

occurrence” Two views of document length

A doc is long because it uses more words A doc is long because it has more contents

Generally penalize long doc, but avoid over-penalizing (pivoted normalization)

Page 26: Vector Space Model

26

TF Normalization

Norm. TF

Raw TF

“Pivoted normalization”( , )

( , )( )

( , ) 1_

f t dTf t d

doclen df t d

avg docl

k

be

bn

k

Page 27: Vector Space Model

27

IDF Weighting A term is discriminative if it occurs only in a

few documents Formula:

IDF(t) = 1+ log(n/m)n – total number of docsm -- # docs with term t (doc freq)

Can be interpreted as mutual information

Page 28: Vector Space Model

28

TF-IDF Weighting TF-IDF weighting :

The importance of a term t to a document d

weight(t,d)=TF(t,d)*IDF(t)

Freq in doc high tf high weight Rare in collection high idf high weight

1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w

Page 29: Vector Space Model

29

TF-IDF Weighting TF-IDF weighting :

The importance of a term t to a document d

weight(t,d)=TF(t,d)*IDF(t)

Freq in doc high tf high weight Rare in collection high idf high weight

Both qi and dk,i arebinary values, i.e. presence and absence of a word in query and document.

1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w

Page 30: Vector Space Model

30

Problems with Vector Space Model Still limited to word based matching

A document will never be retrieved if it does not contain any query word

How to modify the vector space model ?

Page 31: Vector Space Model

31

Choice of Bases

Java

Microsoft

Starbucks

D1

Q

D

Page 32: Vector Space Model

32

Choice of Bases

Java

Microsoft

Starbucks

D1

Q

D

Page 33: Vector Space Model

33

Choice of Bases

Java

Microsoft

Starbucks

D1

Q

D

D’

Page 34: Vector Space Model

34

Choice of Bases

Java

Microsoft

Starbucks

D1

Q

D

D’

Q’

Page 35: Vector Space Model

35

Choice of Bases

Java

Microsoft

Starbucks

D1

D’

Q’

Page 36: Vector Space Model

36

Choosing Bases for VSM Modify the bases of the vector space

Each basis is a concept: a group of words Every document is a vector in the concept space

A1

A2

c1 c2 c3 c4 c5 m1 m2 m3 m4

A1 1 1 1 1 1 0 0 0 0

A2 0 0 0 0 0 1 1 1 1

Page 37: Vector Space Model

37

Choosing Bases for VSM Modify the bases of the vector space

Each basis is a concept: a group of words Every document is a mixture of concepts

A1

A2

c1 c2 c3 c4 c5 m1 m2 m3 m4

A1 1 1 1 1 1 0 0 0 0

A2 0 0 0 0 0 1 1 1 1

Page 38: Vector Space Model

38

Choosing Bases for VSM Modify the bases of the vector space

Each basis is a concept: a group of words Every document is a mixture of concepts

How to define/select ‘basic concept’? In VS model, each term is viewed as an

independent concept

Page 39: Vector Space Model

39

Basic: Matrix Multiplication

Page 40: Vector Space Model

40

Basic: Matrix Multiplication

Page 41: Vector Space Model

41

Linear Algebra Basic: Eigen Analysis Eigenvectors (for a square mm matrix S)

Example

eigenvalue(right) eigenvector

Page 42: Vector Space Model

42

Linear Algebra Basic: Eigen Analysis

21

12S

2/1

2/1,1 eigenvalue second the

2/1

2/1,3 eigenvaluefirst the

22

11

v

v

Page 43: Vector Space Model

43

Linear Algebra Basic: Eigen Decomposition

1 1 1 12 1 3 02 2 2 21 2 1 1 0 1 1 1

2 2 2 2

S

S = U * * UT

,1,3 2/1

2/1 ,

2/1

2/12121

vv

Page 44: Vector Space Model

44

Linear Algebra Basic: Eigen Decomposition

1 1 1 12 1 3 02 2 2 21 2 1 1 0 1 1 1

2 2 2 2

S

S = U * * UT

,1,3 2/1

2/1 ,

2/1

2/12121

vv

Page 45: Vector Space Model

45

Linear Algebra Basic: Eigen Decomposition

1 1 1 12 1 3 02 2 2 21 2 1 1 0 1 1 1

2 2 2 2

S

This is generally true for symmetric square matrix

Columns of U are eigenvectors of S

Diagonal elements of are eigenvalues of S

S = U * * UT

Page 46: Vector Space Model

46

Singular Value Decomposition

TVUA

mm mn V is nn

For an m n matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:

The columns of U are left singular vectors.

The columns of V are right singular vectors

is a diagonal matrix with singular values

Page 47: Vector Space Model

47

Singular Value Decomposition Illustration of SVD dimensions and sparseness

Page 48: Vector Space Model

48

Singular Value Decomposition Illustration of SVD dimensions and sparseness

Page 49: Vector Space Model

49

Singular Value Decomposition Illustration of SVD dimensions and sparseness

Page 50: Vector Space Model

50

Low Rank Approximation Approximate matrix with the largest singular

values and singular vectors

Page 51: Vector Space Model

51

Low Rank Approximation Approximate matrix with the largest singular

values and singular vectors

Page 52: Vector Space Model

52

Low Rank Approximation Approximate matrix with the largest singular

values and singular vectors

Page 53: Vector Space Model

53

Latent Semantic Indexing (LSI)Computation: using single value decomposition (SVD) with the first m largest singular values and singular vectors, where m is the number of concepts

Rep. of Concepts in term space

Concept

Concept

Rep. of concepts in document space

Page 54: Vector Space Model

54

Finding “Good Concepts”

Page 55: Vector Space Model

55

54.20

034.3X X

SVD: Example: m=2

Page 56: Vector Space Model

56

54.20

034.3X X

SVD: Example: m=2

Page 57: Vector Space Model

57

54.20

034.3X X

SVD: Example: m=2

Page 58: Vector Space Model

58

54.20

034.3X X

SVD: Example: m=2

5

476.0

34.3

54.2

Page 59: Vector Space Model

59

SVD: Orthogonality

54.20

034.3X X

u1 u2 · = 0

v1

v2

v1 · v2 = 0

Page 60: Vector Space Model

60

54.20

034.3

X X

SVD: Properties

rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent.

SVD produces the best low rank approximation

X’: rank(X’) = 2X: rank(X) = 9

Page 61: Vector Space Model

61

SVD: Visualization

X =

Page 62: Vector Space Model

62

SVD: Visualization

SVD tries to preserve the Euclidean distance of document vectors