Information retreival, By Hadi Mohammadzadeh

60
1 . Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm – 3 Nov. 2009 Seminar on Information Retriev

Transcript of Information retreival, By Hadi Mohammadzadeh

Page 1: Information retreival, By Hadi Mohammadzadeh

1

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 3 Nov. 2009

Seminar on

Information Retrieval (IR)

Page 2: Information retreival, By Hadi Mohammadzadeh

2

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Information Retrieval Definition

• Information Retrieval (IR) is :

1. finding material (usually documents)

2. of an unstructured nature (usually text)

3. that satisfies an information need(query)

4. from within large collections (usually stored on computers).

Page 3: Information retreival, By Hadi Mohammadzadeh

3

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Basic assumptions of Information Retrieval

• Collection: Fixed set of documents

• Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task

Page 4: Information retreival, By Hadi Mohammadzadeh

4

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Search Methods

for

Finding Documents

Page 5: Information retreival, By Hadi Mohammadzadeh

5

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Searching Methods

Grep method Term-document incidence matrix (Binary Ret.)

Inverted indexInverted index mit Skip pointers/Skip lists Positional Postings (for Phrase queries)

Page 6: Information retreival, By Hadi Mohammadzadeh

6

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Term-document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Page 7: Information retreival, By Hadi Mohammadzadeh

7

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Inverted index

• For each term T, we must store a list of all documents that contain T.

• Do we use an array or a list for this?

7

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

What happens if the word Caesar is added to document 14?

Sec. 1.2

Page 8: Information retreival, By Hadi Mohammadzadeh

8

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Inverted index

• Linked lists generally preferred to arrays– Dynamic space allocation– Insertion of terms into documents easy– Space overhead of pointers

8

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

PostingPosting

Sec. 1.2

Page 9: Information retreival, By Hadi Mohammadzadeh

9

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Augment postings with skip pointers (at indexing time)

• Why?• To skip postings that will not figure in the search

results.• Where do we place skip pointers?

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Page 10: Information retreival, By Hadi Mohammadzadeh

10

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Where do we place skips?

• Tradeoff:– More skips shorter skip spans more likely

to skip. But lots of comparisons to skip pointers.

– Fewer skips few pointer comparison, but then long skip spans few successful skips.

Page 11: Information retreival, By Hadi Mohammadzadeh

11

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Positional index example

<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>

Which of docs 1,2,4,5could contain “to be

or not to be”?

Page 12: Information retreival, By Hadi Mohammadzadeh

12

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Steps of Inverted index construction

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Documents tobe indexed.

Friends, Romans, countrymen.

Sec. 1.2

Page 13: Information retreival, By Hadi Mohammadzadeh

13

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Parts of an Inverted Index

• Dictionary– Commonly keep in memory

• Posting lists– Commonly keep in disk

Page 14: Information retreival, By Hadi Mohammadzadeh

14

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Inverted index construction

Preprocessing to form the term vocabulary

Tokenization (problems) Hyphens apostrophes Compounds Chinese numbers

Dropping Stop Words But you need them: Phrase queries, various song titles,

Relational queriesNormalization (Term equivalence classing)

Numbers case folding (Reduce all letters to lower case) Stemming ( Porter’s algorithm) Reduce terms to their “roots” lemmatization (Reduce variant forms to base form)

Page 15: Information retreival, By Hadi Mohammadzadeh

15

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Inverted index construction

Index ConstructionBlocked Sort-based indexing (BSBI)

Algorithm Accumulate posting for each block, sort, write to disk Then merge (External sorting) the blocks into one long sorted order

Distributed indexing using MapReduce Break up indexing into sets of 2 parallel tasks

Parsers Invertors

Break the input document corpus into splits Parsers

Master assign a split to an idle parser machine Parser reads a document at a time and emit (term,doc) pairs Parser writes pairs into j partitions Each partition is for a range of term's first letters

Inverters An inverter collects all (term,doc) pairs for one term-partition Sorts and writes to postings list

Dynamic Indexing

Page 16: Information retreival, By Hadi Mohammadzadeh

16

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase

Inverted index construction

Index Construction

Page 17: Information retreival, By Hadi Mohammadzadeh

17

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Search structures for Dictionary

A naïve dictionary Hash tables Trees

Binary tree B-Tree

Page 18: Information retreival, By Hadi Mohammadzadeh

18

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Index compression

Dictionary compression for Boolean indexes Array of fixed/width entries (it is wasteful) Dictionary as a string Blocking Front coding

Postings compression Gap encoding using prefix-unique codes Variable-Byte

Gamma codes ( seldom used in practice)

Page 19: Information retreival, By Hadi Mohammadzadeh

19

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Dictionary compression for Boolean indexes

Dictionary-as-a-String

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

Total string length =400K x 8B = 3.2MB

Pointers resolve 3.2Mpositions: log23.2M =

22bits = 3bytes

Page 20: Information retreival, By Hadi Mohammadzadeh

20

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Dictionary compression for Boolean indexes

Blocking

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes on 3 pointers.

Lose 4 bytes onterm lengths.

Page 21: Information retreival, By Hadi Mohammadzadeh

21

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Dictionary compression for Boolean indexes

Front coding

– Sorted words commonly have • long common prefix – store differences only

– (for last k-1 in a block of k)

8automata8automate9automatic10automation8automat*a1e2ic3ion

Encodes automat Extra lengthbeyond automat.

Page 22: Information retreival, By Hadi Mohammadzadeh

22

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Information Retrieval

Ranked Retrieval

Page 23: Information retreival, By Hadi Mohammadzadeh

23

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Information Retrieval

Ranked retrieval

• Thus far, our queries have all been Boolean.• Good for expert users • Also good for applications: Applications can

easily consume 1000s of results.– Not good for the majority of users.– Most users incapable of writing Boolean queries (or

they are, but they think it’s too much work).

• Most users don’t want to wade through 1000s of results.– This is particularly true of web search

Page 24: Information retreival, By Hadi Mohammadzadeh

24

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Term Weighting• Term frequency and Inverse document frequency

– TF

– IDF: the number of docs in the collection that contain a term t

• td-idf weighting– The tf-idf weight of a term is the product of its tf weight and its idf weight

• td-idf is the best known weighting scheme in information retrieval

otherwise 0,

0 tfif, tflog10 1 t,dt,d

t,dw

tt N/df log idf 10

tdt Ndt

df/log)tflog1(w 10,,

Page 25: Information retreival, By Hadi Mohammadzadeh

25

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Vector space model for scoring

– Represent the query as a weighted tf-idf vector– Represent each document as a weighted tf-idf vector– Compute the cosine similarity score for the query vector and

each document vector

– Rank documents with respect to the query by score– Increases with the number of occurrences within a document– Increases with the rarity of the term in the collection

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Page 26: Information retreival, By Hadi Mohammadzadeh

26

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Providing heuristics methods for

Speeding up Vector Space Scoring & Ranking

– Many of these heuristics achieve their speed at risk of not finding quite top K documents matching query

• Efficient Scoring & ranking1. Inexact top K document retrieval2. Index Elimination3. Champion lists4. Static quality scores

• We want top-ranking documents to be both relevant and authoritative

• Relevance is being modeled by cosine scores• Authority is typically a query-independent property of a

document• Assign a query-independent quality score in [0,1] to each

document d, Denote this by g(d)

Page 27: Information retreival, By Hadi Mohammadzadeh

27

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Providing heuristics methods for

Speeding up Vector Space Scoring & Ranking(Cont.)

5 - Cluster pruning: preprocessing• Pick N docs at random: call these leaders• For every other doc, pre-compute nearest leader

– Docs attached to a leader: its followers;

– Likely: each leader has ~ N followers.

• Process a query as follows:– Given query Q, find its nearest leader L.

– Seek K nearest docs from among L’s followers

– Net score for a document d• net-score can be computed as combination of cosine

relevance and authority e.g. net-score(q,d) = g(d) + cosine(q,d)

• Top K by net score – fast methods

Page 28: Information retreival, By Hadi Mohammadzadeh

28

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Cluster Pruning

Query

Leader Follower

Page 29: Information retreival, By Hadi Mohammadzadeh

29

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Parametric and zone indexes

• In fact documents have multiple parts, some with special semantics:

– Author, Title, Date of publication, Language, Format, etc.• These constitute the metadata about a document• We sometimes wish to search by these metadata• Field or parametric index: postings for each field value

– Field query typically treated as conjunction• A zone is a region of the doc that can contain an

arbitrary amount of text e.g., Title, Abstract, References …

– Build inverted indexes on zones as well to permit querying

Page 30: Information retreival, By Hadi Mohammadzadeh

30

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Example zone indexes

Encode zones in dictionary vs. postings.

Page 31: Information retreival, By Hadi Mohammadzadeh

31

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Tiered indexes

– Tiered indexes• Break postings up into a hierarchy of lists

– Most important– …– Least important

• Can be done by g(d) or another measure• Inverted index thus broken up into tiers of decreasing

importance• At query time use top tier unless it fails to yield K docs

– If so drop to lower tiers

Page 32: Information retreival, By Hadi Mohammadzadeh

32

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Example tiered index

Page 33: Information retreival, By Hadi Mohammadzadeh

33

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

A Complete Search System

Page 34: Information retreival, By Hadi Mohammadzadeh

34

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Evaluating

Search Engine(Ranked Retrieval Method)

Page 35: Information retreival, By Hadi Mohammadzadeh

35

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Measures for a search engine

Which parameters are very important in SE

– How fast does a search engine index– How fast does a search engine search– Expressiveness of query language– Uncluttered User Interface(UI)– Is it free?

Page 36: Information retreival, By Hadi Mohammadzadeh

36

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

The key measure

User happiness

• Useless answers won’t make a user happy• Need a way of quantifying user happiness• Issue: who is the user we are trying to make happy?

– Web engine– eCommerce site– Enterprise (company/govt/academic)

• Happiness: elusive to measure

Page 37: Information retreival, By Hadi Mohammadzadeh

37

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Evaluation of unranked retrieval– Precision: fraction of retrieved docs that are relevant =

P( relevant | retrieved )– Recall: fraction of relevant docs that are retrieved =

P( retrieved | relevant )

• Precision P = tp/(tp + fp)• Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn

Page 38: Information retreival, By Hadi Mohammadzadeh

38

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

38

Evaluation of unranked retrieval(Cont.)

• What about Accuracy– The accuracy of an engine: the fraction of

classifications that are correct– Accuracy is a used in machine learning

classification work– Why is this not a very useful evaluation measure

in IR?– How to build a 99.9999% accurate search engine

on a low budget….

Page 39: Information retreival, By Hadi Mohammadzadeh

39

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

39

Evaluation of unranked retrieval(Cont.)

• F measure– Combined measure that assesses precision/recall

tradeoff is F measure (weighted harmonic mean):

– People usually use balanced F1 measure i.e., with = 1 or = ½

– For F1 the best value is 1 and the worst value is 0

RP

PR

RP

F

2

2 )1(1

)1(1

1

Page 40: Information retreival, By Hadi Mohammadzadeh

40

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

40

Evaluation of Ranked Retrieval

• By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve

• We can determine a value between the points using Interpolation

• 11-point interpolated average precision• Other methods: Mean average precision (MAP) and R-

precision

Page 41: Information retreival, By Hadi Mohammadzadeh

41

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

41

A precision-recall curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cisi

on

Page 42: Information retreival, By Hadi Mohammadzadeh

42

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

42

Typical (good) 11 point precisions

• SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cis

ion

Page 43: Information retreival, By Hadi Mohammadzadeh

43

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Relevance Feedback (RF)for

Query Refinement

In

Search Engine

Page 44: Information retreival, By Hadi Mohammadzadeh

44

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Relevance Feedback• user feedback on relevance of docs in initial set of

results– User issues a (short, simple) query– The user marks some results as relevant or non-relevant.– The system computes a better representation of the

information need based on feedback.– Relevance feedback can go through one or more

iterations.

• Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate

Page 45: Information retreival, By Hadi Mohammadzadeh

45

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Relevance Feedback: Example• Image search engine

http://nayana.ece.ucsb.edu/imsearch/imsearch.html

Page 46: Information retreival, By Hadi Mohammadzadeh

46

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Results for Initial Query

Page 47: Information retreival, By Hadi Mohammadzadeh

47

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Relevance Feedback

Page 48: Information retreival, By Hadi Mohammadzadeh

48

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Results after Relevance Feedback

Page 49: Information retreival, By Hadi Mohammadzadeh

49

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Key concept: Centroid

• The centroid is the center of mass of a set of points

• Recall that we represent documents as points in a high-dimensional space

• Definition: Centroid

where C is a set of documents.

Cd

dC

C

||

1)(

Page 50: Information retreival, By Hadi Mohammadzadeh

50

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Rocchio Algorithm

• The Rocchio algorithm uses the vector space model to pick a relevance fed-back query

• Rocchio seeks the query q opt that maximizes

• Tries to separate docs marked relevant and non-relevant

• Problem: we don’t know the truly relevant docs

))](,cos())(,[cos(maxarg nrr

q

opt CqCqq

rjrj Cdj

nrCdj

ropt d

Cd

Cq

11

Page 51: Information retreival, By Hadi Mohammadzadeh

51

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Rocchio 1971 Algorithm (SMART)

• Used in practice:

• Dr = set of known relevant doc vectors• Dnr = set of known irrelevant doc vectors

– Different from Cr and Cnr

• qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)

• New query moves toward relevant documents and away from irrelevant documents

nrjrj Ddj

nrDdj

rm d

Dd

Dqq

110

!

Page 52: Information retreival, By Hadi Mohammadzadeh

52

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

The Theoretically Best Query

x

x

xx

oo

o

Optimal query

x non-relevant documentso relevant documents

o

o

o

x x

xxx

x

x

x

x

x

x

x

x

x

Page 53: Information retreival, By Hadi Mohammadzadeh

53

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Relevance feedback on initial query

x

x

xx

oo

o

Revised query

x known non-relevant documentso known relevant documents

o

o

ox

x

x x

xx

x

x

xx

x

x

x

x

Initial query

Page 54: Information retreival, By Hadi Mohammadzadeh

54

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Relevance Feedback in vector spaces

• We can modify the query based on relevance feedback and apply standard vector space model.

• Use only the docs that were marked.

• Relevance feedback can improve recall and precision

• Relevance feedback is most useful for increasing recall in situations where recall is important

– Users can be expected to review results and to take time to iterate

Page 55: Information retreival, By Hadi Mohammadzadeh

55

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Relevance feedback revisited• In relevance feedback, the user marks a number of

documents as relevant/nonrelevant.• We then try to use this information to return better

search results.• Suppose we just tried to learn a filter for nonrelevant

documents• This is an instance of a text classification problem:

– Two “classes”: relevant, nonrelevant– For each document, decide whether it is relevant or

nonrelevant

Page 56: Information retreival, By Hadi Mohammadzadeh

56

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Text Classification

Page 57: Information retreival, By Hadi Mohammadzadeh

57

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Classification Methods #1

Manual classification• Used by Yahoo! (originally; now present but

downplayed), Looksmart, about.com, ODP, PubMed

• Very accurate when job is done by experts• Consistent when the problem size and team is

small• Difficult and expensive to scale

Page 58: Information retreival, By Hadi Mohammadzadeh

58

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Classification Methods #2

Automatic document classification

• Hand-coded rule-based systems– One technique used by CS dept’s spam filter,

Reuters, CIA, etc. – Companies (Verity) provide “IDE” for writing such

rules– Accuracy is often very high if a rule has been carefully

refined over time by a subject expert– Building and maintaining these rules is expensive

Page 59: Information retreival, By Hadi Mohammadzadeh

59

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

Classification Methods #3

Supervised learning • Supervised learning of a document-label

assignment function– Many systems partly rely on machine learning

• k-Nearest Neighbors (simple, powerful)• Naive Bayes (simple, common method)• Support-vector machines (new, more powerful)• No free lunch: requires hand-classified training data• But data can be built up (and refined) by amateurs

Page 60: Information retreival, By Hadi Mohammadzadeh

60

.

Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages

References

• Introduction to Information Retrieval-2008• Managing Gigabytes-1999