Search A Basic Overview

37
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014

description

Search A Basic Overview. Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014. Back in those days. We had access to much smaller amount of information Had to find information manually. - PowerPoint PPT Presentation

Transcript of Search A Basic Overview

Page 1: Search A Basic Overview

SearchA Basic Overview

Debapriyo Majumdar

Data Mining – Fall 2014

Indian Statistical Institute Kolkata

October 20, 2014

Page 2: Search A Basic Overview

2

Back in those daysOnce upon a time in the world, there were days without search engines

We had access to much smaller amount of information

Had to find information manually

Page 3: Search A Basic Overview

3

Search engine

User needs some information

Assumption: the required information is present

somewhere

A search engine tries to bridge this gap

How: User “expresses” the information need – query Engine returns – list of documents, or by some better means

Page 4: Search A Basic Overview

4

Search engine

User needs some information

Assumption: the required information is present

somewhere

A search engine tries to bridge this gap

Simplest model User submits query – a set of words (terms) Search engine returns documents “matching” the query Assumption: matching the query would satisfy the information need Modern search has come a long way from the simple model, but the

fundamentals are still required

Page 5: Search A Basic Overview

5

Basic approachThis is in

Indian Statistical Institute,

Kolkata, IndiaStatistically flying is the safest mode of journey

Diwali is a huge festival

in India

India’s population is

huge

Thank god it is a holiday

This is autumn

There is no end of

learning

Documents contain terms Documents are

represented by terms present in them

Match queries and documents by terms

For simplicity: ignore positions, consider documents as “bag-of-words”

There may be many matching documents – need to rank them Query: india

statistics

Page 6: Search A Basic Overview

6

Vector space model

Each term represents a dimension

Documents are vectors in the term-space

Term-document matrix: a very sparse matrix

Query is also a vector in the term-space

d1 d2 d3 d4 d5 q

diwali 1 0 0 0 0

india 1 0 0 1 1 1

flying 0 1 0 0 0

population 0 0 0 1 0

autumn 0 0 1 0 0

statistical 0 1 0 0 1 1

Similarity of each document d with the query q is measured by the cosine similarity (dot product normalized by norms of the vectors)

Page 7: Search A Basic Overview

7

Scoring function: TF.iDF How important is a term t in a document d Approach: take two factors into account

– With what significance does t occur in d? [term frequency]– Does t occur in many other documents also? [document frequency]– Called TF.iDF: TF × iDF, has many variants for TF and iDF

Variants for TF(t, d)1. Number of times t occurs in d: freq(t, d)

2. Logarithmically scaled frequency: 1 + log(freq(t, d))

3. Augmented frequency: avoid bias towards longer documents

Inverse document frequency of t : iDF(t)

for all t in d; 0 otherwise

where N = total number of documentsDF(t) = number of documents in which t occurs

Half the score for just being present

Rest a function of frequency

Page 8: Search A Basic Overview

8

BM25 Okapi IR system – Okapi BM25 If the query q = {q1, … , qn} where qi’s are words in the query

where

N = total number of documents

avgdl = average length of documents

k1 and b are optimized parameters, usually b = 0.75 and 1.2 ≤ k1 ≤ 2.0

BM25 exhibited better performance than TF.iDF in TREC consistently

Page 9: Search A Basic Overview

9

Relevance Simple IR model: query, documents, returned results Relevant document: a document that satisfies the information

need expressed by the query– Merely matching query terms does not make a document relevant– Relevance is human perception, not a mathematical statement– User may want some statistics on population of India by the query

“india statistics” – The document “Indian Statistical Institute” matches the query terms,

but not relevant

To evaluate effectiveness of a system, we need for each query1. Given a result, an assessment of whether it is relevant

2. The set of all relevant results assessed (pre-validated)• If the second is available, it serves the purpose of the first as well

Measures: precision, recall, F-measure (harmonic mean of precision and recall)

Page 10: Search A Basic Overview

10

Inverted index Standard representation:

document terms Inverted index: term documents For each term t, store the list of the

documents in which t occurs

This is in Indian

Statistical Institute,

Kolkata, India

Statistically flying is

the safest mode of journey

Diwali is a huge

festival in India

India’s population

is huge

Thank god it is a holiday This is

autumnThere is

no end of learning

1 2 3

45

6 7

diwali: d3

india: d2 d3 d7

flying: d1

population: d7

autumn: d4

statistical: d1 d2

Scores?

Page 11: Search A Basic Overview

11

Inverted index Standard representation:

document terms Inverted index: term documents For each term t, store the list of the

documents in which t occurs

diwali: d3(0.5)

india: d2(0.7) d3(0.3) d7(0.4)

flying: d1(0.3)

population: d7(0.6)

autumn: d4(0.8)

statistical: d1(0.2) d2(0.5) Note: These scores are dummy, not by any formula

This is in Indian

Statistical Institute,

Kolkata, India

Statistically flying is

the safest mode of journey

Diwali is a huge

festival in India

India’s population

is huge

Thank god it is a holiday This is

autumnThere is

no end of learning

1 2 3

45

6 7

Page 12: Search A Basic Overview

12

Positional index Just documents and scores follows bag of

words model Cannot perform proximity search or phrase

query search Positional inverted index: also store

position of each occurrence of term t in each document d where t occurs

diwali: d3(0.5):<1>

india: d2(0.7):<4,8> d3(0.3):<7> d7(0.4):<1>

flying: d1(0.3):<2>

population: d7(0.6):<2>

autumn: d4(0.8):<3>

statistical: d1(0.2):<1> d2(0.5):<5>

This is in Indian

Statistical Institute,

Kolkata, India

Statistically flying is

the safest mode of journey

Diwali is a huge

festival in India

India’s population

is huge

Thank god it is a holiday This is

autumnThere is

no end of learning

1 2 3

45

6 7

Page 13: Search A Basic Overview

13

Pre-processing Removal of stopwords: of, the, and, …

– Modern search does not completely remove stopwords– Such words add meaning to sentences as well as queries

Stemming: words stem (root) of words– Statistics, statistically, statistical statistic (same root)– Loss of slight information (the form of the word also matters)– But unifies differently expressed queries on the same topic– Lemmatization: doing this properly with morphological analysis of

words

Normalization: unify equivalent words as much as possible– U.S.A, USA– Windows, windows

Stemming, lemmatization, normalization, synonym finding, all are important subfields on their own!!

Page 14: Search A Basic Overview

14

Creating an inverted index For each document, write out pairs

(term, docid) Sort by term Group, compute DF

This is in Indian

Statistical Institute,

Kolkata, India

Statistically flying is

the safest mode of journey

Diwali is a huge

festival in India

India’s population

is huge

Thank god it is a holiday This is

autumnThere is

no end of learning

1 2 3

45

6 7Term docId

statistic 1

fly 1

safe 1

… …

india 2

statistic 2

india 3

… …

india 7

Term docId

india 2

india 3

india 7

… …

fly 1

safe 1

statistic 1

statistic 2

… …

Term docId docId docId

india (df=3) 2 3 7

fly (df=1) 1

statistic (df=2) 1 2

… …

Page 15: Search A Basic Overview

15

Traditional architecture

Analysis (stemming, normalization, …)

Basic format conversion, parsing

Indexing

Core query processing(accessing index, ranking)

Index

Different types of documents

Query handler (query parsing)

Results handler (displaying results)

User

Query Results

Query Results

Page 16: Search A Basic Overview

Query processing

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

16

Pick the smallest doc id

Page 17: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

17

doc 5 (0.6)

Pick the smallest doc id

Page 18: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

18

Pick the smallest doc id

doc 5 (0.6)

Page 19: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

19

Pick the smallest doc id

doc 5 (0.6)

Page 20: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

20

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

Page 21: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

21

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

Page 22: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

22

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

Page 23: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

23

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

Page 24: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

24

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

Page 25: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

25

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

Page 26: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

26

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

doc 17 (1.6)

Page 27: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

27

Pick the smallest doc id

doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

doc 17 (1.6)

Page 28: Search A Basic Overview

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

28

doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

doc 17 (1.6)

doc 21 (0.5)

doc 25 (0.6)

doc 38 (0.6)

doc 44 (0.1)

doc 61 (0.3)

doc 65 (0.1)

doc 78 (0.5)

doc 81 (0.2)

doc 83 (1.8)

doc 91 (0.1)

Merged list

stil

l sor

ted

by d

oc id

(Partial) sort doc 83 (1.8)

doc 17 (1.6)

Top-2Complexity?

klogn

Page 29: Search A Basic Overview

MergeSimple and efficient, minimal overhead

Lists sorted by doc id

Merge

Merged list

But, have to scan the lists fully! 29

Page 30: Search A Basic Overview

30

Top-k algorithms If there are millions of documents in the lists

– Can the ranking be done without accessing the lists fully?

Exact top-k algorithms (used more in databases)– Family of threshold algorithms (Ronald Fagin et al)– Threshold algorithm (TA)– No random access algorithm (NRA) [we will discuss, as an example]– Combined algorithm (CA)– Other follow up works

Inexact top-k algorithms– Exact top-k not required, the scores are only “crude” approximation of

“relevance” (human perception)– Several heuristics– Further reading: IR book by Manning, Raghavan and Schuetze, Ch. 7

Page 31: Search A Basic Overview

NRA (No Random Access) Algorithm

lists

sort

ed b

y

score

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

List 1 List 2 List 3

Fagin’s NRA Algorithm:

read one doc from every list

31

Page 32: Search A Basic Overview

NRA (No Random Access) Algorithm

lists

sort

ed b

y

score

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

Fagin’s NRA Algorithm: round 1

doc 83[0.9, 2.1]

doc 17[0.6, 2.1]

doc 25[0.6, 2.1]

Candidatesmin top-2 score: 0.6maximum score for unseen docs: 2.1

min-top-2 < best-score of candidates

List 1 List 2 List 3

read one doc from every list

current score

best-score

0.6 + 0.6 + 0.9 = 2.1

32

Page 33: Search A Basic Overview

NRA (No Random Access) Algorithm

lists

sort

ed b

y

score

Fagin’s NRA Algorithm: round 2

doc 17[1.3, 1.8]

doc 83[0.9, 2.0]

doc 25[0.6, 1.9]

doc 38[0.6, 1.8]

doc 78[0.5, 1.8]

Candidatesmin top-2 score: 0.9maximum score for unseen docs: 1.8

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

List 1 List 2 List 3

min-top-2 < best-score of candidates

read one doc from every list

0.5 + 0.6 + 0.7 = 1.8

33

Page 34: Search A Basic Overview

NRA (No Random Access) Algorithm

lists

sort

ed b

y

score

doc 83[1.3, 1.9]

doc 17[1.3, 1.7]

doc 25[0.6, 1.5]

doc 78[0.5, 1.4]

Candidatesmin top-2 score: 1.3maximum score for unseen docs: 1.3

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

Fagin’s NRA Algorithm: round 3

List 1 List 2 List 3

min-top-2 < best-score of candidates

no more new docs can get into top-2

but, extra candidates left in queue

read one doc from every list

0.4 + 0.6 + 0.3 = 1.3

34

Page 35: Search A Basic Overview

NRA (No Random Access) Algorithm

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

lists

sort

ed b

y

score

doc 171.6

doc 83[1.3, 1.9]

doc 25[0.6, 1.4]

Candidatesmin top-2 score: 1.3maximum score for unseen docs: 1.1

Fagin’s NRA Algorithm: round 4

List 1 List 2 List 3

min-top-2 < best-score of candidates

no more new docs can get into top-2

but, extra candidates left in queue

read one doc from every list

0.3 + 0.6 + 0.2 = 1.1

35

Page 36: Search A Basic Overview

NRA (No Random Access) Algorithm

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

lists

sort

ed b

y

score

doc 831.8

doc 171.6

Candidatesmin top-2 score: 1.6maximum score for unseen docs: 0.8

Done!

Fagin’s NRA Algorithm: round 5

List 1 List 2 List 3

no extra candidate in queue

read one doc from every list

0.2 + 0.5 + 0.1 = 0.8

36

More approaches: Periodically also perform random accesses on

documents to reduce uncertainty (CA) Sophisticated scheduling on lists Crude approximation: NRA may take a lot of

time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores?

Page 37: Search A Basic Overview

37

References Primarily: IR Book by Manning, Raghavan and

Schuetze: http://nlp.stanford.edu/IR-book/