bo

IR

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading Chapter 1

Nowadays IR is much more than building search engines !

What is IR today?

Paolo Ferragina

2009-122009

This is an IR tool!

Paolo Ferragina, Università di Pisa


CISCO foresee 50 mld devices connected by 2020


We have now «devices 2.0» that have their ID, Communication capacity, computing and, more and more, interaction ability.

Paradigm shift

We live in a «post human» society[Amber Case, TED Lecture 2010]

“You are a cyborg everytime you look at a PC screen or use a cellular phone.”

Man has «extended» his capacities (such as movement, memory, strength, communication,…) through the use of these devices 2.0. The new cellular phones has changed the original «communication network» into a «sensing network»


An Internet of bits and bodies

opportunistic Credit card transactions Tel calls, bills, web clicks, …

Purposely sensed pollution, temperature, wind, … movement, accelleration,… Health sensing,…

User generated Photo, tweet, post, email,… Query-log on search engines

Three main types of data…

•Trash track: trashtrack_release.mov

A universe of possibilities

… limited only by our immagination !


A universe of possibilities

… limited only by our immagination !

The Phd+ course:

how to build a start-up ?

Basics

Paolo Ferragina

Information Retrieval

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

25

IR vs. databases:Unstructured vs Structured data

Structured data tends to refer to “tables”

26

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g.,

Salary < 60000 AND Manager = Smith.

Unstructured data

Typically refers to free text, and allows

Keyword queries including operators

More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse

Classic model for searching text documents

27

Semi-structured data: XML

In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such

as the Title and Bullets

Facilitates “semi-structured” search such as Title contains data AND Bullets contain search

Issues: how do you process “about”? how do you rank results?

28

Boolean queries: Exact match

The Boolean retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries are queries using AND, OR and

NOT to join query terms Views each document as a set of words Is precise: document matches condition or not.

Perhaps the simplest model to build an IR system on

Many search systems still use it: Email, library catalog, Mac OS X Spotlight

29

Implementing the Boolean model

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

Matrix could be very

big

Inverted index

For each term t, we must store a list of all documents that contain t. Identify each by docID, a document serial number

Can we use fixed-size arrays for this?

31

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

What happens if the word Caesar is added to document 14?

174

54 101

AND query

Cleopatra

Cesare 57 12 4 9 15 16 2 ….

9 3 45 11 1 46 31 ….

If n,m are the lengths of the lists, how many comparisons ? n * m

≈1012

cmp

≈103 sec

This is not an «engineering problem»,

You need efficient algorithms!

AND query

Cleopatra

Cesare 57 12 4 9 15 16 2 ….

9 3 45 11 1 46 31 ….

Cleopatra

Cesare 2 4 9 12 15 16 57 ….

1 3 9 11 31 45 46 ….

How many comparisons ?

Which are the top-10 results ?

n + m ≈106

≈1 msec

The Inverted index

Brutus

the

Calpurnia

1 2 3 5 8 13 21 34

2 4 6 10 32

13 16

Two advantages: Speed: query requires just a scan Space: store smaller integers (gap coding)

Compressed, they occupy 13% original text

Intersecting two postings lists

35

Query optimization

What is the best order for query processing? Consider a query that is an AND of n terms. For each of the n terms, get its postings, then

AND them together.

Brutus

Caesar

Calpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64 128

13 16

Query: Brutus AND Calpurnia AND Caesar36

Query optimization

Can we improve scanning-based intersection? Skips (yet scan-based but with shortcuts)

Augment postings with skip pointers (at indexing time)

How do we deploy them ? Where do we place them ?

1282 4 8 41 48 64

311 2 3 8 11 17 21

3111

41 128

Sec. 2.3

Query optimization

Can we improve scanning-based intersection? Skips (yet scan-based but with shortcuts) Recursive merge (splitting by pivots)

Caesar

Calpurnia

1 2 3 5 8 16 21 34

13 16

39

34

Binary search

Which list you bisect at every recursive step ?

Boolean queries: More general merges

Exercise: Adapt the merge for :

Brutus AND NOT Caesar

Brutus OR NOT Caesar

Can we still run the merge in time O(x+y)?

40

Sec. 1.3

IR is much more…

What about phrases? “Stanford University”

Proximity: Find Gates NEAR Microsoft. Need index to capture term positions in docs.

Zones in documents: Find documents with (author = Ullman) AND (text contains automata).

41

Ranking search results

Boolean queries give inclusion or exclusion of docs.

But often results are too many and we need to rank

results Classification, clustering, summarization, text

mining, etc…

42

Web IR and its challenges

Unusual and diverse Documents Users Queries Information needs

Exploit ideas from social networks link analysis, click-streams, ...

How do search engines work? 43

Our topics, on an exampleW

eb

Crawler

Page archive

Which pagesto visit next?

Query

Queryresolver

?

Ranker

PageAnalizer

textStructure

auxiliary

Indexer

Hashing

Data Compression

DictionariesSorting

Linear AlgebraClusteringClassification

I data center

[Procs OSDI 2006]

Hbase, in Java, Apache license, runs on Hadoop

HyperTable, in C++, GNU license, runs on Hadoop or GlusterFS

Cassandra, in Java, Apache license 2, runs on Amazon’s Dynamo

No SQL

“Smart” algorithms2007

“This is rocket science but you don't have to be a rocket scientist

to use it”

bo

Documents

Transcript of bo