INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each...

67
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨utze’s, linked from http://informationretrieval.org/ IR 19/25: Web Search Basics and Classification Paul Ginsparg Cornell University, Ithaca, NY 9 Nov 2010 1 / 67

Transcript of INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each...

Page 1: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

INFO 4300 / CS4300Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 19/25: Web Search Basics and Classification

Paul Ginsparg

Cornell University, Ithaca, NY

9 Nov 2010

1 / 67

Page 2: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Discussion 5, Tue 16 Nov

For this class, read and be prepared to discuss the following:

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified DataProcessing on Large Clusters. Usenix SDI ’04, 2004.http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf

2 / 67

Page 3: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Overview

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

3 / 67

Page 4: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Outline

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

4 / 67

Page 5: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Duplicate detection

The web is full of duplicated content.

More so than many other collections

Exact duplicates

Easy to eliminateE.g., use hash/fingerprint

Near-duplicates

Abundant on the webDifficult to eliminate

For the user, it’s annoying to get a search result withnear-identical documents.

Recall marginal relevance

We need to eliminate near-duplicates.

5 / 67

Page 6: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Shingling: Summary

Input: N documents

Choose n-gram size for shingling, e.g., n = 5

Pick 200 random permutations, represented as hash functions

Compute N sketches: 200× N matrix shown on previousslide, one row per permutation, one column per document

Compute N·(N−1)2 pairwise similarities

Transitive closure of documents with similarity > θ

Index only one document from each equivalence class

6 / 67

Page 7: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Web IR: Differences from traditional IR

Links: The web is a hyperlinked document collection.

Queries: Web queries are different, more varied and there area lot of them. How many? ≈ 109

Users: Users are different, more varied and there are a lot ofthem. How many? ≈ 109

Documents: Documents are different, more varied and thereare a lot of them. How many? ≈ 1011

Context: Context is more important on the web than in manyother IR applications.

Ads and spam

7 / 67

Page 8: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Types of queries / user needs in web search

Informational user needs: I need information on something.“low hemoglobin”

We called this “information need” earlier in the class.

On the web, information needs proper are only a subclass ofuser needs.

Other user needs: Navigational and transactional

Navigational user needs: I want to go to this web site.“hotmail”, “myspace”, “United Airlines”

Transactional user needs: I want to make a transaction.

Buy something: “MacBook Air”Download something: “Acrobat Reader”Chat with someone: “live soccer chat”

Difficult problem: How can the search engine tell what theuser need or intent for a particular query is?

8 / 67

Page 9: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Bowtie structure of the web

A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,S. Stata, A. Tomkins, andJ. Wiener. Graph structure in the web. Computer Networks, 33:309–320, 2000.

Strongly connected component (SCC) in the centerLots of pages that get linked to, but don’t link (OUT)Lots of pages that link to other pages, but don’t get linked to (IN)Tendrils, tubes, islands

# of in-links (in-degree) averages 8–15, not randomly distributed (Poissonian),instead a power law:# pages with in-degree i is ∝ 1/iα, α ≈ 2.1

9 / 67

Page 10: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Poisson Distribution

Bernoulli process with N trials, each probability p of success:

p(m) =

(

N

m

)

pm(1− p)N−m .

Probability p(m) of m successes, in limit N very large and p small,parametrized by just µ = Np (µ = mean number of successes).For N ≫ m, we have N!

(N−m)! = N(N − 1) · · · (N −m + 1) ≈ Nm,

so(

Nm

)

≡ N!m!(N−m)! ≈

Nm

m! , and

p(m) ≈1

m!Nm

( µ

N

)m(

1−µ

N

)N−m

≈µm

m!lim

N→∞

(

1−µ

N

)N

= e−µ µm

m!

(ignore (1− µ/N)−m since by assumption N ≫ µm).N dependence drops out for N →∞, with average µ fixed (p → 0).The form p(m) = e

−µ µm

m! is known as a Poisson distribution

(properly normalized:∑∞

m=0 p(m) = e−µ

∑∞m=0

µm

m! = e−µ · eµ = 1).

10 / 67

Page 11: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Poisson Distribution for µ = 10

p(m) = e−10 10m

m!

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 5 10 15 20 25 30

Compare to power law p(m) ∝ 1/m2.1

11 / 67

Page 12: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Power Law p(m) ∝ 1/m2.1 and Poisson p(m) = e−10 10m

m!

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

10 20 30 40 50 60 70 80 90 100

12 / 67

Page 13: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Power Law p(m) ∝ 1/m2.1 and Poisson p(m) = e−10 10m

m!

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000

(log–log scale)

13 / 67

Page 14: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

The spatial context: Geo-search

Three relevant locations

Server (nytimes.com → New York)Web page (nytimes.com article about Albania)User (located in Palo Alto)

Locating the user

IP addressInformation provided by user (e.g., in user profile)Mobile phone

Geo-tagging: Parse text and identify the coordinates of thegeographic entities

Example: East Palo Alto CA → Latitude: 37.47 N, Longitude:122.14 WImportant NLP problem

14 / 67

Page 15: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Outline

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

15 / 67

Page 16: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

The goal of spamming on the web

You have a page that will generate lots of revenue for you ifpeople visit it.

Therefore, you would like to direct visitors to this page.

One way of doing this: get your page ranked highly in searchresults.

How can I get my page ranked highly?

16 / 67

Page 17: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Spam technique: Keyword stuffing / Hidden text

Misleading meta-tags, excessive repetition

Hidden text with colors, style sheet tricks etc.

Used to be very effective, most search engines now catch these

17 / 67

Page 18: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Keyword stuffing

18 / 67

Page 19: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Spam technique: Doorway and lander pages

Doorway page: optimized for a single keyword, redirects tothe real target page

Lander page: optimized for a single keyword or a misspelleddomain name, designed to attract surfers who will then clickon ads

19 / 67

Page 20: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Lander page

Number one hit on Google for the search “composita”

The only purpose of this page: get people to click on the adsand make money for the page owner

20 / 67

Page 21: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Spam technique: Duplication

Get good content from somewhere (steal it or produce ityourself)

Publish a large number of slight variations of it

For example, publish the answer to a tax question with thespelling variations of “tax deferred” on the previous slide

21 / 67

Page 22: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Spam technique: Cloaking

Serve fake content to search engine spider

So do we just penalize this always?

No: legitimate uses (e.g., different content to US vs.European users)

22 / 67

Page 23: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Spam technique: Link spam

Create lots of links pointing to the page you want to promote

Put these links on pages with high (or at least non-zero)PageRank

Newly registered domains (domain flooding)A set of pages that all point to each other to boost eachother’s PageRank (mutual admiration society)Pay somebody to put your link on their highly ranked page(“schuetze horoskop” example)Leave comments that include the link on blogs

23 / 67

Page 24: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

SEO: Search engine optimization

Promoting a page in the search rankings is not necessarilyspam.

It can also be a legitimate business – which is called SEO.

You can hire an SEO firm to get your page highly ranked.

There are many legitimate reasons for doing this.

For example, Google bombs like Who is a failure?

And there are many legitimate ways of achieving this:

Restructure your content in a way that makes it easy to indexTalk with influential bloggers and have them link to your siteAdd more interesting and original content

24 / 67

Page 25: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

The war against spam

Quality indicators

Links, statistically analyzed (PageRank etc)Usage (users visiting a page)No adult content (e.g., no pictures with flesh-tone)Distribution and structure of text (e.g., no keyword stuffing)

Combine all of these indicators and use machine learning

Editorial intervention

BlacklistsTop queries auditedComplaints addressedSuspect patterns detected

25 / 67

Page 26: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Webmaster guidelines

Major search engines have guidelines for webmasters.

These guidelines tell you what is legitimate SEO and what isspamming.

Ignore these guidelines at your own risk

Once a search engine identifies you as a spammer, all pageson your site may get low ranks (or disappear from the indexentirely).

There is often a fine line between spam and legitimate SEO.

Scientific study of fighting spam on the web: adversarial

information retrieval

26 / 67

Page 27: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Outline

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

27 / 67

Page 28: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Growth of the web

The web keeps growing.But growth is no longer exponential?

28 / 67

Page 29: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Size of the web: Who cares?

Media

Users

They may switch to the search engine that has the bestcoverage of the web.Users (sometimes) care about recall. If we underestimate thesize of the web, search engine results may have low recall.

Search engine designers (how many pages do I need to be ableto handle?)

Crawler designers (which policy will crawl close to N pages?)

29 / 67

Page 30: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

What is the size of the web? Any guesses?

30 / 67

Page 31: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Simple method for determining a lower bound

OR-query of frequent words in a number of languages

According to this query: Size of web ≥ 21,450,000,000 on2007.07.07

Big if: Page counts of google search results are correct.(Generally, they are just rough estimates.)

But this is just a lower bound, based on one search engine.

How can we do better?

31 / 67

Page 32: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Size of the web: Issues

What is size? Number of web servers? Number of pages?Terabytes of data available?

The “dynamic” web is infinite.

Any sum of two numbers is its own dynamic page on Google.(Example: “2+4”)Many other dynamic sites generating infinite number of pages

The static web contains duplicates – each “equivalence class”should only be counted once.

Some servers are seldom connected.

Example: Your laptopIs it part of the web?

32 / 67

Page 33: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

“Search engine index contains N pages”: Issues

Can I claim a page is in the index if I only index the first 4000bytes?

Can I claim a page is in the index if I only index anchor textpointing to the page?

There used to be (and still are?) billions of pages that are onlyindexed by anchor text.

33 / 67

Page 34: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

How can we estimate the size of the web?

34 / 67

Page 35: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Sampling methods

Random queries (picked from dictionary)

Random searches (picked from search logs)

Random IP addresses

Random walks

35 / 67

Page 36: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Variant: Estimate relative sizes of indexes

There are significant differences between indexes of differentsearch engines.

Different engines have different preferences.

max url depth, max count/host, anti-spam rules, priority rulesetc.

Different engines index different things under the same URL.

anchor text, frames, meta-keywords, size of prefix etc.

36 / 67

Page 37: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is
Page 38: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Outline

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

38 / 67

Page 39: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Digression: “naive” Bayes

Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).

180 of the S messages contain the word “offer”.20 of the S messages contain the word “offer”.

Suppose you receive a message containing the word “offer”.What is the probability it is S? Estimate:

180

180 + 20=

9

10.

(Formally, assuming “flat prior” p(S) = p(S):

p(S |offer) =p(offer|S)p(S)

p(offer|S)p(S) + p(offer|S)p(S)=

1801000

1801000 + 20

1000

=9

10.)

39 / 67

Page 40: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Classification

Naive Bayes is simple and a good baseline.

Use it if you want to get a text classifier up and running in ahurry.

But other classification methods are more accurate.

Perhaps the simplest well performing alternative: kNN

kNN is a vector space classifier.

Today1 intro vector space classification2 very simple vector space classification: Rocchio3 kNN

Next time: general properties of classifiers

40 / 67

Page 41: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Recall vector space representation

Each document is a vector, one component for each term.

Terms are axes.

High dimensionality: 100,000s of dimensions

Normalize vectors (documents) to unit length

How can we do classification in this space?

41 / 67

Page 42: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Vector space classification

As before, the training set is a set of documents, each labeledwith its class.

In vector space classification, this set corresponds to a labeledset of points or vectors in the vector space.

Premise 1: Documents in the same class form a contiguousregion.

Premise 2: Documents from different classes don’t overlap.

We define lines, surfaces, hypersurfaces to divide regions.

42 / 67

Page 43: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Classes in the vector space

xxx

x

⋄⋄⋄

China

Kenya

UK⋆

Should the document ⋆ be assigned to China, UK or Kenya?Find separators between the classesBased on these separators: ⋆ should be assigned to China

How do we find separators that do a good job at classifying newdocuments like ⋆?

43 / 67

Page 44: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Outline

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

44 / 67

Page 45: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Recall Rocchio algorithm (lecture 12)

The optimal query vector is:

~qopt = µ(Dr ) + [µ(Dr )− µ(Dnr )]

=1

|Dr |

~dj∈Dr

~dj + [1

|Dr |

~dj∈Dr

~dj −1

|Dnr |

~dj∈Dnr

~dj ]

We move the centroid of the relevant documents by thedifference between the two centroids.

45 / 67

Page 46: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Exercise: Compute Rocchio vector (lecture 12)

x

x

x

x

xx

circles: relevant documents, X’s: nonrelevant documents

46 / 67

Page 47: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio illustrated (lecture 12)

x

x

x

x

xx

~µR

~µNR

~µR − ~µNR~qopt

~µR : centroid of relevant documents~µNR : centroid of nonrelevant documents~µR − ~µNR : difference vectorAdd difference vector to ~µR to get ~qopt

~qopt separates relevant/nonrelevant perfectly.

47 / 67

Page 48: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio 1971 algorithm (SMART) (lecture 12)

Used in practice:

~qm = α~q0 + βµ(Dr )− γµ(Dnr )

= α~q0 + β1

|Dr |

~dj∈Dr

~dj − γ1

|Dnr |

~dj∈Dnr

~dj

qm: modified query vector; q0: original query vector; Dr andDnr : sets of known relevant and nonrelevant documentsrespectively; α, β, and γ: weights attached to each term

New query moves towards relevant documents and away fromnonrelevant documents.

Tradeoff α vs. β/γ: If we have a lot of judged documents, wewant a higher β/γ.

Set negative term weights to 0.

“Negative weight” for a term doesn’t make sense in the vectorspace model.

48 / 67

Page 49: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Using Rocchio for vector space classification

We can view relevance feedback as two-class classification.

The two classes: the relevant documents and the nonrelevantdocuments.

The training set is the set of documents the user has labeledso far.

The principal difference between relevance feedback and textclassification:

The training set is given as part of the input in textclassification.It is interactively created in relevance feedback.

49 / 67

Page 50: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio classification: Basic idea

Compute a centroid for each class

The centroid is the average of all documents in the class.

Assign each test document to the class of its closest centroid.

50 / 67

Page 51: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Recall definition of centroid

~µ(c) =1

|Dc |

d∈Dc

~v(d)

where Dc is the set of all documents that belong to class c and

~v(d) is the vector space representation of d .

51 / 67

Page 52: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio algorithm

TrainRocchio(C, D)1 for each cj ∈ C

2 do Dj ← {d : 〈d , cj 〉 ∈ D}3 ~µj ←

1|Dj |

d∈Dj~v(d)

4 return {~µ1, . . . , ~µJ}

ApplyRocchio({~µ1, . . . , ~µJ}, d)1 return arg minj |~µj − ~v(d)|

52 / 67

Page 53: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2

xxx

x

⋄⋄

China

Kenya

UK

⋆ a1

a2

b1

b2

c1

c2

53 / 67

Page 54: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio properties

Rocchio forms a simple representation for each class: thecentroid

We can interpret the centroid as the prototype of the class.

Classification is based on similarity to / distance fromcentroid/prototype.

Does not guarantee that classifications are consistent with thetraining data!

54 / 67

Page 55: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Time complexity of Rocchio

mode time complexity

training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)

55 / 67

Page 56: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio vs. Naive Bayes

In many cases, Rocchio performs worse than Naive Bayes.

One reason: Rocchio does not handle nonconvex, multimodalclasses correctly.

56 / 67

Page 57: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Rocchio cannot handle nonconvex, multimodal classes

a

a

a

a

a

a

a aa

a

aa

aa

a a

aa

a

a

a

a

a

a

a

a

a

a

a

a a

aa

aa

aa

aa

a

bb

bb

bb

bb b

b

bbb

b

b

b

b

b

b

X XA

B

o

Exercise: Why is Rocchio notexpected to do well for theclassification task a vs. b here?

A is centroid of the a’s, Bis centroid of the b’s.

The point o is closer to Athan to B.

But it is a better fit forthe b class.

A is a multimodal classwith two prototypes.

But in Rocchio we onlyhave one.

57 / 67

Page 58: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Outline

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

58 / 67

Page 59: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

kNN classification

kNN classification is another vector space classificationmethod.

It also is very simple and easy to implement.

kNN is more accurate (in most cases) than Naive Bayes andRocchio.

If you need to get a pretty accurate classifier up and runningin a short time . . .

. . . and you don’t care about efficiency that much . . .

. . . use kNN.

59 / 67

Page 60: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

kNN classification

kNN = k nearest neighbors

kNN classification rule for k = 1 (1NN): Assign each testdocument to the class of its nearest neighbor in the trainingset.

1NN is not very robust – one document can be mislabeled oratypical.

kNN classification rule for k > 1 (kNN): Assign each testdocument to the majority class of its k nearest neighbors inthe training set.

Rationale of kNN: contiguity hypothesis

We expect a test document d to have the same label as thetraining documents located in the local region surrounding d .

60 / 67

Page 61: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Probabilistic kNN

Probabilistic version of kNN: P(c |d) = fraction of k neighborsof d that are in c

kNN classification rule for probabilistic kNN: Assign d to classc with highest P(c |d)

61 / 67

Page 62: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

kNN is based on Voronoi tessellation

x

x

xx

x

xx

xx x

x

⋄⋄

⋄⋄⋄

⋄ ⋄

1NN, 3NNclassifica-tion decisionfor star?

62 / 67

Page 63: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

kNN algorithm

Train-kNN(C, D)1 D

′ ← Preprocess(D)2 k ← Select-k(C, D′)3 return D

′, k

Apply-kNN(D′, k, d)1 Sk ← ComputeNearestNeighbors(D′, k, d)2 for each cj ∈ C(D′)3 do pj ← |Sk ∩ cj |/k4 return arg maxj pj

63 / 67

Page 64: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Exercise

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?

64 / 67

Page 65: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Exercise

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio

65 / 67

Page 66: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

Time complexity of kNN

kNN with preprocessing of training set

training Θ(|D|Lave)testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)

kNN test time proportional to the size of the training set!

The larger the training set, the longer it takes to classify atest document.

kNN is inefficient for very large training sets.

66 / 67

Page 67: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Index only one document from each equivalence class 6/67. Web IR: Differences from traditional IR Links: The web is

kNN: Discussion

No training necessary

But linear preprocessing of documents is as expensive astraining Naive Bayes.You will always preprocess the training set, so in realitytraining time of kNN is linear.

kNN is very accurate if training set is large.

Optimality result: asymptotically zero error if Bayes rate iszero.

But kNN can be very inaccurate if training set is small.

67 / 67