Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund...

13
logolund EITF25 Internet - Web Search Anders Ardö EIT – Electrical and Information Technology, Lund University November 20, 2012 A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 1 / 50 logolund Agenda 1 Web search 2 Web search engines 3 Web robots, crawler 4 Focused Web crawling 5 Web search vs Browsing 6 Privacy, Filter bubble A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 2 / 50 logolund Outline 1 Web search 2 Web search engines 3 Web robots, crawler 4 Focused Web crawling 5 Web search vs Browsing 6 Privacy, Filter bubble A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 3 / 50 logolund Why Web search ... Explosion of (digital) information within all types of information collections Harder and harder to follow information flow Faster way to find relevant information when its needed Challenges Distributed, dynamic data Large volume Unstructured, heterogeneous data A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 4 / 50

Transcript of Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund...

Page 1: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

EITF25 Internet - Web Search

Anders Ardö

EIT – Electrical and Information Technology, Lund University

November 20, 2012

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 1 / 50

logolund

Agenda

1 Web search

2 Web search engines

3 Web robots, crawler

4 Focused Web crawling

5 Web search vs Browsing

6 Privacy, Filter bubble

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 2 / 50

logolund

Outline

1 Web search

2 Web search engines

3 Web robots, crawler

4 Focused Web crawling

5 Web search vs Browsing

6 Privacy, Filter bubble

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 3 / 50

logolund

Why Web search ...

Explosion of (digital) informationwithin all types of information collections

Harder and harder to follow information flowFaster way to find relevant information when its neededChallenges

Distributed, dynamic dataLarge volumeUnstructured, heterogeneous data

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 4 / 50

Page 2: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Size of the Web

no one knowsestimates (text pages)

2005 ’more than 11.5 billion’2007 ’more than 20 billion’2010 ’ 20 - 55 billion ’

Google claims to know of 1012 unique URLs (text, images, ...)

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 5 / 50

logolund

Important questions

Digital Libraries

How do I find relevant information?How do I navigate the digital information landscape?How structure and organize information to ease knowledgeextraction?How to create collections, properly organized, with relevantmaterial?How to keep collections updated?

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 6 / 50

logolund

Outline

1 Web search

2 Web search engines

3 Web robots, crawler

4 Focused Web crawling

5 Web search vs Browsing

6 Privacy, Filter bubble

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 7 / 50

logolund

Search Engine - Basic structure

���������������������������

���������������������������

Database

Interface

Database

Web pagesHTTP Web browserQuery

Answer

CGI−script

Web robot The WebHTTP

Size efficiency response time

software crawling the web (much like a human clicking on links)collect all found web-pages into a database (IR system)offer a web-interface to that database

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 8 / 50

Page 3: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Size of search engines

not publishedguesses 1 - 20 - 50 billion pagesSearch Engine Total URLs Unique URLs Overlap (%)Google 182 166 8.79Altavista 181 167 7.74Hotbot 200 170 15Scirus 174 164 5.75Bioweb 200 200 0.0

From: Rather, Lone, Shah: “Overlap in Web Search Results: A Study of Five Search Engines”, Library Philosophy andPractice 2008, ISSN 1522-0222

http://www.webpages.uidaho.edu/ mbolin/rather-lone-shah.htm

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 9 / 50

logolund

Google

started late 1990:sestimated 450,000 low-cost commodity servers (2006)1 trillion links to web pages (July 2008)“over 8 billion web pages”estimate 40 billion pages?goal is to index all the world’s dataGoogle Flu Trends

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 10 / 50

logolund

Search engine examples

Google, Bing, Yahoo,(DuckDuckGo)

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 11 / 50

logolund

Search Engine - Application

���������������������������

���������������������������

Web browser

Database

Web pages

CGI−script

HTTP

Web server

CGI/HTML

SRU/XML

HTTP

(Z39.50 ...)

(ASN, ...)

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 12 / 50

Page 4: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Meta Search Engine - Application

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 13 / 50

logolund

MetaSearch Engine

it’s software that simultaneously search several individual searchenginescollecting, reviewing and ranking their answersand give them back in a merged/condensed form to the userthey are not better than the quality of the search enginedatabases they obtain results from

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 14 / 50

logolund

MetaSearch engines

Simultaneously search several individual search enginesQuery translationResult merging

Simple mergeDuplicate detectiontf-idf/similarity rankingPosition based

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 15 / 50

logolund

MetaSearch Engine examples

Dogpile, Yippy, DuckDuckGo

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 16 / 50

Page 5: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Special (Vertical) search engines

pricesex: prisjakt, PriceRunner, ...http://www.pricerunner.co.uk/http://www.prisjakt.nu/jobsex: freejobsearch, jobspider, ...http://freejobsearch.org/http://www.jobspider.com/Housingex: rightmove, hemnet, bovision, ...http://www.rightmove.co.uk/http://www.hemnet.se/http://bovision.se/... and so on ...

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50

logolund

Other Search Engines

Wolfram Alpha

Wolfram|Alpha introduces a fundamentally new way to get knowledgeand answers — not by searching the web, but by doing dynamiccomputations based on a vast collection of built-in data, algorithms,and methods.From http://www.wolframalpha.com/about.html

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 18 / 50

logolund

Wolfram Alpha example

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 19 / 50

logolund

Outline

1 Web search

2 Web search engines

3 Web robots, crawler

4 Focused Web crawling

5 Web search vs Browsing

6 Privacy, Filter bubble

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 20 / 50

Page 6: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Web Robot - Basic architecture

Spider, Crawler, Robot, agent, ...

Frontier

List of

unvisited

pages

Database

Get URL

Fetch

Web page

Analyze

Save

pagesWeb

Repository

of visited

pages

URLs

Links

Seed

URLs

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 21 / 50

logolund

Web Robot - Ethics

Important - BE NICEDo not overloadnetwork or serverRobot exclusion protocolcheck forhttp://www.foobar.com/robots.txt

HTML meta-tag ROBOTS

robots.txt:User-agent: *Disallow: /cgi-bin/Disallow: /DATA/Disallow: /Images/

<META NAME="ROBOTS"CONTENT="NOINDEX,NOFOLLOW">

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 22 / 50

logolund

Web Robot - Problems

Network failuresErroneous URLsUnreachable serversPassword protectionSpider trapsRecursive URLsCharacter set encodingsSame page - different URLs

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 23 / 50

logolund

Web Robot - More Problems

Hidden Web

DatabasesDynamic scripts... ?

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 24 / 50

Page 7: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Web Robot - Traversal algorithms

Depth first (Stack, LIFO queue)Breadth first (FIFO queue)Best first (How?)Relevance order (How?)

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 25 / 50

logolund

Outline

1 Web search

2 Web search engines

3 Web robots, crawler

4 Focused Web crawling

5 Web search vs Browsing

6 Privacy, Filter bubble

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 26 / 50

logolund

Focused Crawling

Frontier

List of

unvisited

pages

Seed

URLs

Database

pagesWeb

Repository

of visited

pages

URLsGet URL

Fetch

Web page

URL

focus

filter

Analyze

Linksfocus

inNot

Within the

focusSave

filterFocus

Focus:

DomainProjectCountryRegionTopicSubject

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 27 / 50

logolund

Topic-specific Web-crawling

ProblemConstruct a topic specific search-engine(ex. Carnivorous plants)SolutionMake a Web-crawler walk through Internet and collect all pageswith topic ’Carnivorous plants’

easier said than done!

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 28 / 50

Page 8: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Conditions

Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 29 / 50

logolund

Automated Classification technologies

Machine learning methods

Statistical models (Bayes, SVM, ...)ANN

Information Retrieval methodsClustering (no predefined categories)

Library Science methodsString matching + Thesaurus

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 30 / 50

logolund

Topic Filter

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 31 / 50

logolund

Conditions

Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 32 / 50

Page 9: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Internet is Big

First pageOK, saveLinksChoosePage OK?New pagePage OK?SaveNew page

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 33 / 50

logolund

Basic Algorithm

Add good start pages (seeds) to frontierLOOP:

Choose a page among linksPage OK?

Save pageAdd all links to frontier

Go to LOOP

Save (database(s)):All relevant pages (search engine database)All analyzed pages (seen pages)All new links (frontier)

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 34 / 50

logolund

Focused Crawling

Frontier

List of

unvisited

pages

Seed

URLs

Database

pagesWeb

Repository

of visited

pages

URLsGet URL

Fetch

Web page

URL

focus

filter

Analyze

Linksfocus

inNot

Within the

focusSave

filterFocus

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 35 / 50

logolund

Problems I

Which newpage?

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 36 / 50

Page 10: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Problems II

Isolatedpages

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 37 / 50

logolund

Problems III

Non relevantpages“blocking”

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 38 / 50

logolund

Conditions

Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 39 / 50

logolund

Compromises

Precision/recallcompleteness/speed

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 40 / 50

Page 11: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Outline

1 Web search

2 Web search engines

3 Web robots, crawler

4 Focused Web crawling

5 Web search vs Browsing

6 Privacy, Filter bubble

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 41 / 50

logolund

Browsing vs search

SearchLOTS of dataUnstructuredUnrelated items clutter results

BrowsingSmall amounts of dataHierarchically structuredQuality assessed

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 42 / 50

logolund

Browsing examples

Dmoz (ODP), Yahoo! Directory

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 43 / 50

logolund

Outline

1 Web search

2 Web search engines

3 Web robots, crawler

4 Focused Web crawling

5 Web search vs Browsing

6 Privacy, Filter bubble

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 44 / 50

Page 12: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Filter bubble

What do search engines or social sites know about me?At least location, search history, click history, likes, and more . . .Personalize whats shown (search results, . . . ) using this infoShow us what we want/like to see - algorithmically. . . and not whats relevant (who decides that?)

Problem?

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 45 / 50

logolund

Filter bubble example I

From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 46 / 50

logolund

Filter bubble example II

From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 47 / 50

logolund

ToS-DR

Terms-of-Service – Didn’t Read; http://tos-dr.info/

you give Google (and those we work with) a worldwide license touse, host, store, reproduce, modify, create derivative works (suchas those resulting from translations, adaptations or other changeswe make so that your content works better with our Services),communicate, publish, publicly perform, publicly display anddistribute such content.Facebook: you grant us a non-exclusive, transferable,sub-licensable, royalty-free, worldwide license to use any IPcontent that you post on or in connection with Facebook (IPLicense).

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 48 / 50

Page 13: Why Web search · A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 17 / 50 logolund Other Search Engines Wolfram Alpha Wolfram|Alpha introduces a fundamentally new way

logolund

Privacy

Search history, clicks, photos, documents, comments, . . .leads to a profilethat can be used by ads or sold, or even stolenwhich might lead to it ending up in unwanted placesand used against you

Beware!

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 49 / 50

logolund

Questions!

QUESTIONS?

A. Ardö, EIT EITF25 Internet - Web Search November 20, 2012 50 / 50