History of Search and Web Search Engines - Seminar on Web Search
-
Upload
beat-signer -
Category
Education
-
view
3.923 -
download
1
Transcript of History of Search and Web Search Engines - Seminar on Web Search
2 December 2005
Seminar on Web Search History of Search and Web Search Engines
Prof. Beat Signer
Department of Computer Science
Vrije Universiteit Brussel
http://www.beatsigner.com
Beat Signer - Department of Computer Science - [email protected]
2 September 5, 2011
Seminar Organisation
Prof. Beat Signer
WISE Lab, Vrije Universiteit Brussel
[email protected] cross-media information spaces
and architectures
interactive paper and augmented reality
multimodal and multi-touch interaction
Content of the Seminar history of search and web search engines
search engine optimisation (SEO) and search engine marketing (SEM)
current and future trends in web search
Beat Signer - Department of Computer Science - [email protected]
3 September 5, 2011
Early "Documents"
Beat Signer - Department of Computer Science - [email protected]
4 September 5, 2011
Papyrus
Greeks and Romans
stored information on
papyrus scrolls
Tags with a summary of
the content facilitated the
retrieval of information
Table of content was
introduced around 100 BC
Parchment (vellum) came
up as an alternative
bound in book form
Beat Signer - Department of Computer Science - [email protected]
5 September 5, 2011
Paper
Invented in China (105 AD)
Brought to Europe only in
the twelfth century
Took another 300 years
before paper became the
major writing material
How long will we still use
paper?
electronic paper vs.
augmented paper
Beat Signer - Department of Computer Science - [email protected]
6 September 5, 2011
Printing Press
Johann Gutenberg
invented the printing press
in 1450
Gutenberg Bible published
in 1455
Growing libraries and
need to search for
information
Beat Signer - Department of Computer Science - [email protected]
7 September 5, 2011
Reading Wheel (Bookwheel)
Described by Agostino
Ramelli in 1588
Keep several books open
to read from them at the
same time
comparable to modern
tabbed browsing
The reading wheel has
never really been built
Could be seen as a
predecessor of hypertext
Beat Signer - Department of Computer Science - [email protected]
8 September 5, 2011
Dewey Decimal Classification (DDC)
Library classification
system
developed by Melvil Dewey
in 1876
Hierarchical classification
10 main classes with
10 divisions each and
10 sections per division
total of 1000 sections
often separate fiction section
Documents can appear in
more than one class
Beat Signer - Department of Computer Science - [email protected]
9 September 5, 2011
Dewey Decimal Classification (DDC) ...
After the three numbers,
decimals can be used for
further subclassification
Different Alternatives
Library of Congress
classification
Universal Decimal
Classification (UDC)
Beat Signer - Department of Computer Science - [email protected]
10 September 5, 2011
Dewey Decimal Classification (DDC) ...
000-099 Computer Science, Information and General Works 000 Computer Science, Knowledge and Systems 000 Computer Science, Knowledge and General Works ... 005 Computer Programming, Programs and Data ... 009 [Unassigned] 010 Bibliographies ... 100-199 Philosophy and Psychology 200-299 Religion 300-399 Social Sciences 340 Law 341 International Law 400-499 Language 500-599 Science 600-699 Technology 700-799 Arts 800-899 Literature 900-999 History, Geography and Biography
Beat Signer - Department of Computer Science - [email protected]
11 September 5, 2011
"As We May Think" (1945)
... When data of any sort are placed in
storage, they are filed alphabetically
or numerically, and information is
found (when it is) by tracing it down
from subclass to subclass. It can be in
only one place, unless duplicates are
used; one has to have rules as to which
path will locate it, and the rules are
cumbersome. Having found one
item, moreover, one has to emerge from
the system and re-enter on a
new path. The human mind does not work
that way. It operates by association.
...
Vannevar Bush
Beat Signer - Department of Computer Science - [email protected]
12 September 5, 2011
"As We May Think" (1945) …
... It affords an immediate step,
however, to associative indexing, the
basic idea of which is a
provision whereby any item may be
caused at will to select immediately
and automatically another. This is the
essential feature of the memex. The
process of tying two items together is
the important thing. ...
Vannevar Bush, As We May Think,
Atlanic Monthly, July 1945
Vannevar Bush
Beat Signer - Department of Computer Science - [email protected]
13 September 5, 2011
"As We May Think" (1945) …
Bush's article 'As We My Think'
(1945) is often seen as
the “origin" of hypertext
Article introduces the Memex prototypical hypertext machine
store and access information
follow cross-references in the form of associative trails between pieces of information (microfilms)
trail blazers are those who find delight in the task of establishing useful trails
Memex
Beat Signer - Department of Computer Science - [email protected]
15 September 5, 2011
Hypertext (1965)
Ted Nelson coined the term hypertext
Nelson started Project Xanadu in 1960 first hypertext project
nonsequential writing
referencing/embedding parts of a document in another document (transclusion) transpointing windows
bidirectional (bivisible) links
version and rights management
XanaduSpace 1.0 was released as part of Project
Xanadu in 2007
Ted Nelson
Beat Signer - Department of Computer Science - [email protected]
16 September 5, 2011
World Wide Web (WWW)
Networked hypertext system
(over ARPANET) to share in-
formation at CERN first draft in March 1989
The Information Mine, Information Mesh, …?
Components by end of 1990 HyperText Transfer Protocol (HTTP)
HyperText Markup Language (HTML)
HTTP server software
Web browser (WorldWideWeb)
First public "release" in August 1991
Tim Berners-Lee Robert Cailliau
Beat Signer - Department of Computer Science - [email protected]
17 September 5, 2011
Search Engine History
Early "search engines" include various systems
starting with Bush's Memex
Archie (1990) first Internet search engine
indexing of files on FTP servers
W3Catalog (September 1993) first "web search engine"
mirroring and integration of manually maintained catalogues
JumpStation (December 1993) first web search engine combining crawling, indexing and
searching
Beat Signer - Department of Computer Science - [email protected]
18 September 5, 2011
Search Engine History ...
In the following two years (1994/1995) many
new search engines appeared AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ...
Two categories of early Web search solutions full text search
- based on an index that is automatically created by a web crawler in
combination with an indexer
- e.g. AltaVista or InfoSeek
manually maintained classification (hierarchy) of webpages
- significant human editing effort
- e.g. Yahoo
Beat Signer - Department of Computer Science - [email protected]
19 September 5, 2011
Information Retrieval
Precision and recall can be used to measure the
performance of different information retrieval algorithms
documents retrieved
documents retrieveddocumentsrelevant precision
documentsrelevant
documents retrieveddocumentsrelevant recall
D1 D2 D4
D6 D7 D10
D3 D5
D8 D9
D1 D3 D8
D9 D10
query
6.05
3precision
75.04
3recall
Beat Signer - Department of Computer Science - [email protected]
20 September 5, 2011
Information Retrieval ...
Often a combination of precision and recall, the so-called
F-score (harmonic mean) is used as a single measure
D1 D2 D4
D6 D7 D10
D3 D5
D8 D9
D1 D3
D8 D9 D10
query
57.0precision
1recall
recallprecision
recallprecision2scoreF
D1 D2 D4
D6 D7 D10
D3 D5
D8 D9
D1 D3 D8
D9 D10
query
6.0precision
75.0recall
67.0score-F
D5 D2
73.0score-F
Beat Signer - Department of Computer Science - [email protected]
21 September 5, 2011
Bank
Delhaize
Ghent
Metro
Shopping
Train
D1 D2 D3 D4 D5 D6
1
Boolean Model
Based on set theory and boolean logic
Exact matching of documents to a user query
Uses the boolean AND, OR and NOT operators
query: Shopping AND Ghent AND NOT Delhaize
computation: 101110 AND 100111 AND 000111 = 000110
result: document set {D4,D5}
1 0 0 1 1
1
1
0
1
1
1
0
0
1
0
0
1
1
1
0
0
1
0
1
1
0
1
0
1
0
0
1
0
0
0
... ... ... ... ... ... ...
Beat Signer - Department of Computer Science - [email protected]
22 September 5, 2011
Boolean Model ...
Advantages relatively easy to implement and scalable
fast query processing based on parallel scanning of indexes
Disadvantages does not pay attention to synonymy
does not pay attention to polysemy
no ranking of output
often the user has to learn a special syntax such as the use of double quotes to search for phrases
Variants of the boolean model form the basis for many
search engines
Beat Signer - Department of Computer Science - [email protected]
23 September 5, 2011
Vector Space Model
Algebraic model representing text documents and
queries as vectors based on the index terms one dimension for each term
Compute the similarity (angle) between the query vector
and the document vectors
Advantages simple model based on linear algebra
partial matching with relevance scoring for results
potenial query reevaluation based on user relevance feedback
Disadvantages computationally expensive (similarity measures for each query)
limited scalability
Beat Signer - Department of Computer Science - [email protected]
24 September 5, 2011
Web Search Engines
Most web search engines are based on traditional
information retrieval techniques but they have to be
adapted to deal with the characteristics of the the Web immense amount of web resources (>50 billion webpages)
hyperlinked resources
dynamic content with frequent updates
self-organised web resources
Evaluation of performance no standard collections
often based on user studies (satisfaction)
Of course not only the precision and recall but also the
query answer time is an important issue
Beat Signer - Department of Computer Science - [email protected]
25 September 5, 2011
What About Old Content?
Beat Signer - Department of Computer Science - [email protected]
26 September 5, 2011
The Internet Archive
Beat Signer - Department of Computer Science - [email protected]
27 September 5, 2011
Web Crawler
A web crawler or spider is used to create an
index of webpages to be used by a web search engine any web search is then based on this index
Web crawler has to deal with the following issues freshness
- the index should be updated regularly (based on webpage update frequency)
quality
- since not all webpages can be indexed, the crawler should give priority to
"high quality" pages
scalabilty
- it should be possible to increase the crawl rate by just adding additional
servers (modular architecture)
- e.g. the estimated number of Google servers in 2007 was 1'000'000 (including
not only the crawler but the entire Google platform)
Beat Signer - Department of Computer Science - [email protected]
28 September 5, 2011
Web Crawler ...
distribution
- the crawler should be able to run in a distributed manner (computer centers all
over the world)
robustness
- the Web contains a lot of pages with errors and a crawler has to deal with
these problems
- e.g. deal with a web server that creates an unlimited number of "virtual web
pages" (crawler trap)
efficiency
- resources (e.g. network bandwidth) should be used in a most efficient way
crawl rates
- the crawler should pay attention to existing web server policies
(e.g. revisit-after HTML meta tag or robots.txt file)
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ robots.txt
Beat Signer - Department of Computer Science - [email protected]
29 September 5, 2011
Web Search Engine Architecture
WWW Crawler
URL Pool
Storage Manager
Page Repository
content already added?
Document Index
Special Indexes
Indexers URL Handler
URL Repository
filter
normalisation
and duplicate
elimination
Client
Query Handler
inverted index
Ranking
Beat Signer - Department of Computer Science - [email protected]
30 September 5, 2011
Pre-1998 Web Search
Find all documents for a given query term use information retrieval (IR) solutions
- boolean model
- vector space model
- ...
ranking based on "on-page factors" problem: poor quality of search results (order)
Larry Page and Sergey Brin proposed to compute the
absolute quality of a page called PageRank based on the number and quality of pages linking
to a page (votes)
query-independent
Beat Signer - Department of Computer Science - [email protected]
31 September 5, 2011
Origins of PageRank
Developed as part of an
academic project at Stanford
University research platform to aid under-
standing of large-scale web data and enable researchers to easily experiment with new search technologies
Larry Page and Sergey Brin worked on the project about a new kind of search engine (1995-1998) which finally led to a functional prototype called Google
Larry Page Sergey Brin
Beat Signer - Department of Computer Science - [email protected]
32 September 5, 2011
PageRank
A page Pi has a high PageRank Ri if there are many pages linking to it
or, if there are some pages with a high PageRank linking to it
Total score = IR score × PageRank
P1
R1
P2
R2
P3
R3
P4
R4
P5
R5
P6
R6
P7
R7
P8
R8
Beat Signer - Department of Computer Science - [email protected]
33 September 5, 2011
Basic PageRank Algorithm
where Bi is the set of pages
that link to page Pi
Lj is the number of outgoing links for page Pj
ij BP j
j
iL
PRPR
)()(
P1 P2
P3
P1
1
P2
1
P3
1
P1
1.5
P2
1.5
P3
0.75
P1
1.5
P2
1.5
P3
0.75
Beat Signer - Department of Computer Science - [email protected]
34 September 5, 2011
Matrix Representation
Let us define a hyperlink
matrix H
P1 P2
P3
otherwise0
if1 ijj
ij
BPLH
0210
001
1210
H iPRRand
HRR
R is an eigenvector of H
with eigenvalue 1
Beat Signer - Department of Computer Science - [email protected]
35 September 5, 2011
Matrix Representation ...
We can use the power method to find R
sparse matrix H with 40 billion columns and rows but only an average of 10 non-zero entries in each colum
tt HRR 1
0210
001
1210
HFor our example
this results in or 122R 2.04.04.0
Beat Signer - Department of Computer Science - [email protected]
36 September 5, 2011
Dangling Pages (Rank Sink)
Problem with pages that
have no outbound links (e.g. P2)
Stochastic adjustment if page Pj has no outgoing links then replace column j with 1/Lj
New stochastic matrix S always has a stationary vector R can also be interpreted as a markov chain
P1 P2
01
00H and 00R
210
210C
211
210CHSand
C
C
Beat Signer - Department of Computer Science - [email protected]
37 September 5, 2011
Strongly Connected Pages (Graph)
Add new transition proba-
bilities between all pages with probability d we follow
the hyperlink structure S
with probability 1-d we choose a random page
matrix G becomes irreducible
Google matrix G reflects
a random surfer no modelling of back button
P1 P2
P3 P4
P5
1SGn
dd1
1 GRR
1-d
1-d 1-d
Beat Signer - Department of Computer Science - [email protected]
38 September 5, 2011
Examples
1SGn
dd1
1
A1
0.26
A2
0.37
A3
0.37
Beat Signer - Department of Computer Science - [email protected]
39 September 5, 2011
Examples ...
A1
0.13
A2
0.185
A3
0.185
B1
0.13
B2
0.185
B3
0.185
5.0AP 5.0BP
1SGn
dd1
1
Beat Signer - Department of Computer Science - [email protected]
40 September 5, 2011
Examples
PageRank leakage
A1
0.10
A2
0.14
A3
0.14
B1
0.22
B2
0.20
B3
0.20
38.0AP 62.0BP
1SGn
dd1
1
Beat Signer - Department of Computer Science - [email protected]
41 September 5, 2011
Examples ...
A1
0.3
A2
0.23
A3
0.18
B1
0.10
B2
0.095
B3
0.095
71.0AP 29.0BP
1SGn
dd1
1
Beat Signer - Department of Computer Science - [email protected]
42 September 5, 2011
Examples
PageRank feedback
A1
0.35
A2
0.24
A3
0.18
B1
0.09
B2
0.07
B3
0.07
77.0AP 23.0BP
1SGn
dd1
1
Beat Signer - Department of Computer Science - [email protected]
43 September 5, 2011
Examples ...
A1
0.33
A2
0.17
A3
0.175
B1
0.08
B2
0.06
B3
0.06
80.0AP
20.0BPA4
0.125
1SGn
dd1
1
Beat Signer - Department of Computer Science - [email protected]
44 September 5, 2011
Implications for Website Development
First make sure that your page gets indexed on-page factors
Think about your site's internal link structure create many internal links for important pages
be "careful" about where to put outgoing links
Increase the number of pages
Ensure that webpages are addressed consistently http://www.vub.ac.be http://www.vub.ac.be/index.php
Make sure that you get incoming links from good
websites
Beat Signer - Department of Computer Science - [email protected]
45 September 5, 2011
Tools
Google toolbar shows logarithmic PageRank value (from 0 to 10)
information not frequently updated (google dance)
Google webmaster tools accepts a sitemap (XML document) with the structure of a website
variety of reports that help to improve the quality of a website
- meta description issues
- title tag issues
- non-indexable content issues
- number and URLs of indexed pages
- number and URLs of inbound/outbound links
- ...
Beat Signer - Department of Computer Science - [email protected]
46 September 5, 2011
Questions
Is PageRank fair?
What about Google's power and influence?
What about Web 2.0 or Web 3.0 and web search? "non-existent" webpages such as offered by Rich Internet
Applications (e.g. Ajax) may bring problems for traditional search engines (hidden web)
new forms of social search
- Wikia Search
- Delicious
- ...
social marketing
Beat Signer - Department of Computer Science - [email protected]
47 September 5, 2011
HITS Algorithm
Hypertext Induced Topic Search Jon Kleinberg
developed around the same time when Page and Brin invented PageRank
Uses the link structure like PageRank to
compute a popularity score
Differences from PageRank two popularity values for each page (hub and authority score)
note that the values are not query-independent
user gets a ranked hub and authority list
Jon Kleinberg
Beat Signer - Department of Computer Science - [email protected]
48 September 5, 2011
HITS Algorithm ...
Good authorities are linked by good hubs and good hubs
link to good authorities
Compute impact of authorities and hubs similar to
PageRank (but only on limited set of result pages!)
P1 P2
Authority Hub
initialise each page with an authority and hub score of 1 repeat { compute new authority scores compute new hub scores normalise authority and hub scores }
Beat Signer - Department of Computer Science - [email protected]
49 September 5, 2011
Meta Search Engines
Search tool that sends a query to multiple search
engines
Aggregates the individual results on a single result page
metacrawler is an example of a meta search engine that
uses different search engines (Google, Bing, Yahoo!, ...)
Beat Signer - Department of Computer Science - [email protected]
50 September 5, 2011
Search Engine Market Share
Beat Signer - Department of Computer Science - [email protected]
51 September 5, 2011
Conclusions
Web information retrieval techniques have to deal with
the specific characteristics of the Web
PageRank algorithm absolute quality of a page based on incoming links
based on random surfer model
computed as eigenvector of Google matrix G
PageRank is just one (important) factor
Implications for website development and SEO
Beat Signer - Department of Computer Science - [email protected]
52 September 5, 2011
References
Vannevar Bush, As We May Think, Atlanic Monthly,
July 1945 http://www.theatlantic.com/doc/194507/bush/
http://sloan.stanford.edu/MouseSite/Secondary.html
L. Page, S. Brin, R. Motwani and T. Winograd,
The PageRank Citation Ranking: Bringing Order
to the Web, January 1998
S. Brin and L. Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, Computer Networks
and ISDN Systems, 30(1-7), April 1998
Beat Signer - Department of Computer Science - [email protected]
53 September 5, 2011
References …
Amy N. Langville and Carl D. Meyer, Google's
PageRank and Beyond – The Science of Search Engine
Rankings, Princeton University Press, July 2006
PageRank Calculator http://www.webworkshop.net/pagerank_calculator.php
Google Webmaster Tools http://www.google.com/webmasters/
2 December 2005
Next Lecture Search Engine Optimisation (SEO) and Search
Engine Marketing (SEM)