The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen.
-
date post
22-Dec-2015 -
Category
Documents
-
view
222 -
download
0
Transcript of The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen.
The Fragmented Web
Notes on Chapter 12
For In765
Judith Molka-Danielsen
1. Virtual robots
• Virtual robots read and index web pages.• Would be hard to navigate without them.• But, some pages are never mapped.• Simple search engines can return too
much.• Meta-search engines select hits across
engines. • www.lib.berkeley.edu/TeachingLib/Guides/
Internet/MetaSearch.html
Steve Lawrence and C. Lee Giles Attempt to measure the Web in 1999http://www.neci.nj.nec.com/homepages/lawrence/websize.html
2. Relevancy• Finding the “best” page is more important than
finding the “most” pages.• Notes on Searching the Web:
http://home.himolde.no/~molka/in350/week9y01.htm
Precision - proportion of retrieval documents that are relevant. W2 W2 = number retrieved that are relevant N2 N2 = total number retrieved Recall - proportion of relevant documents that are retrieved. W1 W1 = number relevant that are retrieved N1 N1 = total number relevant
P =
R=
Determining PageRankhttp://www.whitelines.nl/html/google-page-rank.html#example
• According to Sergey Brin and Lawrence (Larry) Page, Co-founders of Google, the PR of a webpage is calculated using this formula:
• PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I))• Where:
– PR(A) is the PageRank of your page A. – d is the damping factor, usually set to 0,85. – PR(I->A) is the PageRank of page I containing a link to page A. – C(I) is the number of links off page I. – PR(I->A)/C(I) is a PR-value page A receives from page I. – SUM (PR(I->A)/C(I)) is the sum of all PR-values page A
receives from pages with links to page A.. • In other words: The PR of a page is determined by the PR of every
page I that has a link to page A. For every page I that points to page A, the PR of page I is devided by the number of links from page I. These values are cumulated and multiplied by 0,85. Finally 0,15 is added to this result, and this number represents the PR of page A.
• What is your PageRank? http://www.klid.dk/pagerank.php?url=
by Greg R. Notess.
http://www.searchengineshowdown.com/stats/sizeest.shtml
Search EngineShowdown
Total Size Estimate(millions)
Claim (millions)
Google 3,033 3,083
AlltheWeb 2,106 2,112
AltaVista 1,689 1,000
WiseNut 1,453 1,500
Hotbot 1,147 3,000
MSN Search 1,018 3,000
Teoma 1,015 500
NLResearch 733 125
Gigablast 275 150
Data from: Dec. 31, 2002
Relative size:
AlltheWeb reported size and percentages from relative size showdown
AlltheWeb: 2,106,156,957 reported; Total Size reports are below.
Older Reports with Largest Three at that TimeMarch 2002: Google, WiseNut, AlltheWeb
August 2001: Google, Fast, WiseNut
April 2001: Google, Fast, MSN (Inktomi)
Oct. 2000: Fast, Google, Northern Light
July 2000: iWon, Google, AltaVista
April 2000: Fast, AltaVista, Northern Light
Feb. 2000: Fast, Northern Light, AltaVista
Jan. 2000: Fast, Northern Light, AltaVista
Nov. 1999: Northern Light, Fast, AltaVista
Sept. 1999: Fast, Northern Light, AltaVista
Aug. 1999: Fast, Northern Light, AltaVista
May 1999: Northern Light, AltaVista, Anzwers
March 1999: Northern Light, AltaVista, HotBot
January 1999: Northern Light, AltaVista, HotBot
August 1998: AltaVista, Northern Light, HotBot
May 1998: AltaVista, HotBot, Northern Light
February 1998: HotBot, AltaVista, Northern Light
October 1997: AltaVista, HotBot, Northern Light
September 1997: Northern Light, Excite, HotBot
June 1997: HotBot, AltaVista, Infoseek
October 1996: HotBot, Excite, AltaVista
Search EngineNewest
Page FoundRoughAverage
OldestPage Found
MSN (Ink.) 1 day 4 weeks 51 days
HotBot (Ink.) 1 day 4 weeks 51 days
Google 2 days 1 month 165 days
AlltheWeb 1 days 1 month 599 days
AltaVista 0 days 3 months 108 days
Gigablast 45 days 7 months381 days
Teoma 41 days 2.5 months 81 days
WiseNut 133 days 6 months 183 days
Freshness
Billions Of Textual Documents IndexedDecember 1995-September 2003
http://searchenginewatch.com/reports/article.php/2156481
3. URL’s are directed links.
Andrei Broder (2000)
http://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf
Db driven/on-demand
Static html
4. Defining Web based communities
• 15% of web pages have links to opposing views.• 60% of web pages have links to like views.• Social segmentation is self re-enforcing.• Beliefs and affiliations have become public
information represented in links and visits.Web based communities are hard to ID.• No boundaries; different sizes; dif. organized.• Pages with more internal links than outside links
may be ID as a community. But, no efficient algorithm.
Other points…
• 5. Technology can allow more control over individuals: ID them, track them.
• Web topology (architecture by self-selecting where to link) limits our actions (browsing, some pages are invisible), more than the code (attempts at control, laws).
• 6. Internet Archive maintained since 1996 by Brewster Kahle. Some data will never go away.
• http://www.archive.org/ (Try the WayBack Machine.)• 7. Web is complex and self-organized. They started by
looking at the macrostructure. The last chapters will look at the smaller groupings.