The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen.

The Fragmented Web

Notes on Chapter 12

For In765

Judith Molka-Danielsen

1. Virtual robots

• Virtual robots read and index web pages.• Would be hard to navigate without them.• But, some pages are never mapped.• Simple search engines can return too

much.• Meta-search engines select hits across

engines. • www.lib.berkeley.edu/TeachingLib/Guides/

Internet/MetaSearch.html

http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html

http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html

Steve Lawrence and C. Lee Giles Attempt to measure the Web in 1999http://www.neci.nj.nec.com/homepages/lawrence/websize.html

2. Relevancy• Finding the “best” page is more important than

finding the “most” pages.• Notes on Searching the Web:

http://home.himolde.no/~molka/in350/week9y01.htm

Precision - proportion of retrieval documents that are relevant. W2 W2 = number retrieved that are relevant N2 N2 = total number retrieved Recall - proportion of relevant documents that are retrieved. W1 W1 = number relevant that are retrieved N1 N1 = total number relevant

P =

R=

Determining PageRankhttp://www.whitelines.nl/html/google-page-rank.html#example

• According to Sergey Brin and Lawrence (Larry) Page, Co-founders of Google, the PR of a webpage is calculated using this formula:

• PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I))• Where:

– PR(A) is the PageRank of your page A. – d is the damping factor, usually set to 0,85. – PR(I->A) is the PageRank of page I containing a link to page A. – C(I) is the number of links off page I. – PR(I->A)/C(I) is a PR-value page A receives from page I. – SUM (PR(I->A)/C(I)) is the sum of all PR-values page A

receives from pages with links to page A.. • In other words: The PR of a page is determined by the PR of every

page I that has a link to page A. For every page I that points to page A, the PR of page I is devided by the number of links from page I. These values are cumulated and multiplied by 0,85. Finally 0,15 is added to this result, and this number represents the PR of page A.

• What is your PageRank? http://www.klid.dk/pagerank.php?url=

by Greg R. Notess.

http://www.searchengineshowdown.com/stats/sizeest.shtml

Search EngineShowdown

Total Size Estimate(millions)

Claim (millions)

Google 3,033 3,083

AlltheWeb 2,106 2,112

AltaVista 1,689 1,000

WiseNut 1,453 1,500

Hotbot 1,147 3,000

MSN Search 1,018 3,000

Teoma 1,015 500

NLResearch 733 125

Gigablast 275 150

Data from: Dec. 31, 2002

Relative size:

AlltheWeb reported size and percentages from relative size showdown

AlltheWeb: 2,106,156,957 reported; Total Size reports are below.

mailto:[email protected]

http://www.searchengineshowdown.com/stats/size.shtml

Older Reports with Largest Three at that TimeMarch 2002: Google, WiseNut, AlltheWeb

August 2001: Google, Fast, WiseNut

April 2001: Google, Fast, MSN (Inktomi)

Oct. 2000: Fast, Google, Northern Light

July 2000: iWon, Google, AltaVista

April 2000: Fast, AltaVista, Northern Light

Feb. 2000: Fast, Northern Light, AltaVista

Jan. 2000: Fast, Northern Light, AltaVista

Nov. 1999: Northern Light, Fast, AltaVista

Sept. 1999: Fast, Northern Light, AltaVista

Aug. 1999: Fast, Northern Light, AltaVista

May 1999: Northern Light, AltaVista, Anzwers

March 1999: Northern Light, AltaVista, HotBot

January 1999: Northern Light, AltaVista, HotBot

August 1998: AltaVista, Northern Light, HotBot

May 1998: AltaVista, HotBot, Northern Light

February 1998: HotBot, AltaVista, Northern Light

October 1997: AltaVista, HotBot, Northern Light

September 1997: Northern Light, Excite, HotBot

June 1997: HotBot, AltaVista, Infoseek

October 1996: HotBot, Excite, AltaVista

http://www.searchengineshowdown.com/stats/0203size.shtml














http://www.searchengineshowdown.com/i/stht0898.gif







Search EngineNewest

Page FoundRoughAverage

OldestPage Found

MSN (Ink.) 1 day 4 weeks 51 days

HotBot (Ink.) 1 day 4 weeks 51 days

Google 2 days 1 month 165 days

AlltheWeb 1 days 1 month 599 days

AltaVista 0 days 3 months 108 days

Gigablast 45 days 7 months381 days

Teoma 41 days 2.5 months 81 days

WiseNut 133 days 6 months 183 days

Freshness

Billions Of Textual Documents IndexedDecember 1995-September 2003

http://searchenginewatch.com/reports/article.php/2156481

3. URL’s are directed links.

Andrei Broder (2000)

http://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf

Db driven/on-demand

Static html

4. Defining Web based communities

• 15% of web pages have links to opposing views.• 60% of web pages have links to like views.• Social segmentation is self re-enforcing.• Beliefs and affiliations have become public

information represented in links and visits.Web based communities are hard to ID.• No boundaries; different sizes; dif. organized.• Pages with more internal links than outside links

may be ID as a community. But, no efficient algorithm.

Other points…

• 5. Technology can allow more control over individuals: ID them, track them.

• Web topology (architecture by self-selecting where to link) limits our actions (browsing, some pages are invisible), more than the code (attempts at control, laws).

• 6. Internet Archive maintained since 1996 by Brewster Kahle. Some data will never go away.

• http://www.archive.org/ (Try the WayBack Machine.)• 7. Web is complex and self-organized. They started by

looking at the macrostructure. The last chapters will look at the smaller groupings.

The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen.

Documents

Transcript of The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen.