WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers...

18
WWW WWW

Transcript of WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers...

WWWWWW

What is the Web?What is the Web?

Not the internetNot the internet Websites, pages on different computersWebsites, pages on different computers

linked via hyperlinks. An enormous graph.linked via hyperlinks. An enormous graph. No central planning: created by No central planning: created by

independent actions of millionsindependent actions of millions Sometimes front-ends of databases served Sometimes front-ends of databases served

via web pages. E.g., car searchvia web pages. E.g., car search Over 1 trillion unique indexed URLsOver 1 trillion unique indexed URLs

http://news.softpedia.com/news/Google-Reached-1-Trillion-Indexed-Pages-http://news.softpedia.com/news/Google-Reached-1-Trillion-Indexed-Pages-90864.shtml90864.shtml

HistoryHistory

1980-1991 1980-1991 – Tim Berners-Lee @ European Organization for Tim Berners-Lee @ European Organization for

Nuclear Research (CERN)Nuclear Research (CERN)– Info for physicists - no uniformity, accessInfo for physicists - no uniformity, access– by 1990 httpby 1990 http– took long for anybody to pay attentiontook long for anybody to pay attention

1992-19951992-1995– Universities get on boardUniversities get on board– Initially all text-based [gopher, Lynx]Initially all text-based [gopher, Lynx]

more Historymore History 1993 Mosaic @ UIUC/NCSA1993 Mosaic @ UIUC/NCSA

graphical content capabilities, fueled rapid growthgraphical content capabilities, fueled rapid growth

19941994– Web organizations formed W3CWeb organizations formed W3C– ““Hot lists” - pages of organized bookmarksHot lists” - pages of organized bookmarks– ““Yet another hierarchical officious oracle”Yet another hierarchical officious oracle”

1996-19981996-1998– rapid commercialization, dawn of e-commercerapid commercialization, dawn of e-commerce– browser wars: Netscape 80%; by 2001 Explorer browser wars: Netscape 80%; by 2001 Explorer

90%90%

even more Historyeven more History

1999-20011999-2001– dot.com boom dot.com boom – dot.com bustdot.com bust

2002-present2002-present– shakeup, rise of giants: Amazon, Yahoo, shakeup, rise of giants: Amazon, Yahoo,

Google, eBay, PaypalGoogle, eBay, Paypal– youth culture: myspace, facebook, napsteryouth culture: myspace, facebook, napster– democratization of web: blogging, flickr, democratization of web: blogging, flickr,

wikipedia, youtube, twitterwikipedia, youtube, twitter– some of major players only a few years old!some of major players only a few years old!

Things to knowThings to know

http(s)http(s) url anatomyurl anatomy hyperlinkshyperlinks cookiescookies cachescaches pluginsplugins appletsapplets deep webdeep web

www.internettutorials.netREAD:

Things to knowThings to know http(s)http(s)

– protocol followed by computers communicating and protocol followed by computers communicating and transferring web pages (securely)transferring web pages (securely)

url anatomyurl anatomy– example with file pathexample with file path

http://admin.illinois.edu/policy/code/article1_part1_1-101.html– with dbase query: with dbase query: http://illinois.edu/ricker/CampusMap?

buildingID=43&target=displayHighlight

hyperlinkshyperlinks– url at bottom of window or in address barurl at bottom of window or in address bar– what happens when you click?what happens when you click?

cookiescookies– sites store text information about you on your computersites store text information about you on your computer– allows customized web sessions but privacy or security allows customized web sessions but privacy or security

concern?concern?– can delete, refuse cookiescan delete, refuse cookies

Things to knowThings to know cachescaches

– browser stores copies of what you’ve visited… images….textbrowser stores copies of what you’ve visited… images….text– privacy/security/performance concerns.privacy/security/performance concerns.– can delete at expense of performancecan delete at expense of performance

pluginsplugins– software extending browser’s capabilities to view different type of software extending browser’s capabilities to view different type of

content. Browsers come with some built in. content. Browsers come with some built in. – Security: somebody else’s program running on your computerSecurity: somebody else’s program running on your computer

appletsapplets– program transmitted to your computer that you runprogram transmitted to your computer that you run– security issuessecurity issues

deep webdeep web– dynamic web pages, dbases, secure pages, multimediadynamic web pages, dbases, secure pages, multimedia

Crawling and storingCrawling and storing Web crawlers… how do they work?Web crawlers… how do they work? Google recorded 1 trillion unique URLsGoogle recorded 1 trillion unique URLs Back-of-envelope calculation for TEXT pages:Back-of-envelope calculation for TEXT pages:

1 trillion x 10Kb/page = 10 trillion Kb1 trillion x 10Kb/page = 10 trillion Kb = 10 quadrillion bytes= 10 quadrillion bytes = 10 petabytes= 10 petabytes = 10,000 terabytes= 10,000 terabytes

= 10,000 disks= 10,000 disks x $100/diskx $100/disk = $1,000,000= $1,000,000

Actual Google specs kept secret, estimates Actual Google specs kept secret, estimates around 2006: 450,000 servers.around 2006: 450,000 servers.

Searching the webSearching the web

What is Lenny Pitt’s phone number?What is Lenny Pitt’s phone number? simple dbase lookup, simple keyword searchsimple dbase lookup, simple keyword search

In database, find credit-worthy consumersIn database, find credit-worthy consumers AI/learning problemAI/learning problem

Find web pages relevant to “computer Find web pages relevant to “computer music”music”

Among cell phone conversations from Among cell phone conversations from country X, identify suspicious onescountry X, identify suspicious ones

Search all religion and philosophy books Search all religion and philosophy books for the meaning of lifefor the meaning of life

thanks to sanjeev arora for this slide

Searching the webSearching the web Find “computer music”Find “computer music”

? Search all pages for the phrase ?? Search all pages for the phrase ?? Sort according to number of occurrences ?? Sort according to number of occurrences ?? Human staff answers questions ?? Human staff answers questions ?

PitfallsPitfalls– Spamming by unscrupulous websitesSpamming by unscrupulous websites– Synonyms Synonyms – Homographs [homonym]Homographs [homonym]– Polysemes [bank = institution or building or Polysemes [bank = institution or building or

verb?]verb?]thanks to sanjeev arora for this slide

Exploit Link Structure!Exploit Link Structure!

Example: PageRank [Google]Example: PageRank [Google]http://en.wikipedia.org/wiki/PageRank

Ideas: PR = how many pages vote for youIdeas: PR = how many pages vote for you

BUT!: some votes are more BUT!: some votes are more importantimportant

(those from pages with higher PR)(those from pages with higher PR)

C has higher PageRank than E

Page Rank and Random WalksPage Rank and Random Walks Random Walk Method 1Random Walk Method 1

Choose equally from outgoing links.Choose equally from outgoing links. Walk for a long timeWalk for a long time PR(X) is probability you end up at PR(X) is probability you end up at page Xpage X

Random Walk Method 2Random Walk Method 2 Choose equally from all pages on the Choose equally from all pages on the web.web.

PR algorithm is an interpolation between PR algorithm is an interpolation between both methods: models a random surfer both methods: models a random surfer who gets bored after several clicks and who gets bored after several clicks and switches to a random page.switches to a random page.

Round 1: New PRs:

PRA= .15(PRA) + .85(PRB/2 + PRC/1 + PRD/3) = .15(.25) + .85(.25/2 + .25 + .25/3) = .352

PRB = .15(PRB) + .85(PRA/3 + 0 + PRD/3)

= .15(.25) + .85(.25/3 + .25/3) = .179

PRC = .15(.25) + .85(.25/3 + .25/2 + .25/3) = .2854

PRD = .15(.25) + .85(.25/3) = .10833..

AB

C

D

• Assume that A links to all pages equally • Assume initially that PR of all pages is 1/N = .25• Compute new PR for each node:

• .15(old PR) + .85(“votes” from other nodes)• B’s votes: ½ of its PR to A, ½ to C.

Round 1: New PRs:

PRA= .15(PRA) + .85(PRB/2 + PRC/1 + PRD/3 = .15(.25) + .85(.25/2 + .25 + .25/3) = .352

PRB = .15(PRB) + .85(PRA/3 + 0 + PRD/3)

= .15(.25) + .85(.25/3 + .25/3) = .179

PRC = .15(.25) + .85(.25/3 + .25/2 + .25/3) = .2854

PRD = .15(.25) + .85(.25/3) = .10833..

Now substitute the new PRs into equations to get newer PRs

PRA= .15(PRA) + .85(PRB/2 + PRC/1 + PRD/3 = .15(.352) + .85(.179/2 + .2854 + .10833/3) = .402

PRB = .15(.179) + .85(.352/3 + .10833/3) = .1572

Etc. Continue until values “settle down”.

A page's PageRank = 0.15/N + 0.85 * (a "share" of the PageRank of

every page that links to it)• A link from a page with PR=.4 and 5 outbound links is worth more than a link from a page with PR=.8 and 100 outbound links.

• The PageRank of a page that links to yours is important but the number of links on that page is also important. The more links there are on a page, the less PageRank value your page will receive from it.

• Executed ahead of time (when indexing all documents).

in praise of XMLin praise of XML HTML:HTML:

<h1> Introduction </h1><h1> Introduction </h1>Tags specify formatting information, not content.Tags specify formatting information, not content.

XMLXML<chapter> Introduction </chapter><chapter> Introduction </chapter><author> Peg Babcock </author><author> Peg Babcock </author>

Tags specify content. Separate file gives formatting Tags specify content. Separate file gives formatting information for various content areas.information for various content areas.

Advantages:Advantages: Helps search enginesHelps search engines Easy to make global format changesEasy to make global format changes