WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers...
-
Upload
augustine-dawson -
Category
Documents
-
view
217 -
download
0
Transcript of WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers...
What is the Web?What is the Web?
Not the internetNot the internet Websites, pages on different computersWebsites, pages on different computers
linked via hyperlinks. An enormous graph.linked via hyperlinks. An enormous graph. No central planning: created by No central planning: created by
independent actions of millionsindependent actions of millions Sometimes front-ends of databases served Sometimes front-ends of databases served
via web pages. E.g., car searchvia web pages. E.g., car search Over 1 trillion unique indexed URLsOver 1 trillion unique indexed URLs
http://news.softpedia.com/news/Google-Reached-1-Trillion-Indexed-Pages-http://news.softpedia.com/news/Google-Reached-1-Trillion-Indexed-Pages-90864.shtml90864.shtml
HistoryHistory
1980-1991 1980-1991 – Tim Berners-Lee @ European Organization for Tim Berners-Lee @ European Organization for
Nuclear Research (CERN)Nuclear Research (CERN)– Info for physicists - no uniformity, accessInfo for physicists - no uniformity, access– by 1990 httpby 1990 http– took long for anybody to pay attentiontook long for anybody to pay attention
1992-19951992-1995– Universities get on boardUniversities get on board– Initially all text-based [gopher, Lynx]Initially all text-based [gopher, Lynx]
more Historymore History 1993 Mosaic @ UIUC/NCSA1993 Mosaic @ UIUC/NCSA
graphical content capabilities, fueled rapid growthgraphical content capabilities, fueled rapid growth
19941994– Web organizations formed W3CWeb organizations formed W3C– ““Hot lists” - pages of organized bookmarksHot lists” - pages of organized bookmarks– ““Yet another hierarchical officious oracle”Yet another hierarchical officious oracle”
1996-19981996-1998– rapid commercialization, dawn of e-commercerapid commercialization, dawn of e-commerce– browser wars: Netscape 80%; by 2001 Explorer browser wars: Netscape 80%; by 2001 Explorer
90%90%
even more Historyeven more History
1999-20011999-2001– dot.com boom dot.com boom – dot.com bustdot.com bust
2002-present2002-present– shakeup, rise of giants: Amazon, Yahoo, shakeup, rise of giants: Amazon, Yahoo,
Google, eBay, PaypalGoogle, eBay, Paypal– youth culture: myspace, facebook, napsteryouth culture: myspace, facebook, napster– democratization of web: blogging, flickr, democratization of web: blogging, flickr,
wikipedia, youtube, twitterwikipedia, youtube, twitter– some of major players only a few years old!some of major players only a few years old!
Things to knowThings to know
http(s)http(s) url anatomyurl anatomy hyperlinkshyperlinks cookiescookies cachescaches pluginsplugins appletsapplets deep webdeep web
www.internettutorials.netREAD:
Things to knowThings to know http(s)http(s)
– protocol followed by computers communicating and protocol followed by computers communicating and transferring web pages (securely)transferring web pages (securely)
url anatomyurl anatomy– example with file pathexample with file path
http://admin.illinois.edu/policy/code/article1_part1_1-101.html– with dbase query: with dbase query: http://illinois.edu/ricker/CampusMap?
buildingID=43&target=displayHighlight
hyperlinkshyperlinks– url at bottom of window or in address barurl at bottom of window or in address bar– what happens when you click?what happens when you click?
cookiescookies– sites store text information about you on your computersites store text information about you on your computer– allows customized web sessions but privacy or security allows customized web sessions but privacy or security
concern?concern?– can delete, refuse cookiescan delete, refuse cookies
Things to knowThings to know cachescaches
– browser stores copies of what you’ve visited… images….textbrowser stores copies of what you’ve visited… images….text– privacy/security/performance concerns.privacy/security/performance concerns.– can delete at expense of performancecan delete at expense of performance
pluginsplugins– software extending browser’s capabilities to view different type of software extending browser’s capabilities to view different type of
content. Browsers come with some built in. content. Browsers come with some built in. – Security: somebody else’s program running on your computerSecurity: somebody else’s program running on your computer
appletsapplets– program transmitted to your computer that you runprogram transmitted to your computer that you run– security issuessecurity issues
deep webdeep web– dynamic web pages, dbases, secure pages, multimediadynamic web pages, dbases, secure pages, multimedia
Crawling and storingCrawling and storing Web crawlers… how do they work?Web crawlers… how do they work? Google recorded 1 trillion unique URLsGoogle recorded 1 trillion unique URLs Back-of-envelope calculation for TEXT pages:Back-of-envelope calculation for TEXT pages:
1 trillion x 10Kb/page = 10 trillion Kb1 trillion x 10Kb/page = 10 trillion Kb = 10 quadrillion bytes= 10 quadrillion bytes = 10 petabytes= 10 petabytes = 10,000 terabytes= 10,000 terabytes
= 10,000 disks= 10,000 disks x $100/diskx $100/disk = $1,000,000= $1,000,000
Actual Google specs kept secret, estimates Actual Google specs kept secret, estimates around 2006: 450,000 servers.around 2006: 450,000 servers.
Searching the webSearching the web
What is Lenny Pitt’s phone number?What is Lenny Pitt’s phone number? simple dbase lookup, simple keyword searchsimple dbase lookup, simple keyword search
In database, find credit-worthy consumersIn database, find credit-worthy consumers AI/learning problemAI/learning problem
Find web pages relevant to “computer Find web pages relevant to “computer music”music”
Among cell phone conversations from Among cell phone conversations from country X, identify suspicious onescountry X, identify suspicious ones
Search all religion and philosophy books Search all religion and philosophy books for the meaning of lifefor the meaning of life
thanks to sanjeev arora for this slide
Searching the webSearching the web Find “computer music”Find “computer music”
? Search all pages for the phrase ?? Search all pages for the phrase ?? Sort according to number of occurrences ?? Sort according to number of occurrences ?? Human staff answers questions ?? Human staff answers questions ?
PitfallsPitfalls– Spamming by unscrupulous websitesSpamming by unscrupulous websites– Synonyms Synonyms – Homographs [homonym]Homographs [homonym]– Polysemes [bank = institution or building or Polysemes [bank = institution or building or
verb?]verb?]thanks to sanjeev arora for this slide
Exploit Link Structure!Exploit Link Structure!
Example: PageRank [Google]Example: PageRank [Google]http://en.wikipedia.org/wiki/PageRank
Ideas: PR = how many pages vote for youIdeas: PR = how many pages vote for you
BUT!: some votes are more BUT!: some votes are more importantimportant
(those from pages with higher PR)(those from pages with higher PR)
Page Rank and Random WalksPage Rank and Random Walks Random Walk Method 1Random Walk Method 1
Choose equally from outgoing links.Choose equally from outgoing links. Walk for a long timeWalk for a long time PR(X) is probability you end up at PR(X) is probability you end up at page Xpage X
Random Walk Method 2Random Walk Method 2 Choose equally from all pages on the Choose equally from all pages on the web.web.
PR algorithm is an interpolation between PR algorithm is an interpolation between both methods: models a random surfer both methods: models a random surfer who gets bored after several clicks and who gets bored after several clicks and switches to a random page.switches to a random page.
Round 1: New PRs:
PRA= .15(PRA) + .85(PRB/2 + PRC/1 + PRD/3) = .15(.25) + .85(.25/2 + .25 + .25/3) = .352
PRB = .15(PRB) + .85(PRA/3 + 0 + PRD/3)
= .15(.25) + .85(.25/3 + .25/3) = .179
PRC = .15(.25) + .85(.25/3 + .25/2 + .25/3) = .2854
PRD = .15(.25) + .85(.25/3) = .10833..
AB
C
D
• Assume that A links to all pages equally • Assume initially that PR of all pages is 1/N = .25• Compute new PR for each node:
• .15(old PR) + .85(“votes” from other nodes)• B’s votes: ½ of its PR to A, ½ to C.
Round 1: New PRs:
PRA= .15(PRA) + .85(PRB/2 + PRC/1 + PRD/3 = .15(.25) + .85(.25/2 + .25 + .25/3) = .352
PRB = .15(PRB) + .85(PRA/3 + 0 + PRD/3)
= .15(.25) + .85(.25/3 + .25/3) = .179
PRC = .15(.25) + .85(.25/3 + .25/2 + .25/3) = .2854
PRD = .15(.25) + .85(.25/3) = .10833..
Now substitute the new PRs into equations to get newer PRs
PRA= .15(PRA) + .85(PRB/2 + PRC/1 + PRD/3 = .15(.352) + .85(.179/2 + .2854 + .10833/3) = .402
PRB = .15(.179) + .85(.352/3 + .10833/3) = .1572
Etc. Continue until values “settle down”.
A page's PageRank = 0.15/N + 0.85 * (a "share" of the PageRank of
every page that links to it)• A link from a page with PR=.4 and 5 outbound links is worth more than a link from a page with PR=.8 and 100 outbound links.
• The PageRank of a page that links to yours is important but the number of links on that page is also important. The more links there are on a page, the less PageRank value your page will receive from it.
• Executed ahead of time (when indexing all documents).
in praise of XMLin praise of XML HTML:HTML:
<h1> Introduction </h1><h1> Introduction </h1>Tags specify formatting information, not content.Tags specify formatting information, not content.
XMLXML<chapter> Introduction </chapter><chapter> Introduction </chapter><author> Peg Babcock </author><author> Peg Babcock </author>
Tags specify content. Separate file gives formatting Tags specify content. Separate file gives formatting information for various content areas.information for various content areas.
Advantages:Advantages: Helps search enginesHelps search engines Easy to make global format changesEasy to make global format changes