Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management...
-
Upload
ezra-mccarthy -
Category
Documents
-
view
214 -
download
0
Transcript of Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management...
![Page 1: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/1.jpg)
Ch. 13 Structure of the Web
Padmini Srinivasan
Computer Science Department Department of Management Sciences
http://cs.uiowa.edu/[email protected]
![Page 2: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/2.jpg)
Origins
• Origins of WWW (1989/1990: http)– Sir Tim Berners-Lee & Robert Cailliau
• First prototype of browser: WorldWideWeb• 1st popular graphical browser: Mosaic (NCSA), Marc
Andreessen and others– Mozilla -> Netscape -> Firefox
• Lynx• 2000 Windows explorer• WAIS, Gopher, Veronica, • 1994: W3C• 1993: 1st World wide web conference• 1995: Yahoo! 1998: Google 2006: Live Search -> Bing
![Page 3: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/3.jpg)
Network Metaphor
• Information network: – Different from social network
• Notion of a logical document: different – Decentralized, over many computers– annotation
• Network metaphor: “inspired and non-obvious”• Origins in hypertext – origins in citation nets
• Citation nets: distinctly temporal, web?– Citation maps (popular) co-citation; bibliographic coupling;
• H-index (Hirsch); g-index; f-index
– Patents; legal cases (precedents); medical literature• Indexes: cross-linkages; see also; wikipedia
![Page 4: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/4.jpg)
Links/Associations• Directed edges,
– Friendship nets, name-recognition, business colleagues, collaboration [Erdos number, Bacon number], IM nets, email graphs etc.
– paths, shortest paths…• Associative memory• Semantic nets aka Conceptual networks (free-association studies)• Vannevar Bush “As We May Think” (1945) Atlantic Monthly.
WW2. MEMEX (on web)– Associative connections between all of knowledge– Acknowledged by most– A way to rechannel human resources
![Page 5: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/5.jpg)
Paths and Connectivity
• Connected graphs• Path: sequence of nodes beginning at node X and
ending at node Y.• A directed graph is strongly connected if there is a
path (directed of course) between every pair of its nodes.
• If it is not strongly connected, need to examine its ‘reachability’ properties.– Easier in an undirected graph: disconnected components– Directed? Find strongly connected components
![Page 6: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/6.jpg)
Strongly Connected Component
• SCC in a directed graph is a subset of nodes such that – (1) every node in it has a path to every other node
in it– (2) the subset is not a part of a larger set of nodes
that has the same property. [So it is the largest such component]
• Why is it interesting to know about such components in the Web?
![Page 7: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/7.jpg)
Bow-Tie Structure of the Web
• 1999 Andrei Broder (now Yahoo!), then Alta Vista• SCC; IN; OUT; Tendrils; Tubes, Disconnected• Macro-model– Properties of a reasonable model:
• Should have a succinct and fairly natural description• Rooted in plausible macro-level process for creation of
Web content• Not require some prior static set of topics• Should reflect many of the structural phenomenon
observed in the Web
![Page 8: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/8.jpg)
Similar Studies
• Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are
• Webbase, 200 Million Stanford crawl– 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48
million) next SCC: 10 thousand!
![Page 9: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/9.jpg)
Similar Studies
• Buriol et al. (includes Donato): Temporal analysis of Wikigraph.
![Page 10: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/10.jpg)
Bow-Tie
• Why a single SCC? Why not two large ones?• Any other explanations?– Interlinked world?– Hard to be disconnected?– What about a new page?
• Is the SCC static/fixed? How does it change?– Are links permanent? (2004: 25% remain after 1 year and
50% of pages stay the same; Ntoulas et al., 2004)• Many naturally occurring graphs have a giant SCC– IM (nodes people, link message) almost all are in the SCC;
median path length is 7,mean 6.6.
![Page 11: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/11.jpg)
Bow-Tie: points to note
• Incomplete picture– Doesn’t tell you how this is generated, just that it is.– Macro model:
• Thematic collections; differences?• Organization specific collections• Regional: economic incentives/disincentives?• Community based: education levels?• Bipartite cliques (small sized – many in number)
– Fans pointing to centers
– Will it always be observed? How about now?
![Page 12: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/12.jpg)
Web 2.0
• “an attitude not a technology”– Collaboration/collective maintenance• Annotation, tags, links, editing, revisions
– Data generated by individuals for individual and group sharing; Flickr, Gmail.
– Connections between entities beyond “documents”.
• Social feedback key; ‘wisdom of crowds’; long tail;
![Page 13: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/13.jpg)
Web Links
• Navigational – static pages – passive services• Transactional – dynamic / computational
services. Deep web• Search engines – heuristics– What kinds of rules would you use?– Implications for crawlers
![Page 14: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.](https://reader036.fdocuments.us/reader036/viewer/2022082709/56649d9e5503460f94a880d9/html5/thumbnails/14.jpg)
Summary
• Web: origins, network metaphor– Citations, MEMEX
• Paths• Structures (macro)– SCC– Bow-Tie model
• Next– Ch 14: Hubs and Authorities; PageRank