Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web...
Transcript of Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web...
![Page 1: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/1.jpg)
Data Mining and Information Data Mining and Information RetrievalRetrieval
Introduction to Web MiningIntroduction to Web Mining
![Page 2: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/2.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 2 / 23
What is Web Mining?What is Web Mining?
Discovering useful information from the World-Wide Web and its usage patterns.
![Page 3: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/3.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 3 / 23
Web Mining vs. Data MiningWeb Mining vs. Data MiningStructure (or lack of it)
Textual information and linkage structureScale
Data generated per day is comparable to largest conventional “data warehouses”
SpeedOften need to react to evolving usage patterns in real-time (e.g., merchandising)
![Page 4: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/4.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 4 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
![Page 5: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/5.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 5 / 23
Size of the WebSize of the WebNumber of pages
Technically, infiniteMuch duplication (30-40%)Best estimate of “unique” static HTML pages comes from search engine claims
Google = 8 billion(?), Yahoo = 20 billion
Number of web sites Netcraft survey says 206,675,938 sites (March 2010)
(http://news.netcraft.com/archives/web_server_survey.html)
![Page 6: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/6.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 6 / 23
NetcraftNetcraft SurveySurvey
http://news.netcraft.com/archives/web_server_survey.html
![Page 7: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/7.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 7 / 23
The Web as a GraphThe Web as a GraphPages = nodes, hyperlinks = edges
Ignore contentDirected graph
High linkage8-10 links/page on averagePower-law degree distribution
![Page 8: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/8.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 8 / 23
Structure of Web GraphStructure of Web GraphLet’s take a closer look at structure
Broder et al (2000) studied a crawl of 200M pages and other smaller crawlsBow-tie structure
Not a “small world”
![Page 9: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/9.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 9 / 23
BowBow--tie Structuretie Structure
Source: Broder et al, 2000
![Page 10: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/10.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 10 / 23
What can the graph tell us?What can the graph tell us?Distinguish “important” pages from unimportant ones
Page rankDiscover communities of related pages
Hubs and AuthoritiesDetect web spam
Trust rank
![Page 11: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/11.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 11 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
![Page 12: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/12.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 12 / 23
PowerPower--law degree distributionlaw degree distribution
Source: Broder et al, 2000
log
Long tail
![Page 13: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/13.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 13 / 23
PowerPower--laws galorelaws galoreStructure
In-degreesOut-degreesNumber of pages per site
Usage patternsNumber of visitorsPopularity
And much more…
![Page 14: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/14.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 14 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
![Page 15: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/15.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 15 / 23
Extracting Structured DataExtracting Structured Data
http://www.simplyhired.com
![Page 16: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/16.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 16 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
![Page 17: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/17.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 17 / 23
Searching the WebSearching the Web
Content aggregatorsThe Web Content consumers
![Page 18: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/18.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 18 / 23
Ads vs. search resultsAds vs. search results
![Page 19: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/19.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 19 / 23
Ads vs. search resultsAds vs. search resultsSearch advertising is the revenue model
Multi-billion-dollar industryAdvertisers pay for clicks on their ads
Interesting problemsWhat ads to show for a search?If I’m an advertiser, which search terms should I bid on and how much to bid?
![Page 20: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/20.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 20 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
![Page 21: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/21.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 21 / 23
Systems architectureSystems architecture
Memory
Disk
CPUMachine Learning, Statistics
“Classical” Data Mining
![Page 22: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/22.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 22 / 23
Very LargeVery Large--Scale Data MiningScale Data Mining
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU…
Cluster of commodity nodes
![Page 23: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information](https://reader034.fdocuments.us/reader034/viewer/2022042113/5e8f9ebb9fbcad76a41ced28/html5/thumbnails/23.jpg)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 23 / 23
Systems IssuesSystems IssuesWeb data sets can be very large
Tens to hundreds of terabytesCannot mine on a single server!
Need large farms of serversHow to organize hardware/software to mine multi-terabye data sets