Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan...
-
Upload
kailyn-wigginton -
Category
Documents
-
view
222 -
download
0
Transcript of Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan...
![Page 1: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/1.jpg)
Common Crawl:enabling machine-scale
analysis of web data
Lisa GreenKurt Bollacker
Jordan Mendelson
IIPC2014-05-19
![Page 2: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/2.jpg)
![Page 3: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/3.jpg)
Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
![Page 4: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/4.jpg)
Photo license: CC BY-SA http://commons.wikimedia.org/wiki/File:Img20050526_0007_at_tannheim_cumulus.jpg
![Page 5: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/5.jpg)
Photo license: CC-BY-NC https://www.flickr.com/photos/malloreigh/5580160943
![Page 6: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/6.jpg)
Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
![Page 7: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/7.jpg)
Enable machine scale access and analysis of web data for everyone
![Page 8: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/8.jpg)
Web Data Commons:“Extracting Structured Data from the Common Crawl”
![Page 9: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/9.jpg)
WikiEntities (Han Xiaogang) In What Context Is a Term Referenced?
![Page 10: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/10.jpg)
WikiEntities Example: DiscographyWho are the most popular artists?
![Page 11: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/11.jpg)
How Easily Can Google Analytics Track Our Browsing? (S. Merity, C.
Hornbaker)
![Page 12: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/12.jpg)
Data Publica: Finding French Open Data
![Page 13: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/13.jpg)
Commercial Applications:Improved Spell Checking
![Page 14: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/14.jpg)
may be too domain specific
Photo license: CC-BY-NC-ND https://www.flickr.com/photos/blueforce4116/1398245798
![Page 15: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/15.jpg)
Photo license: CC BY-SA http://commons.wikimedia.org/wiki/File:Img20050526_0007_at_tannheim_cumulus.jpg
![Page 16: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/16.jpg)
Photo license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.
![Page 17: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/17.jpg)
![Page 18: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/18.jpg)
![Page 19: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/19.jpg)
![Page 20: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/20.jpg)
![Page 21: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/21.jpg)
![Page 22: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/22.jpg)
Image license: CC BY-SA https://www.flickr.com/photos/xdxd_vs_xdxd/6829447421
![Page 23: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/23.jpg)
Photo license: CC-BY-SA https://www.flickr.com/photos/hackny/6202775045
![Page 24: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/24.jpg)
![Page 25: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649ca65503460f94967e6e/html5/thumbnails/25.jpg)