Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan...

26
Common Crawl: enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19

Transcript of Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan...

Page 1: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Common Crawl:enabling machine-scale

analysis of web data

Lisa GreenKurt Bollacker

Jordan Mendelson

IIPC2014-05-19

Page 2: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 3: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg

Page 4: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Photo license: CC BY-SA http://commons.wikimedia.org/wiki/File:Img20050526_0007_at_tannheim_cumulus.jpg

Page 5: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Photo license: CC-BY-NC https://www.flickr.com/photos/malloreigh/5580160943

Page 6: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg

Page 7: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Enable machine scale access and analysis of web data for everyone

Page 8: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Web Data Commons:“Extracting Structured Data from the Common Crawl”

Page 9: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

WikiEntities (Han Xiaogang) In What Context Is a Term Referenced?

Page 10: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

WikiEntities Example: DiscographyWho are the most popular artists?

Page 11: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

How Easily Can Google Analytics Track Our Browsing? (S. Merity, C.

Hornbaker)

Page 12: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Data Publica: Finding French Open Data

Page 13: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Commercial Applications:Improved Spell Checking

Page 14: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

may be too domain specific

Photo license: CC-BY-NC-ND https://www.flickr.com/photos/blueforce4116/1398245798

Page 15: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Photo license: CC BY-SA http://commons.wikimedia.org/wiki/File:Img20050526_0007_at_tannheim_cumulus.jpg

Page 16: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Photo license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.

Page 17: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 18: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 19: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 20: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 21: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 22: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Image license: CC BY-SA https://www.flickr.com/photos/xdxd_vs_xdxd/6829447421

Page 23: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Photo license: CC-BY-SA https://www.flickr.com/photos/hackny/6202775045

Page 24: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 25: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.
Page 26: Common Crawl : enabling machine-scale analysis of web data Lisa Green Kurt Bollacker Jordan Mendelson IIPC 2014-05-19.

Thank You

www.commoncrawl.org

[email protected]

[email protected]

[email protected]