Internet content as research data
-
Upload
national-library-of-australia -
Category
Technology
-
view
468 -
download
0
Transcript of Internet content as research data
Internet Content as Research Data
Digital Humanities Australia March 2012, Canberra
Monica Omodei & Gordon Mohr
Research Examples
• Social networking • Lexicography • Linguistics • Network Science • Political Science • Media Studies • Contemporary history
Common Collec)on Strategies
• Crawl Scope & Focus 1) Thema)c/Topical (elec)ons, events, global warming…) 2) Resource-‐specific (video, pdf, etc.) 3) Broad survey (domain wide for .com/.net/.org/.edu/.gov) 4) Exhaus)ve (end of life, closure crawls, natl domains) 5) Frequency-‐Based
• Key Inputs: nomina)ons from subject maSer experts, prior crawl data, registry data, trusted directories, wikipedia
Exis)ng web archives
• Internet Archive • Common Crawl • Pandora Archive • Internet Memory Founda)on Archive • Other na)onal archives • Research, University Library archives
Internet Archive’s Web Archive
Positives – Very broad – 175+ billion web instances – Historic – started 1996 – Publicly accessible – Time-based URL search – API access – Not constrained by legislation – covered by
fair use and fast take-down response
Internet Archive’s Web Archive Negatives
– Because of size can’t search by keyword – Because of size, fully automated - QA not
possible
Common Use Cases for IA’s web archive
• Content discovery • Nostalgia queries • Web site restora)on and file recovery • Domain name valua)on • Collabora)ve R&D • Prior art analysis and patent/copyright infringement research
• Legal cases • Topic analysis, web trends analysis, popularity analysis
Common Crawl
• Non-‐profit founda)on building an open crawl of the web to seed research and innova)on
• Currently 5 billion pages • Stored on Amazon’s S3 • Accessible via MapReduce processing in Amazon’s EC2 compute cloud
• Wholesale extrac)on, transforma)on, and analysis of web data cheap and easy
• commoncrawl.org/data/accessing-‐the-‐data/
Common Crawl
Nega)ves • Not designed for human browsing but for machine access
• Objec)ve is to support large-‐scale analysis and text mining/indexing – not long-‐term preserva)on
• Some costs are involved for direct extrac)on of data from S3 storage using Requester-‐Pays API
Pandora Archive • Posi)ves
– Quality checked – Targeted Australian content with selec)on policy – Historical – started 1996 – Bibliocentric approach –we sites/publica)ons selected for archiving are catalogued (see Trove)
– Keyword search – Publicly accessible – You can nominate Australian web sites for inclusion -‐ pandora.nla.gov.au/registra)on_form.html
Pandora Archive
• Nega)ves – labour intensive so small – significant content missed because permission to copy refused
• Situa)on will improve markedly if Legal Deposit provisions extended to digital publica)ons
• Broader coverage will be achieved when infrastructure is upgraded hence reducing labour costs for checking/fixing crawls
Pandora Archive Stats
• Size – 6.32 TB • Number of Files > 140 million • Number of ‘)tles’ > 30.5K • Number of )tle instances > 73.5K
.au Domain Annual Snapshots • Annual crawls since 2005 commissioned from Internet Archive
• Includes sites on servers located in Australia as well as .au domain
• Robots.txt respected except for inline images and stylesheets
• No public access – researcher access protocols are being developed
• Full text search – tailored to archive search • Separate .gov crawl publicly accessible soon
Australian web domain crawls
Year 2005 2006 2007 2008 2009 2011
Files 185 million
596 million
516 million
1 billion 765 million
660 million
Hosts crawled
811,523 1,046,038 1,247,614 3,038,658 1,074,645 1,346,549
Size (TBs) 6.69 19.04 18.47 34.55 24.29 30.71
Internet Memory Founda)on Archive
• internetmemory.org/en/ • no keyword search yet – only URL • Number of European partners
Other Na)onal Archives • List of Interna)onal Internet Preserva)on Consor)um member archives – netpreserve.org/about/archiveList.php
• Some are whole domain archives, some are selec)ve archives, many are both
• Some have public access, others you will need to nego)ate access for research
• Most archives have been collected using the heritrix open-‐source crawler and thus use the standard format (warc ISO format)
Research Archives • California Digital Library • Harvard University Libraries • Columbia University Libraries • University of North Texas …. and many more • WebCITE -‐ webcita)on.org (cita)on service archive)
Bringing Archives Together
• Common standard and APIs • Memento project
Create your own Archive
• Use a subscrip)on service • Build your own archive using open-‐source crawler heritrix and standard file format .warc
• Use web cita)on services that create archive copies as you bookmark pages
Subscrip)on Services
• archive-‐it.org (service operated by non-‐profit Internet Archive since 2006)
• archivethe.net (service operated by non-‐profit Internet Memory Founda)on)
• California Digital Library Web Archiving Service -‐ cdlib.org/services/uc3/was.html
• OCLC Harvester Service -‐ oclc.org/webharvester/overview/default.htm
Install web archiving system locally
• Easy-‐to-‐deploy web archiving toolkit not yet available (that meets web archive standards)
• Ins)tu)onal web archiving infrastructure is feasible and has been established at a number of universi)es for use by researchers – needs IT systems engineers to set up though
• Archives can be deposited with the NLA for long-‐term preserva)on
'Memento': adding )me to the web
Protocol and browser add-‐on (MementoFox) • Aids discovery, aggrega)on of page histories
Innovation is increasingly driven from Large scale Data Analysis
Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
Web Data Mining & Analysis – What is it? Why Do It?
Platform & Toolkit: Overview
• Software – Apache Hadoop – Apache Pig
• Data/File format – WARC – CDX – WAT (new!)
Apache Hadoop
• HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS
• MapReduce – Distributed computation – You write Java functions – Hadoop distributes work across cluster – Tolerates failures
File formats and data: WARC
File formats and data: CDX
• Index for Wayback Machine: used to browse WARC-based archive
• Space-delimited text file • Only essential metadata needed by Wayback
– URL – Content Digest – Capture Timestamp – Content-Type – HTTP response code – etc.
File formats and data: WAT
• Yet Another Metadata Format! ☺ ☹ • Not preservation format • Data exchange and analysis • Less than full WARC, more than CDX • Essential metadata for many types of analysis • Avoids barriers to data exchange: copyright,
privacy • Work-in-progress: we want your feedback
File formats and data: WAT • WAT is WARC ☺
– WAT records are WARC metadata records
– WARC-Refers-To header identifies original WARC record
• WAT payload is JSON – Compact – Hierarchical – Supported by every
programming environ
File formats & data: • CDX: 53 MB • WAT: 443 MB • WARC: 8,651 MB
Some References
• hSp://en.wikipedia.org/wiki/Web_archiving • hSp://netpreserve.org/about/archiveList.php • Web Archives: The Future(s) -‐ hSp://www.netpreserve.org/publica)ons/2011_06_IIPC_WebArchives-‐TheFutures.pdf
Contacts • Webarchive @ nla.gov.au • Secretariat @ internetmemory.org • Queries about the internet archive web archive hSp://iawebarchiving.wordpress.com/
• Queries about Archive-‐It service hSp://www.archive-‐it.org/contact-‐us
• momodei @ nla.gov.au • gojomo @ xavvy.com