Post on 29-Sep-2020
1
Web Archive Analysis
Vinay GoelSenior Data Engineer
Internet Archivevinay@archive.org
2
Access Web Archive Data
Wayback Machine
Search
3
Enable Research & Analysis
Analysis Workflow
5
Data: WARC
Data written by web crawlers Web ARchive Container File - WARC (ISO standard) Revision of the ARC file format Each file contains a series of concatenated records
– Full HTTP request/response records
– Metadata records (links, crawler path, encoding etc.)
– Records to store duplicate detection events
– Records to support segmentation and conversion
6
Derived Data: CDX
Index for Wayback Machine Space delimited text file Contains only essential fields needed by Wayback
– URL, Timestamp, Content Digest
– MIME type, HTTP Status Code
– Redirect URL, meta tags, size
– WARC filename and file offset of record
7
Wayback Machine
8
Growth of content
9
Growth of content
10
Rate of duplication
11
Breakdown by Year-First-Crawled
Log Analysis (Hive/Pig/Giraph)
CDX Warehouse Crawl Log Warehouse Distribution of HTTP status codes, MIME types Find timeout errors, duplicate content, crawler traps,
robots exclusions Trace path of the crawler
13
Derived Data: Parsed Text
Input to build text indexes for Search Text is extracted from (W)ARC files HTML boilerplate is stripped out Also contains metadata for each record
– URL, Timestamp, Content Digest, Record Length
– MIME type, HTTP status code
– Title, description and meta keywords
– Links with anchor text
14
Derived Data: WAT
Extensible Metadata format Essential metadata for many types of analyses Avoids barriers to data exchange: copyright, privacy Less data than WARC, more than CDX WAT records are WARC metadata records Contains for every HTML page in the WARC,
– Title, description and meta keywords
– Embeds and outgoing links with alt / anchor text
15
Text Analysis (Pig/Mahout)
Text extracted from WARC / Parsed Text / WAT files Use curated collections to train Classifiers Cluster documents in collections Topic Modeling
– Discover topics
– Study how topics evolve over time Compare how a page describes itself (meta text) vs.
how other pages linking to it describe it (anchor text)
16
Link Analysis (Pig/Giraph)
Links extracted from crawl logs / WARC metadata records / Parsed Text / WAT files
Indegree and Outdegree information Inter-host and Intra-host link information Study how linking behavior changes over time Rank resources by PageRank
– Identify important resources
– Prioritize crawling of missing resources Find possible spam pages by running biased PageRank
algorithms
17
Completeness
18
PageRank over Time
19
PageRank over Time
20
PageRank over Time
21
Archive Analysis Workshop
Generate derivatives: CDX, WAT, Parsed Text Set up CDX Warehouse using Hive Extract links from WARCs / WAT / Parsed Text Extract text from WARCs / WAT / Parsed Text Generate Archival web graphs
– Assign integer / fingerprint ID to URLs
– Represent graph as an adjacency list using these IDs and the timestamp info
Generate host and domain graphs
22
Archive Analysis Workshop
Text Analysis using Pig / Mahout
– Extract top terms using TF-IDF
– Prepare text for analysis with Mahout Link Analysis with Pig / Giraph
– Degree Analysis
– PageRank
– Find common links between entities Data extraction
– Repackage subset of data into new (W)ARCs
23
Archive Analysis Workshop
https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop