Web Archive Analysis - Rutgers...

Post on 29-Sep-2020

0 views 0 download

Transcript of Web Archive Analysis - Rutgers...

1

Web Archive Analysis

Vinay GoelSenior Data Engineer

Internet Archivevinay@archive.org

2

Access Web Archive Data

Wayback Machine

Search

3

Enable Research & Analysis

Analysis Workflow

5

Data: WARC

Data written by web crawlers Web ARchive Container File - WARC (ISO standard) Revision of the ARC file format Each file contains a series of concatenated records

– Full HTTP request/response records

– Metadata records (links, crawler path, encoding etc.)

– Records to store duplicate detection events

– Records to support segmentation and conversion

6

Derived Data: CDX

Index for Wayback Machine Space delimited text file Contains only essential fields needed by Wayback

– URL, Timestamp, Content Digest

– MIME type, HTTP Status Code

– Redirect URL, meta tags, size

– WARC filename and file offset of record

7

Wayback Machine

8

Growth of content

9

Growth of content

10

Rate of duplication

11

Breakdown by Year-First-Crawled

Log Analysis (Hive/Pig/Giraph)

CDX Warehouse Crawl Log Warehouse Distribution of HTTP status codes, MIME types Find timeout errors, duplicate content, crawler traps,

robots exclusions Trace path of the crawler

13

Derived Data: Parsed Text

Input to build text indexes for Search Text is extracted from (W)ARC files HTML boilerplate is stripped out Also contains metadata for each record

– URL, Timestamp, Content Digest, Record Length

– MIME type, HTTP status code

– Title, description and meta keywords

– Links with anchor text

14

Derived Data: WAT

Extensible Metadata format Essential metadata for many types of analyses Avoids barriers to data exchange: copyright, privacy Less data than WARC, more than CDX WAT records are WARC metadata records Contains for every HTML page in the WARC,

– Title, description and meta keywords

– Embeds and outgoing links with alt / anchor text

15

Text Analysis (Pig/Mahout)

Text extracted from WARC / Parsed Text / WAT files Use curated collections to train Classifiers Cluster documents in collections Topic Modeling

– Discover topics

– Study how topics evolve over time Compare how a page describes itself (meta text) vs.

how other pages linking to it describe it (anchor text)

16

Link Analysis (Pig/Giraph)

Links extracted from crawl logs / WARC metadata records / Parsed Text / WAT files

Indegree and Outdegree information Inter-host and Intra-host link information Study how linking behavior changes over time Rank resources by PageRank

– Identify important resources

– Prioritize crawling of missing resources Find possible spam pages by running biased PageRank

algorithms

17

Completeness

18

PageRank over Time

19

PageRank over Time

20

PageRank over Time

21

Archive Analysis Workshop

Generate derivatives: CDX, WAT, Parsed Text Set up CDX Warehouse using Hive Extract links from WARCs / WAT / Parsed Text Extract text from WARCs / WAT / Parsed Text Generate Archival web graphs

– Assign integer / fingerprint ID to URLs

– Represent graph as an adjacency list using these IDs and the timestamp info

Generate host and domain graphs

22

Archive Analysis Workshop

Text Analysis using Pig / Mahout

– Extract top terms using TF-IDF

– Prepare text for analysis with Mahout Link Analysis with Pig / Giraph

– Degree Analysis

– PageRank

– Find common links between entities Data extraction

– Repackage subset of data into new (W)ARCs

23

Archive Analysis Workshop

https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop