Web Archive Analysis

Vinay GoelSenior Data Engineer

Internet Archivevinay@archive.org

Access Web Archive Data

Wayback Machine

Search

Enable Research & Analysis

Analysis Workflow

Data: WARC

Data written by web crawlers Web ARchive Container File - WARC (ISO standard) Revision of the ARC file format Each file contains a series of concatenated records

– Full HTTP request/response records

– Metadata records (links, crawler path, encoding etc.)

– Records to store duplicate detection events

– Records to support segmentation and conversion

Derived Data: CDX

Index for Wayback Machine Space delimited text file Contains only essential fields needed by Wayback

– URL, Timestamp, Content Digest

– MIME type, HTTP Status Code

– Redirect URL, meta tags, size

– WARC filename and file offset of record

Wayback Machine

Growth of content

Rate of duplication

Breakdown by Year-First-Crawled

Log Analysis (Hive/Pig/Giraph)

CDX Warehouse Crawl Log Warehouse Distribution of HTTP status codes, MIME types Find timeout errors, duplicate content, crawler traps,

robots exclusions Trace path of the crawler

Derived Data: Parsed Text

Input to build text indexes for Search Text is extracted from (W)ARC files HTML boilerplate is stripped out Also contains metadata for each record

– URL, Timestamp, Content Digest, Record Length

– MIME type, HTTP status code

– Title, description and meta keywords

– Links with anchor text

Derived Data: WAT

Extensible Metadata format Essential metadata for many types of analyses Avoids barriers to data exchange: copyright, privacy Less data than WARC, more than CDX WAT records are WARC metadata records Contains for every HTML page in the WARC,

– Title, description and meta keywords

– Embeds and outgoing links with alt / anchor text

Text Analysis (Pig/Mahout)

Text extracted from WARC / Parsed Text / WAT files Use curated collections to train Classifiers Cluster documents in collections Topic Modeling

– Discover topics

– Study how topics evolve over time Compare how a page describes itself (meta text) vs.

how other pages linking to it describe it (anchor text)

Link Analysis (Pig/Giraph)

Links extracted from crawl logs / WARC metadata records / Parsed Text / WAT files

Indegree and Outdegree information Inter-host and Intra-host link information Study how linking behavior changes over time Rank resources by PageRank

– Identify important resources

– Prioritize crawling of missing resources Find possible spam pages by running biased PageRank

algorithms

Completeness

PageRank over Time

Archive Analysis Workshop

Generate derivatives: CDX, WAT, Parsed Text Set up CDX Warehouse using Hive Extract links from WARCs / WAT / Parsed Text Extract text from WARCs / WAT / Parsed Text Generate Archival web graphs

– Assign integer / fingerprint ID to URLs

– Represent graph as an adjacency list using these IDs and the timestamp info

Generate host and domain graphs

Text Analysis using Pig / Mahout

– Extract top terms using TF-IDF

– Prepare text for analysis with Mahout Link Analysis with Pig / Giraph

– Degree Analysis

– PageRank

– Find common links between entities Data extraction

– Repackage subset of data into new (W)ARCs

https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop

Web Archive Analysis - Rutgers...

Transcript of Web Archive Analysis - Rutgers...

Web Archive Analysis - Rutgers...

Documents

Transcript of Web Archive Analysis - Rutgers...

Sony Cdx f5507x

Sony Cdx r5610

CRT2590 CDX-FM1277 UNIVERSAL MULTI-CD SYSTEM...CDX-FM1277UNIVERSAL MULTI-CD SYSTEM X1N/UC Service Manual CDX-FM1277 X1N/ES CDX-FM1277/X1N/UC ... screw, cover the hole with the supplied

COMPACT DISC PLAYER CDX-497/CDX-397

CDX-1140, A Novel Agonist CD40 Antibody with Potent Anti ...CDX-1140 (µ g/ml) Human B cells 21.4.1 FI CDX-114 huIgG2 CDX-1140 MFI Recombinant proteins CDX-1140 (µ g/ml) The TNFR

Parsed documents from the Oviedo Cathedral Archive (13th ...

Sony Cdx l480x

CDX-V3800 - Diagramasde.com - Diagramas …diagramasde.com/diagramas/autoradio-potencias-radios/SONY CDX-V… · CDX-V3800. 2 CDX-V3800 CAUTION Use of controls or adjustments or performance

Web Archive Pro ling Through CDX Summarizationmln/pubs/tpdl-2015/tpdl-2015-profiling.pdf · Web Archive Pro ling Through CDX Summarization 3 Many web archives tend to limit their

CHN CDX Demographics

Sony Cdx-gt45u

CDX of leynes

CDX-F5510/F5510X/F5550EE - ESpecarchive.espec.ws/files/SONY CDX-F5510,F5510X,F5550EE.pdf · 2010. 1. 22. · CDX-F5510 E Model CDX-F5510X East European Model CDX-F5550EE CDX-F5510/F5510X/F5550EE

Parsed documents from the Avilés Municipal Archive (13th ... · Parsed documents from the Avilés Municipal Archive (13th century) ROSABEL SAN SEGUNDO CACHERO

FM/AM Operating Instructions GB Compact Disc ... - sony… · CDX-G1080UM/CDX-G1071U/CDX-G1070U/CDX-G1051U/CDX-G1050U CXS-G1069U/CXS-G1016U/CXS-G104SU 4-466-656-51(1)FM/AM Compact

CDX-FM1287 CDX-FM687 - Pioneer Electronics USA Manual CDX-FM1287 CDX-FM687 Mode d’emploi UNIVERSAL MULTI-CD SYSTEM SYSTEME DE CD MULTIPLE UNIVERSEL Dear Customer ..... 3 Preparing

ADVANCED INDOOR DETECTOR CDX series · 2019. 6. 20. · Cover Terminals Lens Model CDX-NAM CDX-AM CDX-DAM Detection method Detector standard Masking detection method PIR Coverage

Cdx Gt61uigt610ui

THE spdf ELECTRON ORBITAL MODEL PARSED

Icelandic Parsed Historical Corpus (IcePaHC) Stability ...linguist.is/skjol/rilivs2010_reykjavik_slides.pdf · Icelandic Parsed Historical Corpus (IcePaHC) Stability? Patterns that