Web Archive Profiling Through Fulltext Search

Sawood Alam and Michael L. NelsonComputer Science Department, Old Dominion University

Norfolk, Virginia - 23529

Herbert Van de SompelLos Alamos National Laboratory, Los Alamos, NM

David S. H. RosenthalStanford University Libraries, Stanford, CA

Supported in part by the IIPC and NSF 1526700

Unorganized Collections

Organized Collections

Collection Understanding

Memento Aggregator

From: Michael Nelson [mailto:mln@cs.odu.edu]

Sent: Wednesday, December 02, 2015 12:33 PM

To: Jones, Gina

Cc: Rourke, Patrick; Grotke, Abigail

Subject: Re: WebSciDL

Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages.

regards,

Michael

On Wed, 2 Dec 2015, Jones, Gina wrote:

> Hi Michael, we have a slight configuration issue with the current OW

> set up for our webarchives. I think, from looking at the logs, that

> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.

> Do you know who is running this scraper? Itʼs not part of memento is it?

> Gina Jones

> Web Archiving Team

> Library of Congress

From: Ilya Kreymer <ikreymer@gmail.com>

Date: Wed, 2 Dec 2015 10:33:56 -0800

Subject: high traffic on oldweb!

To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com>

Hi Herbert, Sawood,

Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily..

I am thinking that ability to remove source archives quickly is an important aspect of an aggregator.

Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;)

Broadcasting is Bad

Availability and Overlap

● Archives are sparse● Broadcasting is wasteful, both clients and archives suffer

Memento Routing

Routing Pros & Cons

● Pros○ Minimizes traffic and resources consumption○ Improves throughput

● Cons○ Upfront profile maintenance cost○ May miss Mementos (false negatives)

Why Small Archives Matter?

● 400B+ web pages at IA do not cover everything

● Top three archives after IA produce full TimeMap 52% of the time (AlSum, et al., TPDL 2013)

● Targeted crawls● Special focus archives● Restricted resources● Private archives● Censorship

While the Internet Archive was Down...

$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016 17

Archive Profile

● High-level summary of an archive● Predicts presence of mementos of a URI-R in

an archive● Provides various statistics about the holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other

things

Profiling Strategies

● Sample URI Profiling (AlSum, et al., TPDL 2013)

● CDX Profiling (Alam, et al., TPDL 2015)

● Response Cache Profiling (Bornand, et al., JCDL 2016)

● Fulltext Search Profiling

Methodology

Top Nouns

timeyearpeoplewaymandaythingchildmrgovernment 20

Random Dict

analogiesunboltconsonantcoilsstolidlycigardecrepitrhododendroncannibalhoneydew

Dynamic Words Discovery

the وكالة warangry the أنباءarab العربي middlenews east الغاضبservice on arabica politics poetrysource war art

Random Searcher Model (RSM)

Seed Vocabulary

NextWord()

ExtractWords()

Search()

Select a random link from the search results

Vocabulary seeding needed?

Termination condition reached?

GenerateProfile()

Store search results

Fetch the contents of the selected document

RSM Illustration

Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012

RSM Modes

● Static: Externally supplied static word list● PopularityBiased: Refresh Vocabulary after

every search attempt and consider term frequency for selecting next search keyword

● EqualOpportunity: Refresh Vocabulary after every search attempt and ignore term frequency for selecting next search keyword

● Conservative: Discover new words only when the Vocabulary is exhausted

Profiling Policies & Archive-It Dataset

Policy # Keys Example

URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200

HxP1 1,724,284 uk,co,bbc,news,)/Images

DDom 91,629 uk,co,bbc,)/

H1P0 212 uk,)/

Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40

For a detailed list of profiling policies please refer to:Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238

Searches vs Coverage

100% in 11K searches

100% in 27K searches

100% in 337K searches 100% in 1.9M searches

RSM Operation Mode Costs

Mode Query Cost

HTTP Cost Remarks

Static C C Suitable for specialized collection with known top keywords

PopularityBiased C 2 * C Human like model, but costly

EqualOpportunity C 2 * C Human like model, but costly

Conservative C C + (where << C)

Suitable for any collection and works without any supplementary materials with very little overhead

Routing Confusion Matrix

Predicted \ Actual Present in the Archive Not in the Archive

Routed to the Archive True Positive (TP) False Positive (FP)

Not Routed to the Archive False Negative (FN) True Negative (TN)

Routing Confusion Matrix Recall Accuracy27

Accuracy, Recall, & Coverage (10-100%)

DMOZ IA Wayback

UK WaybackMemento Proxy

Low Accuracy (high FP) =>Archives & Aggregator suffer

Low Recall (high FN) =>Users suffer

Profile Policy Recommendations

● IF complete CDX is available THEN○ Generate HxP1 profile

● ELSE IF fulltext search is available THEN○ Generate DDom profile

● ELSE○ Generate H1P0 or other smaller profiles using

Sample URIs

Note: It is possible to perform less detailed queries on more specific (higher order) profiles, but not the other way

RSM Mode Recommendations

● IF the collection is about a specific topic in a specific language AND a suitable top keywords list is available THEN○ Use Static mode

● ELSE○ Use Conservative mode

Who Knows Term Frequency for Estonian Nouns?

31https://en.wiktionary.org/wiki/Category:Estonian_nouns

Future Work

● Evaluation of combination profiles such as URI-Key along with Datetime

● Utilize archive profile to generate rank ordered list of archive

● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)

Conclusions● Evaluated the search cost as a function of archive holdings’

coverage and profiling policy● Developed the Random Searcher Model● Correctly route 80% requests while maintaining 0.9 Recall

by only discovering 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile

Web Archive Profiling Through Fulltext Search

Science

Transcript of Web Archive Profiling Through Fulltext Search

Security Guide - openSUSE Leap 42 › documentation › leap › archive › 42.2 › ... · 2018-11-05 · 18.1 AppArmor Components 200 18.2 Background Information on AppArmor Profiling

Payment for agro-ecosystem services: Developmental case ...apexjournal.org/rjaem/archive/2015/Sep/fulltext/France and Campbell... · (Kelco, 2009a,b; Devanney and MacDonald, ... fashion

. G -WIDE PROFILING ACONTINUOUS PROFILING I DATA ENTERSstatic.googleusercontent.com/media/research.google.com/... · 2020-03-03 · google-wide profiling: acontinuous profiling infrastructure

TPDL 2016 Doctoral Consortium - Web Archive Profiling

Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications

Deep Immune Profiling with the Maxpar Direct Immune ... · Deep Immune Profiling with the Maxpar Direct Immune Profiling System . Introduction Immune profiling is the practice of

Racial Profiling

Full Length Researchapexjournal.org/rjpa/archive/2016/July/fulltext/Uddin.pdf · of Bangladesh is suffering from excessive dependency, financial inadequacy, structural boundary, and

Education Working Paper Archive › fulltext › ED509018.pdf · Introduction Personnel policies in public schools are the subject of considerable policy debate. ... collective bargaining,

Introduction to Geographic Profiling for Crime Analysis Profiling... · Geographic Profiling for Crime Analysis Geographic Profiling was developed to focus serial crime investigations

Impact of philanthropic corporate social responsibility on ...apexjournal.org/jbamsr/archive/2017/Jun/fulltext/Ayoola.pdfCussons Nigeria Plc, Cadbury Nigeria Plc and Unilever Nigeria

Package ‘fulltext’ - R · fulltext-package Fulltext search and retrieval of scholarly texts. Description fulltext is a single interface to many sources of scholarly texts. In

Full Length Research - apexjournal.orgapexjournal.org/irje/archive/2013/Nov/fulltext/Abou-Taleb.pdf · terry woven clothes were manufactured as an absorbent fabric i.e. (absorbent

Analytical Article Quick Search: American Chemical Society All …lib3.dss.go.th/fulltext/E_content/0003-2700/Vol. 78 No.3... · 2007. 2. 1. · Remnant Lipoprotein Density Profiling

Migrating gossypioma: Unforeseen challengesapexjournal.org/irms/archive/2014/Dec/fulltext/Malhotra et al.pdfMigrating gossypioma: Unforeseen challenges Parveen Malhotra*, Naveen Malhotra,

Data Profiling Guide - start [Gerardnico] · PDF fileData Profiling Guide. Informatica PowerCenter Data Profiling Guide ... available at http:

Geographic Profiling A Component of Criminal Profiling CRIM B55.

Development of healthcare system for smart hospital based ...apexjournal.org/irje/archive/2014/Jan/fulltext/Mahmoud.pdf · Development of healthcare system for smart hospital based

Research Article Phytochemical Profiling of Leaf, Stem ...downloads.hindawi.com/archive/2014/567409.pdfUsed in preparation of perfumes and cosmetics, plasticized vinyl seats on furniture,

Data Profiling - ULisboa · PDF fileData Profiling Helena Galhardas DEI/IST References • Slides “Data Profiling” course, Felix Naumann, ... –