Building Event Collections from Crawling Web...

32
Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZL Building Event Collections from Crawling Web Archives Martin Klein 1 Lyudmila Balakireva 1 Herbert Van de Sompel 2 1 Research Library Los Alamos National Laboratory 2 Data Archiving and Networked Services The Netherlands

Transcript of Building Event Collections from Crawling Web...

Page 1: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

Building Event Collections

from

Crawling Web Archives

Martin Klein1

Lyudmila Balakireva1

Herbert Van de Sompel2

1Research Library

Los Alamos National Laboratory

2Data Archiving and Networked Services

The Netherlands

Page 2: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

2

Inspiration from Previous Work

https://doi.org/10.1007/978-3-319-67008-9_10

Page 3: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

3

Published at WebSci 2018

https://doi.org/10.1145/3201064.3201085

Page 4: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

4

1. Can we create event collections by focused crawling online-

available web archives?

2. How do event collections created from the archived web

compare to those created from the live web?

3. How does the amount of time passed since the event affect

the collections built from the live and the archived web?

4. How do event collections built from the archived web

compare to manually curated collections?

Questions

Page 5: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

5

• Often orchestrated by subject matter experts, archivists,

special collection librarians, technicians

• Potentially with guidance from institutional collection policy

• Results in a list of seeds (URIs, social media accounts, etc)

• Utilization of crawling services such as Archive-It, Social Feed

Manager

Background – Event Collection Building

Page 6: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

6

Temporal: time passed since event is of concern•

Use of web archives via Memento infrastructure

Selection: seeds often picked manually•

Use of references from Wikipedia pages

Relevance: seed assessment often done by humans •

Use of focused crawling with content and temporal

relevance assessment

Problems and our Approach

Page 7: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

7

Page 8: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

8

Page 9: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

9

• Temporal: time passed since event is of concern

Use of web archives

• Selection: seeds often picked manually

Use of references from Wikipedia pages

• Relevance: seed assessment often done by humans

Use of focused crawling with content and temporal

relevance assessment

Problems and our Approach

Page 10: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

10

Page 11: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

11

Page 12: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

12

• Temporal: time passed since event is of concern

Use of web archives

• Selection: seeds often picked manually

Use of references from Wikipedia pages

• Relevance: seed assessment often done by humans

Use of focused crawling with content and temporal

relevance assessment

Problems and our Approach

Page 13: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

13

Focused Crawling

Child 1

Seed

Child 2 Child 3

Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2

Not crawledCrawled and

not relevant

Crawled and

relevant

Page 14: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

14

Focused Crawling

Child 1

Seed

Child 2 Child 3

Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2

Not crawledCrawled and

not relevant

Crawled and

relevant

Page 15: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

15

Focused Crawling

Child 1

Seed

Child 2 Child 3

Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2

Not crawledCrawled and

not relevant

Crawled and

relevant

Page 16: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

16

1. Content of Wikipedia page + random 60% of page’s references

• Generate topic vector (TF-IDF of 1grams + 2grams)

2. Content of remaining 40% of Wikipedia page’s references

• Generate topic vector (TF-IDF of 1grams + 2grams)

• Compute cosine similarity value between vectors 1 and 2

• Run 10 times

• Take average cosine similarity value as content threshold

Content Relevance

Page 17: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

17

• Define temporal interval for which crawled pages are

considered relevant

• Event date extracted from Wikipedia event page

Temporal Relevance

1

Event Date Change Point Today

0 0

Page 18: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

18

Change Point Detection

2016−06−12 2016−11−05 2017−03−31 2017−08−24

020

40

60

80

10

0

Edit Dates

Pe

rce

nta

ge

46

• Plot number of Wikipedia page

edits per day

• Run R’s changepoint algorithm

• Detect significant change in curve

https://cran.r-project.org/web/packages/changepoint/index.html

Page 19: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

19

• Extract datetime from pages via:

• URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/

• Meta tags<meta property="article:published" itemprop="datePublished"

content="2017-12-09T10:14:50-05:00" />

• ODU’s Carbondate toolhttp://carbondate.cs.odu.edu/

• Memento datetime

• X-Header

Datetime Extraction

Page 20: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

20

• Topics limited to terror attacks and mass shootings in the U.S.

• From different times in the past

• Take content and temporal relevance into account

• Equally weighted

• Use events’ Wikipedia page as input for focused crawler

• Version that was live at change point

Experiment Details

Page 21: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

21

Focused crawl of: •

22 • archives, simultaneously, via Memento infrastructure

The live web•

Seeds•

Memento of Wikipedia page references closest to and •

after event time

Subject to temporal and contextual relevance assessment•

Crawled • outlinks

Memento of • outlinks closest to and after event time

Subject to temporal and contextual relevance assessment•

Crawl Details

Page 22: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

22

• Crawl stop conditions:

• No more relevant documents left

• 5 levels deep

• Utilized crawl priority queue

Crawl Details

Level 2

Level 1

Level 0

Child 1

Seed

Child 2 Child 3

Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2

Page 23: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

23

New York City, October • 31st 2017

Las Vegas, October • 1st 2017

Orlando, June • 12th 2016

San • Bernadino, December 2nd 2015

Tucson, January • 8th 2011

Binghampton• , April 3rd 2009

Collections Crawled (in November 2017)

Page 24: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

24

NYC, 10/31/2017 – URIs per Level

0 1 2 3 4 5

Crawl depth

Num

ber

of U

RIs

050

0100

0150

0200

0

Web Archive Crawl

01

02

030

40

50

60

70

80

90

100

All URIs

Relevant URIs

0 1 2 3 4 5

Crawl depth

050

0100

0150

0200

0

Live Web Crawl

01

02

030

40

50

60

70

80

90

100

Perc

ent

All URIs

Relevant URIs

Page 25: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

25

TUC, 01/08/2011 – URIs per Level

0 1 2 3 4 5

Crawl depth

Num

ber

of U

RIs

020

000

4000

060000

80

000

Web Archive Crawl

01

020

30

40

50

60

70

80

90

100

All URIs

Relevant URIs

0 1 2 3 4 5

Crawl depth

020

000

4000

060000

80

000

Live Web Crawl

01

020

30

40

50

60

70

80

90

100

Perc

ent

All URIs

Relevant URIs

Page 26: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

26

NYC, 10/31/2017 – Relevance over…

Crawled Documents Crawl Time

Page 27: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

27

TUC, 01/08/2011 – Relevance over…

Crawled Documents Crawl Time

Page 28: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

28

TUC, 01/08/2011 – Comparison to Archive-IT

0 5000 10000 15000

05

000

10

00

015

000

Documents

Accu

mu

late

d R

ele

va

nce

Web Archive Crawl

Archive−It Crawl

Page 29: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

29

TUC, 01/08/2011 – Web Archive Contributions

web.archive.org 75%

wayback.archive−it.org

14%webarchive.loc.gov 7%

web.archive.bibalex.org 2%archive.is 2%

Page 30: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

30

• Web archives are great resources to build event collections of

web resources

• Crawling web archives is much slower than the live web

• Collections about very recent events benefit more from the

live web than the archived web

but

• Collections about events from the distant past benefit more

from the archived web than the live web

• Utilizing multiple web archives is beneficial for the collection

• Focused crawls have the potential to outperform manual

collection building

Takeaways

Page 31: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

31

https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384

Page 32: Building Event Collections from Crawling Web Archivesnetpreserve.org/ga2018/wp-content/uploads/2018/11/IIPC... · 2018-11-18 · Building Event Collections from Crawling Web Archives

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

Building Event Collections

from

Crawling Web Archives

Martin Klein1

Lyudmila Balakireva1

Herbert Van de Sompel2

1Research Library

Los Alamos National Laboratory

2Data Archiving and Networked Services

The Netherlands