Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and...

Workshop AU

16.01.2020 netlab.dk

Workshop on Web Archiving

MODULE 1:

WEB ARCHIVING: Theory — and a Bit of Practice

Niels Brügger

Asger Harlung

http://www.netlab.dk/

netlab.dk

Workshop AU

16.01.2020

Module 1: Web Archiving

2

• Introducing ourselves and NetLab

• Why archive the web

• The research process and research examples

• Project presentation round

• Three kinds of digital content

• WWW as technology, and data mining example

• What is web archiving?

• Methods of web archiving and web crawling

• Challenges for the web crawler

• Crawling — advantages/disadvantages

• Characteristics of the archived web


netlab.dk

Workshop AU

16.01.2020

Introducing Ourselves and NetLab

3

Niels Brügger – Professor in Media and Internet

Studies, Head of NetLab, and of the Centre for

Internet Studies, specialising in internet research

since 1997.

Asger Harlung – MA in ICT and learning, has

previously worked with research in digital rhetoric,

and supporting creativity development in learning

processes.


netlab.dk

Workshop AU

16.01.2020

Advertisement!

4


netlab.dk

Workshop AU

16.01.2020

NetLab’s services are free for members of the DIGHUMLAB

communities (KB, and the humanities faculties at AU, AAU,

KU, SDU).

We offer different types of support, dependent of the needs of

the researcher.

Our focus is on the archived web — already archived or

needs to be archived.

NetLab Services

7


8

Research project

netlab.dk

Intro workshop

PhD workshop

Online course

Ad hoc support

Borrow an IT

developer

NetLab Forum

Tools & tutorials ... and much more

On demand. min. 6 participants, 3 modules

For PhD stud., 1 ECTS, January and August

Own project, teacher, 6 assignments, 3 ECTS

IT support (Ulrich), research support (Niels)

Applications May & Sep, 2-4 weeks

Open forum, resear-chers and web archive

The researcher can enter NetLab via several entry points, and can use one or more entries

netlab.dk

Workshop AU

16.01.2020

• To preserve the cultural heritage

• To preserve a stable research object

• To be able to document and illustrate a study

• Modern source references

• Documentation in general; legal claims

Why Archive the Web?

12


netlab.dk

Workshop AU

16.01.2020

The Research Process

13

Close — middle — distant reading dr.dk — FV11-15 — entire .dk

Consider making a Research Data Management plan at: https://dmponline.deic.dk/

data collection data cleaning selection/corpus

creation

analysis (computer supported)

analysis (human supported)

visualisation long term

preservation

Legal challenges


https://dmponline.deic.dk/

netlab.dk

NetLab projects with IT developer help

14

Let's have a look at the list of projects at:

http://www.netlab.dk/research/it-developer-projects/












netlab.dk

Probing a Nation’s Web Domain — from Small Data to Big Data

15

The historical development of an entire national web:

.dk 2005-2015

The project is a collaboration with Netarkivet.

2006 2009 2012 2015


netlab.dk


16

Grosslist of 'probes’:

• Size — e.g. bytes

• Space — e.g. geolocalisation

• Structure — e.g. network of hyperlinks

• Liveliness — e.g. domain names and updating

• Content — e.g. degrees of openness, files, software types,

language, website textual elements, semantics


netlab.dk


17


netlab.dk


18


netlab.dk


19


netlab.dk

Workshop AU

16.01.2020

Project Presentation Round

20

• Time to present yourselves and your projects

• Notes go on a whiteboard, and may be drawn upon for the

remainder of the day.

• We expect to return to some of these examples in the

afternoon, during the final part of the workshop.


netlab.dk

Workshop AU

16.01.2020

Digitised Formerly analog media, transferred to a digital form.

Born Digital Has not previously existed in any other form than digital.

Reborn Digital Born digital content which has been gathered and

preserved, and to some extent has been changed in the

process.

Three kinds of digital content

21


netlab.dk

Workshop AU

16.01.2020

WWW as Technology

22

How is a web page like this created?


netlab.dk

Workshop AU

16.01.2020

WWW — one among other internet protocols:

http — Hyper Text Transfer Protocol

URL — Uniform Resource Identifier (Locator)

html — Hyper Text Markup Language

Constructing a URL on WWW:

protocol://subdomain.domain.topdomain/path/page/

http://cc.au.dk/research/researchprograms/

WWW as Technology

23


Web pages = patched together in an ‘empty’ shell (stylesheet) of material from databases

24

The browser (Safari, Firefox...) translates html into writing, pictures etc.

Network of computers

html html html html html html html html html html html html

Computer (webserver)URL, dr.dk

Computer (user)

http

http

Computer (webserver) as database, CMS (Content Management System), URL dr.dk

Web pages = html-files

Images

Heading

Words

Computer (webserver) as database, URL, e.g. dmi.dk

Weather

Comp. X

Comp. Y

25

Small Exercise: Source Code

28

Small Exercise: Page Source

29


This allows you

to access the

underlying HTML

code for the

entire web page

30


… and can be

used for example

to search for

HTML tags, or

file types, or to

backtrack

content from

other pages …

31


32


33


netlab.dk

Workshop AU

16.01.2020

• A researcher wanted to track how Danish enclaves in

U.S.A. presented themselves.

• Text and images were important.

• The example is authentic. What is needed is:

1) Knowledge of ”web inspection”,

2) Taking a closer look at existing data, and

3) A bit of persistence :-)

Data Mining Example

34


35

Data Mining Example

36

Data Mining Example

37

Data Mining Example

38

Data Mining Example

39

Data Mining Example

40

Data Mining Example

41

Data Mining Example

42

Data Mining Example

43

Data Mining Example

44

Data Mining Example

45

Data Mining Example

46

Data Mining Example

netlab.dk

Workshop AU

16.01.2020

What is Web Archiving?

47

International Internet Preservation Consortium’s definition:

”… the process of gathering up data that has been published on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research.”

(https://web.archive.org/web/20170606072544/http://netpreserve.org/about-us) (Removed over the summer of 2017 this definition itself can only be retrieved from web archives).

”Any form of deliberate and purposive collection and preservation of web material.”

Brügger, Niels (2018): The Archived Web: Doing History in the Digital Age. MIT Press, p. 79


https://web.archive.org/web/20170606072544/http:/netpreserve.org/about-us




netlab.dk

Workshop AU

16.01.2020

What is Web Archiving?

48

Macro archiving

• Cultural heritage institutions

• Preserve as much as possible

• Big and varied data

• IT expertise, advanced technology, computer power

Micro archiving

• Individual researcher/research group

• Stablize a concrete research object, here-and-now

• No experience, no advanced technology or computer

power


netlab.dk

Workshop AU

16.01.2020

Methods of Web Archiving

49

• Web crawling (hyperlink crawling)

• Screen image

• Screen filming

• Harvesting via API

• (Delivery from producers)


netlab.dk

Workshop AU

16.01.2020

Web Crawling

50

domain.com

page

page page

page

page page page

page

page


netlab.dk

Workshop AU

16.01.2020

Web Crawling

51

domain.com

page

page page

page

page page page

crawler

page

page

1

0

2

3


52

domain.dk

page page page page

page page page

page page page

page page page

page page page page page page

URL URL URL URL URL …

domain.dk

page

page

page page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

domain.dk

page page page page

page page page

page page page

page page

page page page page page page

crawler

crawler

domain.dk

domain.com

JOB ID

netlab.dk

Workshop AU

16.01.2020

Web Crawling

53

domainX.com

page

page page

page

page page page

crawler

page

page domainY.com

page

page page

page

page page

page

crawler

By-Harvest

domainX.com …

JOB ID 11

domainY.com …

JOB ID 12


netlab.dk

Workshop AU

16.01.2020

Challenges for the crawler

54

• JavaScripts

• Content based on Flash

• Interactive pages

• Streamed content

• Websites with access limitations (password, captcha)

• Cookies, adds, plugins etc.

• Robots.txt

• Deep web (e.g. databaser, ftp-server, password-protected

content, hidden content, pages not linked to, dynamic

content based on requests).

http://da.wikipedia.org/wi

ki/CAPTCHA


netlab.dk

Pages not being crawled

55

✔

domain

✔

✔ ✔ ✔ ✔

✔ ✔ ✔

✔ ✔ page

✔ ✔

✔ ✔ ✔ ✔ ✔ ✔

page page page page

Not crawled

– too deep

page page

Not crawled

– password

protected

domain

page page Not

crawled –

robots.txt

page page

Not crawled – script


56

Elements not crawled _ Netarkivet

57

Elements not crawled _ Netarkivet

58

Elements not crawled _ Internet Archive

netlab.dk

Workshop AU

16.01.2020

Characteristics of the Archived Web

61

What is archived is not a 1:1 copy of the material one attempted to archive

It is versions/reconstructions:

• Created in the process of archiving

• On the basis of a number of choices made by the archiver

(harvesting strategy, settings, etc.)

• The choices made have consequences for what is

archived

• The archived objects are re-assembled in the archive

’replay’


netlab.dk

Workshop AU

16.01.2020


62

The archived version is deficient because of:

• Technical challenges

• Web’s specific characteristics: dynamic, unpredictable

• Potential asynchronicity between updating and archiving

→ archiving takes time

→ certain elements cannot be archived

It is an added challenge that we do not know what is missing:

• Not much documentation

• No baseline to compare with


netlab.dk

Workshop AU

16.01.2020


63

As scholars using archived web as an object of study, it is important that we are aware of the pitfalls and sources of error inherent in the material.


netlab.dk

Workshop AU

16.01.2020


64

It is versions/reconstructions:

• The archived objects are re-assembled in the archive

’replay’


netlab.dk

Workshop AU

15.08.2019

65

do not expect to find this...

... but rather this. Thanks to Emily Maemura

for these illustrations


66

IN CONTRAST TO DIGITIZED COLLECTIONS: TO A LARGE EXTENT ARCHIVED WEB IS ALREADY MARKED UP — HTML, FILE NAMES...

html + files

Online web archiving

Link list Named entities ?

Workshop AU

16.01.2020 netlab.dk

Workshop on Web Archiving

MODULE 1

WEB ARCHIVING: Theory — and a Bit of Practice

Niels Brügger

Asger Harlung


Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and...

Documents

Transcript of Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and...