Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler:...

Workshop KU

02.11.2016 netlab.dk

Workshop on Web Archiving

MODULE 1: WEB ARCHIVING

Niels Brügger

Asger Harlung

http://www.netlab.dk/

netlab.dk

Workshop KU

02.11.2016

Program 02.11.2016, KU

2

12.00-13.00 Workshop part 1: Web archiving

13.00-13.15 Coffee

13.15-14.00 Workshop part 2: Existing collections

14.00-14.10 Short break

14.10-15.00 Workshop part 3: Doing your own archiving (incl. hands-on)

15.00-15.05 Wrap-up; feedback

15.05-15.30 Open discussion: Thoughts and questions


netlab.dk

Workshop KU

02.11.2016

Module 1: Web Archiving

3

• Introducing ourselves and NetLab

• Why archive the web

• Two examples; dr.dk’s history and Probing a Nations Web

• Three kinds of digital content

• WWW as technology

• What is web archiving?

• Methods of web archiving

• Challenges for the web crawler

• Crawling — advantages/disadvantages

• Characteristics of the archived web


netlab.dk

Workshop KU

02.11.2016

Introducing Ourselves and NetLab

4

Niels Brügger – Professor (MSO, with special responsibilities) in Internet Studies and Digital Humanities, Head of NetLab, and of the Centre for Internet Studies, specialising in internet research since 1997.

Asger Harlung – MA in ICT and learning, has previously worked with research in digital rhetoric, and supporting creativity development in learning processes

… and the rest of NetLab is:

Janne Nielsen – Assistant Professor, PhD, Media Studies, AU; author of the book: Using Web Archives in Research – an Introduction (2016)

Ulrich Have – IT Architect, MA in Information Studies. At NetLab, he is developer of tools for research using archived web material and does data analytics.


netlab.dk

Workshop KU

02.11.2016

Introducing Ourselves and NetLab

5

A research infrastructure for internet research.

Part of the Danish research infrastructure Digital Humanities

Lab (DIGHUMLAB).

Established in 2012.

Research driven development of research infrastructures.


Digital Humanities Lab

Language Tools

(KU)

Media Tools

(AU)

Interaction & Design

(AAU & SDU)

Audio and visual

materials

NetLab

Online

Archived

Netarkivet

(the Danish

national web

archive)

NetLab

Forum

IT architect

Collecting

data for

specific

projects

2017-18, micro stipends

6

netlab.dk

Workshop KU

02.11.2016

Ditigal Impact on the Humanities

7

We are living in an

ever more digitised

world …

We know that …

why are they

stating the

obvious?


netlab.dk

Workshop KU

02.11.2016


8

(based on Alvin Toffler: TheThird Wave (1980))

• Neolithic revolution, ca. 12,500 years ago: Agriculture is the first

technological revolution, and replaces hunter-gatherer culture.

Hundreds or thousands of years between technogical advances.

• Industrial revolution (I and II), ca. 1760-1950’s. Decades

between technological advances, but faster advancements

since late in the 19th century (”second industrial revolution”)

• Post-industrial age, information age, digital age … 1950’s and

onwards. Years, months, sometimes weeks between

technological leaps. State of the art today may be obsolete

before the end of the year (a state of perpetual ”future shock”).


netlab.dk

Workshop KU

02.11.2016


9

• 2000: 75% of the world’s data was stored in analog form

(paper, film, photographic prints, vinyl, magnetic

casette tapes, etc.),

• 2007: 7% analog, 93% digital

• 2012: Only 2% of all stored data was stored in analog

form.

Mayer-Schönberger & K. Cukier (2013):

Big Data: A revolution that will transform how we live, work, and think. Houghton Mifflin

Harcourt Publishing Company, New York, 2013, pp. 8-9


netlab.dk

Workshop KU

02.11.2016

Why Archive the Web?

10

• To preserve the cultural heritage

• To preserve a stable research object

• To be able to document and illustrate a study


netlab.dk

Workshop KU

02.11.2016

dr.dk’s History

11


12

• Started in 2007

• Analyse the historical development of

dr.dk

• Partly based on archived web content

• www.drdk.dk

Dr.dk’s history 1996-2006

http://www.drdk.dk/

netlab.dk

Probing a Nation’s Web Domain — from Small Data to Big Data

13

Website

Web page

Web element

Web sphere

Web sphere


netlab.dk


14

The historical development of an entire national web:

.dk 2005-2015

The project is a collaboration with Netarkivet.

2006 2009 2012 2015


netlab.dk


15

Brutto list of 'probes’:

• Size — e.g. bytes

• Space — e.g. geolocalisation

• Structure — e.g. network of hyperlinks

• Liveliness — e.g. domain names and updating

• Content — e.g. degrees of openness, files, software types,

language, website textual elements, semantics


Digitised Formerly analogous media, transferred to a digital form.

Born Digital Has not previously existed in any other form than digital.

Reborn Digital Born digital content which has been gathered and

preserved, and to some extent has been changed in the

process.

16

netlab.dk

Web element

WWW as Technology

17


netlab.dk

Web page

WWW as Technology

18


netlab.dk

Website

Web page

Web element

Web site

WWW as Technology

19


WWW — one among other internet protocols:

http — Hyper Text Transfer Protocol

URL — Uniform Resource Identifier (Locator)

html — Hyper Text Markup Language

Constructing a URL on WWW:

protocol://subdomain.domain.topdomain/path/page/

http://cc.au.dk/research/researchprograms/

20

Web pages = patched together in an ‘empty’ shell (stylesheet) of material from databases

21

The browser (Safari, Firefox...) translates html into writing, pictures etc.

Network of computers

html html html html html html html html html html html html

Computer (webserver)URL, dr.dk

Computer (user)

http

http

Computer (webserver) as database, CMS (Content Management System), URL dr.dk

Web pages = html-files

Images

Heading

Words

Computer (webserver) as database, URL, e.g. dmi.dk

Weather

Comp. X

Comp. Y

22

Small Exercise: Source Code

netlab.dk

Workshop KU

02.11.2016

What is Web Archiving?

23

International Internet Preservation Consortium’s definition:

”… the process of gathering up data that has been published

on the World Wide Web, storing it, ensuring the data is

preserved in an archive, and making the collected data

available for future research.”

(http://netpreserve.org/about-us)

”Any form of deliberate and purposive preserving of web

material.” (Brügger, 2011:25)

Brügger, Niels (2011): ”Web Archiving – Between Past, Present, and Future”. IN: Mia

Consalvo and Charles Ess (eds.) The Handbook of Internet Studies. Wiley-Blackwell.


http://netpreserve.org/about-us



netlab.dk

Workshop KU

02.11.2016

What is Web Archiving?

24

Macro archiving

• Cultural heritage institutions

• Preserve as much as possible

• Big and varied data

• IT expertise, advanced technology, computer power

Micro archiving

• Individual researcher/research group

• Stablize a concrete research object, here-and-now

• No experience, no advanced technology or computer

power


netlab.dk

Workshop KU

02.11.2016

Methods of Web Archiving

25

• Web crawling (hyperlink crawling)

• Screen image

• Screen filming

• Harvesting via API

• (Delivery from producers)


netlab.dk

Workshop KU

02.11.2016

Web Crawling

26

domain.com

page

page page

page

page page page

page

page


netlab.dk

Workshop KU

02.11.2016

Web Crawling

27

domain.com

page

page page

page

page page page

crawler

page

page

1

0

2

3


28

domain.dk

page page page page

page page page

page page page

page page page

page page page page page page

URL URL URL URL URL …

domain.dk

page

page

page page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

domain.dk

page page page page

page page page

page page page

page page

page page page page page page

crawler

crawler

domain.dk

domain.com

netlab.dk

Workshop KU

02.11.2016

Challenges for the crawler

29

• JavaScripts

• Content based on Flash

• Interactive pages

• Streamed content

• Websites with access limitations (password, captcha)

• Cookies, adds, plugins etc.

• Robots.txt

• Deep web (e.g. databaser, ftp-server, password-protected

content, hidden content, pages not linked to, dynamic

content based on requests).

http://da.wikipedia.org/wi

ki/CAPTCHA


netlab.dk

Pages not being crawled

30

✔

domain

✔

✔ ✔ ✔ ✔

✔ ✔ ✔

✔ ✔ page

✔ ✔

✔ ✔ ✔ ✔ ✔ ✔

page page page page

Not crawled

– too deep

page page

Not crawled

– password

protected

domain

page page Not

crawled –

robots.txt

page page

Not crawled – script


31

Elements not crawled _ Netarkivet

32

Elements not crawled _ Netarkivet

33

Elements not crawled _ Internet Archive

netlab.dk

Workshop KU

02.11.2016

Crawling, Advantages

34

• The entire page in full length

• Hyperlinks, link source as well as target

• Look and feel of live web (with limitations)

• Automatic (partly, evaluation and trouble shooting)

• Machine readable, enables search, sorting, analysis

• Access to metadata (crawl logs)

• Robust format (html)

• Big data-analysis (content analysis, network analysis, etc.)


netlab.dk

Workshop KU

02.11.2016

Crawling, Disadvantages

35

• Some objects not archived, e.g. videos and streamed

content, and applications based on Flash, JavaScript etc.

• Temporal inconsistencies

• Difficult to delimit in terms of spatial extent

• Risk of web crawler being caught in ’bot traps’ (some

monitoring is necessary)


netlab.dk

Workshop KU

02.11.2016

Characteristics of the Archived Web

36

What is archived is not a 1:1 copy of the material one attempted to archive

It is versions/reconstructions:

• Created in the process of archiving

• On the basis of a number of choices made by the archiver

(harvesting strategy, settings, etc.)

• The choices made have consequences for what is

archived

• The archived objects are re-assembled in the archive

’replay’


netlab.dk

Workshop KU

02.11.2016


37

The archived version is deficient because of:

• Technical challenges

• Web’s specific characteristics: dynamic, unpredictable

• Potential asynchronicity between updating and archiving

→ archiving takes time

→ certain elements cannot be archived

It is an added challenge that we do not know what is missing:

• Not much documentation

• No baseline to compare with


netlab.dk

Workshop KU

02.11.2016


38

As scholars using archived web as an object of study, it is important that we are aware of the pitfalls and sources of error inherent in the material.


”During the Olympics in Sydney in 2000, I wanted to save the

website of the Danish newspaper JyllandsPosten. I began at the

first level, the front page, on which I could read that the Danish

badminton player Camilla Martin would play in the finals a half hour

later.

My computer took about an hour to save this first level, after which

time I wanted to download the second level, ’Olympics 2000’. But on

the front page of this section, I could already read the result of the

badminton finals (she lost).

The website was — as a whole — not the same as when I had

started; it had changed in the time it took to archive it, and I could

now read the result on the front page, where the match was

previously only announced.”

N. Brügger: Archiving Websites, 2005, pp. 22-23

39

netlab.dk

Workshop KU

02.11.2016


40

Exactly what is preserved?

Content, user experience or something else entirely?

”…an archive also must be sure that the document is translated in an

authentic manner. In this case, authenticity means that the document

must both include the context and evoke the experience of the original.”

(Lyman, 2002 s. 41)


netlab.dk

Workshop KU

02.11.2016


41

It is versions/reconstructions:

• The archived objects are re-assembled in the archive

’replay’


42

IN CONTRAST TO DIGITIZED COLLECTIONS: TO A LARGE EXTENT ARCHIVED WEB IS ALREADY MARKED UP — HTML, FILE NAMES...

html + files

Online web archiving

Link list Named entities

Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler:...

Documents

Transcript of Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler:...