Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler:...

42
Workshop KU 02.11.2016 netlab.dk Workshop on Web Archiving MODULE 1: WEB ARCHIVING Niels Brügger Asger Harlung

Transcript of Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler:...

Page 1: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

Workshop KU

02.11.2016 netlab.dk

Workshop on Web Archiving

MODULE 1: WEB ARCHIVING

Niels Brügger

Asger Harlung

Page 2: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Program 02.11.2016, KU

2

12.00-13.00 Workshop part 1: Web archiving

13.00-13.15 Coffee

13.15-14.00 Workshop part 2: Existing collections

14.00-14.10 Short break

14.10-15.00 Workshop part 3: Doing your own archiving (incl. hands-on)

15.00-15.05 Wrap-up; feedback

15.05-15.30 Open discussion: Thoughts and questions

Page 3: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Module 1: Web Archiving

3

• Introducing ourselves and NetLab

• Why archive the web

• Two examples; dr.dk’s history and Probing a Nations Web

• Three kinds of digital content

• WWW as technology

• What is web archiving?

• Methods of web archiving

• Challenges for the web crawler

• Crawling — advantages/disadvantages

• Characteristics of the archived web

Page 4: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Introducing Ourselves and NetLab

4

Niels Brügger – Professor (MSO, with special responsibilities) in Internet Studies and Digital Humanities, Head of NetLab, and of the Centre for Internet Studies, specialising in internet research since 1997.

Asger Harlung – MA in ICT and learning, has previously worked with research in digital rhetoric, and supporting creativity development in learning processes

… and the rest of NetLab is:

Janne Nielsen – Assistant Professor, PhD, Media Studies, AU; author of the book: Using Web Archives in Research – an Introduction (2016)

Ulrich Have – IT Architect, MA in Information Studies. At NetLab, he is developer of tools for research using archived web material and does data analytics.

Page 5: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Introducing Ourselves and NetLab

5

A research infrastructure for internet research.

Part of the Danish research infrastructure Digital Humanities

Lab (DIGHUMLAB).

Established in 2012.

Research driven development of research infrastructures.

Page 6: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

Digital Humanities Lab

Language Tools

(KU)

Media Tools

(AU)

Interaction & Design

(AAU & SDU)

Audio and visual

materials

NetLab

Online

Archived

Netarkivet

(the Danish

national web

archive)

NetLab

Forum

IT architect

Collecting

data for

specific

projects

2017-18, micro stipends

6

Page 7: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Ditigal Impact on the Humanities

7

We are living in an

ever more digitised

world …

We know that …

why are they

stating the

obvious?

Page 8: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Ditigal Impact on the Humanities

8

(based on Alvin Toffler: TheThird Wave (1980))

• Neolithic revolution, ca. 12,500 years ago: Agriculture is the first

technological revolution, and replaces hunter-gatherer culture.

Hundreds or thousands of years between technogical advances.

• Industrial revolution (I and II), ca. 1760-1950’s. Decades

between technological advances, but faster advancements

since late in the 19th century (”second industrial revolution”)

• Post-industrial age, information age, digital age … 1950’s and

onwards. Years, months, sometimes weeks between

technological leaps. State of the art today may be obsolete

before the end of the year (a state of perpetual ”future shock”).

Page 9: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Ditigal Impact on the Humanities

9

• 2000: 75% of the world’s data was stored in analog form

(paper, film, photographic prints, vinyl, magnetic

casette tapes, etc.),

• 2007: 7% analog, 93% digital

• 2012: Only 2% of all stored data was stored in analog

form.

Mayer-Schönberger & K. Cukier (2013):

Big Data: A revolution that will transform how we live, work, and think. Houghton Mifflin

Harcourt Publishing Company, New York, 2013, pp. 8-9

Page 10: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Why Archive the Web?

10

• To preserve the cultural heritage

• To preserve a stable research object

• To be able to document and illustrate a study

Page 11: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

dr.dk’s History

11

Page 12: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

12

• Started in 2007

• Analyse the historical development of

dr.dk

• Partly based on archived web content

• www.drdk.dk

Dr.dk’s history 1996-2006

Page 13: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Probing a Nation’s Web Domain — from Small Data to Big Data

13

Website

Web page

Web element

Web sphere

Web sphere

Page 14: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Probing a Nation’s Web Domain — from Small Data to Big Data

14

The historical development of an entire national web:

.dk 2005-2015

The project is a collaboration with Netarkivet.

2006 2009 2012 2015

Page 15: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Probing a Nation’s Web Domain — from Small Data to Big Data

15

Brutto list of 'probes’:

• Size — e.g. bytes

• Space — e.g. geolocalisation

• Structure — e.g. network of hyperlinks

• Liveliness — e.g. domain names and updating

• Content — e.g. degrees of openness, files, software types,

language, website textual elements, semantics

Page 16: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

Digitised Formerly analogous media, transferred to a digital form.

Born Digital Has not previously existed in any other form than digital.

Reborn Digital Born digital content which has been gathered and

preserved, and to some extent has been changed in the

process.

16

Page 17: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Web element

WWW as Technology

17

Page 18: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Web page

WWW as Technology

18

Page 19: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Website

Web page

Web element

Web site

WWW as Technology

19

Page 20: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

WWW — one among other internet protocols:

http — Hyper Text Transfer Protocol

URL — Uniform Resource Identifier (Locator)

html — Hyper Text Markup Language

Constructing a URL on WWW:

protocol://subdomain.domain.topdomain/path/page/

http://cc.au.dk/research/researchprograms/

20

Page 21: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

Web pages = patched together in an ‘empty’ shell (stylesheet) of material from databases

21

The browser (Safari, Firefox...) translates html into writing, pictures etc.

Network of computers

html html html html html html html html html html html html

Computer (webserver)URL, dr.dk

Computer (user)

http

http

Computer (webserver) as database, CMS (Content Management System), URL dr.dk

Web pages = html-files

Images

Heading

Words

Computer (webserver) as database, URL, e.g. dmi.dk

Weather

Comp. X

Comp. Y

Page 22: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

22

Small Exercise: Source Code

Page 23: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

What is Web Archiving?

23

International Internet Preservation Consortium’s definition:

”… the process of gathering up data that has been published

on the World Wide Web, storing it, ensuring the data is

preserved in an archive, and making the collected data

available for future research.”

(http://netpreserve.org/about-us)

”Any form of deliberate and purposive preserving of web

material.” (Brügger, 2011:25)

Brügger, Niels (2011): ”Web Archiving – Between Past, Present, and Future”. IN: Mia

Consalvo and Charles Ess (eds.) The Handbook of Internet Studies. Wiley-Blackwell.

Page 24: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

What is Web Archiving?

24

Macro archiving

• Cultural heritage institutions

• Preserve as much as possible

• Big and varied data

• IT expertise, advanced technology, computer power

Micro archiving

• Individual researcher/research group

• Stablize a concrete research object, here-and-now

• No experience, no advanced technology or computer

power

Page 25: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Methods of Web Archiving

25

• Web crawling (hyperlink crawling)

• Screen image

• Screen filming

• Harvesting via API

• (Delivery from producers)

Page 26: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Web Crawling

26

domain.com

page

page page

page

page page page

page

page

Page 27: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Web Crawling

27

domain.com

page

page page

page

page page page

crawler

page

page

1

0

2

3

Page 28: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

28

domain.dk

page page page page

page page page

page page page

page page page

page page page page page page

URL URL URL URL URL …

domain.dk

page

page

page page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

page

domain.dk

page page page page

page page page

page page page

page page

page page page page page page

crawler

crawler

domain.dk

domain.com

Page 29: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Challenges for the crawler

29

• JavaScripts

• Content based on Flash

• Interactive pages

• Streamed content

• Websites with access limitations (password, captcha)

• Cookies, adds, plugins etc.

• Robots.txt

• Deep web (e.g. databaser, ftp-server, password-protected

content, hidden content, pages not linked to, dynamic

content based on requests).

http://da.wikipedia.org/wi

ki/CAPTCHA

Page 30: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Pages not being crawled

30

domain

✔ ✔ ✔ ✔

✔ ✔ ✔

✔ ✔ page

✔ ✔

✔ ✔ ✔ ✔ ✔ ✔

page page page page

Not crawled

– too deep

page page

Not crawled

– password

protected

domain

page page Not

crawled –

robots.txt

page page

Not crawled – script

Page 31: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

31

Elements not crawled _ Netarkivet

Page 32: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

32

Elements not crawled _ Netarkivet

Page 33: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

33

Elements not crawled _ Internet Archive

Page 34: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Crawling, Advantages

34

• The entire page in full length

• Hyperlinks, link source as well as target

• Look and feel of live web (with limitations)

• Automatic (partly, evaluation and trouble shooting)

• Machine readable, enables search, sorting, analysis

• Access to metadata (crawl logs)

• Robust format (html)

• Big data-analysis (content analysis, network analysis, etc.)

Page 35: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Crawling, Disadvantages

35

• Some objects not archived, e.g. videos and streamed

content, and applications based on Flash, JavaScript etc.

• Temporal inconsistencies

• Difficult to delimit in terms of spatial extent

• Risk of web crawler being caught in ’bot traps’ (some

monitoring is necessary)

Page 36: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Characteristics of the Archived Web

36

What is archived is not a 1:1 copy of the material one attempted to archive

It is versions/reconstructions:

• Created in the process of archiving

• On the basis of a number of choices made by the archiver

(harvesting strategy, settings, etc.)

• The choices made have consequences for what is

archived

• The archived objects are re-assembled in the archive

’replay’

Page 37: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Characteristics of the Archived Web

37

The archived version is deficient because of:

• Technical challenges

• Web’s specific characteristics: dynamic, unpredictable

• Potential asynchronicity between updating and archiving

→ archiving takes time

→ certain elements cannot be archived

It is an added challenge that we do not know what is missing:

• Not much documentation

• No baseline to compare with

Page 38: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Characteristics of the Archived Web

38

As scholars using archived web as an object of study, it is important that we are aware of the pitfalls and sources of error inherent in the material.

Page 39: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

”During the Olympics in Sydney in 2000, I wanted to save the

website of the Danish newspaper JyllandsPosten. I began at the

first level, the front page, on which I could read that the Danish

badminton player Camilla Martin would play in the finals a half hour

later.

My computer took about an hour to save this first level, after which

time I wanted to download the second level, ’Olympics 2000’. But on

the front page of this section, I could already read the result of the

badminton finals (she lost).

The website was — as a whole — not the same as when I had

started; it had changed in the time it took to archive it, and I could

now read the result on the front page, where the match was

previously only announced.”

N. Brügger: Archiving Websites, 2005, pp. 22-23

39

Page 40: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Characteristics of the Archived Web

40

Exactly what is preserved?

Content, user experience or something else entirely?

”…an archive also must be sure that the document is translated in an

authentic manner. In this case, authenticity means that the document

must both include the context and evoke the experience of the original.”

(Lyman, 2002 s. 41)

Page 41: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

netlab.dk

Workshop KU

02.11.2016

Characteristics of the Archived Web

41

It is versions/reconstructions:

• The archived objects are re-assembled in the archive

’replay’

Page 42: Workshop on Web Archiving€¦ · Ditigal Impact on the Humanities 8 (based on Alvin Toffler: TheThird Wave (1980)) • Neolithic revolution, ca. 12,500 years ago: Agriculture is

42

IN CONTRAST TO DIGITIZED COLLECTIONS: TO A LARGE EXTENT ARCHIVED WEB IS ALREADY MARKED UP — HTML, FILE NAMES...

html + files

Online web archiving

Link list Named entities