Learning from Linked Open Data Usage

17
Copyright 2010 Digital Enterprise Research Institute. All rights reserved. WebScience 2010, Raleigh, NC, USA 26/04/2010 Learning from Linked Open Data Usage: Patterns & Metrics Knud Möller, Michael Hausenblas, Richard Cyganiak, Gunnar Grimnes, Siegfried Handschuh Copyright 2010 Knud Möller Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-sa/3.0/ Monday 26 April 2010

description

"Although the cloud of Linked Open Data has been growing continuously for several years, little is known about the particular features of linked data usage. Motivating why it is important to understand the usage of Linked Data, we describe typical linked data usage scenarios and contrast the so derived requirement with conventional server access analysis. Then, we report on usage patterns found through an in-depth analysis of access logs of four popular LOD datasets. Eventually, based on the usage patterns we found in the analysis, we propose metrics for assessing Linked Data usage from the human and the machine perspective, taking into account different agent types and resource representations." Slides for a presentation at WebScience 2010. The paper is available for download at http://journal.webscience.org/302/.

Transcript of Learning from Linked Open Data Usage

Page 1: Learning from Linked Open Data Usage

13/03/2008 FAST kick-off, Madrid, 2008 Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

WebScience 2010, Raleigh, NC, USA26/04/2010

Learning from Linked Open Data Usage: Patterns & Metrics

Knud Möller, Michael Hausenblas, Richard Cyganiak, Gunnar Grimnes, Siegfried Handschuh

Copyright 2010 Knud MöllerExcept where otherwise noted, this work is licensed underhttp://creativecommons.org/licenses/by-sa/3.0/

Monday 26 April 2010

Page 2: Learning from Linked Open Data Usage

What is Linked (Open) Data? (in <1 minute)

2

Conventional “Eye-ball” Web Web of Linked Data

interlinked documents interlinked items of data (URIs, RDF)

mainly people / Web browsers

mainly machine agents

Monday 26 April 2010

Page 3: Learning from Linked Open Data Usage

What is Linked (Open) Data? (in <1 minute)

3

Linked Open Data cloud (the set of interlinked, Semantic Web datasets)

February 2008

July 2009http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

Monday 26 April 2010

Page 4: Learning from Linked Open Data Usage

Question: How is Linked Data being Used?

•plenty of research on conventional Web usage•what about usage of linked data?

Why?•how healthy is the Web of linked data?•who is using the data and how? Is it useful? Are there

trends?•providers: improve hosting•... just curiosity!

4

Monday 26 April 2010

Page 5: Learning from Linked Open Data Usage

Question: How is Linked Data being Used?

•plenty of research on conventional Web usage•what about usage of linked data?

Why?•how healthy is the Web of linked data?•who is using the data and how? Is it useful? Are there

trends?•providers: improve hosting•... just curiosity!

4

webometrics?

Monday 26 April 2010

Page 6: Learning from Linked Open Data Usage

Approach

•particular sites:– a URI for each data item ➙ a request for each data item

(resource)– content negotiation best practices– redirection (HTTP 303)

5

Monday 26 April 2010

Page 7: Learning from Linked Open Data Usage

http://data.semanticweb.org/conference/www/2009

http://data.semanticweb.org/conference/www/2009/rdf

http://data.semanticweb.org/conference/www/2009/html

plainresource URI

RDFdocument URI

HTMLdocument URI

Approach

•particular sites:– a URI for each data item ➙ a request for each data item

(resource)– content negotiation best practices– redirection (HTTP 303)

5

Monday 26 April 2010

Page 8: Learning from Linked Open Data Usage

Approach (ctd.)

•server log files– common log format (CLF), combined log format

6

80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"

200 64674 "-" "ARC Reader (http://arc.semsol.org/)"

Request IP Request Date Request String

User AgentReferrerResponce SizeResponse Code

•RDF requests vs. “semantic” requests•90.21.243.141 − − [06/Oct/2008:16:07:58 +0100] ”GET /organization/vrije−universiteit−amsterdam−the−netherlands HTTP/1.1” 303 7592 ”−” ”rdflib −2.4.0 (http://rdflib.net/; [email protected])”

•90.21.243.141 − − [06/Oct/2008:16:08:02 +0100] ”GET /organization/vrije−universiteit−amsterdam−the−netherlands/rdf HTTP/1.1” 200 45358 ”−” ”rdflib −2.4.0 (http://rdflib.net/; [email protected])”

Monday 26 April 2010

Page 9: Learning from Linked Open Data Usage

Source Data

7

80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"

200 64674 "-" "ARC Reader (http://arc.semsol.org/)"

Request IP Request Date Request String

User AgentReferrerResponce SizeResponse Code

Figure 1: The combined log format

# triples # days total # hits # plain hits # RDF hits # HTML hits SPARQL

Dog Food 79,175 597 8,427,967 1,923,945 259,031 1,647,205 879,932(14,117) (3,223) (434) (2,759) (1,471)

DBpedia 109,750,000 118 87,203,310 22,821,475 7,008,310 22,999,237 20,972,630(739,011) (193,402) (59,392) (194,909) (177,734)

DBTune 74,209,000 61 7,467,125 1,952,185 1,135,509 677,904 3,055,493(122,412) (32,003) (18,615) (11,113) (50,090)

RKBExplorer 91,501,684 29 529,938 — — — 9,327(18,274) (—) (—) (—) (322)

Table 1: Overview of four LOD datasets

queries are served. For our evaluation, we had access to logfiles in two periods: from 24/05/2009–21/06/2009 and from27/09/2009–29/10/2009, i.e., roughly two months.

3.2.4 RKBExplorerRKBExplorer6 [11] is another meta-dataset currently com-

prising 44 sub-datasets covering various topics and sourceswithin the domain of academic research, as well as a Webapplication that allows users to access and browse its contentin an integrated fashion. Both RDF and HTML documentsabout the resources in all datasets are available. Apart fromserving linked data, the site also features a module thatprovides co-reference resolution functionality [10]. For ourevaluation, we had access to log files in the period from24/05/2009–21/06/2009, i.e., roughly one month. However,since the log files were partially broken (no referrer IPs wererecorded), and because their structure was slightly modi-fied in comparison to the conventional log file format, wewere only able to make use of the dataset in some of ourexperiments.

3.3 A New Breed of AgentsSince we expect usage of linked data to be different from

conventional Web usage, we can also expect to find newkinds of agents. In this section we define what we considerto be “semantically aware” agents, which are explicitly tar-geted at the Web of linked data.

3.3.1 Detecting SemanticityBy classifying an agent as “semantic”, we imply that it is

capable of processing structured, semantic data, i.e., RDF.Whether or not an agent has this capability can only be de-termined indirectly from the log files, based on some heuris-tics. Making the assumption that any agent which explicitlyrequests semantic data from a server also knows how to pro-cess it, we will classify such agents as “semantic”. In detail,we use the following two heuristics:

• SPARQL requests: if an agent sends a request con-

6http://www.rkbexplorer.com

taining a SPARQL query, we assume that it is capa-ble of handling the query result, i.e., either a set ofbindings (in the case of a SELECT query), potentiallycontaining URIs of RDF resources, or an RDF graph(in the case of a CONSTRUCT or DESCRIBE query).

• RDF requests: if an agent directly requests RDFfrom a server, we assume that it knows how to pro-cess data in this format. Directly here means thatthe agent specified an RDF syntax such as rdf/xmlas an acceptable response in the header of its request.Merely requesting the URI of an RDF representationdoes not suffice to indicate semanticity, as this couldsimply mean that the agent followed a link to this rep-resentation.

http://data.semanticweb.org/conference/www/2009

http://data.semanticweb.org/conference/www/2009/rdf

http://data.semanticweb.org/conference/www/2009/html

plainresource URI

RDFdocument URI

HTMLdocument URI

Figure 2: Plain resource, RDF and HTML representations

Detecting SPARQL requests is straightforward, since therequested URI will contain the actual SPARQL query. How-ever, log files of Web servers do not normally record theheader for each request7, which makes it less straightfor-ward to apply the second heuristic. Nevertheless, there isan indirect way to apply it in some cases, based on the7Web servers can be configured to also log information suchas request headers. In fact, this has been done by the ad-ministrators of RKBExplorer, which makes it easy to detectsemantic agents in this site’s log files.

3

DBTune

Plain 45%

HTML 39.9%

RDF 14.9% Semantic 4.2%

DBpedia

Plain 47.7%

HTML 46.5%

RDF 5.8% Semantic 2.8%

Dog Food

Plain 41.0%

HTML 51.1%

RDF 7.8% Semantic 2.5%

Monday 26 April 2010

Page 10: Learning from Linked Open Data Usage

Agents: Ordinary Traffic

8

SW Dog Food (21/07/2008 - 20/06/2009)

0

100000

200000

300000

400000

500000

0 5 10 15 20 25 30

hits

agents

http://data.semanticweb.org, 21/07/2008 - 20/06/2009

hits

Google

Bot (

4978

33)

Yahoo! S

lurp

(159

238

& 1

3376

6)

msn

bot (11

8928

)

Sindic

eFet

cher

(192

11)

multi

craw

ler (

1232

5)

rdfb

ot/1.0

(734

2)

ARC R

eader

(680

8)

ordinary traffic: the usual suspects

Monday 26 April 2010

Page 11: Learning from Linked Open Data Usage

Agents: How “Semantic” are they?

9

0

0.2

0.4

0.6

0.8

1at

tribu

tor/1

.13.

2tri

plr

sind

iceb

otrd

flib-

2.4.

2R

ippl

eO

L_Vi

rtuos

o_R

DF_

craw

ler

Mor

ph_C

onve

rter_

Serv

ice

Falc

onsb

otSp

eedy

Slug

_SW

_Cra

wle

rya

cybo

thc

lsre

port-

craw

ler

MJ1

2bot

PycU

RL

herit

rix/1

.14.

3Si

ndic

eFet

cher

herit

rix/p

om.v

ersi

onhe

ritrix

/2.0

.2m

ultic

raw

ler

Sind

iceB

otia

_arc

hive

rZi

tgis

t-APl

usPl

us-A

gent

rdfli

b-2.

4.1

Mp3

Bot

curl

Zend

_Http

_Clie

ntSp

eedy

_Spi

der

nxcr

awle

rm

arbl

es-

Java

rdfli

b-2.

4.0

(unk

now

n)AR

C_R

eade

rM

LBot

Moz

illaJa

karta

_Http

Clie

ntW

get

libw

ww

-per

lM

SIE

Fire

fox

Pyth

on-u

rllib

sind

ice_

onto

logy

_fet

cher

Goo

gleb

ot

sem

antic

hits

/tot

al h

its (

>100

sem

antic

hits

)

semantic traffic: new kinds of agents

Monday 26 April 2010

Page 12: Learning from Linked Open Data Usage

0

1000

2000

3000

4000

5000

6000

200

8-07

-01

200

8-09

-01

200

8-11

-01

200

9-01

-01

200

9-03

-01

200

9-05

-01

200

9-07

-01

200

9-09

-01

200

9-11

-01

201

0-01

-01

201

0-03

-01

201

0-05

-01

Dog Food Hits over Time (smoothing factor 0.05)

plainhtml

rdfsemantic

Is Demand for LOD increasing?

10

no increase for semantic requests

Monday 26 April 2010

Page 13: Learning from Linked Open Data Usage

0

50000

100000

150000

200000

250000

300000

200

9-06

-20

200

9-07

-04

200

9-07

-18

200

9-08

-01

200

9-08

-15

200

9-08

-29

200

9-09

-12

200

9-09

-26

200

9-10

-10

200

9-10

-24

200

9-11

-07

DBpedia Hits over Time (smoothing factor 0.05)

plainhtml

rdfsemantic

Is Demand for LOD increasing? (ctd.)

11

no increase for semantic requests

Monday 26 April 2010

Page 14: Learning from Linked Open Data Usage

0

100

200

300

400

500

600

700 2

008-

07-0

1

200

8-09

-01

200

8-11

-01

200

9-01

-01

200

9-03

-01

200

9-05

-01

200

9-07

-01

200

9-09

-01

200

9-11

-01

201

0-01

-01

201

0-03

-01

201

0-05

-01

Demand for Events (smoothing factor 0.05)

iswc2008www2009eswc2009iswc2009

12

Do Real-world Events have an Impact on LOD Usage?

possible impact

Monday 26 April 2010

Page 15: Learning from Linked Open Data Usage

0

1

2

3

4

5

6

7

8

9

200

9-06

-20

200

9-07

-04

200

9-07

-18

200

9-08

-01

200

9-08

-15

200

9-08

-29

200

9-09

-12

200

9-09

-26

200

9-10

-10

200

9-10

-24

200

9-11

-07

Irish Lisbon Treaty Referendum (smoothing factor 0.05)

http://dbpedia.org/resource/Republic_of_Irelandhttp://dbpedia.org/resource/European_Unionhttp://dbpedia.org/resource/Treaty_of_Lisbon

Do Real-world Events have an Impact on LOD Usage?

13

possible impact

Monday 26 April 2010

Page 16: Learning from Linked Open Data Usage

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5 2

009-

06-2

0

200

9-07

-04

200

9-07

-18

200

9-08

-01

200

9-08

-15

200

9-08

-29

200

9-09

-12

200

9-09

-26

200

9-10

-10

200

9-10

-24

200

9-11

-07

Michael Jackson Memorial Service (smoothing factor 0.05)

http://dbpedia.org/resource/Staples_Centerhttp://dbpedia.org/resource/Michael_Jackson_memorial_service

http://dbpedia.org/resource/Michael_Jackson

Do Real-world Events have an Impact on LOD Usage?

14

possible impact

Monday 26 April 2010

Page 17: Learning from Linked Open Data Usage

Conclusion (of sorts)

•Generic approach for analysing usage of LOD sites (but see below), based on server log files

•Metric for semanticity of agents•Did not notice a rising demand in LOD•However: real-world events do seem to have an effect

on LOD usage•Restrictions:

– does not work well with embedded metadata (e.g., RDFa-based sites)

– does not take into account usage through meta sites (indexes, search engines, ...)

15

Monday 26 April 2010