Characterizing Machine Agent Behavior through SPARQL Query Mining

Post on 20-Jan-2015

16.735 views 0 download

Tags:

description

Mining SPARQL queries to understand the behavior of au-tomated programs (or machine agents) is an important stepin designing systems for the semantic web. We presenttechniques that differ from state-of-the-art SPARQL miningtechniques in two ways: 1. Move away from one SPARQLquery at a time view to SPARQL user session view 2. Lookat the results of SPARQL queries in addition to the queryitself. Due to these two approaches, we are able to find twonew patterns in SPARQL queries that help us reason betterabout the underlying program that generated the SPARQLqueries. Through a variety of experiments, we show thatthe patterns found have significant support in all the fourdatasets provided by the USEWOD committee.

Transcript of Characterizing Machine Agent Behavior through SPARQL Query Mining

Characterizing Machine Agent Behavior through SPARQL Query

MiningAravindan RaghuveerYahoo! Inc, Bangalore.

aravindr@yahoo-inc.com

Yahoo! Confidential

Introduction: LOD Users

The LOD cloud has two types of users- Humans (browsers). - Programs / machine agents.

2

Yahoo! Confidential

Introduction: LOD Access Methods

3

The data on the LOD cloud can be accessed in multiple ways.

For this work, we categorize them into two buckets:- SPARQL : A powerful declarative graph query

language

- Non-SPARQL: Direct linked data requests.

Yahoo! Confidential

Motivation: User Behavior Understanding

Deep Understanding of client behavior can help build “better” serving systems

Better:- Secure- Scalable- Available

Prior Work:- Moller et al , WebSci 2010- Picalausa et al. Swim 2011- Kirchberg et. al Usewod 2011- Mario et. Al, Usewod 2011 4

Yahoo! Confidential

Summarizing. . .

5

Human Users Machine Agents

Non-SPARQL

SPARQL This paper’s focus

Yahoo! Confidential

What this paper is about?

Mining of the USEWOD query log dataset to identify:

- Two Trends in Machine Agent Querying

- Two Patterns in Machine Agent Querying

6

Yahoo! Confidential

The USEWOD dataset

Query logs of servers hosting a part of LOD cloud data.

7

Type # records(million)

% SPARQL

bio2rdf Life sciences ~ 0.2 100%

lgd Geo ~ 1.9 100%

SWDF Conference ~ 16.7 43.38%

dbpedia Structured wikipedia

~ 36.2 46.9%

Yahoo! Confidential

Part-1: Two Trends in Machine Agent Querying

The Theme

“What are the overarching trends for SPARQL queries?”

8

Yahoo! Confidential

Trend-1: SPARQL is here to stay!

9

SWDF Dbpedia

Take-away: SPARQL query volume is pretty significant

0.1 – 1million

Yahoo! Confidential

Trend-2: SPARQL is heavily used by machine agents.

10

Took 17 million user agents from SPARQL queries from dbpediaand..

Yahoo! Confidential

Part-2: Two Patterns in Machine Agent Querying

The Theme

“Looking at SPARQL query logs, can we reason about the program that generated the queries?”

11

Yahoo! Confidential

Salient aspects of proposed Query Mining Techniques

Move from per query analysis to query session analysis

Move from query analysis to query result analysis

12

Yahoo! Confidential

Pattern -1 : Loops in Programs

Take-away

• Through a per-user, temporal mining of logs, we discover patterns that are caused by loops in program.

• Significant support in all 4 datasets

13

Yahoo! Confidential

Per-user Temporal mining

14User-1 User-2 User-3 User-4

TIME

Original Logs

User level Session Analysis

Loop

Yahoo! Confidential

Intra Pattern Loop

successive queries from the same user, use the same “template”

Example: Two successive queries:

15

SELECT * WHERE {http://bio2rdf.org/dr:D00332http://bio2rdf.org/ns/bio2rdf#xRefhttp://bio2rdf.org/cas:54-47-7}

SELECT * WHERE{http://bio2rdf.org/dr:D00333http://bio2rdf.org/ns/bio2rdf#xRefhttp://bio2rdf.org/cas:54-47-7}

Only the subject (D00332,D00333) varies

Yahoo! Confidential

Detecting Intra Pattern Loop

We convert a query to its canonical form by replacing variables, URI and literals by “keywords”.

16

SELECT * WHERE {http://bio2rdf.org/dr:D00332http://bio2rdf.org/ns/bio2rdf#xRefhttp://bio2rdf.org/cas:54-47-7}

Canonical Form of the previous queries: SELECT * WHERE { _URI_ _URI_ _URI_ }

Queries generated by the same template will have the same canonical form.

Yahoo! Confidential

Salient Aspects of Intra Pattern loops

Iterate over a dictionary of values (categorical)

Iterate over a numerical range (example LIMIT, OFFSET parameters in SPARQL queries)

Multiple levels of nested loops with the same intra loop pattern.

4 Parameters to quantify above (in paper)17

Yahoo! Confidential

Inter Pattern Loops

Found loops that iterate over a set of patterns

18

P1,P2,P3 ,P1,P2,P3,P1,P2,P3

Typically used when the output of the first query goes as a parameter to the second query.

(examples in paper)

Yahoo! Confidential

Results

19

86% 32%

40% 16%

Take-away:Significant support

for loops!bio2rdf

lgd

swdf dbpedia

Yahoo! Confidential

Pattern-2: Querying for dbpedia Linkage

Take-away:• By executing each query • analyze the results, we find that a portion of

queries “look” for dbpedia links• Results:- 20 months of SWDF queries had average of 8% look

for dbpedia urls- 2 days worth of lgd queries had 26.5% queries look

for dbpedia urls

20

Yahoo! Confidential

Summary & Conclusions

Proposed 2 new ways of SPARQL query mining:- Session view - Analyze results in addition to query

Showed that machine agents look for dbpedia using the owl:sameas annotation.

21

Influence on system design:- Can we pre-fetch elements in loop beforehand?- Priortitize dbpedia attributes for caching

Influence on log collection & analysis:- Stratified random sampling to remove effect of loops.

Yahoo! Confidential

22

For the great data !! For the great feedback & commentsFor listening!

Yahoo! Confidential

The famous LOD Cloud . . .

7 billion triples and counting!!23