Skolemising Blank Nodes while Preserving Isomorphism

Post on 07-Aug-2015

67 views 1 download

Tags:

Transcript of Skolemising Blank Nodes while Preserving Isomorphism

Skolemising Blank Nodes whilePreserving Isomorphism

Aidan Hogan – DCC, Universidad de Chile

WHY? BLANK NODES ARE GREAT!

When life gives you blank nodes …

Blank Nodes are glue!

Blank Nodes names aren’t important …

(Isomorphic)

Blank nodes are common in real-world data …

Aidan Hogan, Marcelo Arenas, Alejandro Mallea and Axel Polleres "Everything You Always Wanted to Know About Blank Nodes". Journal of Web Semantics 27: pp. 42–69, 2014

BLANK NODES ENABLE SYNTAX SHORTCUTSThey represent implicit nodes in the graphThey help specify order, higher-arity relations, reification, etc., succinctlyThey are common in real-world data

BLANK NODES:WHAT’S THE PROBLEM?

Are two RDF graphs isomorphic?

Are two RDF graphs isomorphic?

RDF ISOMORPHISM IS GI-COMPLETEA general algorithm to see if two RDF graphs are the “same” will (probably) not be tractable

BLANK NODES ADD COMPLEXITY?WHAT TO DO?

RDF 1.1 proposes Skolemisation

But fresh IRIs every time is not ideal

But fresh IRIs every time is not ideal

Would prefer a “consistent” labelling

Would prefer a “consistent” labelling

Compute isomorphically-unique graph hash

Finding duplicate documents from a crawler

CANONICAL LABELLING USEFUL FOR:1. Mapping blank nodes to IRIs 2. Computing unique hashes for RDF graphs

OLD BUT RECURRING QUESTION

An old question that won’t go away …

Jeremy J. Carroll. “Signing RDF Graphs.” ISWC 2003.

Edzard Höfig, Ina Schieferdecker. “Hashing of RDF Graphs and a Solution to the Blank Node Problem.” URSW 2014.

NO EXISTING APPROACH IS GENERAL• Hard cases seem unlikely in practice• Let’s build a general (and thus worst-case exponential) algorithm

that’s efficient for practical cases

NAÏVE CANONICAL LABELLING SCHEME

(Naïve) Canonical labels for blank nodes

But wait … what happens if ... ?

Or another case …

Or another case …

Or another case …

Fixpoint does not distinguish all blank nodes!

NAÏVE: COLOUR BLANK NODES RECURSIVELY UNTIL FIXPOINT• Efficient• Incomplete

CANONICAL LABELLING SCHEME:ALWAYS DISTINGUISH ALL BLANK NODES

Brendan D. McKay. "Practical graph isomorphism". Congressus Numerantium 30: pp. 45–87, 1981.

Start with a (non-distinguished) colouring …

Let’s distinguish a node …

Let’s distinguish a node …

Colouring is no longer a fixpoint!

Rerun colouring to fixpoint

Rerun colouring to fixpoint

Rerun colouring to fixpoint

Rerun colouring to fixpoint

Fixpoint reached: still not finished!

So again let’s distinguish another …

… and rerun colouring to fixpoint

… and rerun colouring to fixpoint

… and rerun colouring to fixpoint

… and rerun colouring to fixpoint

… and rerun colouring to fixpoint

… and rerun colouring to fixpoint

Now all blank nodes are distinguished!

Blank node labels computed from colour

Let’s go back: first, why pick _:a and _:c?

Okay so: why _:a …

Adapt ideas from the Nauty algorithm (for standard graph isomorphism)

Adapt ideas from the Nauty algorithm (for standard graph isomorphism)

Check all leafs for minimum graph

What happened?

What happened?

What happened?

Automorphisms cause repetitions

CORE ALGORITHM: FIND MINIMAL GRAPH FOLLOWING FIXED COLOURING RULES• Complete• Efficient for many cases?

OKAY … SO WHAT HASHING TO USE?

What about hash collisions?

128 bit: MD5, Murmur3_128160 bit: SHA1

HASHING MAY LEAD TO COLLISIONS• Don’t care what hashing you want to use• 128-bit hash shortest hash with acceptable collision probability• For cryptographic use-cases, SHA-256 or better might be needed

EVALUATION

Evaluation: Real-world Graphs

Evaluation: Nasty Synthetic Graphs

CONCLUSIONS

In loving memory of

Linked Data

2007–2012

Survived by its research

community

_:b1999–2015

Conclusions

Aside: Why GI-Hard?

Aside: Why GI-Hard?(Can Encode Graph Isomorphism as RDF Isomorphism)

if and only if

Aside: Why GI-Complete?(Can we encode RDF isomorphism as graph isomorphism?)

if and only if

?

?

Aside: Why GI-Complete?(Yes: We can encode RDF isomorphism as graph isomorphism)

Aside: Why GI-Complete?(Yes: We can encode RDF isomorphism as graph isomorphism)

if and only if

COMPLETE CANONICAL LABELLING SCHEME

A complete canonical labelling?

Find a canonical labelling for H

Choose the lowest possible graph

COMPLETE: FIND MINIMUM POSSIBLE GRAPH USING FIXED BLANK NODE LABELS• Complete• Inefficient

The need for a graph-level hash

OPTIMISATION: PRUNE THE TREE USING AUTOMORPHISMS

Trim the search treeusing “found” automorphisms

Found Automorphisms …

PRUNING PER AUTOMORPHISMS AVOIDS SYMMETRIC REPETITIONS• Automorphisms are found naturally• Makes very “regular” structures (like cliques) a lot easier• Need to be careful how to manage the automorphism group