NoSQL with Graphsmining graphs for fun & profit
claudio martellaNoSQLDay 2011
Saturday, March 26, 2011
Outline
Graphs
Why
Tools
Apps
NoSQL
RDBMS
O(1)
Semantic Web
Tinkerpop
Recommendation
Query
Table
Documents
GraphDBs
2
Saturday, March 26, 2011
Who am I?
• PhD in Distributed Graphs @ UniBZ
• Analyst @ TIS Innovation Park
• Topics: Data / Text Mining with Graphs
• Technology: Hadoop, NoSQL, GraphDBs
• Writing Graffiti
3
Saturday, March 26, 2011
Surrounded by graphs
• the Web Graph
• Semantic Web
• Social Networks
• Natural Sciences
• GIS
4
Saturday, March 26, 2011
Property Graph
• A Graph is composed by Vertices and Edges
• Vertices are connected by Edges
• An Edge has a Label and Direction
• Edges and Vertices have Properties
5
Saturday, March 26, 2011
Who am I?6
Me
TIS
works at
UniBZ
studies at
NoSQL
likes Hadoop
works withGraffiti
author
belongs to
GraphDB
belongs to
belongs to
name: claudiosurname: martellaemail: [email protected]
Saturday, March 26, 2011
A graph in RDBMS
7
Follower Followee
1 2
1 3
1 4
2 5
... ...
ID Name
1 Claudio
2 Cirpo
3 Okram
4 Spinoza
... ...
Saturday, March 26, 2011
BTree Index 101
8
• Lookup costs Log(N)
• Where N is the global size of the data structure
• Updating the index is also not for free
Cirpo Claudio Okram Spinoza
Saturday, March 26, 2011
A lookup (RDBMS)
• Look for Claudio’s ID [ Log(N) ]
• Look for K Followees [ Log(N) ]
• Get their names [ K*Log(N) ]
Fr Fe
1 2
1 3
1 4
2 5
... ...
I Name
1 Claudio
2 Cirpo
3 Okram
4 Spinoza
... ...
9
Saturday, March 26, 2011
A graph in NoSQL
10
ID F1 F2 F3 ...
Cirpo ... ... ... ...
Claudio Cirpo Okram Spinoza ...
Okram ... ... ... ...
Spinoza ... ... ... ...
... ... ... ... ...
Saturday, March 26, 2011
A lookup (NoSQL)
• Look for Claudio’s ID [ Log(N) ]
• Look for Followees [ O(K) ]
11
ID F1 F2 F3 ...
Cirpo ... ... ... ...
Claudio ... ... ... ...
Okram ... ... ... ...
Spinoza ... ... ... ...
... ... ... ... ...
Saturday, March 26, 2011
A graph in GraphDB
12
1
2
follows
3follows
4follows
name: Spinoza
name: Okramname: Claudio
name: Cirpo
Saturday, March 26, 2011
A lookup (Graph)
13
• Look for Claudio’s ID [ Log(N) ]
• Look for Followees [ O(K) ]
1
2
follows
3follows
4follows
name: Spinoza
name: Okramname: Claudio
name: Cirpo
Saturday, March 26, 2011
What about Friends (of Friends)*?
14
Saturday, March 26, 2011
A benchmark
15
• 1 Million Vertices
• 4 Million Edges
• Scale-Free Topology
• Postgres VS Neo4J
• Both Hash and BTree
Depth RDBMS Graph
1
2
3
4
5
100ms 30ms
1000ms 500ms
10000ms 3000ms
100000ms 50000ms
N/A 100000ms
Ref: http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-graph-traversal/
Saturday, March 26, 2011
A benchmark
• 50 friends on average
• Look if there’s a path connecting two people
16
Ref: http://www.slideshare.net/thobe/nosqleu-graph-databases-and-neo4j
DB # Time
RDBMS
Graph
Graph
RDBMS
1K 2000ms
1K 2ms
1M 2ms
1M N/A
Saturday, March 26, 2011
A Graph Database allows O(1) access to
adjacent Vertices
Ref: The Graph Traversal Pattern: Marko A. Rodriguez and Peter Neubauer17
Saturday, March 26, 2011
Example: Queries
18
Brad Pitt
Ocean 11
actor Ocean 12
actor Ocean 13
actor
Se7en
actorThe Departedproducer
Actiongenre
Crime
genre
genre
Thrillergenre
genre
genre
genre Drama
genre
genre
genre
Steven Soderbergh
director
director
director
Saturday, March 26, 2011
Example: Queries
19
Brad Pitt
Ocean 11
actor Ocean 12
actor Ocean 13
actor
Se7en
actorThe Departedproducer
Actiongenre
Crime
genre
genre
Thrillergenre
genre
genre
genre Drama
genre
genre
genre
Steven Soderbergh
director
director
director
Saturday, March 26, 2011
Example: Queries
20
Brad Pitt
Ocean 11
actor Ocean 12
actor Ocean 13
actor
Se7en
actorThe Departedproducer
Actiongenre
Crime
genre
genre
Thrillergenre
genre
genre
Steven Soderbergh
director
director
director
genre Drama
genre
genre
genre
Saturday, March 26, 2011
Example: Queries
21
Brad Pitt
Ocean 11
actor Ocean 12
actor Ocean 13
actor
Se7en
actorThe Departedproducer
Actiongenre
Crime
genre
Steven Soderbergh
director
director
director genre
Thrillergenre
genre
genre
genre Drama
genre
genre
genre
Saturday, March 26, 2011
Example: Recommendations
22
ClaudioGraph Runnerlikes
The Lord of the Graphs
likes
Adventure
tagged
Sci-Fitagged
tagged
Trilogytagged
Cirpo
likes
PHP I love Youlikes
Geekytagged
Boringtagged
Caprazzi
likes
likes
Javatarlikes
tagged
tagged
Saturday, March 26, 2011
Example: Recommendations
23
ClaudioGraph Runnerlikes
The Lord of the Graphs
likes
Adventure
tagged
Sci-Fitagged
tagged
Trilogytagged
Cirpo
likes
PHP I love Youlikes
Geekytagged
Boringtagged
Caprazzi
likes
likes
Javatarlikes
tagged
tagged
Saturday, March 26, 2011
Example: Recommendations
24
Claudio
The Lord of the Graphs
likes
Graph Runnerlikes
Cirpo
likes
PHP I love Youlikes
Caprazzi
likes
likes
Javatarlikes
Adventuretagged
Trilogytagged
tagged
Sci-Fitagged
Geekytagged
Boringtagged
tagged
tagged
Saturday, March 26, 2011
Example: Recommendations
25
Claudio
The Lord of the Graphs
likes
Graph Runnerlikes
Cirpo
likes
PHP I love Youlikes
Caprazzi
likes
Javatarlikes
likes
Adventuretagged
Trilogytaggedtagged
Geekytagged
tagged
Boringtagged
tagged
Sci-Fitagged
Saturday, March 26, 2011
Example: Recommendations
26
ClaudioGraph Runnerlikes
The Lord of the Graphs
likes
Adventure
tagged
Sci-Fitagged
tagged
Trilogytagged
Cirpo
likes
PHP I love Youlikes
Geekytagged
Boringtagged
Caprazzi
likes
likes
Javatarlikes
tagged
tagged
Saturday, March 26, 2011
Example: Recommendations
27
ClaudioGraph Runnerlikes
The Lord of the Graphs
likes
Adventure
tagged
Sci-Fitagged
tagged
Trilogytagged
Cirpo
likes
PHP I love Youlikes
Geekytagged
Boringtagged
Caprazzi
likes
likes
Javatarlikes
tagged
tagged
Saturday, March 26, 2011
Example: Recommendations
28
Claudio
Graph Runnerlikes
The Lord of the Graphs
likes
Sci-Fitagged
Adventure
tagged
Trilogytagged
tagged
Cirpolikes
PHP I love You
likes
Geekytagged
Boringtagged
Caprazzi
likes
likes
Javatarlikes
tagged
tagged
Saturday, March 26, 2011
Example: Recommendations
29
ClaudioGraph Runnerlikes
The Lord of the Graphs
likes
Adventure
Javatar
tagged
Geekytagged
tagged
Sci-Fitagged
tagged
Trilogytagged
Cirpolikes
PHP I love You
likes
tagged
Boringtagged
Caprazzi likes
likes
likes
Saturday, March 26, 2011
Example: Recommendations
30
ClaudioGraph Runnerlikes
The Lord of the Graphs
likes
Adventure
Javatar
tagged
Geekytagged
Caprazzi likes
likes
PHP I love You
likes
tagged
Sci-Fitagged
tagged
Trilogytagged
Cirpolikes
likes
tagged
Boringtagged
Saturday, March 26, 2011
Graph Mining
31
Ref: Programming the Semantic Web - O’Reilly
How are they connected?
Saturday, March 26, 2011
Graph Mining
32
Ref: Programming the Semantic Web - O’Reilly
Saturday, March 26, 2011
Graph Mining
33
Saturday, March 26, 2011
Other Applications
34
• Community Analysis
• Fraud Detection
• Planning
• Text Processing
• Reasoning
Saturday, March 26, 2011
as you can’t get rid of logicians
35
Saturday, March 26, 2011
there’s an SQL also for Graphs
36
Saturday, March 26, 2011
Triplestores
37
Tom Cruise
Top Gun
actor
Katie Holmesmarried
Scientology
advocate
Hollywoodlives
July 3, 1962
born
Saturday, March 26, 2011
Triplestores
38
Subject Predicate Object
Tom Cruise actor Top Gun
Tom Cruise married Katie Holmes
Tom Cruise advocate Scientology
Tom Cruise lives Hollywood
Tom Cruise born July 3, 1962
Saturday, March 26, 2011
SPARQL
39
PREFIX ged: <http://www.daml.org/2001/01/gedcom/gedcom#>SELECT ?name ?marriedOnFROM <http://www.daml.org/2001/01/gedcom/royal92.daml>WHERE{ ?royal ged:title "Princess". ?royal ged:name ?name. ?royal ged:spouseIn ?family. ?family ged:marriage ?marriage. ?marriage ged:date ?marriedOn.}ORDER BY ASC [?name]
Saturday, March 26, 2011
what if Internet was your GraphDB?
40
Saturday, March 26, 2011
41
Saturday, March 26, 2011
what about a NoSPARQL?
42
Saturday, March 26, 2011
Tinkerpop
43
Saturday, March 26, 2011
44
• Blueprints is the like the JDBC of the graph database community.
• Provides a Java-based interface API for the property graph data model. Graph, Vertex, Edge, Index.
• Provides implementations of the interfaces for TinkerGraph, Neo4j, OrientDB, Sails (e.g. AllegroSail, Neo4jSail), and soon (hopefully) others such as InfiniteGraph, InfoGrid, Sones, and HyperGraphDB
Saturday, March 26, 2011
45
• A dataflow framework with support for Blueprints-based graph processing.
• Provides a collection of “pipes” (implement Iterable and Iterator)
✴ Filters: ComparisonFilterPipe, RandomFilterPipe, etc.
✴Traversal: VertexEdgePipe, EdgeVertexPipe, PropertyPipe, etc.
✴ Splitting/Merging: CopySplitPipe, RobinMergePipe, etc.
✴ Logic: OrPipe, AndPipe, etc.
Saturday, March 26, 2011
46
• A Turing-complete, graph-based programming language that compiles Gremlin syntax down to Pipes (implements JSR 223).
• Builds on top of Groovy
• Support various language constructs: :=, foreach, while, repeat, if/else, function and path definitions, etc.
An example of “Amazon’s” recommender: m = [:] g.v(1).outE('purchased').inV.inE('purchased').outV.groupCount(m); m.sort{ a,b -> a.value <=> b.value }
Saturday, March 26, 2011
47
• Allows Blueprints graphs to be exposed through a RESTful API (HTTP)
• Supports stored traversals written in raw Pipes or Gremlin.
• Supports adhoc traversals represented in Gremlin.
• Provides “helper classes” for performing search-, score-, and rank-based traversal algorithms—in concert, support for recommendation.
Saturday, March 26, 2011
Sample Stack
48
• HTTP Request arrives
• Converts REST to Gremlin
• Gremlin “compiles” to Pipes
• Pipes makes Blueprints calls
• Store provides the data
Saturday, March 26, 2011
Neo4J
49
• Engine: Graph
• License: AGPLv3
• Language: Java
• Transactions: ACID
• Distributed: HA, Master-Slave Cache Sharding, Domain-Specific
• Features: Embeddable, REST, many plugins
Saturday, March 26, 2011
OrientDB
50
• Engine: Document-Graph
• License: Apache 2.0
• Language: Java
• Transactions: ACID
• Distributed: HA through Replication
• Features: Embeddable, REST, SQL-like
Saturday, March 26, 2011
HypergraphDB
51
• Engine: HyperGraph
• License: LGPL
• Language: Java
• Transactions: ACID
• Distributed: P2P distribution and replication
• Features: Hyperedges, Java OODB, storage on BerkeleyDB
Saturday, March 26, 2011
InfiniteGraph
52
• Engine: Graph
• License: Commercial
• Language: Java
• Transactions: ACID
• Distributed: Graph Partitioning, Federation on Objectivity
• Features: Distributed lock management, scales to Exabytes
Saturday, March 26, 2011
Where do I go now?
53
Tinkerpop: http://www.tinkerpop.comNeo4J: http://neo4j.org OrientDB: http://www.orientechnologies.com/orient-db.htm InfoGrid: http://infogrid.orgInfiniteGraph: http://www.infinitegraph.comSones: http://developers.sones.deAllegroGraph: http://www.franz.com/agraph/allegrographHypergraphDB: http://www.kobrix.com/hgdb.jsp
Saturday, March 26, 2011
http://blog.acaro.orghttp://github.com/claudiomartella/
@claudiomartellahttp://joind.in/2946
Saturday, March 26, 2011
Top Related