Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin

download Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin

If you can't read please download the document

description

A quick overview of the history, motivation, and uses of graph modeling and graph databases in various industries. Covers a brief introduction to graph databases with an emphasis on the Tinkerpop stack and Gremlin query language. These concepts are then solidified through a hands-on lab modeling a blog engine using Titan and Gremlin. See more at http://allthingsgraphed.com.

Transcript of Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin

  • 1.INTRO TO GRAPHDATABASESUsing Tinkerpop, TitanDB, and Gremlin{email : [email protected],website : http://calebjones.info,twitter : @JonesWCaleb}

2. Overview Why Graphs? Order to complexity Use cases major players Graphs & Adjacency Matrices Tinkerpop Framework Blueprints, Frames, Pipes, Furnace, Gremlin, Rexster Titan using Cassandra Blog Application (lab) Traversals using Gremlin 3. WHY GRAPHS? 4. Warren Weaver 17th - 19th century Problems of simplicity How one element interacts withanother First half of 20th century Problem of disorganized complexity Many elements operating in a systemw/o regard to how they interact witheach other Predicted Problem of organized complexity Many elements operating in a systemtaking into account how they interactwith each other Would require computational powerfar beyond what was currentlyavailableScience and Complexity1948ENIAC (1946) 5. Organisms 6. Knowledge Classification 7. Organizational Hierarchy 8. Neurology 9. Order to Complexity Trees describe order Linear (simple lineage) Categorized Single dimensional Symmetrical Hierarchical Convergent modeling Networks describe complexity Non-linear (multi-lineage) Multi-categorical Multi-dimensional Asymmetrical Decentralized Divergent modeling 10. Types of Networks 11. Types of Networks 12. Types of Networks 13. Types of Networks 14. Types of Networks 15. Types of Networks 16. Types of Networks 17. Types of Networksypes of Networksypes of NetworksNeuron Network of Mouse Millennium Simulation (2005)Largest astronomical simulation ever on the structure andevolution of galaxies in the universe.25 TB of data and 20 million galaxies 20. Use Cases Recommendation engines (avoidrelational N-JOIN or self-JOIN) Ranking/credibility (GooglesPageRank) Path finding (shortest, longest,mutual friends) Social (friendship, following, keyconnectors) 21. Graphs Node/Verticy: An entity that can have zero or more edgesconnected to it.1 2 3 Edge: An entity which connects two nodes. May bedirected or undirected1 2A B 22. Adjacency Matrix If graph is undirected, the adjacency matrix is symmetric Thus, transposition of matrix is the same graph 23. Adjacency Matrix Some graphs have different types or dimensions of edges 24. Property GraphsAttribute Valueid 2name BobAttribute Valueid E3type knowssince 2013-09-01Attribute Valueid 4name AliceAttribute Valueid 3name EveAttribute Valueid E2type knowssince 2013-09-01Attribute Valueid E4type siblingtwins trueAttribute Valueid 1name IvanAttribute Valueid E1type cousinseparation 1 25. Traversals Breadth-first 3, 2, 4, 1 Depth-first 3, 2, 1, 4 Breadth-first anddepth-first searchcan be combined. Filtering Ability to filter/sortpaths in traversal Aggregating Ability to aggregate/count properties as traversal occurs and affecttraversal with result of aggregation (e.g. power-grid load distr.) Backtracking Leave marker in traversal and come back to it when certain criteria ismet in a lower step1234 26. TINKERPOPGraph Framework 27. Tinkerpop A comprehensive, open-source graph framework(http://www.tinkerpop.com/)Property graphmodel that is DBagnostic. A kind ofJDBC for graphs.Data flow API forprocessing graphs.Underlyingcomponent forgraph traversalsDSL for traversingproperty graphs.Implemented inJSR-223.Maps betweendomain objects andthe graphs nodesand edges. LikeORM for graphs.Collection ofcommon graphanalysis algorithmsfor propertygraphs.Exposes anyblueprints graphvia a uniformRESTful API.Blueprints Pipes GremlinFrames Furnace Rexster 28. Tinkerpop Stack Different components all buildon each other Provides abstraction fromHTTP layer, to object mappinglayer, to traversal scripting, topluggable graph API Blueprints underpins the stackmaking it all DB agnostic Blueprints implementations: Neo4j, Sail, OrientDB, Dex *) Accumulo, ArangoDB, Bitsy,FluxGraph, FoundationDB,InfiniteGraph, MongoDB, Oracle-NoSQL, TitanDB * - Implemented by 3rd party 29. Tinkerpop - Rexter Provides REST and binary (RexPro - grizzly) protocols Flexible extension model (e.g. ad-hoc Gremlin queries) Server-side stored procedures (Gremlin) Browser-based interface (Dog House) Command-line tool for interacting with API Pluggable security SPARQL plugin to work against Sail graphs (OpenRDF) More information:https://github.com/tinkerpop/rexster/wiki 30. Tinkerpop - Furnace Collection of industry-standard algorithms fortraversing or analyzing graphs. Network generators (by clique or degree distribution) Search: A*, Breadth-first, Depth-first Shortest path Bellman-Ford (like Dijkstras but can handle neg. paths) PageRank Degree Distribution More information:https://github.com/tinkerpop/furnace/wiki 31. Tinkerpop - FramesMore Information: https://github.com/tinkerpop/frames/wiki 32. Tinkerpop - Pipes Dataflow framework for process graphs. Computational step becomes a node and an edge is acommunication channel between steps. Pipes are then chained and nested. Custom pipes can be created. Pipe types: Transform emit transformation of object Dozens of different types of transforms Filter decide whether to include/exclude object in traversal ~20 different types of filters sideEffect include object but produce side-effect from it ~15 different types of sideEffects (e.g. group, count, table, tree) Branch decide which step to take next in traversal Several different branching options 33. Tinkerpop - Blueprints Like JDBC but for graphs. Common API for Property Graphs which are very flexible Foundational component for Pipes, Gremlin, Frames,Furnace, and Rexster Supports transactions (if underlying DB engine does) Multi-threaded transactions supported Format readers/writers (GML, GraphML, GraphSON) More Information:https://github.com/tinkerpop/blueprints/wiki 34. Tinkerpop - Gremlin Graph traversal scripting language. Works against Blueprints API and is compiled intoFrames data-flows. Both native Java and Groovy (JSR-223) supported. Step library (https://github.com/tinkerpop/gremlin/wiki/Gremlin-Steps) Transform emit transformation of object Dozens of different types of transforms Filter decide whether to include/exclude object in traversal ~20 different types of filters sideEffect include object but produce side-effect from it ~15 different types of sideEffects (e.g. group, count, table, tree) Branch decide which step to take next in traversal Several different branching options 35. SQL Gremlin (secret decoder ring)Query SQL GremlinGet all users select*fromusersg.V(type,user).map()Get user names selectnamefromusersg.V(type,user).nameGet user names/ages selectname,agefromusersg.V(type,user).transform({[name:it.getProperty(name),age:it.getProperty(age)]})Get distinct user ages selectdistinct(age)fromusersg.V(type,user).age.dedup()Get oldest user selectmax(age)fromusersg.V(type,user).age.max() 36. SQL Gremlin (secret decoder ring)Query SQL GremlinSelect by equality select*fromuserswhereage=35g.V(type,user).has(age,35).map()Select by comparison select*fromuserswhereage21g.V(type,user).has(age,T.gt,21).map()Select by multiple criteria select*fromuserswheresex=Mandage25g.V(type,user).has(age,T.gt,25).has(sex,M).map()Order by age(switch a and b to do asc)select*fromusersorderbyagedescg.V(type,user).order({it.b.getProperty(age)=it.a.getProperty(age)}).map()Paging select*fromusersorderbyagedesclimit5offset5g.V(type,user).order({it.b.getProperty(age)=it.a.getProperty(age)})[5..10].map() 37. SQL Gremlin (secret decoder ring)Query SQL GremlinJoin selectusers.*fromusersinnerjoingroupsonusers.gId=groups.idwheregroups.name=devsg.V(type,groups).has(name,dev).in(inGroup).map()Join-on-join-on-join SELECTTOP(5)[t14].[ProductName]FROM(SELECTCOUNT(*)AS[value],[t13].[ProductName]FROM[customers]AS[t0]CROSSAPPLY(SELECT[t9].[ProductName]FROM[orders]AS[t1]CROSSJOIN[orderdetails]AS[t2]INNERJOIN[products]AS[t3]ON[t3].[ProductID]=[t2].[ProductID]CROSSJOIN[orderdetails]AS[t4]INNERJOIN[orders]AS[t5]ON[t5].[OrderID]=[t4].[OrderID]LEFTJOIN[customers]AS[t6]ON[t6].[CustomerID]=[t5].[CustomerID]CROSSJOIN([orders]AS[t7]CROSSJOIN[orderdetails]AS[t8]INNERJOIN[products]AS[t9]ON[t9].[ProductID]=[t8].[ProductID])WHERENOTEXISTS(SELECTNULLAS[EMPTY]FROM[orders]AS[t10]CROSSJOIN[orderdetails]AS[t11]INNERJOIN[products]AS[t12]ON[t12].[ProductID]=[t11].[ProductID]WHERE[t9].[ProductID]=[t12].[ProductID]AND[t10].[CustomerID]=[t0].[CustomerID]AND[t11].[OrderID]=[t10].[OrderID])AND[t6].[CustomerID][t0].[CustomerID]AND[t1].[CustomerID]=[t0].[CustomerID]AND[t2].[OrderID]=[t1].[OrderID]AND[t4].[ProductID]=[t3].[ProductID]AND[t7].[CustomerID]=[t6].[CustomerID]AND[t8].[OrderID]=[t7].[OrderID])AS[t13]WHERE[t0].[CustomerID]=N'ALFKI'GROUPBY[t13].[ProductName])AS[t14]ORDERBY[t14].[value]DESCg.V('customerId','ALFKI').as('customer).out('ordered').out('contains').out('is').as('products).in('is').in('contains').in('ordered').except('customer).out('ordered').out('contains').out('is').except('products).groupCount().cap().orderMap(T.decr[0..5].productName 38. Gremlin Resources Tinkerpop resources https://github.com/tinkerpop/gremlin/wiki/Basic-Graph-Traversals https://github.com/tinkerpop/gremlin/wiki/Gremlin-Steps https://github.com/tinkerpop/gremlin/wiki/Using-Gremlin-through-Java https://groups.google.com/forum/#!forum/gremlin-users https://github.com/tinkerpop/gremlin/wiki/SPARQL-vs.-Gremlin http://markorodriguez.com/2011/08/03/on-the-nature-of-pipes/ http://sql2gremlin.com/ http://gremlindocs.com/ Groovy http://groovy.codehaus.org/Beginners+Tutorial http://groovy.codehaus.org/Collections Misc http://www.fromdev.com/2013/09/Gremlin-Example-Query-Snippets-Graph-DB.html http://markorodriguez.com/2011/06/15/graph-pattern-matching-with-gremlin-1-1/ 39. GREMLINDemo Dataset Lab 40. Tinkerpop - Gremlingremling=TinkerGraphFactory.createTinkerGraph()==tinkergraph[vertices:6edges:6]gremling.V.count()==6gremling.E.count()==6gremling.v(1)==v[1]gremling.v(1).map=={age=29,name=marko}gremling.v(1).outE==e[7][1-knows-2]==e[8][1-knows-4]==e[9][1-created-3]gremling.v(1).outE('knows')==e[7][1-knows-2]==e[8][1-knows-4]gremling.v(1).outE('knows').map=={weight=0.5}=={weight=1.0} 41. Tinkerpop - Gremlin//getverticiesknownbymarkogremling.v(1).outE('knows').inV==v[2]==v[4]//getpropertiesofverticiesknownbymarkogremling.v(1).outE('knows').inV.map=={age=27,name=vadas}=={age=32,name=josh}//filterbythoseolderthan30gremling.v(1).outE('knows').inV.filter{it.age30}.map=={age=32,name=josh}//justgetnamegremling.v(1).outE('knows').inV.filter{it.age30}.name==josh//findnodeswhoknowsomeoneolderthan30gremling.V.as('x').outE('knows').inV.has('age',T.gt,30).back('x').map=={age=29,name=marko} 42. Tinkerpop - Gremlin//findedgeswithweight.5gremling.E.filter{it.weight0.5}==e[10][4-created-5]==e[8][1-knows-4]//findedgesw/weight.5frommarkogremling.E.filter{it.weight0.5}.as('x').outV.has('name',T.eq,'marko').back('x')==e[8][1-knows-4]//findnodescreatedbyothernodesgremling.V.as('x').inE('created').back('x').map=={name=lop,lang=java}=={name=ripple,lang=java}gremling.E.filter{it.label=='created'}.inV.dedup().map=={name=lop,lang=java}=={name=ripple,lang=java}//findnodescreatedbymorethan1nodegremling.E.filter{it.label=='created'}.inV.groupCount().cap()=={v[3]=3,v[5]=1}//findnodescreatedbymarkosfriendsgremling.v(1).outE('knows').inV.outE('created').inV.map=={name=ripple,lang=java}=={name=lop,lang=java} 43. Tinkerpop - Gremlin//addsomenewnodesgremling.addVertex([name:'bob',age:'60'])==v[0]gremling.addVertex([name:'eve',age:'40'])==v[7]gremling.addVertex([name:'timmy',age:'5'])==v[8]//addsomeedgesgremling.addEdge(g.v(0),g.v(7),'friend)==e[13][0-friend-7]gremling.addEdge(g.v(0),g.v(8),'child')==e[14][0-child-8]gremling.V.filter{it.name=='bob'}.outE('child').as('x').inV.filter{it.name=='timmy'}.back('x')==e[14][0-child-8]gremling.removeEdge(g.e(14))==nullgremling.V.filter{it.name=='bob'}.outE('child').as('x').inV.filter{it.name=='timmy'}.back('x')//noresults 44. Tinkerpop - Gremlin//previouslygremling.addVertex([name:'bob',age:'60'])==v[0]gremling.addVertex([name:'eve',age:'40'])==v[7]gremling.addEdge(g.v(0),g.v(7),'friend')==e[13][0-friend-7]//queryforedgegremling.v(0).outE==e[13][0-friend-7]//removevertex(autoremovesorphanededge)gremling.removeVertex(g.v(7))==nullgremling.v(0).outE//noresultsgremling.e(13)==null 45. TITANA Distributed Graph Database 46. Titan Graph Database Optimized to work against billions of nodesand edges Theoretical limitation of 2^60 edges and 1^60 nodes Works with several different distributed DBsincluding Cassandra and HBase Supports many concurrent users doingcomplex graph traversals simultaneously Native integration with Tinkerpop stack Supports integration with searchtechnologies such as Lucene andElasticsearch Created by Thinkaurelius(http://thinkaurelius.com/) 47. Titan Distributed Architecture TitanDB can integrate with distributed architectures in afew different waysNative Remote Embedded Put Rexter in front toallow RESTful access Connects remotely tocluster Can scale size as faras cluster can Possible processingbottleneck TitanDB and Rexter run oneach node in the cluster Can run on same JVM Considerableperformance/scalabilityimprovement Connects remotelyto cluster (or local) Can scale size asfar as cluster can Native Titan API Possibleprocessingbottleneck 48. Titan Indexing Standard index Internal to Titan Very fast but only supports exact matches External index Use indexing engine external to Titan (Lucene or Elasticsearch) Supports range queries Lucene Limited to only one machine (small-sized datasets) Also as richer set of search features (than Elasticsearch) Elasticsearch Distributed Not as feature-filled as Lucene 49. Distributed Titan Limitations/Gotchas Limitations which are present but which are scheduled tobe remedied Property indexes must be created before property is ever used Unable to drop indices Types cannot be changed once created Gotchas Multiple graphs on same backend requires specific configurationsper graph Ghost vertices certain concurrency circumstances can leavetraces of vertices. Recommendation is to allow this and periodicallyclean them up 50. Titan Graph Database - Gremlingraph vertices edges propertiesG = (V , E , ) 51. Titan Graph Database - Gremlingraph vertices edges propertiesG = (V , E , ) 52. Titan Graph Database - Gremlingraph vertices edges propertiesG = (V , E , )Application 53. Titan Graph Database - Gremlingraph vertices edges propertiesG = (V , E , )Application 54. Titan Graph Database - Gremlingraph vertices edges propertiesG = (V , E , )Application 55. DATA MODELINGEXAMPLEA Blogging Application 56. Bloggie Blog Requirements Create users, posts, and comments Retrieve all posts for a user Retrieve posts by time range Retrieve all comments for a user Retrieve all comments for a post, sorted by vote Retrieve the top N posts, sorted by vote User can only vote *once* on a post or comment 57. Get CassandraTitan https://github.com/thinkaurelius/titan/wiki/Downloads (0.3.2 stable)$$TITAN_LOCATION/bin/gremlin.sh,,,/(oo)-----oOOo-(_)-oOOo-----gremling=newTinkerGraph();==tinkergraph[vertices:0edges:0]gremlin 58. Modeling Entities (User, Post, Comment) Theres no one way to model this. General rules to follow: 1-N relationships can be modeled as one node with N edges pointing toother nodes 1-1 relationships can be modeled as a simple edge between two nodes M-N relationships are just more edges It is important to categorize the different types of edges since manydifferent types of edges will connect to a single node Dont shy away from attaching properties to edges. Remember that edgesare just a query-able as nodes. A common practice is to tend to model actions as edges andactors/artifacts as nodes Denormalize to minimize traversals 59. Users, Posts, Comments 60. Retrieve Users Posts Lets create a user and post Link them together Retrieve the user and their postsgremling.addVertex([type:'user',email:'[email protected]',name:'Robert',password:'asdf'])==v[0]gremling.addVertex([type:'post',guid:'21EC2020-3AEA-1069-A2DD-08002B30309D',title:'HelloWorld',text:'Myfirstpost!',userDisplayName:'Bob'])==v[1]gremling.addEdge(g.v(0),g.v(1),'postAuthor')==e[3][0-postAuthor-1]gremling.V.has('type','post').as('posts').inE('postAuthor').outV.has('email','[email protected]').back('posts').map()=={guid=21EC2020-3AEA-1069-A2DD-08002B30309D,text=Myfirstpost!,title=HelloWorld,userDisplayName=Bob,type=post} 61. Retrieve Posts by Time Range Add timestamp property to post Query by rangegremling.V.has('guid','21EC2020-3AEA-1069-A2DD-08002B30309D').has('type','post').sideEffect({it.createTimestamp=1383726500});==v[1]gremling.V.has('createTimestamp',T.gt,1383726400).has('createTimestamp',T.lt,1383726600).map()=={guid=21EC2020-3AEA-1069-A2DD-08002B30309D,createTimestamp=1383726500,text=Myfirstpost!,title=HelloWorld,userDisplayName=Bob,type=post} 62. Retrieve All Users Comments Add comment Link to author and to postgremling.addVertex([type:'comment',guid:'3F2504E0-4F89-11D3-9A0C-0305E82C3301',text:'Ilikeit!',userDisplayName:'Sally',createTimestamp:1383736500])==v[4]gremling.addEdge(g.v(1),g.v(4),'postComment')==e[5][1-postComment-4]gremling.addVertex([type:'user',email:'[email protected]',name:'Sally',password:'qwerty'])==v[6]gremling.addEdge(g.v(6),g.v(4),'commentAuthor')==e[7][6-commentAuthor-4]gremling.V.has('type','comment').as('comments').inE('commentAuthor').outV.has('email','[email protected]').back('comments').map()=={guid=3F2504E0-4F89-11D3-9A0C-0305E82C3301,createTimestamp=1383736500,text=Ilikeit!,userDisplayName=Sally,type=comment} 63. Retrieve top N posts by vote Create postVote edge andaggregated votes count in post Query and sort by votesgremling.addEdge(g.v(6),g.v(1),'postVote',[date:1383726600])==e[8][6-postVote-1]gremling.V.has('type','post').has('guid','21EC2020-3AEA-1069-A2DD-08002B30309D').sideEffect({it.votes=1})==v[1]gremling.addVertex([type:'post',guid:'21EC2020-3AEA-1069-A2DD-08002B30309E',createTimestamp:1383726600,title:'LearningGremlin',text:'Gremlinisneat.',userDisplayName:'Bob',votes:2])==v[9]gremling.V('type','post').order({it.b.getProperty('votes')=it.a.getProperty('votes')}).transform({['title':it.getProperty('title'),'votes':it.getProperty('votes')]})[0..5]=={title=LearningGremlin,votes=2}=={title=HelloWorld,votes=1} 64. Retrieve Post Comments Sorted by Vote Similar to post votesgremling.addEdge(g.v(0),g.v(4),'commentVote',[date:1383726700])==e[10][0-commentVote-4]gremling.V.has('type','comment').has('guid','3F2504E0-4F89-11D3-9A0C-0305E82C3301').sideEffect({it.votes=1})==v[4]gremling.addVertex([type:'comment',guid:'3F2504E0-4F89-11D3-9A0C-0305E82C3302',text:'Thanks.',userDisplayName:'Bob',createTimestamp:1383736500])==v[11]gremling.addEdge(g.v(1),g.v(11),'postComment')gremling.addEdge(g.v(0),g.v(11),'commentAuthor')gremling.v(1).outE('postComment').inV.order({it.b.getProperty('votes')=it.a.getProperty('votes')}).map()=={guid=3F2504E0-4F89-11D3-9A0C-0305E82C3301,createTimestamp=1383736500,text=Ilikeit!,votes=1,userDisplayName=Sally,type=comment}=={guid=3F2504E0-4F89-11D3-9A0C-0305E82C3302,createTimestamp=1383736500,text=Thanks.,userDisplayName=Bob,type=comment} 65. User Can Only Vote Once Could enforce using externalunique indexes Or do 2-step incrementing ingremlin (small chance of dups)gremlinuser=g.v(0);post=g.v(1);if(post.inE('postVote').outV.has('email',user.email).count()==0){g.addEdge(user,post,'postVote',[date:newDate().getTime()]);if(post.getProperty('votes')!=null){post.votes++;}else{post.votes=1;}}==1gremlin//samecommandabove==null 66. Graph Visualization 67. Areas Not Covered Map/Reduce Gremlin has its own built-in M/R API Indexing Titan currently has limitation requiring all indexes are created up-front Integration with other backends HBase, Oracle Berkeley DB, Hazelcast, Persistit Detailed full-text search through external indexes Graph analytics engine (Faunus) Deep dive into gremlin query language andGroovy Seriously, theres a TON there. 68. Referenceshttp://sql2gremlin.com/http://www.tinkerpopbook.com/ - http://www.tinkerpop.com/https://github.com/thinkaurelius/titan/wiki/Getting-Startedhttps://groups.google.com/forum/#!forum/gremlin-usershttps://groups.google.com/forum/#!forum/aureliusgraphshttp://thinkaurelius.com/ 69. THANK YOU{email : [email protected],website : http://calebjones.info,twitter : @JonesWCaleb}