Faunus: Graph Analytics Engine

72
FAUNUS MARKO A. RODRIGUEZ http://THINKAURELIUS.COM GRAPH ANALYTICS ENGINE

description

Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.

Transcript of Faunus: Graph Analytics Engine

Page 1: Faunus: Graph Analytics Engine

FAUNUS

MARKO A. RODRIGUEZ

http://THINKAURELIUS.COM

GRAPH ANALYTICS ENGINE

Page 2: Faunus: Graph Analytics Engine

Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.

ABSTRACT

http://FAUNUS.THINKAURELIUS.COM

Page 3: Faunus: Graph Analytics Engine

SPONSORED BY

ECCO, the Evolution, Complexity and Cognition group, is a multidisciplinary research group, directed by Francis Heylighen. They are localized at the Vrije Universiteit Brussel (VUB), although members are distributed across four continents. Researchers come from a wide variety of backgrounds, from physical science and technology to the social sciences and humanities. The philosophy is intrinsically transdisciplinary, transcending the traditional boundaries between "hard" and "soft" sciences, and between philosophical foundations and practical applications.

The Big-Data Interest Group (BIGDIG) is a focus group at LANL meeting monthly to explore big-data methods and architectures. One goal of the group is to identify early adopters and learn from their experiences. Furthermore, they would like involve scientists that are looking for big-data solutions and foster collaboration with those who might provide the needed technology. The BIGDIG group includes members from all domains: science, security, sensing, computing, library, and more.

The EgoSystem project is creating an integrated social model of the Los Alamos National Laboratory and its surroundings using numerous online services such as Twitter, LinkedIn, MS Academic, Wikipedia, and more. The model is seeded with LANL PostDocs, their created artifacts and continuously grows to encompass their relations to other people and institutions. EgoSystem is a Director sponsored project engineered by the Digital Library Research and Prototyping Team using Big Graph Data technology provided by Aurelius.

Page 4: Faunus: Graph Analytics Engine
Page 5: Faunus: Graph Analytics Engine

VERTEX

Page 6: Faunus: Graph Analytics Engine

0 ID

Page 7: Faunus: Graph Analytics Engine

0

name:faunusborn:2012PROPERTIES

Page 8: Faunus: Graph Analytics Engine

0

name:faunusborn:2012

EDGE

1

name:hadoopborn:2005

Page 9: Faunus: Graph Analytics Engine

0

name:faunusborn:2012

ID

1

name:hadoopborn:2005

5

Page 10: Faunus: Graph Analytics Engine

0

name:faunusborn:2012

LABEL

1

name:hadoopborn:2005

dependsO

n

5

Page 11: Faunus: Graph Analytics Engine

0

name:faunusborn:2012

PROPERTIES

1

name:hadoopborn:2005

dependsO

nsince:2012

5

Page 12: Faunus: Graph Analytics Engine
Page 13: Faunus: Graph Analytics Engine
Page 14: Faunus: Graph Analytics Engine
Page 15: Faunus: Graph Analytics Engine
Page 16: Faunus: Graph Analytics Engine
Page 17: Faunus: Graph Analytics Engine
Page 18: Faunus: Graph Analytics Engine
Page 19: Faunus: Graph Analytics Engine
Page 20: Faunus: Graph Analytics Engine
Page 21: Faunus: Graph Analytics Engine
Page 22: Faunus: Graph Analytics Engine
Page 23: Faunus: Graph Analytics Engine
Page 24: Faunus: Graph Analytics Engine

VERTICES + EDGES(ELEMENTS)

Page 25: Faunus: Graph Analytics Engine

0

1

2

3

VERTEX IDS

Page 26: Faunus: Graph Analytics Engine

0

1

2

3

4

5

6

7

EDGE IDS

Page 27: Faunus: Graph Analytics Engine

0

1

2

3

AB

A

C

4

5

6

7

EDGE LABELS

Page 28: Faunus: Graph Analytics Engine

0

1

2

3

AB

A

C

a:b

c:d

e:f

g:h

i:j

4

5

6

7

ELEMENT

PROPERTIES

Page 29: Faunus: Graph Analytics Engine

0

1

2

3

AB

A

C

a:b

c:d

e:f

g:h

i:j

4

5

6

7

1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3

Page 30: Faunus: Graph Analytics Engine

0

1

2

3

AB

A

C

a:b

c:d

e:f

g:h

i:j

4

5

6

7

id props id props label id id props label idid label id id label id

1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3

Page 31: Faunus: Graph Analytics Engine

0

1

2

3

AB

A

C

a:b

c:d

e:f

g:h

i:j

4

5

6

7

id props

vertex

id props label id

edge

id props label id

edge

id label id

edge

id label id

edge

1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3

Page 32: Faunus: Graph Analytics Engine

0

1

2

3

AB

A

C

a:b

c:d

e:f

g:h

i:j

4

5

6

7

1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3

id props

vertex

id props label id

edge

id props label id

edge

id label id

edge

id label id

edge

incoming edges outgoing edges

Page 33: Faunus: Graph Analytics Engine

0

1

3

4

5

6

7

8

9

10

11

AN ADJACENCY LIST

Page 34: Faunus: Graph Analytics Engine

127.0.0.2 127.0.0.3 127.0.0.4

AN ADJACENCY LIST+

CLUSTER0

1

3

4

5

6

7

8

9

10

11

Page 35: Faunus: Graph Analytics Engine

0

1

2

3

4

5

6

7

8

9

10

11

A DISTRIBUTED ADJACENCY LIST

127.0.0.2 127.0.0.3 127.0.0.4

Page 36: Faunus: Graph Analytics Engine

Hadoop is a distributed computing platform composed of two key components:

HDFS: A distributed file system that stores arbitrarily large files within a cluster.

MapReduce: A parallel functional computing model for key/value pair data.

HADOOP

http://hadoop.apache.org

Page 37: Faunus: Graph Analytics Engine

0

1

2

3

4

5

6

7

8

9

10

11

Structure

Process

Faunus provides graph input/output formats (structure) and a traversal language for graphs (process).

FAUNUS AND HADOOP

127.0.0.2 127.0.0.3 127.0.0.4

Page 38: Faunus: Graph Analytics Engine

PROCESSING GRAPHSWITH FAUNUS

Page 39: Faunus: Graph Analytics Engine

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12

GRAPH OF THE GODS

* Toy graph distributed with Faunus.

Page 40: Faunus: Graph Analytics Engine

faunus$

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4

Page 41: Faunus: Graph Analytics Engine

faunus$ bin/gremlin.sh

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4

http://gremlin.tinkerpop.com

Page 42: Faunus: Graph Analytics Engine

faunus$ bin/gremlin.sh

\,,,/ (o o)-----oOOo-(_)-oOOo-----gremlin>

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4

Page 43: Faunus: Graph Analytics Engine

faunus$ bin/gremlin.sh

\,,,/ (o o)-----oOOo-(_)-oOOo-----gremlin> hdfs.ls()gremlin>

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4

Page 44: Faunus: Graph Analytics Engine

faunus$ bin/gremlin.sh

\,,,/ (o o)-----oOOo-(_)-oOOo-----gremlin> hdfs.ls()gremlin> hdfs.copyFromLocal('graph-of-the-gods.json','graph-of-the-gods.json')==>nullgremlin>

0

1

2

3

4

5

6

7

8

9

10

11

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4

Page 45: Faunus: Graph Analytics Engine

faunus$ bin/gremlin.sh

\,,,/ (o o)-----oOOo-(_)-oOOo-----gremlin> hdfs.ls()gremlin> hdfs.copyFromLocal('graph-of-the-gods.json','graph-of-the-gods.json')==>nullgremlin> hdfs.ls()==>rw-r--r-- marko supergroup 2028 graph-of-the-gods.jsongremlin>

0

1

2

3

4

5

6

7

8

9

10

11

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4

Page 46: Faunus: Graph Analytics Engine

gremlin> g = FaunusFactory.open('bin/faunus.properties')==>faunusgraph[graphsoninputformat->graphsonoutputformat]gremlin> g.getConf('faunus')==>faunus.graph.input.format =com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat==>faunus.input.location=graph-of-the-gods.json==>faunus.graph.output.format =com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat==>faunus.output.location=output==>faunus.output.location.overwrite=true==>faunus.sideeffect.output.format =org.apache.hadoop.mapreduce.lib.output.TextOutputFormat

0

1

2

3

4

5

6

7

8

9

10

11

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4

Page 47: Faunus: Graph Analytics Engine

gremlin> g.V13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map]13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Job data location: output/job-013/05/07 12:07:10 INFO input.FileInputFormat: Total input paths to process : 113/05/07 12:07:10 INFO mapred.JobClient: Running job: job_201304251105_000413/05/07 12:07:11 INFO mapred.JobClient: map 0% reduce 0%...

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12

0

1

2

3

4

5

6

7

8

9

10

11

127.0.0.2 127.0.0.3 127.0.0.4

1

1

1

1

1

1

1

1

1

1

1

1

Page 48: Faunus: Graph Analytics Engine

gremlin> g.V.has('type','god')13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map]13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Job data location: output/job-013/05/07 12:08:56 INFO input.FileInputFormat: Total input paths to process : 113/05/07 12:08:57 INFO mapred.JobClient: Running job: job_201304251105_000513/05/07 12:08:58 INFO mapred.JobClient: map 0% reduce 0%...

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12

0

1

2

3

4

5

6

7

8

9

10

11

127.0.0.2 127.0.0.3 127.0.0.4

0

1

1

1

0

0

0

0

0

0

0

0

Page 49: Faunus: Graph Analytics Engine

gremlin> g.V.has('type','god').in('father')13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce]13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Job data location: output/job-013/05/07 12:13:03 INFO input.FileInputFormat: Total input paths to process : 113/05/07 12:13:04 INFO mapred.JobClient: Running job: job_201304251105_000613/05/07 12:13:05 INFO mapred.JobClient: map 0% reduce 0%...

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12

0

1

2

3

4

5

6

7

8

9

10

11

127.0.0.2 127.0.0.3 127.0.0.4

0

0

0

0

0

0

0

1

0

0

0

0

Page 50: Faunus: Graph Analytics Engine

gremlin> g.V.has('type','god').in('father').out('mother').name13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Compiled to 3 MapReduce job(s)13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Executing job 1 out of 3: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce]13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Job data location: output/job-013/05/07 12:25:18 INFO input.FileInputFormat: Total input paths to process : 113/05/07 12:25:18 INFO mapred.JobClient: Running job: job_201305071220_0007...==>alcmenegremlin>

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12

0

1

2

3

4

5

6

7

8

9

10

11

127.0.0.2 127.0.0.3 127.0.0.4

0

0

0

0

0

0

0

0

1

0

0

0

Page 51: Faunus: Graph Analytics Engine

1 k1:v1k2:v2 2 3 5

k1:v1

vertex edge

incoming edges

4

edge edge

outgoing edges

edge

TRAVERSAL DATA

1. A long counter denoting how many traversers exist at the element.

-OR-

2. A list of lists denoting path history of individual traversers at the element.

coun

ter =

chea

p

enum

erativ

e =

expe

nsive

* Each element in a row maintains traversal data as well.

k1:v1 k1:v1 k1:v1

Page 52: Faunus: Graph Analytics Engine

gremlin> g.V.has('type','god').in('father').out('mother').path13/05/07 14:37:59 WARN mapreduce.FaunusCompiler: Path calculations are enabled for this Faunus job (space and time expensive)13/05/07 14:37:59 INFO mapreduce.FaunusCompiler: Compiled to 3 MapReduce job(s)13/05/07 14:37:59 INFO mapreduce.FaunusCompiler: Executing job 1 out of 3: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce]13/05/07 14:38:00 INFO mapred.JobClient: Running job: job_201305071220_0005...==>[v[1], v[7], v[8]]gremlin>

1

6

0

3

name:tartarustype:location

name:plutotype:god

lives

brother

name:jupitertype:god 2

brother name:neptunetype:god

pet

11

name:cerberustype:monster

lives

father

name:saturntype:titan

brother

5

name:seatype:location

lives

4

name:skytype:location

lives

7

father

battled

name:herculestype:demigod

10

name:hydratype:monster

battled

9

name:nemeantype:monster

battled

8

name:alcmenetype:human

mother

time:1 time:2 time:12

0

1

2

3

4

5

6

7

8

9

10

11

127.0.0.2 127.0.0.3 127.0.0.4

[1,7,8]

Page 53: Faunus: Graph Analytics Engine

GREMLIN

GRAPH TRAVERSAL LANGUAGE

TRANSFORM FILTER SIDE-EFFECT BRANCHt : (V [ E)! P(V [ E) f : (V [ E) ! (V [ E [ ;) s : (V [ E)/!(V [ E)

f1 � f2 � f3 � · · · � f4

transform{}VidlabeloutinoutEinEinVmaporder...

filter{}hashasNot[0..10]randomsimplePathback...

sideEffect{}groupCountgroupByaggregatetablestorelinkInlinkOutcount...

loopcopySplitfairMergeexhaustMerge...

Gremlin is a functional graph language where traversals are defined using function composition. A set of useful predefined functions are provided with the language and generic lambdas/closures are possible for arbitrary mappings.

http://gremlin.tinkerpop.com

Page 54: Faunus: Graph Analytics Engine

EXAMPLE TRAVERSALS

g.V.has('type','person').out('attends') .has('type','academy').name.groupCount

g.V.out.out.out.simplePath.count()

"How many people attend each academy?"

g.V.sideEffect{it.degree = it.inE('friend').count()} .degree.groupCount

"What is the in-degree distribution of the friendship subgraph?"

"How many 3-step acyclic paths exist in the graph?"

* The only memory structure is the graph, thus all data must be in the graph.

g.V.as('x').out('father').out('father') .linkIn('grandfather','x')

"Derive all implicit grandfather relations in the graph."

g.V.count()

"How many vertices are in the graph?"

* Mutates the graph.

Page 55: Faunus: Graph Analytics Engine

hdfs://user/ubuntu/output/job-0/output/job-1/output/job-2/ { graph*

sideeffect*

g.V.out .out .count()

<NullWritable, FaunusVertex> <NullWritable, FaunusVertex>

<NullWritable, FaunusVertex> <LongWritable, Holder<FaunusElement>>

<LongWritable, Iterable<Holder<FaunusElement>>> <NullWritable, FaunusVertex>

MAP ONLY STEPS (NO REDUCE NEEDED)

MAP/REDUCE STEPS

map

map

reduce

FAUNUS DATA FLOW

valuekey

Page 56: Faunus: Graph Analytics Engine

GREMLIN IN MAP/REDUCE

map(null, vertex, context) { key = context.getConf().get('provided.key') value = context.getConf().get('provided.value') if(!vertex.getProperty(key).equals(value)) { vertex.clearPaths(); } context.write(vertex);}

FILTER

f : (V [ E) ! (V [ E [ ;)

g.V.ha

s('typ

e','go

d')

* Most filters are map-only steps. If the predicate returns false, then all the path metadata is cleared from the element.

f(v)

'type''god'

Page 57: Faunus: Graph Analytics Engine

map(null, vertex, context) { for(e : vertex.getEdges(OUT)) { context.write(e.getVertex(IN).id, holder('p',vertex.pathsOnly())) } context.write(vertex.id, holder('v',vertex))}

reduce(long, iterable<holder> holders, context) { vertex = new FaunusVertex(long) for(h : holders) { if(h.getTag() == 'v')) vertex.addAll(h.getVertex()) else vertex.addPaths(h.getVertex()) } context.write(null, vertex)}

127.0.0.4

127.0.0.3

127.0.0.2

GREMLIN IN MAP/REDUCE

t : (V [ E)! P(V [ E)

TRANSFORM

g.V.out

* Traversals implement a reduce-side join.

Page 58: Faunus: Graph Analytics Engine

map(null, vertex, context) { key = context.getConf().get('provided.key') context.write('graph',null,vertex) context.write('sideeffect', vertex.getProperty(key),vertex.getPathCount())}

reduce(object, iterable<long> longs, context) { sum = 0 for(l : longs) { sum += l } context.write('sideeffect',object,sum)}

GREMLIN IN MAP/REDUCESIDE-EFFECT

s : (V [ E)/!(V [ E)

g.V.ty

pe.gro

upCoun

t()

s(v)

'type'

* Leverages MultipleInputs/Outputs

Page 59: Faunus: Graph Analytics Engine

STRUCTURING GRAPHS

WITH FAUNUS

Page 60: Faunus: Graph Analytics Engine

INPUT/OUTPUT FORMATS

SequenceFileOutputFormat

A list of serialized vertex objects in a compressed binary format. <NullWritable,FaunusVertex>

The intermediate data format between MapReduce jobs within a Faunus pipeline.

Fastest available format for both reading and writing.

Compressed using variable-width and prefix encodings.

gremlin> g==>faunusgraph[graphsoninputformat->graphsonoutputformat]gremlin> g.setGraphOutputFormat(SequenceFileOutputFormat)==>nullgremlin> g==>faunusgraph[graphsoninputformat->sequencefileoutputformat]gremlin>

SequenceFileInputFormat

Page 61: Faunus: Graph Analytics Engine

INPUT/OUTPUT FORMATS

GraphSONOutputFormat

A verbose JSON-based text-format. Each vertex is a single JSON document.

Easy for developers to generate. Useful for testing and examples.

Limited to JSON supported datatypes for element property values.

{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"father","_id":12,"_outV":1}]}

{"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4},

{"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2},

{"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3},

{"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]}

{"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5},

{"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE":

[{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]}

...

GraphSONInputFormat

* JSON specification is available at http://json.org

Page 62: Faunus: Graph Analytics Engine

INPUT/OUTPUT FORMATS

faunus.graph.input.format=

com.thinkaurelius.faunus.formats.edgelist.rdf.RDFInputFormat

faunus.input.location=graph-example-1.ntriple

faunus.graph.input.rdf.format=n-triples

faunus.graph.input.rdf.as-properties=http://www.w3.org/1999/02/22-rdf-syntax-ns#type

faunus.graph.input.rdf.use-localname=true

faunus.graph.input.rdf.literal-as-property=true

RDFInputFormat

Maps popular RDF text formats to a property graph.

Configurations allow for different mappings of RDF to the property graph model.

Utilizes a MapReduce step to convert an edge-list into an adjacency list.

33^^xsd:intex:markofoaf:age 0

uri:ex:markoage:33

* RDF parsers provided by http://openrdf.org

Page 63: Faunus: Graph Analytics Engine

INPUT/OUTPUT FORMATS

RexsterInputFormat

Rexster

{ "results": { "_type":"vertex", "_id":1, "name":"tiberius", "age":29 }, "queryTime":0.123 }

HTTP REXPRO

http://.../vertices/1

g.v(1).out('mother') .out('mother').name

==>aurelia

Rexster is a graph server that is accessed via: REST and a Gremlin binary protocol.

Rexster supports any Blueprints-enabled graph database.

http://rexster.tinkerpop.com

Page 64: Faunus: Graph Analytics Engine

INPUT/OUTPUT FORMATS

A Gremlin script stored in HDFS (distributed cache) allows for an arbitrary parse.

def boolean read(FaunusVertex v, String line) { parts = line.split(':'); v.reuse(Long.valueOf(parts[0])) parts[1].split(',').each { v.addEdge(OUT, 'linkedTo', Long.valueOf(it)); } return true;}

ScriptInputFormat

0:1,2,3,41:2,32:0,3,5,63:1,2...

def void write(FaunusVertex vertex, DataOutput output) { output.writeUTF(vertex.getId().toString() + ':'); Iterator<Edge> itty = vertex.getEdges(OUT).iterator() while (itty.hasNext()) { output.writeUTF( itty.next().getVertex(IN).getId() + ','); } output.writeUTF('\n');}

ScriptOutputFormat

0:1,2,3,41:2,32:0,3,5,63:1,2...

Page 65: Faunus: Graph Analytics Engine

Adam Jacobs. 2009. The Pathologies of Big Data. Communications of the ACM 52, 8 (August 2009), 36-44. doi:10.1145/1536616.1536632 http://doi.acm.org/10.1145/1536616.1536632

Page 66: Faunus: Graph Analytics Engine

0

1

3

4

5

6

7

8

9

10

11

Serial Key/Value Data Structure Indexed Key/Indexed Value Data Structure

0

1

3

4

5

6

7

8

9

10

11

GLOBAL VS. LOCALGRAPH ANALYSIS

Page 67: Faunus: Graph Analytics Engine

TITAN

DISTRIBUTED GRAPH DATABASE

Application Servers Reading/Writing Graph Data

Titan Cluster Processing Gremlin Traversals and Writes

The biggest known Titan/Cassandra cluster to date:

~120 billion edge graph stored in a 16 hi1.4xlarge machine cluster. Ego-centric graph traversals are requested by 80 m1.large machines. The cluster serves ~10,000 transactions a second w/ ~200ms return times.

http://titan.thinkaurelius.com

http://thinkaurelius.com/2013/05/13/educating-the-planet-with-pearson/

Page 68: Faunus: Graph Analytics Engine

FAUNUS AND TITAN

SUPPORTED TITAN INPUT/OUTPUT FORMATS

TitanCassandraInputFormatTitanCassandraOutputFormat

TitanHBaseInputFormatTitanHBaseOutputFormat

Page 69: Faunus: Graph Analytics Engine

FAUNUS AND TITAN

Faunus/HadoopTitan/Cassandra

INTRA-CLUSTER CONFIGURATION

Data is processed on the machine where it is located. Limited network communication.

Page 70: Faunus: Graph Analytics Engine

FAUNUS AND TITAN

INTER-CLUSTER CONFIGURATION

Graph data is offloaded to another cluster.Repeated analysis does not interfere with production graph database.

Page 71: Faunus: Graph Analytics Engine

Graph glong counter = 0

def setup(args) { g = TitanFactory.open('cassandra:localhost')}

def map(vertex, args) { g.v(vertex.id).as('x').out('father') .out('father').linkIn('grandfather','x') if(counter++ % 1000 == 0) g.commit()}

FAUNUS AND TITANVERTEX-CENTRIC COMPUTING WITH GREMLIN

A Gremlin script is stored in HDFS (distributed cache).Vertex long ids are pulled out of Titan (FaunusVertex with id only).The Gremlin script is evaluated concurrently for every vertex long id.Guaranteed co-location of Gremlin script JVM and Titan vertex.

* Provided by the Gremlin script()-step

Page 72: Faunus: Graph Analytics Engine

CREDITSPRESENTED BY

MARKO A. RODRIGUEZ

SUPPORTED BYLOS ALAMOS NATIONAL LABORATORY

LANL RESEARCH LIBRARYVRIJE UNIVERSITEIT BRUSSEL

MANY THANKS TOMATTHIAS BRöCHELER

STEPHEN MALLETTEPAVEL YASKEVICHDAN LAROCQUE

AURELIUS COMMUNITYTINKERPOP COMMUNITY

KETRINA YIM