Mongo db &_spark

62
MongoDB & Spark @blimpyacht

Transcript of Mongo db &_spark

MongoDB & Spark@blimpyacht

MongoDB & Spark

Level Setting

TROUGH OF Disillusionment

HDFS

Distributed Data

HDFSYARN

Distributed Resources

HDFSYARN

MapReduce

Distributed Processing

HDFSYARN

Hive

Pig

Domain Specific Languages

MapReduce

Interactive ShellEasy (-er)Caching

HDFS

Distributed Data

HDFSYARN

Distributed Resources

HDFSYARN

SparkHadoop

Distributed Processing

HDFSYARN

SparkHadoopMesos

HDFSStand AloneYARN

SparkHadoopMesos

HDFSStand AloneYARN

SparkHadoopMesos

Hive

Pig

HDFSStand AloneYARN

SparkHadoopMesos

Hive

Pig

SparkShell

HDFSStand AloneYARN

SparkHadoopMesos

Hive

Pig

SparkShell

SparkStreaming

HDFSStand AloneYARN

SparkHadoopMesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

HDFSStand AloneYARN

SparkHadoopMesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

HDFSStand AloneYARN

SparkHadoopMesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

Stand AloneYARN

SparkHadoopMesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

SparkStreaming

Hive

SparkShell

MesosHadoop

Pig

SparkSQL

Spark

Stand AloneYARN

Stand AloneYARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

Stand AloneYARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

executor

Worker Node

executor

Worker Node Driver

Resilient Distributed Datasets

Resilient Distributed Datasets

f(x’’) = yParellelize = xt(x) = x’t(x’) = x’’

t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x

Parallelization

t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x

Transformations

Tranformationsfilter( func )union( func )intersection( set )distinct( n )map( function )

t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x

Action

Actionscollect()count()first()take( n )reduce( function )

f(x) = x’f(x’) = x’’t(x’’) = x’’’Parellelize = x

Lineage

Transform

Transform ActionParalleliz

e

Lineage

Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e

Lineage

Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e

Lineage

Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e Transform

Transform ActionParalleliz

e

Lineage

https://github.com/mongodb/mongo-hadoop

{"_id" : ObjectId("4f16fc97d1e2d32371003e27"),"body" : "the scrimmage is still up in the air.

"subFolder" : "notes_inbox","mailbox" : "bass-e","filename" : "450.","headers" : {

"X-cc" : "","From" : "[email protected]","Subject" : "Re: Plays and other information","X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox","Content-Transfer-Encoding" : "7bit","X-bcc" : "","To" : "[email protected]","X-Origin" : "Bass-E","X-FileName" : "ebass.nsf","X-From" : "Michael Simmons","Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)","X-To" : "Eric Bass","Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>","Content-Type" : "text/plain; charset=us-ascii","Mime-Version" : "1.0"

}}

{ "_id" : "[email protected]|[email protected]", "value" : 2}{ "_id" : "[email protected]|[email protected]", "value" : 2}{ "_id" : "[email protected]|[email protected]", "value" : 2 }

Eratosthenes

Democritus

Hypatia

Shemp

Euripides

Spark ConfigurationConfiguration conf = new Configuration();conf.set(

"mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat”);conf.set(

"mongo.input.uri", "mongodb://localhost:27017/db.collection”);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

mongos

mongos

Data Services

Deployment Artifacts

Hadoop Connector Jar

Fat JarJava Driver

Jar

Spark Submit/usr/local/spark-1.5.1/bin/spark-submit \ --class com.mongodb.spark.examples.DataframeExample \ --master local Examples-1.0-SNAPSHOT.jar

Stand AloneYARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

JavaRDD<Message> messages = documents.map (

new Function<Tuple2<Object, BSONObject>, Message>() {

public Message call(Tuple2<Object, BSONObject> tuple) { BSONObject header = (BSONObject)tuple._2.get("headers");

Message m = new Message(); m.setTo( (String) header.get("To") ); m.setX_From( (String) header.get("From") ); m.setMessage_ID( (String) header.get( "Message-ID" ) ); m.setBody( (String) tuple._2.get( "body" ) );

return m; } });

MognoDB & Spack

DEMO

THE FUTUREAND

BEYOND THE INFINITE

Stand AloneYARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

mongos

mongos

Data Services

THANKS!{ Name: ‘Bryan Reinero’,

Title: ‘Developer Advocate’,

Twitter: ‘@blimpyacht’,Email:

[email protected]’ }