MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

42
MongoDB Hadoop Connector Luke Lovett Maintainer, mongo-hadoop https://github.com/mongodb/mongo-hadoop

Transcript of MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Page 1: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoDB Hadoop Connector

Luke LovettMaintainer, mongo-hadoop

https://github.com/mongodb/mongo-hadoop

Page 2: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Overview

• Hadoop Overview• Why MongoDB and Hadoop• Connector Overview• Technical look into new features• What’s on the horizon?• Wrap-up

Page 3: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Hadoop Overview

• Distributed data processing• Fulfills analytical requirements• Jobs are infrequent, batch processes

Churn Analysis Recommendation Warehouse/ETL Risk Modeling

Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis

Page 4: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoDB + Hadoop

• MongoDB backs application• Satisfy queries in real-time• MongoDB + Hadoop = application data analytics

Page 5: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Connector Overview

• Brings operational data into analytical lifecycle• Supporting an evolving Hadoop ecosystem

– Apache Spark has made a huge entrance• MongoDB interaction seamless, natural

Page 6: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Connector Examples

MongoInputFormat MongoOutputFormatBSONFileInputFormat BSONFileOutputFormat

Pig

data = LOAD “mongodb://myhost/db.collection” USING com.mongodb.hadoop.MongoInputFormat

Page 7: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Connector Examples

MongoInputFormat MongoOutputFormatBSONFileInputFormat BSONFileOutputFormat

Hive

CREATE EXTERNAL TABLE mongo ( title STRING, address STRUCT<from:STRING, to:STRING>)STORED BY“com.mongodb.hadoop.hive.MongoStorageHandler”;

Page 8: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Connector Examples

MongoInputFormat MongoOutputFormatBSONFileInputFormat BSONFileOutputFormat

Spark (Python)

import pymongo_sparkpymongo_spark.activate()rdd = sc.MongoRDD(“mongodb://host/db.coll”)

Page 9: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

New Features

• Hive predicate pushdown• Pig projection• Compression support for BSON• PySpark support• MongoSplitter improvements

Page 10: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

PySpark

• Python shell• Submit jobs written in Python• Problem: How do we provide a natural Python syntax

for accessing the connector inside the JVM?• What we want:

– Support for PyMongo’s objects– Have a natural API for working with MongoDB inside

Spark’s Python shell

Page 11: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

PySpark

We need to understand:• How do the JVM and Python work together in Spark?• What does data look like between these processes?• How does the MongoDB Hadoop Connector fit into this?

We need to take a look inside PySpark.

Page 12: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

What’s Inside PySpark?

• Uses py4j to connect to JVM running Spark• Communicates objects to/from JVM using Python’s

pickle protocol• org.apache.spark.api.python.Converter

converts Writables to Java Objects and vice-versa• Special PythonRDD type encapsulates JVM gateway

and necessary Converters, Picklers, and Constructors for un-pickling

Page 13: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

What’s Inside PySpark?JVM Gatewaypython:

java:

Page 14: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

What’s Inside PySpark?PythonRDDPython: Keeps Reference to SparkContext, JVM Gateway

Java: simply wrap a JavaRDD and do some conversions

Page 15: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

What’s Inside PySpark?Pickler/Unpickler – What is a Pickle, anyway?

• Pickle – a Python object serialized into a byte stream, can be saved to a file

• defines a set of opcodes that operate as in a stack machine

• pickling turns a Python object into a stream of opcodes

• unpickling performs the operators, getting a Python object out

Page 16: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Example (pickle version 2)

>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP

{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}

Page 17: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP

Example (pickle version 2)

{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}

Page 18: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP

Example (pickle version 2)

{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}

Page 19: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP

Example (pickle version 2)

{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}

Page 20: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP

Example (pickle version 2)

{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}

Page 21: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

What’s Inside PySpark?Pickle, implemented by Pyrolite libraryPyrolite - Python Remote Objects "light" and Pickle for Java/.NEThttps://github.com/irmen/Pyrolite

• Pyrolite library allows Spark to use Python’s Pickle protocol to serialize/deserialize Python objects across the gateway.

• Hooks available for handling custom types in each direction– registerCustomPickler – define how to turn a Java

object into a Python Pickle byte stream– registerConstructor – define how to construct a

Java object for a given Python type

Page 22: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

What’s Inside PySpark?BSONPickler – translates Java -> PyMongoPyMongo – MongoDB Python driverhttps://github.com/mongodb/mongo-python-driver

Special handling for- Binary- BSONTimestamp- Code- DBRef- ObjectId- Regex- Min/MaxKey

Page 23: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

“PySpark” – Before Picture>>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’,... ‘mongo.output.uri’: ‘mongodb://host/db.output’}>>> rdd = sc.newAPIHadoopRDD(... ‘com.mongodb.hadoop.MongoInputFormat’,... ‘org.apache.hadoop.io.TextWritable’,... ‘org.apache.hadoop.io.MapWritable’... None, None, config)>>> rdd.first()({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__': u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date': datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’: u’World’})>>> # do some processing with RDD>>> processed_rdd = …>>> processed_rdd.saveAsNewAPIHadoopFile(... ‘file:///unused’,... ‘com.mongodb.hadoop.MongoOutputFormat’,... None, None, None, None, config)

Page 24: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

PySpark – After Picture

>>> import pymongo_spark>>> pymongo_spark.activate()>>> rdd = sc.MongoRDD(‘mongodb://host/db.input’)>>> rdd.first(){u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’: u‘World’}>>> processed_rdd = ...>>> processed_rdd.saveToMongoDB(... ‘mongodb://host/db.output’)

Page 25: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitter

• splitting – cutting up data to distribute among worker nodes• Hadoop InputSplits / Spark Partitions• very important to get splitting right for optimum performance• improvements in splitting for mongo-hadoop

Page 26: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitterSplitting Algorithms• split per shard chunk• split per shard• split using splitVector command

mongos

shard 1

connector

shard 0

config servers

Page 27: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitterSplit per Shard Chunk

shards: { "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" } { "_id" : "shard02", "host" : "shard01/llp:27021,llp:27022,llp:27023" } { "_id" : "shard03", "host" : "shard01/llp:27024,llp:27025,llp:27026" }databases: { "_id" : "customer", "partitioned" : true, "primary" : "shard01" } customer.emails shard key: { "headers.From" : 1 } chunks: shard01 21 shard02 21 shard03 20 { "headers.From" : { "$minKey": 1}} -->> { "headers.From" : "[email protected]" } on : shard01 Timestamp(42, 1) { "headers.From" : "[email protected]": 1} -->> { "headers.From" : "[email protected]" } on : shard02 Timestamp(42, 1) { "headers.From" : "[email protected]" } -->> { "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)

Page 28: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitterSplitting Algorithms• split per shard chunk• split per shard• split using splitVector command

mongos

shard 1

connector

shard 0

config server

Page 29: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitterSplitting Algorithms• split per shard chunk• split per shard• split using splitVector command

_id_1

{“splitVector”: “db.collection”, “keyPattern”: {“_id”: 1}, “maxChunkSize”: 42}

_id: 0 _id: 25 _id: 50 _id: 75 _id: 100

Page 30: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitter

Problem: empty/unbalanced splitsQuery{“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})

• can use index on “createdOn”• splitVector can’t split on a subset of the index• some splits might be empty

Page 31: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitter

Problem: empty/unbalanced splitsQuery{“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})

Solutions• Create a new collection with subset of data• Create index over relevant documents only• Learn to live with empty splits

Page 32: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitter

AlternativesFiltering out empty splits:mongo.input.split.filter_empty=true

• create cursor, check for empty• empty splits are thrown out from the final list• save resources from task processing empty split

Page 33: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitter

Problem: empty/unbalanced splitsQuery{“published”: true}

• No index on “published” means splits more likely unbalanced

• Query selects documents throughout index for split pattern

Page 34: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitter

SolutionPaginatingMongoSplittermongo.splitter.class= com.mongodb.hadoop.splitter.MongoPaginatingSplitter

• one-time collection scan, but splits have efficient queries• no empty splits• splits of equal size (except for last)

Page 35: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoSplitter

• choose the right splitting algorithm• more efficient splitting with input query

Page 36: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Future Work – Data Locality

• Processing happens where the data lives• Hadoop

– namenode (NN) knows locations of blocks– InputFormat can specify split locations– jobtracker collaborates with NN to schedule tasks to

take advantage of data locality• Spark

– RDD.getPreferredLocations

Page 37: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Future Work – Data Locality

https://jira.mongodb.org/browse/HADOOP-202

Idea:• Data node/executor on same machine as shard• Connector assigns work based on local chunks

Page 38: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Future Work – Data Locality

• Set up Spark exectutors or Hadoop data nodes on machines with shards running

• Mark each InputSplit or Partition with the shard host that contains it

Page 39: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Wrapping Up

• Investigating Python in Spark• Understand splitting algorithms• Data locality with MongoDB

Page 40: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Thank You!

Questions?

Github:https://github.com/mongodb/mongo-hadoop

Issue Tracker:https://jira.mongodb.org/browse/HADOOP

Page 41: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

#MDBDaysmongodb.com

Get your technical questions answered

In the foyer, 10:00 - 5:00By appointment only – register in person

Page 42: MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Tell me how I did today on Guidebook and enter for a chance to win one of these

How to do it: Download the Guidebook App

Search for MongoDB Silicon Valley Submit session feedback