MongoDB Days UK: MongoDB and Spark

Post on 16-Apr-2017

1.028 views 5 download

Transcript of MongoDB Days UK: MongoDB and Spark

Spark in the Leaf

3

Ross LawleyJVM Software engineerOn the drivers team

Twitter: @RossC0

4

Agenda

The data challengeSparkUse CasesConnectorsDemo

C 18,000 BCE

First recorded example of Humans saving data.

Tally sticks used to track trading activity and record inventory.

1663

First recorded statistical – analysis of dataJohn Graunt started the field of demographics in an attempt to

predict the spread of the bubonic plague.

1928

First use of magnetic tape to store data.

Fritz Pfleumer formed basis of modern digital data storage.

1965

The start of Big Data?The US Government plans the world’s first data center to store

742 million tax returns and 175 million sets of fingerprints.

1970

The start of accessible data

Relational Database model developed by Edgar F Codd.

1991

The birth of the internet.

1997

Google

Michael Lesk estimates the digital universe increasing tenfold in size every year.

2001

Big Data challenges defined

Doug Laney defined the Three “Vs” of Big Data

2005

Big Data taming by Elephants

Hadoop created!

2009

MongoDB released!

2010

Eric Schmidt

Every two days now we create as much information as we did from the dawn of civilization up until 2003

2014

Spark 1.0 released!

Big Data

Big Challenge

Apache Spark is the Taylor Swift of big data software.

“Derrick Harris, Fortune

22

What is Spark?

Fast and general computing engine for clusters

• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, …• It’s fundamentally different to what’s come before

23

Why not just use Hadoop?

• Spark is FAST– Faster to write.– Faster to run.

• Up to 100x faster than Hadoop in memory• 10x faster on disk.

A visual comparison

HadoopSpark

25

Spark Programming Model

Resilient Distributed Datasets

• An RDD is a collection of elements that is immutable, distributed and fault-tolerant.

• Transformations can be applied to a RDD, resulting in new RDD.

• Actions can be applied to a RDD to obtain a value.

• RDD is lazy.

26

RDD Operations

Transformations Actionsmap reduce

filter collect

flatMap count

mapPartitions save

sample lookupKey

union take

join foreach

groupByKey

reduceByKey

27

Example: Filtering text

val searches = spark.textFile("hdfs://...") .filter(line => line.contains("Search")) .map(s => s.split("\t")(2)).cache() Driver

Worker

Worker

Worker

// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count() tasksres

ults

Block 1

Block 2

Block 3

// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))

.collect()

Cache 1Cache 2

Cache3

28

Built in fault tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions

val searches = spark.textFile("hdfs://...") .filter(_.contains("Search")) .map(_.split("\t")(2)).cache()

.filter(_.contains("MongoDB")) .count()

Mapped RDD

Filtered RDD

HDFS RDD

Cached RDD

Filtered RDD Count

29

Spark higher level libraries

Spark

Spark SQL

Spark Streaming MLIB GraphX

Spark + MongoDB

31

MongoDB and Spark

Spark

Spark SQL Spark Streaming MLIB GraphX

32

MongoDB and Spark

Spark

Spark SQL

Spark Streaming MLIB GraphX

33

MongoDB and Spark

Spark

Spark SQL

Spark Streaming MLIB GraphX

34

Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection

35

Data Management

OLTPApplicationsFine grained operations

Offline Processing Analytics Data Warehousing

Fraud Detection

I'm so in love!

Fraud Detection

I'm so in love!

Me, too<3

Now send me your CC number

?

Ok, XXXX-123-zzz

$$$

Fraud Detection

Sharing Workloads

Chat App

HDFS HDFS HDFS ArchivingData Crunching

LoginUser ProfileContactsMessages…

Fraud DetectionSegmentationRecommendations

Spark

MongoDB + Spark Connectors

Choices, choices

Hadoop Connector Stratio Connector

MongoDB Hadoop Connector

HDFS HDFS HDFSMongoDB Hadoop

Connector

MongoDB Shard

Spark

MongoDB Hadoop Connector

HDFS HDFS HDFSMongoDB Hadoop

Connector

MongoDB Shard

Spark

YARN

44

MongoDB Hadoop Connector

Positive Not So Good

Battle Tested Not the fastest thing

Integrated with existing Hadoop components Not dedicated to Spark

Supports HIVE and PIG Dependent on HDFS

http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/

45

Stratio Spark-MongoDBhttp://spark-packages.org/?q=mongodb

Stratio Spark MongoDB

MongoDB Shard

Spark

Stratio Spark-MongoDB

https://github.com/Stratio/spark-mongodb

47

MongoDB Hadoop Connector Stratio Spark-MongoDB Connector

Machine Learning Yes Yes

SQL No Yes

Data Frames No Yes

Streaming No No

Python Yes YesSpark SQL syntax

Use MongoDB secondary indexes to filter input data Yes Yes

Compatibility with MongoDB replica sets and sharding Yes Yes

HDFS Support Yes Yes

Support for MongoDB BSON Files Yes PartialWrite only

Commercial Support YesWith MongoDB Enterprise Advanced

YesProvided by Stratio

Spark Streaming

49

Spark Streaming

Twitter Feed Spark

50

Spark Streaming

Twitter Feed

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

51

Spark Streaming{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "time": "Mon Sep 24 03:35", "freebandnames": 1}

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "time": "Mon Sep 24 03:35", "freebandnames": 4}

Spark

52

Capped Collection

MongoDB and Spark Streaming future

{ "time": "Mon Sep 24 03:35", "freebandnames": 4}{ "time": "Mon Nov 5 09:40", “mongoDBLondon": 400}{ "time": "Mon Nov 5 11:50", “spark": 7556}{ "time": "Mon Nov 24 12:50", "itshappening": 100}

Tailable Cursor

Spark SQL

54

Demo

Spark Stratio Spark MongoDB

55

Open High Low Close

Symbol, Timestamp, Day, Open, High, Low, Close, VolumeMSFT, 2009-08-24 09:30, 24, 24.41, 24.42, 24.31, 24.31, 683713

MongoDB + Spark performance

Document Design mattersdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 12.9}

Resource

Type

WhenData

Time Seriesdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 1699342, minutes: { "0": 12.9, "1": 14.4, ... "59": 15.8 }}

Series

WiredTiger

Very High Speed

61

Spark I/O Matters

val searches = spark.fromMongoDB(mongoDBConfig) .filter(line => line.contains("Search")) .map(s => s.split("\t")(2))

SparkDriver

Worker Worker

// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count()

App// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))

.collect()

.cache()

62

Spark and MongoDB

• An extremely powerful combination

• Many possible use cases

• Some operations are actually faster if performed using Aggregation Framework

• Evolving all the time

Questions?

Ross LawleySenior Engineerross@mongodb.com@RossC0

65

References

• Resources– https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb– http://spark.apache.org/docs/latest/quick-start.html– https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals– http://techanjs.org/

• Images– https://commons.wikimedia.org/wiki/File:SAM_PC_1_-_Tally_sticks_1_-_Overview.jpg– http://www.pieria.co.uk/articles/a_17th_century_spreadsheet_of_deaths_in_london– http://www.snipview.com/q/Fritz_Pfleumer– https://news.google.com/newspapers?id=ZGogAAAAIBAJ&sjid=3GYFAAAAIBAJ&dq=data-center&pg=933%2C5465131– http://www.slideshare.net/renguzi/codd– http://www.datasciencecentral.com/profiles/blogs/a-little-known-component-that-should-be-part-of-most-data-science– https://medium.com/deepend-indepth/know-your-audience-better-than-asio-4802839c3fd3#.fj0dxq99w– http://timschreiber.com/img/cardboard-tank.jpg– http://olap.com/forget-big-data-lets-talk-about-all-data/– http://www.engadget.com/2015/10/07/lexus-cardboard-electric-car/– http://cdn.theatlantic.com/static/infocus/ngt051713/n10_00203194.jpg– http://www.businessinsider.com/the-red-pill-reddit-2013-8– https://www.flickr.com/photos/dogfaceboy/2572744331/