MongoDB Days UK: MongoDB and Spark

Spark in the Leaf

Ross LawleyJVM Software engineerOn the drivers team

Twitter: @RossC0

Agenda

The data challengeSparkUse CasesConnectorsDemo

C 18,000 BCE

First recorded example of Humans saving data.

Tally sticks used to track trading activity and record inventory.

First recorded statistical – analysis of dataJohn Graunt started the field of demographics in an attempt to

predict the spread of the bubonic plague.

First use of magnetic tape to store data.

Fritz Pfleumer formed basis of modern digital data storage.

The start of Big Data?The US Government plans the world’s first data center to store

742 million tax returns and 175 million sets of fingerprints.

The start of accessible data

Relational Database model developed by Edgar F Codd.

The birth of the internet.

Google

Michael Lesk estimates the digital universe increasing tenfold in size every year.

Big Data challenges defined

Doug Laney defined the Three “Vs” of Big Data

Big Data taming by Elephants

Hadoop created!

MongoDB released!

Eric Schmidt

Every two days now we create as much information as we did from the dawn of civilization up until 2003

Spark 1.0 released!

Big Data

Big Challenge

Apache Spark is the Taylor Swift of big data software.

“Derrick Harris, Fortune

What is Spark?

Fast and general computing engine for clusters

• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, …• It’s fundamentally different to what’s come before

Why not just use Hadoop?

• Spark is FAST– Faster to write.– Faster to run.

• Up to 100x faster than Hadoop in memory• 10x faster on disk.

A visual comparison

HadoopSpark

Spark Programming Model

Resilient Distributed Datasets

• An RDD is a collection of elements that is immutable, distributed and fault-tolerant.

• Transformations can be applied to a RDD, resulting in new RDD.

• Actions can be applied to a RDD to obtain a value.

• RDD is lazy.

RDD Operations

Transformations Actionsmap reduce

filter collect

flatMap count

mapPartitions save

sample lookupKey

union take

join foreach

groupByKey

reduceByKey

Example: Filtering text

val searches = spark.textFile("hdfs://...") .filter(line => line.contains("Search")) .map(s => s.split("\t")(2)).cache() Driver

Worker

// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count() tasksres

Block 1

Block 2

Block 3

// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))

.collect()

Cache 1Cache 2

Cache3

Built in fault tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions

val searches = spark.textFile("hdfs://...") .filter(_.contains("Search")) .map(_.split("\t")(2)).cache()

.filter(_.contains("MongoDB")) .count()

Mapped RDD

Filtered RDD

HDFS RDD

Cached RDD

Filtered RDD Count

Spark higher level libraries

Spark SQL

Spark Streaming MLIB GraphX

Spark + MongoDB

MongoDB and Spark

Spark SQL Spark Streaming MLIB GraphX

MongoDB and Spark

Spark SQL

MongoDB and Spark

Spark SQL

Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection

Data Management

OLTPApplicationsFine grained operations

Offline Processing Analytics Data Warehousing

Fraud Detection

I'm so in love!

Fraud Detection

I'm so in love!

Me, too<3

Now send me your CC number

Ok, XXXX-123-zzz

Fraud Detection

Sharing Workloads

Chat App

HDFS HDFS HDFS ArchivingData Crunching

LoginUser ProfileContactsMessages…

Fraud DetectionSegmentationRecommendations

MongoDB + Spark Connectors

Choices, choices

Hadoop Connector Stratio Connector

MongoDB Hadoop Connector

HDFS HDFS HDFSMongoDB Hadoop

Connector

MongoDB Shard

HDFS HDFS HDFSMongoDB Hadoop

Connector

MongoDB Shard

Positive Not So Good

Battle Tested Not the fastest thing

Integrated with existing Hadoop components Not dedicated to Spark

Supports HIVE and PIG Dependent on HDFS

http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/

Stratio Spark-MongoDBhttp://spark-packages.org/?q=mongodb

Stratio Spark MongoDB

MongoDB Shard

Stratio Spark-MongoDB

https://github.com/Stratio/spark-mongodb

MongoDB Hadoop Connector Stratio Spark-MongoDB Connector

Machine Learning Yes Yes

SQL No Yes

Data Frames No Yes

Streaming No No

Python Yes YesSpark SQL syntax

Use MongoDB secondary indexes to filter input data Yes Yes

Compatibility with MongoDB replica sets and sharding Yes Yes

HDFS Support Yes Yes

Support for MongoDB BSON Files Yes PartialWrite only

Commercial Support YesWith MongoDB Enterprise Advanced

YesProvided by Stratio

Spark Streaming

Twitter Feed Spark

Spark Streaming

Twitter Feed

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

Spark Streaming{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

{ "time": "Mon Sep 24 03:35", "freebandnames": 1}

{ "time": "Mon Sep 24 03:35", "freebandnames": 4}

Capped Collection

MongoDB and Spark Streaming future

{ "time": "Mon Sep 24 03:35", "freebandnames": 4}{ "time": "Mon Nov 5 09:40", “mongoDBLondon": 400}{ "time": "Mon Nov 5 11:50", “spark": 7556}{ "time": "Mon Nov 24 12:50", "itshappening": 100}

Tailable Cursor

Spark SQL

Spark Stratio Spark MongoDB

Open High Low Close

Symbol, Timestamp, Day, Open, High, Low, Close, VolumeMSFT, 2009-08-24 09:30, 24, 24.41, 24.42, 24.31, 24.31, 683713

MongoDB + Spark performance

Document Design mattersdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 12.9}

Resource

WhenData

Time Seriesdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 1699342, minutes: { "0": 12.9, "1": 14.4, ... "59": 15.8 }}

Series

WiredTiger

Very High Speed

Spark I/O Matters

val searches = spark.fromMongoDB(mongoDBConfig) .filter(line => line.contains("Search")) .map(s => s.split("\t")(2))

SparkDriver

Worker Worker

// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count()

App// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))

.collect()

.cache()

Spark and MongoDB

• An extremely powerful combination

• Many possible use cases

• Some operations are actually faster if performed using Aggregation Framework

• Evolving all the time

Questions?

Ross LawleySenior Engineerross@mongodb.com@RossC0

References

• Resources– https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb– http://spark.apache.org/docs/latest/quick-start.html– https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals– http://techanjs.org/

• Images– https://commons.wikimedia.org/wiki/File:SAM_PC_1_-_Tally_sticks_1_-_Overview.jpg– http://www.pieria.co.uk/articles/a_17th_century_spreadsheet_of_deaths_in_london– http://www.snipview.com/q/Fritz_Pfleumer– https://news.google.com/newspapers?id=ZGogAAAAIBAJ&sjid=3GYFAAAAIBAJ&dq=data-center&pg=933%2C5465131– http://www.slideshare.net/renguzi/codd– http://www.datasciencecentral.com/profiles/blogs/a-little-known-component-that-should-be-part-of-most-data-science– https://medium.com/deepend-indepth/know-your-audience-better-than-asio-4802839c3fd3#.fj0dxq99w– http://timschreiber.com/img/cardboard-tank.jpg– http://olap.com/forget-big-data-lets-talk-about-all-data/– http://www.engadget.com/2015/10/07/lexus-cardboard-electric-car/– http://cdn.theatlantic.com/static/infocus/ngt051713/n10_00203194.jpg– http://www.businessinsider.com/the-red-pill-reddit-2013-8– https://www.flickr.com/photos/dogfaceboy/2572744331/

MongoDB Days UK: MongoDB and Spark

Technology

Transcript of MongoDB Days UK: MongoDB and Spark

MongoDB Days Silicon Valley: Introducing MongoDB 3.2

MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland with MongoDB

MongoDB Days Silicon Valley: MongoDB Support Cases: The Blockbusters

1. Spark DataFrames + SQL€¦ · Spark + MongoDB 1. Spark DataFrames + SQL 1.1 Setup the Spark cluster on Azure Create a cluster Sign into the azure portal (portal.azure.com). Search

1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

Mongodb - drupal dev days

Zaharia spark-scala-days-2012

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Blazing Fast Analytics with MongoDB & Spark

Introduction to NoSQL (MongoDB and Elastic )Introduction to NoSQL (MongoDB and Elastic ) By : Mehdi Habibzadeh (@NimaHM1980) Hossein Shemshadi (@HosseinShemshadi) July 2017. ... Spark

mongodb training | mongodb online training | mongodb training and certification | mongodb course

MongoDB Days Silicon Valley: Using MongoDB with Adobe AEM Communities

Data Driven Performance Repository to Classify and ... · MongoDB. Cluster-Python Driver. Cassandra - Python Driver. Python. Spark Cluster. Spark - Cassandra Connector. Spark - MongoDB

MongoDB Days UK: Scaling MongoDB with Docker and cgroups

I -MongoDB 4bos.itdks.com/773b8fef52f942e485ac2a495922fe8c.pdf · 2018-12-11 · MongoDB î Û 3.0 3.2 Document ... BI & Spark Connectors ++ Compass ++ Hardware Monitoring Server

MongoDB Days UK: Building Apps with the MEAN Stack

MongoDB 3.0 migration - MongoDB Days Munich

MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with MongoDB at x.ai

Replication MongoDB Days 2013