Spark and MongoDB
-
Upload
norberto-leite -
Category
Software
-
view
2.802 -
download
0
Transcript of Spark and MongoDB
By now, you should have heard about MongoDB
Unless you've been living under a rock for the last few years!
Spark Stack
Spark SQL Spark Streaming MLIB GraphX
Apache Spark
Seamless integration with SQL using
DataFrame API. Also supports HIVE SQL
Fast Feed data processing API. Designed for Fault Tolerance and
bridges streaming with batch processing MLib is Spark machine learning algorithms trick bag.
Spark graph library
20
Data Management
Offline Processing Analytics Data Warehousing
OLTP Applications Fine grained operations
The image cannot be displayed. Your computer 21
Delivering User Relevancy • Integrate data from many
sources • Fast-cycle analytics • Real-time • Reliable
Workloads
Chat App
Login User Profile Contacts Messages …
Spark Fraud Detection Segmentation Recommendations
HDFS HDFS HDFS Archiving Data Crunching
The image cannot be displayed. Your computer 27
Access complete patient history Avoid of conflicting prescriptions Clinical trials
Time Series db.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 1699342, minutes: { "0": 12.9, "1": 14.4, ... "59": 15.8 }}
Resource
Type
When
Series
34
MongoDB Storage Engines
Content Repo
IoT Sensor Backend Ad Service Customer
Analytics Archive
MongoDB Query Language (MQL) + Native Drivers
MongoDB Document Data Model
MMAP V1 WT In-Memory ? ?
Supported in MongoDB 3.0 Future Possible Storage Engines
Man
agem
ent
Sec
urity
Experimental
37
Spark Streaming
Twitter Feed
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
38
Spark Streaming
Spark
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "time": "Mon Sep 24 03:35", "freebandnames": 1}
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "time": "Mon Sep 24 03:35", "freebandnames": 4}
39
Capped Collection
Spark Streaming
{ "time": "Mon Sep 24 03:35", "freebandnames": 4}
{ "time": "Mon Sep 24 03:40", "bigdataspain": 400}
{ "time": "Mon Sep 24 03:50", "bigdataspain": 7556}
{ "time": "Mon Sep 24 03:50", "itshappending": 100}
Tailable Cursor
43
MongoDB Hadoop Connector
Positive Not So Good
Battle Tested Not the fastest thing
Integrated with existing Hadoop components Not dedicated to Spark
Supports HIVE and PIG Dependent on HDFS
http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
45
Stratio Spark-MongoDB
https://github.com/Stratio/spark-mongodb
Spark
HDFS HDFS HDFS
MongoDB Shard
Stratio Spark-MongoDB
46
Stratio Spark-MongoDB
val mcInputBuilder = MongodbConfigBuilder(Map(Host -> List("localhost:27017"), Database -> "marketdata", Collection -> "minbars", SamplingRatio -> 1.0, WriteConcern -> MongodbWriteConcern.Normal))
val readConfig = mcInputBuilder.build()
Database
Collec9on
SamplingRa9o
WriteConcern
47
Stratio Spark-MongoDB
val sqlContext = new HiveContext(sc)val dfOneMin = sqlContext.fromMongoDB(readConfig)
48
Stratio Spark-MongoDB
val dfFiveMinForMonth = sqlContext.sql("""SELECT m.Symbol, m.OpenTime as Timestamp, m.Open, m.High, m.Low, m.CloseFROM...FROM minbars)as mWHERE unix_timestamp(m.CloseTime, 'yyyy-MM-dd HH:mm') - unix_timestamp(m.OpenTime, 'yyyy-MM-dd HH:mm') = 60*4""")
49
Stratio Spark-MongoDB
https://github.com/Stratio/spark-mongodb
Spark
HDFS HDFS HDFS
MongoDB Shard
Stratio Spark-MongoDB
50
DC West
DC West
DC West
Stratio Spark-MongoDB
https://github.com/Stratio/spark-mongodb
Spark
MongoDB Shard
Spark
Spark
55
What to expect
• We are working on a dedicated Spark Connector for MongoDB
• Stratio Connector is great but: – Some Operations are actually faster if performed using
Aggregation Framework • Better Integration with upcoming 3.2 Async Java Driver
– Specially for the Apache Streaming Support
57
Engineering
Sales&AccountManagement Finance&PeopleOpera9ons
Pre-SalesEngineering Marke9ng
JointheTeam
Viewalljobsandapply:h1p://grnh.se/pj10su