Modeling the Smart and Connected City of the Future with Kafka and Spark
-
Upload
memsql -
Category
Data & Analytics
-
view
1.328 -
download
2
Transcript of Modeling the Smart and Connected City of the Future with Kafka and Spark
Modeling the Smart and Connected City of the Future with Kafka and Spark
Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel
MAKE DATA WORKDECEMBER 1-3, 2015 SINGAPORE
2
MemSQL at a Glance
Enterprise Focused
Our Mission:
Real-time database for transactions and analytics Founded in 2011, based in San Francisco Founders are former Facebook, SQL Server
database engineers $50 million in funding to date
Make every company a real-time enterprise.
What does a Smart City Look Like?
4
Our Conception
5
Our Reality
6
3.9b people live in cities today
7
By 2050, we’ll add another 2.5b people
8
We need to create sustainable cities
9
We need to use technology to help us
10
We don’t live in Tomorrowland
11
We live here
12
The good news: the Technology of Today can build smart cities.
13
City-wide WiFi City App to report issues Open-Data Initiatives to
share data with the public Most importantly, an
adaptive IT department
A Smart City Should Have…
14
Let’s learn how.
A Model Application: MemCityCapturing data from 1.4 million households Total AWS hardware costs at $2.35 per hour
MemCity Reach
1.4 million households (approximately the size of Chicago)
Capturing data from 8 devices in each home,
every minute
*#MemCity
186,667 transactions per secondfrom Kafka Spark MemSQL
#MemCity
1.4 Million Households8 Devices per Household186K Events per Second
The “Real-Time Trinity”
21
Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
22
A high-throughput distributed messaging system
Publish and subscribe to Kafka “topics”
Centralized data transport for the organization
Kafka
23
In-memory execution engine
High level operators for procedural and programmatic analytics
Faster than MapReduce
Spark
24
In-memory, distributed database
Full transactions and complete durability
Enable real-time, performant applications
MemSQL
25
Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010111100001110101100000010010010111…
1110010101000101010001010100010111111010100011110101100011010101000…
0101111000011100101010111110001111011010111100000000101110101100000…
Event added to message queue
26
Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010111100001110101100000010010010111…
27
Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time house_id zip device_id device_type watts
2015-07-
06T16:43:40.33
Z
329280 94110 23 ‘kitchen_appliance’ 60
… … … … … …
28
Go to Production
Compress development timelines
SELECT ... FROM memcity_table ...
We can use In-Memory technology to build interactive applications for Cities.
31
Urban planning Efficient power consumption Efficient transportation Sustainable energy practices
So We Can Optimize…
32
Creating Real-Time Pipelines should be push button
easy.
33
One click deployment of integrated Apache Spark
Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation
Eliminates batch ETL Open source on GitHub
MemSQL Streamliner for IoT Applications
34
Simple Deployment Process
Application
35
Cluster
1. Deploy MemSQL
In-Memory | Distributed | Relational
Application
36
Cluster
2. Deploy Spark
Application
37
Cluster
Kafka Connects to Each Node
Application
38
Streamliner Architecture
First of many integrated Apache Spark solutions
Other Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine Learning Solution
STREAMLINER
39
Streamliner ETL Detail
Other Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine Learning Solution
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
STREAMLINER
Extract Transform Load
40
Streamliner
41
Extract
42
Transform
43
Load
Extending Analytics with Lambda Architecture
Real-Time Analytics StreamingAnalytic ApplicationsNot Excel Reports
Financial Services Adtech eCommerce IoT Consumer Internet Energy Federal
Lambda Architecture
New Real-Time Processing
Existing Batch Processing
Msg Queue
45
Multi-TB on commodity hardware
Store the “state of the model”
Easily build applications
Avoid direct disk at all cost
In-Memory Databases Rise Up
46
Comprehensive Architecture
Tran
sact
ions
47
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
RowstoreTran
sact
ions
48
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
Analytics
Tran
sact
ions
49
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tran
sact
ions
50
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tran
sact
ions
Execution engine that spans the data spectrum
51
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tran
sact
ions
52
Simplified Lambda Architectures with MemSQL
Layer Traditional Lambda MemSQL Lambda
Batch Hadoop MemSQL Column Store
Speed Storm, Spark Kafka > Spark > MemSQL
Serving Cassandra, HBase MemSQL
53
Lambda Applies to Real-Time Data Pipelines
Message Queue
Batch
Inputs DatabaseTransformation Application
54
Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
55
Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications 1 GB/sec totaling 72 TB/day
Real-time analytics
56
Using Real-Time for Personalization
Ad Servers EC2
Real-time analytics
PostgreSQLLegacy reports
Monitoring S3 (replay) HDFSData Science
VerticaStar Schema MictoStrategy
Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times
57
300k events/sec
Reduced Latency from 30 minutes to Sub-Second
Real-time Analytics
58
Sample Pipeline: Analyzing Twitter Data in Real Time
ApplicationApache Spark
SPARKSTREAMLINER
Public API“Garden Hose”
</>Python Extract Transform Load
SPARK STREAMLINER
59
Install MemSQL and Apache Spark in < 1min With MemSQL Ops and Streamliner
60
Run Kafka in Docker Container and Create a New Topic: TWITTER
61
Fill Out Extract, Transform and Load Details to Set Up Pipeline
62
Use Python Script to Load Tweets into Kafka Topic and Get Data Flowing
63
Connect to MemSQL Database and Run SQL Queries Instantly
64
Run Online Alter Table to Optimize Query Performance
65
Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1
Spark Worker
Pipeline 2
Spark Worker
Executor (P2 only)
Executor (P2 only)
Executor (P1 only)
Executor (P1 only)
Driver (P1 only)
Driver (P2 only)
All Pipelines
Streamliner Driver…
…
Spark WorkerSpark Worker
Executor (P1 or P2)
Executor (P1 or P2)
Executor (P1 or P2)
Executor (P1 or P2)
66
Building Real-Time Data Pipelines and Predictive Applications
67
Adding Real-Time Scoring to Predictive Applications
StreamlinerInput
User JarSAS Generated PMML
Industrial Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data with Predictive Models
Sensor 1 Predictive Model 1
68
69
GET YOUR FREE COPY:memsql.com/oreilly