Getting It Right Exactly Once: Principles for Streaming Architectures
-
Upload
memsql -
Category
Data & Analytics
-
view
794 -
download
1
Transcript of Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once:Principles for Streaming ArchitecturesDarryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies
September 2016 | Strata+Hadoop World, NY
2
Getting Started I’m Darryl Smith
• Chief Data Platform Architectand Distinguished EngineerDell Technologies
Agenda• Real-Time And The Need For Streaming• Adding Real-Time And Streaming To The Data Lake• Results, Plans, Lessons Learned• Demonstration
3
Trickle, Flood, or Torrent…
Streaming is aboutcontinuous data motion,
more than speedor volume
4
The Conversation Around Streaming
Website and Mobile Application Logs
Internet of ThingsSensors
5
The Enterprise Reality
Batch > Real-Time > StreamingEnterprise Opportunities
Immediate Business Advantage
Website and Mobile Application Logs
Internet of ThingsSensors
6
The Enterprise Streaming Play
Moving from batch to real-time streamsavoids surges, normalizes compute,
and drives value
7
Real time and the need for streaming
8
Drive DellEMC towards a Predictive Enterprise via
intelligent data driving agility, increasing revenue and
productivity resulting in a competitive advantage
Analytics Vision
9
Need to use new data for competitive advantage
• Volume, Variety and Velocity Leverage near real time and
streaming data sets to optimize predictions
• Make faster, better decisions Cost-effectively scale to improve
query and load performance Put the data in the hands of the
business
Becoming An Analytical Enterprise
DRIVE COMPETITIVE ADVANTAGE
COST-EFFECTIVELY SCALE
DATA ACCESS BY BUSINESS
NEAR REAL-TIME ANALYTICS
10
Problem StatementTeams do not have access to maintenance renewal quotes in the timeframes or the degree of quality which they need for Tech Refresh and Renewal sales.
Desired OutcomeImplement a cost-effective, real-time solution that improves productivity and gives confidence to produce desired outcomes efficiently.
Scoping The Business Objectives
11
Business Drivers
CURRENT REALITY VISION FOR THE FUTURE
TO REALIZE THIS VISION:IMPLEMENT
CALM SOLUTION
PHASES AND OPTIMZE
BUSINESS PROCESSES
HIGH TOUCH TACTICAL EXECUTION
LOW TOUCH SELF SERVICE
DATE DRIVEN PROCESSES
BUSINESS VALUE DRIVEN PROCESSES
INEFFICENCIES & LOST PRODUCTITY
INCREASED PRODUCTIVITY
SILOED DATA / LIMITED VIEWS
SINGLE VIEW OF DATA/DATA SCORING
VARIABLE DATA QUALITY
DATA QUALITY & CONFIDENCE
12
The Need for “CALM”Customer Asset Lifecycle Management
Forenterprise salesWho needaccurate and timely customer informationCALM is areal-time applicationProvidingup to the moment customer 360 dashboards
For enterprise salesWho need accurate and timely customer information
CALM is a real-time applicationProviding up to the moment customer 360o dashboards
Install Base
Pricing
Device Config
Contacts
Contracts
Analytics Contracts
Component Data
Offers
Scorecard
13
Data Lake Architecture
D A T A P L A T F O R M
V M W A R E V C L O U D S U I T E
E X E C U T I O N
P R O C E S S GREENPLUM DBSPRING XD PIVOTAL HD
Gemfire
H A D O O P
ING
ES
TIO
ND
AT
A G
OV
ER
NA
NC
E
Cassandra PostgreSQL MemSQL
HDFS ON ISILONHADOOP ON SCALEIO
VCE VBLOCK/VxRACK | XTREMIO | DATA DOMAIN
A N A L Y T I C S T O O L B O X
Network WebSensor SupplierSocial Media MarketS T R U C T U R E DU N S T R U C T U R E D
CRM PLMERP
APPLICATIONS
Apache R
angerA
ttivioC
ollibraR
eal-T
ime
Mic
ro-B
atch
Bat
ch
14
Data Ingestion• Small to Big Data (high-throughput)• Structured and unstructured Data from any Source• Streams and Batches• Secure, multi-tenant, configurable Framework
Real-Time Analytics• Tap into streams for in-memory Analytics• Real Time Data insights and decisions
Services• Data Ingestion to Data Lake• Data Lake APIs• Data Alerting
Business Data Lake Offerings
Unstructured
Structured
15
Adding Real Time and Streamingto the Data Lake
16
Seeking A Fast Database
A compliment to the business data lake
O P C M
HammerDB Platform BenchmarksHammerDB workloads testing was done following EMC’s Oracle and SQL Server DBA Teams standard practices. Definition of workload. Mix of 5 transactions as follows:
• New order: receive a new order from a customer: 45%
• Payment: update the customer balance to record a payment: 43%
• Delivery: deliver orders asynchronously: 4%
• Order status: retrieve the status of customer’s most recent order: 4%
• Stock level: return the status of the warehouse’s inventory: 4%
Testing scenario:• 100 warehouses 8 vUsers. Database creation and initial data loading.
• Timed testing. 20 minutes per each testing session.
• Scaled number of virtual users for each testing session from 1 until 44.
No changes done to the systems and databases configuration while running the test.
HammerDB Workload Testing
Each test was 16 vCPU x 32 GB RAM
• RedHat 6.4• Oracle 11g R2
• Windows Core 2012 R2 • SQL Server 2012 Ent Ed.
• RedHat 6.4• PostgreSQL 9.3.3
HammerDB Workload - Results
Results
Query PostgreSQL MemSQL Opportunity(5K) 5 seconds 200ms
Sales Order(170K) 1-1.5 Minutes 6 seconds
Territory(60K) 60 seconds 5 seconds
PostgreSQL vs In-Memory DB
We picked 5 top queries run by different business functions.Presented here are 3 queries that had response times that did not meet the SLA.
21
Business Data Lake – Ingestion to Fulfillment
Raw Data
SummaryData
DAT
A G
OV
ER
NO
R Consumers
Predictive/PrescriptiveAnalytics
ProcessedData Analytical Data
GREENPLUM DATABASE
HADOOPRAWData
INGESTMANAGER
SPRING XD
SPARK
SQOOP
Execution TierCASSANDRAGEMFIRE
MEMSQL POSTGRESQL
Real-TimeTap
22
Here Are The Data Flows We Built
Low Velocity
Batch
Real-Time
23
Data Flow Patterns – Low Velocity
Analytical [BATCH]
Ingestion
Data
Service
JDB
C
Application
Presentation [SPEED/SERVING]
GREENPLUMDATABASE
PIVOTAL HD
POSTGRESQL
MEMSQL
RawData
One-Time
CASSANDRA
GEMFIRE
24
Analytical [BATCH]
Ingestion
Data
Service
JDB
C
ApplicationGREENPLUMDATABASE
PIVOTAL HD
Data Flow Patterns – Batch
Batch
Presentation [SPEED/SERVING]
POSTGRESQL
MEMSQL CASSANDRA
GEMFIRE
25
Data Flow Patterns – Real Time
Real-time
Initial Load
Analytical [BATCH]
Ingestion
Data
Service
JDB
C
ApplicationGREENPLUMDATABASE
PIVOTAL HD
Presentation [SPEED/SERVING]
POSTGRESQL
MEMSQL CASSANDRA
GEMFIRE
26
Nothing Closer To Real Time Than Streaming Let’s look at the leading edge Apache Kafka Messaging Semantics
• At most once• At least once• Exactly once
27
At most once
000
?01 02 03 04
28
At least once
01 02 03 04
000
?
29
Exactly Once
000
01 02 03 04
01
30
Understanding Streaming Semantics
At most once At least once Exactly once
Message pulled once Message pulled one or more times;processed each time
Message pulled one or more times;processed once
May or may not be received Receipt guaranteed Receipt guaranteed
No duplicates Likely duplicates No duplicates
Possible missing data No missing data No missing data
000? 000000 ?01
01
01
31
Rendering In Real Time Picking the right business intelligence layer
• Tableau• Custom Application (CF, D3, Docker)• Additional Third Party Solutions
32
Results, Plans, Lessons Learned
33
Business Benefits
DATA QUERYINGDown from 4 hours per quarter to less than 1 minute per year
SIMPLIFIED PROVISIONING
Reduced number of tables/report required
DATA GOVERNANCE
Provides one version of the truth
TIME TO MARKETReduced number of tables/report
required
TOOL AGNOSTIC
Business logic in the DB not the tool provides increased
flexibility
34
Use Case: Customer Account Profile STREAMLINED analytics ENVIRONMENT TO GAIN A HOLISTIC CUSTOMER VIEW
Service Request
Contracts
Installed Base
Bookings
Billings
EMC DATA LAKE
BDL SERVICES
DATA WORKSPACES
DATA INGESTION
Prof Services
23 BUSINESS MANAGED WORKSPACES
35
Customer Asset Lifecycle ManagementPlatform Roadmap
Phase 1 : Foundational Capabilities/Discovery
Phase 2 : Scale Platform / Automate
Future Phases : Global Standard tool Integrations , advanced Analytics
BAaaS/Tableau
ScalablePlatform
Integrated Platform
GBSRenewals
InsideSales
Additional Business groups
Oct 2015 2016 TBDAug 2015
BDL Platform
Enablement CollaborationAcceleration
In-Memory Capabilities(POC)
We are here
36
Data Services Roadmap
SecurityPlanned integration into custom BDL security API for managing Role Based Access Control (RBAC) to the underlying data
Business Data Lake Plans
37
Lessons Learned – Key Takeaways
EDUCATE ASSESS INFRASTRUCTURE JOURNEY
Educate the business
Use examples of business impact
Assess in-house big data skills
Ensure plan to support the organization for 3-5 years
Choose the best possible infrastructure
Make sure your Big Data technology platform can evolve
Remember it is a journey
Look for small wins as well as big wins.
38
Lessons Learned: Analytics and DataSourcing the right skills, working with a different philosophy,and some new tools will help you meet your analytical goals
TRANSFORM YOUR PEOPLE
CHANGE YOUR PROCESSES
ADAPT YOUR TECHNOLOGY
Data science in the organization, IT or both?
Helping business units take initiative
New philosophy to running analytics projects
How and when to share data
Steadily refine toolsets based on needed analysis
Identify to infrastructure layers
39
Demonstration
40
Demo Agenda
Showcase exactly-once semantics from Kafka
1: Data set of 200,000 transactions summing to zero
2: CREATE TABE AND CREATE PIPELINE
3: Push to Kafka and confirm exactly-once
4: Validate Resiliency and confirm exactly-once
Step 1: Data Source start with a data set of 200,000 transactions representing
money/goods that sum to zero
200,000 transactions• Transaction number• Increase / Decrease• Amount
Step 2: CREATE TABLE AND CREATE PIPELINE
create a table and pipeline in MemSQL that subscribes to that Kafka topic
CREATE TABLE
CREATE PIPELINE
Step 3: Push to Kafka Push that data set to Kafka Validate exactly-once delivery by querying MemSQL
• show tables;• show pipelines;• select sum(amount) from transactions;
Should be 0 in the demo• select count(*) from transactions;
Should be 200,000 in the demo
46
Step 4: Resiliency induce a failures to show resiliency during exactly-once
workflowsa. randomly_fail_batches.pyb. restart Kafka and show error countc. continue and validate exactly-once semantics
48
Errors
TotalTransactions
Sum
The mission is clear:
We’re movingfrom batch to real-time
with streaming
Thank You
Darryl SmithChief Data Platform Architect and Distinguished Engineer
Dell Technologies