Getting It Right Exactly Once: Principles for Streaming Architectures

Post on 16-Jan-2017

794 views 1 download

Transcript of Getting It Right Exactly Once: Principles for Streaming Architectures

Getting It Right Exactly Once:Principles for Streaming ArchitecturesDarryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies

September 2016 | Strata+Hadoop World, NY

2

Getting Started I’m Darryl Smith

• Chief Data Platform Architectand Distinguished EngineerDell Technologies

Agenda• Real-Time And The Need For Streaming• Adding Real-Time And Streaming To The Data Lake• Results, Plans, Lessons Learned• Demonstration

3

Trickle, Flood, or Torrent…

Streaming is aboutcontinuous data motion,

more than speedor volume

4

The Conversation Around Streaming

Website and Mobile Application Logs

Internet of ThingsSensors

5

The Enterprise Reality

Batch > Real-Time > StreamingEnterprise Opportunities

Immediate Business Advantage

Website and Mobile Application Logs

Internet of ThingsSensors

6

The Enterprise Streaming Play

Moving from batch to real-time streamsavoids surges, normalizes compute,

and drives value

7

Real time and the need for streaming

8

Drive DellEMC towards a Predictive Enterprise via

intelligent data driving agility, increasing revenue and

productivity resulting in a competitive advantage

Analytics Vision

9

Need to use new data for competitive advantage

• Volume, Variety and Velocity Leverage near real time and

streaming data sets to optimize predictions

• Make faster, better decisions Cost-effectively scale to improve

query and load performance Put the data in the hands of the

business

Becoming An Analytical Enterprise

DRIVE COMPETITIVE ADVANTAGE

COST-EFFECTIVELY SCALE

DATA ACCESS BY BUSINESS

NEAR REAL-TIME ANALYTICS

10

Problem StatementTeams do not have access to maintenance renewal quotes in the timeframes or the degree of quality which they need for Tech Refresh and Renewal sales.

Desired OutcomeImplement a cost-effective, real-time solution that improves productivity and gives confidence to produce desired outcomes efficiently.

Scoping The Business Objectives

11

Business Drivers

CURRENT REALITY VISION FOR THE FUTURE

TO REALIZE THIS VISION:IMPLEMENT

CALM SOLUTION

PHASES AND OPTIMZE

BUSINESS PROCESSES

HIGH TOUCH TACTICAL EXECUTION

LOW TOUCH SELF SERVICE

DATE DRIVEN PROCESSES

BUSINESS VALUE DRIVEN PROCESSES

INEFFICENCIES & LOST PRODUCTITY

INCREASED PRODUCTIVITY

SILOED DATA / LIMITED VIEWS

SINGLE VIEW OF DATA/DATA SCORING

VARIABLE DATA QUALITY

DATA QUALITY & CONFIDENCE

12

The Need for “CALM”Customer Asset Lifecycle Management

Forenterprise salesWho needaccurate and timely customer informationCALM is areal-time applicationProvidingup to the moment customer 360 dashboards 

For enterprise salesWho need accurate and timely customer information

CALM is  a real-time applicationProviding up to the moment customer 360o dashboards 

Install Base

Pricing

Device Config

Contacts

Contracts

Analytics Contracts

Component Data

Offers

Scorecard

13

Data Lake Architecture

D A T A P L A T F O R M

V M W A R E V C L O U D S U I T E

E X E C U T I O N

P R O C E S S GREENPLUM DBSPRING XD PIVOTAL HD

Gemfire

H A D O O P

ING

ES

TIO

ND

AT

A G

OV

ER

NA

NC

E

Cassandra PostgreSQL MemSQL

HDFS ON ISILONHADOOP ON SCALEIO

VCE VBLOCK/VxRACK | XTREMIO | DATA DOMAIN

A N A L Y T I C S T O O L B O X

Network WebSensor SupplierSocial Media MarketS T R U C T U R E DU N S T R U C T U R E D

CRM PLMERP

APPLICATIONS

Apache R

angerA

ttivioC

ollibraR

eal-T

ime

Mic

ro-B

atch

Bat

ch

14

Data Ingestion• Small to Big Data (high-throughput)• Structured and unstructured Data from any Source• Streams and Batches• Secure, multi-tenant, configurable Framework

Real-Time Analytics• Tap into streams for in-memory Analytics• Real Time Data insights and decisions

Services• Data Ingestion to Data Lake• Data Lake APIs• Data Alerting

Business Data Lake Offerings

Unstructured

Structured

15

Adding Real Time and Streamingto the Data Lake

16

Seeking A Fast Database

A compliment to the business data lake

O P C M

HammerDB Platform BenchmarksHammerDB workloads testing was done following EMC’s Oracle and SQL Server DBA Teams standard practices. Definition of workload. Mix of 5 transactions as follows:

• New order: receive a new order from a customer: 45%

• Payment: update the customer balance to record a payment: 43%

• Delivery: deliver orders asynchronously: 4%

• Order status: retrieve the status of customer’s most recent order: 4%

• Stock level: return the status of the warehouse’s inventory: 4%

Testing scenario:• 100 warehouses 8 vUsers. Database creation and initial data loading.

• Timed testing. 20 minutes per each testing session.

• Scaled number of virtual users for each testing session from 1 until 44.

No changes done to the systems and databases configuration while running the test.

HammerDB Workload Testing

Each test was 16 vCPU x 32 GB RAM

• RedHat 6.4• Oracle 11g R2

• Windows Core 2012 R2 • SQL Server 2012 Ent Ed.

• RedHat 6.4• PostgreSQL 9.3.3

HammerDB Workload - Results

Results

Query PostgreSQL MemSQL Opportunity(5K) 5 seconds 200ms

Sales Order(170K) 1-1.5 Minutes 6 seconds

Territory(60K) 60 seconds 5 seconds

PostgreSQL vs In-Memory DB

We picked 5 top queries run by different business functions.Presented here are 3 queries that had response times that did not meet the SLA.

21

Business Data Lake – Ingestion to Fulfillment

Raw Data

SummaryData

DAT

A G

OV

ER

NO

R Consumers

Predictive/PrescriptiveAnalytics

ProcessedData Analytical Data

GREENPLUM DATABASE

HADOOPRAWData

INGESTMANAGER

SPRING XD

SPARK

SQOOP

Execution TierCASSANDRAGEMFIRE

MEMSQL POSTGRESQL

Real-TimeTap

22

Here Are The Data Flows We Built

Low Velocity

Batch

Real-Time

23

Data Flow Patterns – Low Velocity

Analytical [BATCH]

Ingestion

Data

Service

JDB

C

Application

Presentation [SPEED/SERVING]

GREENPLUMDATABASE

PIVOTAL HD

POSTGRESQL

MEMSQL

RawData

One-Time

CASSANDRA

GEMFIRE

24

Analytical [BATCH]

Ingestion

Data

Service

JDB

C

ApplicationGREENPLUMDATABASE

PIVOTAL HD

Data Flow Patterns – Batch

Batch

Presentation [SPEED/SERVING]

POSTGRESQL

MEMSQL CASSANDRA

GEMFIRE

25

Data Flow Patterns – Real Time

Real-time

Initial Load

Analytical [BATCH]

Ingestion

Data

Service

JDB

C

ApplicationGREENPLUMDATABASE

PIVOTAL HD

Presentation [SPEED/SERVING]

POSTGRESQL

MEMSQL CASSANDRA

GEMFIRE

26

Nothing Closer To Real Time Than Streaming Let’s look at the leading edge Apache Kafka Messaging Semantics

• At most once• At least once• Exactly once

27

At most once

000

?01 02 03 04

28

At least once

01 02 03 04

000

?

29

Exactly Once

000

01 02 03 04

01

30

Understanding Streaming Semantics

At most once At least once Exactly once

Message pulled once Message pulled one or more times;processed each time

Message pulled one or more times;processed once

May or may not be received Receipt guaranteed Receipt guaranteed

No duplicates Likely duplicates No duplicates

Possible missing data No missing data No missing data

000? 000000 ?01

01

01

31

Rendering In Real Time Picking the right business intelligence layer

• Tableau• Custom Application (CF, D3, Docker)• Additional Third Party Solutions

32

Results, Plans, Lessons Learned

33

Business Benefits

DATA QUERYINGDown from 4 hours per quarter to less than 1 minute per year

SIMPLIFIED PROVISIONING

Reduced number of tables/report required

DATA GOVERNANCE

Provides one version of the truth

TIME TO MARKETReduced number of tables/report

required

TOOL AGNOSTIC

Business logic in the DB not the tool provides increased

flexibility

34

Use Case: Customer Account Profile STREAMLINED analytics ENVIRONMENT TO GAIN A HOLISTIC CUSTOMER VIEW

Service Request

Contracts

Installed Base

Bookings

Billings

EMC DATA LAKE

BDL SERVICES

DATA WORKSPACES

DATA INGESTION

Prof Services

23 BUSINESS MANAGED WORKSPACES

35

Customer Asset Lifecycle ManagementPlatform Roadmap

Phase 1 : Foundational Capabilities/Discovery

Phase 2 : Scale Platform / Automate

Future Phases : Global Standard tool Integrations , advanced Analytics

BAaaS/Tableau

ScalablePlatform

Integrated Platform

GBSRenewals

InsideSales

Additional Business groups

Oct 2015 2016 TBDAug 2015

BDL Platform

Enablement CollaborationAcceleration

In-Memory Capabilities(POC)

We are here

36

Data Services Roadmap

SecurityPlanned integration into custom BDL security API for managing Role Based Access Control (RBAC) to the underlying data

Business Data Lake Plans

37

Lessons Learned – Key Takeaways

EDUCATE ASSESS INFRASTRUCTURE JOURNEY

Educate the business

Use examples of business impact

Assess in-house big data skills

Ensure plan to support the organization for 3-5 years

Choose the best possible infrastructure

Make sure your Big Data technology platform can evolve

Remember it is a journey

Look for small wins as well as big wins.

38

Lessons Learned: Analytics and DataSourcing the right skills, working with a different philosophy,and some new tools will help you meet your analytical goals

TRANSFORM YOUR PEOPLE

CHANGE YOUR PROCESSES

ADAPT YOUR TECHNOLOGY

Data science in the organization, IT or both?

Helping business units take initiative

New philosophy to running analytics projects

How and when to share data

Steadily refine toolsets based on needed analysis

Identify to infrastructure layers

39

Demonstration

40

Demo Agenda

Showcase exactly-once semantics from Kafka

1: Data set of 200,000 transactions summing to zero

2: CREATE TABE AND CREATE PIPELINE

3: Push to Kafka and confirm exactly-once

4: Validate Resiliency and confirm exactly-once

Step 1: Data Source start with a data set of 200,000 transactions representing

money/goods that sum to zero

200,000 transactions• Transaction number• Increase / Decrease• Amount

Step 2: CREATE TABLE AND CREATE PIPELINE

create a table and pipeline in MemSQL that subscribes to that Kafka topic

CREATE TABLE

CREATE PIPELINE

Step 3: Push to Kafka Push that data set to Kafka Validate exactly-once delivery by querying MemSQL

• show tables;• show pipelines;• select sum(amount) from transactions;

Should be 0 in the demo• select count(*) from transactions;

Should be 200,000 in the demo

46

Step 4: Resiliency induce a failures to show resiliency during exactly-once

workflowsa. randomly_fail_batches.pyb. restart Kafka and show error countc. continue and validate exactly-once semantics

48

Errors

TotalTransactions

Sum

The mission is clear:

We’re movingfrom batch to real-time

with streaming

Thank You

Darryl SmithChief Data Platform Architect and Distinguished Engineer

Dell Technologies