SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High...

37
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 1 SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

Transcript of SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High...

Page 1: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015

Page 2: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

SEIZE THE DATA. 2015

High performance, low latency streamingStreaming data into Vertica from Kafka

Tom Wall and Jason Blais

August 2015

Page 3: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

This is a rolling (up to three year) Roadmap and is subject to change without notice.

Forward-looking statements

This document contains forward looking statements regarding future operations, product development, product capabilities and availability dates. This information is subject to substantial uncertainties and is subject to change at any time without prior notification. Statements contained in this document concerning these matters only reflect Hewlett Packard's predictions and / or expectations as of the date of this document and actual results and future plans of Hewlett-Packard may differ significantly as a result of, among other things, changes in product strategy resulting from technological, internal corporate, market and other changes. This is not a commitment to deliver any material, code or functionality and should not be relied upon in making purchasing decisions.

Page 4: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

HP confidential information

This Roadmap contains HP Confidential Information.

If you have a valid Confidential Disclosure Agreement with HP, disclosure of the Roadmap is subject to that CDA. If not, it is subject to the following terms: for a period of 3 years after the date of disclosure, you may use the Roadmap solely for the purpose of evaluating purchase decisions from HP and use a reasonable standard of care to prevent disclosures. You will not disclose the contents of the Roadmap to any third party unless it becomes publically known, rightfully received by you from a third party without duty of confidentiality, or disclosed with HP’s prior written approval.

Page 5: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 SEIZE THE DATA. 2015

Yesterday, scalable bulk data loading was hard. Today, the demand to do it in real time makes it even harder. To do so properly requires lots of Verticaexpertise & engineering effort. By leveraging the strengths of Apache Kafka, Vertica Excavator makes it simpler than ever to do scalable, real time loading.

Page 6: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

The challenges of data loading

Loading data needs to be easier!

Data continuously

generated

Unpredictable bursts

Many sourcesMany

processing systems

Complex data formats

Page 7: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Loading complexity: clients

Ver

tica

JDBCINSERT VALUES (?)

COPY LOCAL

• Traditional ETL via clients

• Easy to setup

• Lots of tools available

• Reasonable latency

• Single stream; poor throughput

• Scales until you hit session concurrency limits

Source

Page 8: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Loading complexity: parallel COPY

Ver

tica

File

• Write files to staging area accessible by Vertica

• Programs & scripts to COPY files in parallel

• Lots of custom code

• Very good throughput

• Scales with number of files & nodes

• Staging adds latency

Source

FileFiles

vsqlCOPY

Page 9: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Loading complexity: comparisonKafka can provide the best of both worlds

Method Scale Limits Latency Throughput Complexity Coupling Ecosystem

Clients # Stmts Low Low Low Tight Good

COPY # Nodes High High High Tight Poor

Kafka # Nodes Low High Low Loose Good

Page 10: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

http://kafka.apache.org/

Kafka to the rescue!

• A fast, scalable, fault tolerant, publish-subscribe messaging system• A standard data backbone for all of an enterprise’s systems & applications• Generic enough to suit most use cases

• Fraud detection• Website activity tracking• Monitoring data metrics• Log aggregation• Stream processing

Page 11: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Kafka high-level architecture

Sensor Data Broker(s)

Consumers

Cloud Data

Customer Data

Topics

Page 12: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Publish/subscribe message busKafka decouples data sources & sinks

Analytics Archive

Kafka

Apps LogsSocial Media

Apps LogsSocial Media

Analytics Archive

Page 13: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Kafka ecosystem

Connect your applications Connect your data systems

The developer community is large and growing!

• CLI, HTTP

• Java, C++, .NET

• Clojure, Scala, Erlang

• Python, Ruby, Node.js

• Many more!

• OLTP Databases

• Hadoop

• Storm

• Spark

• Many more!

Page 14: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Integration with Vertica

• Vertica schedules loads to continuously consume from Kafka

• JSON, Avro data formats

• CLI for easy setup

• In-database monitoring

• Vertica can also produce --export query results Kafka to close the processing loop

LoadScheduler

KafkaKafka

Kafka

Vertica

Kafka Plugin

Vertica

Kafka Plugin

Vertica

Kafka Plugin

CLI

Load

Export

Page 15: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Fits in your existing ecosystem

• No need to dedicate a Kafka cluster to Vertica; it connects to your existing Kafka

• Producers & consumers don’t need to know they are interacting with Vertica!

• No client drivers

• No complicated ETL

• No staging files

• Seamless, scalable interaction with producers & consumers for other systems

If you have Kafka and Vertica, you’re ready to go!

Kafka

Raw Data

Apps

Stream Engines

Vertica

Hadoop

Page 16: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16 SEIZE THE DATA. 2015

Vertica’s solutionTechnical details

Page 17: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Listening to you, our customers

Requirements

• Analytic reports desired more quickly.

• Moving from files to stream based data delivery

• AVRO and JSON formats are priority

• Support multiple topics

• Support topics with different priorities

• Don’t make us Vertica internal experts

Page 18: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Things we worry about too.

Engineering requirements

Low Latency Parallelism

Data LossLimited

Resources

ScalabilityMonitoring /

Usability

Page 19: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Things we worry about too

Engineering requirements

• Time to take action on data is short

• Lots of topics requires limiting time that individual topics can be polled.

• Optimize topic loading for hot topics

Low Latency Parallelism

Data LossLimited

Resources

ScalabilityMonitoring /

Usability

Page 20: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Things we worry about too

Engineering requirements

Since each topic targets a single table, we need to be able to handle/limit a number of COPY statements running at the same time.

Low latency Parallelism

Data lossLimited

resources

ScalabilityMonitoring /

usability

Page 21: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Things we worry about too

Engineering requirements

Low latency Parallelism

Data lossLimited

resources

ScalabilityMonitoring /

usability

With so many streams and components, what happens if there is a failure?

• Data loss probability is limited through use of atomic operations we call a microbatch.

Page 22: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Things we worry about too

Engineering requirements

Low latency Parallelism

Data lossLimited

resources

ScalabilityMonitoring /

usability

• Vertica cluster has limited resources

• Needs to share with user queries.

• Resource Pools can be dedicated to the task.

• Planned Concurrency important for parallelism.

Page 23: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Things we worry about too

Engineering requirements

Low latency Parallelism

Data lossLimited

resources

ScalabilityMonitoring /

usability

• More data means more resources.

• Adding Vertica Nodes increases scalability

• Adding Kafka Brokers increases scalability

Page 24: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Things we worry about too

Engineering requirements

Low latency Parallelism

Data lossLimited

resources

ScalabilityMonitoring /

usability

There lots of meta-data being gathered. It would be nice to be able to track it and report on it to ensure the system is working as planned.

Page 25: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Support ‘exactly once’ message loads

Preventing data loss

Kafka Data @ Offset X

Data Inserted Into Vertica

New Offsets Stored In Vertica Commit

Microbatch(µB)

Page 26: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

SQL Statements

Microbatch statements

COPY public.from_kafka

SOURCE KafkaSource(

stream=‘topic,0,-2|topic,0,-2’,

brokers=‘broker:port’,

duration=interval ‘10000 milliseconds’,

stop_on_eof=true,

executionparallelism=1 )

PARSER KafkaJSONParser( )

REJECTED DATA AS TABLE public.rejections

DIRECT NO COMMIT;

INSERT INTO kafkacfg.kafka_offsets(…,* from

(SELECT KafkaOffsets() OVER ()) as b

Page 27: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Simple Example

Implementation

1Example scheduling yields 2 3 4 5

• 5 topics

• Concurrency of 1

• Frame split into 5 equal parts.

Hot topics become starved

Page 28: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.28 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Dynamic prioritization

A better way to implement

• In the simple example there is a lot of time wasted when microbatches finish early

• It becomes hard for ‘busy’ topics to ever catch up

• Can we implement something more intelligent?

Yes!

• Dynamic prioritizing of hot topics

• Flexible microbatch durations

Page 29: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.29 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Dynamic prioritization example

A better way to do it

1

1 2

1 2 3

1 2 3 4

1 2 3 4 5

Where possible, the remaining duration is split evenly

Page 30: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.30 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Dynamic prioritization example

A better way to do it

Assuming the times are the same as before, the order now shifts.

1

1

1

1 3 4

1 23 4

5

35

5

5 µB2 gets lots of time now!

Page 31: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.31 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Supporting multiple topics

• There may be 10s or 100s of topics

• Some are higher priority than others.

• The way to do this is with separate instances of schedulers which use

• Dedicated resource pools with specific tuning

• Different schedule configurations ( ie max poll duration )

Page 32: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.32 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Process monitoring

Monitoring

kafka_events

Store history logged

events from the scheduler

kafka_scheduler

Current configuration for

the scheduler (resource

pool, duration, etc)

kafka_offsets

Long term history of

messages read / duration by

node/topic/partition/offsets

kafka_tables

Target tables by Kafka

topic

k_scheduler_history

Store history about

changes to the scheduler

Page 33: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.33 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Simple command line configuration

Simplicity

[node001]$: /opt/vertica/packages/kafka/scripts/vkconfig.sh

scheduler –add

[node001]$: /opt/vertica/packages/kafka/scripts/vkconfig.sh topic

--add --target public.kafka_tgt --rejection_table

public.kafka_rej --topic mud --npartitions 4

[node001]$: /opt/vertica/packages/kafka/scripts/vkconfig.sh launch

--start

Example terminal command line

Page 34: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.34 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Terabytes per hour

Benchmarks

System Specifications

Hardware: 6 Servers 24 cores each 252 GB RAM

OS: CentOS 6.6

3 Node Vertica Cluster

1 Kafka Brokers

1 Producer

1 Zookeeper Instance

Benchmarks

Data Set Size: 1.5M 5kb messages

System Load : ~30% CPU / 14G RAM

1.2

2.2

1 Partition 3 Partitions

TB

/ H

OU

R

TB/Hour

Page 35: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.35 SEIZE THE DATA. 2015

This is a rolling (up to 3 year) roadmap and is subject to change without notice

Thank you!

Things to take away:

• Low latency

• Scalable and near real time.

• JSON and Avro support

• Make Vertica a component of processing pipeline

• Coming soon.

Next Steps:

• Try the Beta! http://bit.ly/1UqPzAG

• Learn more about Kafka at http://kafka.apache.org/

• Talk to us in the Dev Lounge

Page 36: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.36 SEIZE THE DATA. 2015

Questions?

Page 37: SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High performance, low latency streaming ... • Website activity tracking • Monitoring data

SEIZE THE DATA. 2015