SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High...
Transcript of SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382.€¦ · SEIZE THE DATA. 2015 High...
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
High performance, low latency streamingStreaming data into Vertica from Kafka
Tom Wall and Jason Blais
August 2015
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
This is a rolling (up to three year) Roadmap and is subject to change without notice.
Forward-looking statements
This document contains forward looking statements regarding future operations, product development, product capabilities and availability dates. This information is subject to substantial uncertainties and is subject to change at any time without prior notification. Statements contained in this document concerning these matters only reflect Hewlett Packard's predictions and / or expectations as of the date of this document and actual results and future plans of Hewlett-Packard may differ significantly as a result of, among other things, changes in product strategy resulting from technological, internal corporate, market and other changes. This is not a commitment to deliver any material, code or functionality and should not be relied upon in making purchasing decisions.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
HP confidential information
This Roadmap contains HP Confidential Information.
If you have a valid Confidential Disclosure Agreement with HP, disclosure of the Roadmap is subject to that CDA. If not, it is subject to the following terms: for a period of 3 years after the date of disclosure, you may use the Roadmap solely for the purpose of evaluating purchase decisions from HP and use a reasonable standard of care to prevent disclosures. You will not disclose the contents of the Roadmap to any third party unless it becomes publically known, rightfully received by you from a third party without duty of confidentiality, or disclosed with HP’s prior written approval.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 SEIZE THE DATA. 2015
Yesterday, scalable bulk data loading was hard. Today, the demand to do it in real time makes it even harder. To do so properly requires lots of Verticaexpertise & engineering effort. By leveraging the strengths of Apache Kafka, Vertica Excavator makes it simpler than ever to do scalable, real time loading.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
The challenges of data loading
Loading data needs to be easier!
Data continuously
generated
Unpredictable bursts
Many sourcesMany
processing systems
Complex data formats
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Loading complexity: clients
Ver
tica
JDBCINSERT VALUES (?)
COPY LOCAL
• Traditional ETL via clients
• Easy to setup
• Lots of tools available
• Reasonable latency
• Single stream; poor throughput
• Scales until you hit session concurrency limits
Source
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Loading complexity: parallel COPY
Ver
tica
File
• Write files to staging area accessible by Vertica
• Programs & scripts to COPY files in parallel
• Lots of custom code
• Very good throughput
• Scales with number of files & nodes
• Staging adds latency
Source
FileFiles
vsqlCOPY
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Loading complexity: comparisonKafka can provide the best of both worlds
Method Scale Limits Latency Throughput Complexity Coupling Ecosystem
Clients # Stmts Low Low Low Tight Good
COPY # Nodes High High High Tight Poor
Kafka # Nodes Low High Low Loose Good
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
http://kafka.apache.org/
Kafka to the rescue!
• A fast, scalable, fault tolerant, publish-subscribe messaging system• A standard data backbone for all of an enterprise’s systems & applications• Generic enough to suit most use cases
• Fraud detection• Website activity tracking• Monitoring data metrics• Log aggregation• Stream processing
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Kafka high-level architecture
Sensor Data Broker(s)
Consumers
Cloud Data
Customer Data
Topics
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Publish/subscribe message busKafka decouples data sources & sinks
Analytics Archive
Kafka
Apps LogsSocial Media
Apps LogsSocial Media
Analytics Archive
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Kafka ecosystem
Connect your applications Connect your data systems
The developer community is large and growing!
• CLI, HTTP
• Java, C++, .NET
• Clojure, Scala, Erlang
• Python, Ruby, Node.js
• Many more!
• OLTP Databases
• Hadoop
• Storm
• Spark
• Many more!
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Integration with Vertica
• Vertica schedules loads to continuously consume from Kafka
• JSON, Avro data formats
• CLI for easy setup
• In-database monitoring
• Vertica can also produce --export query results Kafka to close the processing loop
LoadScheduler
KafkaKafka
Kafka
Vertica
Kafka Plugin
Vertica
Kafka Plugin
Vertica
Kafka Plugin
CLI
Load
Export
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Fits in your existing ecosystem
• No need to dedicate a Kafka cluster to Vertica; it connects to your existing Kafka
• Producers & consumers don’t need to know they are interacting with Vertica!
• No client drivers
• No complicated ETL
• No staging files
• Seamless, scalable interaction with producers & consumers for other systems
If you have Kafka and Vertica, you’re ready to go!
Kafka
Raw Data
Apps
Stream Engines
Vertica
Hadoop
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16 SEIZE THE DATA. 2015
Vertica’s solutionTechnical details
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Listening to you, our customers
Requirements
• Analytic reports desired more quickly.
• Moving from files to stream based data delivery
• AVRO and JSON formats are priority
• Support multiple topics
• Support topics with different priorities
• Don’t make us Vertica internal experts
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Things we worry about too.
Engineering requirements
Low Latency Parallelism
Data LossLimited
Resources
ScalabilityMonitoring /
Usability
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Things we worry about too
Engineering requirements
• Time to take action on data is short
• Lots of topics requires limiting time that individual topics can be polled.
• Optimize topic loading for hot topics
Low Latency Parallelism
Data LossLimited
Resources
ScalabilityMonitoring /
Usability
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Things we worry about too
Engineering requirements
Since each topic targets a single table, we need to be able to handle/limit a number of COPY statements running at the same time.
Low latency Parallelism
Data lossLimited
resources
ScalabilityMonitoring /
usability
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Things we worry about too
Engineering requirements
Low latency Parallelism
Data lossLimited
resources
ScalabilityMonitoring /
usability
With so many streams and components, what happens if there is a failure?
• Data loss probability is limited through use of atomic operations we call a microbatch.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Things we worry about too
Engineering requirements
Low latency Parallelism
Data lossLimited
resources
ScalabilityMonitoring /
usability
• Vertica cluster has limited resources
• Needs to share with user queries.
• Resource Pools can be dedicated to the task.
• Planned Concurrency important for parallelism.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Things we worry about too
Engineering requirements
Low latency Parallelism
Data lossLimited
resources
ScalabilityMonitoring /
usability
• More data means more resources.
• Adding Vertica Nodes increases scalability
• Adding Kafka Brokers increases scalability
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Things we worry about too
Engineering requirements
Low latency Parallelism
Data lossLimited
resources
ScalabilityMonitoring /
usability
There lots of meta-data being gathered. It would be nice to be able to track it and report on it to ensure the system is working as planned.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Support ‘exactly once’ message loads
Preventing data loss
Kafka Data @ Offset X
Data Inserted Into Vertica
New Offsets Stored In Vertica Commit
Microbatch(µB)
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
SQL Statements
Microbatch statements
COPY public.from_kafka
SOURCE KafkaSource(
stream=‘topic,0,-2|topic,0,-2’,
brokers=‘broker:port’,
duration=interval ‘10000 milliseconds’,
stop_on_eof=true,
executionparallelism=1 )
PARSER KafkaJSONParser( )
REJECTED DATA AS TABLE public.rejections
DIRECT NO COMMIT;
INSERT INTO kafkacfg.kafka_offsets(…,* from
(SELECT KafkaOffsets() OVER ()) as b
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Simple Example
Implementation
1Example scheduling yields 2 3 4 5
• 5 topics
• Concurrency of 1
• Frame split into 5 equal parts.
Hot topics become starved
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.28 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Dynamic prioritization
A better way to implement
• In the simple example there is a lot of time wasted when microbatches finish early
• It becomes hard for ‘busy’ topics to ever catch up
• Can we implement something more intelligent?
Yes!
• Dynamic prioritizing of hot topics
• Flexible microbatch durations
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.29 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Dynamic prioritization example
A better way to do it
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
Where possible, the remaining duration is split evenly
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.30 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Dynamic prioritization example
A better way to do it
Assuming the times are the same as before, the order now shifts.
1
1
1
1 3 4
1 23 4
5
35
5
5 µB2 gets lots of time now!
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.31 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Supporting multiple topics
• There may be 10s or 100s of topics
• Some are higher priority than others.
• The way to do this is with separate instances of schedulers which use
• Dedicated resource pools with specific tuning
• Different schedule configurations ( ie max poll duration )
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.32 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Process monitoring
Monitoring
kafka_events
Store history logged
events from the scheduler
kafka_scheduler
Current configuration for
the scheduler (resource
pool, duration, etc)
kafka_offsets
Long term history of
messages read / duration by
node/topic/partition/offsets
kafka_tables
Target tables by Kafka
topic
k_scheduler_history
Store history about
changes to the scheduler
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.33 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Simple command line configuration
Simplicity
[node001]$: /opt/vertica/packages/kafka/scripts/vkconfig.sh
scheduler –add
[node001]$: /opt/vertica/packages/kafka/scripts/vkconfig.sh topic
--add --target public.kafka_tgt --rejection_table
public.kafka_rej --topic mud --npartitions 4
[node001]$: /opt/vertica/packages/kafka/scripts/vkconfig.sh launch
--start
Example terminal command line
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.34 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Terabytes per hour
Benchmarks
System Specifications
Hardware: 6 Servers 24 cores each 252 GB RAM
OS: CentOS 6.6
3 Node Vertica Cluster
1 Kafka Brokers
1 Producer
1 Zookeeper Instance
Benchmarks
Data Set Size: 1.5M 5kb messages
System Load : ~30% CPU / 14G RAM
1.2
2.2
1 Partition 3 Partitions
TB
/ H
OU
R
TB/Hour
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.35 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Thank you!
Things to take away:
• Low latency
• Scalable and near real time.
• JSON and Avro support
• Make Vertica a component of processing pipeline
• Coming soon.
Next Steps:
• Try the Beta! http://bit.ly/1UqPzAG
• Learn more about Kafka at http://kafka.apache.org/
• Talk to us in the Dev Lounge
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.36 SEIZE THE DATA. 2015
Questions?
SEIZE THE DATA. 2015