Ingest and Stream Processing - What will you choose?

Post on 15-Apr-2017

331 views 1 download

Transcript of Ingest and Stream Processing - What will you choose?

1© Cloudera, Inc. All rights reserved.

13 April 2016Ted Malaska| Principle Solutions Architect @ Cloudera, Pat Patterson| Community Champion @ StreamSets

Ingest and Stream Processing - What will you choose?

2© Cloudera, Inc. All rights reserved.

About Ted and Pat

Ted Malaska• Principal Solutions Architect @ Cloudera• Apache HBase SparkOnHBase

Contributor•Contact• ted.malaska@cloudera.com• @TedMalaska

Pat Patterson•Community Champion @ StreamSets• Formerly Developer Evangelist at

Salesforce•Contact• pat@streamsets.com• @metadaddy

3© Cloudera, Inc. All rights reserved.

Streaming Patterns

•Ingestion•Low Millisecond Actions•Near Real Time Complex Actions

4© Cloudera, Inc. All rights reserved.

Parts Of Streaming

Producer Kafka Engine Destination

5© Cloudera, Inc. All rights reserved.

Parts Of Streaming

Producer Kafka Engine Destination

At Least onceOrdered

Partitioned

At Least Once Depends

Depends

6© Cloudera, Inc. All rights reserved.

Destinations• File Systems: example HDFS• Batch is good•Only can do exactly once is a file is closed in a single ack.•Good for Scans

• Solr• Everything is Document based making exactly once• Batch is still good•Good for Search Queries

7© Cloudera, Inc. All rights reserved.

Destinations• NoSQL: example HBase• Everything has a row key making exactly once for writes• Increments can be applied twice is so be careful•Good for gets and puts

• Kudu• Everything has a row key making exactly once for writes•Good for gets, puts, and scans

8© Cloudera, Inc. All rights reserved.

Ingestion Destinations• File Systems: example HDFS•Flume•Kafka Connect

• Solr•Flume•Any Streaming Engine

9© Cloudera, Inc. All rights reserved.

Ingestion Destinations

•NoSQL: example HBase•Flume•Any Streaming Engine: Storm and Spark Streaming Tested

•Kudu•Flume•Kafka Connect•Any Streaming Engine: Spark Streaming Tested

10© Cloudera, Inc. All rights reserved.

Tricks With Producers• Send Source ID (requires Partitioning In Kafka) •Seq•UUID•UUID plus time

•Partition on SourceID•Watch out for repartitions and partition fail overs

11© Cloudera, Inc. All rights reserved.

Streaming Engines

•Consumer•Flume, KafkaConnect

• Storm• Spark Streaming• Flink•Kafka Streams

12© Cloudera, Inc. All rights reserved.

Consumer: Flume, KafkaConnect• Simple and Works• Low latency•High throughput • Interceptors•Transformations•Alerting• Ingestions

13© Cloudera, Inc. All rights reserved.

Storm•Old Gen• Low latency• Low throughput •At least once•Around for ever• Topology Based

14© Cloudera, Inc. All rights reserved.

Spark Streaming• The Juggernaut•Higher Latency•High Through Put• Exactly Once• SQL•MlLib

•Highly used• Easy to Debug/Unit Test• Easy to transition from Batch• Flow Language•600 commits in a month and about 100 meetups

15© Cloudera, Inc. All rights reserved.

Spark Streaming

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

First Batch

Second Batch

16© Cloudera, Inc. All rights reserved.

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDDpartitions

RDDParition

RDD

Single PassFilter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

Spark Streaming

17© Cloudera, Inc. All rights reserved.

Flink• I’m Better Than Spark Why Doesn’t Anyone use me•Very much like Spark but not as feature rich• Lower Latency•Micro Batch -> ABS• Asynchronous Barrier Snapshotting

• Flow Language• ~1/6th the comments and meetups•But Slim loves it

18© Cloudera, Inc. All rights reserved.

Flink - ABS

Operator

Buffer

19© Cloudera, Inc. All rights reserved.

Operator

Buffer

Operator

Buffer

Flink - ABS

Barrier 1A Hit

Barrier 1B Still Behind

20© Cloudera, Inc. All rights reserved.

Operator

Buffer

Flink - ABS

Both Barriers Hit

Operator

Buffer

Barrier 1A Hit

Barrier 1B Still Behind

Check Point

21© Cloudera, Inc. All rights reserved.

Operator

Buffer

Flink - ABSBoth

Barriers Hit

Check Point

Operator

Buffer Barrier is combined and can move on

Buffer can be flushed

out

22© Cloudera, Inc. All rights reserved.

Kafka Streams• The new Kid on the Block•When you only have Kafka• Low Latency•High Throughput• Interesting snapshot approach•Very Young• Flow Language

23© Cloudera, Inc. All rights reserved.

Summary about Engines• Ingestion• Flume and KafkaConnect

• Super Real Time and Special • Consumer

• Counting, MlLib, SQL• Spark

• Maybe future and cool• Flink and KafkaStreams

• Odd man out• Storm

24© Cloudera, Inc. All rights reserved.

StreamSets Data CollectorBuilding a Higher Level, Open Source Tool

25© Cloudera, Inc. All rights reserved.

Traditional and Big Data Founders

StreamSets Company Background

Top tier Investors

Momentum to Date

Strategic Partners

• Founded 2014; exited stealth 9/15• ~30 employees• Double-digit enterprise customers• 10,000 downloads

26© Cloudera, Inc. All rights reserved.

Thank you!