Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can...

19
Processing Data of Any Size with Apache Beam 1 / 19 Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Transcript of Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can...

Page 1: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Processing Data of Any Sizewith Apache Beam

1 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 2: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Chapter 1

Introducing Apache Beam

2 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 3: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

What Is Beam?Why Use Beam?Using Beam

Introducing Apache Beam

3 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 4: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Apache Beam is a unified model forprocessing data

Was originally created at GoogleLater donated to the Apache Foundation asApache BeamNow an Apache top level project

Beam code is written to its APICode is executed on different runnersNot directly tied to a framework or runner

All interactions are done throughpipelines

Apache Beam

4 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 5: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Pipeline

The DoFNs take in the inputprocesses it, and emit the results

Source

The Source reads the input onerecord or row at a time

DoFN DoFN Sink

The Source saves the output of theDoFN to the targeted path

All work is encapsulated in aPipeline

Beam Pipelines Diagram

5 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 6: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Juan

Fatima

Mark

14:00 14:30 15:00 15:30 16:00

Data is broken intosessions based on acriteria for a timeoutbetween actions.

Data can be calculatedin fixed windows wherethe time doesn't change.

Data can be calculatedin sliding windows wherethe time is fixed butadvances.

Beam Windowing

6 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 7: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

What Is Beam?Why Use Beam?Using Beam

Introducing Apache Beam

7 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 8: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Learning framework-specific APIs every time anew framework comes out

or completely changes theirexisting API doesn’t create

value

Too Many APIs

8 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 9: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Hadoop Cluster

Real-time data is published toKafka

Spark Streaming, Storm, orKafka Consumers process in

real-time

DataSource

DataSource

DataSource

RDBMS

Real-timeProcessingKafka Cluster

BI Analytics

Batch data is saved to HDFS

DataSource

DataSource

DataSource

MapReduce, Hive, Pig, Crunch,and Spark process data stored

in HDFS

Real-time data is archived toHDFS for analytics and ofine

processing

General Architecture Diagram

9 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 10: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

One API to rule them allOne API to learnMove between frameworks

The most unified batch and stream API I’ve used

Unified API to the ecosystem

Risk mitigation of frameworks

Multiple languages

Why I'm Excited About Beam

10 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 11: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Beam isn't tied to a specific framework

Apache Spark uses the spark-submit

Apache Flink can be submitted with the Maven runner

Google Cloud Dataflow can be submitted with the Maven runner

The DirectRunner can be started with the Maven runner

Running Beam

11 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 12: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Beam Contributions

12 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 13: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

What Is Beam?Why Use Beam?Using Beam

Introducing Apache Beam

13 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 14: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

IcannotteachhimTheboyhasnopatience

PCollection<String> etl = lines.apply(MapElements.into( TypeDescriptors.strings()) .via( (String line) -> line.toUpperCase() ));

ICANNOTTEACHHIMTHEBOYHASNOPATIENCE

MapElements

14 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 15: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

I cannot teach him. The boy has no patience.He will learn patience.

PCollection<String> linecount = lines.apply(Regex.matches("I.*\\."));

I cannot teach him. The boy has no patience.

Regular expressions can be used to parse KVs

I cannot teach him. The boy has no patience.He will learn patience.

PCollection<KV<String, String>> twoSentences = lines.apply(Regex.findKV("(.*)\\. (.*)", 1, 2));

<I cannot teach him, The boy has no patience>

Regex Transform

15 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 16: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

I cannot teach him. The boy has no patience.He will learn patience.

PCollection<String> pats = lines.apply(ParDo.of(new PatLinesFN()));

static class PatLinesFN extends DoFn<String, String> { @ProcessElement public void processElement(DoFn<String, String>.ProcessContext context) throws Exception { String[] pieces = context.element().split(" ");

for (String piece : pieces) { if (piece.startsWith("pat")) { context.output(piece); } } }}

patience.patience.

Example Custom DoFN

16 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 17: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

import org.apache.beam.sdk.Pipeline;import org.apache.beam.sdk.io.TextIO;import org.apache.beam.sdk.options.PipelineOptions;import org.apache.beam.sdk.options.PipelineOptionsFactory;import org.apache.beam.sdk.transforms.Count;import org.apache.beam.sdk.transforms.Regex;import org.apache.beam.sdk.transforms.ToString;

public class PicoWordCount { public static void main(String[] args) { PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options);

p .apply(TextIO.read().from("playing_cards.tsv")) .apply(Regex.split("\\W+")) .apply(Count.perElement()) .apply(ToString.elements()) .apply(TextIO.write().to("output/stringcounts"));

p.run(); }}

Playing Card Algorithm

17 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 18: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

What are other people doing with Beam?http://tiny.jesse-anderson.com/beaminterview

Where is some sample Beam code?http://tiny.jesse-anderson.com/beamtutorial

Main Beam sitehttps://beam.apache.org/

Convincing your bosshttp://tiny.jesse-anderson.com/beam1http://tiny.jesse-anderson.com/beam2

Next Steps

18 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b

Page 19: Processing Data of Any Size with Apache Beam...Apache Spark uses the spark-submit Apache Flink can be submitted with the Ma ven runner Google Cloud Dataflow can be submitted with

Current: Instructor, Thought Leader, Monkey Tamer

Previously:Curriculum Developer and Instructor @ ClouderaSenior Software Engineer @ Intuit

Covered, Conferences and Published In:GigaOM, ArsTecnica, Pragmatic Programmers, Strata, OSCON,Wall Street Journal, CNN, BBC, NPR

See Me On:http://www.jesse-anderson.com@jessetandersonhttp://tiny.bdi.io/linkedinhttp://tiny.bdi.io/youtube

About Me

19 / 19Copyright © 2018 Smoking Hand LLC. All rights Reserved. Version: f448583b