Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

157
Architectural Considerations for Hadoop Applications Strata/Hadoop World 2014 – October 15 th 2014 Mark Grover | @mark_grover Ted Malaska | @TedMalaska Jonathan Seidman | @jseidman Gwen Shapira | @gwenshap

description

Slides for Architectural Considerations for Hadoop Applications presentation at Strata Hadoop World 2015

Transcript of Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

Page 1: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

Architectural Considerations for Hadoop Applications Strata/Hadoop World 2014 – October 15th 2014

Mark Grover | @mark_grover Ted Malaska | @TedMalaska Jonathan Seidman | @jseidman Gwen Shapira | @gwenshap

Page 2: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

2

About the book •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook

©2014 Cloudera, Inc. All Rights Reserved.

Page 3: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

3

About the presenters

•  Principal Solutions Architect at Cloudera

•  Previously, lead architect at FINRA

•  Contributor to Apache Flume, Avro, YARN, Pig and Spark

•  Senior Solutions Architect/Partner Enablement at Cloudera

•  Previously, Technical Lead on the big data team at Orbitz Worldwide

•  Co-founder of the Chicago Hadoop User Group and Chicago Big Data

©2014 Cloudera, Inc. All Rights Reserved.

Ted Malaska Jonathan Seidman

Page 4: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

4

About the presenters

•  Solutions Architect turned Software Engineer at Cloudera

•  Contributor to Sqoop, Flume and Kafka

•  Formerly a senior consultant at Pythian, Oracle ACE Director

•  Software Engineer at Cloudera

•  Committer on Apache Bigtop, PMC member on Apache Sentry (incubating)

•  Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume

©2014 Cloudera, Inc. All Rights Reserved.

Gwen Shapira Mark Grover

Page 5: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

5

Logistics •  Coffee break around 10:30 •  Questions at the end of each section

©2014 Cloudera, Inc. All Rights Reserved.

Page 6: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

6

Case Study Clickstream Analysis

Page 7: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

7

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 8: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

8

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 9: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

9

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 10: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

10

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 11: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

11

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 12: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

12

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 13: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

13

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 14: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

14

Web Logs – Combined Log Format

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Page 15: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

15

Clickstream Analytics

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Page 16: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

16

Similar use-cases •  Sensors – heart, agriculture, etc. •  Casinos – session of a person at a table

©2014 Cloudera, Inc. All Rights Reserved.

Page 17: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

17

Pre-Hadoop Architecture Clickstream Analysis

Page 18: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

18

Click Stream Analysis (Before Hadoop)

©2014 Cloudera, Inc. All Rights Reserved.

Web logs (full fidelity) (2 weeks)

Data Warehouse Transform/Aggregate Business

Intelligence

Tape Archive

Page 19: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

19

Problems with Pre-Hadoop Architecture •  Full fidelity data is stored for small amount of time (~weeks). •  Older data is sent to tape, or even worse, deleted! •  Inflexible workflow - think of all aggregations beforehand

©2014 Cloudera, Inc. All Rights Reserved.

Page 20: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

20

Effects of Pre-Hadoop Architecture •  Regenerating aggregates is expensive or worse, impossible •  Can’t correct bugs in the workflow/aggregation logic •  Can’t do experiments on existing data

©2014 Cloudera, Inc. All Rights Reserved.

Page 21: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

21

Why is Hadoop A Great Fit? Clickstream Analysis

Page 22: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

22

Why is Hadoop a great fit? •  Volume of clickstream data is huge •  Velocity at which it comes in is high •  Variety of data is diverse - semi-structured data •  Hadoop enables

–  active archival of data –  Aggregation jobs –  Querying the above aggregates or raw fidelity data

©2014 Cloudera, Inc. All Rights Reserved.

Page 23: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

23

Click Stream Analysis (with Hadoop)

©2014 Cloudera, Inc. All Rights Reserved.

Web logs Hadoop Business Intelligence

Active archive (no tape) Aggregation engine Querying engine

Page 24: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

24

Challenges of Hadoop Implementation

©2014 Cloudera, Inc. All Rights Reserved.

Page 25: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

25

Challenges of Hadoop Implementation

©2014 Cloudera, Inc. All Rights Reserved.

Page 26: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

26

Other challenges - Architectural Considerations •  Storage managers?

–  HDFS? HBase? •  Data storage and modeling:

–  File formats? Compression? Schema design? •  Data movement

–  How do we actually get the data into Hadoop? How do we get it out? •  Metadata

–  How do we manage data about the data? •  Data access and processing

–  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  Orchestration –  How do we manage the workflow for all of this?

©2014 Cloudera, Inc. All Rights Reserved.

Page 27: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

27

Case Study Requirements Overview of Requirements

Page 28: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

28

Overview of Requirements •  Data ingestion – what requirements do we have for moving data into

our processing flow? •  Data storage – what requirements do we have for the storage of

data, both incoming raw data and processed data? •  Data processing – how do we need to process the data to meet our

functional requirements? •  Workflow orchestration – how do we manage and monitor all the

processing?

©2014 Cloudera, Inc. All Rights Reserved.

Page 29: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

29

Case Study Requirements Data Ingestion

Page 30: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

30

Data Ingestion Requirements

©2014 Cloudera, Inc. All Rights Reserved.

Web Servers Web Servers

Web Servers Web Servers

Web Servers Logs

244.157.45.12 - - [17/Oct/2014:21:08:30 ]

"GET /seatposts HTTP/1.0" 200 4463 …

CRM Data

ODS Web Servers

Web Servers Web Servers

Web Servers Web Servers

Logs

Hadoop

Page 31: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

31

Data Ingestion Requirements •  So we need to be able to support:

–  Reliable ingestion of large volumes of semi-structured event data arriving with high velocity (e.g. logs).

–  Timeliness of data availability – data needs to be available for processing to meet business service level agreements.

–  Periodic ingestion of data from relational data stores.

©2014 Cloudera, Inc. All Rights Reserved.

Page 32: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

32

Case Study Requirements Data Storage

Page 33: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

33

Data Storage Requirements

©2014 Cloudera, Inc. All Rights Reserved.

Store all the data Make the data accessible for

processing

Compress the data

Page 34: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

34

Case Study Requirements Data Processing

Page 35: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

35

Processing requirements Be able to answer questions like: •  What is my website’s bounce rate?

–  i.e. how many % of visitors don’t go past the landing page?

•  Which marketing channels are leading to most sessions? •  Do attribution analysis

–  Which channels are responsible for most conversions?

©2014 Cloudera, Inc. All Rights Reserved.

Page 36: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

36

Sessionization

©2014 Cloudera, Inc. All Rights Reserved.

Website visit

Visitor 1 Session 1

Visitor 1 Session 2

Visitor 2 Session 1

> 30 minutes

Page 37: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

37

Case Study Requirements Orchestration

Page 38: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

38

Orchestration is simple We just need to execute actions One after another

©2014 Cloudera, Inc. All Rights Reserved.

Page 39: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

39 ©2014 Cloudera, Inc. All Rights Reserved.

Actually, we also need to handle errors And user notifications ….

Page 40: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

40 ©2014 Cloudera, Inc. All Rights Reserved.

And… •  Re-start workflows after errors •  Reuse of actions in multiple workflows •  Complex workflows with decision points •  Trigger actions based on events •  Tracking metadata •  Integration with enterprise software •  Data lifecycle •  Data quality control •  Reports

Page 41: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

41 ©2014 Cloudera, Inc. All Rights Reserved.

OK, maybe we need a product To help us do all that

Page 42: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

42

Architectural Considerations Data Modeling

Page 43: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

43

Data Modeling Considerations •  We need to consider the following in our architecture:

–  Storage layer – HDFS? HBase? Etc. –  File system schemas – how will we lay out the data? –  File formats – what storage formats to use for our data, both raw and

processed data? –  Data compression formats?

©2014 Cloudera, Inc. All Rights Reserved.

Page 44: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

44

Architectural Considerations Data Modeling – Storage Layer

Page 45: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

45

Data Storage Layer Choices •  Two likely choices for raw data:

©2014 Cloudera, Inc. All Rights Reserved.

Page 46: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

46

Data Storage Layer Choices

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

•  Stores data as Hfiles on HDFS

•  Slow scans •  Fast random reads/writes

©2014 Cloudera, Inc. All Rights Reserved.

Page 47: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

47

Data Storage – Storage Manager Considerations

•  Incoming raw data: –  Processing requirements call for batch transformations across multiple

records – for example sessionization.

•  Processed data: –  Access to processed data will be via things like analytical queries – again

requiring access to multiple records.

•  We choose HDFS –  Processing needs in this case served better by fast scans.

©2014 Cloudera, Inc. All Rights Reserved.

Page 48: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

48

Architectural Considerations Data Modeling – Raw Data Storage

Page 49: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

49

Storage Formats – Raw Data and Processed Data

©2014 Cloudera, Inc. All Rights Reserved.

Processed Data

Raw Data

Page 50: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

50

Raw Data Storage – Format Considerations •  Store as plain text?

–  Sure, well supported by Hadoop. –  Text can easily be processed by MapReduce, loaded into Hive for analysis,

and so on.

•  But… –  Will begin to consume lots of space in HDFS. –  May not be optimal for processing by tools in the Hadoop ecosystem.

©2014 Cloudera, Inc. All Rights Reserved.

Page 51: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

51

Raw Data Storage – Format Considerations •  But, we can compress the text files…

–  Gzip – supported by Hadoop, but not splittable. –  Bzip2 – hey, splittable! Great compression! But decompression is slooowww. –  LZO – splittable (with some work), good compress/de-compress performance.

Good choice for storing text files on Hadoop. –  Snappy – provides a good tradeoff between size and speed.

©2014 Cloudera, Inc. All Rights Reserved.

Page 52: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

52

Raw Data Storage – More About Snappy •  Designed at Google to provide high compression speeds with reasonable

compression. •  Not the highest compression, but provides very good performance for

processing on Hadoop. •  Snappy is not splittable though, which brings us to…

©2014 Cloudera, Inc. All Rights Reserved.

Page 53: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

53

SequenceFile

• Stores records as binary key/value pairs.

• SequenceFile “blocks” can be compressed.

•  This enables splittability with non-splittable compression.

©2014 Cloudera, Inc. All Rights Reserved.

Page 54: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

54

Avro

• Kinda SequenceFile on Steroids.

• Self-documenting – stores schema in header.

• Provides very efficient storage.

• Supports splittable compression. ©2014 Cloudera, Inc. All Rights Reserved.

Page 55: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

55

Our Format Recommendations for Raw Data…

•  Avro with Snappy –  Snappy provides optimized compression. –  Avro provides compact storage, self-documenting files, and supports schema

evolution. –  Avro also provides better failure handling than other choices.

•  SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. –  But only supports Java.

©2014 Cloudera, Inc. All Rights Reserved.

Page 56: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

56

But Note… •  For simplicity, we’ll use plain text for raw data in our example.

©2014 Cloudera, Inc. All Rights Reserved.

Page 57: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

57

Architectural Considerations Data Modeling – Processed Data Storage

Page 58: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

58

Storage Formats – Raw Data and Processed Data

©2014 Cloudera, Inc. All Rights Reserved.

Processed Data

Raw Data

Page 59: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

59

Access to Processed Data

©2014 Cloudera, Inc. All Rights Reserved.

Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value

Analytical Queries

Page 60: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

60

Columnar Formats •  Eliminates I/O for columns that are not part of a query. •  Works well for queries that access a subset of columns. •  Often provide better compression. •  These add up to dramatically improved performance for many

queries.

©2014 Cloudera, Inc. All Rights Reserved.

1 2014-10-13

abc

2 2014-10-14

def

3 2014-10-15

ghi

1 2 3

2014-10-13

2014-10-14

2014-10-15

abc def ghi

Page 61: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

61

Columnar Choices – RCFile •  Provides efficient processing for MapReduce applications, but

generally only used with Hive. •  Only supports Java. •  No Avro support. •  Limited compression support. •  Sub-optimal performance compared to newer columnar formats.

©2014 Cloudera, Inc. All Rights Reserved.

Page 62: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

62

Columnar Choices – ORC •  A better RCFile. •  Supports Hive, but currently limited support for other interfaces. •  Only supports Java.

©2014 Cloudera, Inc. All Rights Reserved.

Page 63: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

63

Columnar Choices – Parquet •  Supports multiple programming interfaces – MapReduce, Hive,

Impala, Pig. •  Multiple language support. •  Broad vendor support. •  These features make Parquet a good choice for our processed data.

©2014 Cloudera, Inc. All Rights Reserved.

Page 64: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

64

Architectural Considerations Data Modeling – Schema Design

Page 65: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

65

HDFS Schema Design – One Recommendation

/user/<username> - User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – processed data to be shared data with the entire organization /app – Everything but data: UDF jars, HQL files, Oozie workflows

©2014 Cloudera, Inc. All Rights Reserved.

Page 66: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

66

Partitioning •  Split the dataset into smaller consumable chunks. •  Rudimentary form of “indexing”. Reduces I/O needed to process

queries.

©2014 Cloudera, Inc. All Rights Reserved.

dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt

dataset file1.txt file2.txt … filen.txt

Un-partitioned HDFS directory structure

Partitioned HDFS directory structure

Page 67: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

67

Partitioning considerations •  What column to partition by?

–  Don’t have too many partitions (<10,000) –  Don’t have too many small files in the partitions (more than block size

generally)

•  We’ll partition by timestamp. This applies to both our raw and processed data.

©2014 Cloudera, Inc. All Rights Reserved.

Page 68: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

68

Partitioning For Our Case Study •  Raw dataset:

–  /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!

•  Processed dataset: –  /data/bikeshop/clickstream/year=2014/month=10/day=10!

©2014 Cloudera, Inc. All Rights Reserved.

Page 69: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

69

A Word About Partitioning Alternatives

year=2014/month=10/day=10 dt=2014-10-10

©2014 Cloudera, Inc. All Rights Reserved.

Page 70: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

70

Architectural Considerations Data Ingestion

Page 71: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

71 ©2014 Cloudera, Inc. All rights reserved.

•  Omniture data on FTP •  Apps •  App Logs •  RDBMS

Typical Clickstream data sources

Page 72: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

72 ©2014 Cloudera, Inc. All rights reserved.

Getting Files from FTP

Page 73: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

73 ©2014 Cloudera, Inc. All rights reserved.

curl ftp://myftpsite.com/sitecatalyst/myreport_2014-10-05.tar.gz --user name:password | hdfs -put - /etl/clickstream/raw/sitecatalyst/myreport_2014-10-05.tar.gz

Don’t over-complicate things

Page 74: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

74 ©2014 Cloudera, Inc. All rights reserved.

Reliable, distributed and highly available systems That allow streaming events to Hadoop

Event Streaming – Flume and Kafka

Page 75: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

75 ©2014 Cloudera, Inc. All rights reserved.

•  Many available data collection sources •  Well integrated into Hadoop •  Supports file transformations •  Can implement complex topologies •  Very low latency •  No programming required

Flume:

Page 76: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

76 ©2014 Cloudera, Inc. All rights reserved.

“We just want to grab data from this directory and write it to HDFS”

We use Flume when:

Page 77: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

77 ©2014 Cloudera, Inc. All rights reserved.

•  Very high-throughput publish-subscribe messaging •  Highly available •  Stores data and can replay •  Can support many consumers with no extra latency

Kafka is:

Page 78: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

78 ©2014 Cloudera, Inc. All rights reserved.

“Kafka is awesome. We heard it cures cancer”

Use Kafka When:

Page 79: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

79 ©2014 Cloudera, Inc. All rights reserved.

•  Use Flume with a Kafka Source •  Allows to get data from Kafka,

run some transformations write to HDFS, HBase or Solr

Actually, why choose?

Page 80: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

80 ©2014 Cloudera, Inc. All rights reserved.

•  We want to ingest events from log files •  Flume’s Spooling Directory source fits •  With HDFS Sink

•  We would have used Kafka if… –  We wanted the data in non-Hadoop systems too

In Our Example…

Page 81: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

81

Sources Interceptors Selectors Channels Sinks

Flume Agent

Short Intro to Flume Twitter, logs, JMS,

webserver Mask, re-format,

validate… DR, critical Memory, file HDFS, Hbase,

Solr

Page 82: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

82

Configuration

• Declarative – No coding required. – Configuration specifies

how components are wired together.

©2014 Cloudera, Inc. All Rights Reserved.

Page 83: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

83

Interceptors

• Mask fields • Validate information

against external source • Extract fields • Modify data format •  Filter or split events

©2014 Cloudera, Inc. All rights reserved.

Page 84: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

84 ©2014 Cloudera, Inc. All rights reserved.

Any sufficiently complex configuration Is indistinguishable from code

Page 85: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

85

A Brief Discussion of Flume Patterns – Fan-in •  Flume agent runs on

each of our servers. •  These client agents

send data to multiple agents to provide reliability.

•  Flume provides support for load balancing.

©2014 Cloudera, Inc. All Rights Reserved.

Page 86: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

86

A Brief Discussion of Flume Patterns – Splitting •  Common need is to split

data on ingest. •  For example:

– Sending data to multiple clusters for DR.

– To multiple destinations. •  Flume also supports

partitioning, which is key to our implementation.

©2014 Cloudera, Inc. All Rights Reserved.

Page 87: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

87 ©2014 Cloudera, Inc. All rights reserved.

Flume Demo

Page 88: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

88

Flume Demo – Architecture

Web Server Flume Agent Web Server Flume Agent

Web Server Flume Agent Web Server Flume Agent

Web Server Flume Agent Web Server Flume Agent

Web Server Flume Agent Web Server Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Fan-in Pattern

Multi Agents for Failover and rolling restarts

HDFS

©2014 Cloudera, Inc. All Rights Reserved.

Page 89: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

89

Flume Agent

Web Logs Spooling

Dir Source

Timestamp Interceptor

File Channel

Avro Sink Avro Sink Avro Sink

Flume Demo – Client Tier Web Server Flume Agent

Page 90: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

90

Flume Demo – Collector Tier Flume Agent

Flume Agent

HDFS Avro

Source File

Channel HDFS Sink

HDFS Sink

Page 91: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

91 ©2014 Cloudera, Inc. All rights reserved.

•  Add Kafka producer to our webapp •  Send clicks and searches as messages •  Run an MR job once a day to grab all new events •  We can add a second consumer for real-time •  Another consumer for alerting…

What if…. We were to use Kafka?

Page 92: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

92 ©2014 Cloudera, Inc. All rights reserved.

Camus

Page 93: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

93

Architectural Considerations Data Processing – Engines tiny.cloudera.com/app-arch-slides

Page 94: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

94

Processing Engines •  MapReduce •  Abstractions •  Spark •  Spark Streaming •  Impala

Confidentiality Information Goes Here

Page 95: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

95

MapReduce •  Oldie but goody •  Restrictive Framework / Innovated Work Around •  Extreme Batch

Confidentiality Information Goes Here

Page 96: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

96

MapReduce Basic High Level

Confidentiality Information Goes Here

Mapper

HDFS (Replicated)

Native File System

Block of Data

Temp Spill Data

Partitioned Sorted Data

Reducer

Reducer Local Copy

Output File

Page 97: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

97

MapReduce Innovation •  Mapper Memory Joins •  Reducer Memory Joins •  Buckets Sorted Joins •  Cross Task Communication •  Windowing •  And Much More

Confidentiality Information Goes Here

Page 98: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

98

Abstractions •  SQL

–  Hive

•  Script/Code –  Pig: Pig Latin –  Crunch: Java/Scala –  Cascading: Java/Scala

Confidentiality Information Goes Here

Page 99: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

99

Spark •  The New Kid that isn’t that New Anymore •  Easily 10x less code •  Extremely Easy and Powerful API •  Very good for machine learning •  Scala, Java, and Python •  RDDs •  DAG Engine

Confidentiality Information Goes Here

Page 100: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

100

Spark - DAG

Confidentiality Information Goes Here

Page 101: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

101

Spark - DAG

Confidentiality Information Goes Here

Filter KeyBy

KeyBy

TextFile

TextFile

Join Filter Take

Page 102: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

102

Spark - DAG

Confidentiality Information Goes Here

Filter KeyBy

KeyBy

TextFile

TextFile

Join Filter Take

Good

Good

Good

Good

Good

Good

Good-Replay

Good-Replay

Good-Replay

Good

Good-Replay

Good

Good-Replay

Lost Block Replay

Good-Replay

Lost Block

Good

Future

Future

Future

Future

Page 103: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

103

Spark Streaming •  Calling Spark in a Loop •  Extends RDDs with DStream •  Very Little Code Changes from ETL to Streaming

Confidentiality Information Goes Here

Page 104: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

104

Spark Streaming

Confidentiality Information Goes Here

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

Page 105: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

105

Spark Streaming

Confidentiality Information Goes Here

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

Page 106: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

106

Impala

Confidentiality Information Goes Here

•  MPP Style SQL Engine on top of Hadoop •  Very Fast •  High Concurrency

Page 107: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

107

Impala – Broadcast Join

Confidentiality Information Goes Here

Impala Daemon

Smaller Table Data Block

100% Cached Smaller Table

Smaller Table Data Block

Impala Daemon

100% Cached Smaller Table

Impala Daemon

100% Cached Smaller Table

Impala Daemon

Hash Join Function

Bigger Table Data Block

100% Cached Smaller Table

Output

Impala Daemon

Hash Join Function

Bigger Table Data Block

100% Cached Smaller Table

Output

Impala Daemon

Hash Join Function

Bigger Table Data Block

100% Cached Smaller Table

Output

Page 108: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

108

Impala – Broadcast Join

Confidentiality Information Goes Here

Impala Daemon

Smaller Table Data Block

~33% Cached Smaller Table

Smaller Table Data Block

Impala Daemon

~33% Cached Smaller Table

Impala Daemon

~33% Cached Smaller Table

Hash Partitioner Hash Partitioner

Impala Daemon

BiggerTable Data Block

Impala Daemon Impala Daemon

Hash Partitioner Hash Join Function

33% Cached Smaller Table

Hash Join Function

33% Cached Smaller Table

Hash Join Function

33% Cached Smaller Table

Output Output Output

BiggerTable Data Block

Hash Partitioner

BiggerTable Data Block

Hash Partitioner

Page 109: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

109

Impala vs Hive

Confidentiality Information Goes Here

•  Very different approaches and •  We may see convergence at some point •  But for now

–  Impala for speed –  Hive for batch

Page 110: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

110

Architectural Considerations Data Processing – Patterns and Recommendations

Page 111: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

111

What processing needs to happen?

Confidentiality Information Goes Here

•  Sessionization •  Filtering •  Deduplication •  BI / Discovery

Page 112: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

112

Sessionization

Confidentiality Information Goes Here

Website visit

Visitor 1 Session 1

Visitor 1 Session 2

Visitor 2 Session 1

> 30 minutes

Page 113: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

113

Why sessionize?

Confidentiality Information Goes Here

Helps answers questions like: •  What is my website’s bounce rate?

–  i.e. how many % of visitors don’t go past the landing page?

•  Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? –  Which ones of those lead to most conversions (e.g. people buying things,

signing up, etc.)

•  Do attribution analysis – which channels are responsible for most conversions?

Page 114: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

114

Sessionization

Confidentiality Information Goes Here

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12+1413580110 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 244.157.45.12+1413583199

Page 115: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

115

How to Sessionize?

Confidentiality Information Goes Here

1.  Given a list of clicks, determine which clicks came from the same user (Partitioning, ordering)

2.  Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session (Identifying session boundaries)

Page 116: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

116

#1 – Which clicks are from same user? •  We can use:

–  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")

©2014 Cloudera, Inc. All Rights Reserved.

Page 117: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

117

#1 – Which clicks are from same user? •  We can use:

–  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")

©2014 Cloudera, Inc. All Rights Reserved.

Page 118: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

118

#1 – Which clicks are from same user?

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Page 119: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

119

#2 – Which clicks part of the same session?

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

> 30 mins apart = different sessions

Page 120: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

120 ©2014 Cloudera, Inc. All rights reserved.

Sessionization engine recommendation •  We have sessionization code in MR, Spark and Hive on github.

The complexity of the code varies, depends on the expertise in the organization.

•  We choose Hive, since it’s a fairly small query to maintain instead of large pieces of code.

Page 121: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

121

Filtering – filter out incomplete records

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

Page 122: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

122

Filtering – filter out records from bots/spiders

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Google spider IP address

Page 123: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

123 ©2014 Cloudera, Inc. All rights reserved.

Filtering recommendation •  Bot/Spider filtering can be done easily in any of the engines •  Incomplete records are harder to filter in schema systems like

Hive, Impala, Pig, etc. •  Pretty close choice between MR, Hive and Spark •  Can be done in Spark using rdd.filter() •  We can simply embed this in our sessionization job

Page 124: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

124

Deduplication – remove duplicate records

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Page 125: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

125 ©2014 Cloudera, Inc. All rights reserved.

Deduplication recommendation •  Can be done in all engines. •  We already have a Hive table with all the columns, a simple

DISTINCT query will perform deduplication •  reduce() in spark •  We use Pig

Page 126: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

126 ©2014 Cloudera, Inc. All rights reserved.

BI/Discovery engine recommendation •  Main requirements for this are:

–  Low latency –  SQL interface (e.g. JDBC/ODBC) –  Users don’t know how to code

•  We chose Impala –  It’s a SQL engine –  Much faster than other engines –  Provides standard JDBC/ODBC interfaces

Page 127: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

127 ©2014 Cloudera, Inc. All rights reserved.

Processing Demo

Page 128: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

128

Architectural Considerations Orchestration

Page 129: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

129

•  Data arrives through Flume •  Triggers a processing event:

–  Sessionize –  Enrich – Location, marketing channel… –  Store as Parquet

•  Each day we process events from the previous day

Orchestrating Clickstream

Page 130: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

130 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing Right

Page 131: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

131

Oozie or Azkaban?

©2014 Cloudera, Inc. All rights reserved.

Page 132: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

132 ©2014 Cloudera, Inc. All rights reserved.

Oozie Architecture

Page 133: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

133 ©2014 Cloudera, Inc. All rights reserved.

•  Part of all major Hadoop distributions •  Hue integration •  Built -in actions – Hive, Sqoop, MapReduce, SSH •  Complex workflows with decisions •  Event and time based scheduling •  Notifications •  SLA Monitoring •  REST API

Oozie features

Page 134: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

134 ©2014 Cloudera, Inc. All rights reserved.

•  Overhead in launching jobs •  Steep learning curve •  XML Workflows

Oozie Drawbacks

Page 135: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

135 ©2014 Cloudera, Inc. All rights reserved.

Azkaban Architecture

Azkaban Executor Server

Azkaban Web Server

HDFS viewer plugin

Job types plugin

MySQL

Client Hadoop

Page 136: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

136 ©2014 Cloudera, Inc. All rights reserved.

•  Simplicity •  Great UI – including pluggable visualizers •  Lots of plugins – Hive, Pig… •  Reporting plugin

Azkaban features

Page 137: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

137 ©2014 Cloudera, Inc. All rights reserved.

•  Doesn’t support workflow decisions •  Can’t represent data dependency

Azkaban Limitations

Page 138: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

138 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing…

Easier in Oozie

Page 139: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

139 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing the right Orchestration Tool

Better in Azkaban

Page 140: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

140 ©2014 Cloudera, Inc. All rights reserved.

The best orchestration tool is the one you are an expert on

Important Decision Consideration!

Page 141: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

141 ©2014 Cloudera, Inc. All rights reserved.

Orchestration Patterns – Fan Out

Page 142: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

142 ©2014 Cloudera, Inc. All rights reserved.

Capture & Decide Pattern

Page 143: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

143 ©2014 Cloudera, Inc. All rights reserved.

Oozie Demo

Page 144: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

144

Putting It All Together Final Architecture

Page 145: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

145

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 146: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

146

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 147: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

147

Final Architecture – Ingestion

Web Server Flume Agent Web Server Flume Agent

Web Server Flume Agent Web Server Flume Agent

Web Server Flume Agent Web Server Flume Agent

Web Server Flume Agent Web Server Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Fan-in Pattern

Multi Agents for Failover and rolling restarts

HDFS

©2014 Cloudera, Inc. All Rights Reserved.

Page 148: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

148

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 149: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

149

Final Architecture – Storage and Processing

/etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10 …

Data Processing Deduplication

Filtering Sessionization

Etc.

/data/bikeshop/clickstream/year=2014/month=10/day=10 …

©2014 Cloudera, Inc. All Rights Reserved.

Page 150: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

150

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 151: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

151

Final Architecture – Data Access

Hive/Impala

BI/Analytics

Tools

DWH Sqoop

Local Disk

R, etc.

DB import tool

JDBC/ODBC

©2014 Cloudera, Inc. All Rights Reserved.

Page 152: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

152

Stay in touch! @hadooparchbook hadooparchitecturebook.com slideshare.com/hadooparchbook

Page 153: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

153 Confidentiality Information Goes Here

Join the Discussion

Get community help or provide feedback cloudera.com/community

Page 154: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

154

Try Hadoop Now cloudera.com/live

Page 155: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

155

Visit us at Booth #305

Highlights: Hear what’s new with 5.2 including Impala 2.0 Learn how Cloudera is setting the standard for Hadoop in the Cloud

BOOK SIGNINGS THEATER SESSIONS

TECHNICAL DEMOS GIVEAWAYS

Page 156: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

156

Free books and office hours! •  Book signing #1

–  Tomorrow, 10/16, 3:45 – 4:15pm at Expo Hall - O'Reilly Authors Booth

•  Book signing #2 –  Tomorrow, 10/16, 6:35 – 7:00pm at Cloudera Booth #305

•  Office Hours, tomorrow, 10/16, 10:30 – 11:00 am –  Mark and Ted in Room: Table A –  Gwen and Jonathan in Room: Table C

©2014 Cloudera, Inc. All Rights Reserved.

Page 157: Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

Thank you