Application Architectures with Hadoop - Big Data TechCon SF 2014
-
Upload
hadooparchbook -
Category
Technology
-
view
586 -
download
2
description
Transcript of Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop Big Data TechCon San Francisco 2014 slideshare.com/hadooparchbook Mark Grover | @mark_grover Jonathan Seidman | @jseidman
2
About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.
3
About Us • Mark
– Software Engineer – Committer on Apache Bigtop, committer and PPMC member on Apache
Sentry (incubating). – Contributor to Hadoop, Hive, Spark, Sqoop, Flume. – @mark_grover
• Jonathan – Senior Solutions Architect, Partner Enablement. – Co-founder of Chicago Hadoop User Group and Chicago Big Data. – [email protected] – @jseidman
©2014 Cloudera, Inc. All Rights Reserved.
4
Case Study Clickstream Analysis
5
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
6
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
7
Web Logs – Combined Log Format
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
8
Clickstream Analytics
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
9
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
10
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
11
Hadoop Architectural Considerations • Storage managers?
– HDFS? HBase? • Data storage and modeling:
– File formats? Compression? Schema design? • Data movement
– How do we actually get the data into Hadoop? How do we get it out? • Metadata
– How do we manage data about the data? • Data access and processing
– How will the data be accessed once in Hadoop? How can we transform it? How do we query it?
• Orchestration – How do we manage the workflow for all of this?
©2014 Cloudera, Inc. All Rights Reserved.
12
Architectural Considerations Data Storage and Modeling
13
Data Modeling Considerations • We need to consider the following in our architecture:
– Storage layer – HDFS? HBase? Etc. – File system schemas – how will we lay out the data? – File formats – what storage formats to use for our data, both raw and
processed data? – Data compression formats?
©2014 Cloudera, Inc. All Rights Reserved.
14
Architectural Considerations Data Modeling – Storage Layer
15
Data Storage Layer Choices • Two likely choices for raw data:
©2014 Cloudera, Inc. All Rights Reserved.
16
Data Storage Layer Choices
• Stores data directly as files • Fast scans • Poor random reads/writes
• Stores data as Hfiles on HDFS
• Slow scans • Fast random reads/writes
©2014 Cloudera, Inc. All Rights Reserved.
17
Data Storage – Storage Manager Considerations
• Incoming raw data: – Processing requirements call for batch transformations across multiple
records – for example sessionization.
• Processed data: – Access to processed data will be via things like analytical queries – again
requiring access to multiple records.
• We choose HDFS – Processing needs in this case served better by fast scans.
©2014 Cloudera, Inc. All Rights Reserved.
18
Architectural Considerations Data Modeling – Raw Data Storage
19
Storage Formats – Raw Data and Processed Data
©2014 Cloudera, Inc. All Rights Reserved.
Processed Data
Raw Data
20
Data Storage – Format Considerations
Click to enter confidentiality information
Logs (plain text)
Logs (plain text)
Logs (plain text)
Logs (plain text)
Logs (plain text) Logs
(plain text)
Logs (plain text)
Logs (plain text)
21
Data Storage – Compression
Click to enter confidentiality information
snappy
Well, maybe. But not splittable.
X Splittable. Getting better…
Hmmm…. Splittable, but no...
22
Data Storage – More About Snappy – Designed at Google to provide high compression speeds with reasonable
compression. – Not the highest compression, but provides very good performance for
processing on Hadoop. – Snappy is not splittable though, which brings us to…
©2014 Cloudera, Inc. All Rights Reserved.
23
Hadoop File Types • Formats designed specifically to store and process data on Hadoop:
– File based – Sequence File – Serialization formats – Thrift, Protocol Buffers, Avro – Columnar formats – RCFile, ORC, Parquet
Click to enter confidentiality information
24
SequenceFile
• Stores records as binary key/value pairs.
• SequenceFile “blocks” can be compressed.
• This enables splittability with non-splittable compression.
©2014 Cloudera, Inc. All Rights Reserved.
25
Avro
• Kinda SequenceFile on Steroids.
• Self-documenting – stores schema in header.
• Provides very efficient storage.
• Supports splittable compression.
©2014 Cloudera, Inc. All Rights Reserved.
26
Our Format Choices… • Avro with Snappy
– Snappy provides optimized compression. – Avro provides compact storage, self-documenting files, and supports schema
evolution. – Avro also provides better failure handling than other choices.
• SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. – But only supports Java.
©2014 Cloudera, Inc. All Rights Reserved.
27
Architectural Considerations Data Modeling – Processed Data Storage
28
Storage Formats – Raw Data and Processed Data
©2014 Cloudera, Inc. All Rights Reserved.
Processed Data
Raw Data
29
Access to Processed Data
©2014 Cloudera, Inc. All Rights Reserved.
Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value
Analytical Queries
30
Columnar Formats • Eliminates I/O for columns that are not part of a query. • Works well for queries that access a subset of columns. • Often provide better compression. • These add up to dramatically improved performance for many
queries.
©2014 Cloudera, Inc. All Rights Reserved.
1 2014-10-13
abc
2 2014-10-14
def
3 2014-10-15
ghi
1 2 3
2014-10-13
2014-10-14
2014-10-15
abc def ghi
31
Columnar Choices – RCFile • Provides efficient processing for MapReduce applications, but
generally only used with Hive. • Only supports Java. • No Avro support. • Limited compression support. • Sub-optimal performance compared to newer columnar formats.
©2014 Cloudera, Inc. All Rights Reserved.
32
Columnar Choices – ORC • A better RCFile. • Supports Hive, but currently limited support for other interfaces. • Only supports Java.
©2014 Cloudera, Inc. All Rights Reserved.
33
Columnar Choices – Parquet • Supports multiple programming interfaces – MapReduce, Hive,
Impala, Pig. • Multiple language support. • Broad vendor support. • These features make Parquet a good choice for our processed data.
©2014 Cloudera, Inc. All Rights Reserved.
34
Architectural Considerations Data Modeling – HDFS Schema Design
35
Recommended HDFS Schema Design • How to lay out data on HDFS?
©2014 Cloudera, Inc. All Rights Reserved.
36
Recommended HDFS Schema Design /user/<username> - User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the entire organization /app – Everything but data: UDF jars, HQL files, Oozie workflows
©2014 Cloudera, Inc. All Rights Reserved.
37
Architectural Considerations Data Modeling – Advanced HDFS Schema Design
38
What is Partitioning?
dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt
dataset file1.txt file2.txt . . . filen.txt
Un-partitioned HDFS directory structure
Partitioned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
39
What is Partitioning?
clicks dt=2014-01-01/clicks.txt dt=2014-01-02/clicks.txt . . . dt=2014-03-31/clicks.txt
clicks clicks-2014-01-01.txt clicks-2014-01-02.txt . . . clicks-2014-03-31.txt
Un-partitioned HDFS directory structure
Partitioned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
40
Partitioning • Split the dataset into smaller consumable chunks • Rudimentary form of “indexing” • <data set name>/<partition_column_name=partition_column_value>/
{files}
©2014 Cloudera, Inc. All Rights Reserved.
41
Partitioning considerations • What column to partition by?
– HDFS is append only. – Don’t have too many partitions (<10,000) – Don’t have too many small files in the partitions (more than block size
generally)
• We decided to partition by timestamp
©2014 Cloudera, Inc. All Rights Reserved.
42
Partitioning For Our Case Study • Raw dataset:
– /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!
• Processed dataset: – /data/bikeshop/clickstream/year=2014/month=10/day=10!
©2014 Cloudera, Inc. All Rights Reserved.
43
A Word About Partitioning Alternatives
year=2014/month=10/day=10 dt=2014-10-10
©2014 Cloudera, Inc. All Rights Reserved.
44
Architectural Considerations Data Ingestion
45
File Transfers
• “hadoop fs –put <file>” • Reliable, but not
resilient to failure. • Other options are
mountable HDFS, for example NFSv3.
©2014 Cloudera, Inc. All Rights Reserved.
46
Streaming Ingestion • Flume
– Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs.
• Kafka – Reliable and distributed publish-subscribe messaging system.
©2014 Cloudera, Inc. All Rights Reserved.
47
Flume vs. Kafka
• Purpose built for Hadoop data ingest.
• Pre-built sinks for HDFS, HBase, etc.
• Supports transformation of data in-flight.
• General pub-sub messaging framework.
• Hadoop not supported, requires 3rd-party component (Camus).
• Just a message transport (a very fast one).
©2014 Cloudera, Inc. All Rights Reserved.
48
Flume vs. Kafka • Bottom line:
– Flume very well integrated with Hadoop ecosystem, well suited to ingestion of sources such as log files.
– Kafka is a highly reliable and scalable enterprise messaging system, and great for scaling out to multiple consumers.
©2014 Cloudera, Inc. All Rights Reserved.
49
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume Twitter, logs, JMS,
webserver Mask, re-format,
validate… DR, critical Memory, file HDFS, HBase,
Solr
50
A Quick Introduction to Flume • Reliable – events are stored in channel until delivered to next
stage. • Recoverable – events can be persisted to disk and recovered in
the event of failure.
Flume Agent
Source Channel Sink Destination
©2014 Cloudera, Inc. All Rights Reserved.
51
A Quick Introduction to Flume
• Declarative – No coding required. – Configuration specifies
how components are wired together.
©2014 Cloudera, Inc. All Rights Reserved.
52
A Brief Discussion of Flume Patterns – Fan-in
• Flume agent runs on each of our servers.
• These agents send data to multiple agents to provide reliability.
• Flume provides support for load balancing.
©2014 Cloudera, Inc. All Rights Reserved.
53
Sqoop Overview • Apache project designed to ease import and export of data between
Hadoop and external data stores such as relational databases. • Great for doing bulk imports and exports of data between HDFS,
Hive and HBase and an external data store. Not suited for ingesting event based data.
©2014 Cloudera, Inc. All Rights Reserved.
54
Ingestion Decisions • Historical Data
– Smaller files: file transfer – Larger files: Flume with spooling directory source.
• Incoming Data – Flume with the spooling directory source.
• Relational Data Sources – ODS, CRM, etc. – Sqoop
©2014 Cloudera, Inc. All Rights Reserved.
55
Architectural Considerations Data Processing – Engines
56
Data flow
Raw data
Partitioned clickstream data
Other data (Financial, CRM, etc.)
Aggregated dataset #2
Aggregated dataset #1
©2014 Cloudera, Inc. All Rights Reserved.
57
Processing Engines • MapReduce • Abstractions – Pig, Hive, Cascading, Crunch • Spark • Impala
Confidentiality Information Goes Here
58
MapReduce • Oldie but goody • Restrictive Framework / Innovated Work Around • Extreme Batch
Confidentiality Information Goes Here
59
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS (Replicated)
Native File System
Block of Data
Temp Spill Data
Partitioned Sorted Data
Reducer
Reducer Local Copy
Output File
60
Abstractions • SQL
– Hive
• Script/Code – Pig: Pig Latin – Crunch: Java/Scala – Cascading: Java/Scala
Confidentiality Information Goes Here
61
Spark • The New Kid that isn’t that New Anymore • Easily 10x less code • Extremely Easy and Powerful API • Very good for machine learning • Scala, Java, and Python • RDDs • DAG Engine
Confidentiality Information Goes Here
62
Spark - DAG
Confidentiality Information Goes Here
63
Impala • Real-time open source MPP style engine for Hadoop • Doesn’t build on MapReduce • Written in C++, uses LLVM for run-time code generation • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
64
What processing needs to happen?
Confidentiality Information Goes Here
• Sessionization • Filtering • Deduplication • BI / Discovery
65
Sessionization
Confidentiality Information Goes Here
Website visit
Visitor 1 Session 1
Visitor 1 Session 2
Visitor 2 Session 1
> 30 minutes
66
Why sessionize?
Confidentiality Information Goes Here
Helps answers questions like: • What is my website’s bounce rate?
– i.e. how many % of visitors don’t go past the landing page?
• Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? – Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
• Do attribution analysis – which channels are responsible for most conversions?
67
How to Sessionize?
Confidentiality Information Goes Here
1. Given a list of clicks, determine which clicks came from the same user
2. Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session
68
#1 – Which clicks are from same user? • We can use:
– IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
69
#1 – Which clicks are from same user? • We can use:
– IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
70
#1 – Which clicks are from same user?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
71
#2 – Which clicks part of the same session?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
> 30 mins apart = different sessions
72 ©2014 Cloudera, Inc. All rights reserved.
Sessionization engine recommendation • We have sessionization code in MR, Spark on github. The
complexity of the code varies, depends on the expertise in the organization.
• We choose MR, since it’s fairly simple and maintainable code.
73
Filtering – filter out incomplete records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
74
Filtering – filter out records from bots/spiders
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
Google spider IP address
75 ©2014 Cloudera, Inc. All rights reserved.
Filtering recommendation • Bot/Spider filtering can be done easily in any of the engines • Incomplete records are harder to filter in schema systems like
Hive, Impala, Pig, etc. • Pretty close choice between MR, Hive and Spark • Can be done in Flume interceptors as well • We can simply embed this in our sessionization job
76
Deduplication – remove duplicate records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
77 ©2014 Cloudera, Inc. All rights reserved.
Deduplication recommendation • Can be done in all engines. • We already have a Hive table with all the columns, a simple
DISTINCT query will perform deduplication • We use Pig
78 ©2014 Cloudera, Inc. All rights reserved.
BI/Discovery engine recommendation • Main requirements for this are:
– Low latency – SQL interface (e.g. JDBC/ODBC) – Users don’t know how to code
• We chose Impala – It’s a SQL engine – Much faster than other engines – Provides standard JDBC/ODBC interfaces
79
Architectural Considerations Orchestration
80
• Data arrives through Flume • Triggers a processing event:
– Sessionize – Enrich – Location, marketing channel… – Store as Parquet
• Each day we process events from the previous day
Orchestrating Clickstream
81 ©2014 Cloudera, Inc. All rights reserved.
• Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting
Choosing Right
82
Oozie or Azkaban?
©2014 Cloudera, Inc. All rights reserved.
83 ©2014 Cloudera, Inc. All rights reserved.
• Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting
Choosing…
Easier in Oozie
84 ©2014 Cloudera, Inc. All rights reserved.
• Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting
Choosing the right Orchestration Tool
Better in Azkaban
85 ©2014 Cloudera, Inc. All rights reserved.
The best orchestration tool is the one you are an expert on
Important Decision Consideration!
86
Putting It All Together Final Architecture
87
Final Architecture – High Level Overview
Data Sources Ingestion Data Storage/
Processing
Data Reporting/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
88
Final Architecture – High Level Overview
Data Sources Ingestion Data Storage/
Processing
Data Reporting/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
89
Final Architecture – Ingestion
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Fan-in Pattern
Multi Agents for Failover and rolling restarts
HDFS
©2014 Cloudera, Inc. All Rights Reserved.
90
Final Architecture – High Level Overview
Data Sources Ingestion Data Storage/
Processing
Data Reporting/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
91
Final Architecture – Storage and Processing
/etl/weblogs/20140331/ /etl/weblogs/20140401/ …
Data Processing
/data/marketing/clickstream/bouncerate/ /data/marketing/clickstream/attribution/ …
©2014 Cloudera, Inc. All Rights Reserved.
92
Final Architecture – High Level Overview
Data Sources Ingestion Data Storage/
Processing
Data Reporting/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
93
Final Architecture – Data Access
Hive/Impala
BI/Analytics
Tools
DWH Sqoop
Local Disk
R, etc.
DB import tool
JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Thank you
95
Contact info • Mark Grover
– @mark_grover – www.linkedin.com/in/grovermark
• Jonathan Seidman – [email protected] – @jseidman – www.linkedin.com/in/jaseidman – www.slideshare.net/jseidman
• Slides at slideshare.net/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.