Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014
Dataiku - Paris JUG 2013 - Hadoop is a batch
-
Upload
dataiku -
Category
Technology
-
view
774 -
download
6
description
Transcript of Dataiku - Paris JUG 2013 - Hadoop is a batch
![Page 1: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/1.jpg)
Hadoop Is A BatchPig, Hive, Cascading …
Paris Jug May 2013 Florian Douetteau
![Page 2: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/2.jpg)
04/10/2023Dataiku Training – Hadoop for Data Science 2
Florian Douetteau <[email protected]>
CEO at Dataiku Freelance at Criteo (Online Ads) CTO at IsCool Ent. (#1 French Social Gamer) VP R&D Exalead (Search Engine Technology)
About me
![Page 3: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/3.jpg)
Dataiku - Pig, Hive and Cascading
Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)
Agenda
![Page 4: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/4.jpg)
Dataiku - Pig, Hive and Cascading
CHOOSE TECHNOLOGY
HadoopCeph
Sphere
Cassandra
Spark
Scikit-Learn
MahoutWEKA
MLBase LibSVM
SASRapidMiner
SPSS Panda
QlickViewTableau
SpotFireHTML5/D3
InfiniDBVertica
GreenPlumImpalaNetezza
Elastic Search
SOLR
MongoDBRiak
Membase
Pig
Cascading
Talend
Machine Learning Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization CountyData Clean Wasteland
Statistician Old House
R
![Page 5: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/5.jpg)
Dataiku - Pig, Hive and Cascading
How do I (pre)process data?
Implicit User Data(Views, Searches…)
Content Data(Title, Categories, Price, …)
Explicit User Data(Click, Buy, …)
User Information(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation Matrix
Transformation Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
A/B Test Data
Predictor Runtime
Online User Information
![Page 6: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/6.jpg)
Dataiku - Pig, Hive and Cascading
Analyse Raw Logs (Trackers, Web Logs)
Extract IP, Page, … Detect and remove robots Build Statistics
◦ Number of page view, per produt
◦ Best Referers◦ Traffic Analysis◦ Funnel◦ SEO Analysis◦ …
Typical Use Case 1 Web Analytics Processing
![Page 7: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/7.jpg)
Dataiku - Pig, Hive and Cascading
Extract Query Logs Perform query
normalization Compute Ngrams Compute Search
“Sessions” Compute Log-
Likehood Ratio for ngrams across sesions
Typical Use Case 2Mining Search Logs for Synonyms
![Page 8: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/8.jpg)
Dataiku - Pig, Hive and Cascading
Compute User – Product Association Matrix
Compute different similarities ratio (Ochiai, Cosine, …)
Filter out bad predictions
For each user, select best recommendable products
Typical Use Case 3Product Recommender
![Page 9: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/9.jpg)
Dataiku - Pig, Hive and Cascading
Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)
Agenda
![Page 10: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/10.jpg)
Dataiku - Pig, Hive and Cascading
Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from
2003 2007 as an Apache Project
Initial motivation◦ Search Log Analytics: how long is the average
user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? …
Pig History
words = LOAD '/training/hadoop-wordcount/output‘ USING PigStorage(‘\t’)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;first_words = LIMIT sorted_words 10;
DUMP first_words;
![Page 11: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/11.jpg)
Dataiku - Pig, Hive and Cascading
Developed by Facebook in January 2007
Open source in August 2008
Initial Motivation◦ Provide a SQL like abstraction to perform
statistics on status updates
Hive History
create external table wordcounts ( word string, count int) row format delimited fields terminated by '\t' location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit 10;
select SUM(count) from wordcounts where word like ‘th%’;
![Page 12: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/12.jpg)
Dataiku - Pig, Hive and Cascading
Authored by Chris Wensel 2008
Associated Projects◦ Cascalog : Cascading in Closure◦ Scalding : Cascading in Scala
(Twitter in 2012)◦ Lingual ( to be released soon): SQL
layer on top of cascading
Cascading History
![Page 13: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/13.jpg)
Dataiku - Pig, Hive and Cascading
Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)
Agenda
![Page 14: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/14.jpg)
04/10/2023Dataiku - Innovation Services 14
MapReduceSimplicity is a complexity
![Page 15: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/15.jpg)
04/10/2023Dataiku - Innovation Services 15
Pig & HiveMapping to Mapreduce jobs
* VAT excluded
events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
Job 1 : Mapper Job 1 : Reducer1
LOAD FILTERGROU
PFOREACH FILTER
Shuffle and sort by user
![Page 16: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/16.jpg)
04/10/2023Dataiku - Innovation Services 16
Pig & HiveMapping to Mapreduce jobs
events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
recent_high = ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO ‘/output’;
Job 1: Mapper Job 1 :Reducer
LOAD FILTERGROU
PFOREACH FILTER
Shuffle and sort by user
Job 2: Mapper Job 2: Reducer
LOAD(from tmp)
STOREShuffle and sort by max_ts
![Page 17: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/17.jpg)
Dataiku - Pig, Hive and Cascading
Pig How does it work
Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not)
![Page 18: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/18.jpg)
04/10/2023Dataiku - Innovation Services 19
Reducer 2Mappers output
Reducer 1
Hive JoinsHow to join with MapReduce ?
tbl_idx
uid name
1 1 Dupont
1 2 Durand
tbl_idx
uid type
2 1 Type1
2 1 Type2
2 2 Type1
Shuffle by uidSort by (uid, tbl_idx)
Uid
Tbl_idx Name Type
1 1 Dupont
1 2 Type1
1 2 Type2
Uid
Tbl_idx Name Type
2 1 Durand
2 2 Type1
Uid
Name Type
1 Dupont Type1
1 Dupont Type2
Uid
Name Type
2 Durand Type1
![Page 19: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/19.jpg)
Dataiku - Pig, Hive and Cascading
Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)
Agenda
![Page 20: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/20.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment
Integration◦ Partitioning◦ Formats Integration◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 21: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/21.jpg)
Dataiku - Pig, Hive and Cascading
Transformation as a sequence of operations
Transformation as a set of formulas
Procedural Vs Declarative
insert into ValuableClicksPerDMA select dma, count(*)from geoinfo join (
select name, ipaddr from users join clicks on (users.name = clicks.user)
where value > 0;) using ipaddr
group by dma;
Users = load 'users' as (name, age, ipaddr);Clicks = load 'clicks' as (user, url, value);ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks by user;Geoinfo = load 'geoinfo' as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;ByDMA = group UserGeo by dma;ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
![Page 22: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/22.jpg)
Dataiku - Pig, Hive and Cascading
All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}
Different approach◦ Resilient Schema ◦ Static Typing ◦ No Static Typing
Data type and ModelRationale
![Page 23: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/23.jpg)
04/10/2023 24
HiveData Type and Schema
Simple type Details
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes
FLOAT, DOUBLE 4 and 8 bytes
BOOLEAN
STRING Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type Details
ARRAY Array of typed items (0-indexed)
MAP Associative map
STRUCT Complex class-like objects
Dataiku Training – Hadoop for Data Science
CREATE TABLE visit (user_name STRING,user_id INT,user_details STRUCT<age:INT, zipcode:INT>
);
![Page 24: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/24.jpg)
04/10/2023 25
rel = LOAD '/folder/path/'USING PigStorage(‘\t’)AS (col:type, col:type, col:type);
Data types and SchemaPig
Simple type Details
int, long, float, double
32 and 64 bits, signed
chararray A string
bytearray An array of … bytes
boolean A boolean
Complex type Details
tuple a tuple is an ordered fieldname:value map
bag a bag is a set of tuples
Dataiku Training – Hadoop for Data Science
![Page 25: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/25.jpg)
Dataiku - Pig, Hive and Cascading
Support for Any Java Types, provided they can be serialized in Hadoop
No support for Typing
Data Type and Schema Cascading
Simple type Details
Int, Long, Float, Double
32 and 64 bits, signed
String A string
byte[] An array of … bytes
Boolean A boolean
Complex type Details
Object Object must be « Hadoop serializable »
![Page 26: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/26.jpg)
Dataiku - Pig, Hive and Cascading
Style Summary
Style Typing Data Model
Metadata store
Pig Procedural Static + Dynamic
scalar + tuple+ bag
(fully recursive)
No (HCatalog)
Hive Declarative Static + Dynamic,
enforced at execution
time
scalar+ list + map
Integrated
Cascading Procedural Weak scalar+ java objects
No
![Page 27: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/27.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing, error management and environment
Integration◦ Partitioning◦ Formats Integration◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 28: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/28.jpg)
Dataiku - Pig, Hive and Cascading
Does debugging the tool lead to bad headaches ?
HeadachilityMotivation
![Page 29: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/29.jpg)
Dataiku - Pig, Hive and Cascading
Out Of Memory Error
(Reducer)
Exception in Building /
Extended Functions
(handling of null)
Null vs “”
Nested Foreach and scoping
Date Management (pig 0.10)
Field implicit ordering
HeadachesPig
![Page 30: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/30.jpg)
Dataiku - Pig, Hive and Cascading
A Pig Error
![Page 31: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/31.jpg)
Dataiku - Pig, Hive and Cascading
Out of Memory Errors in
Reducers
Few Debugging Options
Null / “”
No builtin “first”
HeadachesHive
![Page 32: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/32.jpg)
Dataiku - Pig, Hive and Cascading
Weak Typing Errors
(comparing Int and String … )
Illegal Operation Sequence
(Group after group …)
Field Implicit Ordering
HeadachesCascading
![Page 33: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/33.jpg)
Dataiku - Pig, Hive and Cascading
How to perform unit tests ? How to have different versions of the same script
(parameter) ?
TestingMotivation
![Page 34: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/34.jpg)
Dataiku - Pig, Hive and Cascading
System Variables Comment to test No Meta Programming pig –x local to execute on local files
TestingPig
![Page 35: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/35.jpg)
Dataiku - Pig, Hive and Cascading
Junit Tests are possible Ability to use code to actually comment out some
variables
Testing / Environment Cascading
![Page 36: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/36.jpg)
Dataiku - Pig, Hive and Cascading
Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start …
Checkpointing Motivation
Page User Correlation OutputFilteringParse Logs Per Page StatsFA
IL
FIX and relaunch
![Page 37: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/37.jpg)
Dataiku - Pig, Hive and Cascading
STORE Command to manually store files
PigManual Checkpointing
Page User Correlation OutputFilteringParse Logs Per Page Stats
// COMMENT Beginning of script and relaunch
![Page 38: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/38.jpg)
Dataiku - Pig, Hive and Cascading
Ability to re-run a flow automatically from the last saved checkpoint
Cascading Automated Checkpointing
addCheckpoint(…)
![Page 39: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/39.jpg)
Dataiku - Pig, Hive and Cascading
Check each file intermediate timestamp Execute only if more recent
Cascading Topological Scheduler
Page User Correlation OutputFilteringParse Logs Per Page Stats
![Page 40: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/40.jpg)
Dataiku - Pig, Hive and Cascading
Productivity Summary
Headaches Checkpointing/Replay
Testing / Metaprogrammation
Pig Lots Manual Save Difficult Meta programming, easy
local testing
Hive Few, but without
debugging options
None (That’s SQL) None (That’s SQL)
Cascading Weak TypingComplexity
Checkpointing Partial Updates
Possible
![Page 41: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/41.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment
Integration◦ Formats Integration◦ Partitioning◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 42: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/42.jpg)
Dataiku - Pig, Hive and Cascading
Ability to integrate different file formats◦ Text Delimited◦ Sequence File (Binary Hadoop format)◦ Avro, Thrift ..
Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …)
Formats IntegrationMotivation
Format Size on Disk (GB) HIVE Processing time (24 cores)
Text File, uncompressed 18.7 1m32s
1 Text File, Gzipped 3.89 6m23s (no parallelization)
JSON compressed 7.89 2m42s
multiple text file gzipped 4.02 43s
Sequence File, Block, Gzip 5.32 1m18s
Text File, LZO Indexed 7.03 1m22s
Format impact on size and performance
![Page 43: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/43.jpg)
Dataiku - Pig, Hive and Cascading
Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap
Format Integration
![Page 44: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/44.jpg)
Dataiku - Pig, Hive and Cascading
No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition
Common partition schemas on Hadoop◦ By Date /apache_logs/dt=2013-01-23◦ By Data center /apache_logs/dc=redbus01/…◦ By Country◦ …◦ Or any combination of the above
PartitionsMotivation
![Page 45: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/45.jpg)
04/10/2023 46
Hive PartitioningPartitioned tables
CREATE TABLE event (user_id INT,type STRING,message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0/hive/event/day=2013-01-27/server_id=s1/file1/hive/event/day=2013-01-27/server_id=s2/file0/hive/event/day=2013-01-27/server_id=s2/file1…/hive/event/day=2013-01-28/server_id=s2/file0/hive/event/day=2013-01-28/server_id=s2/file1
Dataiku Training – Hadoop for Data Science
INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=‘s1’)SELECT * FROM event_tmp;
![Page 46: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/46.jpg)
Dataiku - Pig, Hive and Cascading
No Direct support for partition Support for “Glob” Tap, to build read from files
using patterns
You can code your own custom or virtual partition schemes
Cascading Partition
![Page 47: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/47.jpg)
Dataiku - Pig, Hive and Cascading
External Code IntegrationSimple UDF
Pig Hive
Cascading
![Page 48: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/48.jpg)
Dataiku - Pig, Hive and Cascading
Hive Complex UDF(Aggregators)
![Page 49: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/49.jpg)
Dataiku - Pig, Hive and Cascading
Cascading Direct Code Evaluation
Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO
![Page 50: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/50.jpg)
Dataiku - Pig, Hive and Cascading
Allow to call a cascading flow from a Spring Batch
Spring Batch Cascading Integration
No full Integration with Spring MessageSource or MessageHandler yet (only for local flows)
![Page 51: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/51.jpg)
Dataiku - Pig, Hive and Cascading
IntegrationSummary
Partition/Incremental Updates
External Code Format Integration
Pig No Direct Support
Simple Doable and rich community
Hive Fully integrated, SQL Like
Very simple, but complex dev setup
Doable and existing
community
Cascading With Coding Complex UDFS but regular, and Java Expression
embeddable
Doable and growing
commuinty
![Page 52: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/52.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment
Integration◦ Formats Integration◦ Partitioning◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 53: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/53.jpg)
Dataiku - Pig, Hive and Cascading
Several Common Map Reduce Optimization Patterns◦ Combiners◦ MapJoin◦ Job Fusion◦ Job Parallelism◦ Reducer Parallelism
Different support per framework◦ Fully Automatic◦ Pragma / Directives / Options◦ Coding style / Code to write
Optimization
![Page 54: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/54.jpg)
Dataiku - Pig, Hive and Cascading
SELECT date, COUNT(*) FROM product GROUP BY date
CombinerPerform Partial Aggregate at Mapper Stage
Map Reduce2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 20
2012-02-15 35
2012-02-16 1
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
![Page 55: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/55.jpg)
Dataiku - Pig, Hive and Cascading
SELECT date, COUNT(*) FROM product GROUP BY date
CombinerPerform Partial Aggregate at Mapper Stage
Map Reduce2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq22012-02-14 12
2012-02-15 23
2012-02-16 1
2012-02-14 8
2012-02-15 12
2012-02-14 20
2012-02-15 35
2012-02-16 1
Reduced network bandwith. Better parallelism
![Page 56: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/56.jpg)
Dataiku - Pig, Hive and Cascading
Join OptimizationMap Join
set hive.auto.convert.join = true;
Hive
Pig
Cascading
( no aggregation support after HashJoin)
![Page 57: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/57.jpg)
Dataiku - Pig, Hive and Cascading
Critical for performance
Estimated per the size of input file◦ Hive
divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)◦ Pig
divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Number of Reducers
![Page 58: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/58.jpg)
Dataiku - Pig, Hive and Cascading
CombinerOptimization
JoinOptimization
Number of reducers optimization
Pig Automatic Option Estimate or DIY
Cascading DIY HashJoin DIY
Hive PartialDIY
Automatic(Map Join)
Estimate or DIY
Performance & Optimization Summary
![Page 59: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/59.jpg)
Dataiku - Pig, Hive and Cascading
Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)
Agenda
![Page 60: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/60.jpg)
Dataiku - Pig, Hive and Cascading
Follow the Flow
Tracker Log
MongoDB
MySQL
MySQL
Syslog
Product Catalog
Order
Apache Logs
Session
Product Transformation
Category Affinity
Category Targeting
Customer Profile
Product Recommender
S3
Search Logs (External) Search Engine Optimization
(Internal) Search Ranking
MongoDB
MySQL
Partner FTP
Sync In Sync Out
Pig
Pig
Hive
Hive
ElasticSearch
![Page 61: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/61.jpg)
Dataiku - Pig, Hive and Cascading
E.g. Product Recommender
Page Views
Orders
Catalog
Bots, Special Users
Filtered Page Views User Affinity
Product Popularity
User Similarity (Per Category)
Recommendation Graph
Recommendation
Order Summary
User Similarity (Per Brand)
Machine Learning
![Page 62: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/62.jpg)
Dataiku - Pig, Hive and Cascading
Schema Maintenance between tools
Proper incremental and efficient synchronization between tools and NoSQL Store and Logs Systems
Proper “management” partition (daily jobs, …)
Job Sequence and Management ◦ How to handle properly a new field ? a missing data ?
recompute everything ?
Pain PointsOn Large Projects
![Page 63: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/63.jpg)
Dataiku - Pig, Hive and Cascading
Hcatalog provides an interoberability between Hive and Pig in term of schema
Integration OptionHCatalog
Hive Pig
HCatalog
![Page 64: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/64.jpg)
Dataiku - Pig, Hive and Cascading
1970 Shell script 1977 Makefile 1980 Makedeps 1999 Cons/CMake 2001 Maven 2004 Ivy 2008 Gradle
Shell Script 2008 HaMake 2009 Oozie … ETL Hadoop Next … ?
Similar to “Build”
![Page 65: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/65.jpg)
Dataiku - Pig, Hive and Cascading
Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)
Agenda
![Page 66: Dataiku - Paris JUG 2013 - Hadoop is a batch](https://reader036.fdocuments.us/reader036/viewer/2022062510/548d88acb47959115f8b469b/html5/thumbnails/66.jpg)
Dataiku - Pig, Hive and Cascading
Want to keep close to SQL ?◦ Hive
Want to write large flows ?◦ Pig
Want to integrate in large scale programming projects ◦ Cascading (cascalog / scalding)
Presentation Available On http://www.slideshare.net/Dataiku