Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
-
Upload
yahoo-developer-network -
Category
Technology
-
view
7.261 -
download
0
description
Transcript of Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Andreas Neumann
Oozie – Workflow for Hadoop
- 2 -
Who Am I?
Dr. Andreas Neumann
Software Architect, Yahoo!anew <at> yahoo-inc <dot> com
At Yahoo! (2008-present)
- Grid architecture
- Content Platform
- Research
At IBM (2000-2008)
- Database (DB2) Development
- Enterprise Search
- 3 -
Oozie Overview
Main Features
– Execute and monitor workflows in Hadoop
– Periodic scheduling of workflows
– Trigger execution by data availability
– HTTP and command line interface + Web console
Adoption
– ~100 users on mailing list since launch on github
– In production at Yahoo!, running >200K jobs/day
- 4 -
Oozie Workflow Overview
Purpose:
Execution of workflows on the Grid
Oozie
Hadoop/Pig/HDFS
DB
WS API
Tomcatweb-app
- 5 -
Oozie Workflow
startJavaMain
M/Rstreaming
job
decision
fork
Pigjob
M/Rjob
joinOK
OK
OK
OK
end
Java Main
FSjob
OK OK
ENOUGH
MORE
Directed Acyclic Graph of Jobs
- 6 -
Oozie Workflow Example
<workflow-app name=’wordcount-wf’>
<start to=‘wordcount’/>
<action name=’wordcount'>
<map-reduce>
<job-tracker>foo.com:9001</job-tracker>
<name-node>hdfs://bar.com:9000</name-node>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to=’end'/>
<error to=’kill'/>
</action>
<kill name=‘kill’/>
<end name=‘end’/>
</workflow-app>
Start
M-Rwordcou
nt
End
OKStart
Kill
Error
- 7 -
Oozie Workflow Nodes
• Control Flow:– start/end/kill
– decision
– fork/join
• Actions:– map-reduce
– pig
– hdfs
– sub-workflow
– java – run custom Java code
- 8 -
Oozie Workflow Application
A HDFS directory containing:
– Definition file: workflow.xml
– Configuration file: config-default.xml
– App files: lib/ directory with JAR and SO files
– Pig Scripts
- 9 -
Running an Oozie Workflow Job
Application Deployment:$ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount
Workflow Job Parameters:$ cat job.properties
oozie.wf.application.path = hdfs://bar.com:9000/usr/abc/wordcount
input = /usr/abc/input-data
output = /user/abc/output-data
Job Execution:$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-W
- 10 -
Monitoring an Oozie Workflow Job
Workflow Job Status:$ oozie job -info 1-20090525161321-oozie-xyz-W
------------------------------------------------------------------------
Workflow Name : wordcount-wf
App Path : hdfs://bar.com:9000/usr/abc/wordcount
Status : RUNNING
…
Workflow Job Log:$ oozie job –log 1-20090525161321-oozie-xyz-W
Workflow Job Definition:$ oozie job –definition 1-20090525161321-oozie-xyz-W
- 11 -
Oozie Coordinator Overview
Purpose:
– Coordinated execution of workflows on the Grid
– Workflows are backwards compatible
Hadoop
Tomcat
Oozie Client
Oozie Workflow
WS API Oozie Coordinator
Check Data Availability
- 12 -
Oozie Application Lifecycle
1*f
Action1
WF
2*f
Action2
WF
N*f… …
ActionN
… …
WF
0*f
Action0
WF
actioncreate
actionstart
start end
Coordinator Job
Oozie Coordinator Engine
Oozie Workflow Engine
A
B C
- 13 -
Use Case 1: Time Triggers
• Execute your workflow every 15 minutes (CRON)
00:15 00:30 00:45 01:00
- 14 -
Example 1: Run Workflow every 15 mins
<coordinator-app name=“coord1” start="2009-01-08T00:00Z" end="2010-01-01T00:00Z" frequency=”15" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>key1</name><value>value1</value> </property> </configuration> </workflow> </action> </coordinator-app>
- 15 -
Use Case 2: Time and Data Triggers
• Materialize your workflow every hour, but only run them when the input data is ready.
01:00 02:00 03:00 04:00
Hadoop
Input Data Exists?
- 16 -
Example 2: Data Triggers
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action> </coordinator-app>
- 17 -
Use Case 3: Rolling Windows
• Access 15 minute datasets and roll them up into hourly datasets
00:15 00:30 00:45 01:00
01:00
01:15 01:30 01:45 02:00
02:00
- 18 -
Example 3: Rolling Windows
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“15” initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-
template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <start-instance>${current(-3)}</start-instance> <end-instance>${current(0)}</end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action> </coordinator-app>
- 19 -
Use Case 4: Sliding Windows
• Access last 24 hours of data, and roll them up every hour.
01:00 02:00 03:00 24:00
24:00
…
02:00 03:00 04:00+1 day01:00
+1 day01:00
…
03:00 04:00 05:00+1 day02:00
+1 day02:00
…
- 20 -
Example 4: Sliding Windows
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <start-instance>${current(-23)}</start-instance> <end-instance>${current(0)}</end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action> </coordinator-app>
- 21 -
Oozie Coordinator Application
A HDFS directory containing:
– Definition file: coordinator.xml
– Configuration file: coord-config-default.xml
- 22 -
Running an Oozie Coordinator Job
Application Deployment:$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job
Coordinator Job Parameters:$ cat job.properties
oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job
Job Execution:$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-C
- 23 -
Monitoring an Oozie Coordinator Job
Coordinator Job Status:$ oozie job -info 1-20090525161321-oozie-xyz-C
------------------------------------------------------------------------
Job Name : wordcount-coord
App Path : hdfs://bar.com:9000/usr/abc/coord_job
Status : RUNNING
…
Coordinator Job Log:$ oozie job –log 1-20090525161321-oozie-xyz-C
Coordinator Job Definition:$ oozie job –definition 1-20090525161321-oozie-xyz-C
- 24 -
Oozie Web Console: List Jobs
- 25 -
Oozie Web Console: Job Details
- 26 -
Oozie Web Console: Failed Action
- 27 -
Oozie Web Console: Error Messages
- 28 -
What’s Next For Oozie?
New Features– More out-of-the-box actions: distcp, hive, …
– Authentication framework• Authenticate a client with Oozie
• Authenticate an Oozie workflow with downstream services
– Bundles: Manage multiple coordinators together
– Asynchronous data sets and coordinators
Scalability– Memory footprint
– Data notification instead of polling
Integration with Howl (http://github.com/yahoo/howl)
- 29 -
We Need You!
Oozie is Open Source• Source: http://github.com/yahoo/oozie
• Docs: http://yahoo.github.com/oozie
• List: http://tech.groups.yahoo.com/group/Oozie-users/
To Contribute:• https://github.com/yahoo/oozie/wiki/How-To-Contribute
Thank You!
github.com/yahoo/oozie/wiki/How-To-Contribute