October 2014 HUG : Oozie HA
-
Upload
yahoo-developer-network -
Category
Technology
-
view
1.269 -
download
0
Transcript of October 2014 HUG : Oozie HA
Oozie High AvailabilityHadoop User Group meetup 10/15/14
Robert Kanter (Cloudera)Ryota Egashira (Yahoo!)
Agenda● What is High Availability?● Architectural Overview● Security● Authentication● HCatalog Integration in HA● Other Challenges in HA● Future Work
What is High Availability?● A system without non-planned downtime when partial
failures occuro Typically achieved by having redundancies and
removing single-points of failure
● Our Goalso Don’t change the API or usage patternso User doesn’t even have to know it’s HA
Architectural Overview:Database● Oozie stores most of its state in a database
o (submitted jobs, workflow definitions, etc)
● Instead of a failover model, we want to run many Oozie servers against the same databaseo Active-Active HAo Also provides horizontal scalability
● Zookeeper for coordination
Architectural Overview:Access● Users and client programs need a single address to
connect too Web UI, REST/Java API,
JobTracker/ResourceManager callbacks, etc
● Load balancer, Virtual IP, or DNS round-robin can be used to provide a single entry point to the Oozie serverso Technically also needs to be HA
Architectural Overview:Log Streaming● Oozie’s log files are not in the database
o Each Oozie server only has access to its own logs
● Jobs are not assigned to a specific Oozie server
● What if the user asks Oozie Server 1 for logs for a job processed by Oozie Server 2?o Oozie Server 1 can ask Oozie Server 2 for its logs
● Caveat: If an Oozie Server goes down, any logs from it will be unavailable
Security● Existing security features continue to work
● authentication tokenso Signed cookies for authenticating users to Oozie servero Each Oozie server uses it’s own randomly generated secret
Problem: Won’t accept cookies signed by other Oozie servers
● Hadoop-auth in Hadoop 2.6.0 will add support for pluggable secret providerso Includes a Zookeeper-backed implementation that
synchronizes a rolling randomly generated secret across multiple servers No locking required!
Authentication
Load Balancer
Oozie Server 1
Oozie Server 2
Oracle DB
Zookeeper
Hadoop Cluster
user submit request
Load balanced request redirection
Inter server communication for log streaming, sharelib.etc
Zookeeper for lock and management
ApacheCurator
Authentication
Load Balancer
Oozie Server 1
Oozie Server 2
Oracle DB
Zookeeper
Hadoop Cluster
user submit request
Security: https + kerberos / cookie-based auth
Load balanced request redirectionSecurity: https + kerberos / cookie-based-auth
Inter server communication for log streaming, sharelib.etcSecurity: https+kerberos
Security: kerberos
Zookeeper for lock and managementSecurity: Kerberos
ApacheCurator
HCatalog Integration in HA
• HCatalog : metadata management service for HDFS datasets– Oozie receive notification from Hcatalog through JMS
(e.g., ActiveMQ)– Start a job immediately after data becomes ready
Oozie 1
JMS(e.g, ActiveMQ)
HCatalog
3. Push notification<New Partition>
2. Register Topic
4. Notify New Partition
1. Query/Poll Partition
Oozie 2
Job
HCatalog Integration in HA
• To support HA– Keep consistency in in-memory data structure
• store list of jobs waiting for a data partition– Create and cleanup topic listener for JMS
Oozie 1 HCatalog
3. Push notification<New Partition>
2. Register Topic
4. Notify New Partition
1. Query/Poll Partition
Oozie 2
Job
JMS(e.g, ActiveMQ)
Other Challenges in HA
• SLA support – Oozie has in-memory data structure to track sla status for
each job (start/duration/end met/miss and notifications)– add check of sla status against Database– use ZK lock to synchronize update on the same job from
multiple servers.
• Distributed Locks– Reentrant distributed lock using Apache Curator +
Zookeeper
Other Challenges in HA
● Distributed Job IDo Maintain distributed sequence number for Job ID
using Apache Curator + Zookeeper
● Zookeeper Failure Handlingo Oozie servers automatically shutdown when
Zookeeper is down
Future work
• Learn from experience for stability– At Y!, HA running on non-prod grids >1 month, and
prod deployment in Q4
• Faster job fail-over – currently wait for a thread (Recovery Service) to pick
non-progressing jobs every few minutes– Oozie server should immediately notice when other
server is down and fail-over job (e.g, using ZK watcher)
• Improve log streaming