High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc....

32
High Availability HDFS Matt Foley Hortonworks, Inc. [email protected]

Transcript of High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc....

Page 1: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

High Availability HDFS

Matt Foley Hortonworks, Inc. [email protected]

Page 2: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Matt Foley - Background

Page 2 Architecting the Future of Big Data

• MTS at Hortonworks Inc. – HDFS contributor, part of original ~25 in Yahoo! spin-out of Hortonworks – Currently managing engineering infrastructure for Hortonworks – My team also provides Build Engineering infrastructure services to ASF,

for Hadoop core and several related projects within Apache – Formerly, led software development for back end of Yahoo Mail for three

years – 20,000 servers with 30 PB of data under management, 400M active users

– Did startups in Storage Management and Log Management

• Apache Hadoop, ASF – Committer and PMC member, Hadoop core – Release Manager – Hadoop-1.0

Page 3: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Company Background

Page 3 Architecting the Future of Big Data

• In 2006, Yahoo! was a very early adopter of Hadoop, and became the principle contributor to it.

• Over time, invested 40K+ servers and 170PB storage in Hadoop • Over 1000 active users run 5M+ Map/Reduce jobs per month • In 2011, Yahoo! spun off ~25 engineers into Hortonworks, a company

focused on advancing open source Apache Hadoop for the broader market ( http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop )

2006

2011

Page 4: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Agenda

Page 4 Architecting the Future of Big Data

• Overview of HDFS architecture • Hadoop “ecosystem” • Hadoop 2.0

• High Availability • What has been the HDFS record?

–reliability –availability

• HDFS-HA

Page 5: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

What is Hadoop?

Page 5 Architecting the Future of Big Data

• Hadoop - Open Source Apache Project – Framework for reliably storing & processing petabytes of data using

commodity hardware and storage • Scalable solution

– Computation capacity – Storage capacity – I/O bandwidth

• Core components – HDFS: Hadoop Distributed File System - distributes data – Map/Reduce - distributes application processing and control

• Move computation to data and not the other way • Written in Java • Runs on

– Linux, Windows, Solaris, and Mac OS/X

Page 6: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Commodity Hardware Cluster

Page 6 Architecting the Future of Big Data

• Typically in 2- or 3-level architecture – Nodes are commodity Linux servers – 20 - 40 nodes/rack – Uplink from rack is 10 or 2x10 gigabit – Rack-internal is 1 or 2x1 gigabit all-to-all

• “Flat fabric” 10Gbit network architectures being planned at growing number of sites

10

Page 7: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Hadoop Distributed File System (HDFS)

Page 7 Architecting the Future of Big Data

• One PB-scale file system for the entire cluster –Managed by a single Namenode –Files are written, read, renamed, deleted, but append-only –Optimized for streaming reads of large files

• Files are broken into uniform sized blocks –Blocks are typically 128 MB (nominal – no wasted space) –Replicated to several Datanodes, for reliability –Exposes block placement so that computation can be migrated to

data • Client library directly reads data from Data Nodes

–Bandwidth scales linearly with the number of nodes –System is topology-aware –Array of block locations is available to clients

Page 8: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

HDFS Diagram

Page 8 Architecting the Future of Big Data

b1

b2

b3 b1

b5

b3 b3

b5

b2

b4 b5

b6 b2

b3

b4

Namenode

Namespace Metadata & Journal

Namespace State

Block Map

Heartbeats & Block Reports

Block ID Block Locations

Datanodes

Block ID Data

Backup Namenode

Hierarchal Namespace File Name BlockIDs

Horizontally Scale IO and Storage

Page 9: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Block Placement

Page 9 Architecting the Future of Big Data

•Default is 3 replicas, but settable •Blocks are placed (writes are pipelined):

–First replica on the local node or a random node on local rack

–Second replica on a remote rack –Third replica on a node on same remote rack –Other replicas randomly placed

•Clients read from closest replica –System is topology-aware

•Block placement policy is pluggable

Page 10: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Block Correctness

Page 10 Architecting the Future of Big Data

•Data is checked with CRC32 •File Creation

–Client computes block checksums –DataNode stores the checksums

•File access –Client retrieves the data and checksum from DataNode –If Validation fails, Client tries other replicas

•Periodic validation by DataNode –Background DataBlockScanner task

Page 11: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

HDFS Data Reliability

Page 11 Architecting the Future of Big Data

b1

b2

b3 b1

b5

b3 b3

b5

b2

b4 b5

b6 b2

b3

b4

Namenode

2. copy

3. blockReceived 1. replicate

Datanodes

Bad/lost block replica

Namespace State

Block Map

Page 12: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Active Data Management

Page 12 Architecting the Future of Big Data

•Continuous replica maintenance

•End-to-end checksums

•Periodic checksum verification

•Decommissioning nodes for service

•Balancing storage utilization

Page 13: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Other Hadoop Ecosystem Components

Page 13 Architecting the Future of Big Data

Zook

eepe

r (C

oord

inat

ion)

Core Apache Hadoop Related Hadoop Projects

HDFS (Hadoop Distributed File System)

MapReduce (Distributed Programing Framework)

Hive (SQL)

Pig (Data Flow)

HBase (Columnar NoSQL

Store)

HC

atal

og

(Tab

le &

Sch

ema

Man

agem

ent)

Presenter
Presentation Notes
Many other projects in the Hadoop family Oozie: Workflow engine Mahout: Scalable machine learning and data mining library Monitoring: Ganglia, Nagios, Ambari The list is growing…
Page 14: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Agenda

Page 14 Architecting the Future of Big Data

• Overview of HDFS architecture • Hadoop “ecosystem” • Hadoop 2.0

• High Availability • What has been the HDFS record?

–reliability –availability

• HDFS-HA

Page 15: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Hadoop 2.0

Page 15 Architecting the Future of Big Data

• Developed on Hadoop branch 0.23 • Highlights:

–HDFS Namenode HA –HDFS Namenode Federation –Next-Generation MapReduce architecture (aka YARN) – Performance

Page 16: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

HDFS Federation in v2.0

Page 16 Architecting the Future of Big Data

• Improved scalability and isolation • Clear separation of Namespace and Block Storage

Presenter
Presentation Notes
Namenode with 64GB / 24 cores can handle 100M files, replication=3, with reasonable performance Federation allows the namespace to scale horizontally Partially alleviates the SPOF problem Leaves it up to users to partition the namespace Datanodes can participate in multiple block pools, not partitioned at DN level like “LUN”s
Page 17: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

MapReduce2 - YARN

Presenter
Presentation Notes
New architecture enables other application types to plug in, besides Map/Reduce Ex. Streaming, Graph, Bulk Sync Processing, MPI (Message Passing Interface) New Resource Manager enhances enterprise viability Scalability, HA, FT, and SPoF
Page 18: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Agenda

Page 18 Architecting the Future of Big Data

• Overview of HDFS architecture • Hadoop “ecosystem” • Hadoop 2.0

• High Availability • What has been the HDFS record?

–reliability –availability

• HDFS-HA

Page 19: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Current HDFS Reliability & Availability

Page 19 Architecting the Future of Big Data

•Block store – extremely high –Block replicas stored in native FS on multiple nodes

–Transparently ensure that blocks stay replicated –Serve from closest available replica –A lost node with 12 TB can be re-replicated in 7 minutes –A single lost disk of 1TB can be re-replicated in 30 seconds

–With standard 3x replication, probability of data loss due to normal rates of server and disk failure is infinitesimally small

–even assuming very casual approach to parts replacement

–In study of 2009 data, lost 19 blocks out of 329M on 20,000 nodes, due to software bugs that have since been fixed.

Presenter
Presentation Notes
19 : 329,000,000 is seven 9’s
Page 20: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Current HDFS Reliability & Availability

Page 20 Architecting the Future of Big Data

•Meta-data store –Single NameNode stores state –Journaling and snapshot management to assure data

persistence, to multiple local and NFS (HA) stores –But SPOF with manual switch-over on failure

•How well did it work? –18 month study of 25 clusters had 22 NN failures

–Only 8 of them would have been helped with HA –Impacted availability, but never durability.

Page 21: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

HA: Approach and Terminology

Page 21 Architecting the Future of Big Data

• Initial goal is Active-Standby – Active Namenode: actively serves read/write operations from clients – Standby Namenode: waits, becomes active when Active Namenode fails

– Could serve read operations

• Standby’s State may be cold, warm or hot

– Cold : Standby has zero state (e.g. started after the Active is declared dead.

– Warm: Standby has partial state: – has loaded fsImage & editLogs but has not received any block reports – has loaded fsImage and rolled logs and all block reports

– Hot Standby: Standby has all most of the Active’s state and start immediately

Page 22: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

High Level Use Cases

Page 22 Architecting the Future of Big Data

•Planned downtime –Upgrades –Config changes –Main reason for downtime

•Unplanned downtime –Hardware failure –Server unresponsive –Software failures –Occurs infrequently

•Supported failures –Single hardware failure

– Double hardware failure not supported

–Some software failures – Same software failure

affects both active and standby

Page 23: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Deployment Models

Page 23 Architecting the Future of Big Data

• Single Namenode configuration; no failover • Active and Standby with manual failover

–Standby could be cold/warm/hot

• Active and Standby with automatic failover –Hot standby

• See HDFS-1623 for detailed use cases

Page 24: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Design

Page 24 Architecting the Future of Big Data

•Failover control outside Namenode •Parallel Block reports to Active and Standby (Hot failover)

•Shared or non-shared Namenode state •Fencing of shared resources/data

–Datanodes –Shared Namenode state (if any)

•Client failover –IP Failover –Smart clients (e.g Zookeeper for coordination)

Page 25: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Failover Control Outside Namenode

Page 25 Architecting the Future of Big Data

• Failover Controller – outside Namenode

• Daemon manages resources – All resources modeled uniformly – Resources – OS, HW, Network etc. – Namenode is just another resource

• Heartbeat with other nodes • Quorum based leader election

– Zookeeper for co-ordination and Quorum

• Fencing during split brain – Prevents data corruption

Failover Controller

Resources Actions

start, stop, failover, monitor, …

Quorum Service

Resources Resources

Shared Resources

Page 26: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

HA Namenode with ZooKeeper

Page 26 Architecting the Future of Big Data

NN Active

NN Standby

DN

FailoverController Active

ZK

Cmds Monitor Health of NN. OS, HW

Monitor Health of NN. OS, HW

DN DN

FailoverController Standby

ZK ZK Heartbeat Heartbeat

Page 27: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Sharing the Namenode’s Persistent State medium term – 6 month timeframe

Page 27 Architecting the Future of Big Data

NN Active

NN Standby

Shared NN state with

single writer (fencing)

Shared Storage Approach

NN Active

NN Standby

Edit Logs

Direct stream to Standby NN

Page 28: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Sharing the Namenode’s Persistent State long term

Page 28 Architecting the Future of Big Data

NN Active

NN Standby

Store NN journal and checkpointed image on

Datanodes

DN DN DN

Page 29: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Hadoop 2.0 “Availability” (in the field)

Page 29 Architecting the Future of Big Data

• Requires LOTS of testing

• In small-scale test (500-800 nodes) 2Q2012

• Ramping up over rest of year, with full range of application testing

• Expected to be in production at multiple sites by end/2012

Page 30: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Credits

Page 30 Architecting the Future of Big Data

For major contributions to Hadoop technology, and help with this presentation:

• Sanjay Radia and Suresh Srinivas, Hortonworks – Architect and Team Lead, HDFS – HA and Federation

• Owen O’Malley, Hortonworks – Hadoop lead Architect – Security, Map/Reduce

• Arun Murthy, Hortonworks – Architect and Team Lead, Map/Reduce – M/R2, YARN, etc.

• Rob Chansler, Yahoo! – Team Lead, HDFS – Analysis of Data Availability and Durability

Page 31: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Help getting started

Page 31 Architecting the Future of Big Data

• Apache Hadoop Projects – http://hadoop.apache.org/ – http://wiki.apache.org/hadoop/

• Apache Hadoop Email lists: – [email protected][email protected][email protected]

• O’Reilly Books – Hadoop, The Definitive Guide – HBase, The Definitive Guide

• Hortonworks, Inc. – Installable Data Platform distribution (100% OSS, conforming to Apache releases)

– http://hortonworks.com/technology/techpreview/ – Training and Certification programs

– http://hortonworks.com/training/

• Hadoop Summit 2012 (June 13-14, San Jose) – http://hadoopsummit.org/

Page 32: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data ...

Thanks for Listening!

Page 32 Architecting the Future of Big Data

Questions?