Hopsfs 10x HDFS performance

48
HopsFS: 10X your HDFS with NDB Jim Dowling Associate Prof @ KTH Senior Researcher @ SICS CEO @ Logical Clocks AB Oracle, Stockholm, 6 th September 2016 www.hops.io @hopshadoop

Transcript of Hopsfs 10x HDFS performance

Page 1: Hopsfs 10x HDFS performance

HopsFS: 10X your HDFS with NDB

Jim Dowling Associate Prof @ KTH

Senior Researcher @ SICSCEO @ Logical Clocks AB

Oracle, Stockholm, 6th September 2016

www.hops.io @hopshadoop

Page 2: Hopsfs 10x HDFS performance

Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,

Gautier Berthou, Salman Niazi, Mahmoud Ismail,Theofilos Kakantousis, Johan Svedlund Nordström, Ermias Gebremeskel, Antonios Kouzoupis.

Alumni: Vasileios Giannokostas, Misganu Dessalegn, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Page 3: Hopsfs 10x HDFS performance

Marketing 101: Celebrity Endorsements

*Turing Award Winner 2014, Father of Distributed Systems

Hi!I’m Leslie Lamport* and even though you’re not using Paxos, I approve

this product.

Page 4: Hopsfs 10x HDFS performance

Bill Gates’ biggest product regret?*

Page 5: Hopsfs 10x HDFS performance

Windows Future Storage (WinFS*)

*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/

Page 6: Hopsfs 10x HDFS performance

6

Hadoop in Context

Data ProcessingSpark, MapReduce, Flink, Presto,

Tensorflow

StorageHDFS, MapR, S3, Collossus, WAS

Resource ManagementYARN, Mesos, Borg

MetadataHive, Parquet, Authorization, Search

Page 7: Hopsfs 10x HDFS performance

7

HDFS v2

DataNodes (up to ~5K)

HDFS Client

Journal Nodes Zookeeper

ActiveNameNode

StandbyNameNode

Asynchronous Replication of EditLogAgreement on the Active NameNodeSnapshots (fsimage) - cut the EditLog

(ls, rm, mv, cp,stat, rm, chown, copyFromLocal,copyFromRemote,chmod, etc)

Page 8: Hopsfs 10x HDFS performance

The NameNode is the Bottleneck for Hadoop

8

Page 9: Hopsfs 10x HDFS performance

9

Max Pause times for NameNode Heap Sizes*

Max Pause-Times (ms)

100

1000

10000

10

JVM Heap Size (GB)

50 75 100 150

Unopti

mized

Optimized

*OpenJDK or Oracle JVM

Page 10: Hopsfs 10x HDFS performance

10

NameNode and Decreasing Memory Costs

Size (GB)

250

500

1000

Year

2016 2017 2018 2019

Projected Max NameNode JVM Heap Size

2020

0

750

Size of RAM in a COTS $7,000 Rack Server

Page 11: Hopsfs 10x HDFS performance

11

Externalizing the NameNode State• Problem:NameNode not scaling up with lower RAM prices

• Solution:Move the metadata off the JVM Heap

• Move it where?An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.

• MySQL Cluster (NDB)

Page 12: Hopsfs 10x HDFS performance

12

HopsFS Architecture

NameNodes

NDB

Leader

HDFS Client

DataNodes

Page 13: Hopsfs 10x HDFS performance

13

Pluggable DBs: Data Abstraction Layer (DAL)

NameNode(Apache v2)

DAL API(Apache v2)

NDB-DAL-Impl(GPL v2)

Other DB(Other License)

hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar

Page 14: Hopsfs 10x HDFS performance

The Global Lock in the NameNode

14

Page 15: Hopsfs 10x HDFS performance

HDFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..

NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue

Namespace & In-Memory EditLogFSNameSystem Lock

EditLog Buffer

EditLog1 EditLog2 EditLog3

Listener(Nio Thread)

Responder(Nio Thread)

dfs.namenode.service.handlercount (default 10)

ipc.server.read.threadpool.size (default 1)

Handler1 HandlerM… Done RPCs

ackIdsflush

Page 16: Hopsfs 10x HDFS performance

HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..

NameNode

NDB

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue

inodes block_infos replicas

Listener(Nio Thread)

Responder(Nio Thread)

dfs.namenode.service.handlercount (default 10)

ipc.server.read.threadpool.size (default 1)

Handler1 HandlerM…

Done RPCs

ackIds

leases…

DAL-ImplDAL API

HARD PART

Page 17: Hopsfs 10x HDFS performance

17

Concurrency Model: Implicit Locking

• Serializabile FS ops using implicit locking of subtrees.

[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]

Page 18: Hopsfs 10x HDFS performance

18

Preventing Deadlock and Starvation

• Acquire FS locks in agreed order using FS Hierarchy. • Block-level operations follow the same agreed order.• No cycles => Freedom from deadlock• Pessimistic Concurrency Control ensures progress

/user/jim/myFilemv

read

block_reportDataNodeNameNodeClient

Page 19: Hopsfs 10x HDFS performance

Per Transaction Cache• Reusing the HDFS codebase resulted in too many roundtrips to the database per transaction.

• We cache intermediate transaction results at NameNodes (i.e., snapshot).

Page 20: Hopsfs 10x HDFS performance

20

Sometimes, Transactions Just ain’t Enough• Large Subtree Operations (delete, mv, set-quota) can’t always be executed in a single Transaction.

• 4-phase Protocol• Isolation and Consistency• Aggressive batching• Transparent failure handling• Failed ops retried on new NN.• Lease timeout for failed clients.

Page 21: Hopsfs 10x HDFS performance

Leader Election using NDB• Leader to coordinate replication/lease management• NDB as shared memory for Leader Election of NN.

21[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]

Page 22: Hopsfs 10x HDFS performance

22

Path Component Caching• The most common operation in HDFS is resolving pathnames to inodes- 67% of operations in Spotify’s Hadoop workload

• We cache recently resolved inodes at NameNodes so that we can resolve them using a single batch primary key lookup.- We validate cache entries as part of transactions- The cache converts O(N) round trips to the database to O(1)

for a hit for all inodes in a path.

Page 23: Hopsfs 10x HDFS performance

Path Component Caching• Resolving a path of length N gives O(N) round-trips• With our cache, O(1) round-trip for a cache hit

/user/jim/myFile

NDB

getInode(0, “user”) getInode

(1, “jim”) getInode(2, “myFile”)

NameNode

/user/jim/myFile

NDB

validateInodes([(0, “user”), (1,”jim”),(2,”myFile”)])

NameNode

CachegetInodes(“/user/jim/myFile”)

Page 24: Hopsfs 10x HDFS performance

24

Hotspots• Mikael saw 1-2 maxed out LDM threads• Partitioning by parent inodeId meant fantastic performance for ‘ls’- Partition-pruned index scans- At high load hotspots appeared at the

top of the directory hierarchy• Current Solution:

- Cache the root inode at NameNodes- Pseudo-random partition key for top-level

directories, but keep partition by parent inodeId at lower levels

- At least 4x throughput increase!

/

/Users /Projects

/NSA /MyProj

/Dataset1 /Dataset2

Page 25: Hopsfs 10x HDFS performance

Scalable Blocking Reporting• On 100PB+ clusters, internal maintenance protocol traffic makes up much of the network traffic

• Block Reporting - Leader Load Balances- Work-steal when exiting

safe-mode

SafeBlocks

DataNodes

NameNodes

NDB

Leader

Blocks

work steal

Page 26: Hopsfs 10x HDFS performance

HopsFS Performance

26

Page 27: Hopsfs 10x HDFS performance

27

HopsFS Metadata Scaleout

Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop

Page 28: Hopsfs 10x HDFS performance

28

Spotify Workload

Page 29: Hopsfs 10x HDFS performance

29

HopsFS Throughput (Spotify Workload - PM)

Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances

Page 30: Hopsfs 10x HDFS performance

30

HopsFS Throughput (Spotify Workload - PM)

Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances

Page 31: Hopsfs 10x HDFS performance

31

HopsFS Throughput (Spotify Workload - AM)

NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.

Page 32: Hopsfs 10x HDFS performance

32

Per Operation HopsFS Throughput

Page 33: Hopsfs 10x HDFS performance

33

NDB Performance Lessons• NDB is quite stable!• ClusterJ is (nearly) good enough

- sun.misc.Cleaner has trouble keeping up at high throughput – OOM for ByteBuffers

- Transaction hint behavior not respected- DTO creation time affected by Java Reflection- Nice features would be:• Projections• Batched scan operations support• Event API

• Event API and Asynchronous API needed for performance in Hops-YARN

Page 34: Hopsfs 10x HDFS performance

34

Heterogeneous Storage in HopsFS• Storage Types in HopsFS: Default, EC-RAID5, SSD

- Default: 3X overhead - triple replication on spinning disks

- SSD: 3X overhead - triple replication on SSDs

- EC-RAID5: 1.4X overhead with low reconstruction overhead!

Page 35: Hopsfs 10x HDFS performance

35

Erasure Coding

HDFS File (Sealed)

d0 d1 d2 d3 d4 d5 p0 p1 p1

overhead

(6+3)/6 = 1.5X

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9d1

0

d1

1p0 p1 p2 p3 (12+4)/16= 1.33X

RS(6,3)

RS(12,4)

Page 36: Hopsfs 10x HDFS performance

host9

d0 d1 d2 d3 d4 p0

Global/Local Reconstruction with EC-RAID5

36

d0 d1 d2 d3 d4 p0Block0 Block9

Block10 Block11 Block12 Block13

host0

host10 host10 host10 host10

ZFS RAID-ZZFS RAID-Z

(10+2+2)/10 = 1.4X

(10+2+4)/10 = 1.6X

RS(10,2) LR(5,1).RS(10,4)LR(5,1).

Page 37: Hopsfs 10x HDFS performance

37

ePipe: Indexing HopsFS’ Namespace

Free-Text Search

NDBElasticSearch

Polyglot PersistenceThe Distributed Database is the Single Source of Truth.

Foreign keys ensure the integrity of Extended Metadata.

MetaDataDesigner

MetaDataEntry

NDB Event API

Page 38: Hopsfs 10x HDFS performance

Hops-YARN

38

Page 39: Hopsfs 10x HDFS performance

39

YARN Architecture

NodeManagers

YARN Client

Zookeeper Nodes

ResourceMgr StandbyResourceMgr

1. Master-Slave Replication of RM State2. Agreement on the Active ResourceMgr

Page 40: Hopsfs 10x HDFS performance

40

NDB

ResourceManager– Monolithic but Modular

ApplicationMasterService

ResourceTrackerService

Scheduler

ClientService

YARN Client

AdminService

Security

Cluster State

HopsResourceTracker

Cluster State

HopsScheduler

NodeManagerNodeManagerYARN Client App MasterApp Master

ResourceManager

~2k ops/s ~10k ops/s

ClusterJ Event API

Page 41: Hopsfs 10x HDFS performance

41

Hops-YARN Architecture

ResourceMgrs

NDB

Scheduler

YARN Client

NodeManagers

Resource Trackers Leader Election forFailed Scheduler

Page 42: Hopsfs 10x HDFS performance

Hopsworks

42

Page 43: Hopsfs 10x HDFS performance

43

Hopsworks – Project-Based Multi-Tenancy• A project is a collection of

- Users with Roles- HDFS DataSets- Kafka Topics- Notebooks, Jobs

• Per-Project quotas- Storage in HDFS- CPU in YARN• Uber-style Pricing

• Sharing across Projects- Datasets/Topics

projectdataset 1

dataset N

Topic 1

Topic N

Kafka

HDFS

Page 44: Hopsfs 10x HDFS performance

Hopsworks – Dynamic Roles

44

[email protected]

NSA__Alice

Authenticate

Users__Alice

Glassfish

HopsFS

HopsYARN

ProjectsSecure

Impersonation

Kafka

X.509 Certificates

Page 45: Hopsfs 10x HDFS performance

45

SICS ICE - www.hops.siteA 2 MW datacenter research and test environment

Purpose: Increase knowledge, strengthen universities, companies and researchers

R&D institute, 5 lab modules, 3-4000 servers, 2-3000 square meters

Page 46: Hopsfs 10x HDFS performance

46

Karamel/Chef for Automated Installation

Google Compute Engine BareMetal

Page 47: Hopsfs 10x HDFS performance

47

Summary• HopsFS is the world’s fastest, most scalable HDFS implementation

• Powered by NDB, the world’s fastest database • Thanks to Mikael, Craig, Frazer, Bernt and others• Still room for improvement….

www.hops.io

Page 48: Hopsfs 10x HDFS performance

Hops[Hadoop For Humans]

Join us!http://github.com/hopshadoop