HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute:...

2
HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases Salman Niazi 1 , Mahmoud Ismail 1 , Steffen Grohsschmiedt 3 , Mikael Ronström 4 Seif Haridi 1,2 , Jim Dowling 1,2 1 KTH - Royal Institute of Technology 2 RISE SICS - Swedish Institute of Computer Science 3 Spotify 4 Oracle www.hops.io HopsFS vs. HDFS Performance Introduction The Hadoop Distributed File System (HDFS) is the most popular open-source platform for storing large volumes of data. However, HDFS’ design introduces two scalability bottlenecks. First, the Namenode architecture places practical limits on the size of the namespace (files/directories). HDFS’ second main bottleneck is a single global lock on the namespace that ensures the consistency of the file system by limiting concurrent access to the namespace to a single-writer or multiple-readers. HopsFS HopsFS is an open-source, next generation distribution, drop-in replacement for HDFS that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-shared state distributed system built on a NewSQL database. By removing the metadata bottleneck in Apache HDFS, HopsFS enables the following: Significantly larger cluster sizes, storing 37 times more metadata. More than an order of magnitude higher throughput 16x 37x, where 37x is the throughput for higher write rates Significantly lower client latencies for large clusters. Multiple stateless Namenodes. Instant failover between the Namenodes. Tinker-friendly metadata. NDB-DN1 /user/foo.txt Quota Inode PRB Inv RUC CR ER BLU Block Lease Replica URB / NDB NDB-DN2 /user /user/foo.txt NDB-DN3 NDB-DN4 File inode related metadata /user/bar.txt /user/foo.tar Quota PRB Inv RUC CR ER BLU Block Lease Replica URB HopsFS and HDFS Throughput for Spotify Workload 200K 400K 600K 800K 1M 1.2M 1.4M 1.6M 1 5 10 15 20 25 30 35 40 45 50 55 60 ops/sec Number of Namenodes HopsFS using 12 Node NDB cluster HopsFS using 8 Node NDB cluster HopsFS using 4 Node NDB cluster HopsFS using 2 Node NDB cluster HopsFS using 12 Node NDB cluster with hotspots HDFS Spotify Workload 25K 50K 75K 100K 1 3 5 HopsFS Architecture HopsFS provides multiple stateless Namenodes. The Namenodes can serve requests from Both HopsFS and HDFS clients, however, HopFS clients provides load balancing between the Namenodes using random, round-robin, and sticky policies. NameNodes DataNodes DN 1 DN 2 DN 3 DN N NN 1 NN 2 NN 3 NN N MySQL Cluster HopsFS /HDFS Clients Metadata Partitioning All the inodes in a directory are partitioned using a parent inode ID, therefore, all the immediate children of /user directory are stored on NDB-DN-3 for efficient directory listing, for example, ls /user. The file inode related metadata for /user/foo.txt is stored on NDB-DN-4 for efficient file reading operations, for example, cat /user/foo.txt. Memory HDFS HopsFS 1 GB 2.3 million 0.69 million 50 GB 115 million 34.5 million 100 GB 230 million 69 million 200 GB 460 million 138 million 500 GB Does Not Scale 346 million 1 TB Does Not Scale 708 million 25 TB Does Not Scale 17 billion HopsFS and HDFS Metadata Scalability 0 30 60 90 120 150 180 210 1000 2000 3000 4000 5000 6000 Time (ms) Clients HopsFS HDFS 0 2.5 5 7.5 10 100 200 300 400 500 HopsFS and HDFS End-to-end Latency 50K 100K 200K 400K 600K 20 50 80 110 140 170 200 230 ops/sec Time (sec) HDFS HopsFS HopsFS and HDFS Namenode Failover Vertical lines represent namenodes failures.

Transcript of HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute:...

Page 1: HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute: read and mutate the graph state 1. Load: read the graph from disk and partition

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Salman Niazi1, Mahmoud Ismail1,Steffen Grohsschmiedt3 , Mikael Ronstrom̈4

Seif Haridi1,2, Jim Dowling1,2

1 KTH - Royal Institute of Technology2 RISE SICS - Swedish Institute of Computer Science

3 Spotify 4 Oracle

www.hops.io

HopsFS vs. HDFS PerformanceIntroduction

The Hadoop Distributed File System (HDFS) is the most popularopen-source platform for storing large volumes of data. However,HDFS’ design introduces two scalability bottlenecks. First, theNamenode architecture places practical limits on the size of thenamespace (files/directories). HDFS’ second main bottleneck is asingle global lock on the namespace that ensures the consistency ofthe file system by limiting concurrent access to the namespace to asingle-writer or multiple-readers.

HopsFS

HopsFS is an open-source, next generation distribution, drop-inreplacement for HDFS that replaces the main scalability bottleneck inHDFS, single node in-memory metadata service, with a no-sharedstate distributed system built on a NewSQL database. By removingthe metadata bottleneck in Apache HDFS, HopsFS enables thefollowing:

• Significantly larger cluster sizes, storing 37 times more metadata.• More than an order of magnitude higher throughput

• 16x – 37x, where 37x is the throughput for higher write rates• Significantly lower client latencies for large clusters.• Multiple stateless Namenodes.• Instant failover between the Namenodes.• Tinker-friendly metadata.

NDB-DN1/user/foo.txt

Quota

Inode

PRB Inv RUC

CR

ERBLU

Block

Lease

Replica

URB

/

NDBNDB-DN2/user

/user/foo.txtNDB-DN3 NDB-DN4

File inode related metadata

/user/bar.txt/user/foo.tar

Quota PRB Inv RUC

CR

ERBLU

Block

Lease

Replica

URB

HopsFS and HDFS Throughput for Spotify Workload

200K

400K

600K

800K

1M

1.2M

1.4M

1.6M

1 5 10 15 20 25 30 35 40 45 50 55 60

op

s/se

c

Number of Namenodes

HopsFS using 12 Node NDB clusterHopsFS using 8 Node NDB clusterHopsFS using 4 Node NDB clusterHopsFS using 2 Node NDB cluster

HopsFS using 12 Node NDB cluster with hotspotsHDFS Spotify Workload

25K

50K

75K

100K

1 3 5

HopsFS Architecture

HopsFS provides multiple stateless Namenodes. The Namenodes can serverequests from Both HopsFS and HDFS clients, however, HopFS clientsprovides load balancing between the Namenodes using random, round-robin,and sticky policies.

NameNodes

DataNodes

DN 1 DN 2 DN 3 DN N

NN 1 NN 2 NN 3 NN N

MySQL Cluster

HopsFS/HDFSClients

Metadata Partitioning

All the inodes in a directory are partitioned using a parent inode ID, therefore,all the immediate children of /user directory are stored on NDB-DN-3 forefficient directory listing, for example, ls /user. The file inode related metadatafor /user/foo.txt is stored on NDB-DN-4 for efficient file reading operations, forexample, cat /user/foo.txt.

Memory HDFS HopsFS1 GB 2.3 million 0.69 million

50 GB 115 million 34.5 million100 GB 230 million 69 million200 GB 460 million 138 million500 GB Does Not Scale 346 million

1 TB Does Not Scale 708 million25 TB Does Not Scale 17 billion

HopsFS and HDFS Metadata Scalability

0

30

60

90

120

150

180

210

1000 2000 3000 4000 5000 6000

Tim

e (m

s)

Clients

HopsFSHDFS

0

2.5

5

7.5

10

100 200 300 400 500

HopsFS and HDFS End-to-end Latency

50K100K

200K

400K

600K

20 50 80 110 140 170 200 230

ops/

sec

Time (sec)

HDFSHopsFS

HopsFS and HDFS Namenode Failover

Vertical lines represent namenodes failures.

Page 2: HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute: read and mutate the graph state 1. Load: read the graph from disk and partition

Paris Carbone, Gyula Fóra, Seif Haridi, Vasiliki Kalavri, Marius Melzer, Theodore Vasiloudis <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]>

A NEW STATE OF THE ART IN DATA STREAMING

• Lightweight, consistent end-to-end processing

Advancing Data Stream Analytics with Apache Flink®

CONSISTENT CONTINUOUS PROCESSING WITH PIPELINED SNAPSHOTS

• Dynamic reconfiguration and application management

in-progress committedpendingpending

epoch n-1 epoch n-2 epoch n-3epoch n

rollback

snap-1 snap-2

snap-3

update application

Pre-partition state in hash(K) space, into key-groups

bob……

… ………

alice

local states

input streams

1) divide computationinto epochs

stream processor

Reconfiguration Scenarios • scale in/out • failure recovery • pushing bug fixes • application AB Testing • platform migration

2) capture states after each epochwithout stopping

snapshot

snapshot

snapshot

Apache Flink™: Stream and Batch Processing in a Single Engine P Carbone, S Ewen, S Haridi, A Katsifodimos, V Markl, K Tzoumas Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

State management in Apache Flink®: consistent stateful distributed stream processing P Carbone, S Ewen, G Fóra, S Haridi, S Richter, K Tzoumas Proceedings of the VLDB Endowment 10 (12), 1718-1729

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

consistent stateevent-time progress

fluid apipartitioned streams

FAST SLIDING WINDOW AGGREGATION SYSTEM SUPPORT FOR GRAPH STREAM MINING • Sliding window aggregation can be very expensive

• Existing optimisations apply to limited window types

• ‘Cutty’ redefines stream windows for optimal processing

20 40 60 80 100

Number of Queries

0k500k

1000k1500k2000k2500k3000k3500k4000k4500k

Thro

ughp

ut(r

ecor

ds/s

ec)

CuttyPairs+RA

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

104

105

106

107

108

109

1010

1011

Tota

lRed

uce

Cal

ls

Cutty (eager)Pairs+Cutty (lazy)

PairsRANaive

Cutty: Aggregate Sharing for User-Defined Windows P Carbone, J Traub, A Katsifodimos, S Haridi, V Markl ACM CIKM - 25th International Conference on Information and Knowledge Management

• People process graphs inefficiently

3. Store: write the final graph state back to disk

2. Compute: read and mutate the graph state

1. Load: read the graph from disk and partition it in memory

• We propose a new way to process graphs continuously

• this is slow, expensive and redundant

graph summary

algorithm algorithm~R1 R2

1) Single-pass summaries edgeStream.aggregate(newSummary(window,fold,combine,lower))

edge additions

2) Neighbour Aggregation and Iterations on Stream Windows

sinkloopsrc winsrc win sinkloop

graphstream.window(…) .iterateSyncFor(10, InputFunction(), StepFunction(),

OutputFunction())

graphstream.window(…) .applyOnNeighbors(FindPairs())

Sponsors Partnersgithub.com/vasia/gelly-streaming