HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute:...

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Salman Niazi1, Mahmoud Ismail1,Steffen Grohsschmiedt3 , Mikael Ronstrom̈4

Seif Haridi1,2, Jim Dowling1,2

1 KTH - Royal Institute of Technology2 RISE SICS - Swedish Institute of Computer Science

3 Spotify 4 Oracle

www.hops.io

HopsFS vs. HDFS PerformanceIntroduction

The Hadoop Distributed File System (HDFS) is the most popularopen-source platform for storing large volumes of data. However,HDFS’ design introduces two scalability bottlenecks. First, theNamenode architecture places practical limits on the size of thenamespace (files/directories). HDFS’ second main bottleneck is asingle global lock on the namespace that ensures the consistency ofthe file system by limiting concurrent access to the namespace to asingle-writer or multiple-readers.

HopsFS

HopsFS is an open-source, next generation distribution, drop-inreplacement for HDFS that replaces the main scalability bottleneck inHDFS, single node in-memory metadata service, with a no-sharedstate distributed system built on a NewSQL database. By removingthe metadata bottleneck in Apache HDFS, HopsFS enables thefollowing:

• Significantly larger cluster sizes, storing 37 times more metadata.• More than an order of magnitude higher throughput

• 16x – 37x, where 37x is the throughput for higher write rates• Significantly lower client latencies for large clusters.• Multiple stateless Namenodes.• Instant failover between the Namenodes.• Tinker-friendly metadata.

NDB-DN1/user/foo.txt

Quota

Inode

PRB Inv RUC

CR

ERBLU

Block

Lease

Replica

URB

/

NDBNDB-DN2/user

/user/foo.txtNDB-DN3 NDB-DN4

File inode related metadata

/user/bar.txt/user/foo.tar

Quota PRB Inv RUC

CR

ERBLU

Block

Lease

Replica

URB

HopsFS and HDFS Throughput for Spotify Workload

200K

400K

600K

800K

1M

1.2M

1.4M

1.6M

1 5 10 15 20 25 30 35 40 45 50 55 60

op

s/se

c

Number of Namenodes

HopsFS using 12 Node NDB clusterHopsFS using 8 Node NDB clusterHopsFS using 4 Node NDB clusterHopsFS using 2 Node NDB cluster

HopsFS using 12 Node NDB cluster with hotspotsHDFS Spotify Workload

25K

50K

75K

100K

1 3 5

HopsFS Architecture

HopsFS provides multiple stateless Namenodes. The Namenodes can serverequests from Both HopsFS and HDFS clients, however, HopFS clientsprovides load balancing between the Namenodes using random, round-robin,and sticky policies.

NameNodes

DataNodes

DN 1 DN 2 DN 3 DN N

NN 1 NN 2 NN 3 NN N

MySQL Cluster

HopsFS/HDFSClients

Metadata Partitioning

All the inodes in a directory are partitioned using a parent inode ID, therefore,all the immediate children of /user directory are stored on NDB-DN-3 forefficient directory listing, for example, ls /user. The file inode related metadatafor /user/foo.txt is stored on NDB-DN-4 for efficient file reading operations, forexample, cat /user/foo.txt.

Memory HDFS HopsFS1 GB 2.3 million 0.69 million

50 GB 115 million 34.5 million100 GB 230 million 69 million200 GB 460 million 138 million500 GB Does Not Scale 346 million

1 TB Does Not Scale 708 million25 TB Does Not Scale 17 billion

HopsFS and HDFS Metadata Scalability

0

30

60

90

120

150

180

210

1000 2000 3000 4000 5000 6000

Tim

e (m

s)

Clients

HopsFSHDFS

0

2.5

5

7.5

10

100 200 300 400 500

HopsFS and HDFS End-to-end Latency

50K100K

200K

400K

600K

20 50 80 110 140 170 200 230

ops/

sec

Time (sec)

HDFSHopsFS

HopsFS and HDFS Namenode Failover

Vertical lines represent namenodes failures.

Paris Carbone, Gyula Fóra, Seif Haridi, Vasiliki Kalavri, Marius Melzer, Theodore Vasiloudis <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]>

A NEW STATE OF THE ART IN DATA STREAMING

• Lightweight, consistent end-to-end processing

Advancing Data Stream Analytics with Apache Flink®

CONSISTENT CONTINUOUS PROCESSING WITH PIPELINED SNAPSHOTS

• Dynamic reconfiguration and application management

in-progress committedpendingpending

epoch n-1 epoch n-2 epoch n-3epoch n

rollback

snap-1 snap-2

snap-3

…

update application

Pre-partition state in hash(K) space, into key-groups

bob……

… ………

alice

local states

input streams

1) divide computationinto epochs

stream processor

Reconfiguration Scenarios • scale in/out • failure recovery • pushing bug fixes • application AB Testing • platform migration

2) capture states after each epochwithout stopping

snapshot

snapshot

snapshot

Apache Flink™: Stream and Batch Processing in a Single Engine P Carbone, S Ewen, S Haridi, A Katsifodimos, V Markl, K Tzoumas Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

State management in Apache Flink®: consistent stateful distributed stream processing P Carbone, S Ewen, G Fóra, S Haridi, S Richter, K Tzoumas Proceedings of the VLDB Endowment 10 (12), 1718-1729

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

consistent stateevent-time progress

fluid apipartitioned streams

FAST SLIDING WINDOW AGGREGATION SYSTEM SUPPORT FOR GRAPH STREAM MINING • Sliding window aggregation can be very expensive

• Existing optimisations apply to limited window types

• ‘Cutty’ redefines stream windows for optimal processing

20 40 60 80 100

Number of Queries

0k500k

1000k1500k2000k2500k3000k3500k4000k4500k

Thro

ughp

ut(r

ecor

ds/s

ec)

CuttyPairs+RA

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

104

105

106

107

108

109

1010

1011

Tota

lRed

uce

Cal

ls

Cutty (eager)Pairs+Cutty (lazy)

PairsRANaive

Cutty: Aggregate Sharing for User-Defined Windows P Carbone, J Traub, A Katsifodimos, S Haridi, V Markl ACM CIKM - 25th International Conference on Information and Knowledge Management

• People process graphs inefficiently

3. Store: write the final graph state back to disk

2. Compute: read and mutate the graph state

1. Load: read the graph from disk and partition it in memory

• We propose a new way to process graphs continuously

• this is slow, expensive and redundant

graph summary

algorithm algorithm~R1 R2

1) Single-pass summaries edgeStream.aggregate(newSummary(window,fold,combine,lower))

edge additions

2) Neighbour Aggregation and Iterations on Stream Windows

sinkloopsrc winsrc win sinkloop

graphstream.window(…) .iterateSyncFor(10, InputFunction(), StepFunction(),

OutputFunction())

graphstream.window(…) .applyOnNeighbors(FindPairs())

Sponsors Partnersgithub.com/vasia/gelly-streaming

HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute:...

Documents

Transcript of HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute:...