HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute:...
Transcript of HopsFS: Scaling Hierarchical File System Metadata Using ... · graph state back to disk 2. Compute:...
HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Salman Niazi1, Mahmoud Ismail1,Steffen Grohsschmiedt3 , Mikael Ronstrom̈4
Seif Haridi1,2, Jim Dowling1,2
1 KTH - Royal Institute of Technology2 RISE SICS - Swedish Institute of Computer Science
3 Spotify 4 Oracle
www.hops.io
HopsFS vs. HDFS PerformanceIntroduction
The Hadoop Distributed File System (HDFS) is the most popularopen-source platform for storing large volumes of data. However,HDFS’ design introduces two scalability bottlenecks. First, theNamenode architecture places practical limits on the size of thenamespace (files/directories). HDFS’ second main bottleneck is asingle global lock on the namespace that ensures the consistency ofthe file system by limiting concurrent access to the namespace to asingle-writer or multiple-readers.
HopsFS
HopsFS is an open-source, next generation distribution, drop-inreplacement for HDFS that replaces the main scalability bottleneck inHDFS, single node in-memory metadata service, with a no-sharedstate distributed system built on a NewSQL database. By removingthe metadata bottleneck in Apache HDFS, HopsFS enables thefollowing:
• Significantly larger cluster sizes, storing 37 times more metadata.• More than an order of magnitude higher throughput
• 16x – 37x, where 37x is the throughput for higher write rates• Significantly lower client latencies for large clusters.• Multiple stateless Namenodes.• Instant failover between the Namenodes.• Tinker-friendly metadata.
NDB-DN1/user/foo.txt
Quota
Inode
PRB Inv RUC
CR
ERBLU
Block
Lease
Replica
URB
/
NDBNDB-DN2/user
/user/foo.txtNDB-DN3 NDB-DN4
File inode related metadata
/user/bar.txt/user/foo.tar
Quota PRB Inv RUC
CR
ERBLU
Block
Lease
Replica
URB
HopsFS and HDFS Throughput for Spotify Workload
200K
400K
600K
800K
1M
1.2M
1.4M
1.6M
1 5 10 15 20 25 30 35 40 45 50 55 60
op
s/se
c
Number of Namenodes
HopsFS using 12 Node NDB clusterHopsFS using 8 Node NDB clusterHopsFS using 4 Node NDB clusterHopsFS using 2 Node NDB cluster
HopsFS using 12 Node NDB cluster with hotspotsHDFS Spotify Workload
25K
50K
75K
100K
1 3 5
HopsFS Architecture
HopsFS provides multiple stateless Namenodes. The Namenodes can serverequests from Both HopsFS and HDFS clients, however, HopFS clientsprovides load balancing between the Namenodes using random, round-robin,and sticky policies.
NameNodes
DataNodes
DN 1 DN 2 DN 3 DN N
NN 1 NN 2 NN 3 NN N
MySQL Cluster
HopsFS/HDFSClients
Metadata Partitioning
All the inodes in a directory are partitioned using a parent inode ID, therefore,all the immediate children of /user directory are stored on NDB-DN-3 forefficient directory listing, for example, ls /user. The file inode related metadatafor /user/foo.txt is stored on NDB-DN-4 for efficient file reading operations, forexample, cat /user/foo.txt.
Memory HDFS HopsFS1 GB 2.3 million 0.69 million
50 GB 115 million 34.5 million100 GB 230 million 69 million200 GB 460 million 138 million500 GB Does Not Scale 346 million
1 TB Does Not Scale 708 million25 TB Does Not Scale 17 billion
HopsFS and HDFS Metadata Scalability
0
30
60
90
120
150
180
210
1000 2000 3000 4000 5000 6000
Tim
e (m
s)
Clients
HopsFSHDFS
0
2.5
5
7.5
10
100 200 300 400 500
HopsFS and HDFS End-to-end Latency
50K100K
200K
400K
600K
20 50 80 110 140 170 200 230
ops/
sec
Time (sec)
HDFSHopsFS
HopsFS and HDFS Namenode Failover
Vertical lines represent namenodes failures.
Paris Carbone, Gyula Fóra, Seif Haridi, Vasiliki Kalavri, Marius Melzer, Theodore Vasiloudis <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]> <[email protected]>
A NEW STATE OF THE ART IN DATA STREAMING
• Lightweight, consistent end-to-end processing
Advancing Data Stream Analytics with Apache Flink®
CONSISTENT CONTINUOUS PROCESSING WITH PIPELINED SNAPSHOTS
• Dynamic reconfiguration and application management
in-progress committedpendingpending
epoch n-1 epoch n-2 epoch n-3epoch n
rollback
snap-1 snap-2
snap-3
…
update application
Pre-partition state in hash(K) space, into key-groups
bob……
… ………
alice
local states
input streams
1) divide computationinto epochs
stream processor
Reconfiguration Scenarios • scale in/out • failure recovery • pushing bug fixes • application AB Testing • platform migration
2) capture states after each epochwithout stopping
snapshot
snapshot
snapshot
Apache Flink™: Stream and Batch Processing in a Single Engine P Carbone, S Ewen, S Haridi, A Katsifodimos, V Markl, K Tzoumas Bulletin of the IEEE Computer Society Technical Committee on Data Engineering
State management in Apache Flink®: consistent stateful distributed stream processing P Carbone, S Ewen, G Fóra, S Haridi, S Richter, K Tzoumas Proceedings of the VLDB Endowment 10 (12), 1718-1729
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Tabl
e
CEP
Gra
phs
MLLibraries
Core API
Runner
Setup
consistent stateevent-time progress
fluid apipartitioned streams
FAST SLIDING WINDOW AGGREGATION SYSTEM SUPPORT FOR GRAPH STREAM MINING • Sliding window aggregation can be very expensive
• Existing optimisations apply to limited window types
• ‘Cutty’ redefines stream windows for optimal processing
20 40 60 80 100
Number of Queries
0k500k
1000k1500k2000k2500k3000k3500k4000k4500k
Thro
ughp
ut(r
ecor
ds/s
ec)
CuttyPairs+RA
1 10 20 30 40 50 60 70 80 90 100
Number of Queries
104
105
106
107
108
109
1010
1011
Tota
lRed
uce
Cal
ls
Cutty (eager)Pairs+Cutty (lazy)
PairsRANaive
Cutty: Aggregate Sharing for User-Defined Windows P Carbone, J Traub, A Katsifodimos, S Haridi, V Markl ACM CIKM - 25th International Conference on Information and Knowledge Management
• People process graphs inefficiently
3. Store: write the final graph state back to disk
2. Compute: read and mutate the graph state
1. Load: read the graph from disk and partition it in memory
• We propose a new way to process graphs continuously
• this is slow, expensive and redundant
graph summary
algorithm algorithm~R1 R2
1) Single-pass summaries edgeStream.aggregate(newSummary(window,fold,combine,lower))
edge additions
2) Neighbour Aggregation and Iterations on Stream Windows
sinkloopsrc winsrc win sinkloop
graphstream.window(…) .iterateSyncFor(10, InputFunction(), StepFunction(),
OutputFunction())
graphstream.window(…) .applyOnNeighbors(FindPairs())
Sponsors Partnersgithub.com/vasia/gelly-streaming