Post on 18-Dec-2014
description
NAMENODE AND DATANODE COUPLING
FOR A POWER-PROPORTIONAL
HADOOP DISTRIBUTED FILE SYSTEM
Hieu Hanh Le, Satoshi Hikida and Haruo Yokota
Tokyo Institute of Technology
Appeared in DASFAA 2013
The 18th International Conference on Database Systems for Advanced Applications (Wuhan, China)
1
Background
Research Motivation
Goal and Approach
Proposals
Experimental Evaluation
Conclusion
Agenda 2
Background
Hadoop Distributed File System (HDFS) is widely
used as data storage for applications in the Cloud
Commercial Off-the-self-based system
Support MapReduce framework
Good scalability
Utilize a huge number of DataNodes to store huge amount
of data requested by data-intensive applications
Expand the power consumption of storage system
Power-aware file systems are moving towards
power-proportional design
3
[Background]
Power-proportional Storage System
System should consume energy in proportion to
amount of work performed [Barroso and Holzle, 2007]
Set system’s operation to multiple gears containing
different number of data nodes
Made possible by data placement methods
4
High Gear
Node
1
Node
2
D2
Node
3
D3 D1
Node
4
D4
Low Gear
Node
1
Node
4
Node
3
Node
2
D2 D3 D1 D4
D1 D4 migration
Research Motivation 5
Gear-shifting is vital in power-proportional system
The system needs to reflect updated data that was
modified in a lower gear to guarantee the higher
performance
Re-transfer the updated data according to the data
placement
The inefficient gear-shifting process in current methods
for the HDFS [Rabbit, Sierra]
Bottleneck in metadata access
High communication cost among nodes
Rabbit: Robust and Flexible Power-proportional Storage, ACM SOCC 2010
Sierra: Practical Power-proportionality for Data Center Storage, ACM EuroSys 2011
Gear-shifting in current HDFS-based methods [1/10]
6
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
Eg: Rabbit, Sierra
D1
D2 D3
D4
D2 D3
D1 D4
Low Gear High Gear
Gear-shifting in current HDFS-based methods [2/10]
7
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
Eg: Rabbit, Sierra 1. Access metadata to
identify updated blocks
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
Low Gear High Gear
Gear-shifting in current HDFS-based methods [3/10]
8
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Gear-shifting in current HDFS-based methods [4/10]
9
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Gear-shifting in current HDFS-based methods [5/10]
10
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Gear-shifting in current HDFS-based methods [6/10]
11
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Gear-shifting in current HDFS-based methods [7/10]
12
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Gear-shifting in current HDFS-based methods [8/10]
13
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Gear-shifting in current HDFS-based methods [9/10]
14
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra Congestion
D1
D2 D3
D4
D2 D3
D1 D4 D1
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Gear-shifting in current HDFS-based methods [10/10]
15
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up 2. Transfer updated
blocks
Eg: Rabbit, Sierra
Sequentially
(1 block/connection)
Congestion
Inefficiency D1
D2 D3
D4
D2 D3
D1 D4 D1 D4
2.1 Command
issuance 2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
Goal and Approach
Goal
Propose a novel architecture for efficient gear-shifting for power-proportional HDFS
Approach
Utilize distributed metadata management (MDM) Eliminate the bottleneck of the centralized MDM
Coupling NameNode and DataNode (NDCouplingHDFS) Localize the range of updated blocks maintained by metadata
management Reduce the communication cost among nodes
Enable multiple blocks transfer to improve the efficiency in HDFS
16
[Proposals]
Distributed MDM
Distribute MDM to multiple nodes to decentralize the load during gear-shiftings
Require a distributed MDM that is update conscious
The MDM is transferred when the system shifts gears
Low cost of search/insert/delete operations
Inefficient distributed hash table based method
For each transferred file, the hash function is needed to be applied
Efficient range based method
For a range of files, all the metadata can be transferred within a limited structure transverses
Apply two range-based methods
Each node statically maintains a separate subnamespace (Static Directory Partition-SDP)
Parallel index technique with well concurrency control (Fat-Btree) [*]
17
[*] A Concurrency Control Protocol for Parallel B-tree structure without
latch-coupling for explosively growing digital content, EDBT 2008
[Proposals]
NDCouplingHDFS with Distributed MDM
Each node maintains a subnamespace of the whole namspace of the system
The mapping information [Node, Range] is managed by Distributed MDM
18
Data
Management
Distributed
MDM
ND1
Distributed
MDM
Data
Management
ND2
Distributed
MDM
Data
Management
ND3
Distributed
MDM
Data
Management
ND4
2. Forward request to
responsible nodes
3. Serve the request
and return the results
1. Send
request of 25
4. Return results
A NDCoulingHDFS
node
ND1: [1, 10]
ND2: [11,20]
ND3: [21, 30]
ND4: [31,~]
[Proposals]
Efficient Gear-shifting [1/6] 19
Data
Management
Distributed
MDM
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
A D B C
A1 B1 C1 D1
WOL
Log
WOL
Log
A
A1
D
D1
<File, Temp Node, Intended Node>
Reactivated Reactivated A1
B1 C1
D1 A1 D1
The process is distributed to multiple nodes
The command issuance from Disitributed MDM and Data Management is locally performed
Updated blocks are transferred in batch way (multiple blocks per connection)
[Proposals]
Efficient Gear-shifting [2/6] 20
Data
Management
Distributed
MDM
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
A D B C
A1 B1 C1 D1
WOL
Log
WOL
Log
A
A1
D
D1
<File, Temp Node, Intended Node>
Reactivated Reactivated A1
B1 C1
D1 A1 D1
1. Transfer updated
metadata
1. Transfer updated
metadata
The process is distributed to multiple nodes
The command issuance from Disitributed MDM and Data Management is locally performed
Updated blocks are transferred in batch way (multiple blocks per connection)
[Proposals]
Efficient Gear-shifting [3/6] 21
Data
Management
Distributed
MDM
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
A D B C
A1 B1 C1 D1
WOL
Log
WOL
Log
A
A1
D
D1
<File, Temp Node, Intended Node>
Reactivated Reactivated A1
B1 C1
D1 A1 D1
1. Transfer updated
metadata
1. Transfer updated
metadata
The process is distributed to multiple nodes
The command issuance from Disitributed MDM and Data Management is locally performed
Updated blocks are transferred in batch way (multiple blocks per connection)
2. Command issuance 2. Command issuance
[Proposals]
Efficient Gear-shifting [4/6] 22
Data
Management
Distributed
MDM
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
A D B C
A1 B1 C1 D1
WOL
Log
WOL
Log
A
A1
D
D1
<File, Temp Node, Intended Node>
Reactivated Reactivated A1
B1 C1
D1 A1 D1
1. Transfer updated
metadata
1. Transfer updated
metadata
The process is distributed to multiple nodes
The command issuance from Disitributed MDM and Data Management is locally performed
Updated blocks are transferred in batch way (multiple blocks per connection)
2. Command issuance 2. Command issuance
3. Transfer blocks 3. Transfer blocks
[Proposals]
Efficient Gear-shifting [5/6] 23
Data
Management
Distributed
MDM
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
A D B C
A1 B1 C1 D1
WOL
Log
WOL
Log
A
A1
D
D1
<File, Temp Node, Intended Node>
Reactivated Reactivated A1
B1 C1
D1 A1 D1
1. Transfer updated
metadata
1. Transfer updated
metadata
The process is distributed to multiple nodes
The command issuance from Disitributed MDM and Data Management is locally performed
Updated blocks are transferred in batch way (multiple blocks per connection)
2. Command issuance 2. Command issuance
3. Transfer blocks 3. Transfer blocks
4. Updated metadata 4. Updated metadata
[Proposals]
Efficient Gear-shifting [6/6] 24
Data
Management
Distributed
MDM
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
A D B C
A1 B1 C1 D1
WOL
Log
WOL
Log
A
A1
D
D1
<File, Temp Node, Intended Node>
Reactivated Reactivated A1
B1 C1
D1 A1 D1
1. Transfer updated
metadata
1. Transfer updated
metadata
The process is distributed to multiple nodes
The command issuance from Disitributed MDM and Data Management is locally performed
Updated blocks are transferred in batch way (multiple blocks per connection)
2. Command issuance 2. Command issuance
3. Transfer blocks 3. Transfer blocks
4. Updated metadata 4. Updated metadata
Parallelism
Reduce
network cost
Efficient block
transfer
Experiment Evaluation
Experiment 1
Verify the effectiveness of proposals in gear-shifting
process by comparing with the normal HDFS
Updated block reflection is the major cost
Coupling architecture, batch block transferring
Experiment 2
Evaluate the effectiveness of distributed index
technique to NDCouplingHDFS
SDP and Fat-Btree through changing the number of nodes
25
[Experiment 1]
Validity of NDCouplingHDFS in Gear-shifting 26
Updated Data Reflection
# Gears 2
# Active nodes at Low Gear 8
# Active nodes at High
Gear
16
# files 16000
File size 1MB
HDFS
Version 0.20.2
Maximum number of
transferred blocks
100
Heartbeat interval 1s
Compare the execution time of updated data
reflection the NDCouplingHDFS with the normal
HDFS based on five configurations
Combinations of architecture, distributed MDM (SDP,
Fat-Btree), command issuance, block transfer
Environment
0
5
10
15
20
25
30
35
40
45
0
10
20
30
40
50
60
70
NormalHDFS SSS SBS SBB FBB
Execution time
Number of communication connections[commnand issuance]
[Experiment 1]
Experimental Results 27
46% 41%
Configuration Normal
HDFS
SSS SBS SBB FBB
Architecture HDFS Coupling Coupling Coupling Coupling
MDM Central SDP SDP SDP Fat-Btree
Command
issuance
Sequential Sequential Batch
Batch
Batch
Block
transference
Sequential
Sequential
Sequential Batch Batch
Coupling architecture and
Batch block transferring highly
effected the performance
[s]
[Experiment 2]
Scalability of metadata operations
Evaluate SDP vs. Fat-Btree
Change the number of files and number of nodes
28
Machine
# 1, 2, 4, 8
CPU TM8600 1.0GHz
Memory DRAM 4GB
NIC 1000Mb/s
OS Linux 3.0 64bit
Java JDK-1.7.0
Fat-Btree
Fanout 16
Control
Concurrency
LCFB [Yoshihara, 2007]
Workload
#files 3000
File size 1MB
Fat-Btree gained better scalability when the number of nodes increases
The read throughput scaled well due to better search cost and concurrency control
The efficiency in write throughput is limited due to the synchronization cost in updating tree structure
[Experiment 2]
Experimental Results 29
0
50
100
150
200
250
300
350
1 2 4 8
SDP
Fat-Btree
0
5
10
15
20
25
30
1 2 4 8
SDP
FBT
Read T
hroug
hput
[opera
tion/
s]
Write
Thr
oug
hput
[opera
tions
/s]
A transaction: open/create metadata
and read/write files
Conclusion
Proposed NDCouplingHDFS for efficient gear-shifting in power-proportional HDFS
Significantly reduced at most 46% the execution time of reflecting updated data compare with the normal HDFS
Coupling architecture and batch block transferring
Improved the IO performance by applying distributed index technique to NDCouplingHDFS
NDCouplingHDFS
Maintains supporting MapReduce
Exptected to achieve real power-proportionality including power consumption of metadata management
30
NameNode and DataNode Coupling for a
Power-proportional Hadoop Distributed File System
Thank you for your attention! 31