[IEEE 2012 2nd International Conference on Computer Science and Network Technology (ICCSNT) -...

Abstract— Hadoop is enjoying popularity for processing

data-intensive jobs because of its data locality feature. However, the performance gained from Hadoop’s above feature is currently limited by its default block placement policy, which implicitly assumes instances of MapReduce jobs access data from a single file. On the contrary, multi-file queries like indexing query or aggregation query need to process related data from more than one files found on different DataNodes of a cluster. In this paper we proposed a Correlation-based Block Placement (CBP) Algorithm that enhances the performance of these queries by placing related blocks on the same set of DataNodes. Furthermore, we developed a customized InputFormat that enables InputSplits contain records from different files. Simulation results demonstrated that the number of migrating data blocks for CBP was insignificant. On the contrary, for default policy case, the number of migrating data blocks increased significantly with the input dataset size. As a result, for any input dataset size, the performance of CBP exceeded that of the default policy.

Index Terms— HDFS, Block Placement, Data locality, Correlation

I. INTRODUCTION

While the capability of computing systems has been increasing at Moore’s Law rate, the amount of digital data has been increasing even faster. Huge amount of data are being generated from digital media, web authoring, physical simulations, scientific simulations, like seismic simulation and natural language processing, etc. Effectively storing, querying, analyzing and utilizing these ever-expanding huge datasets present one of the grand challenges to the computing industry and research community[1][8]. Google publicized the most popular programming model, called MapReduce, that has emerged as a scalable way to perform data intensive computations on cluster of commodity computers [2][20].

Hadoop, developed by Doug Cutting, is a popular open source implementation of the Google’s MapReduce framework [3]-[5]. So far it has been applied in 150 institutions like yahoo, Baidu, AOL, and FaceBook for educational and productions uses and 23 companies like IBM, and Amazon web services for service offerings [4]. Hadoop is a framework designed for

scalable and distributed computing. It can scale up from single server to thousands of machines, each offering local computation and storage, by simply adding commodity servers. It allows distributed processing of large datasets by partitioning data and computation across thousands of DataNodes and executing application computations in parallel close to their data [3].

At first, MapReduce job’s input file is horizontally partitioned into blocks. For parallel computation and load balancing, NameNode’s block placement policy distributes the blocks among DataNodes. Then, when a job starts to execute, NameNode uses the File System knowledge and tries to assign the processing of each block to a DataNode on which the block is actually stored. That is each instance of a MapReduce job will move to the DataNode where the required block is located. This strategy of moving the computation to the data is known as data locality optimization and results in high performance by avoiding unnecessary data transfers across networks[1][8]. However the default NameNode’s placement policy and data locality is practical only for jobs that process data from a single file. On the contrary there are cases where user level tasks or queries request multiple blocks from different files [7][9][11][12]. The blocks required by these queries can be distributed on different DataNodes. So these queries need to transfer data blocks from one DataNode to another DataNode. This overhead of transferring data blocks from one DataNode to another becomes a performance bottle neck for MapReduce jobs.

Increasing evidences show that careful data organization can boost the performance of Hadoop applications and query processing algorithms [6][17][18]. X.Jiong et al. [6] proved ignoring the data locality issues in heterogeneous environments can noticeably reduce the MapReduce performance due to unprocessed data migration from fast nodes to slow nodes. The authors developed an algorithm that initially places block replicas based on node capacity. They also implemented a data redistribution algorithm to solve skew problem that could occur when a cluster is expanded or a new data is added to an existing file. In contrast to our work, X.Jiong, et al. [6] focused on the placement of replicas of one block. They did not consider relations among blocks. Few other researchers try to consider the relations among data blocks and developed algorithms that increase availability and

Multi-file Queries Performance Improvement through Data Placement in Hadoop

Yu Tang1, Elham Abdulhay1, Aihua Fan2, Sheng Su1, Kidus Gebreselassie1

1 University of Electronic Science and Technology of China, Chengdu, 611731, China 2 Xi’an Polytechnic University, Xi’an, 710048, China

986

2012 2nd International Conference on Computer Science and Network Technology

978-1-4673-2964-4/12/$31.00 ©2012 IEEE CHANGCHUN, CHINA

read performance of related blocks [9][11]. Specifically, authors of [7][15][19] has tried to collocate related data blocks on Hadoop. However, A.Abouzied, et al. [19] achieved this at the cost of heavy weight changes to Hadoop and breaking the programming model and simplicity of MapReduce. Although Hadoop++ [15] does not require code maintenance cost, unlike to our work, it can only collocate two files. In addition, every time a new data is arrived, Hadoop++ needs to reorganize the whole dataset. M.Y.Eltabakh, et al. [7] has suggested an approach that allows applications control collocation of data blocks. They achieved collocation by introducing modifications to Hadoop APIs and injecting new Data Structure to NameNode. On the contrary, our approach does not require any API modification and code maintenance cost. Furthermore, like the default policy, the approach by [7] determines a DataNode as good target if it has a free space enough for one block size. However, if a block related with other blocks is placed on a DataNode which have one or few blocks size free space, subsequent related blocks will be placed on another DataNodes. Hence, collocation of related blocks is overridden. Besides, since the MergeInputFormat proposed by these authors form an InputSplit that contain all related blocks, some blocks are forced to migrate from one node to another. On the contrary, in our approach, the required minimum free space is modified so that a considerable number of related blocks will be hosted on the same DataNode. In addition, our customized InputFormat constructs InputSplits containing records from different files that are found on the same DataNode. And this significantly enhances the data locality principle.

In this paper, we proposed an approach to improve multi-file queries performance by significantly reducing the amount of moving data among DataNodes. Our approach initially partitions large datasets in accordance with usage pattern and then places related partitions on the same set of DataNodes. More specifically, we implemented a Correlation-based Block Placement (CBP) policy as a pluggable module that replaces the default placement policy. CBP does not disturb fault tolerance feature of default policy and requires no modification to Hadoop APIs. Hence, the cost of code maintenance is decreased as new releases of HDFS become available. In addition an InputFormat that allows InputSplits contain records from different files found on the same DataNode is implemented.

The rest of this paper is organized as follows. Section II gives a background of Hadoop and an explanation of the problem statement. The proposed solution and its implementation are described in section III. In Section IV, we present our evaluation results. Finally section V concludes the paper with future research direction.

II. BACKGROUND AND PROBLEM STATEMENT

We first provide some background on Hadoop along with a description of the extensibility points that were used to implement our placement policy and InputFormat.

A. Hadoop Distributed File System (HDFS)

HDFS is the distributed file system component of Hadoop

that provides high fault tolerance and high throughput access to large application datasets. It has master/slave architecture and stores file system metadata and application data separately. File system metadata are stored in a master server called NameNode and application data are stored in slave servers called DataNodes. Application data are stored and accessed in the form of files. However internally these files are split into one or more blocks and stored in a set of DataNodes. Block size can be configured to a required value and its default value is 64MB. NameNode has a Data structure called BlocksMap that keeps a mapping between data blocks and DataNodes storing the data blocks. Every time new blocks are stored or some blocks are inaccessible because of DataNodes failure, NameNode automatically updates BlocksMap.

Like other distributed file systems, HDFS replicates the blocks of a file for fault tolerance, availability and maximum bandwidth utilization. By default a block is replicated into three. Each replica of a block will then be placed in a separate DataNode. The default block placement policy of HDFS distributes replicas of a block across DataNodes of a cluster so that the availability and reliability of that single block is maintained. For the default replication factor three, the first replica of a block is placed on a local node where the writer is located. If the writer is not in the cluster, it will be placed in a randomly chosen DataNode. The second replica is placed on a node in remote rack. The third replica is placed on the different node but the same rack with the second replica. Besides, HDFS provides automatic replication factor management. If the number of replicas of a block is less than the required replication factor, it automatically detects and re-replicates the block to the required number of replicas and places using the same policy with minor modifications.

The default policy chooses DataNodes randomly. The following piece of code shows how a DataNode is chosen from a given local rack.

Random r = new Random(); int numOfDataNodes = localRack.getNumOfLeaves(); int leaveIndex = r.nextInt(numOfDataNodes);

The DataNode with ‘leaveIndex’ index value will be chosen.

Before this chosen DataNode is returned as good target to the client, it is evaluated based on available free space, current load and number of DataNodes already chosen in the current rack. The default policy focuses only on the availability of one block. It does not take into consideration about the relation among blocks. It simply chooses DataNodes randomly.

987

Fig.1. Horizontal File Partitioning

Besides, a randomly chosen DataNode can be considered as good target if it can hold at least one block.

B. MapReduce

MapReduce is a software framework used for writing data-intensive applications running on parallel computers like commodity clusters. An application is implemented as a sequence of MapReduce operations, each consisting of a map stage and a reduce stage that processes a large number of independent data items. The framework uses so –called InputFormats to fragment the large input files into splits that can be provided as input to the individual map tasks. Hadoop provides a number of InputFormat types, like TextInputFormat, KeyValueTextInputFormat, etc. However Inputsplits generated by these InputFormats can contain records only from one file. Multi-file queries incur a problem of generating input splits that contain records from more than one input files. To overcome this problem, we developed a InputFormat that allows Input split construction from different files and hence related files in multi-file queries will be processed by a single mapper. A detailed explanation of this InputFormat is given in the following section.

C. Problem Statement

As shown in fig.1, HDFS partitions datasets horizontally because it assumes data will be processed sequentially. However, multi-file queries, like aggregation queries, need to process data (or records) with the same key that come from one or more files. For example records with key “1” in fig.1 need to be processed together. A common MapReduce solution for these types of queries is to repartition the input

Fig.2. Data Block Migration

Fig.3. Usage Pattern File Partitioning

files according a given key [7][21]. However repartitioning incurs performance bottleneck because it involves expensive operations like sorting and shuffling. Furthermore as mentioned above, the default placement policy could distribute related blocks randomly across a cluster. Assume two files, File A and File B. File A is divided into two Blocks A1 and A2, and File B is divided into two Blocks B1 and B2. Fig.2 shows how these blocks could be distributed by the default policy. If mapper 1 needs to process A1 and B1, at least one of the blocks should migrate from one DataNode to another. To overcome the repartitioning and data migration problems, we proposed a solution which is discussed in the following section. Algorithm1. CBP algorithm 1: fileName = getBlockFileName() 2: If (NotFollowConvention(fileName)) Goto step 15 3: If (NoAlreadyStoredRelatedBlock(fileName) ) Goto step 13 //collect DataNodes hosting related blocks 4: collectedDataNodes = CollectDataNodes() //sort according number of hosted related blocks – descending order 5: sortedDataNodes = Sort(collectedDataNodes) 6: While (!(sortedDataNodes.hasNext())) Goto step 13 7: candidateNode = sortedDataNodes.next() 8: if (isNotGoodTarget(candidateNode)) Goto step 12 9: Results.add(candidateNode) 10: --numOfReplicas 11: If(numOfReplicas == 0) Goto step 16 12: end while // the required minimum free space is changed 13: Call modified default policy 14: Goto step 16 15: Call default policy 16: Return Results

II. PROPOSED SOLUTION AND ITS IMPLEMENTATION

The solution proposed consists of three modules: Partitioning and Loading Module, Collocation Module, and InputSplit Generation Module.

A. Partitioning and Loading Module

Unlike the default dataset partitioning, which partition input datasets horizontally, this partitioning module partitions input datasets according a given key or attribute value and load each

988

partition as a separate file. Records with the same key will be included in the same partition file. We follow a convention for naming these files. Partition files with the same key that come from different datasets will be stored on the same folder. This will help Data Collocation Module to identify which of the partition files are related. Fig.3. shows the operation of this module.

B. Collocation Module

This is the placement policy module that replaces the default policy baked on NameNode. It is implemented as a Java class called Correlation-based Block Placement (CBP) class that extends the ‘BlockPlacementPolicy’ abstract class found on NameNode. Data blocks which have the same file name path at least till the parent directory are assumed to be correlated and if two data blocks are correlated, surely both will be processed together. Thus, CBP places correlated data blocks on the same set of DataNodes. The algorithm of this class is described in Algorithm 1. In the algorithm we referred to a default policy with modified required minimum free space as “modified default policy”. The algorithm returns list of DataNodes chosen for hosting replicas of a given block CBP is implemented as a pluggable module [13]. Hence, CBP does not require whole HDFS or Hadoop re-compilation. Only configuration changes are required for replacing default placement policy with CBP. The jar file of CBP class is placed on the library folder of the HDFS framework. The full class name of the CBP class is assigned to the configuration parameter ‘dfs.block.replicator.classname’ in the site specific configuration file – hdfs-site.xml. Then when the HDFS is started, CBP will take effect. This makes CBP easy to use.

C. InputSplit Generation Module

It is an InputFormat class called PartitionFileInputFormat that extends CombineFileInputFormat abstract class. It generates combine inputsplits containing records from more than one file. This InputFormat assumes files in one folder will form a pool. Since related partition files are collocated and stored in one folder by Data Collocation Module, each pool created by this InputFormat will hold related files. PartitionFileInputFormat overrides the getSplits (JobContext) method and defines it as follows: For each pool and a given MaxSplitSize, blocks on the same DataNode will be combined to form a single combine InputSplit. This ensures data locality even though there are related blocks that belong to the same pool but found on different DataNodes. Besides, this module provides a Record Reader that read each record from the combine input split and outputs key and value to a mapper.

III. EVALUATION

In this section we present experimental results demonstrating related blocks collocation, migration, distribution and response time of a typical aggregation query for default placement and CBP policy cases.

A. Cluster setup and Datasets

The experiments were run on a cluster consisting of 5 nodes each ran Fedora15 Linux, Kernel release 2.6.42.12-1.fc15.x86_64. One of the nodes was acting as

NameNode and DataNode. The rest were acting as DataNodes. JDK1.7.0_01 java version and Hadoop version 0.21.0 were used, and Hadoop was configured to run 4 mappers per node (4 map slots per node) and 2 reducers per node. We modified the minimum required free space to 10 data blocks size. Thus, at least 10 related blocks will be collocated. We developed a simple program that generates a CSV format synthetic datasets of any size.

B. Related Blocks Collocation and Migration

The first experiment was performed to see the location and migration of related blocks when placed by default policy and CBP policy. 10 data sets were generated and each dataset partitioned into 10 partition files, each partition file corresponding to one block. Then the partition files of each dataset had crawled to HDFS one by one. For CBP case, experimental results showed that all related blocks were collocated and no block migration was required. However, for the default policy case, the number of migrating blocks increased significantly with the input dataset size. The following figure shows minimum and maximum number of migrating blocks under different input dataset sizes for default policy case.

Fig.4. Number of Migrating Blocks

Fig.4 shows the number of migrating blocks increases with input dataset size. When the size of input dataset became 1280MB, total number of blocks required by a typical MapReduce (example: average query) was 20. At this point, 5% - 35% (2 to 7 data blocks) of the total blocks were forced to migrate. When the input dataset size reached 6400MB, 38% - 81% of the total blocks were forced to migrate (here the total number of blocks required by the MapReduce job was 100).

C. Block Distribution

In this test case scenario, the block distribution of both policies was evaluated. Default policy tries to distribute data blocks evenly to all data nodes. While CBP tries to collocate related blocks on the same set of DataNodes. Ten datasets, each 250MB size, were partitioned and loaded one by one. The distribution of the blocks on 5 DataNodes of the cluster for default policy and CBP case are shown in fig.5 and fig.6 respectively. The figures showed that collocation feature introduced by CBP does not severely affect the block

989

distribution. One possible reason for the similarity of the distribution is CBP uses default policy to place the first block of a given set of related blocks. The other reason is we use the same number of blocks for each set of related blocks.

Fig.5. Block Distribution with Default Policy

Fig.6. Block Distribution with CBP

D. Average Query Response Time

Unlike [7], in this testing scenario we explicitly measured the impact of collocation on minimizing response time of average query implemented as a MapReduce job. The same average query implementation was executed for both default and CBP policies. PartitionFileInputFormat was used as InputFormat for both cases. So the number of mappers and reducers were the same in both cases. The block size was set to 64MB. For each case, 10 input data sets were partitioned and crawled one by one to HDFS and average query was executed under different input dataset sizes. Fig.7 shows the response time of this query under different input dataset sizes. As the input dataset size increased, for the default policy case, the probability of migrating more blocks increased. Then more time would be spent for migrating these blocks. In the case of CBP policy, almost all data blocks that formed each InputSplit were found on the same DataNode. No data block migration

was required. Hence the response time of CBP was found to be always lower than default policy.

Fig.7. Average Query Response Time

IV. CONCLUSION AND FUTURE WORK

We presented a three module solution that improves multi-file queries performance by avoiding remotely accessing data blocks. The first module partitions large files according usage pattern. Then the second module places related partitions on the same set of DataNodes. For collocating related partitions, we developed a Correlation-based Block Placement (CBP) policy that replaces default policy. CBP is developed as a pluggable module and no whole HDFS re-compilation is required. Hence, the cost of code maintenance is decreased as new releases of HDFS become available. The last module implements a customized InputFormat that allows InputSplits contain records from more than one related partition files but found on the same DataNode. Hence, it improves data locality to a large extent. Experimental results demonstrated that CBP did not severely disturb the fault tolerance feature of the default policy. Rather the performance of CBP was found to be better than the default policy under different configuration parameters. While experimenting we observed that data blocks could also migrate due to heterogeneity of the DataNodes. Thus, integrating heterogeneity parameter to CBP could be a good research direction for future work.

Acknowledgement: this work is supported by the Research Start-up Funds for Returned Oversea Scholars, Ministration of Education, China.

REFERENCES [1] G. Yunhong and R. Grossman, "Toward Efficient and Simplified

Distributed Data Intensive Computing," Parallel and Distributed Systems, IEEE Transactions on, vol. 22, pp. 974-984, 2011.

[2] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp. 107-113, 2008.

[3] K. Shvachko, et al., "The Hadoop Distributed File System," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-10.

[4] H. Wiki. http://wiki.apache.org/hadoop/PoweredBy [5] Hadoop. http://hadoop.apache.org/

990

[6] X. Jiong, et al., "Improving MapReduce performance through data placement in heterogeneous Hadoop clusters," in Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1-9.

[7] M. Y. Eltabakh, et al., "CoHadoop: flexible data placement and its exploitation in Hadoop," Proc. VLDB Endow., vol. 4, pp. 575-585, 2011.

[8] J. B. Buck, et al., "SciHadoop: array-based query processing in Hadoop," presented at the Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, Washington, 2011.

[9] Z. Ming, et al., "Correlation-Aware Object Placement for Multi-Object Operations," in Distributed Computing Systems, 2008. ICDCS '08. The 28th International Conference on, 2008, pp. 512-521.

[10] D. Bo, et al., "Correlation Based File Prefetching Approach for Hadoop," in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 2010, pp. 41-48.

[11] H. Yu and P. B. Gibbons, "Optimal inter-object correlation when replicating for availability," presented at the Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, Portland, Oregon, USA, 2007.

[12] A. Floratou, et al., "Column-oriented storage techniques for MapReduce," Proc. VLDB Endow., vol. 4, pp. 419-429, 2011.

[13] https://issues.apache.org/jira/browse/HDFS-385. [14] S. Blanas, et al., "A comparison of join algorithms for log processing in

MaPreduce," presented at the Proceedings of the 2010 international conference on Management of data, Indianapolis, Indiana, USA, 2010.

[15] J. Dittrich, et al., "Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)," Proc. VLDB Endow., vol. 3, pp. 515-529, 2010.

[16] D. Bo, et al., "A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files," in Services Computing (SCC), 2010 IEEE International Conference on, 2010, pp. 65-72.

[17] W. Qingsong, et al., "CDRM: A Cost-Effective Dynamic Replication Management Scheme for Cloud Storage Cluster," in Cluster Computing (CLUSTER), 2010 IEEE International Conference on, 2010, pp. 188-196.

[18] M. Zhou and C. Xu, "Optimized data placement for column-oriented data store in the distributed environment," presented at the Proceedings of the 16th international conference on Database systems for advanced applications, Hong Kong, China, 2011.

[19] A. Abouzied, et al., "HadoopDB in action: building real world applications," presented at the Proceedings of the 2010 international conference on Management of data, Indianapolis, Indiana, USA, 2010.

[20] Chuck Lam. (2011). Hadoop in Action. [21] S. Blanas, et al., "A comparison of join algorithms for log processing in

MaPreduce," presented at the Proceedings of the 2010 international conference on Management of data, Indianapolis, Indiana, USA, 2010.

.

991

[IEEE 2012 2nd International Conference on Computer Science and Network Technology (ICCSNT) -...

Documents

Transcript of [IEEE 2012 2nd International Conference on Computer Science and Network Technology (ICCSNT) -...