Consideration of Parallel Data Processing over an Apache...

3
Consideration of Parallel Data Processing over an Apache Spark Cluster Kasumi Kato * , Atsuko Takefusa , Hidemoto Nakada and Masato Oguchi * * Ochanomizu University National Institute of Informatics National Institute of Advanced Industrial Science and Technology (AIST) I. I NTRODUCTION The Spread of cameras and sensors and cloud technologies enable us to obtain life logs at ordinary homes and transmit the captured data to a cloud for life log analysis. However, the amount of processing for video data analysis in a cloud drastically increases when a very large number of homes send data to the cloud. In this research, we aim to improve the efficiency of distributed video data analysis processing by using the parallel deep learning framework Chainer [2] and the distribution processing platform Apache Spark [1] (Spark). In this paper, we construct a Spark cluster and investigate the performance of parallel data processing using Spark varying parameter settings. II. RELATED TECHNOLOGY A. Apache Spark Spark is a fast and versatile cluster computing platform designed large-scale data processing. Spark was developed by the University of California, Berkeley and was donated to the Apache Software Foundation in 2014. Spark can perform batch processing in a minimum unit, called micro-batch process. Because computation is performed on-memory in the Spark platform, the performance of data processing is improved avoiding expensive disk access. Processed data are stored in Resilient Distributed Dataset (RDD) defined by the Spark core and the RDDs are automatically distributed. The Spark project consists of multiple components tightly coupled together, such as Spark core, Spark SQL, MLlib [5]. The Spark core includes Spark’s basic functions such as task scheduling, memory man- agement, disaster recovery, and interaction with the storage system. By operating RDD, it is possible to easily perform a wide range data processing including repetitive machine learning, streaming, complicated queries, batches. Spark can be combined with other big data tools such as Hadoop [4], and has excellent versatility. B. Chainer Chainer is a deep learning framework developed by Pre- ferred Networks. It is provided as a Python library, and has three outstanding characteristics, ”Flexible”, ”Intuitive” and ”Powerful”. Chainer enables users to install easily, write a code intuitively with a simple notation, and perform high- performance operation utilizing GPU with CUDA support. It Fig. 1. Master/Worker processing over a Spark cluster. is used in a wide range of fields including image processing, natural language processing, and robot control. The most notable feature of Chainer is so-called ’define-by- run’ nature, meaning it constructs computational graphs when it runs code at the first time. In contrast, most other deep learning frameworks takes different approach called ’define- and-run’, where graph construction is performed in advance as a separate step. This feature is quite useful to debug complicated deep neural networks. III. EXPERIMENTS A. Outline of Experiment We investigate the performance of parallel data processing using Spark varying parameter settings. We construct a Spark cluster for parallel machine learning processing using Chainer as shown in Fig. 1. The Spark cluster consists of a master node and worker nodes and the master node executes a Python program to read data and create RDDs of the data. We use MNIST[3], which is an image data set of handwritten digits of 28×28 pixels from 0 to 9, to be analyzed. Then the master node passes the RDDs to the worker nodes and the worker nodes evaluate them using Chainer. In the experiments, we use Spark v. 2.2.0 and Chainer v. 2.0.2. The Spark cluster consists of 1 master and 4 worker nodes connected in the Spark standalone mode and the details of each node are shown the in the table III-A. The Spark cluster configuration is shown in the Fig. 2. Spark has the concept of partitions to split each RDD for processing and we can specify the number of partitions by using the method called partitionBy() in the program as follows: partitionBy(numP artitions, partitioner = portable_hash)

Transcript of Consideration of Parallel Data Processing over an Apache...

Consideration of Parallel Data Processing over anApache Spark Cluster

Kasumi Kato∗, Atsuko Takefusa†, Hidemoto Nakada‡ and Masato Oguchi∗∗Ochanomizu University †National Institute of Informatics

‡National Institute of Advanced Industrial Science and Technology (AIST)

I. INTRODUCTION

The Spread of cameras and sensors and cloud technologiesenable us to obtain life logs at ordinary homes and transmitthe captured data to a cloud for life log analysis. However,the amount of processing for video data analysis in a clouddrastically increases when a very large number of homes senddata to the cloud.

In this research, we aim to improve the efficiency ofdistributed video data analysis processing by using the paralleldeep learning framework Chainer [2] and the distributionprocessing platform Apache Spark [1] (Spark). In this paper,we construct a Spark cluster and investigate the performanceof parallel data processing using Spark varying parametersettings.

II. RELATED TECHNOLOGY

A. Apache Spark

Spark is a fast and versatile cluster computing platformdesigned large-scale data processing. Spark was developed bythe University of California, Berkeley and was donated to theApache Software Foundation in 2014. Spark can perform batchprocessing in a minimum unit, called micro-batch process.Because computation is performed on-memory in the Sparkplatform, the performance of data processing is improvedavoiding expensive disk access. Processed data are stored inResilient Distributed Dataset (RDD) defined by the Spark coreand the RDDs are automatically distributed. The Spark projectconsists of multiple components tightly coupled together, suchas Spark core, Spark SQL, MLlib [5]. The Spark core includesSpark’s basic functions such as task scheduling, memory man-agement, disaster recovery, and interaction with the storagesystem. By operating RDD, it is possible to easily performa wide range data processing including repetitive machinelearning, streaming, complicated queries, batches. Spark canbe combined with other big data tools such as Hadoop [4],and has excellent versatility.

B. Chainer

Chainer is a deep learning framework developed by Pre-ferred Networks. It is provided as a Python library, and hasthree outstanding characteristics, ”Flexible”, ”Intuitive” and”Powerful”. Chainer enables users to install easily, write acode intuitively with a simple notation, and perform high-performance operation utilizing GPU with CUDA support. It

Fig. 1. Master/Worker processing over a Spark cluster.

is used in a wide range of fields including image processing,natural language processing, and robot control.

The most notable feature of Chainer is so-called ’define-by-run’ nature, meaning it constructs computational graphs whenit runs code at the first time. In contrast, most other deeplearning frameworks takes different approach called ’define-and-run’, where graph construction is performed in advanceas a separate step. This feature is quite useful to debugcomplicated deep neural networks.

III. EXPERIMENTS

A. Outline of Experiment

We investigate the performance of parallel data processingusing Spark varying parameter settings. We construct a Sparkcluster for parallel machine learning processing using Chaineras shown in Fig. 1. The Spark cluster consists of a masternode and worker nodes and the master node executes a Pythonprogram to read data and create RDDs of the data. We useMNIST[3], which is an image data set of handwritten digitsof 28×28 pixels from 0 to 9, to be analyzed. Then the masternode passes the RDDs to the worker nodes and the workernodes evaluate them using Chainer. In the experiments, weuse Spark v. 2.2.0 and Chainer v. 2.0.2. The Spark clusterconsists of 1 master and 4 worker nodes connected in the Sparkstandalone mode and the details of each node are shown the inthe table III-A. The Spark cluster configuration is shown in theFig. 2. Spark has the concept of partitions to split each RDDfor processing and we can specify the number of partitionsby using the method called partitionBy() in the program asfollows:

partitionBy(numPartitions, partitioner = portable_hash)

TABLE ITHE PERFORMENCE OF MACHINE

OS Ubuntu 16.04LTS

CPU Intel(R) Xeon(R) CPU W5590 @3.33GHz(4 cores)×2 sockets

Memory 8Gbyte

Fig. 2. Master/Worker processing over a Spark cluster.

In this paper, we perform parallel data processing changingthe following two parameters.

1) locality wait time2) partitioner

1) locality wait time is the parameter that specifies a standbytime before task allocation and 2) partitioner is the parameterthat decide how to part the group of task. Spark implementsa scheduling scheme called ”delay scheduling” to achievecompatibility between Fairness and Locality [6]. If the jobto be scheduled next is not able to be allocated on a nodewhich stores processed data considering fairness, it makesthe job wait and the other jobs start instead. In this paper,we call this standby time, ”locality wait time”, and specifyit in the configuration file, we experiment by changing thespecific gravity of Fairness and Locality. Since the partitioneris defined as portable_hash by default, we can adjust thepartitioning of tasks by changing portable_hash.

B. Locality Wait Time

Locality wait time is set to 3 seconds by default. In thispaper, locality wait time is set to 3 seconds and 0 seconds,which means a state not considering data locality. We divide40 tasks into partitions and use 3 workers. Fig. 3 shows thebehavior of task processing when the number of partitions isset to 9 and locality wait time is set to 3s. The horizontal axisshows the elapsed time, and the vertical axis represents nodeswhere tasks are processed. Each arrow in the figure indicatesthe elapsed time of each task processed in each node, and agroup of arrows continuing in the upper right direction meansa partition. We can see that task is biased in a certain node,and processing can not be performed efficiently. We assumethat this phenomenon is caused by locality wait time. Fig. 4shows the behavior of task processing when the number ofpartitions is set to 9 and locality wait time is set to 0. Withthis parameter adjustment, all nodes are used, but the totalexecution time was not improved much.

C. Partitioner

We divide 1,000 tasks into partitions and use 4 workers. Fig.5 shows the behavior of task processing when the number of

Fig. 3. The behavior of task processing with default settings (number ofpartitions is set to 9).

Fig. 4. The behavior of task processing with locality wait set to 0 sec(number of partitions is set to 9).

partitions is set to 32. We can see that the number of tasks ina partition is not uniform, so the total execution time becomeslonger. Therefore, we modified portable_hash which decideshow to make the number of data in a partition so that thepartition sizes are adjusted to be equal. Portable_hash is anargument of partitionBy(). Fig. 6 shows the behavior of taskprocessing after modification. The number of partitions is alsoset to 32. Fig. 6 shows that the number of tasks in eachpartition became almost uniform and tasks are processed inparallel. However, the number of partitions processed in eachnode decreases and partitions are executed consecutively inthe time direction.

Fig. 7 shows the behavior of task processing when thenumber of partitions is set to 64. The size of each partition isfiner than when number of partition is set to 32. In addition,variations can be seen in the number of included tasks foreach partition. It is understood that the start time of the taskprocess is delayed compared with the case of the number ofpartitions 32, and the execution time does not become shortas the number of partitions is increased.

Fig. 8 shows the behavior of task processing after modifi-cation when the number of partition is set to 64. As with thenumber of partition 32, the number of tasks in each partition

Fig. 5. The behavior of task processing with default settings (number ofpartitions is set to 32).

Fig. 6. The behavior of task processing with modified settings (number ofpartitions is set to 32).

became almost uniform and tasks are processed in parallel.However, only three out of the four nodes have been used.

IV. CONCLUSIONS

In the experiments, we investigate the performance ofparallel data processing using Spark varying parameter set-tings. The locality wait time is effective in that processingcan be performed in parallel among multiple nodes, but thetotal execution time was not improved much between defaultsetting and not considering data locality setting. Also, modifiedportable_hash can make the number of tasks in each partitionalmost uniform and tasks processed in parallel, but the numberof partitions processed in each node decreases and partitionsare executed consecutively. In the future, we will considera method to control the number of allocated tasks on eachworker in order to improve the processing performance.

ACKNOWLEDGMENT

This paper is partially based on results obtained froma project commissioned by the New Energy and IndustrialTechnology Development Organization (NEDO), JSPS KAK-ENHI Grant Number JP16K00177, and the open collabora-

Fig. 7. The behavior of task processing with default settings (number ofpartitions is set to 64).

Fig. 8. The behavior of task processing with modified settings (number ofpartitions is set to 64).

tive research at National Institute of Informatics (NII) Japan(FY2017).

REFERENCES

[1] Apache Spark, https://spark.apache.org/.[2] Tokui, S., Oono, K., Hido, S. and Clayton, J.: Chainer: a Next-

Generation Open Source Frame- work for Deep Learning, In Proceed-ings of Workshop on Machine Learning Systems (LearningSys) in TheTwenty-ninth Annual Conference on Neural Informa- tion ProcessingSystems (NIPS) (2015). 6 pages.

[3] Lecun, Y., Cortes, C. and Burges, C. J.: The MNIST Database ofhandwritten digits, http://yann.lecun.com/exdb/mnist/.

[4] Apache Hadoop, https://hadoop.apache.org/.[5] Karau, H., Konwinski, A., Wendell, P. and Zaharia, M.: Learning Spark,

Cambridge: O’Reilly Media (2015).[6] Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker,

S., and Stoica, I.: Delay scheduling: A simple technique for achievinglocality and fairness in cluster scheduling. In EuroSys 2010, (2010).