New Nerual Network Reliability Enhancement Approach using...

Neural Network Reliability Enhancement Approach Using Dropout Underutilization in GPU

Dongsu Lee, Hyunyul Lim, Tae Hyun Kim, Sungho Kang Dept. of Electrical and Electronic Engineering

Yonsei University Seoul, Korea

{ehdtn4545, lim8801, incendio9}@soc.yonsei.ac.kr and [email protected]

Abstract— Recently, the researches on DNN (deep neural network) using GPUs has been actively conducted. The reason for using GPUs in DNN is that it reduces the learning time by using many computational cores. However, GPUs have no implements to support the reliable computing operations. It can exacerbate the reliability of the deep neural network. To ensure the reliability of the deep neural network, the proposed approach is to utilize the dropout technique used in MLP (Multi-Layer Perceptron) learning. In case of the dropout method, some threads in GPUs do not participate in the calculation and it causes GPUs underutilization. The proposed method uses the GPUs underutilization to support a reliable deep neural network. The proposed approach is available through using the idle neurons with adjacent calculating neurons in the dropout process. The experiment results show that the proposed approach is able to support the reliability issues in GPUs while executing deep neural network algorithms.

Keywords—Reliability, GPU, Multi-layer Perceptron, Deep neural network, dropout

I. INTRODUCTION Deep neural network learning is a set of machine learning

algorithms that try to combine high-level abstractions, which is summarizing core content or functions in large amounts of data sets or complex data sets, through a combination of several non-linear transformation techniques [1]. The types of problems that are solved using machine learning are largely represented by two types of regression and classification problems. Classification refers to the fact that the result of the input value is not continuous, but results in several kinds of splits. Regression refers to the problem that it is a case where a value is expected when a result value has continuity and a case where a variable and a result value are consecutively performed. These problems can be said to be a field of machine learning that teaches computers how people think in a big frame.

The most basic principle of imitation of artificial neurons is a multi-layered neural network, in which a large number of artificial neurons are arranged for each layer [2]. Information is exchanged through the interconnections of a large number of layers of neurons. Through the implementation of an artificial neural network composed of multi-layer, it imitates basic brain structure and it can be used in various research fields.

Because of its potential for many uses, deep-learning techniques are seeking to expand into the overall realm of the industry. Simple examples include email spam filtering, postal code recognition of stationery, recommendation systems for shopping malls and cable TV, natural language recognition, autonomous driving and so on [2], [3], [4]. The basic principle that enables these usages is the usage of arithmetic units that mimic human neurons. The reliability of artificial intelligence environment is also an important issue as the application of artificial intelligence directly or indirectly affects human life. This is because the result of the wrong operation of artificial intelligence may cause a damage to property or personal injury like autonomous vehicle accident etc. Therefore, securing the reliability of artificial intelligence calculation results is one of the most important issues in the research and the development of an artificial intelligence.

In the artificial intelligence learning, GPUs are one of the most important devices that because of greatly improving the learning time compared with the CPU [5]. However, there are no methods to support the reliability of the computing operation in GPUs. Only high-end GPUs have a means of ensuring reliability at the ECC level of memory. Therefore, this study explains how to support the reliability in GPUs when artificial intelligence learning case.

In this study, we propose a method to improve the reliability of artificial intelligence by using GPUs underutilization which occurs in the dropout process. The dropout is one of the techniques to mitigate overfitting in the learning process of multi-layer perceptron. The rest of this paper is organized as follows. Section II introduces the background. In section III, the details of the proposed approach are explained. In section IV, the experimental results of the proposed approach are declared. Finally, Section V describes the conclusion and future works.

Proceedings of TENCON 2018 - 2018 IEEE Region 10 Conference (Jeju, Korea, 28-31 October 2018)

2270978-1-5386-5457-6/18/$31.00 ©2018 IEEE

II. BACKGROUND

As the sort of classification task with the artificial neural network, the Multi-Layer Perceptron is the type of feedforward neural network in which many neurons have interconnected with each other in many layers. As shown in Fig. 1a. The MLP consists of one input layer, a number of hidden layers and one output layer. Each layer consists of multiple artificial neurons as the mathematical computation units. Most input layer neurons comprise a linear activation function structure. The hidden layer and the output layer consist of nonlinear activation function like softmax, ReLU etc. [6], [7]. In MLP structure task, there are various learning algorithms available. One of the most widely-used algorithms is a backward propagation(BP) algorithm. The sequence of BP algorithm is proceeding forward propagation that generated output signal values through artificial neuron layers. After forward propagation, error values can obtain by various error calculation equation with output signal values and optimal signal values. Backward propagation is using an error value and mediates weight that consists of an artificial neuron. However, many numbers of artificial neurons per each layer and many numbers of layers in MLP leads to overfitting problems due to the learning of excessive datasets in machine learning.

The term, overfitting is usually announced in machine learning. The definition is that the results from learning of deep neural network correspond too closely or exactly to a set of data, and may, therefore, cause a failure to fit additional data.[8] The overfitting is a very important concept in artificial neuron network because there are many closely related to other important statistical concepts. There are various reasons that cause overfitting in machine learning.

There are too many variables to consider in neuron layer. It also announced as the curse of dimensionality, the larger the number of variables to consider, the wider space where the data is represented, and the observations are distributed more shallowly in the wider space. As the points become farther apart, the effect of each value on the model becomes larger. As extreme values have a large impact on the mean value, even accidentally shot points (which you can think of as noise) can have a big impact on the model, increasing the likelihood that the model

will explain the noise. Another reason is that too much complex model to solve problems. Overfitting can also occur if the model is too complex. In fact, the complexity of the model is closely related to the concept of degree of freedom. It is easy to understand Fig. 1c. As shown in, the green line represents an overfitted model and the black line represents a regularized model. The green line best follows the training data however, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line. There are also various methods to solve the overfitting problems, in multi-layer perceptron case, the developer uses the dropout method.

In general, the number of hidden layers in the artificial neural network increases the learning ability; however, there is a high possibility of overfitting and a long learning time. Dropout[9] is a technique for preventing overfitting and provides an effective learning environment in multi-layer perceptron by dropping out neurons (both hidden and visible) on training data as shown in Fig. 1b. It is a very efficient way of performing model averaging with neural networks. The dropout is not all the neurons in each layer participate in the computation, but the dropout neurons in each layer are probabilistically. The selection of which neuron participate learning is follow Bernoulli distribution. Using dropout, the activity of the hidden unit is controlled and the feature can be obtained independently, which is effective for learning. When the dropout model is applied in the GPUs, a unit that is dropped due to the characteristic of the GPUs and is not involved in the operation, which causes underutilization of the GPUs resources.

The general case of underutilization occurrence in the GPUs is branch divergence problem [10]. A branch divergence problem is due to the operating characteristic of the GPUs. A branch instruction is encountered when the GPU is operating, the threads within the warp take a branch operation. Due to a single PC-lock step constraint, threads with the not-taken branch are also executed together in the warp operation. While the not-taken path instructions are also executed, the core assigned to the taken path threads are idled. This sequence has occurred the time waste with underutilization.

Fig. 1. Common deep learning models (a) MLP with one hidden layer scheme (b) Dropout MLP with one hidden layer scheme (C) Overfitting model


2271

However, the case of GPU underutilization in the multi-layer perceptron using dropout, it is due to the occurrence of a streaming processor that does not participate in the operation by dropout. This underutilization is caused by the lock-step execution in the SIMT structure of the GPU. The computation execution of GPU is managed and executed in the Warp scheduler. In the SIMT (single instruction multi-threads) structure of the GPU, the execution of threads in the streaming multiprocessor is executed all at the same time by the warp scheduler, and the operation is executed even though the computation value is not assigned to the activation thread. Due to the characteristic of the GPU, in the dropout process, there is the GPU underutilization occurred.

III. DROPOUT BASED MULTI-LAYER PERCEPTRON RELIABILITY ENHANCEMENT APPROACH

This section proposes a dropout based reliability enhancement approach for MLPs. The proposed approach is using dropout neurons as shown in Fig. 2. Section A describes a modeling of the proposed dropout approach and Section B explains the application of the proposed modeling to GPUs.

This section describes the process of proposed multi-layer perceptron model. The process begins by considering a general weight sum.

= ×Equation (1) denotes a general weight sum of n inputs multi-layer perceptron. Where X denotes the input vector, W is a weighted of neuron and the Y represents a weight sum. For the sake of clarity, the following equation deployment assumes four

neurons in one layer and a dropout rate of 0.5. The implementation of the proposed matrix is an equation of the Fig. 2.

00 = 00 × 0 0 0 0 0 0 0 0 0 0 0 0

Equation (2) indicates that and , which is represented the neurons in a layer, are dropping out by the dropout process. It can be seen that weight values are set to zero through the dropout process.

00 = 0+0+Equation (3) is a weight sum result of (2). Each Y represents a weighted sum of each input. Equation (3) is the result of general dropout process, and the proposed dropout technique is as follows (4).

= × 0 0 0 0 0 0 0 0To realize the proposed approach, dropout neuron is copied to the same value with the adjacent neuron as like (4). The copied neuron values are used only for verification and do not affect the calculation of next layer.

= ++++Equation (5) is the result of a weighted sum that applied the proposed approach dropout process. Originally, the and have a value of 0 due to dropout, however, through the proposed method, and values are replaced by the and values, which are the adjacent neuron values.

At the weighted sum in the dropout process, we can verify that idle neurons were inevitably generated and find a way to utilize these idle neurons through the proposed approach. In the proposed approach case, if a calculation device activates perfectly with no faults, there is no difference in calculation result. However, if the developers are using GPUs as the

Fig. 2. Proposed approach model


2272

calculation device, due to the characteristic of GPUs, the correct computation results may not be obtained. The application of proposed equation to the GPUs is described in the next section.

As mentioned above, it is a meaningless technique to a zero defect computing device. However, GPUs are not a zero-defect computing device. During the deep learning processing through the GPUs, there is a possibility that the various case of fault can occur in the computation process. Fig. 3 shows the application of the dropout method to the GPUs operation and the application of the proposed dropout method to the GPUs operation.

The proposed approach operation is possible because the GPUs consist of a set of computation units called streaming processor. The group of a streaming processor is called streaming multiprocessor as shown in Fig. 3b. The operation characteristic of the GPUs is that the GPUs do not operate independently each core and the GPUs operate on a streaming multiprocessor basis. From a developer’s point of view, in the GPUs development environment, which called ‘block’ is

assigned to each streaming multiprocessor. The developer can assign the count of threads to the block. The threads which are assigned block execute the same instruction on the streaming multiprocessor. It is the concept of SIMT. It is available through the warp scheduler, which allocated threads to the streaming processors. Due to these operational characteristics, the neurons filtered by dropout also participate in the weight sum calculation with a value of zero. The developer can manage the memory area on the GPUs and match the number of neurons in each layer with the number of threads assigned to the block, each streaming multiprocessor represent the neural network layer and the proposed method can be applied. To make the proposed approach sequence easier to understand, algorithm 1 explains that the proposed dropout approach in GPUs environment in detail.

First, the computation data allocate to the GPUs memory area (Line 1). Computation threads start after location is finished (Line 2). In dropout situation, dropout process filters out the computation threads corresponding to neurons by stochastically (Line 3). After then, the algorithm determines whether idle threads are available(Line 4). In this sequence, idle threads are generated by the dropout process. After process determines that the idle threads are available, adjacent computation threads are copied by the algorithm. After process copied the computation threads, algorithm pastes the computation threads to the idle threads (Line 5-6). If the dropout rate is none that there are no idle threads, pass the copy and pasted process (Line 8). After the entire computation process done, compare process is started to ensure the reliable state of GPUs (Line 10). After finished the compare operation, the algorithm determines that fault has been occurred or not. If the fault has occurred during the computation process, the algorithm returns the fault occurred signal. If not, the algorithm returns the fault not occurred signal (Line 11-14).

As shown in the Algorithm 1 and Fig. 3, the difference of the proposed approach and the general dropout process is that the sequence of the copy, paste and the comparison of computation results. In the learning process, the time overhead is expected to occur during the copy and paste process.

Fig. 3. Execution of dropout process in GPUs (a) Normal dropout execution in GPUs (b) Proposed approach execution in GPUs

Algorithm 1. Compare Computation result sequence idle threads, computation threads, computation

datafault occurred signal, fault not occurred signal

1: allocate (computation data);2: start process (computation threads); 3: dropout (computation threads); 4: (idle threads == 1) 5: copy (computation threads); 6: paste (idle threads); 7: 8: pass (); 9: 10: compare (copy threads, computation threads); 11: compare result == 0) 12: 13: 14: 15:


2273

IV. EXPERIMENTAL EVALUATION To evaluate the proposed approach, the hardware and

software conditions are listed as follows: Intel-i5 3.2G CPU, 8GB DDR4 RAM, NVIDIA GeForce GTX 1080 GPU with CUDA9.2 toolkit [11] and Tensorflow [12] running on Windows 10. In the evaluation, MNIST[13] is used. The number of neurons in each layer is 784 in the input layer, 256 neurons in each hidden layer, and 10 neurons in the output layer, that it is suitable for MNIST. We validate that the proposed approach is an effective technique for improving the reliability of the GPUs.

The MNIST dataset consists of 60,000 training and 10,000 test samples that each sample has 28*28 pixel handwritten digit images. The task is to classify the image into ten-digit classes. To evaluate our proposed approach, we adjust the number of layers variously, control the number of neurons in a layer and measured the fault coverage that varies with dropout rate.

Through this experiment, it is proved that the proposed approach does not make much difference with the general dropout method in the learning time of the deep learning and is a useful technique for securing the GPU’s reliability. The following Section A shows the comparison of the proposed method with the conventional dropout method and Section B shows the actual fault coverage of the proposed scheme.

In this section, we discuss the efficiency of the proposed approach by comparing the learning time, error rate, and accuracy in the conventional MLP and proposed MLP environment.

For the measurement of the entire learning time, the number of an epoch is set to 25, and the epoch time is checked and summed entire epoch to derive the total learning time. As shown in Fig. 4a, the average learning time is 51.56 seconds in conventional MLP environment. The learning time in the proposed scheme is 71.30 seconds as shown in Fig. 5a. In both environments, the difference in learning time on average is 24.8 seconds and the time overhead is about 38.2%. Experimental results in learning time show that selecting the appropriate dropout rate has the effect of saving the learning time.

Fig. 4b and Fig. 5b shows that the softmax cross entropy[14] increases as the dropout rate increases in the both conventional MLP and proposed MLP dropout environment. Softmax cross entropy is a value including the difference between the results obtained through the learning process and the actual results and has a close relationship with the accuracy and softmax cross entropy is used because the degree of overfitting cannot be

(a)

(b)

(c)

Fig. 4. Conventional MLP learning data graph in MNIST (a) Learning time (b) Learning cost (c) Accuracy

(a)

(b)

(c)

Fig. 5. Proposed MLP learning data graph in MNIST (a) Learning time (b)Learning cost (c) Accuracy

Layer1Layer2Layer3Layer430

354045

50

55

60

65

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Tim

e(s)

Dropout rate

Layer1 Layer2 Layer3 Layer4

Layer1

Layer2

Layer3Layer40

0.20.4

0.6

0.8

1

1.2

1.4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Softm

ax c

ross

ent

ropy

Dropout rate


Layer1

Layer2

Layer3Layer4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Acc

urac

y ra

te

Dropout


Layer1

Layer2

Layer3Layer4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Tim

e(s)

Dropout rate

Layer1

Layer2

Layer3Layer40

0.10.20.30.40.50.60.70.80.9

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Softm

ax c

ross

ent

ropy

Dropout rate


Layer1

Layer2Layer3Layer450.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Acc

urac

y ra

te

Dropout rate



2274

confirmed by accuracy alone. It can be seen that entropy increases as the number of learned neurons decrease due to the dropout process. The average cross entropy value is 0.158 in the conventional MLP and 0.106 in the proposed MLP. Since the value of the softmax cross-entropy between the conventional MLP and the proposed approach MLP is not huge, it can be confirmed that overfitting does not occur when the proposed approach is applied.

The measurement of accuracy in the conventional MLP and the proposed MLP is conducted through the comparison with the training set and validation set. As explained above, overfitting due to the excessive learning leads to reduced accuracy. In Fig. 4c and Fig. 5c, the accuracy according to the dropout rate can be observed for each hidden layer. As the dropout rate increases, the accuracy decreases. The average of the accuracy is 95.35% in the conventional MLP environment, and the accuracy in the proposed MLP is 95.19%. The difference in accuracy between the two environments is 0.16%.

B. Fault coverage In this section, we verify that the fault coverage is 100%

when the dropout rate is set to 50%. Also, we checked the fault coverage when the dropout rates were set differently. In case of the dropout rate exceeds 50%, the number of idle neurons that do not participate in the computation is more than half, we could verify entire neurons in the learning process through this idle neurons. In case of the dropout rate decreased 50%, the number of idle neurons that do not participate in the learning process is less than half, due to the lack of verifying neurons, the fault coverage is reduced by the ratio of the total number of neurons to the number of idle neurons. Fig. 6 is a graph of the fault coverage obtained with different dropout rates. In the proposed MLP environment, it can be observed that the fault coverage decreases as the dropout rate decreases, as described above. Through the experiment, we can observe that the fault coverage decreases as the dropout rate decreases.

V. CONCLUSION In this paper, we propose a reliability enhancement approach

to support reliable results in multi-layer perceptron neural network by using dropout characteristics. The experimental results show 0.16% difference in accuracy rate and validate that the proposed approach is effective in identifying the GPUs calculating errors in the deep neural network. Through the proposed approach, developers can identify the problems of

abnormal execution time in the learning process of the deep neural network.

Several companies or research institutions are using not only GPU but also ASIC with many-core architecture structure specialized for machine learning. The proposed method can be applied to man-core architecture structure specialized in machine learning. The remaining future works are to find out various reliability enhancement methods in neural network and to apply the proposed approach to the deep neural network using cluster computing environments.

ACKNOWLEDGMENT This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.2016-0-00140, Development of Application Program Optimization Tools for High Performance Computing Systems).

REFERENCES

[1] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Netw., vol. 61, pp. 85–117, 2015

[2] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-Column Deep Neural Networks for Image Classification,” Technical Report, arXiv:1202.2745, 2012.

[3] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42, Jan. 2012.

[4] Thiago S. Guzella*, Walmir M. Caminhas, “A review of machine learning approaches to Spam ltering”, 2009, Elsevier Journal - Expert Systems with Applications 36 (2009) 10206–10222.

[5] NVIDIA, “cuDNN: GPU Accelerated Deep Learning,” 2018. [6] R.R. Salakhutdinov and G.E. Hinton, “Replicated Softmax: An

Undirected Topic Model,” Proc. Advances in Neural Information Processing Systems Conf., vol. 22, 2010.

[7] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” Proc. Conf. Artificial Intelligence and Statistics, 2011.

[8] D. M. Hawkins, "The problem of overfitting," Journal of chemical information and computer sciences, vol. 44, pp. 1-12, 2004.

[9] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014

[10] T. D. Han and T. S. Abdelrahman, “Reducing Branch Divergence in GPU Programs,” in GPGPU, 2011.

[11] NVIDIA, “NVIDIA CUDA Programming Guide,” 2018 [12] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.

Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, ´ M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wat- ´ tenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[13] Y. LeCun and C. Cortes. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 1998.

[14] N. A. C. D. Campbell and R. A. Dunne, “On the pairing of the softmax activation and cross entropy penalty functions and the derivation of the softmax activation function,” in Proc. 8th Austral. Conf. Neural Netw., 1997, pp. 181–185.

Fig. 6. Fault coverage in the proposed MLP environment


2275

New Nerual Network Reliability Enhancement Approach using...

Documents

Transcript of New Nerual Network Reliability Enhancement Approach using...