ANSCSE15 Full Paper Thanakij

download ANSCSE15 Full Paper Thanakij

of 11

Transcript of ANSCSE15 Full Paper Thanakij

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    1/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    Improving Bayesian Computational Time and Scalability with

    GPGPU

    T. PechprasarnC, and N. Khiripet

    Knowledge Elicitation and Archiving Laboratory, National Electronics and Computer Technology Center,

    Pathumthani, 12120, ThailandC

    E-mail: [email protected]; Fax: 02-5646772; Tel. 02-5646900 ext. 2220

    ABSTRACTIt is almost impossible for one to find the posterior probability in Bayesian inference due to

    the lack of closed-form antiderivatives. Instead, an approximation method like Monte Carlo

    integration (MCI) is used to calculate such an integral. MCI involves a ramdom process to

    generate samples corresponding to the target distribution. In general, a larger number ofsamples yield a more accurate result; however, it also requires more computational time.

    To obtain higher performance, NVidia CUDA can help accelerate the computation by

    leveraging a parallel programming pattern called parallel reduction. Although our current

    achieved speed-up is reasonable, it still can be further improved. In addition to the running

    time, scalability is another issue that we can add on. In this paper, in order to improve the

    performance we further optimize our parallel programs by introducing some optmizationtechniques and also cope with the problem of scalability. Loop unrolling and enhancing thecompacting code are included in our optimization methods. In order to improve scalability,

    we utilize the multidimensional feature of CUDA by using 2D blocks instead of 1D blocks.

    The result shows that the computation time is substantially decreased and the program can

    handle much larger problem size even though a small block size is being used. We

    conclude our work by identifying proper block sizes for certain problem sizes.

    Keywords: Bayesian probability, Monte Carlo integration, Parallel reduction, GPU

    computing, CUDA.

    1. INTRODUCTIONIn Bayesian probability, one is often interested in finding the posterior distribution to test the

    hypothesis given observed training data. However, solving for the posterior is a challenging task

    because typically the posterior is in a form of integrals and most of the time the closed-form

    solutions for such integrals are not available [3]. Instead, the approximation method like Monte

    Carlo integration (MCI) is used to find the integrated value. MCI involves a random process togenerate samples from the target distribution. Then, each sample will be calculated its

    contribution to the final integrated value. In general, using larger number of samples would yield

    more accurate results. Nevertheless, when the sample size is large, the computation becomes

    much slower. Therefore, we try to speed up the computation of MCI with GPUs. We implement a

    parallel program using Compute Unified Device Architecture (CUDA), a leading framework for

    programming GPUs. Given a set of samples, our work is focusing on the core integration part.

    The integration involves finding a summation from each contributed part. We employ a parallel

    pattern called parallel reduction for finding a summation. Parallel reduction is suitable for our

    CUDA programs as it allows many parts of the calculation to be done in parallel [5]. From our

    previous work [7], the experimental result indicates that the higher performance is gained since

    the running time is substantially decreased. Although our previous work is successful to some

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    2/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    point, it still can be improved in many aspects. For example, the computational time can be

    further improved by introducing some optimization methods such as loop unrolling technique.

    There is also an issue about scalability as we cannot use smaller block sizes for larger problem

    sizes. This scalability issue is important because it prevents us to determine the effect and theperformance from using smaller block sizes. Thus, solving scalability issue would significantly

    reveal additional search space for finding the optimal running time. To solve the scalability

    problem, we utilize the multidimensional feature of CUDA and divide the samples into 2D blocks

    instead of 1D blocks. This lets us use smaller block sizes such as 128 for larger problem sizes.Eventually, we present our work to reduce the running time with chosen optimization techniques

    and also cope with the problem of scalability. In addition, a real world example of Bayesian

    application is also provided. According to our experiments, the results show our parallelprograms perform much better than the sequential implementation. For example, the maximum

    speed-up obtained is 53.49 times the sequential code.

    2. THEORY AND RELATED WORKS2.1Bayesian ProbabilityAccording to [1], Bayes rule is defined as

    | | (1)where,

    D = observed data

    = the hypothesis defined by parameter P(|D) = posterior of givenDP(D|) = likelihood ofP() = prior probability ofP(D) = probability ofD

    The posterior is often of interests as it is an inverse probability compared to direct probability

    from classical theory of statistics. The posterior is used to infer the causes given observed data.Given data, the model which is called the likelihood can be constructed. The prior distribution is

    for expressing a general knowledge about the data. Next, in order to see whether the hypothesis

    will be accepted or rejected, an expected value of the posterior has to be computed. The posterior

    expectation has to fall in the region of 95% of the prior distribution.

    According to [2], an expected value is used to find an averaged outcome of a function in long

    run and is defined as

    (2)where,

    P(x) = probability density function

    The expectation of the posterior is |. According to (2),

    | | According to (1),

    | |

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    3/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    According to (2),

    |

    |(3)

    where,

    P(D) = a constant value of | = |2.2Monte Carlo Integration (MCI)Involving a random process, MCI is an integration method to find the value of a definite

    integral [4]. A general form of such an integral is

    (4)We can dividef(x) with P(x) and (4) becomes

    where,

    P(x) = probability density function on interval [a,b]

    According to (2), we have

    (6)But the expected value can be estimated as

    (7)From (6) and (7), MCI is defined as

    (8)

    where,

    P(x) = a sampling distribution

    N= the number of samples

    There are two major steps in MCI. The first one is to generate a set of samples from the

    sampling distribution. Then, contributions from each sample are summed to find an integratedvalue.

    2.3Parallel ReductionParallel reduction is a common pattern for reducing a set of numbers into a single value [5].

    The structure of parallel reduction is shown in Figure 1. With a tree-based structure, there are

    log2Ntree levels. All operations at the same level can be done in parallel, but the next level has to

    wait until operands from the previous level are ready. We employ the parallel reduction patterninto our program during the second step of MCI which is to find a summation of contributions

    from samples.

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    4/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    Figure 1. Structure of parallel reduction.

    2.4Compute Unified Device Architecture (CUDA)According to [6], CUDA is a general-purpose parallel computing architecture. It comes with

    a programming model and new instruction set architecture. The architecture of CUDA is

    composed of GPUs with stream multiprocessors. Each stream multiprocessors contains CUDAcores. CUDA exploits parallelism via blocks of threads. Blocks are executed independently by

    CUDA cores. Therefore, more than one block can be executed in parallel depending on the

    available CUDA core resources. This allows CUDA programs to automatically scale up by

    simply running more blocks. Next, a kernel is a function to be executed on GPUs. In order to

    launch a kernel, both the number of blocks and the number of threads per block have to be

    specified by the CPU callers.

    3. IMPLEMENTATION DETAILS1. Bayesian ApplicationWe extend our previous work by introducing a Bayesian application. This application is for

    calculating the expectation of the posterior. The application is going to compute the result from(3). Using MCI, the first step is to generate a set of samples according to the prior distribution.This random number generation part is done in CPUs. Next, using generated samples, according

    to (7), an expected value can be calculated with parallel reduction. With GPUs, the computation

    of the parallel reduction is accelerated. After obtaining the expectation of the posterior,hypothesis testing is performed by checking whether the probability is fall under 95% regions of

    the prior. An overview of the implementation is shown in Figure 2.

    Figure 2. Implementation of 2D blocks.

    2. Solving the ScalabilityIn order to solve the scalability issue found in our previous work [7], we employ the

    multidimensional feature of CUDA. Our CUDA programs basically divide an array of samples

    into smaller blocks which will be reduced later to find an integrated value. Figure 1 illustrates our

    (* Calculate the expectation of the posterior using MCI *)

    SET samples to Sampling(Normal(5,0.5), N)

    SET numeratortoReduce(f, samples, N) usingf(x) = x*lhd(x)

    SET denominatortoReduce(f, samples, N) usingf(x) = lhd(x)

    SET expected_value to numerator/denominator

    (* Hypothesis testing *)

    SETpH0 to Test(expected_value, Normal(5,0.5))

    RETURNpH0 < 0.95

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    5/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    idea in transformation of using 1D blocks into 2D blocks. With 1D blocks, the maximum number

    of blocks we can use is 65535 x 1 x 1 = 65535. On the other hand, if 2D blocks are being used,

    the maximum number of blocks becomes 65535 x 65535 x 1 = 4294836225. This number is

    already large enough to utilize CUDA core resources and also allow us to scale to larger problemsizes. Theoretically, the maximum problem size for a certain block size is calculated by

    multiplying the number of blocks with the block size. In practice, the physical limit of GPU

    memory may be a blocker for a very large problem size. Figure 3 provides an illustration of the

    idea.

    Figure 3. Transformation from 1D blocks to 2D blocks.

    Figure 4. Implementation of 2D blocks.

    Figure 4 shows our implementation with a required parameter, the size of a row. Thisparameter can be tuned to fit certain problem sizes. If larger value of row size is being used, there

    might be a lot of waste computation in the last row. On the other hand, if smaller rows are used,

    then there is much less opportunity to waste the computation; however, as the number of rows

    grows it might hit the limit of 65535. Future work can provide complex analysis on this problem.

    3. Performance2.1) Loop unrolling

    We employ loop unrolling into our parallel reduction code. The major advantage of the loop

    unrolling technique is that there is no need to check the condition of the loop when iterating. We

    unroll last six iterations since the number of threads can be ensured to be a warp size. By doing

    (* 1D block representation *)

    SET num_blocks to num_samples/block_size

    SET block.x to num_blocks

    SET block.y to 1

    (* 2D block representation *)

    SET num_blocks to num_samples /block_size

    SET num_row to num_blocks/row_size

    SET block.x to min(num_blocks, row_size)

    SET block.y to num_rows

    (* # CUDA blocks = min(block.x, 65535) x min(block.y, 65535) *)

    samples

    row0

    row1

    row size

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    6/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    this, there is an extra benefit as we can remove unnecessary expensive synchronous instructions.

    Threads within the same warp do not require a synchronous point as they will always execute the

    same instruction. The idea is shown in Figure 5.

    Figure 5. Loop unrolling in parallel reduction.

    2.2) Enhancing the compact kernel

    The compact kernel is for gathering the reduced values from all CUDA blocks and forming anew array which will be sent to a reduce kernel until only one block is left. In our previous work,

    to keep simplicity of programming, we use only a single thread per block and let each block to do

    the compact job which may not utilize the CUDA resources. Although it is not a core code,

    tuning up this part also yields a performance improvement. Figure 6 shows our modification.

    Figure 6. Enhancing the compact kernel.

    There is another parameter appeared which is the number of threads for the compactkernel. We adjust the number of threads for this kernel according to the problem size. For

    example, we use 128 threads if the sample size is less than 8388480, use 512 threads if

    the sample size is larger than 16776960 and use 256 threads if the size is in between.

    4. EXPERIMENTS AND RESULTS1. PlatformsWe use NVidia GeForce GTX 580 as our platform for GPUs. On the CPU side, we have Intel

    Core i7. The detail specifications are shown in Table 1.

    (* Original version *)

    kernel_reduce ()

    (* Modified version *)

    kernel_reduce ()

    (* parallel reduction in the reduce kernel *)

    FOR s from num_samples/2 to 64 having s/=2

    Sync threads (* make sure that all threads are working on the same level of the tree *)

    IF threadIdis less than s THEN

    Add s_data[threadId] to s_data[threadId+ s]

    END IF

    END FOR

    (* loop unrolling *)

    IF threadIdis less than 32 THEN (* CUDA warp size is 32 *)

    Add s_data[threadId] to s_data[threaded+ 32]

    Add s_data[threadId] to s_data[threaded+ 16]Add s_data[threadId] to s_data[threaded+ 8]

    Add s_data[threadId] to s_data[threaded+ 4]

    Add s_data[threadId] to s_data[threaded+ 2]

    Add s_data[threadId] to s_data[threaded+ 1]

    END IF

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    7/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    Table 1. Specification of CPUs and GPUs.

    Description CPU GPU

    Model Intel Core i7 NVidia GeForce GTX 580

    Clock frequency (GHz) 2.8 1.56# processors 2 16

    # cores per processor 4 32

    # total cores 8 512

    2. DatasetsCavendishs data [8] are used in our experiments. The data represents the specific density of

    the earth. From 23 experiments, they are: 5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34,

    5.79, 5.10, 5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.78, 5.68 and 5.85. According to [9], our

    corresponding model is also a normal distributionN(:|,0.04). For the prior, it is chosen to benormal with mean = 5 and variance = 0.5.

    3. ResultsOur computed posterior expectation is 5.483 which is similar to the result from [9]. We find

    that the computed probability falls under the 95% of the region (0.75 < 0.95.) Thus, the

    hypothesis is accepted.

    Figure 7. Results from our Bayesian application.

    Figure 7 shows examples of results from our application. In addition to the answer, the

    running time is also provided for both CPU and GPU versions. The details of the computational

    time are provided for both the part of calculating the posterior expectation and the part ofhypothesis testing. In terms of performance, we expect an improvement for the part of posteriorcalculation since this part involves parallelism using CUDA. On the other hand, we should obtain

    similar running time for the part of hypothesis testing since there is no GPU involvement in this

    part.

    3.1) Running Time

    According to the experiment, for our Bayesian application, the results show that our GPU

    program takes less time than the CPU implementation. The logarithmic chart below illustrates thecomparison.

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    8/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    Figure 8. Running time of CPU and GPU (the whole application.)

    Because there are two main parts in the application: 1) posterior expectation calculation and 2)

    hypothesis tests, we proceed by providing the details of the running time for each part. The part

    of calculating the expectation is shown in Figure 9.

    Figure 9. Running time of CPU and GPU (for posterior expectation.)

    The two charts, Figure 8 and 9, reveal a similar trend that the GPU implementation is faster

    than the CPU. Next, for the running time of testing the hypothesis, because this portion of the

    code has no GPU involvement so there is no difference in timing between CPU and GPU.

    However, it would still be useful to see this part scales with different problem sizes. According to

    Figure 10, we find that the testing part has a linear-time scaling.

    0.010

    0.100

    1.000

    10.000

    100.000

    1000.000

    10000.000

    0 100,000,000 200,000,000 300,000,000Runningtime(seconds)

    Problem sizeCPU GPU

    0.001

    0.010

    0.100

    1.000

    10.000

    100.000

    1000.000

    10000.000

    0 100,000,000 200,000,000 300,000,000

    Runningtime(seconds)

    Problem size

    CPU GPU

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    9/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    Figure 10. Running time of the portion of testing the hypothesis.

    Next, we move back to the posterior expectation calculation. It would be interesting to see

    how each optimization strategy performs on the GPU side. Therefore, Figure 11 provides the

    details running time of the GPU programs with different optimizations.

    Figure 11. Effect of optimization methods in GPU programs.

    However, the chart illustrates that there is no much difference in running time of each

    method. We anticipate that this would be caused by the evaluation of the complex function like

    the likelihood function on the GPU side in the parallel reduction step of MCI. Although many

    threads are working in parallel to evaluate the function, but at least the elapsed time of such the

    calculation for a single thread is dominating the portion of the whole reduction. Because the

    optimization techniques such as enhancing the compact kernel and loop unrolling are focusing on

    0.000

    5.000

    10.000

    15.000

    20.000

    25.000

    0 100,000,000 200,000,000 300,000,000

    Runningtime(seconds)

    Problem size

    1) No exta optimization 2) Enhance the compacting kernel3) Loop unrolling 4) Optimization (2)+(3)

    0.000

    2.000

    4.000

    6.000

    8.000

    10.000

    12.000

    14.000

    16.000

    18.000

    20.000

    0 100,000,000 200,000,000 300,000,000

    Runningtime(seconds)

    Problem size

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    10/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    the core reduction part, such improvement becomes very little compared to the time used by the

    function evaluation. Therefore, there is no much difference for each optimization technique.

    3.2) ScalabilityWe show the result after solving the scalability problem in Table 2. Notice that all block size

    even the block size of 128 can be used by all problem sizes and this would not be possible in our

    previous work.

    Table 2. Running time of GPU programs with different block sizes.

    Problem Size 128 256 512 1024 2048 4096

    65,535 0.011 0.011 0.011 0.011 0.011 0.011

    131,070 0.021 0.021 0.021 0.021 0.021 0.021

    262,140 0.041 0.041 0.040 0.047 0.040 0.040

    524,280 0.080 0.080 0.081 0.080 0.080 0.080

    1,048,560 0.159 0.159 0.165 0.166 0.159 0.158

    2,097,120 0.317 0.316 0.316 0.316 0.316 0.3164,194,240 0.631 0.638 0.652 0.631 0.638 0.631

    8,388,480 1.261 1.261 1.264 1.261 1.261 1.262

    16,776,960 2.523 2.529 2.522 2.522 2.525 2.524

    33,553,920 5.076 5.042 5.117 5.042 5.041 5.045

    67,107,840 10.368 10.087 10.082 10.082 10.085 10.084

    134,215,680 20.516 20.502 20.786 20.161 20.155 20.516

    268,431,360 40.332 40.311 41.300 40.313 40.316 40.329

    Table 2 shows no difference in running time of the GPU programs varying block

    sizes. Again, this would be caused by that the most time spent is not in the core parallel

    reduction code so the effect of different block sizes cannot be seen.

    3.3) Speed-up

    We calculate the speed-up of the GPU programs for both the whole program and the

    portion of posterior calculation. The speed-ups are shown in Table 3.

    Table 3. Speed-ups of GPU programs.

    Problem Size Whole

    Application

    Posterior

    Expectation

    65,535 49.97 84.25

    131,070 50.17 88.58

    262,140 52.27 91.59

    524,280 52.83 93.48

    1,048,560 53.22 94.61

    2,097,120 53.29 94.95

    4,194,240 53.37 95.21

    8,388,480 53.45 95.41

    16,776,960 53.44 95.52

    33,553,920 53.48 95.54

    67,107,840 53.49 95.60

    134,215,680 53.49 95.56

    268,431,360 52.21 95.58

  • 8/6/2019 ANSCSE15 Full Paper Thanakij

    11/11

    ANSCSE15 Bangkok University, Thailand

    March 30-April 2, 2011

    The maximum speed-up obtained in case of the whole application is 53.49 times the sequential

    code. For the only portion of calculating the posterior expectation, the maximum speed-up is

    95.60.

    5. CONCLUSIONWe illustrate a real world application of Bayesian probability for testing the

    hypothesis. The expectation is required to do the hypothesis testing. The implementation

    shows that our method can be accurately used to find such the posterior expectation. Wealso present an enhancement to our previous work by further optimizing our CUDA programs and

    also handling the scalability issue. Our results show that our parallel programs perform better

    than the CPU program as they take much less time when executing. In our experiments, we show

    that with small block sizes, we still can handle large problem sizes and this is essential since more

    solution space has been created. The maximum speed-up identified in our experiment is 53.49

    times the sequential code. Future work would focus on employing a full GPU implementation by

    generating random numbers in GPUs and also cover the issue of evaluating the function in the

    parallel reduction step so that the effect of optimization and block size can be seen.

    REFERENCES1. Bayes, T., and Price, R., "An Essay towards solving a Problem in the Doctrine of Chance. By

    the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F.

    R. S.". Philosophical Transactions of the Royal Society of London 53, 1763, 370418.

    2. Ross, S., "2.4 Expectation of a random variable".Introduction to probability models (9th ed.).Academic Press, 2007, p. 38.

    3. Tierney, L., and Kadane, J., "Accurate Approximations for Posterior Moments and MarginalDensities,"Journal of the American Statistical Association, 1986, 81, 82-86.

    4. Caflisch, R., Monte Carlo and quasi-Monte Carlo methods,Acta Numerica vol. 7, CambridgeUniversity Press, 1998, pp. 1-49.

    5. Harris, M., Mapping computational concepts to GPUs, in: M. Pharr (ed.), GPUGems 2 :Programming Techniques for High-Performance Graphics and General-Purpose

    Computation, chap. 31, Addison-Wesley, 2005, pp. 493508.

    6. NVIDIA CUDA C Programming Guide Version 3.2, 2010.7. Pechprasarn, T. and Khiripet, N., Accelerating Bayesian Computation with Parallel Reduction

    using CUDA, The 4th

    Mahasarakham International Workshop on Artificial Intelligence

    (MIWAI), 2010, p40-45.

    8. Cavendish, H., "Experiments to Determine the Density of the Earth".MacKenzie, A. S..Scientific Memoirs Vol.9: The Laws of Gravitation. American Book Co.. 1900. pp. 59105.

    9. Piche R., Normal Data, in the note 2 of Bayesian statistics courses, Tampere University ofTechnology, 2009.