A Study of Java Virtual Machine Scalability Issues on SMP Systems · scalability on SMP systems....

A Study of Java Virtual Machine Scalability Issueson SMP Systems

Zhongbo Cao, Wei Huang, and J. Morris ChangDepartment of Electrical and Computer Engineering

Iowa State UniversityAmes, Iowa 50011

{jzbcao, huangwei, morris}@iastate.edu

Abstract— This paper studies the scalability issues of JavaVirtual Machine (JVM) on Symmetrical Multiprocessing (SMP)systems. Using a cycle-accurate simulator, we evaluate the per-formance scaling of multithreaded Java benchmarks with thenumber of processors and application threads. By correlatinglow-level hardware performance data to two high-level softwareconstructs: thread types and memory regions, we present in detailthe performance analysis and study the potential performanceimpacts of memory system latencies and resource contentions onscalability.

Several key findings are revealed through this study. First,among the memory access latency components, the primaryportion of memory stalls are produced by L2 cache misses andcache-to-cache transfers. Second, among the regions of memory,Java heap space produces most memory stalls. Additionally, alarge majority of memory stalls occur in application threads,as opposed to other JVM threads. Furthermore, we find thatincreasing the number of processors or application threads,independently of each other, leads to increases in L2 cache missratio and cache-to-cache transfer ratio. This problem can bealleviated by using a thread-local heap or allocation buffer whichcan improve L2 cache performance. For certain benchmarks suchas Raytracer, their cache-to-cache transfers, mainly dominatedby false sharing, can be significantly reduced. Our experimentsalso show that a thread-local allocation buffer with a size between16KB and 256KB often leads to optimal performance.

I. INTRODUCTION

In recent years, Symmetrical Multiprocessing (SMP) hasbecome increasingly popular as a scalable parallel computingplatform. This popularity is mainly attributed to two factors:SMP-capable processors and operating systems. Many modernprocessors, such as Intel’s Xeon, Sun’s UltraSparc, and IBM’sPowerPC, have the built-in SMP support. It is now feasible forend-users to build SMP systems at a low cost. Similarly, oper-ating systems (OS), especially Microsoft Windows and Linux,also support SMP. These two factors drive the adaptation ofSMP in various computing environments.

Java is emerging as a competitive paradigm for softwaredevelopment. Designed with many advanced features, suchas automatic memory management (i.e. garbage collection),enforced security check, and cross-platform portability, Javahas become a popular programming language used on variousplatforms. Because of its built-in multithreading support, Javahas been widely used to develop multithreaded programs forserver platforms (such as SMP systems).

The design of Java threads in Java Virtual Machine (JVM)has been improved over the years to satisfy the requirement ofhigher performance and better scalability. Java specificationsoffer the flexibility to implement Java threads in severalalternative ways. These implementations can be generalizedas a m-to-n threading model, which means that m Java user-level threads are mapped to n native kernel threads. Theoriginal implementation of m-to-1 threading model is knownas green threads. This threading model requires a user-levelscheduler to control the execution of user-level Java threads.While the green threads approach is efficient on uniprocessorsystems, it does not scale well on SMP systems because onlyone processor is used for actual computation. Nowadays moststate-of-the-art JVMs are taking the advantage of native kernelthread support from OS since OS thread scheduler has a betterscalability on SMP systems. For instance, Sun JDK 1.4.2 usesthe 1-to-1 threading model. Though the latest design of Javathreads improves the performance of JVM with support ofoperating system, many other issues, such as memory systemperformance, can still potentially prevent Java threads fromscaling well on SMP systems.

The goal of this paper is to study the scalability issuesof multithreaded Java applications on SMP systems, to iden-tify potential performance bottlenecks and to develop rec-ommendations to the programmers and compiler writers forperformance optimization. Specifically, this paper makes thefollowing contributions:

• It presents a comprehensive performance characterizationof several well known multithreaded Java benchmarks onSMP systems. It evaluates the benchmark performanceand examines the benchmark scalability by varying thenumber of processors and application threads.

• It presents a thorough analysis by breaking down theperformance data based on the types of threads, thememory system latency components and/or the memoryregions. This allows us to identify the potential perfor-mance bottlenecks and correlate them to their sources.

• It provides insights into understanding the key sourcesof two major bottlenecks: memory system latencies andresource contentions. It also demonstrates and examinesoptimization techniques for minimizing such performance

119 0-7803-9461-5/05/$20.00 ©2005 IEEE

impacts on SMP systems.For example, parallel garbage collector shows betterscalability over the default garbage collector becauseof its higher processor utilization during garbage col-lection; thread-local heap or allocation buffer improvesperformance of multithreaded Java benchmarks on SMPsystems because they can significantly improve L2 cachelocality of each thread.

The rest of this paper is organized as follows. In section II,we describe the experimental methodology, including the sim-ulation environment, the multithreaded benchmarks, and theimplementation details of our experiments. The experimentalresults together with the analysis are presented in section III.Section IV briefly reviews prior and related research work.Finally, we draw our conclusions and point out the futureresearch work in section V.

II. METHODOLOGY

This section describes our simulation environment, the mul-tithreaded Java benchmarks and the details of implementationand measurement.

A. Simulation Environment

Simics is a full system simulator which can simulate mostpopular hardware platforms to run various unmodified soft-ware components. Our simulation system includes four layers:Java benchmark, Java Virtual Machine, OS and CPU. We useSimics 2.0.5 to simulate a Linux operating system running on ashared memory multiprocessor system [12]. Java benchmarkscan then be run on top of JVM on the simulation system forperformance measurement.

The simulation environment is setup as follows:

• Each simulated CPU is an in-order issue processor withPentium IV instruction set and two level caches (Split L1cache: 16K, 4-way, 1-cycle latency; Unified L2 cache:512K, 8-way, 8-cycle latency). 2GB main memory (100-cycle latency) is shared by all processors in the simulatedsystem. The number of processors can be varied fordifferent simulation configurations.

• The Linux operating system is based on the 2.6.8 kernel.This version of kernel provides a constant time (O(1))thread scheduler and supports process preemption in bothuser space and kernel space. These new features providea faster process response time and a higher throughputthan previous versions of kernel. In order to avoid skewedresults in our experiments, single user mode is enabledso that only a very few necessary system processes arerunning on the system.

• The JVM we used is Sun HotSpot JVM 1.4.2. Unlessotherwise stated, the default JVM options are used in theexperiments. The reason is that the default configurationhas been tuned by software vendors and usually producethe highest throughput. In all the experiments, a 512MBJava heap space is used due to the fact that an SMPsystem usually has a large size of main memory.

B. Multithreaded Benchmarks

Table I gives the descriptions of four multithreaded Javabenchmarks used in our experiments. Three benchmarks (Mol-Dyn, MonteCarlo, and RayTracer) come from the multi-threaded Java Grande Forum (JGF) Benchmark Suite [8].

We also use PseudoJBB, a variant of SPECjbb2000 [15],as a multithreaded benchmark. SPECjbb2000 is developedby Standard Performance Evaluation Corporation (SPEC) forevaluating the performance of servers running typical Javabusiness applications. As a variant of SPECjbb2000, Pseudo-JBB can run a fixed number of transactions in multiple ware-houses. This benchmark has been widely used for performanceevaluation in recent studies[1], [4], [16]. In our experiments,the data of single threaded initialization stage of PseudoJBB isexcluded due to our interest in the multithreading properties.

C. Implementation and Measurement

Many factors, such as instructions to be executed, mem-ory access latency, I/O access latency, processor utilization,thread synchronization, etc., can affect the performance ofmultithreaded Java applications on SMP systems. In order toevaluate the benchmark performance and identify the perfor-mance bottlenecks, it is necessary to record performance dataduring the execution of an application for further performanceanalysis.

As a full system simulator, Simics offers the capability todirectly retrieve hardware performance data, such as cachemisses and TLB misses, without incurring interferences toJVM. However, Simics does not know the software behaviorsdirectly. In order to correlate hardware performance data withhigh-level software behaviors, we use the magic instructionin Simics to notify the simulator of the interested softwareevents indirectly. Simics uses a special architecture-dependentinstruction (i.e. xchg %bx,%bx in Intel X86) as the magicinstruction. Therefore, we instrument the source code of JVMwith this instruction, and a real execution of the instrumentedJVM shows that the instrumentation overhead is completelynegligible.

The performance data is recorded on a per-thread basis. Werecompile JVM with the support of Drepper’s Native POSIXThread Library (NPTL) model [3], which exhibits higherscalability than the classic pthread library on SMP systems.The NPTL threading model provides 1-to-1 mapping frompthread to kernel process (thread). Similarly, Sun JDK 1.4.2uses 1-to-1 mapping from Java thread to pthread. Therefore,we can observe the Java thread behaviors from the OS levelby setting a breakpoint in the address of OS process scheduler.Whenever the breakpoint is reached, Simics is informed of thecontext switch of kernel processes, and the performance datais recorded and correlated to the running thread right beforecontext switch happens.

We run the multithreaded Java benchmarks with differentconfigurations to obtain performance data for our analysis.This allows us to tell how well multithreaded Java benchmarksscale with the number of processors and application threads,

120

TABLE I

MULTITHREADED JAVA BENCHMARKS

Benchmark Description InputMolDyn N-body code modeling particles interacting under a Lennard-Jones potential 2,048 solutionsMonteCarlo Monte Carlo techniques to price products derived from an underlying asset 10,000 solutionsRayTracer A 3D raytracer, which renders 64 spheres with configurable resolutions 500 solutionsPseudoJBB A variant version of SPECjbb2000 400,000 transactions

and to know where are the potential performance bottlenecks.The configurations can be divided into four categories:

• Fixed number of application threadsFor each run of the benchmarks, we use 16 Java appli-cation threads but vary the number of processors of theSMP system (1, 2, 4, 6 and 8). We use no more than 8processors because Linux 2.4 version or higher uses thelogical destination APIC mode of Intel processor, whichlimits the number of processors to 8. The uniprocessorsystem (1 processor), which is not an SMP configuration,is included only for comparison purpose.

• Fixed number of processorsFor each run of the benchmarks, we fix the number ofprocessors to 4 but vary the number of application threadsfrom 1 to 12 (1, 2, 4, 6, 8, 10, and 12). Table I shows theinput for each benchmark. Note that the input is constantfor each run regardless of that how many processors orapplication threads are used.

• Fixed number of application threads but with parallelgarbage collectorThis configuration is actually the same as the first oneexcept that it uses a parallel garbage collector. The pur-pose is to show how the parallel garbage collector differsfrom the default stop-the-world garbage collector in termsof performance. We do not run the second configuration(fixed number of processors) with the parallel garbagecollector because we find that this configuration is enoughto show the patterns.

• Fixed number of processors and fixed number ofapplication threadsIn this configuration, we fix the number of processorsand application threads to 4 and 8 respectively. For eachrun of the benchmarks, we vary the size of thread-localallocation buffer to examine its performance impact onmemory system and determine the size for achieving theoptimal performance. Details will be discussed in sectionIII-E.

III. EXPERIMENTAL RESULTS

The experimental results are analyzed and summarized inthis section. The objective is to correlate hardware levelperformance data back to the software level constructs. Thiswill allow us to identify the performance bottleneck at multiplelevels.

Two software level constructs, the thread types and memoryregions, are introduced and justified in Section III-A. Then, the

results of throughput scaling with the number of processorsand application threads are presented. We then break down theperformance data based on the types of threads, the memoryaccess latency components, and/or the memory regions, anddiscuss in detail the analysis of performance and scalability.Lastly, optimization techniques such as parallel garbage col-lector and thread-local heap or allocation buffer are examined.

A. Two software constructs: Thread Types and Memory Re-gions

On SMP systems, different threads in an application usuallyhave different behaviors and knowing these behaviors willhelp us to identify the performance bottlenecks. When amultithreaded Java application runs on Sun JDK 1.4.2, acertain number of threads will be created. The number andthe types of threads can vary depending on the applicationbehavior, the execution platform, the garbage collection al-gorithm, etc. Typically, the types of threads include mainthread, application thread, compiler thread, garbage collec-tion thread, idle thread, signal dispatcher thread, referencehandler thread, finalizer thread, suspend checker thread, andwatcher thread. Because threads of the same type usually havesimilar behavior, we group threads based on their types andpresent the analysis for each type. In particular, we find thatthreads with the types of signal dispatcher thread, referencehandler thread, finalizer thread, suspend checker thread, andwatcher thread have very small impacts on the executions ofbenchmarks since their contributions to the total executiontime are always less than 2% in total. Thus, we do notdistinguish those types of threads but classify them togetherusing a new type, called other threads, in our analysis.

In the experiments, we also study the behaviors of L1instruction cache, L1 data cache and L2 unified cache. Theresults show that the memory system plays a very importantrole in the overall SMP system performance. As shown inFigure 1, the 4GB virtual address space of a JVM processrunning on a typical 32-bit Linux operation system can bedivided into several memory regions. Note that the addressesshown in the figure may vary with the size of Java heap. Tofurther explore the details of memory system behaviors andidentify the performance bottlenecks, we present the perfor-mance analysis of memory system based on these memoryregions when necessary.

B. Throughput

We use speedup of throughput to examine the scalabilityof benchmarks under study. Figure 2(a) shows the scaling of

121

Kernel space

0xC0000000

0x68490000

0x64490000

0x46BF0000

0x44449000

0x42400000

Thread heap and stack space

Mature heap space

Nursery heap space

Compiled Java method space

JVM and shared library space

JVM Heap Space

Permanent heap space

0x00000000

0xFFFFFFFF

Fig. 1. An Example of JVM Virtual Memory Space Layout

throughput with the number of processors. All the benchmarksachieve increases in throughput with the number of processors.However, the speedups tend to be lower than the linearincrement. PseudoJBB has the lowest speedup in all cases.Additionally, no significant improvement is observed withmore than 6 processors for PseudoJBB. This indicates thatsome potential performance bottlenecks limit the improve-ment. Further investigation of this issue will be presented insection III-C.

Figure 2(b) shows the scaling of throughput with the numberof application threads. All the benchmarks achieve increasesin performance with the number of application threads whenthe number of application threads is less than the number ofprocessors, obviously due to higher CPU utilization with morethreads. The peak throughputs are reached when the numberof application threads is equal to the number of processors.Beyond that point, we observe degradations in performance forboth PseudoJBB and MolDyn while there are no significantchanges for both MonteCarlo and RayTracer.

The result of throughput scaling reveals that using moreprocessors often leads to higher performance and matching thenumber of application threads to the number of processors isimportant in achieving the maximum performance. However,because potential bottlenecks offset this effect, no benchmarkis found to have a linear increase in performance.

In the following sections, the behavior of throughput scalingwill be further studied in detail by examining the impactsof different performance data on overall system performance.Through detailed analysis, useful findings about performanceand scalability issues can be observed.

C. Breakdown of Execution Time

In this section, we present our analysis by breaking downthe total execution cycles based on thread types. We alsoexamine the contributions of various memory access latencycomponents to the overall CPI (average cycles per instruction).

1) Breakdown of Total Execution Time by Thread Types:Figure 3(a) and Figure 3(b) show the scaling of total executioncycles broken down by thread types. The total execution

(a) Speedup of Throughput vs. Number of Processors

(b) Speedup of Throughput vs. Number of ApplicationThreads

Fig. 2. Speedup of Throughput

cycles are normalized to 1 so that it allows us to know andcompare the contributions of different threads. The resultsshow that, for all benchmarks, application thread and idlethread dominate the execution cycles while the contributionsof compiler thread and other threads are negligible. Thisindicates that the JIT compiler is efficient in compiling Javamethods into native machine code. Garbage collection threadcontributes from 5% to 10% of the total execution cycles forPseudoJBB and up to 3% for MonteCarlo, while no significantGC cycles are observed for both MolDyn and RayTracer. Ourinvestigation reveals that both PseudoJBB and MonteCarlohave relatively larger working sets, resulting in more frequentand longer executions of garbage collection. MolDyn has avery small working set which causes no garbage collection atall. RayTracer allocates a lot of small objects during execution.However, most small objects are thread-local and die veryyoung, thus garbage collections can be done very quicklywithout copying a large number of objects from nursery heapspace to mature heap space.

In our experiments, we find that idle thread has a significantcontribution to the total execution cycles, resulting in a severedegradation in performance. In all benchmarks, idle cycleskeep increasing with the number of processors. For PseudoJBBrunning on an eight-processor SMP system, the idle cyclescan constitute as much as 50% of the total CPU cycles. Thebehavior of idle thread becomes different in the scaling withthe number of application threads. We see the largest idlecycles for all benchmarks when the number of applicationthreads is equal to 1. For instance, 74% of the total executioncycles are idle cycles for PseudoJBB and 52% for MolDyn.

122

(a) Total Execution Cycles by Thread Types vs. Numberof Processors

(b) Total Execution Cycles by Thread Types vs. Numberof Application Threads

Fig. 3. Total Execution Cycles

As the number of application threads increases, the idle cycleskeep decreasing until the number of application threads isequal to the number of processors. Thereafter, we find thatidle cycles keep increasing with the number of applicationthreads for both PseudoJBB and MolDyn while there are nosignificant changes for both MonteCarlo and RayTracer.

Further investigation reveals that the major causes of idlecycles are lock contentions and long executions of garbagecollection. MolDyn allocates a few highly shared objects.The strong contentions of concurrent accesses to these objectsproduce considerable number of idle cycles. For MonteCarloand PseudoJBB, the idle cycles are much largely producedduring garbage collection because of the large working sets.RayTracer allocates a large number of thread-local objects, sothe contentions of concurrent accesses to these objects tendto be small. Meanwhile, these objects die very quickly, thusgarbage collection takes much less time to be completed. Asa consequence of these two reasons, only a relatively smallnumber of idle cycles are produced during its execution.

In order to exploit thread-level parallelism for high per-formance achievement, idle cycles should therefore be min-imized. Optimizing JVM threading system and Java’s syn-chronization mechanism is important for reducing the idlecycles produced by lock contentions. In addition to this, Javaprogrammers have the responsibility to write scalable codeat application level. Also, as to be discussed in section III-F, parallel garbage collector can significantly reduce the idlecycles during garbage collection.

2) Breakdown of CPI by Memory Access Latency Compo-nents: In this section, we perform the analysis by breaking

down the overall CPI into five memory access latency com-ponents: processor time (average access time to L1 cache forCPU to fetch one instruction and its data), L1 instruction cachemiss stalls, L1 data cache miss stalls, L2 miss stalls and cache-to-cache transfer stalls. We explicitly exclude the data of idlethread due to the fact that it rarely accesses the memory.

Figure 4 shows the scaling of overall CPI, broken downinto its memory access latency components. We observe thatmemory stalls constitute a large percentage of CPU cycles,ranging from 20% to 60%. This indicates that memory accesslatency is a very critical performance bottleneck on SMPsystems. The results also show that the contributions ofmemory access latency components to the overall CPI varyconsiderably among the four benchmarks. PseudoJBB andRayTracer are more significantly affected by the memorysystem stalls while MolDyn and MonteCarlo are less affected.Both PseudoJBB and RayTracer allocate a large volume ofobjects during the execution, resulting in a large number ofcache misses in both L1 and L2 caches. MolDyn has a verysmall working set and the accesses to objects can usuallyhit in the caches. MonteCarlo behaves differently comparedto the other three benchmarks. We observe that MonteCarloallocates a large number of objects. However, its CPI tends tobe relatively small. This is because its regular memory accesspattern leads to good cache locality.

For all the benchmarks, L2 cache performance plays a veryimportant role in the overall CPI. L2 cache miss stalls canaccount for as high as 50% of the total memory latencystalls. Cache-to-cache transfer stalls are the second largestcontributor of memory system stalls. We find the significantcache-to-cache transfer stalls for RayTracer and PseudoJBB,while relatively less for MolDyn and MonteCarlo. Both L2cache misses and cache-to-cache transfers together contributesto the majority of memory system stalls, while L1 instructioncache misses and L1 data cache misses have less impacts onthe overall memory system performance. To further understandmemory system performance, more detailed explanations willbe presented in section III-D.

The scaling of L2 cache performance can become worse.We observe that, increasing the number of processors or thenumber of application threads, often leads to increases inboth L2 cache misses and L2 cache-to-cache transfers, andthe impact on L2 cache-to-cache transfers is often slightlyhigher than that on L2 cache misses. Two factors contributeto this effect. Firstly, the default memory allocator of Sun JDKallocates objects in nursery heap space through a bump pointer.Objects belonging to different threads can be allocated closeto each other. Accesses to one object may lead to the otherobject to be loaded into the same cache line, and obviouslythis will worsen the spatial cache locality for current thread.This allocator could also lead to high cache-to-cache transfers.If two or more processors try to access data in the same cacheline but in different processor caches simultaneously, writingdata in one cache may cause the data to be invalidated inanother cache. Secondly, increasing the number of processorsor application threads could make this situation even worse

123

(a) Overall CPI by Memory Access Latency Componentsvs. Number of Processors

(b) Overall CPI by Memory Access Latency Componentsvs. Number of Threads

Fig. 4. Overall CPI

due to the increasing contentions of memory accesses.Our further investigation shows that using a thread-local

heap or allocation buffer can potentially improve cache systemperformance on SMP systems. We will detail the studies insection III-E.

D. Memory System Performance

In this section, we further study the memory system behav-iors in terms of thread types and memory regions with themotivation of knowing in detail the causes of memory systembottlenecks. Similar as above, we exclude the data of idlethread in our analysis. Also, in comparison to other caches,the performance impact of L1 instruction cache misses tendsto be small and therefore will not be reported.

1) L1 Data Cache Misses: Figure 5(a) and Figure 5(b)show the scaling of L1 data cache miss ratio, broken downby the thread types. Like L1 instruction cache, we find thatapplication thread dominates cache misses for all benchmarks.Garbage collection thread contributes to the cache misses aswell for PseudoJBB and MonteCarlo. However, the impact istrivial compared to the application thread. This is the directresult of using a large size of Java heap in our experimentswhich minimizes the impact of the garbage collection.

Figure 5(c) and Figure 5(d) show the scaling of L1 datacache miss ratio, broken down by the memory regions. Wefind that cache misses come from all the regions with thethree Java heap spaces together having the largest contribution.This indicates that improving the cache locality of Java heapspaces is critical for improving the overall memory systemperformance.

The behavior of MolDyn is a little bit different. Around80% or more of the total cache misses come from the nurseryheap space. As we mentioned before, this is because of thesmall working set of MolDyn. All objects are allocated in thenursery space and will never be promoted to the mature space.Consequently, most memory accesses to Java heap go to thenursery heap space.

2) L2 Cache Misses: Figure 5(e) and 5(f) illustrate thecontributions of different types of threads to L2 cache missratio. Similar to L1 cache, we observe that L2 cache missesare also dominated by the application threads. However, L2cache miss ratio does not stay constant. Instead, the miss ratiokeeps increasing with either the number of processors or thenumber of application threads. The only exception is that L2cache miss ratio of MolDyn decreases with the number ofprocessors.

Figure 5(g) and Figure 5(h) show the contributions ofdifferent memory address spaces to L2 cache miss ratio.For all benchmarks, kernel space and JVM code space onlycause a small amount of L2 cache misses. Instead, Java heapspace is the main contributor due to the fact that all theobject allocations and most data accesses occur in this space.Additionally, garbage collector can also cause a large numberof L2 cache misses in Java heap space, since the generationalgarbage collector requires scanning objects in nursery heapspace and copying live ones to mature heap space. We alsofind that PseudoJBB has much larger cache miss ratio in Javaheap space than the other benchmarks. This is attributed tothe large sizes of working set and Java code of PseudoJBB.

3) Cache-to-Cache Transfer: To simplify the analysis, wecorrelate cache-to-cache transfers to the thread responsible forthe data modifications which cause the transfers. Figure 5(i)to Figure 5(l) show the contributions of different componentsto L2 cache-to-cache transfers. We observe that, among all thetypes of threads, application thread dominates cache-to-cachetransfers in all cases, while no significant contributions areobserved for all other types of threads. Among all the memoryregions, the nursery heap space contributes the most to cache-to-cache transfers for all benchmarks. We also find that in-creasing the number of processors or the number of applicationthreads will lead to an increase in cache-to-cache transfers.The observation shows that L2 cache-to-cache transfers sharethe similar behavior of L2 cache misses. However, L2 cachemisses are mainly caused by poor cache locality, while L2cache-to-cache transfers are caused by true/false sharing ofdata among processors or threads.

E. Thread-local Heap/Allocation Buffer

Thread-local heap is a scheme in which each thread receivesa partition of the heap for thread-local object allocationand thread-local garbage collection without synchronizationwith other threads[2]. Its original intention is to reduce heapcontentions. However, our study shows that this scheme canlead to good cache performance on SMP systems and itspositive performance impact is greater than that of its originalintention.

124

(a) L1 Data Cache Misses by Thread Types vs. Numberof Processors

(b) L1 Data Cache Misses by Thread Types vs. Numberof Application Threads

(c) L1 Data Cache Misses by Memory Regions vs.Number of Processors

(d) L1 Data Cache Misses by Memory Regions vs.Number of Application Threads

(e) L2 Cache Misses by Thread Types vs. Number ofProcessors

(f) L2 Cache Misses by Thread Types vs. Number ofApplication Threads

(g) L2 Cache Misses by Memory Regions vs. Numberof Processors

(h) L2 Cache Misses by Memory Regions vs. Numberof Application Threads

(i) Cache-to-Cache Transfers by Thread Types vs. Num-ber of Processors

(j) Cache-to-Cache Transfers by Thread Types vs. Num-ber of Application Threads

(k) Cache-to-Cache Transfers by Memory Regions vs.Number of Processors

(l) Cache-to-Cache Transfers by Memory Regions vs.Number of Application Threads

Fig. 5. Cache Performance

125

(a) Throughput vs. Size of Thread-local Allocation Buffer

(b) Memory Stalls/Instruction vs. Size of Thread-localAllocation Buffer

Fig. 6. Thread-local Allocation Buffer

Sun JDK 1.4.2 does not implement thread-local heap di-rectly, but offers a similar approach called thread-local allo-cation buffer. However, there are some differences betweenthem. Thread-local heap approach requires that only non-shared objects are allocated locally in the heap belonging tothe thread which creates the objects. It has the advantage thatthread-local objects can be garbage collected independentlywithout stopping other application threads. However, it iscomplicated because it requires the support of compiler, andoverheads such as write barrier have to be introduced. Incontrast, thread-local allocation buffer approach allows anyobject to be allocated locally in the allocation buffer belongingto the thread which creates the object. In this approach, objectsin the allocation buffer cannot be independently collectedbecause they may be shared by other application threads.However, the implementation is simpler and often lead tosimilar performance. Therefore, we study the performanceimpact of thread-local allocation buffer instead thread-localheap in this section.

Sun JDK 1.4.2 allows us to enable the scheme of thread-local allocation buffer and specify the size of the buffer foreach thread through command line. We only run two bench-marks, RayTracer and PseudoJBB, on a simulated SMP systemwith four processors. Other benchmarks seems insensitive toits impact on performance, either because MolDyn has a smallmemory footprint, or because MonteCarlo already has goodcache performance (see Figure 4). Therefore, they will notbe reported. In the simulation, eight application threads arerun for each benchmark. We also vary the size of thread-localallocation buffer to examine its impact on the performance.

Figure 6(a) shows throughput speedup of RayTracer andPseudoJBB with the size of thread-local allocation buffer.If the size is 0, it means thread-local allocation buffer isnot used for the execution. Overall, using a thread-localallocation buffer leads to performance improvement. Figure6(b) illustrates the memory stalls with the size of thread-localallocation buffer. We observe that performance is improvedbecause L2 cache misses and cache-to-cache transfers aresignificantly reduced. The allocation behavior leads to thiseffect. The default memory allocator of SUN JDK allocatesobjects in the continuous nursery space by increasing the bumppointer. If multiple threads are running simultaneously, theobjects belonging to different threads could be allocated closeto each other. This will lead to poor spatial cache locality foreach thread. This allocator can make cache performance evenworse on SMP systems because of the impact of cache-to-cache transfers(true/false sharing misses). Objects belongingto different threads could be placed in the same cache linebut in different processor cache. Therefore , writing to thiscache line in one processor cache will cause the data tobe invalidated in the other processor caches. Thread-localallocation buffer alleviates the problems by allocating thread-local objects together. Thus the cache locality becomes betterand false sharing misses can be significantly reduced becausemost cache lines will not be shared by other threads. We onlyobserve small performance gains from L1 instruction cacheand L1 data cache, mainly due to their small sizes of cache.

Figure 6(a) also shows that the size of thread-local alloca-tion buffer could affect overall system performance. We findthat a size between 16KB and 256KB often leads to optimalperformance. If the size is too small, a thread may have tokeep requesting a new buffer when current allocation bufferis full. This will cause contention on object allocation andalso memory fragmentation since object to be allocated maynot fit in the free space of current buffer. A small size ofallocation buffer results in poor locality too because objectscan be allocated into separate allocation buffers. However, witha large size of allocation buffer, performance gain could beoffset by the penalties such as TLB misses.

The optimal size of thread-local allocation buffer can varydepending on the cache organization, the application executionbehavior, etc. Dynamically choosing the size of allocationbuffer has the potential to further improve the performanceand we plan to exploit this potential in the future work.

F. Parallel Garbage Collector vs. Default Stop-the-worldGarbage Collector

We use the default stop-the-world garbage collector inour experiments. This collector allows only one processor toactively execute garbage collection. Parallel garbage collector,on the other hand, can fully utilize all the processors on SMPsystems to do the garbage collection in parallel. In this section,we present the analysis to show the performance comparisonof these two collectors on SMP systems. We only examinePseudoJBB since the JGF benchmarks do not produce garbagecollections long enough with the 512 MB heap space.

126

(a) Throughput of PseudoJBB vs. Number of Processors

(b) Total Execution Cycles of PseudoJBB by Thread Typesvs. Number of Processors

Fig. 7. Default GC vs. Parallel GC

Figure 7(a) shows the throughput scaling of PseudoJBBwith the number of processors. The parallel garbage collectorshows a performance improvement more close to the linearincrement as the number of processors is increased. However,the performance improvement of default garbage collectoris far more below the linear increase. The performance gapbetween two collectors keeps increasing with the number ofprocessors. When the number of processors is more than 6,only a very small performance gain is observed for the defaultgarbage collector. Obviously, this is caused by idle cyclesduring garbage collection. Parallel garbage collector, on theother hand, does not waste as many CPU cycles as the defaultcollector. Therefore, it achieves much higher performanceimprovement.

To further verify our conclusion, Figure 7(b) shows thescaling of total execution cycles, broken down by the typesof thread. Though, due to the contention and synchronizationamong the garbage collection threads, the parallel garbagecollector uses slightly more CPU cycles than the defaultgarbage collector. However, since the decrease of idle cyclesoutweighs the increase of garbage collection cycles, betterperformance is achieved because more CPU cycles can beused by application thread.

IV. RELATED WORK

The behaviors of Java applications have been evaluatedsince Java was first introduced in late 1995 [6], [10], [14], [13],[7]. Most studies focused on single threaded Java programs,especially the SPECjvm98 benchmarks. However, studies ofmultithreaded benchmarks are rare. In recent years, because

of the popularity of Java-based server applications, the perfor-mance of multithreaded Java programs is becoming of greatinterest.

Using the performance counters provided by the processor,Luo et al. studied the characterization of Java server appli-cations on Pentium III [11]. They found that such programshave worse instruction streams (including I-Cache miss rate,ITLB miss rate, etc.) than SPECint2000. By increasing thethread number, they also studied the impacts of Java threadson the micro-architecture. Instead of running benchmarks ona uniprocessor system, our work focuses on the performancecharacterization of multithreaded Java programs on SMP en-vironment. Many metrics studied in this work, such as cache-to-cache transfer, are not available on the single-processorsystems.

Using a full system simulator (Simics) and a real machine,Karlsson et al. studied the memory system behavior of Javamiddleware running on SMP systems [9]. They mainly focusedon the characterization of low-level (hardware) performancemetrics, such as cache-to-cache transfer. Compared with thisresearch, our work is more fine-grained by attributing thedetailed low-level performance metrics to high-level softwarecomponents of Java Virtual Machine. Specifically, we focus onthe correlation between these low-level performance metricsand two high-level software constructs: thread types and mem-ory regions. Such correlation can help to identify the potentialperformance and scalability bottlenecks at application level forfurther optimization.

Sweeney et al. recently reported a performance monitoringsystem in Jikes RVM, which was implemented based onthe hardware performance counters [16]. As a demonstration,two performance issues (including general performance trendsand the memory latency issues) were investigated using thissystem. The results show that their tool is able to attributethe observed program behaviors to the specific componentsof JVM. However, the profiling system has some limitations.The performance metrics it can examine heavily rely on thecapability provided by the performance counters of processors.Our simulation infrastructure is based on Simics, a full systemsimulator, which enables us to profile more aspects of theapplications. For instance, we are able to categorize the cachemisses into different memory regions, which is infeasiblebased on current implementations of performance counters.

Based on the infrastructure introduced in [16], Hauswirth etal. [5] examined further applications of this profiling system.This research work introduces a technique, called verticalprofiling, to correlate the performance characterizations acrossthe layers of modern object-oriented systems (OS, virtualmachine, and applications). Different from our work, theirresearch did not particularly focus the scalability issues onSMP systems, though their experiments were also based on a4-way SMP system.

V. CONCLUSIONS

In this paper, we study the scalability issues of JVM on SMPsystems. The detailed simulator offers us a great environment

127

to evaluate the performance scaling of multithreaded bench-marks with number of processors and application threads. Ourunique analysis methodology of correlating the low-level per-formance data to high-level software constructs (thread typesand memory regions) allows us to identify the performanceand scalability bottlenecks at multiple levels.

Two potential bottlenecks, memory system latencies andlock contention, are studied in this work. Key observationsemerge. First, in terms of memory access latency compo-nents, memory regions and threads, the primary portion ofmemory stalls are produced by L2 cache misses and cache-to-cache transfers, Java heap space, and Java application threadsrespectively. Second, increasing the number of processors orapplication threads, independently of each other, often leads toincreases in L2 cache miss ratio and L2 cache-to-cache transferratio, which potentially prevent the system from scaling uplinearly. Lastly, lock contentions could cause a large number ofidle cycles, meaning significant lack of thread-level parallelismon SMP systems. Particularly, idle cycles often scale up withthe number of processors and application threads, resultingin non-linear performance improvement or even performancedegradation.

Several optimization techniques are examined for their abil-ities to reduce the impacts of these performance bottlenecks.We obverse that using a thread-local heap or allocation buffercan significantly reduce L2 cache misses and L2 cache-to-cache transfers for multithreaded Java benchmarks running onSMP systems, although its original intention is to reduce theheap contention. A thread-local allocation buffer with a sizebetween 16KB and 256KB often leads to optimal performance.Also, parallel garbage collector is examined to have betterscalability on SMP systems because workload is balanced forhigher CPU utilization.

To our knowledge, this is the first work that investigatesscalability issues by correlating low-level performance datato high-level software constructs. Our future work includesdynamically choosing the size of allocation buffer, furtherexploiting the behavior of lock contention, and performing thevalidation of simulation results on real SMP systems.

ACKNOWLEDGEMENTS

This material is based upon work supported by the NationalScience Foundation under Grant No. 0296131 (ITR) 0219870(ITR) and 0098235. Any opinions, findings, and conclusionsor recommendations expressed in this material are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation.

REFERENCES

[1] S. M. Blackburn, S. Singhai, M. Hertz, K. S. McKinley, and J. E. B.Moss. Pretenuring for Java. In Proceedings of the 2001 ACM SIGPLANConference on Object-Oriented Programming Systems, Languages andApplications (OOPSLA), pages 342–352, Tampa Bay, FL, October 2001.

[2] T. Domani, G. Goldshtein, E. K. Kolodner, E. Lewis, E. Petrank, andD. Sheinwald. Thread-local heaps for Java. In ISMM ’02: Proceedings ofthe 3rd international symposium on Memory management, pages 76–87,Berlin, Germany, 2002. ACM Press.

[3] U. Drepper and I. Molnar. The native POSIX thread li-brary for Linux. http://people.redhat.com/drepper/nptl-design.pdf, 2003.

[4] S. Z. Guyer and K. S. McKinley. Finding your cronies: static analysis fordynamic object colocation. In Proceedings of the 2004 ACM SIGPLANConference on Object-Oriented Programming Systems, Languages andApplications (OOPSLA), Vancouver, Canada, October 2004.

[5] M. Hauswirth, P. F. Sweeney, A. Diwan, and M. Hind. Verticalprofiling: Understanding the behavior of object-oriented applications.In Proceedings of the 18th ACM SIGPLAN Conference on Object-oriented Programing, Systems, Languages, and Applications (OOPSLA),Vancouver, British Columbia, Canada, October 2004.

[6] C.-H. A. Hsieh, M. T. Conte, T. L. Johnson, J. C. Gyllenhaal, and W.-M. W. Hwu. A study of the cache and branch performance issues withrunning Java on current hardware platforms. In Proceedings of the 42nd

IEEE International Computer Conference (CompCon), San Jose, CA,Feburary 1997.

[7] W. Huang, J. Lin, Z. Zhang, and J. M. Chang. Performance Charac-terization of Java Applications on SMT Processors. In Proceedings ofthe IEEE International Symposium On Performance Analysis of SystemsAnd Software (ISPASS), pages 102–111, Austin, TX, March 2005.

[8] Java Grande Forum. The Java Grande Forum Multi-threadedBenchmarks. Available at http://www.epcc.ed.ac.uk/javagrande/threads.html.

[9] M. Karlsson, K. E. Moore, E. Hagersten, and D. A.Wood. Memorysystem behavior of Java-based middleware. In Proceedings of the9th Annual International Symposium on High-Performance ComputerArchitecture (HPCA), Anaheim, CA, Feburary 2003.

[10] T. Li, L. K. John, V. Narayanan, A. Sivasubramaniam, J. Sabarinathan,and A. Murthy. Using complete system simulation to characterizeSPECjvm98 benchmarks. In Proceedings of the International Confer-ence on Supercomputing (ICS), Santa Fe, NM, May 2000.

[11] Y. Luo and L. K. John. Workload characterization of multithreadedJava servers. In Proceedings of 2001 IEEE International Sysmposiumon Performance Analysis of Systems and Software (ISPASS), Tucson,Arizona, November 2001.

[12] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: a fullsystem simulation environment. IEEE Computer, pages 50–58, Feburary2002.

[13] R. Radhakrishnan, V. Narayanan, L. K. John, and A. Sivasubramaniam.Architectural issues in Java runtime systems. In Proceedings of the 6th

International Symposium on High-Performance Computer Architecture(HPCA), Toulouse, France, January 2000.

[14] B. Rychlik and J. P. Shen. Characterization of value locality inJava programs. In Proceedings of the 3rd Workshop on WorkloadCharacterization in Association with ICCD, Austin, TX, September2000.

[15] Standard Performance Evaluation Corporation (SPEC). SPECjbb2000Benchmark. http://www.spec.org/osg/jbb2000/.

[16] P. F. Sweeney, M. Hauswirth, B. Cahoon, P. Cheng, A. Diwan, D. Grove,and M. Hind. Using hardware performance monitors to understand thebehavior of Java applications. In Proceedings of the USENIX 3rd VirtualMachine Research and Technology Symposium (VM 04), San Jose, CA,May 2004.

128

A Study of Java Virtual Machine Scalability Issues on SMP Systems · scalability on SMP systems....

Documents

Transcript of A Study of Java Virtual Machine Scalability Issues on SMP Systems · scalability on SMP systems....