[IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) -...

Increasing the Transparent Page Sharing in Java

Kazunori Ogata Tamiya Onodera IBM Research – Tokyo

5-6-52 Toyosu, Koto-ku, Tokyo, Japan [email protected]

Abstract— Improving memory utilization is important for improving the efficiency of a cloud datacenter by increasing the number of usable VMs. Memory over-commitment is a common technique for this purpose. Transparent Page Sharing (TPS) is a technique to improve the utilization by sharing identical memory pages to reduce the total memory consumption. For a cloud datacenter, we might expect TPS will reduce memory usage because VMs often execute the same OS and middleware and thus they may have many identical pages. However, TPS is less effective for Java-based middleware because the Java VM finds it difficult to manage the layouts of internal data structures that depend on the execution of Java programs. This paper presents detailed breakdowns of the memory usage of KVM guest VMs executing a Java-based Web application server. Then we propose increasing the amount of page sharing by utilizing a class sharing mechanism in the Java VM. Our approach reduced the measured physical memory for class metadata by up to 89.6% when using the Apache DayTrader benchmark running on four guest VMs in a KVM host machine.

Keywords—Java; Memory usage analysis; Memory reduction; Class sharing; Transparent Page Sharing; Virtualization; KVM

I. INTRODUCTION In a modern datacenter, server machines are virtualized as

virtual machines (VMs) to improve the utilization of physical servers and to reduce the server management costs. In this environment, increasing the number of VMs on each piece of hardware is a key factor for higher cost effectiveness. For example, a cloud computing datacenter often uses a separate guest VM for each user to ensure security and isolate those resources from the other users. This kind of configuration runs many small VMs. Even small reductions of the resources used by each VM are important to improve the efficiency of the entire datacenter, especially if more VMs can run on each real machine.

To increase the number of VMs in a machine, efficient use of memory is quite important, because the cost to move allocated memory from one VM to another is high, since the data in memory must be moved. Memory over-commitment is a key feature of hypervisors to improve memory utilization by allocating virtualized physical memory (guest memory) to guest VMs so that the total memory appears larger than the physical memory of the physical host machine. However, the performance is often degraded when the amount of actually used guest memory exceeds the amount of host physical memory, because the hypervisor starts paging out some of the guest memory to disk.

There are three approaches to reduce paging in a hypervisor. One is to dynamically reduce the amount of guest memory and let the guest OS page out memory as needed [43]. The second approach is to improve the speed of the page-out and page-in operations by paging to RAM instead of disk [18, 19]. The third approach is to reduce the need for host physical memory by sharing a single host physical page among multiple guest memory pages if their content is identical, even though they are pages for different guest VMs.

The third approach is called Transparent Page Sharing (TPS) [43]. The hypervisor detects the identical pages by scanning the memory in the background and shares them in a copy-on-write manner. This operation is transparent to the guest VMs because the sharing takes place via the address translation in the hypervisor. This technique is implemented in many hypervisors, such as VMware [43], using the Difference Engine [18] for Xen [6, 10], using KSM [4, 25] for KVM [26], and PowerVM [9, 15].

For a cloud computing datacenter, VMs often execute the same OS and middleware and thus they may have many identical pages since many users will use standard disk images provided by the cloud service provider. Also, for a Platform as a Service (PaaS) datacenter, the service provider can select the OS and middleware to be used in the VMs, and they usually use the same software stack to reduce management costs. In these environments, we believe that TPS can effectively reduce the memory requirements.

However, certain characteristics of TPS generally render it ineffective for Java-based middleware. A program suitable for TPS should have many pages that contain the same data on every execution. The identical pages should also be read-only for long enough periods to compensate for the overhead of scanning them. In the Java heap, however, even the read-only objects may be moved during garbage collection (GC). For the Java VM’s native memory area, the Java VM cannot manage the layout of the internal data structures because most of them are created in response to API calls in Java programs.

Our work focuses on analysis and reduction of the host physical memory usage in each of the guest VMs that is executing a Java-based Web application server. Our experiment showed that most of the memory for the Java processes was not shared by TPS. We propose increasing the page sharing of Java class areas by utilizing the class sharing

34978-1-4673-5779-1/13/$31.00 ©2013 IEEE

feature1 implemented in production JVMs for Java 5 and later [11, 34-36, 39]. This feature allows Java VMs to share class areas by storing the classes in a shared memory region or a memory-mapped file. We can use this feature to arrange classes in the same order in all of the guest VMs and let TPS share the pages for class areas. To do this, we set up the class sharing feature to use a memory-mapped file and copy the memory-mapped file to all VMs. This approach increased the shared pages by an average of 100 MB per Java process running a Web application server.

The main contributions of this paper are:

• We break down the memory usage of guest VMs, showing which guest user processes are responsible for which parts of the physical memory consumption.

• We analyze in detail the physical memory usage in Java-based Web application servers executed in guest VMs and the amounts of memory shared by using TPS.

• We propose and evaluate increasing the page sharing of Java class areas across guest VMs by utilizing the Java VM’s class-sharing feature to make the layouts of the classes the same for the VMs.

The rest of the paper is organized as follows. Section 2 shows a breakdown of the memory usage of the guest virtual machines. Section 3 is a detailed breakdown of the memory usage for individual Java processes and discusses problems that prevent the Java processes from sharing pages. Section 4 discusses our approach to increase page sharing. Section 5 shows experimental results. Section 6 discusses related work. Finally, Section 7 offers conclusions.

II. BREAKDOWN OF MEMORY USAGE IN GUEST VIRTUAL MACHINES

This section describes how we analyzed the usage of host physical memory, and then covers our experiment for analyzing the physical memory usage in the guest VMs. We measured the memory usage of the guest VMs when executing a Java-based Web application server.

A. Methodology to Measure Physical Memory Usage The key to our memory analysis is to fully identify the

usage of each physical page frame based on the component that allocated the memory. The component might be a guest VM, a guest OS, or a user process in a guest OS. Each virtual address used by a component is translated to a host physical address to analyze the usage of that physical memory location.

A hypervisor virtualizes physical memory by using an extra address-translation step after the OS address translation. This defines a virtual address space for the guest memory. Both the OS and the hypervisor manage tables that hold mappings from

1 This feature is called Class Data Sharing (CDS) [34-36] in Oracle’s HotSpot Java VM [33, 38] and it is enabled by default. IBM J9 Java VM [17] calls this feature shared classes [11, 39]. The J9 Java VM requires a command line option to enable this feature, but some middleware products, such as IBM WebSphere Application Server [22], add this option by default to enable and configure this feature.

virtual addresses to physical addresses. We need to collect all of these tables to track the physical memory usage of the guest processes, because some of the memory pages in the guest processes may not be mapped to host physical memory.

The number of tables that must be collected depends on the hypervisor implementation, which generally follows one of two basic approaches. One approach uses a system VM [40] as shown in Fig. 1(a), where the address translation is handled by the hypervisor and the guest OS. The other approach uses a process VM [40] as shown in Fig. 1(b), where the address translation is handled in three layers: the host OS, the VM process, and the guest OS.

A single host physical page may be used by multiple guest processes. If so, we need to avoid counting the identical page multiple times. There are two approaches for calculating the memory usage of the shared pages for each process: a distribution-oriented approach and an owner-oriented approach.

The distribution-oriented approach calculates the share of each process in each shared page by dividing the size of the page by the number of sharing processes and accumulates the share of the memory usage for each process. This is an often used approach to calculate the usage of a shared memory page. In Linux, the values of PSS in the /proc/<pid>/smaps files are calculated using this approach.

The owner-oriented approach chooses an owner process for a shared memory page and accumulates the size of the entire page for the owner process. The other processes are called non-primary processes and they can use the page for free in terms of memory usage calculation. The benefit of this approach is that we can easily calculate the amount of additional memory needed to run another process sharing this page.

We used the owner-oriented approach, and so we must select one of them as the owner of the page frame. In this

Host OSHardware

VM process VM process

Hypervisor

Hardware

Guest process

Guest OS

Guest process

Guest VM 1 Guest VM 2

Address translation

Address translation

Guest processGuest OS

Guest process

Guest process

Guest OS

Guest process

Guest VM 1 Guest VM 2

Guest processGuest OS

Guest process

(a) Hypervisor using system VM

(b) Hypervisor using process VM Fig. 1. Two approaches for implementing a hypervisor.

35

research, a Java process is always selected as the owner since we are analyzing Java memory usage. If there are multiple Java processes, the process that happened to be assigned the smallest process ID among them is selected as the owner, though there is no relationship between the process IDs in different VMs. The other processes are regarded as the non-primary processes sharing the page.

We calculated the host physical memory usage by totaling the sizes of the pages owned by each process. We also calculated the total size of the shared pages for each process. For the example shown in Fig. 1(b), if the rightmost guest process is the owner of the page being accessed, then the two guest processes in the middle are the non-primary processes sharing that page. The physical memory usage of the rightmost process is increased by the size of the page. For each of the other two processes, the total size of their shared pages is increased by the size of the page, though the physical memory usage is not changed.

For a hypervisor using process VMs, the pages allocated to the guest VM process but not used for guest memory are totaled as the pages used by the guest VM itself.

B. Collecting Address Translation Information We used KVM for our experiments, which is a hypervisor

using a process-based approach as described in Section 2.1. However, our research is also applicable to the system-based hypervisors because the additional address translation below the hypervisors is transparent to the memory usage in the guest processes.

We need to know how each layer manages the address translation information to calculate the corresponding physical address for a given virtual address. Since we are using KVM for the hypervisor, we need to obtain address translation information from three layers, the guest OS, guest VM process, and host OS layers.

1) Collecting Address Translation Information in the Host and the Guest OS

We used system dumps of the real and virtual machines. We used crash dumps of the host OS and used the virsh dump command for KVM dumps of the guest VMs. Both the host and the guest kernel used debug versions, so crash [1] can analyze the dumps.

2) Collecting Address Translation Information in the Guest VM

We implemented a Linux kernel module for the host OS. For KVM, a guest VM is a process that uses a special device driver to handle the virtualization operations, such as modifying page tables for the address ranges assigned for guest memory. This device driver holds its internal data structures, including the address translation information, in the file structure of the kvm-vm device opened by the guest VM process. The kernel module retrieves the data structures from the value of private_data field of the file structure.

C. Measurement Environment We measured the memory usage of a machine hosting four

KVM guests with 1 GB of guest memory. Each guest VM executed IBM WebSphere Application Server (WAS) [22]. Each WAS in a guest VM ran Apache DayTrader [2], which is a benchmark that simulates an online stock trading system. We used the IBM J9 Java VM [5, 15] included in WAS. The second columns in Table 1 and Table 2 describe the environment of the real machine and the guest VMs for this measurement, which ran on the Intel platform. The second column in Table 3 describes the configurations of DayTrader and the Java VM. We ran IBM DB2 [23] on a separate machine.

We enabled the KSM scanner during our measurements. KSM periodically wakes up and scans a specified number of pages, and then sleeps for a specified period. The number of pages to scan in a single cycle and the length of the sleep period are tuning parameters. The number of pages to scan was set to 10,000 for the first three minutes after starting up WAS and while initializing DayTrader by accessing the scenario page, and then it was set to 1,000 during the measurements. The length of the sleep period was set to 100 msec throughout the measurements. The CPU usage for KSM scanning was about 25% when the number of pages to scan was set to 10,000 and about 2% when the number was set to 1,000.

We measured the memory usage after 90 minutes of DayTrader execution. DayTrader was continuously accessed from 12 client threads for each VM, which means the real machine was accessed by a total of 48 threads.

D. Experimental Breakdown of Memory Usage in a Virtualized Server Fig. 2 shows the breakdown of the host physical memory

usage for the guest VMs. It also shows the amount of guest memory shared by TPS, which is also the amount of memory saved by sharing pages. The memory usage was broken down into four categories: the memory used for the Java processes, the memory used for the other user processes in the guest OS, the memory used for the guest kernel including buffers and caches, and the memory used by the guest VM itself. We show the Java processes separately from the other guest user processes.

This breakdown shows that the Java processes were the largest memory consumers in each of the guest VMs. Each Java process consumed 750 MB of physical memory on average in a guest VM with 1024 MB of memory. The guest kernel used 219 MB in guest VM 1, but it averaged 106 MB for the other guest VMs, because an average of 106 MB of guest memory was shared with the guest VM 1, which amounted to 50% of the guest kernel area. The memory usage for the other guest processes and guest VMs were quite small and are difficult to see in the graph.

The amount of Java memory shared by TPS was quite limited. Most of the shared area was in the guest kernel. This result suggests that TPS was not effective for Java processes compared to native programs, which showed large savings [4].

36

III. DETAILED BREAKDOWN OF JAVA MEMORY USAGE

This section is a deeper analysis of the memory usage in the Java processes and discusses the causes of the low effectiveness of TPS for Java process memory.

A. Breakdown of Java memory usage We further broke down the memory usage of the WAS

processes measured in the experiment in Section 2.4 and some other Java applications. We used the same technique as described in our earlier paper [31]. We broke down the Java memory usage into the amount of memory used by each of the components in a Java VM. We improved the technique to use the address translation information in the hypervisor and host OS. We also used the debugging information of the Java VM and the memory management information collected from the OS. Table 4 shows the categories of Java memory used in our analysis. We divided the entire Java process memory into these seven categories.

Fig. 3(a) shows a detailed breakdown of each of the WAS processes from the measurements shown in Fig. 2. Each series of stacked bars shows the amount of guest memory used by a Java process. For example, the series labeled JVM1 shows the breakdown of Java process executed in the guest VM 1. Each box shows the size of each category shown in Table 4, and the color of each bar indicates its category. The graded colored box shows the size of the memory shared by TPS.

TABLE I. ENVIRONMENT AND CONFIGURATION OF THE PHYSICAL MACHINES.

Machine for the Intel platform Machine for the POWER platform

Machine IBM BladeCenter LS21 IBM BladeCenter PS701

CPU Dual-core Opteron (2.4 GHz), 2 sockets POWER 7 (3.0 GHz), 2 sockets, 4 cores per socket, 4-way SMT

RAM size 6 GB 128 GB

Host OS RedHat Enterprise Linux 5.5 (2.6.18-238.5.1.el5debug) N/A

Hypervisor KVM (kvm-83-224.el5) PowerVM 2.1

TABLE II. CONFIGURATION OF A GUEST VIRTUAL MACHINE FOR EXECUTING JAVA-BASED SERVER APPLICATIONS.

Guest VM in the Intel platform Guest VM in the POWER platform Virtual CPU 2 virtual CPUs 1 virtual CPU, 4-way SMT Guest memory size 1.00 GB for DayTrader, TPC-W and Tuscany bigbank demo

1.25 GB for SPECjEnterprise 2010 3.5 GB

OS RedHat Enterprise Linux 5.5 (2.6.18-194.3.1.el5debug) AIX 6.1 TL6 KSM scanner setting 1,000 pages per scan, 100 msec interval N/A

WAS version 7.0.0.15 7.0.0.15 Java VM version IBM J9 Java VM (Java 6 SR9) IBM J9 Java VM (Java 6 SR9)

TABLE III. CONFIGURATION PARAMETERS OF THE JAVA APPLICATIONS AND JAVA VMS

DayTrader in the Intel platform

SPECjEnterprise 2010

TPC-W Tuscany bigbank demo

DayTrader in the POWER platform

Benchmark version 2.0 1.02 Java implementation based on 1.0.1 [8]

1.6.2 2.0

Configuration of client driver for each guest VM

12 client threads Injection rate of 15 10 client threads 7 client threads 25 client threads

Maximum and minimum sizes of the Java heap

530 MB 730 MB 512 MB 32 MB 1.0 GB

Shared class cache size 120 MB 120 MB 120 MB 25 MB 120 MB

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Saving by TPSPhysical memory usage

VM 2

VM 3

Java Web application server

Other user processes

MB

Guest VM

Guest kernel

VM 2

Saving in

VM 3VM 1 VM 4

guestVM 4

Physical memory usage in

Fig. 2. Breakdown of physical memory usage and savings with TPS by

Java processes, other user processes, guest kernels including buffer caches, and guest VMs themselves. Most of the savings from TPS was achieved in the guest kernel areas and very little in the Java process areas.

37

This result shows that TPS effectively shared the code area, but was not effective for other areas. For the Java heap, 0.7% of this area was shared by TPS, but most of the shared pages were those filled with zeros, because the GC fills memory with zeros when it collects garbage. Therefore these shared areas are soon modified and divided. The amount of sharing in the JVM and JIT work area was about 9.2% of this area. The three major sources of shared pages were the buffers of the NIO socket library in Java, the unused part of the memory blocks for the malloc arenas, and the internal data structures that were allocated in bulk but not yet used. Although the buffers of the NIO library were about half of the sharing in this area, we cannot expect this much sharing with typical real-world application programs, because this measurement executes the same benchmark in all of the VMs and transfers the same data between the application servers and the drivers and database servers through the network.

We also analyzed the memory usage of different sets of Java programs to verify that the limited effectiveness of TPS is not specific to DayTrader or WAS. We used SPECjEnterprise 2010 [41], TPC-W [42], and Apache Tuscany [3]. SPECjEnterprise 2010 is a transactional benchmark simulating processes for manufacturing and selling automobiles. TPC-W is a transactional Web benchmark simulating an online bookstore, and we used a Java implementation [8]. These two programs run in a Web application server. Tuscany is an open source implementation of a server that supports the Service Component Architecture (SCA) [30]. SCA is a set of specifications for a model to build systems and applications using the Service Oriented Architecture (SOA). Tuscany is middleware for SCA applications, and thus does not use WAS. The configurations of the measured programs are described in Table 3.

Fig. 3(b) shows a breakdown when DayTrader, SPECjEnterprise 2010, and TPC-W were executed in the same version of WAS. We used three guest VMs and each VM executes a WAS process that runs one of these three applications. Fig. 3(c) shows a breakdown when each of the three guest VMs executed a Tuscany server. We used the bigbank demo program included in the Apache Tuscany package as the benchmark program. These results also showed that areas other than the code area were not shared effectively

by TPS and indicate that the limited effectiveness of TPS is not specific to a particular Java workload.

B. Discussion on Using TPS for Java Programs These results led us to believe that the main factor in the

low effectiveness of TPS in reducing Java memory usage is

TABLE IV. CATEGORIES OF JAVA MEMORY AND THEIR TYPICAL DATA

Category Typical data Code area • Code loaded from the executable files

• Shared libraries • Data areas for shared libraries

Class metadata • Java classes JIT-compiled code • Native code generated by the JIT

• Runtime data for the generated code JIT work area • Work areas for the JIT compiler Java heap • Java heap JVM work area • Work area for the JVM

• Areas allocated by Java class libraries • Other private data of the JVM process

Stack • C stack • Java stack

0 200 400 600 800

JVM4

JVM3

JVM2

JVM1

Code

Class metadata

JIT-compiled code

JVM and JIT work Stack

MB

Java heap

Shared with TPSPhysical memory use

(a) Breakdown of each of the WAS processes from the

measurements shown in Fig. 2.

0 200 400 600 800 1000

TPC-W

SPECjEnterprise

DayTrader

MB

Code

Class metadata

JIT-compiled code


Java heap


(b) Breakdown when DayTrader, SPECjEnterprise 2010, and

TPC-W were executed in the same version of WAS.

0 20 40 60 80 100 120 140 160

JVM3

JVM2

JVM1

Code

Class metadata

JIT-compiled code


MB

Java heap


(c) Breakdown when each of the three Tuscany processes.

Fig. 3. Detailed breakdowns of memory usage of Java processes. The amount memory shared by TPS (graded colored) is quite small.

38

that the Java VM does not have control of the layout of the internal data structures, and so the layouts vary between executions. This differs from previous research [4] that shared many data structures in the heap.

The reason why the Java VM does not have much control over the layout of the data structures in a Java VM, including Java class metadata and Java objects, is because most of them are created in response to the execution of the service routines in the Java VM and they are called based on the execution of a Java program. Thus, the Java VM cannot manage their order when creating those data structures. It is also difficult to make the layout of JVM data structures identical for all of the JVM processes by reserving the memory areas for the data structures, because we cannot predict how many data structures will be created in a particular execution of a Java program. Also, we cannot expect that one part of a data structure in a JVM process will happen to have an identical page in other JVM processes, because the sizes of those data structures are typically much smaller than the size of a memory page.

For the Java heap, there are two additional factors that interfere with TPS sharing: changes to the object headers and movements of the objects during GC. Some operations on an object, such as acquiring a monitor, modify the object header even though the object itself is accessed in read-only mode. Also, GC moves objects around as it frees memory, even if a Java VM is modified to initially arrange objects in a specific order. Most objects are moved when compaction of the Java heap occurs. Generational GC [27, 29] moves survivor and tenured objects on every minor GC. Even if two large read-only objects start with identical pages, either may move to a different address with a new offset from the page boundary, and then these arrays would no longer be shareable by using TPS.

In contrast, native programs have higher probabilities of sharing pages when the same program is executed in guest VMs. First, the address of the memory area allocated to a native program does not change during each execution. This also means that the order of the data structures in a large part of memory is preserved throughout that execution. Second, a large memory area allocated with a standard API is often aligned at a page boundary. For example, a memory area allocated by mmap() is always aligned to a page boundary. The address of a memory area larger than 128 Kbytes also starts at a fixed offset from a page boundary if it is allocated by malloc() in the GNU libc library, and thus it can be shared by TPS if the values of the data are the same.

Therefore, the only JVM area where we can expect TPS sharing to be effective is the code area of the Java VM because the code area maps identical executable files as long as the same version of the Java VM is in use.

IV. INCREASING OPPORTUNITIES FOR PAGE SHARING This section describes our approach to increase the amount

of sharing by using the class-sharing feature of the Java VM. We used this feature to preload a group of classes to manage their layout.

A. Potential Sharing Opportunities in Java Process Memory To increase the sharing opportunities, we need to find

memory areas that are expected to retain large and long-lived read-only data objects with structures that can be managed using such techniques as preloading the data structures before use or by reserving specific memory blocks. The rest of this section considers which of the JVM components from Table 4 satisfy these conditions.

Code area. This area is always shareable with TPS.

Class metadata. This area could be made shareable if we can preload the classes that will be used later. A JVM can load class metadata into memory before it is used, though initialization of the class is still required when it is actually used. This means the JVM can manage the layout of class metadata without violating the Java language specification [16].

JIT-compiled code. This area is difficult to share because the JIT-compiled code can differ from one Java process to another. The reason the generated code can vary is that JIT compiler uses runtime information for the optimizations and the values of the runtime information can differ for each Java process.

JIT work area. This area is difficult to share because most of this area is accessed in read-write mode as a work area. Also, the work area data is short-lived since it is discarded when the JIT compilation of a method is finished.

Java heap. This area is difficult to share because the object headers can be modified even for read-only objects and the GC can move the objects, as described in Section 3.2.

JVM work area. This area is difficult to share because many of the data structures are accessed in read-write mode and it is difficult to predict which of them will be runtime constants. It is also difficult to reserve memory for the data structures that might be used in the execution of a Java program because the JVM functions used depend strongly on the Java programs.

Stack. This area is not shareable because most of this area is accessed in read-write mode and there are many pointers to internal data structures. Both pointers and data structures normally differ from one JVM process to another.

B. Managing the Layout of Java Class Metadata by Preloading Our approach for managing the layout of the class metadata

is to preload the class files in the same order in all of the VMs. We used the class sharing feature of the Java VM to do this. This feature allows Java VMs to share class areas by storing classes into a shared memory region or a memory-mapped file. We can use this feature to arrange classes in the same order in all of the guest VMs and let TPS share the pages for the class areas by setting up the class sharing feature to use a memory-mapped file that is copied to all of the VMs.

There are two benefits of using this feature. One is that it automatically chooses the classes to be preloaded, though that also means we must find the classes that are worth preloading to avoid wasting memory on unneeded classes. Class sharing is

39

a mechanism to store the contents of the class files in a shared memory area when one Java VM loads the classes. Other Java VM processes read the classes from that area to reduce memory usage and to improve performance by skipping the parsing of the class files. This means the classes in the shared area are those actually used in the Java programs.

To control the layout among multiple guest VMs, we configure the Java VM to use a memory-mapped file to allocate a memory region (shared class cache) for storing the sharable classes, so that all of the preloaded classes can be saved as a file. Then we copy this file to all of the guest VMs and configure the Java VM to use the file for the shared class cache. For the IBM J9 Java VM, we need to use the command line option –Xshareclasses to enable the class sharing feature, and we may also need to specify the sub-option persistent on some platforms to configure them with the memory-mapped file. The –Xshareclasses option is specified by default in some middleware products that use the J9 Java VM, such as WAS. The HotSpot Java VM [33, 38] enables class sharing and uses a memory-mapped file by default.

In addition, the J9 Java VM can use a separate shared class cache for each Java program, specifying a different name for the shared area for each program [21]. For example, the default configuration of WAS specifies a predefined name, so that all of the WAS processes in a system will use the same shared class cache. Since the set of needed classes depends on the Java program, this functionality helps improve the efficiency of this approach because we can preload only the classes that are needed for each application.

The other benefit of using the class sharing feature is that we can automatically extract most of the read-only data in the class metadata, such as the bytecode and string literals. The writable data structures, such as the method table, are created in private memory areas.

This approach does not prevent the Java VM from unloading classes. The preloaded read-only part of an unloaded class will stay in memory as a part of the shared class cache even after it is unloaded, and so the pages will remain shared if they are TPS-shared. The page may be unmapped by the OS after all of the classes stored in the page are unloaded and the page is not accessed for a long time.

C. Configuring guest VMs to use a shared class cache We can use a virtual disk image management system, such

as OpenStack [32] or IBM Workload Deployer [24], to configure guest VMs to use our approach. A modern datacenter usually uses such a system to quickly set up guest VMs and deploy user applications. In a typical scenario with such a management system, the administrator of the datacenter creates several base images for typical usage patterns of the servers. The OS and middleware are already installed in these base images. Users of the datacenter (i.e., guest VM administrators) can create their own virtual disk images by selecting a base image suitable for their workload, starting a virtual machine with a copy of the selected base image, and installing extra applications for the middleware of the base image.

Java VMs running in those guest VMs can be automatically configured to use our approach if the datacenter administrator configures the OS and the middleware in the base image to do so. A pre-populated shared class cache file that will be stored into a base image can be created by running the middleware installed in the base image once. The middleware vendor could also deliver a pre-populated shared class cache with the middleware product. Although this base-image-oriented approach can prevent sharing the classes of user applications with TPS, it is sufficient to achieve most of the savings of this approach, since most of the loaded classes are for middleware.

V. EXPERIMENT WITH PRELOADED CLASSES We evaluated the effectiveness of our approach by

measuring the usage of the guest memory when a shared class cache is copied to all of the VMs. The environment and scenarios were the same as for the measurements in Section 2 and Section 3.

A. Increase of page sharing in KVM Fig. 4 shows a breakdown of the physical memory use by

the guest VMs. The savings by using TPS in the non-primary Java processes was an average of 120 MB, while it was only 20 MB in Fig. 2. The total memory used by the four guest VMs was reduced from 3,648 MB to 3,314 MB.

Fig. 5(a) shows a detailed breakdown of the Java memory use of the four VMs from the measurements shown in Fig 4. It shows that 89.6% of the memory used for class metadata was eliminated by TPS for three of the Java processes. Sharing all memory area for shared class cache, almost all of the read-only parts of the classes were shared, since most of the loaded classes were stored in the shared class cache.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Saving by TPSPhysical memory usage

Java Web application server

Other user processes

MB

Guest VM

Guest kernel

VM 2

VM 3

VM 1

guestVM 4

VM 2

Saving in

VM 3VM 4

Physical memory usage in

Fig. 8. Breakdown of physical memory usage and savings by using

TPS. Memory sharing among the Java processes was increased, resulting in larger savings with TPS. The savings by using TPS in the non-primary Java processes was an average of 120 MB, while it was only 20 MB in Fig. 2. The total physical memory used by the four guest VMs was reduced from 3,648 MB to 3,314 MB.

40

Fig. 5(b) and Fig. 5(c) show detailed breakdowns of the memory usage for the same scenario as Fig. 3(b) and Fig. 3(c) in Section 3. These results show that the effectiveness of our approach does not depends on specific middleware, since most

of the class areas were shared by TPS when using our approach is used.

Fig. 5(b) also shows that the amount of the class areas shared by TPS in each of the Java processes was almost the same as the amount of the savings in Fig. 5(a). The reason for the similar results is that the majority of the classes accessed during the execution are those for WAS and the size of the application classes is much smaller. Since a Web application server is a large piece of middleware, we achieved most of the memory reduction by preloading the middleware classes. Actually, around 90% of preloaded classes were those for WAS, including the standard library components such as the OSGi framework and derby database. Only around 10% were Java system classes, the package names such as java.*, javax.*, sun.*, or org.apache.harmony.*. The classes for the Enterprise JavaBeans (EJB) applications were not preloaded because the current implementation of J9 requires the class loaders for the EJB applications to be aware of the shared class cache feature.

B. Increase of page sharing in PowerVM Our approach is applicable to any hypervisor that supports

TPS, and it does not require any specific OS or hardware features. We also measured how physical memory usage was reduced by our approach in AIX on PowerVM. PowerVM has a TPS feature and shares identical pages unless the guest VMs are configured to allocate dedicated physical memory for them [15].

We measured the effectiveness of our approach by comparing the total physical memory usage when the Java classes were and were not preloaded, because our tool cannot obtain a breakdown of the physical memory usage at the same level of detail in AIX as in Linux. The memory reduction by TPS was measured by comparing the total physical memory usage when WAS started up, but TPS did not start with the memory when PowerVM finished scanning and sharing pages. We measured the total physical memory usage of the guests VMs by using the monitoring feature of PowerVM.

Fig. 6 shows the total physical memory usage of three guest VMs when we ran WAS in each of them. Each pair of bars shows the memory usage before and after sharing the identical pages. The pair on the left side shows the sizes without preloading classes, and the right-side pair show the sizes when classes were preloaded. The environment of these measurements is described in the rightmost columns in Tables 1 through 3 in Section 2.

When classes were preloaded, the amount saved with TPS was 424.4 MB, while the size without preloading was 243.4 MB. Thus, our approach increased the amount saved with TPS by 181.0 MB. Since we ran three VMs and one of them is the owner VM of the page frames for TPS-shared pages, the saving for each of the two non-primary VMs is 90.5 MB on average. Since the size of shared class cache was 100 MB, more than 90% of shared class area became shareable by preloading.

C. Performance under low memory conditions We also measured the memory savings from using TPS in a

low-memory condition. We ran DayTrader and

0 200 400 600 800

JVM4

JVM3

JVM2

JVM1

Code

Class metadata

JIT-compiled code


MB

Java heap


(a) Breakdown of each of the WAS processes from the measurements shown in Fig. 4. This allowed 89.6% of the class metadata to be shared by three non-primary JVMs (since the other JVM is the owner of the shared pages).

0 200 400 600 800 1000

TPC-W

SPECjEnterprise

DayTrader

MB

Code

Class metadata

JIT-compiled code


Java heap


(b) Breakdown when DayTrader, SPECjEnterprise 2010, and TPC-W were executed in the same version of WAS.

0 20 40 60 80 100 120 140 160

JVM3

JVM2

JVM1

Code

Class metadata

JIT-compiled code


MB

Java heap


(c) Breakdown when each of the three Tuscany processes.

Fig. 4. Breakdowns of memory usage of Java processes when a shared class cache is copied to all of the guest VMs. The amount memory shared by TPS (graded colored) in the class areas was increased.

41

SPECjEnterprise 2010 while changing the number of guest VMs, where each guest VM ran one WAS process.

Fig. 7 shows the throughput of DayTrader when WAS used in the default configuration without and then with our class preloading approach. We measured the throughput by increasing the number of guest VMs up to nine. The throughput is the average of the last 15 minutes from 90 minutes of execution. The bars show the average of three executions for each configuration, and the error bars show the minimum and the maximum of the three executions. The throughput was acceptable up to seven VMs. However, when executing eight VMs, the throughput remained high with our approach, while the standard WAS configuration was degraded to 17.2 requests per second. This shows that our approach allowed an extra guest VM to run with acceptable performance.

Fig. 8 shows the score of the SPECjEnterprise 2010 benchmark while increasing the number of guest VMs from five to eight. Since we held the injection rate at 15, the scores do not show a performance peak, but they just indicate the number of guest VMs until the performance was degraded. We used a generational GC policy for the Java VM running WAS, since SPECjEnterprise measures the throughput that satisfies the required response time. We allocated 200 MB for the tenured area and 530 MB for the nursery area. For five and six guest VMs, the scores were around 24, which is the appropriate score for an injection rate of 15 on this machine. However, when seven VMs were running, the score remained at 24 with our approach, while the standard WAS configuration was degraded to 15 and thus failed to satisfy the service level agreement (SLA) for response time.

These results show that memory usage reduction with our approach allowed Web applications to execute an extra guest VM with acceptable performance. The results also showed the effectiveness of our approach was not limited to a specific benchmark or a GC policy.

VI. RELATED WORK Ballooning [43] is a technique to reduce paging in a

hypervisor by dynamically reducing the amount of memory available to a guest OS. The guest OS may reduce its memory usage more efficiently than the hypervisor because it has more information about the usage of its memory pages. For example, it can reduce memory by shrinking its disk cache rather than by paging-out pages. However, this approach requires a resource manager that can decide on the size of each guest VM, and the effectiveness of ballooning depends on the heuristics used by the manager to set the sizes. For a hypervisor that does not provide such a manager, such as KVM, we cannot use ballooning unless we install a separate manager.

Gupta et al. have studied many advanced techniques to improve memory utilization and implemented the Difference Engine [18] in Xen by combining three approaches: TPS, paging to RAM, and paging to disk. For paging to RAM approach, they devised two techniques. One stores entire pages to RAM after compressing them. The other calculates and stores in RAM only a set of differences from a similar page in the physical memory, a latter technique that is also called sub-page sharing. The paging to RAM approach can improve the overall performance by reducing the overhead of paging in and out because the overhead to compress or calculate differences is much smaller than that to access the disk, as long as the reduction of the paging overhead outweighs the increased paging. A weakness of this approach compared to TPS is that

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8 9Number of guest VMs

Thro

ughp

ut (r

eque

st/s

ec)

Default configuration Our approach

17.2

148.1

6.82.9

Fig. 5. Throughput of DayTrader when the number of guest VMs was

increased. Our approach reduced the performance degradation under high memory pressure and increased the number of guest VMs running with acceptable performance from seven to eight.

0

5

10

15

20

25

30

5 6 7 8Number of guest VMs

Thro

ughp

ut (E

jOP

S/s

ec)

Default configuration Our approach

Response time did not meet SLA

Fig. 6. Score of SPECjEnterprise 2010 when injection rate was fixed at

15. Our approach increased the number of guest VMs running with acceptable performance from six to seven.

0

1

2

3

4

Justafter

startingWAS

Afterfinishing

pagesharing

Justafter

startingWAS

Afterfinishing

pagesharing

Phy

sica

l mem

ory

usag

e (G

B)

Saving by sharing = 424.4 MB

Saving by sharing = 243.4 MB

Classes were preloaded Classes were not preloaded

Increased sharing by 181.0 MB

Fig. 7. Physical memory usage of three guest VMs. When classes were

preloaded, the amount of memory saved by TPS was 181 MB larger compared to without preloading, because the size of the identical memory pages was increased by preloading the classes. This shows that our technique is not limited to Linux/KVM, but is applicable to other hypervisors such as PowerVM.

42

every access to a compressed or sub-page-shared page requires restoring the full page, while there is no overhead for reading TPS-shared pages. TPS is quite good for reducing the memory usage of read-only parts of the Java class areas, because there is no runtime overhead for reading the class metadata. Active Memory Expansion [19] in PowerVM [15] also implements memory page compression in a production hypervisor, though it does not support sub-page sharing.

Satori [28] reduces the memory usage by sharing memory pages, but it also reduces the overhead to find the identical pages by focusing on the page cache. It uses a sharing-aware block device, which shares the physical memory for the page cache. They observed that page caches in guest VMs executing the same OS are likely to have some identical pages. Their focus on the attributes of the data in potentially shareable pages is similar to our approach, but they focused on memory in the Linux kernel, while we are focusing on Java memory.

Memory Buddies [44] tries to increase page sharing by collocating guest VMs that execute similar workloads, expecting those VMs to have more similar pages than the guest VMs executing different workloads. Their approach improved the sharing for native programs, but the amount of shareable memory was small for their Java application (SPECjbb). They only noted that the sharing was limited because SPECjbb is a memory-intensive program and most of its memory is soon overwritten. This is true for the Java heap, but they did not analyze the JVM native memory area.

For a Software as a Service (SaaS) datacenter, multi-tenancy is another approach for reducing physical memory usage. It is an approach that runs only a single instance of the middleware and uses separate instances for each of applications that run in the shared middleware. Java has the Application Isolation API (JSR 121) [37] to help implement a multi-tenant server. The advantage of this approach over the VM-based one is smaller memory consumption because the middleware and the OS is shared among all of the applications. The disadvantage is that a misbehaving application may affect other applications running with the same middleware. For example, if one application use up all memory available to the server process, other applications may crash because of memory shortages. Another example would be if one application crashes upon access to an invalid memory region, so the entire service process can crash.

These resource and fault isolation problems have been studied for many years. Multi-User Virtual Machine (MVM) [12, 13] is an efficient implementation of the Application Isolation API and it also has limited support for memory quota enforcement and fault isolation. It counts usage of the Java heap for each application to enforce memory quotas, and it isolates Java code from crashes in user-provided JNI methods by executing them in a separate process. J-RAF2 [7, 20] implemented enforcement of CPU quotas by using bytecode modifications.

MVM2 [14] achieved a higher level of access control and fault isolation by using a separate service process for each application to access the system resources and execute the user-provided JNI code. MVM2 utilizes the security control capabilities of the OS by running each of the service process as

the user using the application corresponding to the service process. This can also prevent other applications from crashing each other.

A problem with multi-tenant servers in a PaaS datacenter is that users of the datacenter may be restricted to certain configurations for their applications. For example, a new user of a datacenter may need to modify a port number of an application if the port number is already used by another user of the datacenter. Also, the user may need to modify the port number later if the application needs more computational resources and needs to be moved to another machine.

VII. CONCLUSION In this paper, we broke down the memory usage of guest

VMs and investigated the effectiveness of TPS for a Java-based Web application server. We found that TPS was not broadly effective for Java programs because the Java VM cannot manage the layouts of the internal data structures and the contents of memory pages are rarely identical between Java VM processes, even if they are running the same Java program. We considered whether or not each memory area in a Java process could be made shareable by TPS and found that the the only highly consistent area is the class metadata.

We proposed increasing the page sharing of Java class areas across guest VMs by utilizing the Java VM’s class sharing feature to make the layouts of classes the same among the VMs. We obtained the same layouts for all of the guest VMs by copying the file supporting this feature as a memory mapped file. This approach reduced the physical memory usage for the class area by up to 89.6% by using TPS. The total memory usage of each of four guest VMs was also reduced by 9%. We also showed that reducing the physical memory usage increased the number of guest VMs running with acceptable performance.

ACKNOWLEDGMENT We would like to thank the anonymous reviewers for their

valuable comments and suggestions that were quite helpful in clarifying and improving this paper. We also thank Ed Prosser of IBM and Norman Bobroff of IBM Research for their help in measuring the e physical memory use in the POWER platform.

REFERENCES [1] David Anderson. Crash utility. http://people.redhat.com/anderson/ [2] The Apache Software Foundation. Apache DayTrader Benchmark

Sample. 2009. https://cwiki.apache.org/GMOxDOC20/daytrader.html [3] The Apache Software Foundation. Apache Tuscany.

http://tuscany.apache.org/ [4] Andrea Arcangeli, Izik Eidus, and Chris Wright. Increasing memory

density by using KSM. In Proceedings of the Linux Symposium 2009, pp. 19-28.

[5] Chris Bailey. Java technology, IBM style: Introduction to the IBM Developer Kit, 2006. http://www.ibm.com/developerworks/java/library/j-ibmjava1.html.

[6] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the Art of Virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP ’03). 2003.

43

[7] Walter Binder and Jarle Hulaas. A Portable CPU-Management Framework for Java. In IEEE Internet Computing, Vol. 8, No. 5, September/October 2004, pp. 74-83. 2004.

[8] Trey Cain, Milo Martin, Tim Heil, Eric Weglarz, and Todd Bezenek. TPC-W in Java distribution. http://pharm.ece.wisc.edu/tpcw.shtml

[9] Rodrigo Ceron, Rafael Folco, Breno Leitao, and Humberto Tsubamoto. Power Systems Memory Deduplication. REDP-4827-00.

[10] Citrix Systems, Inc. Xen. http://www.xen.org/ [11] Ben Corrie. Java technology, IBM style: Class sharing, 2006.

http://www.ibm.com/developerworks/java/library/j-ibmjava4/ [12] Grzegorz Czajkowski and Laurent Daynès. Multitasking without

comprimise: a virtual machine evolution. In Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA ’01), pp. 125-138. 2001.

[13] Grzegorz Czajkowski, Laurent Daynès and Nathaniel Nystrom. Code Sharing among Virtual Machines. In Proceedings of 16th European Conference (ECOOP ’02), pp. 155-177. 2002.

[14] Grzegorz Czajkowski, Laurent Daynès and Ben Titzer. A Multi-User Virtual Machine. In Proceedings of USENIX 2003 Annual Technical Conference (USENIX ’03), pp. 85-98. 2003.

[15] Stuart Devenish, Ingo Dimmer, Rafael Folco, Mark Roy, Stephane Saleur, Oliver Stadler, and Naoya Takizawa. IBM PowerVM Virtualization Introduction and Configuration. SG24-7940-04.

[16] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specification, Third Edition. Addison Wesley, ISBN 0321246780, June 2005.

[17] Nikola Grcevski, Allan Kielstra, Kevin Stoodley, Mark Stoodley, and Vijay Sundaresan. Java Just-In-Time Compiler and Virtual Machine Improvements for Server and Middle-ware Applications. In Proceedings of the 3rd USENIX Virtual Machine Research and Technology Symposium (VM ’04), pp. 151-162. 2004.

[18] Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. Difference Engine: Harnessing Memory Redundancy in Virtual Machines. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’08), pp. 309-322. 2008.

[19] David Hepkin. Active Memory Expansion: Overview and Usage Guide. February 9, 2010. POW03037-USEN-00. http://www.ibm.com/systems/power/hardware/whitepapers/am_exp.html

[20] Jarle Hulaas and Walter Binder. Program Transformations for Portable CPU Accounting and Control in Java. In Proceedings of 2004 ACM SIGPLAN Symposium on Partial Evaluation and Program Manipulation (PEPM’04), pp. 169-177. 2004.

[21] IBM Corporation. Diagnostics Guide. http://www.ibm.com/developerworks/java/jdk/diagnosis/

[22] IBM Corporation. WebSphere Application Server. http://www.ibm.com/software/webservers/appserv/was/

[23] IBM Corporation. IBM DB2 Software. http://www-01.ibm.com/software/data/db2/

[24] IBM Corporation. IBM Workload Deployer. http://www-01.ibm.com/software/webservers/workload-deployer/

[25] Kernel Samepage Merging. http://www.linux-kvm.org/page/KSM [26] KVM. http://www.linux-kvm.org/page/Main_Page

[27] Henry Lieberman and Carl Hewitt. A real-time garbage collector based on the lifetimes of objects. In Communications of the ACM, Vol. 26(6), pp. 419–429. 1983.

[28] Grzegorz Miło´s, Derek G. Murray, Steven Hand, and Michael A. Fetterman. Satori: Enlightened page sharing. In Proceedings of the USENIX Annual Technical Conference (USENIX ’09), pp. 1-14. 2009.

[29] David A. Moon. Garbage collection in a large LISP system. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming (LFP ’84), pp. 235–246. 1984.

[30] OASIS. Service Component Architecture (SCA). http://www.oasis-opencsa.org/sca

[31] Kazunori Ogata, Dai Mikurube, Kiyokuni Kawachiya, Scott Trent, and Tamiya Onodera. A Study of Java’s Non-Java Memory. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’10), pp. 191-204. 2010.

[32] OpenStack Foundation. Open source software for building private and public clouds. http://www.openstack.org/

[33] Oracle Corporation and/or its affiliates. Java SE HotSpot at a Glance. http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136373.html

[34] Oracle Corporation and/or its affiliates. Class Data Sharing. http://docs.oracle.com/javase/1.5.0/docs/guide/vm/class-data-sharing.html

[35] Oracle Corporation and/or its affiliates. Class Data Sharing. (Java 6) http://docs.oracle.com/javase/6/docs/technotes/guides/vm/class-data-sharing.html

[36] Oracle Corporation and/or its affiliates. Class Data Sharing. (Java 7) http://docs.oracle.com/javase/7/docs/technotes/guides/vm/class-data-sharing.html

[37] Oracle Corporation and/or its affiliates. JSR-000121 Application Isolation API Specification. http://jcp.org/aboutJava/communityprocess/final/jsr121/index.html

[38] Michael Paleczny, Christopher Vick and Cliff Click. The Java HotSpot Server Compiler. In Proceedings of the Java Virtual Machine Research and Technology Symposium (JVM ’01), pp. 1-12. 2001.

[39] Adam Pilkington and Graham Rawson. Enhance performance with class sharing. http://www.ibm.com/developerworks/library/j-sharedclasses/

[40] Jim Smith and Ravi Nair. Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann, ISBN 1558609105, June 2005.

[41] Standard Performance Evaluation Corporation. SPECjEnterprise2010. http://www.spec.org/jEnterprise2010/

[42] Transaction Processing Performance Council. TPC-W. http://www.tpc.org/tpcw/

[43] Carl A. Waldspurger. Memory Resource Management in VMware ESX Server. In Proceedings of 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’02). 2002.

[44] Timothy Wood, Gabriel Tarasuk-Levin, Prashant Shenoy, Peter Desnoyers, Emmanuel Cecchet, and Mark D. Corner. Memory Buddies: Exploiting Page Sharing for Smart Colocation in Virtualized Data Centers. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’09), pp. 31-40. 2009.

44

[IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) -...

Documents

Transcript of [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) -...