IBM Research -Tokyo

19
© 2014 IBM Corporation String Deduplication for Java-based Middleware in Virtualized Environments IBM Research - Tokyo Michihiro Horie , Kazunori Ogata, Kiyokuni Kawachiya, Tamiya Onodera IBM Research - Tokyo

Transcript of IBM Research -Tokyo

Page 1: IBM Research -Tokyo

© 2014 IBM Corporation

String Deduplication forJava-based Middleware in Virtualized Environments

IBM Research - Tokyo

Michihiro Horie, Kazunori Ogata, Kiyokuni Kawachiya, Tamiya Onodera

IBM Research - Tokyo

Page 2: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

2 VEE'14March 2, 20142

Duplicated strings found on Web Application Server

� Example of duplicated strings created by Apache DayTrader for handling

2,000 requests

“SELECT t0.HOLDINGID, t0.ACCOUNT_ACCOUNTID, t0.PURCHASE

DATE, t0.PURCHASEPRICE, t0.QUANTITY, t2.SYMBOL, t2.CHANGE1,

t2.COMPANYNAME, t2.HIGH, t2.LOW, t2.OPEN1, t2.PRICE, 2.VOLUME

FROM holdingejb t0 INNER JOIN accountejb t1 ON t0.ACCOUNT_ACCOUNTID

= t1.ACCOUNTID LEFT OUTER JOIN quoteejb t2 ON t0.QUOTE_SYMBOL =

t2.SYMBOL WHERE (t1.PROFILE_USERID = ?)”

2,000

“true”917

“false”2,126

“java.lang.String''2,459

“(<B><FONT color="#ff0000">-1.00%</FONT></B>)<IMG

src="images/arrowdown.gif“ width="10" height="10" border="0"></IMG>”2,021

Allocated

objects

String value

14,483 “META-INF/services/org.apache.xerces.xni.parser.XMLParserConfiguration”

Page 3: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

3 VEE'14March 2, 20143

Duplicated strings found on Web Application Server

� Example of duplicated strings created by Apache DayTrader for handling

2,000 requests

“SELECT t0.HOLDINGID, t0.ACCOUNT_ACCOUNTID, t0.PURCHASE

DATE, t0.PURCHASEPRICE, t0.QUANTITY, t2.SYMBOL, t2.CHANGE1,

t2.COMPANYNAME, t2.HIGH, t2.LOW, t2.OPEN1, t2.PRICE, 2.VOLUME

FROM holdingejb t0 INNER JOIN accountejb t1 ON t0.ACCOUNT_ACCOUNTID

= t1.ACCOUNTID LEFT OUTER JOIN quoteejb t2 ON t0.QUOTE_SYMBOL =

t2.SYMBOL WHERE (t1.PROFILE_USERID = ?)”

2,000

“true”917

“false”2,126

“java.lang.String''2,459

“(<B><FONT color="#ff0000">-1.00%</FONT></B>)<IMG

src="images/arrowdown.gif“ width="10" height="10" border="0"></IMG>”2,021

Allocated

objects

String value

14,483 “META-INF/services/org.apache.xerces.xni.parser.XMLParserConfiguration”

File path to configuration files

SQL statement

Tokens after parsing configuration files

Part of HTML

Page 4: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

4 VEE'14March 2, 2014

String duplications in a single JVM and across JVMs

� A server in a PaaS datacenter often runs multiple copies of the same

middleware to host many Web applications

“version=”100

::

“java.lang”10,420

“META-INF/..”11,080

StringStringCountCount

“version=”100

::

“String”10,420

“false”11,080

StringStringCountCount

Guest VM1 Guest VM2 Guest VM3

OS

WAS

OS

WAS

OS

WASF

Java JavaWAS

Java Java

app. app. app. app.

Hypervisor

“version=”100

::

“1.0”10,420

“true”11,080

StringStringCountCount

Duplication

across JVMs

Duplication in

a single JVM

Page 5: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

5 VEE'14March 2, 2014

To increase VMs runnable in one physical machine

� Datacenters want to increase the number of guest VMs.

– For example, we had a real customer who wanted to run

more than 100 guest VMs.

� To reduce memory footprint, we focus on reducing string data in Java

workloads running on guest VMs

– In a Java workload, String and char[] occupy more than 20% of

live Java heap

Page 6: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

6 VEE'14March 2, 2014

Two approaches: String deduplication in a single JVM and across JVMs

• Preload strings using DLL

• Share strings across guest

VMs using hypervisor’s page

sharing feature such as TPS

(Transparent Page Sharing)

• For long-lived objects

• Selectively intern duplicated String and char[] objects

in a single JVM

• For short-lived objectsJava heap

str

str str

String DLL

Java heap

str

Guest VM2Guest VM1

str str

String DLL

Java heap

str str

str strstr str

Guest OS str str

str str

JVM JVM JVM

Dedup by TPS

(Ⅰ)Deduplication in a single JVM

(Ⅱ)Deduplication across JVMs

Page 7: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

7 VEE'14March 2, 20147 VEE'142014/03/02

Deduplication in a single JVM

� Our approach: selective intern by checking calling contexts

– Step-1: Offline profiling

1. Recording calling contexts and created strings

2. Finding calling contexts each of which allocates a single

heavily duplicated string

– Step-2: Selective intern

• Applications are modified to check calling context with small

overhead and intern allocated string if applicable

• Without selective intern, performance degradation occurs

– Our experiment on a Web application showed more than

10% of degradation when all strings are interned

Page 8: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

8 VEE'14March 2, 20148 VEE'142014/03/02

Find calling contexts allocating heavily duplicated strings

� Example: the most heavily duplicated string in the table in page 2

Allocated

objects

String value

14,483 “META-INF/services/org.apache.xerces.xni.parser.XMLParserConfiguration”

“META-INF/...” Other values

� We found some calling contexts

always create the same values of

strings

� For example, this calling context

created more than 10,000

duplication of the value above

String objects allocated by findFilePath()

Created 10,000+ duplication in context:

handleDocumentRoot()

findFilePath()

getURL()

getResource()

Page 9: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

9 VEE'14March 2, 20149 VEE'142014/03/02

Low-overhead Calling Context Checking

� Only check the entry and the exit of a calling context

• Selectively invoke toUnifiedString() at runtime

• Uses an Aspect-Oriented Programming(AOP) language

to change code at post-compile time

String findFilePath() {

// do some work

}

String findFilePath() {

if (/* same thread */)

return toUnifiedString(...);

else

// do original work

}

Original code Modified code

void getResources() {

handleDocumentRoot();

}

void getResources() {

// registers thread ID

handleDocumentRoot();

}

First method

of profiled

calling context

The method

to call intern

Page 10: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

10 VEE'14March 2, 201410 VEE'142014/03/02

Deduplication across JVMs

� Our approach: sharing common strings by using a DLL

– Step-1: Offline profiling

• Creating heap dumps on multiple machines by running the same

middleware and similar web applications

• Collecting common strings among the dumps for:

–Extracting strings whose values are specified by middleware

or web applications

–Excluding machine-specific strings, such as IP addresses and

host names

– Step-2: Sharing strings across JVMs by using a DLL

• Store String and char[] to the DLL in the same format as

those objects in the Java heap

• Each JVM loads the DLL at start-up time

Page 11: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

11 VEE'14March 2, 2014

Common strings among JVMs

� In the heap dumps after running Web application servers on

two different machines, we found 30% common strings

–They are long-lived objects

Size of common strings between two machines

0 10 20 30 40 50 60

WAS start-up

live object

char[] object

sharble char[]

16.8 MB (30%)

Page 12: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

12 VEE'14March 2, 2014

Deduplication across JVMs in a single guest VM

� Store strings into a DLL

� Class pointers and

object references are

decided when the DLL

is created

– We modified to load

the DLL always

at the same address

– (cf.) prelink for Linux

Guest OSString class

char[] classstr

‘a’ ‘b’ ‘c’:

String class

char[] classstr

‘a’ ‘b’ ‘c’:

str

‘a’ ‘b’‘c’:

str

‘a’ ‘b’‘c’:

String classchar[] class

Guest VM1

String classchar[] class

Java heap

Other classes

Other classes

Java heap

String DLL

Page 13: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

13 VEE'14March 2, 2014

Deduplication even across guest VMs

� We make use of the feature of KSM (Kernel Samepage Merging) on

KVM (Kernel-based Virtual Machine)

� KSM merges mapped memory into a single page

1. KSM detects that mapped DLLs in

guest VMs are the same

2. KSM deduplicates the mapped

DLLs into one

Java heap

str strJava heap

str str

Guest VM2Guest VM1

str str

str str

Guest OS

str str

str str

JVM JVM

Dedup by KSMHypervisor (KVM)

Page 14: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

14 VEE'14March 2, 2014

1. Separate into small groups of strings from the viewpoint of the length

2. Pick up only a couple of indices to calculate hash values

3. Choose one hash function that yields the least collision of hash value

Fast lookup mechanism

I_32 hash = ch1 * ch2;

return (hash * hash);

I_32 hash = ch1 * ch2;

return (hash * hash);

I_32 hash = ch1 * ch2 * ch3;

return (hash * hash);

Indices 3 and 4 are better for

hash code calculation

I S O L A T I O N

A S S E R T I O N

A S S O C I A T E

str1

str2

str3

lookup(“after”, 5)

Function table

NULL

String table for length 2

String table for length 3

lookup(“ad”, 2)

Quickly returned null

Page 15: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

15 VEE'14March 2, 2014

Experiment environment

Page 16: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

16 VEE'14March 2, 2014

Reduced memory allocation by up to 30%

� On average, 12% reduction in the total allocation size of

objects

Th

e lo

we

r th

e b

ett

er

Java heap

str str

JVM

Single JVM

0

0.2

0.4

0.6

0.8

1

1.2

DayT

rade

r

fop

luinde

xsu

nflow

avro

ra

batik

jyth

onluse

arch

pmd

tomca

ttrad

ebea

nstrad

esoa

p

xalan h2

Page 17: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

17 VEE'14March 2, 2014

Reduced 70% of live char[] objects

-

5

10

15

20

25

DayTrader on WAS8

Liv

e c

har[

] obje

cts

(M

B)

Original (no string DLL) Our approach

14MB reduction

per JVM

Java heap

str str

JVM

Multi JVMs

Java heap

str str

JVM

� Total amount of memory reduces proportionally to the number of guest

VMs

– For example, we can reduce 1.4GB with 100 virtual machines

Page 18: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

18 VEE'14March 2, 2014

Low overhead for looking up the string DLL

� We need to look up a string in the DLL when a String

constructor is invoked

� No performance degradation compared to original

– 20% speed-up compared to a naive approach

The h

igher

the b

ett

er

0

50

100

150

200

250

300

350

400

450

500

Original

(no String DLL)

Naive approach

(one big has htable)

Our approach

Throughput(tps)

20% speed-up

Java heap

str str

JVM

Multi JVMs

Java heap

str str

JVM

(one big hashtable)

Page 19: IBM Research -Tokyo

© 2014 IBM Corporation

IBM Research - Tokyo

19 VEE'14March 2, 2014

Concluding remarks

� Found that many duplicated strings existed

– In a single JVM and across JVMs

� Proposed two approaches for string deduplication

– Selective interning

– Preload common string objects using DLL

� Reduced amount of char[] objects by:

– 12% at runtime of web applications

– 70% in live char[] objects