Jvm goes big_data_sfjava

47
JVM goes BigData srisatish.ambati AT gmail.com DataStax/OpenJDK 4/12/2011 @srisatish

description

SF Java presentation of jvm goes to big data.“Slowly yet surely the JVM is going to Big Data! In this fun filled presentation we see what pieces of Java & JVM triumph or unravel in the battle for performance at high scale!”Concurrency is the currency of scale on multi-core & the new generation of collections and non-blocking hashmaps are well worth the time taking a deep dive into. We take a quick look at the next gen serialization techniques as well as implementation pitfalls around UUID. The achilles' heel for JVM remains Garbage Collection: a deep dive into the internals of the memory model, common GC algorithms and their tuning knobs is always a big draw. EC2 & cloud present us with a virtualized & unchartered territory for scaling the JVM.We will leave some room for Q&A or fill it up with any asynchronous I/O that might queue up during the talk. A round of applause will be due to the various tools that are essentials for Java performance debugging.

Transcript of Jvm goes big_data_sfjava

Page 1: Jvm goes big_data_sfjava

JVM goes BigData

srisatish.ambati AT gmail.comDataStax/OpenJDK4/12/2011@srisatish

Page 2: Jvm goes big_data_sfjava

Motivation

• A compendium of recent jvm scale issues while working with big data.

• This talk will not have details on big data.

• Thanks Sasa!

Page 3: Jvm goes big_data_sfjava

Trail Ahead

synchronizedNon­blocking Hashmap    ­ A state transition viewCollectionsSerializationUUIDGarbage Collection    ­ The free parameters!    ­ Generations, Promotion, Fragmentation    ­ OffheapQuestions & asynchronous IO

Page 4: Jvm goes big_data_sfjava

tools of trade

• What the JVM is doing:– dtrace, hprof, introscope, jconsole, visualvm, yourkit, 

gchisto, zvision

• Invasive JVM observation tools:– bci, jvmti, jvmdi/pi agents, logging

• What the OS is doing:– dtrace, oprofile, vtune, perf

• What the network/disk is doing:– ganglia, iostat, lsof, nagios, netstat, tcpdump

Page 5: Jvm goes big_data_sfjava
Page 6: Jvm goes big_data_sfjava

synchronized

under the hood– Fast path for no­contention thin lock– Bias threads to lock or bulk revoke bias– Store free biasing

Page 7: Jvm goes big_data_sfjava

JMM: happens­before, causalityPartial ordervolatilePiggybackingFutureTaskBlockingQueuejsr133

Page 8: Jvm goes big_data_sfjava

* Java Concurrency in Practice, Brian Goetz

Page 9: Jvm goes big_data_sfjava

java.util.concurrent also holds locks!

Page 10: Jvm goes big_data_sfjava

Tomcat under concurrent load!

Page 11: Jvm goes big_data_sfjava

Non­blocking collections: Amdahl's > Moore's!

 State, Actions – key/value pairs!get, put, delete, _resize

ByteArray to hold DataConcurrent writes: using CAS

No locks, no volatileMuch faster than locking under heavy load

     Directly reach main data array in 1 step

Resize as neededCopy Array to a larger Array on demand. Post updates

Page 12: Jvm goes big_data_sfjava

Death & Taxes: Java Overheads!

• Cost of an 8­char String?

• Cost of 100­entry TreeMap<Double,Double> ?

8bhdr

12bfields

4bptr

4bpad

8bhdr

4blen

16bdata

A: 56 bytes, or a 7x blowup

48bTreeMap

40bTreeMap$Entry

16bDouble

16bDouble

A: 7248 bytes or a ~5x blowup

Page 13: Jvm goes big_data_sfjava

yourkit: memory profile

Page 14: Jvm goes big_data_sfjava

Which collection: Mozart or Bach?

Concurrency:   Non­blocking HashMap  Google Collections

Overheads  Watch out for per­element costs!  Primitives can be hard to manage!

Sparse collections   Average collection size in enterprise is ~3

Page 15: Jvm goes big_data_sfjava

 

  java.io.Serializable is S.L..O.…WTrue to platform Use “transient” ObjectSerialField[] Avro Google Protocol Buffers,  Externalizable + byte[] Roll your own

serializable

Page 16: Jvm goes big_data_sfjava

ser+deser smaller is better

https://github.com/eishay/jvm­serializers.git

Page 17: Jvm goes big_data_sfjava

avro

• Schema– No per datum overheads– Optional code gen

• Types are runtime• Untagged data• No manually­assigned field IdsCons:• Schema mismatches• Runtime only checks

Page 18: Jvm goes big_data_sfjava

google­proto­buffer

• Define message format in .proto file

• All data in key/value pairs• Generate sources• .builder for each class 

with getter/setter

Page 19: Jvm goes big_data_sfjava

thrift

• Type, Transport, Protocol, Version, Processors

• Separation of structure from protocol & transport

• TCompactProtocol, etc– tag/data, compression

• TSocket, TfileTransport, etc• colocated clients & servers

Page 20: Jvm goes big_data_sfjava

UUIDjava.util.UUID is slow

●   dominated by sha_transform costs●  Leach­salz (128­bit) 

Turns out that default PRNG (via SecureRandom)Uses /dev/urandom for seed initialization          ­Djava.security.egd=file:/dev/urandom

● PRNG without file is atleast 20%­40% better.Use TimeUUIDs where possible – much faster

   Alternatives: JUG – java.uuid.generator, com.eaio.uuid

    ~10x faster

http://github.com/cowtowncoder/java­uuid­generator 

http://jug.safehaus.org/ 

http://johannburkard.de/blog/programming/java/Java­UUID­generators­compared.htm

Page 21: Jvm goes big_data_sfjava

    /**

* Returns a {@code String} object representing this {@code UUID}.

*

* <p> The UUID string representation is as described by this BNF:

* <blockquote><pre>

* {@code

* UUID = <time_low> "-" <time_mid> "-"

* <time_high_and_version> "-"

* <variant_and_sequence> "-"

* <node>

* time_low = 4*<hexOctet>

* time_mid = 2*<hexOctet>

* time_high_and_version = 2*<hexOctet>

* variant_and_sequence = 2*<hexOctet>

* node = 6*<hexOctet>

* hexOctet = <hexDigit><hexDigit>

* hexDigit =

* "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

* | "a" | "b" | "c" | "d" | "e" | "f"

* | "A" | "B" | "C" | "D" | "E" | "F"

* }</pre></blockquote>

*

* @return A string representation of this {@code UUID}

*/

public String toString() {

return (digits(mostSigBits >> 32, 8) + "-" +

digits(mostSigBits >> 16, 4) + "-" +

digits(mostSigBits, 4) + "-" +

digits(leastSigBits >> 48, 4) + "-" +

digits(leastSigBits, 12));

}

Leach­salz UUID

Page 22: Jvm goes big_data_sfjava

­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­   PerfTop:    1485 irqs/sec  kernel:18.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­

             samples  pcnt function                                                         DSO             _______ _____ ________________________________________________________________ 

             1882.00 26.3% intel_idle                                                       [kernel.kallsyms]                      1678.00 23.5% os::javaTimeMillis()                                   libjvm.so                               382.00  5.3% SpinPause                                                        libjvm.so                               335.00  4.7% Timer::ImplTimerCallbackProc()                   libvcllx.so                             291.00  4.1% gettimeofday                                                     /lib/libc­2.12.1.so                     268.00  3.7% hpet_next_event                                              [kernel.kallsyms]                       254.00  3.6% ParallelTaskTerminator::offer_termination(TerminatorTerminator*) libjvm.so                               ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­    PerfTop:    1656 irqs/sec  kernel:59.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­

             samples  pcnt function                                                         DSO             _______ _____ ________________________________________________________________            6980.00 38.5% sha_transform                                                            [kernel.kallsyms]             2119.00 11.7% intel_idle                                                                      [kernel.kallsyms]             1382.00  7.6% mix_pool_bytes_extract                                                [kernel.kallsyms]                 437.00  2.4% i8042_interrupt                                                               [kernel.kallsyms]              416.00  2.3% hpet_next_event                                                             [kernel.kallsyms]              390.00  2.2% extract_buf                                                                     [kernel.kallsyms]              376.00  2.1% ThreadInVMfromNative::~ThreadInVMfromNative()  libjvm.so                      321.00  1.8% T.3542                                                                            libjvm.so                      298.00  1.6% __ticket_spin_lock                                                         [kernel.kallsyms]              296.00  1.6% Timer::ImplTimerCallbackProc()                                  libvcllx.so                    255.00  1.4% Unsafe_GetInt                                                                libjvm.so 

              

Page 23: Jvm goes big_data_sfjava

summary

TimebasedUUIDs vs. UUIDsuse ~4 times less kernel time on creation!No SHA library calls!optimized toString()Much faster than standard java.util.UUID­ Better Instructions per clocks as well. If on EC2: Watch out for non­cacheable file access to /dev/urandom!  

Page 24: Jvm goes big_data_sfjava

String theory of Java!

byte[] vs. char[]If ver > jdk16u21 try ­XX:+UseCompressedStringsAppend performance (gc) differs: Strings vs. StringBufferscom.google.common.base.Joiner

• Join text for cheap, • skipNulls or useForNulls()

com.google.common.base.Splitter 

Page 25: Jvm goes big_data_sfjava

“Null References: A billion dollar mistake”                                                       ­ C.A.R Hoare

“I call it my billion­dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement.  This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.” ­ qconlondon, '09

Page 26: Jvm goes big_data_sfjava

Best Practices:Garbage Collection

Page 27: Jvm goes big_data_sfjava

verbose:gc

GC Logs are cheap even in production

          ­Xloggc:gc.log 

          ­XX:+PrintGCDetails 

          ­XX:+PrintGCTimeStamps ­XX:+PrintTenuringDistribution 

A bit expensive/obscure ones:          ­XX:PrintFLSStatistics=2 ­XX:CMSStatistics=1

          ­XX:CMSInitiationStatistics ­XX:+PrintFLSCensus

Page 28: Jvm goes big_data_sfjava

Three free parameters

 Allocation Rate: your workload!Size: defines runway!      Live Set, memoryPause times:       Stoppages!

Page 29: Jvm goes big_data_sfjava

Four free parameters

 Allocation Rate: your application load!Size: defines runway!      Live Set, system memoryPause times:       Stoppages! 

(fourth: Overheads of GC – Space & CPU.)

Page 30: Jvm goes big_data_sfjava

Part I: Sizingto be ­Xmx == ­Xms or not?Young generation:

Use ­Xmn for predictable performance

edensurvivor spaces

new Object()survivor ratio

 jvm allocates 

TenuringThreshold

promotion

old gen

Page 31: Jvm goes big_data_sfjava

Part II: Pick a collector!

Serial GC – Serial new + Serial OldParallel GC (default) Parallel Scavenge + Serial OldUseParallelOldGC : Parallel Scavenge + Parallel OldUseConcurrentMarkSweep: ParNew, CMS Old, Serial OldG1/Experimental

Page 32: Jvm goes big_data_sfjava

Reading GC logs – a topic/tool

Full GC is STWInitial Mark, Rescan/WeakRef/Remark  are STWLook for promotion failuresLook for concurrent mode failures

Page 33: Jvm goes big_data_sfjava

... 995.330: [CMS­concurrent­mark: 0.952/1.102 secs] [Times: user=3.69 sys=0.54, real=1.10 secs] 995.330: [CMS­concurrent­preclean­start]995.618: [CMS­concurrent­preclean: 0.279/0.287 secs] [Times: user=0.90 sys=0.20, real=0.29 secs] 995.618: [CMS­concurrent­abortable­preclean­start]995.695: [GC 995.695: [ParNew (promotion failed)Desired survivor size 41943040 bytes, new threshold 1 (max 1)­ age   1:   29826872 bytes,   29826872 total: 720596K­>703760K(737280K), 0.4710410 secs]996.166: [CMS996.317: [CMS­concurrent­abortable­preclean: 0.218/0.699 secs] [Times: user=1.39 sys=0.10, real=0.70 secs]  (concurrent mode failure): 4100132K­>784070K(5341184K), 4.7478300 secs] 4780154K­>784070K(6078464K), [CMS Perm : 17033K­>17014K(28400K)], 5.2191410 secs] [Times: user=5.70 sys=0.01, real=5.22 secs]...

Page 34: Jvm goes big_data_sfjava

Tuning CMS

Don’t promote too often!     Frequent promotion causes fragmentation

     (avoid never tenure) TenuringThreshold

Size the generations     Min GC times are a function of Live Set

     Old Gen should host steady state comfortably

Avoid CMS Initiating heuristic         ­XX:+UseCMSInitiationOccupanyOnly

Use Concurrent for System.gc()         ­XX:+ExplicitGCInvokesConcurrent

Page 35: Jvm goes big_data_sfjava

GC Threads

Parallelize on multicores           ­XX:ParallelGCThreads=4

        (default: derived from # of cpus on system) 

               *8 + (n­5)/8

         ­XX:ParallelCMSThreads=4 

        (default: derived from # of parallelgcthreads)

Strategy A: 

       Tune min gcs & let appl data in eden

 

Page 36: Jvm goes big_data_sfjava

Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) {    assert(ParallelGCThreads == 0, "Default ParallelGCThreads is not 0");    // For very large machines, there are diminishing returns    // for large numbers of worker threads.  Instead of    // hogging the whole system, use a fraction of the workers for every    // processor after the first 8.  For example, on a 72 cpu machine    // and a chosen fraction of 5/8    // use 8 + (72 ­ 8) * (5/8) == 48 worker threads.    unsigned int ncpus = (unsigned int) os::active_processor_count();    return (ncpus <= switch_pt) ?           ncpus :          (switch_pt + ((ncpus ­ switch_pt) * num) / den);  } else {    return ParallelGCThreads;  }

Page 37: Jvm goes big_data_sfjava

Fragmentation

Performance degrades over timeInducing “Full GC” makes problem go awayFree memory that cannot be used

    Round off errorsReduce occurrenceUse a compacting collectorPromote less oftenUse uniform sized objects 

Page 38: Jvm goes big_data_sfjava

Not enough large contiguous space for promotion

Small objects still can fit in the holes!Compaction – stop the world.Unsolved on Oracle/Sun Hotspot Azul Systems Pauseless JVM.

Page 39: Jvm goes big_data_sfjava

JRockit Mission Control

Page 40: Jvm goes big_data_sfjava

Example

Application suddenly transitions to back­to­back full gcs.

Cannot use free mem – too many holes!

Page 41: Jvm goes big_data_sfjava

Tools

• GCHisto• jconsole• VisualVM/VisualGC• Logs• Thread dumps• yourkit memory profile, snapshots

Page 42: Jvm goes big_data_sfjava

GCSpy

Page 43: Jvm goes big_data_sfjava

Gone 0xff the heap !!

ByteBuffer.allocateDirect(16 * 1024 * 1024)Also can be mapped memory of a file regionStore long­lived objects outside jvm Managed by native i/o ops.JNA: dynamically load & call native libraries 

without compile time decl like JNIWorks for limited use cases in the lab.          Ex: Terracotta, Hbase, Cassandra

Page 44: Jvm goes big_data_sfjava

Gone 0xff the heap ?Issues to consider:No clear api to de­allocate from this region 

● See jbellis patch to JNA­179 for FreeableBufferObject cleanup relegated to finalization Single finalizer thread, Bug ID: 4469299Behind WeakReference processing in jdk16u21

Workaround:­XX:MaxDirectMemorySize=<size> Manually Trigger System.gc() to avoid “leak” 

Page 45: Jvm goes big_data_sfjava

Virtually there! 

Ballooning driver for Memory: Disable it!Time (TSC) issue! It's relative!Scheduling when # of threads > # of vcpus..          Tickless _nohz kernelGC Thread starvation = STW pauseslarge ec2 instances are not all equal..DirectPathIO & vt­d, rvi – Watch out for Sockets!Tools: Performance counters still not virtualized!

Page 46: Jvm goes big_data_sfjava

summary

• JVM is still the most popular platform for deployment for the new languages!

• JVM heartburn around scale!– Serialization– UUID– Object overhead– Garbage Collection– Hypervisor

Page 47: Jvm goes big_data_sfjava

References

Chris Wimmer, Chris Wimmer, http://wikis.sun.com/display/HotSpotInternals/Synchronizationhttp://wikis.sun.com/display/HotSpotInternals/SynchronizationRussel & Detlefs Russel & Detlefs http://www.oracle.com/technetwork/java/biasedlocking­oopsla2006­wp­149958.pdfhttp://www.oracle.com/technetwork/java/biasedlocking­oopsla2006­wp­149958.pdfGoogle Protocol Buffers Google Protocol Buffers http://code.google.com/p/protobufhttp://code.google.com/p/protobufThrift Thrift http://incubator.apache.org/thrift/static/thrift­20070401.pdfhttp://incubator.apache.org/thrift/static/thrift­20070401.pdfLeach­Salz Variant of UUID Leach­Salz Variant of UUID http://www.upnp.org/resources/draft­leach­uuids­guids­00.txthttp://www.upnp.org/resources/draft­leach­uuids­guids­00.txtHans Boehm, Hans Boehm, http://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.htmlhttp://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.htmlBrian Goetz, JSR­133 Brian Goetz, JSR­133 http://www.ibm.com/developerworks/java/library/j­jtp03304/http://www.ibm.com/developerworks/java/library/j­jtp03304/GCSpy GCSpy http://www.cs.kent.ac.uk/projects/gc/gcspy/http://www.cs.kent.ac.uk/projects/gc/gcspy/Understanding GC logs Understanding GC logs http://blogs.sun.com/poonam/entry/understanding_cms_gc_logshttp://blogs.sun.com/poonam/entry/understanding_cms_gc_logs

Cliff Click's http://sourceforge.net/projects/high­scale­lib/Cliff Click's http://sourceforge.net/projects/high­scale­lib/