Post on 16-Jan-2016
JESSICA2: A Distributed Java Virtual Machine
with Transparent Thread Migration Support
Wenzhang Zhu, Cho-Li Wang, Francis Lau
The Systems Research Group
Department of Computer Science and Information Systems
The University of Hong Kong
JESSICA2, CSIS, HKU 2HKJU, Dec. 18, 2002
HKU JESSICA Project
JESSICA: “Java-Enabled Single-System-Image Computing Architecture” : Project started in 1996. First version (JESSICA1) in 1999. A middleware that runs on top of the standard UNIX/Linux operating
system to support parallel execution of multi-threaded Java applications in a cluster of computers.
JESSICA hides the physical boundaries between machines and makes the cluster appear as a single computer to applications -- a single-system image (SSI).
Special feature : preemptive thread migration which allows a thread to freely move between machines.
Part of the RGC’s Area of Excellence project in 1999-2002.
JESSICA2, CSIS, HKU 3HKJU, Dec. 18, 2002
JESSICA Team Members
Supervisors: Dr. Francis C.M. Lau Dr. Cho-Li Wang
Research Students: Ph.D: Wenzhang Zhu (Thread
Migration) Ph.D: WeiJian Fang (Global
Heap) M.Phil: Zoe Ching Han Yu
(Distributed Garbage Collection)
Ph.D: Benny W. L. Cheung (Software Distributed Shared Memory)
Graduated: Matchy Ma (JESSICA1)
The Systems Research Group
JESSICA Team Members
JESSICA2, CSIS, HKU 4HKJU, Dec. 18, 2002
Outline
Introduction of Cluster Computing
Motivations
Related works
JESSICA2 features
Performance Analysis
Conclusion & Future works
JESSICA2, CSIS, HKU 5HKJU, Dec. 18, 2002
What’s a cluster ? A cluster is a type of parallel or
distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource – IEEE TFCC.
My definition : a HPC system that integrates mainstream commodity components to process large-scale problems low cost, self-made, yet powerful.
JESSICA2, CSIS, HKU 6HKJU, Dec. 18, 2002
Cluster Computer Architecture
High-Speed LAN (Fast/Gigabit Ethernet, SCI, Myrinet)
Availability Infrastructure
Single System Image Infrastructure
Programming Environment
(Java, C, MPI, HPF, DSM)
Management &Monitoring &
Job Scheduling
Cluster Applications(Web, Storage, Computing,
Rendering, Financing..)
OS
Node
OS
Node
OS
Node
OS
Node
JESSICA2, CSIS, HKU 7HKJU, Dec. 18, 2002
Single System Image (SSI) ? JESSICA Project: Java-Enabled Single-System-
Image Computing Architecture A single system image is the illusion, created
by software or hardware, that presents a collection of resources as one, more powerful resource.
Ultimate Goals of SSI : makes the cluster appear like a single machine to the user, to applications, and to the network.
Single Entry Point, Single File System, Single Virtual Networking, Single I/O and Memory Space, Single Process Space, Single Management / Programming View …
JESSICA2, CSIS, HKU 8HKJU, Dec. 18, 2002
Count Share Rmax [GF/s] Rpeak [GF/s] Processors
MPP 224 44.8 % 104899.21 168829.00 111104
Constellations 187 37.4 % 40246.50 59038.00 33828
Cluster 80 16 % 37596.16 69774.00 50181
SMP 9 1.8 % 39208.10 44875.00 6056
Total 500 100 % 221949.97 342516.00 201169
Top 500 computers by “classification” (June 2002)(Source: http://www.top500.org/ )
MPP Massively Parallel ProcessorConstellation E.g., cluster of HPCsCluster Cluster of PCsSMP Symmetric Multiprocessor
About the TOP500 List:
1. the 500 most powerful computer systems installed in the world.
2. Compiled twice a year since June 1993
3. Ranked by their performance on the LINPACK Benchmark
JESSICA2, CSIS, HKU 9HKJU, Dec. 18, 2002
#1 Supercomputer: NEC’s Earth Simulator
Built by NEC, 640 processor nodes, each consisting Built by NEC, 640 processor nodes, each consisting of 8 vector processors, of 8 vector processors, total of 5120 processors, 40 total of 5120 processors, 40 TFlop/s peak, and 10 TB memoryTFlop/s peak, and 10 TB memory..
Linpack : Linpack : 35.8635.86 Tflop/s Tflop/s
(Tera FLOPS = 10(Tera FLOPS = 101212 floating point floating point operations per second operations per second = 450 x Pentium 4 PCs)= 450 x Pentium 4 PCs)
Interconnect: Interconnect: Single Single stage crossbarstage crossbar (1800 (1800 miles of cable) 83,000 miles of cable) 83,000 copper cables, 16 GB/s copper cables, 16 GB/s cross section cross section bandwidthbandwidth
Area of computer = 4 Area of computer = 4 tennis courts, 3 floorstennis courts, 3 floors
(Source: NEC)
JESSICA2, CSIS, HKU 10HKJU, Dec. 18, 2002
Other Supercomputers in TOP500
#2 #3 Supercomputer: ASCI Q 7.7 TF/s Linpack performance. Los Alamos National Laboratory, U.S. HP Alphserver SC (375 x 32-way multiprocessors, total 11,968
processors), 12 terabytes of memory and 600 terabytes of disk storage
#4: IBM ASCI White (U.S.) 8,192 copper microprocessors (IBM SP POWER3), and contains 160
trillion bytes (TB) of memory with more than 160 TB of IBM disk storage capacity; Linpack: 7.22 Tflops. Located at Lawrence Livermore National Laboratory.
512-node, 16-way symmetric multiprocessor. Covers an area the size of two basketball courts, weighs 106 tons. 2,000 miles of copper wiring. Cost: US$110 million.
JESSICA2, CSIS, HKU 11HKJU, Dec. 18, 2002
TOP500 Nov 2002 List
2 new PC clusters made the TOP 10: #5 is a Linux NetworX/Quadrics cluster at
Lawrence Livermore National Laboratory. #8 is a HPTi/Myrinet cluster at the Forecast
Systems Laboratory at NOAA.
A total of 55 Intel based and 8 AMD based PC clusters are in the TOP500.The number of clusters in the TOP500 grew again to a total of 93 systems.
JESSICA2, CSIS, HKU 12HKJU, Dec. 18, 2002
Poor Man’s Cluster HKU Ostrich Cluster 32 x 733 MHz
Pentium III PCs, 384MB Memory
Hierarchical Ethernet-based network : four 24-port Fast Ethernet switches + one 8-port Gigabit Ethernet backbone switch)
JESSICA2, CSIS, HKU 13HKJU, Dec. 18, 2002
Computational Plant (C-Plant cluster)
1536 Compaq DS10L 1U servers (466 MHz Alpha 21264 (EV6) microprocessor, 256 MB ECC SDRAM)
Each node contains a 64-bit, 33 MHz Myrinet network interface card (1.28 Gbps/s) connected to a 64-port Mesh64 switch. 48 cabinets, each of which
contains 32 nodes
(48x32=1536)
Rich Man’s Cluster
JESSICA2, CSIS, HKU 14HKJU, Dec. 18, 2002
The HKU Gideon 300 Cluster(Operating in mid Oct. 2002)
300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS) 300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS) connected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switchconnected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switch
Linpack performance: 355 Gflops#175 in TOP500 (Nov. 2002 List)
JESSICA2, CSIS, HKU 15HKJU, Dec. 18, 2002
Building Gideon 300
JESSICA2, CSIS, HKU 16HKJU, Dec. 18, 2002
JESSICA2 : IntroductionResearch Goal High Performance Java Computing using Clusters
Why Java? The dominant language for server-side programming. More than 2 million Java developers [CNETAsia: 06/2002] Platform independent: “Compile once, run anywhere” Code mobility (i.e., dynamic class loading) and data mobility (i.e., object
serialization). Built-in multithreading support at language level (parallel programming
using MPI, PVM, RMI, RPC, HPF, DSM is difficult)
Why cluster? Large scale server-side applications need high-performance
multithreaded programming supports A cluster provides a scalable hardware platform for true parallel
execution.
JESSICA2, CSIS, HKU 17HKJU, Dec. 18, 2002
Java Virtual Machine
Class Loader Loads class files
Interpreter Executes bytecode
Runtime Compiler Converts bytecode to
native code
0a0b0c0d0c6262431c1d688662a0b0c0d0c1334514726522723
010101010001011101010101100011101010110011010111011
Class loader
Interpreter
Runtimecompiler
Bytecode
Native code
Application Class File Java API Class File
JESSICA2, CSIS, HKU 18HKJU, Dec. 18, 2002
Threads in JVM
Heap (Data)object object
Class loader
Class files
Thread 3
Java Method Area (Code)
Thread 2Thread 1
PC
Stack Frame
Stack Frame
public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1);
p1.start(); c1.start(); }}
A Multithreaded Java Program
Execution Engine
JESSICA2, CSIS, HKU 20HKJU, Dec. 18, 2002
Java Memory Model(How to maintain memory consistency between threads)
Load variable from main memory to working memory before use.
Variable is modified in T1’s working memory.
T1 T2
Upon T1 performs unlock, variable is written back to main memory
Upon T2 performs lock, flush variable in working memoryWhen T2 uses variable, it will be loaded from main memory
Garbage Bin
Per-Thread working memory
Main memory
Object
Variable Heap Area
Threads: T1, T2
Threads in a JVM
master copy
JESSICA2, CSIS, HKU 21HKJU, Dec. 18, 2002
Global Object Space
High Speed Network
PC
OS
Java Threads created in a program
PC
OS
PC
OS
PC
OS
JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with a Single System Image illusion to Java threads.
Distributed Java Virtual Machine (DJVM)
JESSICA2, CSIS, HKU 22HKJU, Dec. 18, 2002
Problems in Existing DJVMs
Mostly based on interpreters Simple but slow
Layered design using distributed shared memory system (DSM) can’t be tightly coupled with JVM JVM runtime information can’t be channeled to DSM False sharing if page-based DSM is employed Page faults block the whole JVM
Programmer specifies thread distribution lacks of transparency Need to rewrite multithreaded Java applications No dynamic thread distribution (preemptive thread
migration) for load balancing.
JESSICA2, CSIS, HKU 23HKJU, Dec. 18, 2002
Related WorkMethod shipping: IBM cJVM Like remote method invocation (RMI) : when accessing object
fields, the proxy redirects the flow of execution to the node where the object's master copy is located.
Executed in Interpreter mode. Load balancing problem : affected by the object distribution.
Page shipping : Rice U. Java/DSM, HKU JESSICA Simple. GOS was supported by some page-based Distributed
Shared Memory (e.g., TreadMarks, JUMP, JiaJia) JVM runtime information can’t be channeled to DSM. Executed in Interpreter mode.
Object shipping: Hyperion, Jackal Leverage some object-based DSM Executed in native mode: Hyperion: translate Java bytecode to
C. Jackal: compile Java source code directly to native code
JESSICA2, CSIS, HKU 25HKJU, Dec. 18, 2002
JESSICA2 Main FeaturesTransparent Java thread migration Runtime capturing and restoring of
thread execution context. No source code modification. No
bytecode instrumenting (preprocessing) No new API introduced.
Enable dynamic load balancing on clusters
Operated in Just-In-Time (JIT) compilation ModeGlobal Object Space A shared global heap spanning all
cluster nodes Adaptive object home migration protocol I/O redirection
Transparentmigration
JIT GOS
JESSICA2
JESSICA2, CSIS, HKU 26HKJU, Dec. 18, 2002
JESSICA2 Architecture
Portable JavaFrames
Migration
Communication Network
OS
Hardware
OS
Hardware...
Load monitor
OS
Hardware
Master JVM
OS
Hardware
Migration
JITEEth
read
sJITEE
thre
ads
HostManager
JITEE
thre
ads
Migration
HostManager
Global Object SpaceHost
Manager
workerJVM
workerJVM
public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1);
p1.start(); c1.start(); }}
Java Bytecode or Source Code
JESSICA2, CSIS, HKU 27HKJU, Dec. 18, 2002
Transparent Thread Migration in JIT Mode
Simple for interpreters (e.g. JESSICA) Interpreter sits in the bytecode decoding loop which can be stopped
upon a migration flag checking The full state of a thread is available in the data structure of
interpreter No register allocation
JIT mode execution makes things complex (JESSICA2) Native code has no clear bytecode boundary How to deal with machine registers? How to organize the stack frames (all are in native form now) ? How to make extracted thread states portable and recognizable by
the remote JVM ? How to restore the extracted states (rebuild the stack frames) and
restart the execution in native form ?
Need to modify JIT compiler to instrument native codes
JESSICA2, CSIS, HKU 29HKJU, Dec. 18, 2002
An overview of JESSICA2 Java thread migration
Thread
Frame
(1) Alert
Frames
Method Area
GOS(heap)
JVM
Frame parsingRestore execution
Frame
Stack analysisStack capturing
Thread Scheduler
Source node
Destination node
Migration Manager
LoadMonitor
Method Area
GOS(heap)
(4b) Load method from NFS
FramesFrames
(2)
(4a) Object Access
(3)
PC
PC
JESSICA2, CSIS, HKU 30HKJU, Dec. 18, 2002
What are those functions?
Migration points selection Delayed to the head of loop basic block or method
Register context handler Spill dirty registers at migration point without invalidation so
that native codes can continue the use of registers Use register recovering stub at restoring phase
Variable type deduction Spill type in stacks using compression
Java frames linking Discover consecutive Java frames
JESSICA2, CSIS, HKU 31HKJU, Dec. 18, 2002
Dynamic Thread State Capturing and Restoring in JESSICA2
mov slot1->reg1mov slot2->reg2...
Bytecode verifier
bytecode translation
migration point
code generation
Intermediate Code
invoke
1. Add migration checking2. Add object checking3. Add type & register spilling
register allocation
Native Code
Native stack scanningLinking &
Constant Resolution
Register recovering
reg slots
cmp obj[offset],0jz ...
cmp mflag,0jz ...
mov 0x110182, slot...
Native thread stack
Java frame
C frame
(Restore)
Global Object Access
Frame
(Capturing)
migration point Selection
JESSICA2, CSIS, HKU 32HKJU, Dec. 18, 2002
How to Maintain Memory Consistency in a Distributed Environment ?
T2
High Speed Network
PC
OSPC
OS
PC
OS
PC
OS
T4 T6 T8T1 T3 T5 T7
Heap Heap
JESSICA2, CSIS, HKU 33HKJU, Dec. 18, 2002
Embedded Global Object Space (GOS)
Main Features: Take advantage of JVM runtime information for
optimization (e.g. object types, accessing threads, etc.) Use threaded I/O interface inside JVM for
communication to hide the latency Non-blocking GOS access
OO based to reduce false sharing Home-based, compliant with JVM Memory Model (“Lazy
Release Consistency”) Master Heap (home objects) and Cache Heap (local and
cached objects) : reduce object access latency
JESSICA2, CSIS, HKU 34HKJU, Dec. 18, 2002
Object Cache
Global Heap
JVM
Master Heap Area Cache Heap Area
Hashtable Hash table
JVM
Master Heap Area Cache Heap Area
Hashtable
Hashtable
Ja
va
threa
d
Java th
read
Java th
read
Java th
read
JESSICA2, CSIS, HKU 35HKJU, Dec. 18, 2002
Adaptive object home migration
Definition “home” of an object = the JVM that holds the
master copy of an object
Problems cache objects need to be flushed and re-fetched
from the home whenever synchronization happens
Adaptive object home migration if # of accesses from a thread dominates the total
# of accesses to an object, the object home will be migrated to the node where the thread is running
JESSICA2, CSIS, HKU 36HKJU, Dec. 18, 2002
I/O redirectionTimer Use the time in Master node as the standard time Calibrate the time in worker node when they register to master
node
File I/O Use half word of fd as node number Open file
For read, check local first, then master node For write, go to master node
Read/Write Go to the node specified by the node number in fd
Network I/O Connectionless send: do locally Others, go to master
JESSICA2, CSIS, HKU 37HKJU, Dec. 18, 2002
Experimental Setting
Modified Kaffe Open JVM version 1.0.6
Linux PC Clusters: 1. Pentium II PCs at 540MHz
(Linux 2.2.1 kernel) Connected by Fast Ethernet
2. HKU Gideon 300 Cluster (RayTracing)
JESSICA2, CSIS, HKU 38HKJU, Dec. 18, 2002
Parallel Ray Tracing on JESSICA2(Running at 64-node Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3)
64 nodes: 108 seconds
1 node: 3430 seconds (~ 1 hour)
Speedup = 4402/108=40.75
JESSICA2, CSIS, HKU 45HKJU, Dec. 18, 2002
Conclusions
Transparent Java thread migration in JIT compiler enable the high-performance execution of multithreaded Java application on clusters while keeping the merits of Java JVM approach => dynamic class loading Just-in-Time compilation for speed
An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead
JESSICA2, CSIS, HKU 46HKJU, Dec. 18, 2002
ThanksHKU SRG:http://www.srg.csis.hku.hk/
JESSICA2 Webpage:http://www.csis.hku.hk/~clwang/
projects/JESSICA2.html