JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu,...

JESSICA2: A Distributed Java Virtual Machine

with Transparent Thread Migration Support

Wenzhang Zhu, Cho-Li Wang, Francis Lau

The Systems Research Group

Department of Computer Science and Information Systems

The University of Hong Kong

JESSICA2, CSIS, HKU 2HKJU, Dec. 18, 2002

HKU JESSICA Project

JESSICA: “Java-Enabled Single-System-Image Computing Architecture” : Project started in 1996. First version (JESSICA1) in 1999. A middleware that runs on top of the standard UNIX/Linux operating

system to support parallel execution of multi-threaded Java applications in a cluster of computers.

JESSICA hides the physical boundaries between machines and makes the cluster appear as a single computer to applications -- a single-system image (SSI).

Special feature : preemptive thread migration which allows a thread to freely move between machines.

Part of the RGC’s Area of Excellence project in 1999-2002.

JESSICA Team Members

Supervisors: Dr. Francis C.M. Lau Dr. Cho-Li Wang

Research Students: Ph.D: Wenzhang Zhu (Thread

Migration) Ph.D: WeiJian Fang (Global

Heap) M.Phil: Zoe Ching Han Yu

(Distributed Garbage Collection)

Ph.D: Benny W. L. Cheung (Software Distributed Shared Memory)

Graduated: Matchy Ma (JESSICA1)

The Systems Research Group

JESSICA Team Members

Outline

Introduction of Cluster Computing

Motivations

Related works

JESSICA2 features

Performance Analysis

Conclusion & Future works

What’s a cluster ? A cluster is a type of parallel or

distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource – IEEE TFCC.

My definition : a HPC system that integrates mainstream commodity components to process large-scale problems low cost, self-made, yet powerful.

Cluster Computer Architecture

High-Speed LAN (Fast/Gigabit Ethernet, SCI, Myrinet)

Availability Infrastructure

Single System Image Infrastructure

Programming Environment

(Java, C, MPI, HPF, DSM)

Management &Monitoring &

Job Scheduling

Cluster Applications(Web, Storage, Computing,

Rendering, Financing..)

Single System Image (SSI) ? JESSICA Project: Java-Enabled Single-System-

Image Computing Architecture A single system image is the illusion, created

by software or hardware, that presents a collection of resources as one, more powerful resource.

Ultimate Goals of SSI : makes the cluster appear like a single machine to the user, to applications, and to the network.

Single Entry Point, Single File System, Single Virtual Networking, Single I/O and Memory Space, Single Process Space, Single Management / Programming View …

Count Share Rmax [GF/s] Rpeak [GF/s] Processors

MPP 224 44.8 % 104899.21 168829.00 111104

Constellations 187 37.4 % 40246.50 59038.00 33828

Cluster 80 16 % 37596.16 69774.00 50181

SMP 9 1.8 % 39208.10 44875.00 6056

Total 500 100 % 221949.97 342516.00 201169

Top 500 computers by “classification” (June 2002)(Source: http://www.top500.org/ )

MPP Massively Parallel ProcessorConstellation E.g., cluster of HPCsCluster Cluster of PCsSMP Symmetric Multiprocessor

About the TOP500 List:

1. the 500 most powerful computer systems installed in the world.

2. Compiled twice a year since June 1993

3. Ranked by their performance on the LINPACK Benchmark

#1 Supercomputer: NEC’s Earth Simulator

Built by NEC, 640 processor nodes, each consisting Built by NEC, 640 processor nodes, each consisting of 8 vector processors, of 8 vector processors, total of 5120 processors, 40 total of 5120 processors, 40 TFlop/s peak, and 10 TB memoryTFlop/s peak, and 10 TB memory..

Linpack : Linpack : 35.8635.86 Tflop/s Tflop/s

(Tera FLOPS = 10(Tera FLOPS = 101212 floating point floating point operations per second operations per second = 450 x Pentium 4 PCs)= 450 x Pentium 4 PCs)

Interconnect: Interconnect: Single Single stage crossbarstage crossbar (1800 (1800 miles of cable) 83,000 miles of cable) 83,000 copper cables, 16 GB/s copper cables, 16 GB/s cross section cross section bandwidthbandwidth

Area of computer = 4 Area of computer = 4 tennis courts, 3 floorstennis courts, 3 floors

(Source: NEC)

Other Supercomputers in TOP500

#2 #3 Supercomputer: ASCI Q 7.7 TF/s Linpack performance. Los Alamos National Laboratory, U.S. HP Alphserver SC (375 x 32-way multiprocessors, total 11,968

processors), 12 terabytes of memory and 600 terabytes of disk storage

#4: IBM ASCI White (U.S.) 8,192 copper microprocessors (IBM SP POWER3), and contains 160

trillion bytes (TB) of memory with more than 160 TB of IBM disk storage capacity; Linpack: 7.22 Tflops. Located at Lawrence Livermore National Laboratory.

512-node, 16-way symmetric multiprocessor. Covers an area the size of two basketball courts, weighs 106 tons. 2,000 miles of copper wiring. Cost: US$110 million.

TOP500 Nov 2002 List

2 new PC clusters made the TOP 10: #5 is a Linux NetworX/Quadrics cluster at

Lawrence Livermore National Laboratory. #8 is a HPTi/Myrinet cluster at the Forecast

Systems Laboratory at NOAA.

A total of 55 Intel based and 8 AMD based PC clusters are in the TOP500.The number of clusters in the TOP500 grew again to a total of 93 systems.

Poor Man’s Cluster HKU Ostrich Cluster 32 x 733 MHz

Pentium III PCs, 384MB Memory

Hierarchical Ethernet-based network : four 24-port Fast Ethernet switches + one 8-port Gigabit Ethernet backbone switch)

Computational Plant (C-Plant cluster)

1536 Compaq DS10L 1U servers (466 MHz Alpha 21264 (EV6) microprocessor, 256 MB ECC SDRAM)

Each node contains a 64-bit, 33 MHz Myrinet network interface card (1.28 Gbps/s) connected to a 64-port Mesh64 switch. 48 cabinets, each of which

contains 32 nodes

(48x32=1536)

Rich Man’s Cluster

The HKU Gideon 300 Cluster(Operating in mid Oct. 2002)

300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS) 300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS) connected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switchconnected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switch

Linpack performance: 355 Gflops#175 in TOP500 (Nov. 2002 List)

Building Gideon 300

JESSICA2 : IntroductionResearch Goal High Performance Java Computing using Clusters

Why Java? The dominant language for server-side programming. More than 2 million Java developers [CNETAsia: 06/2002] Platform independent: “Compile once, run anywhere” Code mobility (i.e., dynamic class loading) and data mobility (i.e., object

serialization). Built-in multithreading support at language level (parallel programming

using MPI, PVM, RMI, RPC, HPF, DSM is difficult)

Why cluster? Large scale server-side applications need high-performance

multithreaded programming supports A cluster provides a scalable hardware platform for true parallel

execution.

Java Virtual Machine

Class Loader Loads class files

Interpreter Executes bytecode

Runtime Compiler Converts bytecode to

native code

0a0b0c0d0c6262431c1d688662a0b0c0d0c1334514726522723

010101010001011101010101100011101010110011010111011

Class loader

Interpreter

Runtimecompiler

Bytecode

Native code

Application Class File Java API Class File

Threads in JVM

Heap (Data)object object

Class loader

Class files

Thread 3

Java Method Area (Code)

Thread 2Thread 1

Stack Frame

public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1);

p1.start(); c1.start(); }}

A Multithreaded Java Program

Execution Engine

Java Memory Model(How to maintain memory consistency between threads)

Load variable from main memory to working memory before use.

Variable is modified in T1’s working memory.

Upon T1 performs unlock, variable is written back to main memory

Upon T2 performs lock, flush variable in working memoryWhen T2 uses variable, it will be loaded from main memory

Garbage Bin

Per-Thread working memory

Main memory

Object

Variable Heap Area

Threads: T1, T2

Threads in a JVM

master copy

Global Object Space

High Speed Network

Java Threads created in a program

JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with a Single System Image illusion to Java threads.

Distributed Java Virtual Machine (DJVM)

Problems in Existing DJVMs

Mostly based on interpreters Simple but slow

Layered design using distributed shared memory system (DSM) can’t be tightly coupled with JVM JVM runtime information can’t be channeled to DSM False sharing if page-based DSM is employed Page faults block the whole JVM

Programmer specifies thread distribution lacks of transparency Need to rewrite multithreaded Java applications No dynamic thread distribution (preemptive thread

migration) for load balancing.

Related WorkMethod shipping: IBM cJVM Like remote method invocation (RMI) : when accessing object

fields, the proxy redirects the flow of execution to the node where the object's master copy is located.

Executed in Interpreter mode. Load balancing problem : affected by the object distribution.

Page shipping : Rice U. Java/DSM, HKU JESSICA Simple. GOS was supported by some page-based Distributed

Shared Memory (e.g., TreadMarks, JUMP, JiaJia) JVM runtime information can’t be channeled to DSM. Executed in Interpreter mode.

Object shipping: Hyperion, Jackal Leverage some object-based DSM Executed in native mode: Hyperion: translate Java bytecode to

C. Jackal: compile Java source code directly to native code

JESSICA2 Main FeaturesTransparent Java thread migration Runtime capturing and restoring of

thread execution context. No source code modification. No

bytecode instrumenting (preprocessing) No new API introduced.

Enable dynamic load balancing on clusters

Operated in Just-In-Time (JIT) compilation ModeGlobal Object Space A shared global heap spanning all

cluster nodes Adaptive object home migration protocol I/O redirection

Transparentmigration

JIT GOS

JESSICA2

JESSICA2 Architecture

Portable JavaFrames

Migration

Communication Network

Hardware

Hardware...

Load monitor

Hardware

Master JVM

Hardware

Migration

JITEEth

sJITEE

HostManager

Migration

HostManager

Global Object SpaceHost

Manager

workerJVM

public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1);

p1.start(); c1.start(); }}

Java Bytecode or Source Code

Transparent Thread Migration in JIT Mode

Simple for interpreters (e.g. JESSICA) Interpreter sits in the bytecode decoding loop which can be stopped

upon a migration flag checking The full state of a thread is available in the data structure of

interpreter No register allocation

JIT mode execution makes things complex (JESSICA2) Native code has no clear bytecode boundary How to deal with machine registers? How to organize the stack frames (all are in native form now) ? How to make extracted thread states portable and recognizable by

the remote JVM ? How to restore the extracted states (rebuild the stack frames) and

restart the execution in native form ?

Need to modify JIT compiler to instrument native codes

An overview of JESSICA2 Java thread migration

Thread

(1) Alert

Frames

Method Area

GOS(heap)

Frame parsingRestore execution

Stack analysisStack capturing

Thread Scheduler

Source node

Destination node

Migration Manager

LoadMonitor

Method Area

GOS(heap)

(4b) Load method from NFS

FramesFrames

(4a) Object Access

What are those functions?

Migration points selection Delayed to the head of loop basic block or method

Register context handler Spill dirty registers at migration point without invalidation so

that native codes can continue the use of registers Use register recovering stub at restoring phase

Variable type deduction Spill type in stacks using compression

Java frames linking Discover consecutive Java frames

Dynamic Thread State Capturing and Restoring in JESSICA2

mov slot1->reg1mov slot2->reg2...

Bytecode verifier

bytecode translation

migration point

code generation

Intermediate Code

invoke

1. Add migration checking2. Add object checking3. Add type & register spilling

register allocation

Native Code

Native stack scanningLinking &

Constant Resolution

Register recovering

reg slots

cmp obj[offset],0jz ...

cmp mflag,0jz ...

mov 0x110182, slot...

Native thread stack

Java frame

C frame

(Restore)

Global Object Access

(Capturing)

migration point Selection

How to Maintain Memory Consistency in a Distributed Environment ?

High Speed Network

T4 T6 T8T1 T3 T5 T7

Heap Heap

Embedded Global Object Space (GOS)

Main Features: Take advantage of JVM runtime information for

optimization (e.g. object types, accessing threads, etc.) Use threaded I/O interface inside JVM for

communication to hide the latency Non-blocking GOS access

OO based to reduce false sharing Home-based, compliant with JVM Memory Model (“Lazy

Release Consistency”) Master Heap (home objects) and Cache Heap (local and

cached objects) : reduce object access latency

Object Cache

Global Heap

Master Heap Area Cache Heap Area

Hashtable Hash table

Master Heap Area Cache Heap Area

Hashtable

Java th

Adaptive object home migration

Definition “home” of an object = the JVM that holds the

master copy of an object

Problems cache objects need to be flushed and re-fetched

from the home whenever synchronization happens

Adaptive object home migration if # of accesses from a thread dominates the total

# of accesses to an object, the object home will be migrated to the node where the thread is running

I/O redirectionTimer Use the time in Master node as the standard time Calibrate the time in worker node when they register to master

File I/O Use half word of fd as node number Open file

For read, check local first, then master node For write, go to master node

Read/Write Go to the node specified by the node number in fd

Network I/O Connectionless send: do locally Others, go to master

Experimental Setting

Modified Kaffe Open JVM version 1.0.6

Linux PC Clusters: 1. Pentium II PCs at 540MHz

(Linux 2.2.1 kernel) Connected by Fast Ethernet

2. HKU Gideon 300 Cluster (RayTracing)

Parallel Ray Tracing on JESSICA2(Running at 64-node Gideon 300 cluster)

Linux 2.4.18-3 kernel (Redhat 7.3)

64 nodes: 108 seconds

1 node: 3430 seconds (~ 1 hour)

Speedup = 4402/108=40.75

Conclusions

Transparent Java thread migration in JIT compiler enable the high-performance execution of multithreaded Java application on clusters while keeping the merits of Java JVM approach => dynamic class loading Just-in-Time compilation for speed

An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead

ThanksHKU SRG:http://www.srg.csis.hku.hk/

JESSICA2 Webpage:http://www.csis.hku.hk/~clwang/

projects/JESSICA2.html

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu,...

Documents

Transcript of JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu,...

Cardioemboli Stroke Lau

Local Snapper Lau Lau - Surf and Sunshine

Lau cummings inn530.1

Lau Ee Shuang

Adenotonsilitis Kronis Lau

Ana lau Autobiography

Presentation 4 lau

EDC Connie Lau

Wu Wenzhang President, SteelHome Website June 2012, NewYork · 2012. 6. 28. · Wu Wenzhang President, SteelHome Website June 2012, NewYork. Shanghai SteelHome (SSH) is a platform

lau application

Wu Wenzhang - Shanghai SteelHome

Joy King Lau

ConnectionsPrecast Lau Dec1990

Gabriel Lau

lou&lau lookbook

BO1_Presentasi William Lau

LAU-127, 128, 129 - Marvin Engineering Comarvineng.com/.../2018/01/LAU-127_128_129-Launcher.pdf · Model shown is LAU-129 LAU-127, 128, 129 Missile Rail Launcher (MRL) A missile rail

Chineseplantart Lau

Lau Lee Peng

Chakshing lau