KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems...

KAISTComputer Architecture Lab.

The Effect of Multi-core on HPC Applica-tions in Virtualized Systems

Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin Kwon¹, Young-ri Choi², and Jaehyuk Huh¹

¹ KAIST(Korea Advanced Institute of Science and Technology)

² KISTI(Korea Institute of Science and Technology Information)

Outline

• Virtualization for HPC

• Virtualization on Multi-core

• Virtualization for HPC on Multi-core

• Methodology

• PARSEC – shared memory model

• NPB – MPI model

• Conclusion

Outline

• Methodology

• Conclusion

Benefits of Virtualization

Hardware

Virtual Machine Monitor

VM VM VM

• Improve system utilization by consolidation

Hardware

VMWin-dows

Solaris

• Improve system utilization by consolidation• Support for multiple types of OSes on a system

Hardware

VMWin-dows

Solaris

• Improve system utilization by consolidation• Support for multiple types of OSes on a system• Fault isolation

Hardware

VMWin-dows

Solaris

Hardware

• Improve system utilization by consolidation• Support for multiple types of OSes on a system• Fault isolation• Flexible resource management

Hardware

VMWin-dows

Solaris

Hardware

• Improve system utilization by consolidation• Support for multiple types of OSes on a system• Fault isolation• Flexible resource management• Cloud computing

VMWin-dows

Solaris Cloud

Hardware

Virtualization for HPC

• Benefits of virtualization

– Improve system utilization by consolidation

– Support for multiple types of OSes on a system

– Fault isolation

– Flexible resource management

– Cloud computing

• HPC is performance-sensitive

• Virtualization can help HPC workloads

resource-sensitive

Outline

• Methodology

• Conclusion

Virtualization on Multi-core

• More VMs on a physical machine• More complex memory hierarchy (NUCA, NUMA)

Shared cache Shared cache

Memory Memory

Challenges

• VM management cost • Semantic gaps– vCPU scheduling, NUMA

Scheduling, Mem-ory, Communica-

tion,I/O multiplexing…

Memory

Outline

• Methodology

• Conclusion

Virtualization for HPC on Multi-core

• Virtualization may help HPC• Virtualization on multi-core may have some overheads• For servers, improving system utilization is a key factor• For HPC, performance is a key factor.

How much overheads are there?

Where do they come from?

Outline

• Methodology

• Conclusion

Machines

• Single Socket System– 12-cores AMD processor– Uniform memory access la-

tency– Two 6MB L3 caches shared

by 6 cores

• Dual Socket System – 2x 4-core Intel processor– Non-uniform memory ac-

cess latency– Two 8MB L3 caches shared

by 4 cores

Single socket: 12-core CPU

Memory

Dual socket: 2x 4-core CPUs

Workloads

• PARSEC– Shared memory model– Input: native– On one machine

• Single and Dual socket

– Fix: One VM– Vary: 1, 4, 8 vCPUs

• NAS Parallel Benchmark– MPI model– Input: class C– On two machines (dual socket)

• 1Gb Ethernet switch

– Fix: 16 vCPUs– Vary: 2 ~ 16 VMs

Memory

Virtual Machine Moni-tor

Hardware

Virtual Machine Moni-tor

Hardware

Semantic gaps VM management cost

Outline

• Methodology

• Conclusion

PARSEC – Single Socket

• Single socket• No NUMA effect• Very low virtualization overheads

blacksc

cannea

x264AVG

1.81 vCPU4 vCPUs8 vCPUs

Execution times normalized to native runs

PARSEC – Single Socket

• Single socket + pin vCPU to each pCPU• Reduce semantic gaps by prevent vCPU migration• vCPU migration has negligible effect

blacksc

cannea

x264AVG

Similar to un-pinned

PARSEC – Dual Socket

• Dual socket, unpinned vCPUs• NUMA effect semantic gap• Significant increase of overheads

blacksc

cannea

x264AVG

1.8 1 vCPU4 vCPUs8 vCPUs

16~37 %

PARSEC – Dual Socket

• Dual socket, pinned vCPUs• May reduce NUMA effect also• Reduced overheads with 1 and 4 vCPUs

blacksc

cannea

x264AVG

XEN and NUMA machine

• Memory allocation policy– Allocate up to 4GB chunk on

one socket

• Scheduling policy– Pinning to allocated socket– Nothing more

• Pinning 1 ~ 4 vCPUs on the socket mem. allocated is possible

• Impossible with 8 vCPUs

Mitigating NUMA Effects

• Range pinning

– Pin vCPUs of a VM on a socket

– Work only if # of vCPUs < # of cores on a socket

– Range-pinned (best): memory of VM in the same socket

– Range-pinned (worst): memory of VM in the other socket

• NUMA-first scheduler

– If there is an idle core in the socket memory allocated, pick it

– If not, anyway, pick a core in the machine

– All vCPUs are not active all the time (sync. or I/O)

Range Pinning

• For 4 vCPUs case• Range-pinned(best) ≈ Pinned

blacksc

cannea

x264AVG

Unpinned

Range-pinned (worst)

Range-pinned (best)

Pinned

NUMA-first Scheduler

• For 8 vCPUs case• Significant improvement by NUMA-first scheduler

blacksc

cannea

x264AVG

1.8Unpinned

Pinned

NUMA-first

Outline

• Methodology

• Conclusion

VM Granularity for MPI model

• Fine-grained VMs– Few processes in a VM– Small VM: vCPUs, memory– Fault isolation among pro-

cesses in different VMs– Many VMs on a machine– MPI communications

mostly through the VMM

• Coarse-grained VMs– Many processes in a VM– Large VM: vCPUs, memory– Single failure point for pro-

cesses in a VM– Few VMs on a machine– MPI communications

mostly within a VM

Hardware

NPB - VM Granularity• Work to do are same for all granularity• 2 VMs: each VM has 8 vCPUs, 8 MPI processes• 16 VMs: each VM has 1 vCPU, 1 MPI processes

BT CG EP FT IS LU MG SP AVG0

3 2 VMs4 VMs8 VMs16 VMs

11~54 %

NPB - VM Granularity

• Fine-grained VMs significant overheads (avg. 54%)

– MPI communications mostly through VMM

• Worst in CG with high communication ratio

– Small memory per VM

– VM management costs of VMM

• Coarse-grained VMs much less overheads (avg. 11%)

– Still dual socket, but less overheads than shared memory model

the bottle neck is moved to communication

– MPI communication largely within VM

– Large memory per VM

Outline

• Methodology

• Conclusion

Conclusion

• Questions on virtualization for HPC on multi-core system– How much overheads are there?– Where do they come from?

• For shared memory model– Without NUMA little overheads– With NUMA large overheads from semantic gaps

• For MPI model– Less NUMA effect communication is important– Fine-grained VMs have large overheads

• Communication mostly through VMM• Small memory / VM management cost

• Future Works– NUMA-aware VMM scheduler– Optimize communication among VMs in a machine

Thank you!

Backup slides

PARSEC CPU Usage

• Environments: native linux, turn on only 8 cores (use 8 threads mode)

• Get CPU usage every seconds, then average them

• For all workloads, less than 800% (fully parallel) NUMA-first can work

blackscholes canneal ferret fluidanimate freqmine streamcluster swaptions x264 Avg.0.00%

100.00%

200.00%

300.00%

400.00%

500.00%

600.00%

700.00%

800.00%

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems...

Documents

Transcript of KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems...

Genetic Programming - KAIST

Sang-Won Cho* : Ph.D. Candidate, KAIST Hyung-Jo Jung : Research Assistant Professor, KAIST

View PDF - KAIST

KAIST College of Business - mbacasecomp.com

SUMMER - KAIST

KAIST GCC : Regional Hub Explorer

KAIST UVR Lab 2013

Mid-term exam - KAIST

Denoising - KAIST

KAIST Mobile Harbor Project

Welcome Dr. Youngjin Kang › ... › 44 › 2013 › 03 › 2017-Fall-Newsletter.pdfWelcome Dr. Youngjin Kang Dr. Youngjin Kang Assistant Professor Child and Family Studiesfamilies.

CS482 Lab Session - KAIST

KAIST Business School Scholarship

Kaist snail-20150122

Youngjin Kang Alyssa Nolde Antoinette Sellers

TCP Congestion Control - KAIST

KAIST Wireless Power Transmission Research Center · KAIST Wireless Power Transmission Research Center KAIST Munji Campus 193, Munji-ro, Yuseong-gu, Dejeon 305-732, Republic of Korea

Group 2 Youngjin Kang Anthony Correa Stephanie Regan.

LinAlg ch6 - KAIST

Moonzoo Kim Computer Science, KAIST