TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS
description
Transcript of TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS
MTAAP’2010 1Atlanta, Georgia
TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OSHandong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware2010-04-23
MTAAP’2010 2
Introduction• Modern OS based upon a sequential
execution model (the von Neumann model).
• Rapid progress of multi-core/many-core chip technology.– Parallel Computer systems now
implemented on single chips.
MTAAP’2010 3
Introduction• Conventional OS model must adapt
to the underlying changes.• Further exploit the many levels of
parallelism.– Hardware as well as Software
• We introduce a study on how to do this adaptation for the IBM BlueGene/P multi-core system.
MTAAP’2010 4
Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work
MTAAP’2010 5
Contributions• Isolate traditional OS functions to a
single core of the BG/P multi-core chip.
• Ported the TiNy Thread (TNT) execution model to allow for further utilization of BG/P compute cores.
• Expanded the design framework to a multi-chip system designed for scalability to a large number of chips.
MTAAP’2010 6
Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work
MTAAP’2010 7
TiNy Threads on BG/P• TiNy Threads
– Lightweight, non-preemptive, threads– API similar to POSIX Threads.– Originally presented in “TiNy Threads: A Thread Virtual
Machine for the Cyclopse-64 Cellular Architecture”– Runs on IBM Cyclops64
• Kernel Modifications– Alterations to the thread scheduler to allow for non-
preemption
MTAAP’2010 8
Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work
MTAAP’2010 9
Multinode Thread Scheduler
• Thread Scheduler allows TNT to run across multiple nodes.– Requests facilitated through IBM’s
Deep Computing Messaging Framework’s RPCs.
• Multiple Scheduling Algorithms– Workload Un-Aware
• Random• Round-Robin
– Workload Aware
MTAAP’2010 10
Multinode Thread Scheduling
Node A Node B
tnt_create()…
tnt_join()
tid
…tnt_exit()
…tnt_join()
MTAAP’2010 11
Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work
MTAAP’2010 12
Synchronization• Three forms
– Mutex– Thread Joining– Barrier
• Similar to thread scheduling– Lock requests, Join requests, and
Barrier notifications sent to node responsible for said synchronization
MTAAP’2010 13
Multinode Thread Scheduling
Node A Node B
tid
…tnt_exit()
tnt_join()
A
tnt_exit()
MTAAP’2010 14
Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work
MTAAP’2010 15
Characteristics of TDSM• Provides One-Sided access to memory
distributed among nodes through IBM’s DCMF.• Allows for virtual address manipulation
– Maps distributed memory to a single virtual address space.
– Allows for array indexing and memory offsets.• Scalable to a variety of applications
– Size of desired global shared memory set at runtime.
• Mutability– Memory allocation algorithm and memory
distribution algorithm can be easily altered and/or replaced.
16
Example of TDSM
0global
15 30 45
tdsm_read(global[15], local, 20*sizeof(int));
global[15] to global[34]
0x00040012
Node 6Node 5 Node 7
Local Buffer0x0004004E
to0x0004009A
Node 6: 0 to 14and
Node 7: 0 to 4
…
MTAAP’2010 17
Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work
MTAAP’2010 18
Summary of Results• The performance of the TNT thread
system shows comparable speedup to that of Pthreads running on the same hardware.
• The distributed shared memory operates at 95% of the experimental peak performance of the network, with distance between nodes not being a sensitive factor.
• The cost of thread creation shows a linear relationship as the number of threads increase.
• The cost of waiting at a barrier is constant and independent of the number of threads involved.
MTAAP’2010 19
Single-Node Thread System Performance
• Based upon Radix-2 Cooley-Tukey algorithm with the Kiss FFT library for the underlying DFT.
• Underlying TNT thread model performs comparably to POSIX standard when the number of threads does not exceed the number of available processor cores.
1024 2048 4096 8192 16384 327680
0.5
1
1.5
2
2.5
TNT POSIX
Size of FFT
Abs
olut
e Sp
eed-
Up
for
Tw
o T
hrea
ds
MTAAP’2010 20
Memory System Performance
• Reads and writes of varying sizes between one and two nodes.
• For inter-node communications, data can be transferred at approximately 357 MB/s.
• Kumar et al determined experimental peak performance on BG/P to be 374 MB/s in their ICS’08 paper.
0 10485760
0.0005
0.001
0.0015
0.002
0.0025
0.003
Local Read
Local Write
Remote Read
Remote Write
bytes
late
ncy
(s)
MTAAP’2010 21
Memory System Performance
• Size of Read/Write is a function of the number of nodes across which the data is distributed.
• Latency linearly increases as the amount of data increases, regardless of distance between nodes.
0 256 512 768 10240
0.001
0.002
0.003
0.004
0.005
0.006
Read
Write
Nodes
late
ncy
(s)
MTAAP’2010 22
Multinode Thread Creation Cost
• Approximately 0.2 seconds per thread
• Remained effectively constant
0 100 200 300 400 500 6000
20
40
60
80
100
120
Number of Threads
Seco
nds
MTAAP’2010 23
Synchronization Costs• Performance of
barrier is effectively a constant 0.2 seconds.
110
020
030
040
050
060
070
080
090
010
000.1999
0.19992
0.19994
0.19996
0.19998
0.2
0.20002
0.20004
Number of Threads Included in Barrier
Tim
e (s
)
MTAAP’2010 24
Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work
MTAAP’2010 25
Conclusions and Future Work
• Proven feasibility of system• Benefits of Execution Model-Driven
approach• Room for Improvement
– Improvements to kernel– More rigorous benchmarks– Improved allocation and scheduling
algorithms
MTAAP’2010 26Atlanta, Georgia
Thank You
MTAAP’2010 27
Bibliography• J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, “Tiny
threads: A thread virtual machine for the cyclops64 cellular architecture,” Parallel and Distributed Processing Symposium, International, vol. 15, p. 265b, 2005.
• S. Kumar, G. Dozsa, G. Almasi et al., “The deep computing messaging framework: generalized scalable message passing on the blue gene/p supercomputer,” in ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 94–103.
• M. Borgerding, “Kiss FFT.” December 2009, http://sourceforge.net/projects/kissfft/.