The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL...
-
Upload
cecilia-gardner -
Category
Documents
-
view
213 -
download
0
Transcript of The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL...
![Page 1: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/1.jpg)
The Fresh Breeze Memory ModelStatus: Linear Algebra and Plans
Guang R. GaoJack Dennis
MIT CSAIL University of Delaware
Funded in part by NSF HECURA Grant CCF-0937832
Joshua Slocum Xiaoxuan MengBrian Lucas
![Page 2: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/2.jpg)
Problem: I/O Performance
Culprits:
• Large Units of Data Transfer
• Operating System Overhead/Noise
Managed by hardwareManaged by software (OS)
$Main
Memory
L3
$
P
Disk
Other
FileMemory
• Few Concurrent Transfers (OS Limits) 2
$P
$P
![Page 3: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/3.jpg)
Solution: Integrated Memory Hierarchy
Features:
• Many Concurrent Transactions – High Bandwidth
• Global Virtual Store
Managed by hardware
$Main
Memory
L3
$
P
Disk
Other
FileMemory
• Superior Basis for Security / Privacy 3
$P
$P
![Page 4: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/4.jpg)
The Fresh Breeze Memory Model
• Fan-out as large as 16
Data
Chunks
e.g. 128 Bytes
Master
Chunk
• Write-Once then Read Only
Cycle-Free Heap Arrays as Trees of Chunks
4
• Arrays: Three levels yields 4096 elements (longs)
![Page 5: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/5.jpg)
Fresh Breeze System Vision
Many Core Processing Chips
Switch Switch
Main Memory:Associative Directoriesand DRAM Chips
Backing Store:Access Controllersand Flash Devices
AD DRAM
Switch
AD DRAMAD DRAM
AC Flash AC Flash AC Flash AC FlashAC Flash
L2 Cache
PL1
PL1
PL1
PL1
L2 Cache
PL1
PL1
PL1
PL1
L2 Cache
PL1
PL1
PL1
PL1
![Page 6: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/6.jpg)
Unique Features of the Processor
• Cache Lines are 128-byte Chunks.• Registers are tagged to indicate those
holding handles of chunks.• Hardware task scheduler: Active Task
List and Pending Task Queue.
6
![Page 7: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/7.jpg)
Spawn and Join
7
Master Task
spawn (n)
Join_fetch Join
Worker 0 Worker n-1
join_update join_update
Spawned Worker Tasks
Continuation Task
![Page 8: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/8.jpg)
The Dot Product
A
B
* Sum
A B
5 levels:Vector length =165 = 1,048,576
* +
scalar result
* *
Each Leaf Task: Dot Product of two 16-element vectors: 16 multiplies; 15 adds
![Page 9: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/9.jpg)
Linear Algebra: Three Algorithms
• Dot Product• Matrix Multiply• Fast Fourier Transform
9
Let’s consider the special characteristics of each.
![Page 10: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/10.jpg)
Dot Product
16
10
SegmentA
15
Multiplies
Adds
31 Operations
• No data reuse
• No intermediate data
• No chunks written
• Large volume of input data
Leaf Task: Dot Product of 16-element segments A and B
SegmentB
+*
![Page 11: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/11.jpg)
Matrix Multiply
64
11
48
112
Multiplies
Adds
Operations
• Each input chunk used many times
• Result chunks written to memory
• No chunks written
• Relatively small input data
Leaf Task: Product of two 4-by-4 matrices
16 dot products of four-element vectors
+*
![Page 12: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/12.jpg)
Fast Fourier Transform
4
EightData
Samples
FourTwiddleFactors
6
Multiplies
Adds
10Operations
• Log2 (n) stages
• Intermediate data
• Chunks written and read
Leaf Task: Group of Four Butterfly Computations
BFLYEight
ResultsBFLY
BFLY
BFLY
16
40
One Butterfly Four Butterflies
24
![Page 13: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/13.jpg)
Simulated System Model
L1P
Memory
Switch
Processors – 16, 24, 32, 40
Up to 64IndependentStorage Units
L1P
![Page 14: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/14.jpg)
Simulation Modes
Non-Blocking:
A task executing a read simply waits for operation to complete.
Models an L2 cache with short access time.
Models a main memory with long access time.
Blocking:
A task executing a read suspends to permit other tasks to run.
![Page 15: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/15.jpg)
Estimated PerformanceL2 Cache Model - NonBlocking
Dot ProductA: 163
B: 164
C: 165
Matrix MultiplyA: 16 x 16B: 32 x 32C: 48 x 48
Fourier TransformA: 1024B: 2048C: 4096
AA
A
A
A A A A
A A A AB
B
B
B
B
B
BB
B B B B
C
C
C
C C
C
C
C
C C C C
0
2
4
6
8
10
12
16 24 32 40 16 24 32 40 16 24 32 40
Number of Processors
To
tal P
erfo
rman
ce -
GF
LO
PS
![Page 16: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/16.jpg)
Estimated PerformanceL2 Cache Model - NonBlocking
Dot ProductA: 163
B: 164
C: 165
Matrix MultiplyA: 16 x 16B: 32 x 32C: 48 x 48
Fourier TransforA: 1024B: 2048C: 4096
0
2
4
6
8
10
12
16 24 32 40 16 24 32 40 16 24 32 40
Number of Processors
To
tal P
erfo
rman
ce -
GF
LO
PS
![Page 17: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/17.jpg)
Findings
• Data locality of chunks works well.• L1 Cache is not very important; only a buffer
for input and result chunks of tasks.• Task switching for chunk reads is costly;
percolation would make a big difference.
17
Percolation: Ensuring that input chunks of a task have been retrieved before the task is scheduled.
![Page 18: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/18.jpg)
Further Work
• Model a Three-Level or Four-Level Storage System
• Develop Hierarchical Work Stealing to improve Load Distribution for Massively Parallel Systems.
• Compiler Development: Automatic Mapping of Objects to Trees of Chunks.
• Expand Test Programs to include Transaction Processing and Database Operations .
18
![Page 19: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/19.jpg)
Distributed Discrete-Event Simulation
• Accurate timing for interconnected communicating components.
• Packet Communication Architecture is a model for distributed systems especially amenable to efficient distributed simulation.
• UDel and MIT have begun a project to develop a PCA-based Simulation Sandbox for evaluating alternate PXMs for massively parallel computing.
19
![Page 20: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/20.jpg)
Relevance to FSIO
• The Fresh Breeze Memory Model extends to 264 chunks, about 1021 chunks or 100,000 exabytes.
• The tree-structure is an excellent basis for advanced security and data object sharing.
• Can simplify check-point / restart.
• Seamless transition from “in-core” to “out-of-core” operation of arbitrary parallel programs.
• Provides ability to use any program as a component of new programs – including any parallel program.
20
![Page 21: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.](https://reader035.fdocuments.us/reader035/viewer/2022062722/56649f2a5503460f94c44215/html5/thumbnails/21.jpg)
Conclusion
• Making the best of many-core technology requires study of new program execution models (PXMs).
• The FSIO challenge can be met by integrating the file system into the system memory hierarchy as a global virtual store.
• The Fresh Breeze PXM demonstrates some of the benefits from departing from conventional system organization.
• More exploration of new PXMs is needed!
21