Impulse Project DARPA Review – July 2000
description
Transcript of Impulse Project DARPA Review – July 2000
![Page 1: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/1.jpg)
1Impulse Adaptable Memory System
Impulse Project
DARPA Review – July 2000
University of Utah
and
University of Massachusetts at Amherst
![Page 2: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/2.jpg)
2Impulse Adaptable Memory System
Technology Trends
Disturbing trends (for a memory architect):– Memory gap widening (CPUs improving 60%/year, DRAM only 7%)– Internal CPU parallelism is escalating– Emerging applications with poor locality (multimedia, databases, …)– Cache size growing much faster than TLB reach– Ugly CPIs: Perl and Sites, OSDI 1996
Possible solutions:– Bigger, deeper cache hierarchies– Better latency-tolerating CPU features (non-blocking cache, OOO, …)– Migrate computation to the DRAMs– Let software control how data is managed (Impulse)
![Page 3: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/3.jpg)
3Impulse Adaptable Memory System
Simple Example Problem Sum of diagonal elements of dense matrix
Problems– Wasted bus bandwidth
– Low cache utilization
– Low cache hit ratio
CachePhysical Memory
Memory Bus
for (i = 0; i < n; i++)
sum += A[i][i];
Memory Controller
![Page 4: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/4.jpg)
4Impulse Adaptable Memory System
The Impulse Idea What if software could do the following?
Improvements– No wasted bus bandwidth
– Better cache utilization
– Higher cache and TLB hit ratios
CachePhysical Memory
Memory bus Memory Controller
Create diag[*] corresponding to A[*][*]for (i = 0; i < n; i++) sum += diag[i];
![Page 5: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/5.jpg)
5Impulse Adaptable Memory System
How? Add Extra Level of Mapping Shadow address: “unused” physical address MC maps shadow address to physical address Applications configure MC through OS
Real physical space
Shadow address space
MM
U/T
LB
virtual space physical space real physical memory
Imp
uls
e M
C
![Page 6: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/6.jpg)
6Impulse Adaptable Memory System
Address Translations
ConventionalSystem
Virtual Memory
ShadowMemory
PseudoVirtual
MemoryPhysical Memory
MMU/TLB
diagonal
MMU/TLB
Physical Memory
Virtual Memory
ImpulseSystem
Word-grainedPage-grained
![Page 7: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/7.jpg)
7Impulse Adaptable Memory System
Impulse Features
Base-stride scatter/gather data– Walk columns or diagonals efficiently
– Remap matrix tiles to contiguous memory without copying
Indirection vector accesses– Static vectors (e.g., perform A[index[i]] efficiently)
– Dynamic cacheline assembly
Remap pages– Create superpages from disjoint base pages
– No-copy page coloring
Aggressive controller-based prefetching– Prefetch data from DRAMs (sequential and pointer-directed)
![Page 8: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/8.jpg)
8Impulse Adaptable Memory System
Exploiting Impulse
1. Application asks OS to setup remapping2. OS allocates free shadow configuration register
• sets up dense “page table” that points to target data
• downloads address of this page table to configuration register
3. OS allocates free shadow and virtual address space• maps application virtual addresses to shadow physical addresses
• returns virtual address corresponding to remapped data to app
1. TLB translation (VA to shadow)2. Fine-grained remapping (if any)3. Remapped addresses pass through MC-TLB4. DRAM scheduler “collects” data5. Application accesses (dense) remapped data
Set
upU
se
![Page 9: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/9.jpg)
9Impulse Adaptable Memory System
Architecture Overview
RegisterFile
ShadowEngine
ShadowEngine
MTLB MTLB
DRAMBank
Controller
DRAMBank
Controller
WritebackBuffer
.
.
.
RequestQueue
ScoreboardOut
Buffer
PrefetchUnit
ShadowStaging
Unit
DATACOHDATA ADDR
I/O
![Page 10: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/10.jpg)
11Impulse Adaptable Memory System
Benchmarks Fine-grained remapping benchmarks
– Conjugate gradient (core of DARPA vision benchmark)
– Ray tracing
Page-grained remapping benchmarks– SPEC95 (dynamic superpage promotion)
– Compress (no-copy page coloring)
Prefetching benchmarks– SPECint 95 suite (3-15% performance improvement)
– Synthetic tree microbenchmarks
![Page 11: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/11.jpg)
12Impulse Adaptable Memory System
Conjugate Gradient
Row A P => B
1 2 3 4 5 6
12
54
63
x
Data
Column 1 5 7 8 3 9
014
Store logical sparse matrix A using Yale storage scheme– Data stores non-zero elements (much larger than P)
– Row[i] indicates where the ith row begins in Data
– Column[i] is the column number of Data[i]
![Page 12: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/12.jpg)
13Impulse Adaptable Memory System
Optimizing Conjugate Gradient
for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * P[Col[j]]; b = sum;
Pi = remap_indirect(P, Col, n, …);for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * Pi[j]; b = sum;
Original Code Optimized Code
Issues:• Data and Col are large streams
• P reusable, but forced out of cache
• Poor L1 cache hit rates
• Interference in L2 cache
Issues:• Indirect access to P[Col[j]] turned
into sequential streaming access
• No reuse on P now
• Side effect: eliminate access to Col• Significant improvement to hit rates
(both L1 and TLB)
![Page 13: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/13.jpg)
14Impulse Adaptable Memory System
Conjugate Gradient Results
Base Impulse
Time (cycles) 5.48B 1.77B
L1 hit ratio 63.4% 77.8%
L2 hit ratio 19.7% 15.9%
TLB cycles 10.1M 0.5M
Speedup --- 3.1X
Significant improvement in effective cache locality
![Page 14: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/14.jpg)
15Impulse Adaptable Memory System
Volume Rendering: Ray Tracing
Problem: Ray traversals are “random” memory accesses Solution: Calculate addresses of rays as “indirection vector
Access rays via Impulse-remapped data structure
![Page 15: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/15.jpg)
16Impulse Adaptable Memory System
Volume Rendering Results
Orig (A) Impulse (A) Orig (B) Impulse (B)
Time 264M 185M 1440M 285M
L1 hit ratio 96.8% 96.6% 86.3% 91.7%
L2 hit ratio 0.8% 0.9% 0.4% 6.2%
TLB cycles 0.30M 0.31M 259M 0.13M
Speedup -- 1.4X -- 6.1X
A: rays follow natural memory layout (X axis) B: rays perpendicular to natural memory layout (Z axis)
![Page 16: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/16.jpg)
17Impulse Adaptable Memory System
Coarse Grained Remappings
Page-grained remapping Aggressive use of synthetic superpages
– modified kernel TLB miss handler to detect pages responsible for frequent TLB misses
– create superpage by page-grained remapping on memory controller
– no copying, therefore can be far more aggressive
No-copy page coloring– Problem: conflicts in the physically-indexed L2 cache
– Normal solution: copy to non-conflicting pages
– Impulse solution: remap to non-conflict pages
![Page 17: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/17.jpg)
18Impulse Adaptable Memory System
0x40138000
0x06155000
0x04012000
0x00004000
0x00005000
0x00007000
0x00006000
Virtual Addresses
0x80240000
0x80243000
0x80242000
0x80241000
Shadow Addresses
Physical Addresses
0x12011000
Shadow-Backed Superpages
SPECint95 improves 5-20% MTLB increases effective reach of CPU TLB Superpage large and multiple arrays at compile time
– at allocation time (cheapest) or dynamically
![Page 18: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/18.jpg)
19Impulse Adaptable Memory System
MMC-Based Prefetching
Idea: Prefetch data off of DRAMs into SRAM on MMC
Misprediction penalties significantly reduced– conflict misses due to cache capacity limitations
– system bus bandwidth
Exploits “free” DRAM bandwidth at MMC level– higher aggregate DRAM bandwidth than cache or bus bandwidth
Reduces latency of accesses that hit in prefetch cache
![Page 19: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/19.jpg)
20Impulse Adaptable Memory System
Pointer-based Microbenchmarks
Random walk down tree w/ N-children per node– vary number of children from 1 (linked list) to 3 (trinary tree)
Baseline: compiler-directed prefetching Impulse: MMC prefetches next nodes in tree (1-ahead)
– allocate nodes in shadow region
– tell MMC what offsets represent pointers
Root
Child1 ChildNChild2
Child1 Child2 ChildN...
...
![Page 20: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/20.jpg)
21Impulse Adaptable Memory System
Pointer Prefetching Results
P1 (N) P1 (C) P1 (I) P3 (N) P3 (C) P3 (I)
Time 100M 99.7M 84.7M 124M 197M 109M
L1 hit ratio 67.5% 98.8% 67.5% 68.2% 97.9% 68.2%
L2 hit ratio 0.4% 0.1% 0.4% 0.4% 0.3% 0.5%
TLB cycles 1.6M 1.2M 1.6M 6.2M 6.2M 6.0M
Speedup --- 1.0X 1.2X --- -0.3X 1.14X
P1(N): singly-linked list, no prefetching P3(C): triply-linked list, compiler-directed prefetching P#(I): Impulse MMC-directed prefetching
![Page 21: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/21.jpg)
22Impulse Adaptable Memory System
Prototyping Status
Four stage prototype strategy I: Slow conventional MMC
II: Fast conventional MMC
III: Impulse on an FPGA
IV: Impulse in an ASIC
Current Status: Stage I complete (pictured)
Stage II imminent (final testing)
Stage III underway (3/01)
Stage IV next year (12/01)
![Page 22: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/22.jpg)
23Impulse Adaptable Memory System
Summary Impulse Benefits
– Higher memory bus utilization
– Higher cache utilization
– Turns sparse memory operations into dense ones
Range of optimizations– Fine-grained data remapping
– Page-grained data remapping
– Memory-based prefetching
Impact– Performance increase for small increase in cost
– Does not require changes to CPUs, caches, or DRAMs
![Page 23: Impulse Project DARPA Review – July 2000](https://reader035.fdocuments.us/reader035/viewer/2022062322/568149fe550346895db72d2a/html5/thumbnails/23.jpg)
24Impulse Adaptable Memory System
Questions?
http://www.cs.utah.edu/impulse