Spring 2009 Prof. Hyesoon Kim - College of Computing
Transcript of Spring 2009 Prof. Hyesoon Kim - College of Computing
![Page 1: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/1.jpg)
Spring 2009
Prof. Hyesoon Kim
![Page 2: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/2.jpg)
• Benchmarking is critical to make a design decision and measuring performance
– Performance evaluations:
• Design decisions
– Earlier time : analytical based evaluations – Earlier time : analytical based evaluations
– From 90’s: heavy rely on simulations.
• Processor evaluations
– Workload characterizations: better understand
the workloads
![Page 3: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/3.jpg)
• Benchmarks– Real applications and application suites
• E.g., SPEC CPU2000, SPEC2006, TPC-C, TPC-H, EEMBC, MediaBench, PARSEC, SYSmark
– Kernels– Kernels• “Representative” parts of real applications
• Easier and quicker to set up and run
• Often not really representative of the entire app
– Toy programs, synthetic benchmarks, etc.• Not very useful for reporting
• Sometimes used to test/stress specific functions/features
![Page 4: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/4.jpg)
“Representative” applications keeps growing with time!
![Page 5: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/5.jpg)
![Page 6: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/6.jpg)
• Test, train and ref
• Test: simple checkup
• Train: profile input, feedback compilation
• Ref: real measurement. Design to run long • Ref: real measurement. Design to run long enough to use for real system
– -> Simulation?
• Reduced input set
• Statistical simulation
• Sampling
![Page 7: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/7.jpg)
• Measure transaction-processing throughput
• Benchmarks for different scenarios
– TPC-C: warehouses and sales transactions
– TPC-H: ad-hoc decision support
– TPC-W: web-based business transactions
• Difficult to set up and run on a simulator
– Requires full OS support, a working DBMS
– Long simulations to get stable results
![Page 8: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/8.jpg)
• SPLASH: Scientific computing kernels
– Who used parallel computers?
• PARSEC: More desktop oriented benchmarks
• NPB: NASA parallel computing benchmarks
• Not many
![Page 9: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/9.jpg)
• GFLOPS, TFLOPS
• MIPS (Million instructions per second)
![Page 10: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/10.jpg)
• Speedup of arithmeitc means != arithmetic mean of speedup
• Use geometric mean: n
n
i
i∏=1
on timeexecution Normalized
• Neat property of the geometric mean:Consistent whatever the reference
machine
• Do not use the arithmetic mean for normalized execution times
![Page 11: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/11.jpg)
• Often when making comparisons in comp-arch studies:
– Program (or set of) is the same for two CPUs
– The clock speed is the same for two CPUs
• So we can just directly compare CPI’s and often we use IPC’s
![Page 12: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/12.jpg)
• Average CPI = (CPI1 + CPI2 + … + CPIn)/n
• A.M. of IPC = (IPC1 + IPC2 + … + IPCn)/n
• Must use Harmonic Mean to remain ∝ to runtime
Not Equal to A.M. of CPI!!!
![Page 13: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/13.jpg)
• H.M.(x1,x2,x3,…,xn) =
n
1 + 1 + 1 + … + 1
x1 x2 x3 xnx1 x2 x3 xn
• What in the world is this?
– Average of inverse relationships
![Page 14: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/14.jpg)
• “Average” IPC = 1
A.M.(CPI)
= 1
CPI1 + CPI2 + CPI3 + … + CPInn n n nn n n n
= n
CPI1 + CPI2 + CPI3 + … + CPIn= n
1 + 1 + 1 + … + 1 =
H.M.(IPC)
IPC1 IPC2 IPC3 IPCn
![Page 15: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/15.jpg)
• Stanford graphics benchmarks
– Simple graphics workload. Academic
• Mostly game applications
– 3DMark: – 3DMark:
– http://www.futuremark.com/benchmarks/3dmar
kvantage
– Tom’s hardware
![Page 16: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/16.jpg)
• Still graphics is the major performance bottlenecks
• Previous research: emphasis on graphics
![Page 17: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/17.jpg)
• Several genres of video games
– First Person Shooter
• Fast-paced, graphically enhanced
• Focus of this presentation
– Role-Playing Games– Role-Playing Games
• Lower graphics and slower play
– Board Games
• Just plain boring
![Page 18: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/18.jpg)
Physics Particle
Event
Collision
DetectionAI
Rendering Display
Computing
![Page 19: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/19.jpg)
• Current game design principles:
– higher frame rates imply the better game
quality
• Recent study on frame rates [Claypool et al. MMCN
2006]2006]
– very high frame rates are not necessary, very
low frame rates impact the game quality
severely
![Page 20: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/20.jpg)
Snapshots of animation [Davis et al. Eurographics 2003]
time
![Page 21: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/21.jpg)
Game
workload
Computational
workload
Rendering
workload
Other
workload
workloadRasterization
workload
![Page 22: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/22.jpg)
• Case study
– Workload characterization of 3D games, Roca,
et al. IISWC 2006 [WOR]
– Use ATTILA
![Page 23: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/23.jpg)
![Page 24: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/24.jpg)
• Average primitives per frame
• Average vertex shader instructions
• Vertex cache hit ratio
• System bus bandwidths • System bus bandwidths
• Percentage of clipped, culled, and traversed triangles
• Average trianglesizes
![Page 25: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/25.jpg)
• GPU execution driven simulator• https://attilaac.upc.edu/wiki/index.php/Architecture
• Can simulate OpenGL at this moments
![Page 26: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/26.jpg)
![Page 27: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/27.jpg)
• Attila architecture
Index Buffer
Vertex cache Vertex Request Buffer
Streamer
Primitive Assembly
Clipping
Triangle Setup
Fragment Generation
Hierarchical Z HZ Cache
HierarchicalZ buffer
RegisterFile Texture
Cache
TextureAddress
TextureFilter
Shader
Shader
Shader
Shader
Unit Size Element width
Streamer 48 16x4x32 bits
Primitive Assembly 8 3x16x4x32 bits
Clipping 4 3x4x32 bits
Triangle Setup 12 3x4x32 bits
Fragment Generation 16 3x4x32 bits
Hierarchical Z 64 (2x16+4x32)x4 bits
Z Cache
Z test
Z Cache
Z test
Interpolator
Color cache
Blend
Color cache
Blend
MC0 MC1 MC2 MC3
Shader
Z Tests 64 (2x16+4x32)x4 bits
Interpolator --- ---
Color Write 64 (2x16+4x32)x4 bits
Unified Shader (vertex) 12+4 16x4x32 bits
Unified Shader (fragment) 240+16 10x4x32 bits
Table 2. Queue sizes and number of threads in the ATTILA reference architecture
![Page 28: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/28.jpg)
• Execution driven:
– Correctness, long development time,
– Execute binary
• Trace driven• Trace driven
– Easy to develop
– Simulation time could be shorten
– Large trace file size
![Page 29: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/29.jpg)
• No simulation is required
• To provide insights
• Statistical Methods
• CPU • CPU
– First-order
• GPU
– Warp level parallelism
![Page 30: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/30.jpg)
1 1 3 5 7
2 4 6 82
3
45
6
78
1 1 3 3
2 42
3
4
1
Case1: Case2:
1
2 4
23
4
CWP=4 MWP=2
Idle cycles
Memory
Waiting period8
(a) (b)
4
2 Computation + 4 Memory2 Computation + 4 Memory
1 1
3
5
7
2
4
6
8
2
3
45
6
7
8
Case3
:Case4
:1 1
3
2
4
2
3
4
3
1
2
4
1
2
3
4
8 Computation + 1 Memory
(a) (b)
CWP=4
MWP > 4
8 Computation + 1 Memory
![Page 31: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/31.jpg)
(a)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1Case6:(1 warp)
8 Computation + 8 Memory
Case7: 1 1 1 1 1 1 1 1
2 2 2 2 1 2 2 2
(b)
(2 warps)
5 Computation + 4 Memory
![Page 32: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/32.jpg)
• Hardware performance counters
– Built in counters (instruction count, cache
misses, branch mispredicitons)
• Profiler
• Architecture simulator
• Characterized items
– Cache miss, branch misprediciton, row-buffer
hit ratio
![Page 33: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/33.jpg)
• Top design
– (instruction, data flow from memory to CPU
and GPU), Data/control signals
• CPU design – Pipeline stages, SMT support, Fetch address calculation,
branch misprediction, cache miss handling path
– Memory address calculation stage, vector processing units
– At least 5 MUXes, register, ALU, latches,
– Memory system: Load/store buffers, queues
• GPU Design
– Show at least 10 ALUs
![Page 34: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/34.jpg)
• One of the following items
– Detailed CPU pipeline design (more muxes
and more adders)
– Detailed survey (more information from other
sources)sources)
– Detailed GPU pipeline design (more muxes
and more adders)
– Detailed memory system (more queues)
– Detailed memory controller
![Page 35: Spring 2009 Prof. Hyesoon Kim - College of Computing](https://reader031.fdocuments.us/reader031/viewer/2022012505/6180801e100757608c33a3d9/html5/thumbnails/35.jpg)
• I/O � just a box
• Cache just one box or (tag + data)
• Report: explanations are required.
• ECC: just a box • ECC: just a box
• Design review: 30 min
– Feedback for final report