Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of...
Transcript of Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of...
![Page 1: Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of Parsec on Nehalem Max Shared Configuration Number of writes to lock greatly increases](https://reader036.fdocuments.us/reader036/viewer/2022070100/60101eafea427d18ab1b5e93/html5/thumbnails/1.jpg)
Why Do We Care? Achieving good performance on today’s platforms is
really hard Must be an expert in the architecture and the
application Many cases still require exhaustive search of
optimization parameters Obtaining good performance is going to get
increasingly difficult for manycore architectures Greater diversity of platforms: cell phones, laptops,
etc. Complex interactions between threads Unknown mix of applications space sharing the
machine Current optimization techniques aren’t going to be
enough Search space is too large for purely exhaustive Machine state varies from run to run
We are exploring performance counters as an approach to get insight into an application’s performance and adapt during runtime Hints to OS scheduling Hints to Online Autotuning
Architecture Overview
Interesting Results
Future Work Application Overview
Dual Socket Quad x86 Cores 2.5 – 3.5 GHz Hyperthreaded- 2 Thread Context Per Core Private L1 (32K) and L2 (256K) per Core Inclusive Shared L3 (8 MB) Up to 6 instructions issued per cycle 10 outstanding data cache misses at a time
635 Events available for Performance Counters
0
100000000
200000000
300000000
400000000
500000000
600000000
700000000
800000000
900000000
1E+09
1 Core 2 Cores 4 Cores
CPU_L2REQ
CPU_L2MISS
CPU_L2MISSALL
0 20000000 40000000 60000000 80000000
100000000 120000000 140000000
1 Core
Single Socket 6 MIPS Cores 500-700 MHz Single Threaded Private L1 (32K) per Core Inclusive Shared Partitioned L2 (256K/Core) Up to 2 instructions issued per cycle 1 outstanding data cache miss at a time
3993 Events available for Performance Counters
Cache requests greatly increase with more cores
Cache misses go from 65% to 97% Cache misses going to DRAM remains constant Data is just moving from L2 to L2 for 8 Cores
Parsec L2 Cache Behavior on SiCortex
Application Overview Parsec Fluidanimate (Intel)
Benchmark Fluid Dynamics Solver Simulates the underlying physics of fluid motion for
realtime animation purposes with the SPH algorithm (Smoothed Particle Hydrodynamics)
Algorithm similar to the one from the ‘Parallellize Particle Simulation’ assignment from class
Exhibits coarse-granular parallelism, static load balancing Contains large working sets, some communication
Chombo Finite Elements Solver Used in the ParLAB Health Application to simulate
bloodflow in arteries as an incompressible fluid
Uses finite differences to discretize partial differential equations on block-structured, adaptively refined grids using published algorithms
This specific application uses the Poisson Solver for Oct-Tree Adaptive Meshes introduced by Martin & Cartwright
Future Work
Conclusions Too Many Counters, not enough useful ones Semantics of counters between different machines
never quite exactly the same Need a movement towards standardization
Keep counters useful for hardware debugging Standardize on ones most useful for predicting
application performance and energy usage Most useful counters are:
Total Cycle Counts Instructions Committed Last level cache misses
Missing counters Energy metering Cache sharing statistics (present on SiCortex)
Short Term Get the rest of the Chombo data gathered Continue scaling analysis and perform better
pipeline analysis of data Analyze data to see if combinations of counters
provide more useful insight
LongTerm Propose a standard set of useful counters for
profiling performance and energy usage at runtime Apply machine learning algorithms to find trends
Scaling Behavior of Applications (Cycles)
Max shared scales worse on Nehalem due to interferance (TLB, Cache Misses, etc)
SiCortex doesn’t scale because of cache lines moving around between cores
0
20000
40000
60000
80000
100000
120000
140000
1 Core 2 Cores 4 Cores 8 Cores
L2_WRITE:LOCK_E_STATE
L2_WRITE:LOCK_HIT
L2_WRITE:LOCK_I_STATE
L2_WRITE:LOCK_M_STATE
L2_WRITE:LOCK_MESI
L2_WRITE:LOCK_S_STATE
Lock Behavior of Parsec on Nehalem
Max Shared Configuration Number of writes to lock greatly increases with
increasing number of cores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 Core 2 Cores 4 Cores 8 Cores
Nehalem- Parsec: Max Shared Nehalem- Parsec: Max Private SiCortex- Parsec
Nehalem- Chombo: Max Shared Nehalem- Chombo: Max Private