Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of...

Why Do We Care?   Achieving good performance on today’s platforms is

really hard   Must be an expert in the architecture and the

application   Many cases still require exhaustive search of

optimization parameters   Obtaining good performance is going to get

increasingly difficult for manycore architectures   Greater diversity of platforms: cell phones, laptops,

etc.   Complex interactions between threads   Unknown mix of applications space sharing the

machine   Current optimization techniques aren’t going to be

enough   Search space is too large for purely exhaustive   Machine state varies from run to run

  We are exploring performance counters as an approach to get insight into an application’s performance and adapt during runtime   Hints to OS scheduling   Hints to Online Autotuning

Architecture Overview

Interesting Results

Future Work Application Overview

  Dual Socket   Quad x86 Cores   2.5 – 3.5 GHz   Hyperthreaded- 2 Thread Context Per Core   Private L1 (32K) and L2 (256K) per Core   Inclusive Shared L3 (8 MB)   Up to 6 instructions issued per cycle   10 outstanding data cache misses at a time

  635 Events available for Performance Counters

0

100000000

200000000

300000000

400000000

500000000

600000000

700000000

800000000

900000000

1E+09

1 Core 2 Cores 4 Cores

CPU_L2REQ

CPU_L2MISS

CPU_L2MISSALL

0 20000000 40000000 60000000 80000000

100000000 120000000 140000000

1 Core

  Single Socket   6 MIPS Cores   500-700 MHz   Single Threaded   Private L1 (32K) per Core   Inclusive Shared Partitioned L2 (256K/Core)   Up to 2 instructions issued per cycle   1 outstanding data cache miss at a time

  3993 Events available for Performance Counters

  Cache requests greatly increase with more cores

  Cache misses go from 65% to 97%   Cache misses going to DRAM remains constant   Data is just moving from L2 to L2 for 8 Cores

Parsec L2 Cache Behavior on SiCortex

Application Overview   Parsec Fluidanimate (Intel)

  Benchmark Fluid Dynamics Solver   Simulates the underlying physics of fluid motion for

realtime animation purposes with the SPH algorithm (Smoothed Particle Hydrodynamics)

  Algorithm similar to the one from the ‘Parallellize Particle Simulation’ assignment from class

  Exhibits coarse-granular parallelism, static load balancing   Contains large working sets, some communication

  Chombo Finite Elements Solver   Used in the ParLAB Health Application to simulate

bloodflow in arteries as an incompressible fluid

  Uses finite differences to discretize partial differential equations on block-structured, adaptively refined grids using published algorithms

  This specific application uses the Poisson Solver for Oct-Tree Adaptive Meshes introduced by Martin & Cartwright

Future Work

Conclusions   Too Many Counters, not enough useful ones   Semantics of counters between different machines

never quite exactly the same   Need a movement towards standardization

  Keep counters useful for hardware debugging   Standardize on ones most useful for predicting

application performance and energy usage   Most useful counters are:

  Total Cycle Counts   Instructions Committed   Last level cache misses

  Missing counters   Energy metering   Cache sharing statistics (present on SiCortex)

Short Term   Get the rest of the Chombo data gathered   Continue scaling analysis and perform better

pipeline analysis of data   Analyze data to see if combinations of counters

provide more useful insight

LongTerm   Propose a standard set of useful counters for

profiling performance and energy usage at runtime   Apply machine learning algorithms to find trends

Scaling Behavior of Applications (Cycles)

  Max shared scales worse on Nehalem due to interferance (TLB, Cache Misses, etc)

  SiCortex doesn’t scale because of cache lines moving around between cores

0

20000

40000

60000

80000

100000

120000

140000

1 Core 2 Cores 4 Cores 8 Cores

L2_WRITE:LOCK_E_STATE

L2_WRITE:LOCK_HIT

L2_WRITE:LOCK_I_STATE

L2_WRITE:LOCK_M_STATE

L2_WRITE:LOCK_MESI

L2_WRITE:LOCK_S_STATE

Lock Behavior of Parsec on Nehalem

  Max Shared Configuration   Number of writes to lock greatly increases with

increasing number of cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 Core 2 Cores 4 Cores 8 Cores

Nehalem- Parsec: Max Shared Nehalem- Parsec: Max Private SiCortex- Parsec

Nehalem- Chombo: Max Shared Nehalem- Chombo: Max Private

Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of...

Documents

Transcript of Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of...