Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of...

1
Why Do We Care? Achieving good performance on todays platforms is really hard Must be an expert in the architecture and the application Many cases still require exhaustive search of optimization parameters Obtaining good performance is going to get increasingly difficult for manycore architectures Greater diversity of platforms: cell phones, laptops, etc. Complex interactions between threads Unknown mix of applications space sharing the machine Current optimization techniques arent going to be enough Search space is too large for purely exhaustive Machine state varies from run to run We are exploring performance counters as an approach to get insight into an applications performance and adapt during runtime Hints to OS scheduling Hints to Online Autotuning Architecture Overview Interesting Results Future Work Application Overview Dual Socket Quad x86 Cores 2.5 – 3.5 GHz Hyperthreaded- 2 Thread Context Per Core Private L1 (32K) and L2 (256K) per Core Inclusive Shared L3 (8 MB) Up to 6 instructions issued per cycle 10 outstanding data cache misses at a time 635 Events available for Performance Counters 0 100000000 200000000 300000000 400000000 500000000 600000000 700000000 800000000 900000000 1E+09 1 Core 2 Cores 4 Cores CPU_L2REQ CPU_L2MISS CPU_L2MISSALL 0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 1 Core Single Socket 6 MIPS Cores 500-700 MHz Single Threaded Private L1 (32K) per Core Inclusive Shared Partitioned L2 (256K/Core) Up to 2 instructions issued per cycle 1 outstanding data cache miss at a time 3993 Events available for Performance Counters Cache requests greatly increase with more cores Cache misses go from 65% to 97% Cache misses going to DRAM remains constant Data is just moving from L2 to L2 for 8 Cores Parsec L2 Cache Behavior on SiCortex Application Overview Parsec Fluidanimate (Intel) Benchmark Fluid Dynamics Solver Simulates the underlying physics of fluid motion for realtime animation purposes with the SPH algorithm (Smoothed Particle Hydrodynamics) Algorithm similar to the one from the Parallellize Particle Simulationassignment from class Exhibits coarse-granular parallelism, static load balancing Contains large working sets, some communication Chombo Finite Elements Solver Used in the ParLAB Health Application to simulate bloodflow in arteries as an incompressible fluid Uses finite differences to discretize partial differential equations on block-structured, adaptively refined grids using published algorithms This specific application uses the Poisson Solver for Oct- Tree Adaptive Meshes introduced by Martin & Cartwright Future Work Conclusions Too Many Counters, not enough useful ones Semantics of counters between different machines never quite exactly the same Need a movement towards standardization Keep counters useful for hardware debugging Standardize on ones most useful for predicting application performance and energy usage Most useful counters are: Total Cycle Counts Instructions Committed Last level cache misses Missing counters Energy metering Cache sharing statistics (present on SiCortex) Short Term Get the rest of the Chombo data gathered Continue scaling analysis and perform better pipeline analysis of data Analyze data to see if combinations of counters provide more useful insight LongTerm Propose a standard set of useful counters for profiling performance and energy usage at runtime Apply machine learning algorithms to find trends Scaling Behavior of Applications (Cycles) Max shared scales worse on Nehalem due to interferance (TLB, Cache Misses, etc) SiCortex doesnt scale because of cache lines moving around between cores 0 20000 40000 60000 80000 100000 120000 140000 1 Core 2 Cores 4 Cores 8 Cores L2_WRITE:LOCK_E_STATE L2_WRITE:LOCK_HIT L2_WRITE:LOCK_I_STATE L2_WRITE:LOCK_M_STATE L2_WRITE:LOCK_MESI L2_WRITE:LOCK_S_STATE Lock Behavior of Parsec on Nehalem Max Shared Configuration Number of writes to lock greatly increases with increasing number of cores 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 Core 2 Cores 4 Cores 8 Cores Nehalem- Parsec: Max Shared Nehalem- Parsec: Max Private SiCortex- Parsec Nehalem- Chombo: Max Shared Nehalem- Chombo: Max Private

Transcript of Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of...

Page 1: Application Overview Future Work Architecture Overviewdemmel/cs267_Spr... · Lock Behavior of Parsec on Nehalem Max Shared Configuration Number of writes to lock greatly increases

Why Do We Care?   Achieving good performance on today’s platforms is

really hard   Must be an expert in the architecture and the

application   Many cases still require exhaustive search of

optimization parameters   Obtaining good performance is going to get

increasingly difficult for manycore architectures   Greater diversity of platforms: cell phones, laptops,

etc.   Complex interactions between threads   Unknown mix of applications space sharing the

machine   Current optimization techniques aren’t going to be

enough   Search space is too large for purely exhaustive   Machine state varies from run to run

  We are exploring performance counters as an approach to get insight into an application’s performance and adapt during runtime   Hints to OS scheduling   Hints to Online Autotuning

Architecture Overview

Interesting Results

Future Work Application Overview

  Dual Socket   Quad x86 Cores   2.5 – 3.5 GHz   Hyperthreaded- 2 Thread Context Per Core   Private L1 (32K) and L2 (256K) per Core   Inclusive Shared L3 (8 MB)   Up to 6 instructions issued per cycle   10 outstanding data cache misses at a time

  635 Events available for Performance Counters

0

100000000

200000000

300000000

400000000

500000000

600000000

700000000

800000000

900000000

1E+09

1 Core 2 Cores 4 Cores

CPU_L2REQ

CPU_L2MISS

CPU_L2MISSALL

0 20000000 40000000 60000000 80000000

100000000 120000000 140000000

1 Core

  Single Socket   6 MIPS Cores   500-700 MHz   Single Threaded   Private L1 (32K) per Core   Inclusive Shared Partitioned L2 (256K/Core)   Up to 2 instructions issued per cycle   1 outstanding data cache miss at a time

  3993 Events available for Performance Counters

  Cache requests greatly increase with more cores

  Cache misses go from 65% to 97%   Cache misses going to DRAM remains constant   Data is just moving from L2 to L2 for 8 Cores

Parsec L2 Cache Behavior on SiCortex

Application Overview   Parsec Fluidanimate (Intel)

  Benchmark Fluid Dynamics Solver   Simulates the underlying physics of fluid motion for

realtime animation purposes with the SPH algorithm (Smoothed Particle Hydrodynamics)

  Algorithm similar to the one from the ‘Parallellize Particle Simulation’ assignment from class

  Exhibits coarse-granular parallelism, static load balancing   Contains large working sets, some communication

  Chombo Finite Elements Solver   Used in the ParLAB Health Application to simulate

bloodflow in arteries as an incompressible fluid

  Uses finite differences to discretize partial differential equations on block-structured, adaptively refined grids using published algorithms

  This specific application uses the Poisson Solver for Oct-Tree Adaptive Meshes introduced by Martin & Cartwright

Future Work

Conclusions   Too Many Counters, not enough useful ones   Semantics of counters between different machines

never quite exactly the same   Need a movement towards standardization

  Keep counters useful for hardware debugging   Standardize on ones most useful for predicting

application performance and energy usage   Most useful counters are:

  Total Cycle Counts   Instructions Committed   Last level cache misses

  Missing counters   Energy metering   Cache sharing statistics (present on SiCortex)

Short Term   Get the rest of the Chombo data gathered   Continue scaling analysis and perform better

pipeline analysis of data   Analyze data to see if combinations of counters

provide more useful insight

LongTerm   Propose a standard set of useful counters for

profiling performance and energy usage at runtime   Apply machine learning algorithms to find trends

Scaling Behavior of Applications (Cycles)

  Max shared scales worse on Nehalem due to interferance (TLB, Cache Misses, etc)

  SiCortex doesn’t scale because of cache lines moving around between cores

0

20000

40000

60000

80000

100000

120000

140000

1 Core 2 Cores 4 Cores 8 Cores

L2_WRITE:LOCK_E_STATE

L2_WRITE:LOCK_HIT

L2_WRITE:LOCK_I_STATE

L2_WRITE:LOCK_M_STATE

L2_WRITE:LOCK_MESI

L2_WRITE:LOCK_S_STATE

Lock Behavior of Parsec on Nehalem

  Max Shared Configuration   Number of writes to lock greatly increases with

increasing number of cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 Core 2 Cores 4 Cores 8 Cores

Nehalem- Parsec: Max Shared Nehalem- Parsec: Max Private SiCortex- Parsec

Nehalem- Chombo: Max Shared Nehalem- Chombo: Max Private