Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic...

Software Data PrefetchingSoftware Data Prefetching

Mohammad Al-Shurman & Amit SethMohammad Al-Shurman & Amit Seth

Instructor: Dr. Aleksandar Milenkovic Instructor: Dr. Aleksandar Milenkovic

Advanced Computer Architecture

CPE631

IntroductionIntroduction

Processor-Memory GapProcessor-Memory Gap Memory speed is the bottleneck in the computer Memory speed is the bottleneck in the computer

systemsystem At least 20% from stalls are D-cache stalls At least 20% from stalls are D-cache stalls

(Alpha)(Alpha) Cache miss is expensiveCache miss is expensive

Reduce cache misses by ensuring data in L1Reduce cache misses by ensuring data in L1

How?!How?!

Data PrefetchingData Prefetching

Appeared first with Multimedia Appeared first with Multimedia applications using MMX technology or applications using MMX technology or SSE processor extensionSSE processor extension

Cache memory designed for data with Cache memory designed for data with high temporal & spatial localityhigh temporal & spatial locality

Multimedia data has high spatial Multimedia data has high spatial locality but low temporal localitylocality but low temporal locality

Data Prefetching (cont’d)Data Prefetching (cont’d) IdeaIdea

Bring data closer to the processor before it is Bring data closer to the processor before it is actually neededactually needed

Advantages Advantages No extra hardware is needed (Implemented in No extra hardware is needed (Implemented in

software)software) Used to mitigate the memory latency problemUsed to mitigate the memory latency problem

DisadvantagesDisadvantages Increase Code sizeIncrease Code size

ExampleExample

//Before prefetching//Before prefetching

for (i=0; i<N; i++) {for (i=0; i<N; i++) {

sum += A[i];sum += A[i];

}}

//After prefetching//After prefetchingfor (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}

PropertiesProperties

prefetchprefetch instruction loads one cache instruction loads one cache line from main memory into cache line from main memory into cache memorymemory During prefetching processor must During prefetching processor must

continue executioncontinue execution Cache memory must support hits while Cache memory must support hits while

prefetching occursprefetching occurs Decrease miss ratioDecrease miss ratio It will be ignored if prefetched data exist It will be ignored if prefetched data exist

in cache in cache

Prefetching InstructionsPrefetching Instructions The temporal instructionsThe temporal instructions

prefetcht0prefetcht0 fetch data into all cache levels, that is fetch data into all cache levels, that is to L1 and L2 for Pentium III processorsto L1 and L2 for Pentium III processors

prefetcht1prefetcht1 fetch data into all cache levels except fetch data into all cache levels except the 0th level, that is to L2 only on Pentium III the 0th level, that is to L2 only on Pentium III processorsprocessors

prefetcht2prefetcht2 fetch data into all cache levels except fetch data into all cache levels except the 0th and 1st levels, that is, to L2 only on the 0th and 1st levels, that is, to L2 only on Pentium III processorsPentium III processors

Non-temporal instructionNon-temporal instruction prefetchntaprefetchnta fetch data into location closest to fetch data into location closest to

the processor, minimizing cache pollution. On the processor, minimizing cache pollution. On the Pentium® III processor, this is the L1 cache.the Pentium® III processor, this is the L1 cache.

Prefetching GuidelinesPrefetching Guidelines

prefetch scheduling distanceprefetch scheduling distanceWhat is the next data to prefetch?What is the next data to prefetch?

minimize the number of prefetchesminimize the number of prefetchesoptimize execution time!optimize execution time!

mixing prefetch with computation mixing prefetch with computation instructions instructions

minimize code size and cache stallsminimize code size and cache stalls

Important noticeImportant notice

Prefetching can be harmful if the Prefetching can be harmful if the loop is smallloop is small

Combined with loop unrolling may Combined with loop unrolling may enhance the application execution enhance the application execution timetime

Can not cause exception if we Can not cause exception if we fetch beyond the array index the call fetch beyond the array index the call will be ignoredwill be ignored

SupportSupport

Check if the processor support SSE Check if the processor support SSE extension (using CPUID inst)extension (using CPUID inst)

mov eax, 1 ; request for feature flagscpuid ; cpuid instructiontest EDX, 002000000h ; bit 25 in feature flags equal to 1jnz Found

We used Intel compiler in our We used Intel compiler in our simulationsimulation

Has built-in macro for prefetchingHas built-in macro for prefetchingSupport loop unrollingSupport loop unrolling

Loop UnrollingLoop Unrolling

IdeaIdea Test performance of code including data Test performance of code including data

prefetch and loop unrollingprefetch and loop unrolling

Advantages Unrolling reduces the branch overhead, since it eliminates

branches Unrolling allows you to aggressively schedule the loop to hide

latencies.

Disadvantages Excessive unrolling, or unrolling of very large loops can lead to

increased code size.

Implementation of Loop Implementation of Loop UnrollingUnrolling

//Prefetch without Unroll//Prefetch without Unrollfor (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}//Prefetching with Unroll//Prefetching with Unroll#pragma unroll (1)#pragma unroll (1)for (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}#pragma unroll (1)#pragma unroll (1)

SimulationSimulation

We simulate simple addition loopWe simulate simple addition loopfor (i=0; i<size; i++) {for (i=0; i<size; i++) {

prefetch (depth)prefetch (depth)

sum += A[i];sum += A[i];

}}

We studied effects of two factorsWe studied effects of two factors Data size Data size Prefetch depthPrefetch depth

Combination of loop unrolling and Combination of loop unrolling and prefetching prefetching

Simulation (cont’d)Simulation (cont’d)

Intel VTune performance analyzerIntel VTune performance analyzer Event based simulationEvent based simulation

CPICPI L1 miss rateL1 miss rate Clock ticksClock ticks

Size Vs CPI Size Vs CPI CPI

0

0.5

1

1.5

2

2.5

3

3.5

size 0.5M size 1M size 2M size 3M size 4M

no optimization

loop unrolling

data prefetching

loop unrolling and data prefetching

Size Vs L1 miss ratioSize Vs L1 miss ratio

L1 data miss ratio

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

size 0.5M size 1M size 2M size 3M size 4M

no optimization

loop unrolling

data prefetching


Size Vs clock ticksSize Vs clock ticksClock ticks

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

size 0.5M size 1M Instructions RetiredSamples

size 2M Instructions RetiredSamples

no optimization

loop unrolling

data prefetching


Depth Vs CPI for prefetching Depth Vs CPI for prefetching with unrollingwith unrolling

0

0.5

1

1.5

2

2.5

3

1 4 16 64 256 1024

Cycles per Retired Instruction - CPI


Depth Vs L1 miss ratio for Depth Vs L1 miss ratio for prefetching with unrollingprefetching with unrolling

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

1 2 4 8 16 32 64 128 256 512 1024 2048

L1 Read Misses Ratio


Depth Vs clockticks for Depth Vs clockticks for prefetching with loop unrollingprefetching with loop unrolling

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

1 2 4 8 16 32 64 128 256 512 1024 2048

Clockticks events

Clockticks events

Depth Vs CPI for prefetching Depth Vs CPI for prefetching without loop unrollingwithout loop unrolling


0

0.5

1

1.5

2

2.5

3

1 2 4 8 16 32 64 128 256 512 1024 2048


Depth Vs L1 miss ratio for Depth Vs L1 miss ratio for prefetching without unrollingprefetching without unrolling


0

0.005

0.01

0.015

0.02

0.025

0.03

1 2 4 8 16 32 64 128 256 512 1024 2048


Depth Vs clockticks for Depth Vs clockticks for prefetching without loop prefetching without loop

unrollingunrollingClockticks events

0

100000

200000

300000

400000

500000

600000

700000

1 2 4 8 16 32 64 128 256 512 1024 2048

Clockticks events

Questions!!Questions!!

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic...

Documents

Transcript of Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic...