Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic...
-
Upload
alan-martin -
Category
Documents
-
view
217 -
download
0
Transcript of Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic...
Software Data PrefetchingSoftware Data Prefetching
Mohammad Al-Shurman & Amit SethMohammad Al-Shurman & Amit Seth
Instructor: Dr. Aleksandar Milenkovic Instructor: Dr. Aleksandar Milenkovic
Advanced Computer Architecture
CPE631
IntroductionIntroduction
Processor-Memory GapProcessor-Memory Gap Memory speed is the bottleneck in the computer Memory speed is the bottleneck in the computer
systemsystem At least 20% from stalls are D-cache stalls At least 20% from stalls are D-cache stalls
(Alpha)(Alpha) Cache miss is expensiveCache miss is expensive
Reduce cache misses by ensuring data in L1Reduce cache misses by ensuring data in L1
How?!How?!
Data PrefetchingData Prefetching
Appeared first with Multimedia Appeared first with Multimedia applications using MMX technology or applications using MMX technology or SSE processor extensionSSE processor extension
Cache memory designed for data with Cache memory designed for data with high temporal & spatial localityhigh temporal & spatial locality
Multimedia data has high spatial Multimedia data has high spatial locality but low temporal localitylocality but low temporal locality
Data Prefetching (cont’d)Data Prefetching (cont’d) IdeaIdea
Bring data closer to the processor before it is Bring data closer to the processor before it is actually neededactually needed
Advantages Advantages No extra hardware is needed (Implemented in No extra hardware is needed (Implemented in
software)software) Used to mitigate the memory latency problemUsed to mitigate the memory latency problem
DisadvantagesDisadvantages Increase Code sizeIncrease Code size
ExampleExample
//Before prefetching//Before prefetching
for (i=0; i<N; i++) {for (i=0; i<N; i++) {
sum += A[i];sum += A[i];
}}
//After prefetching//After prefetchingfor (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}
PropertiesProperties
prefetchprefetch instruction loads one cache instruction loads one cache line from main memory into cache line from main memory into cache memorymemory During prefetching processor must During prefetching processor must
continue executioncontinue execution Cache memory must support hits while Cache memory must support hits while
prefetching occursprefetching occurs Decrease miss ratioDecrease miss ratio It will be ignored if prefetched data exist It will be ignored if prefetched data exist
in cache in cache
Prefetching InstructionsPrefetching Instructions The temporal instructionsThe temporal instructions
prefetcht0prefetcht0 fetch data into all cache levels, that is fetch data into all cache levels, that is to L1 and L2 for Pentium III processorsto L1 and L2 for Pentium III processors
prefetcht1prefetcht1 fetch data into all cache levels except fetch data into all cache levels except the 0th level, that is to L2 only on Pentium III the 0th level, that is to L2 only on Pentium III processorsprocessors
prefetcht2prefetcht2 fetch data into all cache levels except fetch data into all cache levels except the 0th and 1st levels, that is, to L2 only on the 0th and 1st levels, that is, to L2 only on Pentium III processorsPentium III processors
Non-temporal instructionNon-temporal instruction prefetchntaprefetchnta fetch data into location closest to fetch data into location closest to
the processor, minimizing cache pollution. On the processor, minimizing cache pollution. On the Pentium® III processor, this is the L1 cache.the Pentium® III processor, this is the L1 cache.
Prefetching GuidelinesPrefetching Guidelines
prefetch scheduling distanceprefetch scheduling distanceWhat is the next data to prefetch?What is the next data to prefetch?
minimize the number of prefetchesminimize the number of prefetchesoptimize execution time!optimize execution time!
mixing prefetch with computation mixing prefetch with computation instructions instructions
minimize code size and cache stallsminimize code size and cache stalls
Important noticeImportant notice
Prefetching can be harmful if the Prefetching can be harmful if the loop is smallloop is small
Combined with loop unrolling may Combined with loop unrolling may enhance the application execution enhance the application execution timetime
Can not cause exception if we Can not cause exception if we fetch beyond the array index the call fetch beyond the array index the call will be ignoredwill be ignored
SupportSupport
Check if the processor support SSE Check if the processor support SSE extension (using CPUID inst)extension (using CPUID inst)
mov eax, 1 ; request for feature flagscpuid ; cpuid instructiontest EDX, 002000000h ; bit 25 in feature flags equal to 1jnz Found
We used Intel compiler in our We used Intel compiler in our simulationsimulation
Has built-in macro for prefetchingHas built-in macro for prefetchingSupport loop unrollingSupport loop unrolling
Loop UnrollingLoop Unrolling
IdeaIdea Test performance of code including data Test performance of code including data
prefetch and loop unrollingprefetch and loop unrolling
Advantages Unrolling reduces the branch overhead, since it eliminates
branches Unrolling allows you to aggressively schedule the loop to hide
latencies.
Disadvantages Excessive unrolling, or unrolling of very large loops can lead to
increased code size.
Implementation of Loop Implementation of Loop UnrollingUnrolling
//Prefetch without Unroll//Prefetch without Unrollfor (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}//Prefetching with Unroll//Prefetching with Unroll#pragma unroll (1)#pragma unroll (1)for (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}#pragma unroll (1)#pragma unroll (1)
SimulationSimulation
We simulate simple addition loopWe simulate simple addition loopfor (i=0; i<size; i++) {for (i=0; i<size; i++) {
prefetch (depth)prefetch (depth)
sum += A[i];sum += A[i];
}}
We studied effects of two factorsWe studied effects of two factors Data size Data size Prefetch depthPrefetch depth
Combination of loop unrolling and Combination of loop unrolling and prefetching prefetching
Simulation (cont’d)Simulation (cont’d)
Intel VTune performance analyzerIntel VTune performance analyzer Event based simulationEvent based simulation
CPICPI L1 miss rateL1 miss rate Clock ticksClock ticks
Size Vs CPI Size Vs CPI CPI
0
0.5
1
1.5
2
2.5
3
3.5
size 0.5M size 1M size 2M size 3M size 4M
no optimization
loop unrolling
data prefetching
loop unrolling and data prefetching
Size Vs L1 miss ratioSize Vs L1 miss ratio
L1 data miss ratio
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
size 0.5M size 1M size 2M size 3M size 4M
no optimization
loop unrolling
data prefetching
loop unrolling and data prefetching
Size Vs clock ticksSize Vs clock ticksClock ticks
0
20000000
40000000
60000000
80000000
100000000
120000000
140000000
size 0.5M size 1M Instructions RetiredSamples
size 2M Instructions RetiredSamples
no optimization
loop unrolling
data prefetching
loop unrolling and data prefetching
Depth Vs CPI for prefetching Depth Vs CPI for prefetching with unrollingwith unrolling
0
0.5
1
1.5
2
2.5
3
1 4 16 64 256 1024
Cycles per Retired Instruction - CPI
Cycles per Retired Instruction - CPI
Depth Vs L1 miss ratio for Depth Vs L1 miss ratio for prefetching with unrollingprefetching with unrolling
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
1 2 4 8 16 32 64 128 256 512 1024 2048
L1 Read Misses Ratio
L1 Read Misses Ratio
Depth Vs clockticks for Depth Vs clockticks for prefetching with loop unrollingprefetching with loop unrolling
0
20000000
40000000
60000000
80000000
100000000
120000000
140000000
1 2 4 8 16 32 64 128 256 512 1024 2048
Clockticks events
Clockticks events
Depth Vs CPI for prefetching Depth Vs CPI for prefetching without loop unrollingwithout loop unrolling
Cycles per Retired Instruction - CPI
0
0.5
1
1.5
2
2.5
3
1 2 4 8 16 32 64 128 256 512 1024 2048
Cycles per Retired Instruction - CPI
Depth Vs L1 miss ratio for Depth Vs L1 miss ratio for prefetching without unrollingprefetching without unrolling
L1 Read Misses Ratio
0
0.005
0.01
0.015
0.02
0.025
0.03
1 2 4 8 16 32 64 128 256 512 1024 2048
L1 Read Misses Ratio
Depth Vs clockticks for Depth Vs clockticks for prefetching without loop prefetching without loop
unrollingunrollingClockticks events
0
100000
200000
300000
400000
500000
600000
700000
1 2 4 8 16 32 64 128 256 512 1024 2048
Clockticks events
Questions!!Questions!!