CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural...
-
Upload
jerome-allison -
Category
Documents
-
view
216 -
download
0
Transcript of CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural...
![Page 1: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/1.jpg)
CPE 631 Project Presentation
Hussein Alzoubi and Rami Alnamneh
Reconfiguration of architectural parameters to maximize performance and using softwaretechniques to reduce cache miss rate
![Page 2: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/2.jpg)
Topics to Be Covered
Part I, Using PAPI: Finding the best blocking factor to reduce
cache miss rate Getting a complete picture of system hardware
Part II: Using SimpleScalar to find the best size of branch predictor
Part III: Getting the best TLB using the SimpleScalar, also
![Page 3: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/3.jpg)
What is PAPI?
Performance Application Programming Interface Developed at the University of Tennessee’s
Innovative Computing Laboratory Access the hardware performance counters found
on most modern microprocessors Easy to use, well documented, and freely available
![Page 4: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/4.jpg)
Events
Occurrences of specific signals related to a processor’s function
Hardware performance counters exist as a small set of registers that count events while the program executes on the processor such as : Cache misses Floating point operations
![Page 5: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/5.jpg)
C calling interface
Function calls are defined in the header file “papi.h”
Consists of the following form :
return type PAPI_function_name (arg1,arg2,…) Return value can be a pointer to structures or a
value
![Page 6: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/6.jpg)
PAPI timers
can be used to obtain both real and virtual time The real time clock runs all the time (e.g. a wall
clock) and the virtual time clock runs only when the processor is running in user mode
Real time can be acquired in clock cycles and microseconds by calling the following low-level functions, respectively:
PAPI_get_real_cyc()
PAPI_get_real_usec()
![Page 7: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/7.jpg)
System information
Executable informationPAPI_get_executable_info()Information about the executable’s address space:
The beginning of the user program The end of the user program
Hardware information
PAPI_get_hardware_info() Information about the system hardware:
Cycle time of processor Number of processors in the system
![Page 8: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/8.jpg)
Finding the best blocking factor on Bragg and get system information
Use PAPI to find the best block size (using the matrix multiplication)
Measure the number of clock cycles for each block size
Choose the best block size according to the minimum number of clock cycles
Provides system hardware information such as: processor clock rate, number of processors in the system
![Page 9: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/9.jpg)
Results on Bragg system
Available hardware information.-------------------------------------------------------------Vendor string and code : SUN unknown (-1)Model string and code : UltraSPARC I&II (1000)CPU revision : 9.000000CPU Megahertz : 248.000000CPU's in an SMP node : 8Nodes in the system : 1Total CPU's in the system: 8-------------------------------------------------------------Best block size: 8bfactor: 8clock cycles 201801712bfactor: 16clock cycles 208085422bfactor: 32clock cycles 217125792bfactor: 64clock cycles 215792624
![Page 10: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/10.jpg)
Part II: branch predictor
modify the Simple Scalar parameters of: L1-I cache, L1-D cache, branch predictor, and branch target buffer
Get 16 different configurations Using four integer and four floating point
SPEC2000 benchmarks with these configuration Calculate the CPI for each benchmark and every
configuration and plot the results
![Page 11: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/11.jpg)
CPI for integer benchmarks
CPI for the Integer Benchmarks
00.20.40.60.8
11.21.4
1 3 5 7 9
11 13
15
Configuration
CP
I
176.gcc
181.mcf
254.gap
256.bzip2
![Page 12: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/12.jpg)
CPI for floating point benchmarks
CPI for the floating point benchmarks
0
1
2
3
41 3 5 7 9 11 13 15
Configuration
CP
I 171.swim
189.lucas
183.equake
191.fma3d
![Page 13: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/13.jpg)
Average CPI for the integer and floating point benchmarks
Average CPI for integer and floating point benchmarks
0
0.51
1.5
1 3 5 7 9
11 13 15
Configuration
CP
I integer
floating point
Config. # 14
Config. # 14: Branch predictor: 16 KB, branch target buffer: 4KB, L1 instruction cache: 32KB, and L1 data cache: 8KB
![Page 14: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/14.jpg)
Part III: TLB
Used instruction TLB varying from 512 to 1024 entries and data TLB varying from 512 to 1024 entries. L1I and L1D cache sizes were also varied
Get 16 different configurations Run one integer and one floating point SPEC2000
benchmarks for each of these configurations Find the number of clock cycles for each
benchmark and every configuration and plot the results
![Page 15: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/15.jpg)
Number of clock cycles for the integer benchmark
Number of Clock Cycles for Integer Benchmark
2.9
2.92
2.94
2.96
2.98
3
3.02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Configuration #
nu
mb
er o
f cy
cles
* 1
E-
9
![Page 16: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/16.jpg)
Number of clock cycles for the floating point benchmark
Number of Clock Cycles for Floating Benchmark
4
4.1
4.2
4.3
4.4
configuration #
Nu
mb
er o
f cl
ock
cy
cles
*1e-
8
173.applu
![Page 17: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/17.jpg)
Average number of clock cycles of the integer and floating point benchmarks
Average Number of Clock Cycles of Integer and Floating Benchmarks
0
1
2
3
4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Configuration #
Nu
mb
er
of
Clo
ck
Cycle
s *
1E
-8
Average
16 KB L1 instruction cache, 16 KB L1 data cache, 1024 instruction TLB, and 512 data TLB
![Page 18: CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.](https://reader035.fdocuments.us/reader035/viewer/2022062518/5697bf811a28abf838c85819/html5/thumbnails/18.jpg)
Questions?
Thank you…