A first attempt at learning about optimizing the TigerSHARC code
Understanding the TigerSHARC ALU pipeline
description
Transcript of Understanding the TigerSHARC ALU pipeline
![Page 1: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/1.jpg)
Understanding the TigerSHARC ALU pipelineDetermining the speed of one stage of IIR filter – Part 3Understanding the memory pipeline issues
![Page 2: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/2.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 2 / 3104/21/23
Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE
operations What we want to be able to do? The problems we are expecting to have to solve Using the pipeline viewer to see what really happens
Changing code practices to get better performance Specialized C++ compiler options and #pragmas
(Will be covered by individual student presentation) Optimized assembly code and optimized C++
![Page 3: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/3.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 3 / 3104/21/23
Processor Architecture
3 128-bitdata busses
2 Integer ALU 2 Computational
Blocks ALU (Float and
integer) SHIFTER MULTIPLIER COMMUNICATIONS
CLU
![Page 4: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/4.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 4 / 3104/21/23
Simple ExampleIIR -- Biquad For (Stages = 0 to 3) Do
S0 = Xin * H5 + S2 * H3 + S1 * H4 Yout = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0
S0
S1
S2
SOS1S2
![Page 5: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/5.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 5 / 3104/21/23
PIPELINE STAGESSee page 8-34 of Processor manual
10 pipeline stages, but may be completely desynchronized (happen semi-independently)
Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode,
Integer, Access Compute Block – EX1 and EX2
![Page 6: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/6.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 6 / 3104/21/23
Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for 0x17d to complete XFR23 = R8 * R4
Bubble B means that the pipeline is doing “nothing”Meaning that the instruction shown is “place holder” (garbage)
![Page 7: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/7.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 7 / 3104/21/23
Code withstalls shown
8 code lines 5 expected stalls
Expect 13 cyclesto completeif theory is correct
![Page 8: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/8.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 8 / 3104/21/23
Analysis approach IS correct
Code takes same time whether our “SHOW-STALL instructions are there or not
![Page 9: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/9.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 9 / 3104/21/23
Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( )
and iirASM_Optimized( ) to make sure get correct result
Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) )
Identify data dependencies KEY –
Make all “temp operations” use different register Move instructions “forward” to fill delay slots,
BUT don’t break data dependencies
![Page 10: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/10.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 10 / 3104/21/23
Show resource usage and data dependencies
![Page 11: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/11.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 11 / 3104/21/23
Change all temporary registers to use different register names
Then check code produces correct answer
![Page 12: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/12.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 12 / 3104/21/23
Move instructions forward, without breaking data dependencies
What appears possible!
DO one thing at a time and then check that code still works
![Page 13: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/13.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 13 / 3104/21/23
CHECK THE PIPELINE AFTER TESTING
There are many more COMPUTE pipeline improvements possible.
However, let’s not spend too much time here as we are only looking at half of the problem
The coefficients are unlikely to be hard-coded, and the state variables can’t be if we are to call IIR( ) in a loop to be able to filter a series of values.
![Page 14: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/14.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 14 / 3104/21/23
Expect to take 8 cycles to execute
This is not real life
We must use IIR( ) in a loop in order to be able to filter a series of values.
Also IIR( ) will involve multiple stages
Means bring in (read) filter coefficients from memory.
Means bring in (read) state values from memory and store (write)the changed state values at theend of the function.
![Page 15: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/15.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 15 / 3104/21/23
Rewrite Tests so that IIR( ) function can take parameters
Lets make things real bypassing in state variablesthrough an “overloaded”C++ function
![Page 16: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/16.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 16 / 3104/21/23
Rewrite the “C++ code”I leave the old “fixed” values in until I can get the code to work.
Proved useful this time as the code failed
Why did it fail to return the correct value?
![Page 17: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/17.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 17 / 3104/21/23
Explore design issues – 1What do we expect to have to worry about?
XR0 = 0.0; // Set Fsum = 0;XR1 = [J1 += 1]; // Fetch a coefficient from memoryXFR2 = R1 * R4; // Multiply by Xinput (XR4)XFR0 = R0 + R2; // Add to sumXR3 = [J1 += 1]; // Fetch a coefficient from memoryXR5 = [J2 += 1]; // Fetch a state value from memoryXFR5 = R3 * R5; // Multiply coeff and stateXFR0 = R0 + R5; // Perform a sumXR5 = XR12; // Update a state variable (dummy)XR12 = XR13 // Update a state variable (dummy)[J3 += 1] = XR12; // Store state variable to memory[J3 += 1] = XR5; // Store state variable to memory
![Page 18: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/18.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 18 / 3104/21/23
Explore design issues – 2COMPUTE stalls expected (possible)XR0 = 0.0; // Set Fsum = 0;XR1 = [J1 += 1]; // Fetch a coefficient from memoryXFR2 = R1 * R4; // Multiply by Xinput (XR4)XFR0 = R0 + R2; // Add to sumXR3 = [J1 += 1]; // Fetch a coefficient from memoryXR5 = [J2 += 1]; // Fetch a state value from memoryXFR5 = R3 * R5; // Multiply coeff and stateXFR0 = R0 + R5; // Perform a sumXR5 = XR12; // Update a state variable (dummy)XR12 = XR13 // Update a state variable (dummy)[J3 += 1] = XR12; // Store state variable to memory[J3 += 1] = XR5; // Store state variable to memory
![Page 19: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/19.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 19 / 3104/21/23
Explore design issues – 3Probable memory stalls expectedXR0 = 0.0; // Set Fsum = 0;XR1 = [J1 += 1]; // Fetch a coefficient from memoryXFR2 = R1 * R4; // Multiply by Xinput (XR4)XFR0 = R0 + R2; // Add to sumXR3 = [J1 += 1]; // Fetch a coefficient from memoryXR5 = [J2 += 1]; // Fetch a state value from memoryXFR5 = R3 * R5; // Multiply coeff and stateXFR0 = R0 + R5; // Perform a sumXR5 = XR12; // Update a state variable (dummy)XR12 = XR13 // Update a state variable (dummy)[J3 += 1] = XR12; // Store state variable to memory[J3 += 1] = XR5; // Store state variable to memory
![Page 20: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/20.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 20 / 3104/21/23
Memory pipeline issues expected from COMPUTE pipeline issues seen
When you start reading values from memory, how soon is the value fetched available for use within the COMPUTE?
When you have adjacent memory accesses (read or write) does the pipeline work better (higher speed) with [J1 += 1];; or with [J1 += J4];; where J4 has been set to 1?
![Page 21: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/21.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 21 / 3104/21/23
Write a quick test to explore the code example
![Page 22: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/22.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 22 / 3104/21/23
Code stub – part copy from optimized IIR code
![Page 23: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/23.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 23 / 3104/21/23
What is the MINIMUM number of ;; that must be added so that the code is valid TigerSHARC (multi-instruction) code?
![Page 24: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/24.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 24 / 3104/21/23
Code assembles, put when it runs, it crashes the tests
Neat mid-term question
Explain why this code crashed the processor (corrupted the processor) so that the remaining tests never completed?
What are the minimum numbers of lines that must be deleted or added to make the tests run(did not say work)
![Page 25: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/25.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 25 / 3104/21/23
Switched to simulator#if 0 / #endif around unnecessary tests
Break point set at start of test code
All ready seeing lots of pipeline issues
Bigger picture on next slide
![Page 26: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/26.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 26 / 3104/21/23
Lots of instruction fetch issues
PROBABLY from jumping into new routine and the instruction pipeline not filled by the start of this code
Remove from this problem from the analysis by having 10 NOP;; at the beginning of this exploratory code
WHY 10 NOPS and not 4?
![Page 27: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/27.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 27 / 3104/21/23
Looking much better.
![Page 28: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/28.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 28 / 3104/21/23
Pipeline issues – are they as we expected? COMPUTE operations – 1 cycle delay expected
if next instruction needs the result of previous instruction
When you start reading values from memory, how soon is the value fetched available for use within the COMPUTE?
When you have adjacent memory accesses (read or write) does the pipeline work better with [J1 += 1];; or with[J1 += J4];; where J4 has been set to 1?
![Page 29: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/29.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 29 / 3104/21/23
Pipeline performance seen
![Page 30: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/30.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 30 / 3104/21/23
Pipeline performance predicted
When you start reading values from memory, 1 cycle delay for value fetched available for use within the COMPUTE
COMPUTE operations – 1 cycle delay expected if next instruction needs the result of previous instruction
When you have adjacent memory accesses (read or write) does the pipeline work better with [J1 += 1];; or with[J1 += J4];; where J4 has been set to 1?
[J1 += 1];; works just fine here (no delay).Worry about [J1 += J4];; another day
![Page 31: Understanding the TigerSHARC ALU pipeline](https://reader035.fdocuments.us/reader035/viewer/2022062322/56814a17550346895db73e4e/html5/thumbnails/31.jpg)
Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 31 / 3104/21/23
Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE
operations What we want to be able to do? The problems we are expecting to have to solve Using the pipeline viewer to see what really happens
Changing code practices to get better performance Can predict compute and memory stalls Have enough information to be able to predict performance for
real IIR code involving memory fetches and stores Almost enough information to tackle Lab. 2 with IIR