Octavo: An FPGA-Centric Processor Architecture
description
Transcript of Octavo: An FPGA-Centric Processor Architecture
![Page 1: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/1.jpg)
Octavo: An FPGA-Centric Processor Architecture
Charles Eric LaForestJ. Gregory Steffan
ECE, University of Toronto
FPGA 2012, February 24
![Page 2: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/2.jpg)
2
Easier FPGA Programming• We focus on overlay architectures
– Nios, MicroBlaze, Vector Processors• These inherited their architectures from ASICs
– Easy to use with existing software tools– Performance penalty– ASIC architectures poor fit to FPGA hardware!
• ASIC ≠ FPGA– ASIC: transistors, poly, vias, metal layers– FPGA: LUTs, BRAMs, DSP Blocks, routing
• Fixed widths, depths, other discretizations
FPGA-centric processor design?
![Page 3: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/3.jpg)
3
Hardware (Stratix IV) Width (bits) Fmax (MHz)DSP Blocks 36 480Block RAMs 36 550ALUTs 1 800Nios II/f 32 230
How do FPGAs Want to Compute?
What processor architecture best fits the underlying FPGA?
![Page 4: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/4.jpg)
4
Research Goals
1. Assume threaded data parallelism2. Run at maximum FPGA frequency3. Have high performance4. Never stall5. Aim for simple, minimal ISA6. Match architecture to underlying FPGA
![Page 5: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/5.jpg)
5
Result: Octavo
• 10 stages, 8 threads, 550 MHz• Family of designs
– Word width (8 to 72 bits)– Memory depth (2 to 32k words)– Pipeline depth (8 to 16 stages)
Snapshot of work-in-progress
![Page 6: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/6.jpg)
6
Designing Octavo
![Page 7: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/7.jpg)
7
High-Level View of Octavo
Unified registers and RAM
![Page 8: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/8.jpg)
8
Octavo vs. Classic RISC
• All memories unified (no loads/stores)• How to pipeline Octavo?
![Page 9: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/9.jpg)
9
Design For Speed:Self-Loop Characterization
![Page 10: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/10.jpg)
10
Self-Loop Characterization
• Connect module outputs to inputs– Accounts for the FPGA interconnect
• Pipeline loop paths to absorb delays• Pointed to other limits than raw delay
– Minimum clock pulse widths• DSP Blocks: 480 MHz• BRAMs: 550 MHz
We measured some surprising delays…
![Page 11: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/11.jpg)
11
BRAM Self-Loop Characterization
398 MHz(routing!)
656 MHz 531 MHz 710 MHz
Must connect BRAMs using registers
![Page 12: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/12.jpg)
12
Building Octavo: Memory
![Page 13: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/13.jpg)
13
Building Octavo: Memory
![Page 14: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/14.jpg)
14
Memory
Replicated “scratchpad” memories with I/Owhile still exceeding 550 MHz limit.
Instruction
ALUResult
![Page 15: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/15.jpg)
15
Building Octavo: ALU
![Page 16: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/16.jpg)
16
Building Octavo: ALU
• Fully pipelined (4 stages)– Never stalls
![Page 17: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/17.jpg)
17
Building Octavo: ALU
• Multiplication– Uses DSP Blocks– Must overcome their 480 MHz limit…
![Page 18: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/18.jpg)
18
Building Octavo: Multiplier
• One multiplier is wide enough but too slow
• Two multipliers working at half-speed– Send data to both multipliers in alternation
480 MHz
600 MHz
![Page 19: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/19.jpg)
19
Octavo: Putting It All Together
![Page 20: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/20.jpg)
20
Octavo
0 1 2 3 4 5 6 7 8 9
• Pipeline– 10 stages
• Actually 8 stages with one exception (more later)– No result forwarding or pipeline interlocks– Scalar, Single-Issue, In-Order, Multi-Threaded
![Page 21: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/21.jpg)
21
Octavo
• Instruction Memory– Indexed by current thread PC– Provides a 3-operand instruction– On-chip BRAMs only
0 1 2 3 4 5 6 7 8 9
I
![Page 22: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/22.jpg)
22
Octavo
• A and B Memories– Receive operand addresses from instruction– Provide data operands to ALU and Controller
• Some addresses map to I/O ports– On-chip BRAMs only
0 1 2 3 4 5 6 7 8 9
I
A/B A/B
![Page 23: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/23.jpg)
23
Octavo
• Pipeline Registers– Avoid an odd number of stages– Separate BRAMs for best speed
• Predicted by BRAM self-loop characterization• Unusual but essential design constraint
0 1 2 3 4 5 6 7 8 9
I
A/B A/B
![Page 24: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/24.jpg)
24
Octavo
• Controller– Receives opcode, source/destination operands– Decides branches– Provides current PC of next thread to I memory
0 1 2 3 4 5 6 7 8 9
CTL0 CTL1I
A/B A/B
![Page 25: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/25.jpg)
25
Octavo
• ALU– Receives opcode and data– Writes result to all memories
0 1 2 3 4 5 6 7 8 9
ALU0 ALU1 ALU2 ALU3
CTL0 CTL1I
A/B A/B
![Page 26: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/26.jpg)
26
Octavo
0 1 2 3 4 5 6 7 8 9
ALU0 ALU1 ALU2 ALU3
CTL0 CTL1I
A/B A/B
• Longest mandatory loop: 8 stages– Along A/B memories and ALU– Fill with 8 threads to avoid stalls
T6 T7 T2 T3 T4 T5T0 T1
![Page 27: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/27.jpg)
27
Octavo
• Special case longest loop: 10 stages– Along instruction memory and ALU– Does not affect most computations
• Adds a delay slot to subroutine and loop code
0 1 2 3 4 5 6 7 8 9
ALU0 ALU1 ALU2 ALU3
CTL0 CTL1I
A/B A/B
![Page 28: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/28.jpg)
28
Results: Speed and Area
![Page 29: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/29.jpg)
29
Experimental Framework
• Quartus 10.1 targeting Stratix IV (fastest)– Optimize and place for speed– Average speed over 10 placement runs
• Varied processor parameters:– Word width– Memory depth– Pipeline depth
• Measure Frequency, Area, and Density
![Page 30: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/30.jpg)
30
Maximum Operating Frequency
![Page 31: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/31.jpg)
31
Maximum Operating FrequencyFa
ster
Wider
BRAM hard limitTiming Slack!
![Page 32: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/32.jpg)
32
Maximum Operating Frequency
550+ MHz36 bits wide
230 MHz32 bits wide
2.39x faster, but not a fair comparison
![Page 33: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/33.jpg)
33
Maximum Operating Frequency
Multiplier CAD Anomaly!(38 to 54 bits width)
Enough pipeline stages bury the inefficiency
![Page 34: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/34.jpg)
34
Area Density
![Page 35: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/35.jpg)
35
Area Density
72 bits, 1024 words72 bits, 4096 words
“Sweet spot”
67% used(typical) 26% used
![Page 36: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/36.jpg)
36
Designing Octavo:Lessons & Future Work
![Page 37: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/37.jpg)
37
Lessons
• Soft-processors can hit BRAM Fmax– Octavo: 8 threads, 10 stages, 550 MHz
• Self-loop characterization for modules– Helps reason about their pipelining– Shows true operating envelopes on FPGA
• Octavo spans a large design space– Significant range of widths, depths, stages
Consider FPGA-centric architecture!
![Page 38: Octavo: An FPGA-Centric Processor Architecture](https://reader036.fdocuments.us/reader036/viewer/2022081513/5681611e550346895dd0778e/html5/thumbnails/38.jpg)
38
Future Work