CPRE 583 Reconfigurable Computing Lecture 13: Fri 10/8/2010 (System Architectures)
CprE / ComS 583 Reconfigurable Computing
description
Transcript of CprE / ComS 583 Reconfigurable Computing
![Page 1: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/1.jpg)
CprE / ComS 583Reconfigurable Computing
Prof. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University
Lecture #22 – Multi-Context FPGAs
![Page 2: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/2.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.2
process (a, b, c) in port a, b; out port c;{ read(a); … write(c);}
Specification
Line (){ a = … … detach}
Processor
Capture
Model FPGA
Partition
Synthesize
Interface
Recap – HW/SW Partitioning
• Good partitioning mechanism:1)Minimize communication across bus2)Allows parallelism both hardware (FPGA)
and processor operating concurrently3)Near peak processor utilization at all times
(performing useful work)
![Page 3: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/3.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.3
Recap – Communication and Control
• Need to signal between CPU and accelerator• Data ready• Complete
• Implementations:• Shared memory• FIFO• Handshake
• If computation time is very predictable, a simpler communication scheme may be possible
![Page 4: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/4.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.4
Informal Specification,Constraints
System model
Architecture design
HW/SW implementation
PrototypeTest
Implementation
Fail
Success
Componentprofiling
Performanceevaluation
Recap – System-Level Methodology
![Page 5: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/5.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.5
Outline
• Recap• Multicontext
• Motivation• Cost analysis• Hardware support• Examples
![Page 6: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/6.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.6
Single Context
• When we have• Cycles and no data parallelism• Low throughput, unstructured tasks• Dissimilar data dependent tasks
• Active resources sit idle most of the time• Waste of resources
• Why?
![Page 7: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/7.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.7
Single Context: Why?
• Cannot reuse resources to perform different functions, only the same functions
![Page 8: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/8.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.8
Multiple-Context LUT
• Configuration selects operation of computation unit• Context identifier changes over time to allow change in
functionality• DPGA – Dynamically Programmable Gate Array
![Page 9: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/9.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.9
AB F0 F1 F2
Non-pipelined example
Computations that Benefit
• Low throughput tasks• Data dependent operations
• Effective if not all resources active simultaneously
• Possible to time-multiplex both logic and routing resources
![Page 10: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/10.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.10
Computations that Benefit (cont.)
![Page 11: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/11.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.11
Computations that Benefit (cont.)
![Page 12: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/12.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.12
Resource Reuse
• Resources must be directed to do different things at different times through instructions
• Different local configurations can be thought of as instructions
• Minimizing the number and size of instructions a key to successfully achieving efficient design
• What are the implications for the hardware?
![Page 13: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/13.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.13
Example: ASCII – Binary Conversion
• Input: ASCII Hex character• Output: Binary value
signal input : std_logic_vector(7 downto 0);signal output : std_logic_vector(3 downto 0);process (input)begin
end process;
![Page 14: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/14.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.14
ASCII – Binary Conversion Circuit
![Page 15: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/15.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.15
Implementation #1 Implementation #2NA= 3 NA= 4
Implementation Choices
• Both require same amount of execution time• Implementation #1 more resource efficient
![Page 16: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/16.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.16
Interconnect
Mux
Logic Reuse
•Actxt80K2
• dense encoding•Abase800K2
* Each context not overly costly compared to base cost of wire, switches, IO circuitry
Question: How does this effect scale?
Previous Study [Deh96B]
![Page 17: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/17.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.17
DPGA Prototype
![Page 18: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/18.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.18
• Assume ideal packing: Nactive = Ntotal/L• Reminder: c*Actxt = Abase
• Difficult to exactly balance resources/demands• Needs for contexts may vary across applications• Robust point where critical path length equals # contexts
Multicontext Tradeoff Curves
![Page 19: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/19.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.19
In Practice
• Scheduling limitations• Retiming limitations
![Page 20: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/20.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.20
Scheduling Limitations
• NA (active)• Size of largest stage
• Precedence: • Can evaluate a LUT only after predecessors
have been evaluated• Cannot always, completely equalize stage
requirements
![Page 21: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/21.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.21
Scheduling
• Precedence limits packing freedom• Freedom shows up as slack in network
![Page 22: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/22.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.22
Scheduling
• Computing Slack:• ASAP (As Soon As Possible) Schedule
• propagate depth forward from primary inputs• depth = 1 + max input depth
• ALAP (As Late As Possible) Schedule• propagate distance from outputs back from outputs
• level = 1 + max output consumption level
• Slack• slack = L+1-(depth+level) [PI depth=0, PO level=0]
![Page 23: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/23.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.23
Allowable Schedules
Active LUTs (NA) = 3
![Page 24: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/24.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.24
Sequentialization
• Adding time slots • More sequential (more latency)• Adds slack
• Allows better balance
L=4 NA=2 (4 or 3 contexts)
![Page 25: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/25.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.25
Multicontext Scheduling
• “Retiming” for multicontext• goal: minimize peak resource requirements
• NP-complete• List schedule, anneal
• How do we accommodate intermediate data?• Effects?
![Page 26: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/26.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.26
Signal Retiming
• Non-pipelined • hold value on LUT Output (wire)
• from production through consumption
• Wastes wire and switches by occupying• For entire critical path delay L• Not just for 1/L’th of cycle takes to cross wire
segment
• How will it show up in multicontext?
![Page 27: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/27.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.27
Signal Retiming
• Multicontext equivalent• Need LUT to hold value for each intermediate
context
![Page 28: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/28.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.28
• Logically three levels of dependence
• Single Context: 21 LUTs @ 880K2=18.5M2
Full ASCII Hex Circuit
![Page 29: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/29.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.29
• Three contexts: 12 LUTs @ 1040K2=12.5M2
• Pipelining needed for dependent paths
Multicontext Version
![Page 30: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/30.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.30
ASCIIHex Example • All retiming on wires (active outputs)
• Saturation based on inputs to largest stage• With enough contexts only one LUT needed• Increased LUT area due to additional stored configuration
information• Eventually additional interconnect savings taken up by LUT
configuration overhead
IdealPerfect scheduling spread + no retime overhead
![Page 31: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/31.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.31
@ depth=4, c=6: 5.5M2 (compare 18.5M2 )
ASCIIHex Example (cont.)
![Page 32: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/32.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.32
General Throughput Mapping
• If only want to achieve limited throughput• Target produce new result every t cycles• Spatially pipeline every t stages
• cycle = t • Retime to minimize register requirements• Multicontext evaluation w/in a spatial stage
• Retime (list schedule) to minimize resource usage
• Map for depth (i) and contexts (c)
![Page 33: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/33.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.33
• 23 MCNC circuits• Area mapped with SIS and Chortle
Benchmark Set
![Page 34: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/34.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.34
Area v. Throughput
![Page 35: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/35.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.35
Area v. Throughput (cont.)
![Page 36: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/36.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.36
Reconfiguration for Fault Tolerance
• Embedded systems require high reliability in the presence of transient or permanent faults
• FPGAs contain substantial redundancy • Possible to dynamically “configure around”
problem areas• Numerous on-line and off-line solutions
![Page 37: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/37.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.37
• Huang and McCluskey• Assume that each FPGA column is
equivalent in terms of logic and routing• Preserve empty columns for future use• Somewhat wasteful
• Precompile and compress differences in bitstreams
Column Based Reconfiguration
![Page 38: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/38.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.38
• Create multiple copies of the same design with different unused columns
• Only requires different inter-block connections• Can lead to unreasonable configuration count
Column Based Reconfiguration
![Page 39: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/39.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.39
• Determining differences and compressing the results leads to “reasonable” overhead
• Scalability and fault diagnosis are issues
Column Based Reconfiguration
![Page 40: CprE / ComS 583 Reconfigurable Computing](https://reader035.fdocuments.us/reader035/viewer/2022062521/56814afb550346895db80e21/html5/thumbnails/40.jpg)
CprE 583 – Reconfigurable ComputingNovember 6, 2007 Lect-22.40
Summary
• In many cases cannot profitably reuse logic at device cycle rate• Cycles, no data parallelism• Low throughput, unstructured• Dissimilar data dependent computations
• These cases benefit from having more than one instructions/operations per active element
• Economical retiming becomes important here to achieve active LUT reduction
• For c=[4,8], I=[4,6] automatically mapped designs are 1/2 to 1/3 single context size