Abstracting the Hardware / Software Boundary through a Standard
Transcript of Abstracting the Hardware / Software Boundary through a Standard
1
Abstracting the Hardware /Software Boundary through a
Standard System SupportLayer and Architecture
Erik Anderson4 May 2007
2
Agenda
• Publications and proposal review.• Problem background.• Abstracting HW/SW boundary.• Analysis, comparing HW and SW.• Results• Conclusions
3
Special Thanks to
• David Andrews• Perry Alexander• Douglass Niehaus• Ron Sass• Yang Zhang
• Jason Agron• Fabrice Baijot• Ed Komp• Andy Schmidt• Jim Stevens• Wesley Peck• Seth Warn
4
Academic Background
• Bachelor of Science in ComputerScience, University of Kentucky, 1993 -1997, Magna Cum Laude
• Doctoral Candidate in ElectricalEngineering, Kansas University, 2003 -2007
5
Publications• “Enabling a Uniform Programming Model Across the
Software/Hardware Boundary,” FCCM 2006• “Supporting High Level Language Semantics within Hardware Resident
Threads,” submitted to FPL 2007• “Memory Hierarchy for MCSoPC Multithreaded Systems,” ERSA 2007• “Achieving Programming Model Abstractions for Reconfigurable
Computing,” Transactions on VLSI, to appear in 2007• “Hthreads: A Computational Model for Reconfigurable Devices,” FPL
2006• “The Case for High Level Programming Models for Reconfigurable
Computing,” ERSA 2006• “Run-time Services for Hybrid CPU/FPGA Systems on Chip,” RTSS
2006
6
Proposal Review• Augment the HWTI
• Extend support for keysubset of Hthread API.
• Semantic andimplementation differences.
• Hthread test suite.
• Application suite.
• Augmented HWTI with:– User interface and protocol.– Globally distributed local
memory.– Function call stack.
• Extended support for keysubset of Hthread API.– Remote procedural calls.
• Chapter 5 in dissertation– Context similarities.
• Hthread test suite.– Abstractions held.
• Application suite.– Framework for HLL to HDL.
Proposed Completed
7
History of ReconfigurableComputing
• 1959: Gerald Estrin’sFixed plus VariableArchitecture.
• 1984: Xilinx is founded.• 1993: Athana proposed
PRISM-I• 2006: First 65nm FPGA
released.
IEEE Annals of History of Computing, Oct -Dec 2002, page 522FPGA 2006
30FPL 2006
25FCCM 2006
PapersConference
8
Reconfigurable ComputingTechnology
• Post-fabricationcircuit design.
• Embedded cores,memory, andmultipliers.
From xilinx.com
9
“Blessing and a Curse”
• FPGA’s can take onany computationalmodel post-fabrication.
• But which one touse?
MISDSISD
MIMDSIMD
Instruction StreamD
ata
Stre
am
Flynn’s Taxonomy
10
Hardware Acceleration Model• SISD or SIMD.• Advantages:
– Can be successful.– C to HDL tools.
• Disadvantages:– Custom interfaces between
HW and SW.• Write once, run once.
– Costly design-spaceexploration.
– Does not use today’s MIMDprogramming models.
11
Abstract Interfaces• Parallel programming
models to abstract CPU/ FPGA interface.
• CPU and FPGA bothtarget an equivalentabstract interface.
• OS / Middleware layerprovidescommunication andsynchronizationmechanism.
12
Thesis Statement
Programming model and high level languageconstructs can be used to abstract the
existing hardware/software boundary thatcurrently exists between CPU and FPGA
components.
13
Extending the Shared Memory Multi-Threaded Model to Hardware
Conceptual Reality
•Pthreads programming
14
Extending the Shared Memory Multi-Threaded Model to Hardware
• Key Challenges– HW access to API
library.– HW access to
application data.– Eliminate custom
interface to HW.
15
Extending the Shared Memory Multi-Threaded Model to Hardware
• Hthread’s Solutions– Access to the same
communicationmedium.
– Equal or equivalentsynchronizationservices migrated toHW.
– Standard systemsupport layer. Hybridthreads System
16
Hardware Thread Interface
• HWTI provides a standard register set forcommunication and synchronization services.
17
Creating a MeaningfulAbstraction
• Communication and synchronizationare solved.
• Problems persist:– “scratchpad” memory:
• How to instantiate?• How to maintain the shared memory model?
– System versus user function calls?– Creating threads from hardware?
18
Globally Distributed LocalMemory
• Dual ported BRAM.• “Globally Distributed” = All threads have access.• “Local” = User logic access is through LOAD and
STORE protocols.
19
Function Call Stack
• Abstract access to localmemory.
• Consistent function callmodel.– Recursion.
• Works analogously tosoftware based stack.– Only difference, user
logic pushes “returnstate” value instead of“return instruction.”
7RETURN
3CALL
1ADDRESSOF
1WRITE
3READ
1DECLARE
5POP
1PUSH
Clock CyclesOperation
20
Remote Procedural Calls
• Some functions tooexpensive toimplement in HWTI.
• Utilize existingsynchronizationprimitives to calloutto a special softwaresystem thread toperform function.
22
Hthread System CallImplementation Differences
• HW has dedicated resources allocatedat synthesis time.– HW explicitly blocks.
• SW has shared resources allocated atruntime.– SW context switches.
23
Hthread Size andPerformance Comparison
• Size– Definition– hthread_create /
hthread_join– hthread_yield
• Performance– Definition– HW outperforms SW
• Create/join notableexception
– Hardware’s bustransactions
24
Demonstrating an AbstractInterface
• POSIX Test-suite adapted for Hthreads.• Conformance tests
– Version for SW, HW, and mixed.• Stress tests
– Version for SW, HW, and mixed.• Abstractions held across SW/HW.
25
Demonstrating HLLConstructs
Sharing data between HW and SWHuffman
Task level parallelism, localvariables.
IDEA
Local array access, access toshared memory.
Haar DWT
Recursion, local variables.Factorial
Recursion, local variables, accessto shared memory.
QuicksortDemonstratesAlgorithm
28
Task Level Parallelism
• Haar DWT– Software’s pseudo-
concurrency.– Hardware’s true
concurrency.• Performance
– 2 SW = 31.1ms– 1HW/SW = 16.5ms– 2 HW = 16.6ms
29
Future Work
• Memory latency for hardware threads.• Leveraging reconfigurable computing.• High level language to hardware
descriptive language translation.
30
Conclusions
• Parallel programming models may beused to abstract CPU/FPGA boundary.– Threads communicate and synchronize
with other threads without regard tolocation.
• Abstract virtual machine can beimplemented in either HW or SW.– Created a framework for HLL to HDL.
36
Demonstration HLLConstructs: Quicksort
• HWTI maintainsO(nlogn) behavior.
• Cache-likeperformance.
37
Demonstration HLLConstructs: IDEA
• Benefits of task level parallelism.• Comparison with Vuletic’s hardware threads.
38
Demonstration HLLApplicability: Huffman
• Abstract datapassing betweenSW and HWthreads.
• Data cache on CPU.
39
Demonstration HLLApplicability: Haar DWT
• Abstract interface vsmeaningful abstractinterface.
• Performance.• Complexity.
40
Globally Distributed LocalMemory
• “Cache like” performance.• Maintains shared memory
model.• User access without bus
transactions.HWTILocalGlobalOperation
19128Store
19351Load
42
Remote Procedural Calls
• Advantages:– HW Access to
shared libraries.– Complete support for
hthread APIs.• Disadvantages:
– Interrupts the CPU.– Comparatively slow. 114µsstrcmp (string.h)
450µscos (math.h)1.66msprintf (stdio.h)120µsfree (stdlib.h)122µsmalloc (stdlib.h)130µshthread_join (hthread.h)160µshthread_create (hthread.h)
ExecutionLibrary Call