RA7_Vasantham_SudheerKumar
-
Upload
svasanth007 -
Category
Documents
-
view
212 -
download
0
Transcript of RA7_Vasantham_SudheerKumar
8/11/2019 RA7_Vasantham_SudheerKumar
http://slidepdf.com/reader/full/ra7vasanthamsudheerkumar 1/5
A Summary on “NetThreads: Programming NetFPGA with Threaded Software”
Focus : Stateful network applications with a custom multithreaded soft multiprocessor
system-on- chip
Complex applications are best described in a high-level software executing on aprocessor
Factors Contributing to applicability of soft processors—processors composed of programmable
logic on the FPGA
Improvements in logic density
Achievable clock frequencies
Easy to program(eg : C)
Portable to different FPGAs
Flexible(Can be customized)
Can be used to manage othercomponents/accelerators in the
design
Advantages of soft processor over creating custom logic in HDL
NetThreads, a NetFPGA-based soft multithreadedmultiprocessor architecture
Ease the implementation of high-performance,
irregular packet processing applications.
8/11/2019 RA7_Vasantham_SudheerKumar
http://slidepdf.com/reader/full/ra7vasanthamsudheerkumar 2/5
CPU design is multithreaded, allowing asimple and area-efficient Datapath to
avoid stalls
several different memories (instruction,input, output, shared), to tolerate the
limited number of ports on FPGA block-RAMs supporting a shared memory.
Third, our architecture supports multipleprocessors, allowing the hardware design
to scale up to the limits of parallelism
Design Features
Evaluate NetThreads using several parallel packet-processing applications that use shared memory andsynchronization, including UDHCP, packet classification andNAT.
Measure several packet processing workloads on a 4-waymultithreaded, 5-pipeline-stage, two-processor instantiationof our architecture implemented on NetFPGA
Hazard detection logic can often be on the critical path andcan require significant area
Using the block RAMs provided in an FPGA to implementmultiple logical register files is comparatively fast and area-efficient.
The shared cache performs 32-bit line-sized data transfers with the DDR2
SDRAM controller
Each processor has a dedicated connection to a synchronization unit thatimplements 16 mutexes.
Soft processors are configurable and can be extended with accelerators asrequired, and those accelerators can be clocked at a separate frequency
8/11/2019 RA7_Vasantham_SudheerKumar
http://slidepdf.com/reader/full/ra7vasanthamsudheerkumar 3/5
Sophisticated method for Scheduling threads
de-schedule any thread
that is awaiting a lock firstovercome two challenges.
issue instructions fromthe unblocked threads in
round-robin order
Challenges
multiple instructions from the same thread in
the pipeline at once—hence to replay aninstruction, all of the instructions for thatthread following the replayed instruction
must be squashed to preserve the programorder of instructions execution.
Two instructions from the same thread might issuewith an unsafe distance between them in thepipeline, potentially violating a data or control
hazard. We solve this problem by performing statichazard detection
Base processor is a single-issue, in-order, 5-stage, 4- way multithreaded processor,
shown to be the most area- efficient compared to a 3- and 7-stage pipeline
Eliminate the hardware multipliers
processor is big-endian which avoids the need to perform network-to-host byte
ordering transformations
8/11/2019 RA7_Vasantham_SudheerKumar
http://slidepdf.com/reader/full/ra7vasanthamsudheerkumar 4/5
NETFPGA PROGRAMMING ENVIRONMENT
CompilationNetFPGA
ConfigurationTiming Validation Measurement
Stateful applications—i.e., applications in which shared, persistent data structures
are modified during the processing of most packets
Dependences make it difficult to pipeline the code into balanced stages of
execution to extract parallelism
Focus is on control-flow intensive applications performing deep packet inspection
(i.e., deeper than the IP header). Network processing software is normally closely-
integrated with operating system networking constructs
Applications
UDHCP
Classifier
NAT
Custom
Accelerators
Transactiona
l execution
8/11/2019 RA7_Vasantham_SudheerKumar
http://slidepdf.com/reader/full/ra7vasanthamsudheerkumar 5/5
Question and Answer
1)What is the performance bottleneck of Netthreads
A) Synchronization will be the main performance bottleneck
2)What are the design implementation requirements
A) (i) synchronization between threads, resulting in synchronization latency(while waiting to acquire a lock)
(ii) critical sections (while holding a lock)
3)What are the drawbacks of round-robin thread-issue schemes
A) (i) Issuing instructions from a thread that is blocked on synchronization (e.g.,spin-loop instructions or a synchronization instruction that repeatedly fails)
wastes pipeline resources; and
(ii)A thread that currently owns a lock and is hence in a critical section only
issues once every N − 1 cycles