Thesis Defense
Transcript of Thesis Defense
Multi-Threaded End-to-End Applications on Network
Processors
Michael Watts
January 26th, 2006
The Stage
• Demand to move applications from end nodes to network edge
• Increased processing power at edge makes this possible
End Node
End Node
Edge
Edge
The Internet
Example
• All communication between corporate office secured at Internet edge
The Internet
CorporateOffice West
CorporateOffice East
The Internet
• End nodes responsible for establishing secure communication
Applications at Network Edge
• Provide service to end nodes– Security– Quality of Service– Intrusion detection– Load balancing
• Kernels carry out single task– Such as MD5, URL-based switching, and AES
• End-to-end applications combine multiple kernels
Intelligent Devices
• High level applications at network edge– Demand processing power– Demand flexibility of general-purpose processors
• Application-Specific Integrated Circuit (ASIC)– Speed without flexibility– Customized for particular use
• Network Processing Unit (NPU)– Programmable flexibility– Performance through parallelization
Benchmarks
• Increasing complexity of next-generation applications– More demand on NPUs– Benchmark applications used to test
performance of NPUs
• Current network benchmarks– Single-threaded kernels– Insufficient for NPU multi-processor
architecture
Contributions
• Multi-threaded end-to-end application benchmark suite
• Generic NPU simulator
• Analysis shows kernel performance inaccurate indicator of end-to-end application performance
Overview
1. Network Processors and Simulators
2. The NPU Simulator
3. Benchmark Applications
4. Tests and Results
5. Conclusion
6. Future Work
Network Processors
• NPU– Programmable packet processing device– Over 30 self-identified NPUs
• NPU Architecture– Dedicated co-processors– High-speed network interfaces– Multiple processing units
• Pipelined• Symmetric
Pipelined vs. Symmetric
• Pipelined
• Symmetric
Packet
Processing Units
Packet
Packet
Packet
Processing Units
Intel IXP1200
• Symmetric architecture• Processors (266MHz, 32-bit RISC)
– 1 x StrongARM controller• L1 and L2 cache
– 6 x microengines (ME)• 4 hardware supported threads each• No cache, lots of registers
• Shared Memory– 8 MBytes SRAM– 256 MBytes SDRAM– StrongARM and MEs share memory bus– No built-in memory management
Intel IXP1200 Architecture
NPU Simulators
• Purpose– Execute programs on foreign platform– Provide performance statistics
• SimpleScalar– Cycle-accurate hardware simulation– Architecture similar to MIPS– Modified GNU GCC generates binaries
PacketBench
• Developed at University of Massachusetts
• Uses SimpleScalar
• Provides API for basic NPU functions
• NPU platform independence
• Drawback: no support for multiprocessor architectures
Benchmarks
• Applications designed to assess performance characteristics of a single platform or differences between platforms– Synthetic
• Mimic a particular type of workload
– Application• Real-world applications
• Our focus: application benchmarks for the domain of NPUs
Benchmark Suites
• MiBench– Target: embedded microprocessors– Including Rijndael encryption (AES)
• NetBench– Target: NPUs– Including Message-Digest 5 (MD5) and URL-
based switching
• Source available in C• Limitation: single-threaded
The Simulator
• Modified existing multiprocessor simulator• Built on SimpleScalar• Modeled after Intel IXP1200
– Modeled processing units, memory, and cache structure
– Processors share memory bus– SRAM reserved for instruction stacks
Parameter StrongARM Microengines
Scheduling Out-of-order In-order
Width 1 (single-issue) 1 (single-issue)
L1 I Cache Size 16 KByte SRAM (0 penalty)
L1 D Cache Size 8 KByte 1 KByte (replace registers)
Methods of Use
• Simulator compiles on Linux using GCC• Takes SimpleScalar binary as input
sim3ixp1200 [-h] [sim-args] program [program-args]
• Threads argument controlls number of microengine threads (0-24)
• 6 microengines allotted threads using round-robin
Application Development
• Developed in C• Compiled using GCC 2.7.2.3 cross-compiler
– Linux/x86 SimpleScalar
• No POSIX thread support, same binary executed by each thread
• No memory management• Multi-threading
– getcpu()– barrier()– ncpus
Example Code// common initialization…
barrier();
int thread_id = getcpu();
if (thread_id == 0) { // StrongARM}else if (thread_id == 1) { // 1st microengine thread}else { // 2 – ncpu microengine threads}
Benchmark Applications
• Modified 3 kernels from MiBench and NetBench– Message-Digest 5 (MD5)– URL-based switching (URL)– Advanced Encryption Standard (AES)
[Rijndael]
• Modified memory allocations• Modified source of incoming packets• Parallelized
MD5
• Creates a 128-bit signature of input
• Used extensively in public-key cryptography and verification of data integrity
• Packet processing offloaded to microengine (ME) threads
• Packets processed in parallel
MD5 Algorithm
• Every packet processed on separate ME thread
• StrongARM monitors for idle threads and assigns work
Microengines
Inco
min
g P
acke
ts
MD5 Parallelization
StrongARM Microengines
URL
• Directs packets based on payload content
• Useful for load-balancing, fault detection and recovery
• Layer 7 switch, content-switch, web-switch
• Uses pattern matching algorithm
URL Algorithm
• Work for each packet split among ME threads
• StrongARM iterates over search tree, assigning work to idle ME threads
• ME threads report when match found
Microengines
Inco
min
gP
acke
ts
StrongARM
URL Parallelization
StrongARM
Microengines
AES
• Block cipher encryption algorithm
• Made US government standard in 2001
• 256 bit key
• Same parallelization technique as MD5
• Key loaded into each ME’s stack during initialization
• Packet encryption performed in parallel
Performance Tests
• Purpose– Evaluate multi-threading kernels and end-to-
end applications
• Tests– Isolation– Shared– Static– Dynamic
Isolation Tests
• Establish baseline
• Explore effects of multi-threading kernels
• Each kernel run in isolation
• Number of ME threads varied from 1 to 24
• Speedup graphed against serial version
MD5 Isolation Results
• 0: serial on StrongARM• 1-24: parallel on MEs• Decreased speedup on 1 ME • Significant speedup overall• Note decreasing slope at 7, 13, and 19 threads
URL Isolation Results
• When 1 thread finds a match, must wait for other threads to finish– Polling version required polling of global flag– Performed slightly worse (1.64 compared to 1.75)– Matching pattern found in 40% of packets
• When too many threads working at once, shared resource bottlenecks affect speedup
AES Isolation Results
• Performs poorly on MEs• Packets processed in 16 byte chunks• State maintained in accumulator for packet lifetime• Static lookup table of 8 Kbytes• L1 data cache 8 Kbytes for StrongARM – 1 Kbytes for
MEs• Consumes more cycles on ME by factor of 8.4
Shared Tests
• Reveal sensitivity of each kernel to concurrent execution of other kernels
• StrongARM serves as controller
• Baseline of 1 MD5, 4 URL, and 1 AES thread
• Separate packet streams for each kernel
• Number of threads increased for kernel under test
Shared Results
• MD5: not substantially affected• URL: maximum of 1.17 (compared to 1.75)• AES: order of magnitude higher
– Baseline uses ME, not StrongARM
Static Tests
• Characteristics of end-to-end application
• Location of bottlenecks
• Kernels work together to process single packet stream
• Find optimal thread configuration
End-to-End Application
• Distribution of sensitive information from trusted network over Internet to different hosts1. Calculate MD5 signature
2. Determine destination host using URL
3. Encrypt packet using AES
4. Send packet and signature to host
Static Results
• Baseline of 1 MD5, 4 URL, and 1 AES thread• Additional thread tried on each kernel• Best configuration used as starting point for next• Final result 1 MD5, 11 URL, and 12 AES threads
Static Results (cont.)
• Although MD5 best speedup in Isolation, unable to improve speedup in Static– Amdahl’s Law: 1 / ((1 – P) + (P / S))
• More threads initially allocated to URL– URL bottleneck until 10 threads
Dynamic Tests
• MEs not dedicated to single kernel, instead assigned work by StrongARM based on demand
• StrongARM responsible for allocating threads and maintaining wait-queues
• Realistic configuration
• Increased development complexity
Dynamic Algorithm
URL
AESPac
ket
queu
es
StrongARM
Microengines
• MD5 URL AES
• StrongARM monitors MEs
• Assigns work to idle threads
• First from queues, then from incoming packet stream
• AES queue
• URL queue
• Network
• URL queue fills as MD5 outperforms URL
• Additional threads created for URL
• AES threads created each time URL finishes
MD5
URL
URL
AES
Dynamic Results
• Baseline same as Static• Substantial speedup over Static
Dynamic Results (cont.)
• 25% as many cycles as Static• Some ME threads in Static waste idle cycles• Less affected by URL bottleneck• Able to adjust to varying packet sizes
Analysis
• Isolation– Established baseline
• Shared– Explored concurrent kernels
• Static– End-to-end application characteristics– Thread allocation optimization
• Dynamic– Contrast on-demand to static thread allocation
Conclusion
• NPU multi-processor simulator
• Multi-threaded end-to-end benchmark applications
• Analysis of benchmarks on NPU simulator– Kernel performance is not indicative of end-to-
end application performance– MD5 scaled well in Isolation and Shared, little
effect in end-to-end applications
Future Work
• NPU simulator– Already used in two other M.S. thesis projects– Larger cycle count capability– Updated to model current NPU generation
• End-to-end applications– Simulated on next-generation simulator– Further investigation into bottlenecks
Future Work (cont.)
• Benchmark suite– Include additional kernels– Model more real-world end-to-end
applications
Thank You, Questions