Post on 17-Jan-2016
NC STATE UNIVERSITY
Center for Embedded Systems Research (CESR)Department of Electrical & Computer Eng’g
North Carolina State University
Ali El-Haj-Mahmoud, Ahmed S. AL-Zawawi, Aravindh Anantaraman, and Eric Rotenberg
Virtual Multiprocessor: An Analyzable, High-Performance
Microarchitecture for Real-Time Computing
2El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Embedded Processor Trends
Inheriting desktop high-performance features Examples
• ARM11: 8-stage pipeline, caches, dynamic br. pred.
• Ubicom IP3023: 8 hardware threads
• PowerPC 750: 2-way superscalar, OOO execution
3El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Real-Time Systems and Analyzability
Schedulability of task-set determined a priori• Requires worst-case execution times (WCET) statically analyzable microarchitecture
Dynamic microarchitecture features complicate real-time design
A trade-off between performance and analyzability
A
B
4El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Multiple Simple Processors
+ Analyzable
+ Natural fit with real-time systems
− Rigid resource partitioning
− Higher cost/performance metric
proc 1 proc 2
A BC
5El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Simultaneous Multithreading (SMT)
+ Flexible resource sharing
+ Better cost/performance metric
− Unanalyzable
SMT
A B C
6El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Unanalyzability of SMT
− Violates single-task WCET assumption (tasks analyzed separately)
− Arbitrary periods arbitrary overlap of tasks
− Dynamic interference
Cannot derive WCETs
Cannot perform schedulability
7El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Real-Time Virtual Multiprocessor
Combine MP analyzability and SMT flexibility Key idea: Interference-free multithreading
• SMT performance• WCET of each task independent of task-set
RVMP substrate: two parts• Highly reconfigurable multithreaded superscalar
Space: multiple arbitrary interference-free partitionsTime: rapidly reconfigure partitions
• Static schedule orchestrates partitioning
8El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Big Picture
Co-design processor and real-time scheduling for analyzable high-performance
9El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
RVMP Architecture
Superscalar “ways” are natural partitioning granularity
Different-sized virtual processors carved out of single superscalar
10El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Processor Architecture
Starting point• Alpha 21164: 4-way in-order superscalar• Ubicom IP3023: 8 hardware threads
(4 in RVMP 4 VPs)
Simplifications for analyzability• In-order issue within VPs• Software-managed scratchpads • Static branch prediction
Not limitations of RVMP!
11El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Processor Architecture
FetchUnit
PC
InterleavedInstruction Scratchpad
DecodeSlotter and Scoreboard
(Issue Logic)
Int RF
ShadowBuffers
FP RF
Data Scratchpad
FU4: FPU
FU0: INT
FU1: INT/MUL/DIV
FU2: INT/AGEN
FU3: INT/AGEN
4 4
1
ShadowBuffers
RD
RDWR
HRT
Fetch Buffer
12El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Fetch Buffer
InstructionScratchpad
FetchUnit
Decode Issue
Backend
FV
PV
Shadow Buffers
Shadow Buffers
FV
PV
HRT
Instruction Fetch
13El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Real-Time Scheduling
Too complicated• Must schedule entire hyper-period• Overwhelming # of possible space/time schedules• High dedicated-storage cost for schedule
A
BC
D
14El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
WCET2WCET1
Task A
WCET3WCET4
period
1
2
3
4#
of
way
s
4 ways
3 ways
2 ways
1 way
…
…
…
…
…Round
15El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Task Bperiod
1
2
3
4#
of
way
s
4 ways
3 ways
2 ways
1 way
…
…
…
…
…
16El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Task Cperiod
1
2
3
4#
of
way
s
4 ways
3 ways
2 ways
1 way
…
…
…
…
…
17El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Task Dperiod
1
2
3
4#
of
way
s
4 ways
3 ways
2 ways
1 way
…
…
…
…
…
18El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
FAILEDSUCCEEDEDCyclic schedule
19El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Interaction: Scheduling and Architecture
HRT LTC
FU0FU1FU2FU3FU4
FU0FU1FU2FU3FU4
60
40
100 cycles
60 40
FV PV CVs
0
EOT
1
INVALID
INVALID
20El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Experiments Tasks from C-lab and MiBench benchmarks 100 task-sets
• 4 tasks per task-set (also 8 tasks in paper)• grouped according to scalar utilization (U_scalar)
Two experiments• Worst-case schedulability analysis • Run-time experiments
21El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Worst-Case Schedulability Tests
25 25 25 25
1614 15
5
16
97
2
7
1 1
25
911 10
20
25
9
1618
2325
18
24 24 2525
0%
25%
50%
75%
100%
Sca
lar
RV
MP
4x1
2x2
1x4
Sca
lar
RV
MP
4x1
2x2
1x4
Sca
lar
RV
MP
4x1
2x2
1x4
Sca
lar
RV
MP
4x1
2x2
1x4
0 < U_scalar <= 1 1 < U_scalar <= 2 2 < U_scalar <= 3 3 < U_scalar <= 4
Task-set bins
Su
cces
s r
ate
(%)
Failure
Success
22El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
RVMP Configurations
0
1
2
3
4
5
6
7
1-1-
1-1
2-2
1-3/
2-2
1-3/
2-1-
14/
1-1-
24/
4/2-
24/
4/1-
34/
4/4/
4
1-1-
1-1
2-2
1-3/
2-2
1-3/
2-1-
14/
1-1-
24/
4/2-
24/
4/1-
34/
4/4/
4
1-1-
1-1
2-2
1-3/
2-2
1-3/
2-1-
14/
1-1-
24/
4/2-
24/
4/1-
34/
4/4/
4
1 < U_scalar <= 2 2 < U_scalar <= 3 3 < U_scalar <= 4
Task-set bins
# o
f ta
sk-s
ets
23El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Run-Time Experiments
25 25 25 25 25
1715 15
5
18 17 16
97
3
16
7
1
96
810 10
20
7 8 9
1618
22
911
18
24 24 25
1619
14
25
10%
25%
50%
75%
100%
RV
MP
4x1
2x2
1x4
SM
T-E
DF
SM
T-I
CN
T
RV
MP
4x1
2x2
1x4
SM
T-E
DF
SM
T-I
CN
T
RV
MP
4x1
2x2
1x4
SM
T-E
DF
SM
T-I
CN
T
RV
MP
4x1
2x2
1x4
SM
T-E
DF
SM
T-I
CN
T
0 < U_scalar <= 1 1 < U_scalar <= 2 2 < U_scalar <= 3 3 < U_scalar <= 4
Task-set bins
Su
cces
s ra
te (
%)
Failure
Success
24El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Unsafe Behavior of SMT
16
12
6
Only SMT-EDF successOnly SMT-ICNT successBoth successBoth failure
Failure40%
Success60%
1 < U_scalar <= 2
25El-Haj-Mahmoud © 2005
NC STATE UNIVERSITY
CASES 2005
Summary Novel contributions
• Virtualize a single processor− Space: variable-size interference-free partitions
− Time: rapid reconfiguration
• Simple real-time scheduling approach
Analyzability of MP with flexibility of SMT Co-design processor and real-time scheduling
for analyzable high-performance