Single-core Equivalent Multi-core Technology for...
Transcript of Single-core Equivalent Multi-core Technology for...
Acknowledgement
• Co-PIs: Marco Caccamo, Rodolfo Pellizzoni and Heechul Yun
– Single Core Equivalent architecture framework: Lui Sha
– Memory architecture: Zheng Wu and Rodolfo Pellizzoni
– Memory bandwidth allocation and analysis: Heechul Yun and Gang Yao
– Last level cache management: Renato Mancuso and Marco Caccamo
– I/O management and IMA (re)configuration: Jung-Eun Kim and Man-Ki
Yoon
– Architecture Schedulability Analysis and Design Tool: (ASIIST)
Minyoung Nam
• This work is sponsored in part by ONR, NSF, LMC and RCI.
2
Single Core Equivalent Configuration for Multicore Avionic Systems
Shared Resources in Multicore Chips
• Q: Can we just migrate IMA software from single core chips to multicore chips core by core?
• A: No, in a multicore chip we need to worry about
– How DRAM’s ranks and banks are shared can have 4x difference in memory accessing delay
– The sharing of bandwidth of memory controller, because cores can interfere with each other on the use of DRAM
– If DRMA pages map to the same cache page are used by different cores, cores will evict each others cache page
– IMAs in different cores may have major cycles with different rates. The I/O activities in different cores could collide with each other over shared channels.
– Solution: Single Core Equivalency Configuration Architecture allows engineers to treat each core as if it were a single core chip.
3
Single Core Equivalent Configuration for Multicore Avionic Systems
The Business Case of SCE Technology
• Avionics industry has a large installed base of certified avionics software
developed for single core chips. Certification is one of the most expensive
and time consuming tasks in the development of avionics software.
• Without SCE technology, avionics software integration and certification
will incur cost explosion. The IMA standard requires that for any software
failures in any IMA partition, failures cannot take away computing
resources allocated to other IMA partitions. If this rule were not enforced,
then the failures in one partition can lead to cascaded timing failures
across partitions.
• Since applications in different partitions may be assigned with different
DO178 B/C criticality levels, such cascaded failure is unacceptable from a
safety perspective. Since applications in different partitions can be
developed and certified by different teams or companies at different
times, such cascaded failure is unacceptable from a legal liability and
business management perspective
Single Core Equivalent Configuration for Multicore Avionic Systems
Single Core Equivalence for Multicore
5
• Single Core Equivalence (SCE) is a developing technology for resource guarantees
on multicore processors
• Three primary targets for this technology:
• Hardware upgrades where multiple software applications running on single
core CPU’s are being migrated to run together on one multicore CPU
• New programs that must limit the number of multicore CPU’s involved due to
weight/power concerns and yet must meet challenging real-time deadlines
• Mixed programs where some legacy applications must run alongside new
ones with hard real-time deadlines
• The components of SCE are:
• Mechanisms to manage shared cache, memory bandwidth, and I/O
bandwidth to ensure adequate resources to meet all hard-real time
deadlines
• System modeling tools to allocate resources among applications and verify
that the system will meet all deadlines
Core Core Core Core
“Single-Core Equivalence”
Run-time software optimization for multi-core processors reduces code complexity/cost
Legacy or New Code
Single Core Equivalence
BENEFITS:
• Cost/Schedule: Facilitates porting of existing software to multi-core processors
• Decreased Defects: Reduced verification & validation requirements
• Hardware Agility: Decouples software from hardware environment
• Problem 1: Legacy code performance degradation on multi-core processors
• Problem 2:Problem 2:Problem 2:Problem 2: Multi-core optimization drives cost/schedule:
• Increased software complexity
• Increased defects
• Increased verification & validation
• Increased maintenance costs
• Increases hardware switching costs
Processor Cores
Execu
tio
n T
ime (
ms)
Source: SSC Benchmarking Test (VxWorks on p4080)
Proposed Solution:
BA
DG
OO
D
Performance Degraded 50%
Cache and Memory Bandwidth
Partition and Control
Sources of the Interference
(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual
Shared LLC Space sharing
DRAM DIMM
Memory Controller (MC)
Request queue(s)Queuing and
scheduling
DRAM statesBank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4
Effect of Memory Contention
• Y-axis: run-time increase ratio due to memory contention
• X-axis: sorted based on memory intensity (SPEC2006 bench)
– 401.bzip2(840MB/s) … 450.soplex(496MB/s) … 447.dealII (207MB/s)
9
CoreNet
C0
DDR3 DRAM
C1
AppApp. Membomb(*)
C7
P4080Membomb (*): bandwidth –m 16384 –a write
co-run/solo
2.4
1.7
2.5
1.71.4
3.8
1.7
1.2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
MemGuard
• Per-core memory bandwidth reservation to limit interference
• Use perf. counters and implemented in OS (or hypervisor)
10
Performance Isolation of MemGuard
• w/o MemGuard– co-run is slowed by 60% compared to solo (i.e., no interfering memory hogs)
• [email protected]/s– Each core’s memory b/w is regulated --- C0: 1.0GB/s; C1,C2,C3: 333MB/s each.
– Co-run is slowed only by 8% compared to solo
w/o MemGuard MemGuard@1GB/s
C0
Shared Memory
C1
450.soplexInterfering
memory hogs
P4080
CPC
C2 C3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
solo co-run solo co-run
No
rma
lize
d P
erf
orm
an
ce
450.soplex
(*) C4-7 are not used
Cache InterferenceCost of interference on last-level cache
� The observed task is heavily influenced by the interfering task
� Tasks running on different cores evict each other’s lines in the last-level cache
No Interf. Interf.
C0
DRAM
C1
EEMBCInterfering
task
ARM
LLC
250% Slowdown
T1
CPU1
� Aimed model - suffer cache misses in hot regions once:
� During the startup phase, prefetch & lock the hot regions
� Sharp improvement in the schedulability of the system
Colored LockdownFinal goal
T2
CPU2
startup
memory
access
execution
hotregion
T1
CPU1
T2
CPU2
Progressive Colored LockdownIncrementally locking hot pages
� Angle to time conversion EEMBC benchmark (a2time)
� According to the ranking, an increasing number of pages is locked
� Baseline reached when 4 pages are locked / 81% accesses caught
0% Pages locked
6% Pages locked
20% Pages locked
Colored Lockdown EffectSolving cache interference
� A 100% hit ratio is deterministically enforced on hot memory regions
� Performing CL on a limited set of pages is enough to enhance predictability
No Prot.
No Interf.
No Prot.
Interf.
Prot.
Interf.
15 / 59
IMA Partition Scheduling with Conflict-Free I/O
for Multicore Avionics Systems
Zero-Partition -> I/O Partition
1/3
– Zero-partition has been used as a special-purpose ‘I/O partition’.• In IMA, all I/O is consolidated in
one zero-partition.
– Migrating multiple single-core IMAs to a multi-core system.• Supporting multiple rate groups
is required (not harmonic in general).
• Resulting in shared I/O channel conflicts
• Synchronizing challenge among zero-partitions
z
z
zz
z
ZZZZ: zero-partition
core k
core k+1
A Solution - Serializing I/O partitions
• Generate a Multi-IMA Schedule
– For exclusively running I/O transaction
• Dedicated I/O core
– Other non-I/O partitions run concurrently
• No application logic modification
– To avoid substantial recertification cost
2/18
Constraint Programming vs. Heuristic
3/3
Constraint Programming (CP) Heuristic: Hierarchical Offset Selection (HOS)
EVERY solution
Exponential time complexity
Most solutions take
seconds ~ minutes
# solutions found
with 3-hour time limit
Solving time
in the cases that solutions are found
(Experiments for randomly generated synthetic instances with [3-5] cores and [4-7] partitions per a core)
ASIIST: Application Specific I/O
Integration Support Tool
• Schedulability Analysis
• Bus delay analysis
• End to End flow latency analysis
• Bus utilization
AnalysisAnalysisAnalysisAnalysis ModelModelModelModel
• Software Model
• Hardware Model
• I/O Model
ASIISTASIISTASIISTASIIST
Easier understandingof analysis results
Easier testing ofalternative solutions
Understanding Multi-constraints
• Multi-core systems provides opportunity to
pack more applications than ever.
• Implementing Single Core Equivalent (SCE)
Systems, introduces new constraints that
were otherwise disregarded or overlooked.
• When using a combination of multiple
resource management tools and
configurations, constrains can appear from
many domain abstractions. (Bus utilization,
No more cache for isolation, Memory
bandwidth, IMA partition sizes, etc.)
• Early designers can make trade off
decisions if they are more aware of
constraint values and budget information
and not settle on a under-utilized working
systems.
Core 0 Core 1 Core 7
Constraint
A
Constraint
B
Constraint
C
Constraint
D
Constraint
A
Constraint
B
Constraint
C
Constraint
D
Non-schedulable
Schedulable
Multi-core Modeling Support
• Multi-core architectures can be modeled
with chip level detail.
• Each application entails I/O traffic and
memory bandwidth.
• Single Core Equivalent (SCE) systems
requires users to be more
acknowledgeable about coexisting I/O
and memory access requirements.
• ASIIST enables users to model such data
flows semi-automatically and provide
visual interface to understand and
evaluate existing and alternative designs.
Bridge, FPG
A, ASIC
Peripheral
CoreNet
Core 0 Core 1 Core 7
SDRAM A SDRAM B
Peripheral