Single-core Equivalent Multi-core Technology for...

Single-core Equivalent

Multi-core Technology for Avionics

PI: Lui Sha

[email protected]

Acknowledgement

• Co-PIs: Marco Caccamo, Rodolfo Pellizzoni and Heechul Yun

– Single Core Equivalent architecture framework: Lui Sha

– Memory architecture: Zheng Wu and Rodolfo Pellizzoni

– Memory bandwidth allocation and analysis: Heechul Yun and Gang Yao

– Last level cache management: Renato Mancuso and Marco Caccamo

– I/O management and IMA (re)configuration: Jung-Eun Kim and Man-Ki

Yoon

– Architecture Schedulability Analysis and Design Tool: (ASIIST)

Minyoung Nam

• This work is sponsored in part by ONR, NSF, LMC and RCI.

2

Single Core Equivalent Configuration for Multicore Avionic Systems

Shared Resources in Multicore Chips

• Q: Can we just migrate IMA software from single core chips to multicore chips core by core?

• A: No, in a multicore chip we need to worry about

– How DRAM’s ranks and banks are shared can have 4x difference in memory accessing delay

– The sharing of bandwidth of memory controller, because cores can interfere with each other on the use of DRAM

– If DRMA pages map to the same cache page are used by different cores, cores will evict each others cache page

– IMAs in different cores may have major cycles with different rates. The I/O activities in different cores could collide with each other over shared channels.

– Solution: Single Core Equivalency Configuration Architecture allows engineers to treat each core as if it were a single core chip.

3


The Business Case of SCE Technology

• Avionics industry has a large installed base of certified avionics software

developed for single core chips. Certification is one of the most expensive

and time consuming tasks in the development of avionics software.

• Without SCE technology, avionics software integration and certification

will incur cost explosion. The IMA standard requires that for any software

failures in any IMA partition, failures cannot take away computing

resources allocated to other IMA partitions. If this rule were not enforced,

then the failures in one partition can lead to cascaded timing failures

across partitions.

• Since applications in different partitions may be assigned with different

DO178 B/C criticality levels, such cascaded failure is unacceptable from a

safety perspective. Since applications in different partitions can be

developed and certified by different teams or companies at different

times, such cascaded failure is unacceptable from a legal liability and

business management perspective


Single Core Equivalence for Multicore

5

• Single Core Equivalence (SCE) is a developing technology for resource guarantees

on multicore processors

• Three primary targets for this technology:

• Hardware upgrades where multiple software applications running on single

core CPU’s are being migrated to run together on one multicore CPU

• New programs that must limit the number of multicore CPU’s involved due to

weight/power concerns and yet must meet challenging real-time deadlines

• Mixed programs where some legacy applications must run alongside new

ones with hard real-time deadlines

• The components of SCE are:

• Mechanisms to manage shared cache, memory bandwidth, and I/O

bandwidth to ensure adequate resources to meet all hard-real time

deadlines

• System modeling tools to allocate resources among applications and verify

that the system will meet all deadlines

Core Core Core Core

“Single-Core Equivalence”

Run-time software optimization for multi-core processors reduces code complexity/cost

Legacy or New Code

Single Core Equivalence

BENEFITS:

• Cost/Schedule: Facilitates porting of existing software to multi-core processors

• Decreased Defects: Reduced verification & validation requirements

• Hardware Agility: Decouples software from hardware environment

• Problem 1: Legacy code performance degradation on multi-core processors

• Problem 2:Problem 2:Problem 2:Problem 2: Multi-core optimization drives cost/schedule:

• Increased software complexity

• Increased defects

• Increased verification & validation

• Increased maintenance costs

• Increases hardware switching costs

Processor Cores

Execu

tio

n T

ime (

ms)

Source: SSC Benchmarking Test (VxWorks on p4080)

Proposed Solution:

BA

DG

OO

D

Performance Degraded 50%

Cache and Memory Bandwidth

Partition and Control

Sources of the Interference

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

Shared LLC Space sharing

DRAM DIMM

Memory Controller (MC)

Request queue(s)Queuing and

scheduling

DRAM statesBank

4

Bank

3

Bank

2

Bank

1

Core1 Core2 Core3 Core4

Effect of Memory Contention

• Y-axis: run-time increase ratio due to memory contention

• X-axis: sorted based on memory intensity (SPEC2006 bench)

– 401.bzip2(840MB/s) … 450.soplex(496MB/s) … 447.dealII (207MB/s)

9

CoreNet

C0

DDR3 DRAM

C1

AppApp. Membomb(*)

C7

P4080Membomb (*): bandwidth –m 16384 –a write

co-run/solo

2.4

1.7

2.5

1.71.4

3.8

1.7

1.2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

MemGuard

• Per-core memory bandwidth reservation to limit interference

• Use perf. counters and implemented in OS (or hypervisor)

10

Performance Isolation of MemGuard

• w/o MemGuard– co-run is slowed by 60% compared to solo (i.e., no interfering memory hogs)

• [email protected]/s– Each core’s memory b/w is regulated --- C0: 1.0GB/s; C1,C2,C3: 333MB/s each.

– Co-run is slowed only by 8% compared to solo

w/o MemGuard MemGuard@1GB/s

C0

Shared Memory

C1

450.soplexInterfering

memory hogs

P4080

CPC

C2 C3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

solo co-run solo co-run

No

rma

lize

d P

erf

orm

an

ce

450.soplex

(*) C4-7 are not used

Cache InterferenceCost of interference on last-level cache

� The observed task is heavily influenced by the interfering task

� Tasks running on different cores evict each other’s lines in the last-level cache

No Interf. Interf.

C0

DRAM

C1

EEMBCInterfering

task

ARM

LLC

250% Slowdown

T1

CPU1

� Aimed model - suffer cache misses in hot regions once:

� During the startup phase, prefetch & lock the hot regions

� Sharp improvement in the schedulability of the system

Colored LockdownFinal goal

T2

CPU2

startup

memory

access

execution

hotregion

T1

CPU1

T2

CPU2

Progressive Colored LockdownIncrementally locking hot pages

� Angle to time conversion EEMBC benchmark (a2time)

� According to the ranking, an increasing number of pages is locked

� Baseline reached when 4 pages are locked / 81% accesses caught

0% Pages locked

6% Pages locked

20% Pages locked

Colored Lockdown EffectSolving cache interference

� A 100% hit ratio is deterministically enforced on hot memory regions

� Performing CL on a limited set of pages is enough to enhance predictability

No Prot.

No Interf.

No Prot.

Interf.

Prot.

Interf.

15 / 59

IMA Partition Scheduling with Conflict-Free I/O

for Multicore Avionics Systems

Zero-Partition -> I/O Partition

1/3

– Zero-partition has been used as a special-purpose ‘I/O partition’.• In IMA, all I/O is consolidated in

one zero-partition.

– Migrating multiple single-core IMAs to a multi-core system.• Supporting multiple rate groups

is required (not harmonic in general).

• Resulting in shared I/O channel conflicts

• Synchronizing challenge among zero-partitions

z

z

zz

z

ZZZZ: zero-partition

core k

core k+1

A Solution - Serializing I/O partitions

• Generate a Multi-IMA Schedule

– For exclusively running I/O transaction

• Dedicated I/O core

– Other non-I/O partitions run concurrently

• No application logic modification

– To avoid substantial recertification cost

2/18

Constraint Programming vs. Heuristic

3/3

Constraint Programming (CP) Heuristic: Hierarchical Offset Selection (HOS)

EVERY solution

Exponential time complexity

Most solutions take

seconds ~ minutes

# solutions found

with 3-hour time limit

Solving time

in the cases that solutions are found

(Experiments for randomly generated synthetic instances with [3-5] cores and [4-7] partitions per a core)

ASIIST: Application Specific I/O

Integration Support Tool

• Schedulability Analysis

• Bus delay analysis

• End to End flow latency analysis

• Bus utilization

AnalysisAnalysisAnalysisAnalysis ModelModelModelModel

• Software Model

• Hardware Model

• I/O Model

ASIISTASIISTASIISTASIIST

Easier understandingof analysis results

Easier testing ofalternative solutions

Understanding Multi-constraints

• Multi-core systems provides opportunity to

pack more applications than ever.

• Implementing Single Core Equivalent (SCE)

Systems, introduces new constraints that

were otherwise disregarded or overlooked.

• When using a combination of multiple

resource management tools and

configurations, constrains can appear from

many domain abstractions. (Bus utilization,

No more cache for isolation, Memory

bandwidth, IMA partition sizes, etc.)

• Early designers can make trade off

decisions if they are more aware of

constraint values and budget information

and not settle on a under-utilized working

systems.

Core 0 Core 1 Core 7

Constraint

A

Constraint

B

Constraint

C

Constraint

D

Constraint

A

Constraint

B

Constraint

C

Constraint

D

Non-schedulable

Schedulable

Multi-core Modeling Support

• Multi-core architectures can be modeled

with chip level detail.

• Each application entails I/O traffic and

memory bandwidth.

• Single Core Equivalent (SCE) systems

requires users to be more

acknowledgeable about coexisting I/O

and memory access requirements.

• ASIIST enables users to model such data

flows semi-automatically and provide

visual interface to understand and

evaluate existing and alternative designs.

Bridge, FPG

A, ASIC

Peripheral

CoreNet

Core 0 Core 1 Core 7

SDRAM A SDRAM B

Peripheral

Single-core Equivalent Multi-core Technology for...

Documents

Transcript of Single-core Equivalent Multi-core Technology for...