Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf ·...

11
1 Runtime Mapping for Lifetime Reliability and Throughput Optimization in Many-Core Systems Liang Wang, Xiaohang Wang, Member, IEEE, Ho-fung Leung, Senior member, IEEE, Terrence Mak, Member, IEEE,Jie Han Member, IEEE and Leibo Liu Abstract—Due to technology scaling and increasing power density, lifetime reliability is becoming one of major design constraints in the design of future many-core systems. In this paper, we propose a runtime task mapping scheme which can dynamically map applications given a lifetime reliability constraint. A borrowing strategy is adopted to satisfy the lifetime reliability constraint in a long-term scale, while the lifetime constraint can be relaxed in local time scale when the commu- nication requirement is high. The applications are classified into communication-intensive applications and computation-intensive applications, which are mapped using different strategies. The threshold to classify the applications is determined dynamically according to current locations of all available cores besides communication to computation ratio of applications. The pro- posed runtime mapping scheme can improve the throughput because the communication distance of communication intensive applications is optimized and meanwhile the waiting time of computation intensive application is reduced. The experimental results show that compared to the state-of-the-art lifetime- constrained mapping, the proposed mapping schemes can have up to 50.6% throughput improvement with only a small penalty on the communication cost. Index Terms—IEEE, IEEEtran, journal, L A T E X, paper, tem- plate. I. I NTRODUCTION Aggressive technology scaling has enabled integration of tens or hundreds of cores on a single chip [1], [2], which is known as many-core processor. The many-core systems provide high performance computing services for servers, data-centers, clusters, etc. The workloads on such a system usually exhibit dynamic feature, i.e. applications with differ- ence characteristics arrive and leave the system dynamically. This necessitates an efficient runtime application mapping scheme, which can provide high throughput for the many- Liang Wang is with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong and the with the Institute of Microelectronics, Tsinghua University, Beijing, China. Email: [email protected] Xiaohang Wang is with the School of Software Engineering, South China University of Technology, and Guangzhou Institute of Advanced Technology, Guangzhou, China. Email: [email protected] Ho-fung Leung is with the Department of Computer Science and Engineer- ing, The Chinese University of Hong Kong, Shatin, NT, Hong Kong. Email: [email protected] Terrence Mak is with the School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, United Kingdom. Email: [email protected] Jie Han is with the Department of Electrical and Computer Engineering, University of Alberta, Canada. Email: [email protected] Leibo Liu is with the Institute of Microelectronics, Tsinghua University, Beijing, China. Email: [email protected] Application mapping area Runtime Mapping An incoming application Low-aged core High-aged core An incoming application (a) Mapping scheme 1 (b) Mapping scheme 2 Runtime Mapping Application mapping area Fig. 1. Motivation example. (a) The application is mapped to a square region. (b) The application is mapped to a non-square region. core system while satisfying system constraints such as limited resources, thermal design power (TDP), lifetime reliability, etc. Because of technology scaling, lifetime reliability is becom- ing one of major design challenges for many-core system [3], [4], [5], [6]. The transistors are affected by various aging mechanisms such as Electro-migration (EM), Time Dependent Dielectric Breakdown (TDDB), Negative Bias Temperature Instability (NBTI), Hot Carrier Injection (HCI), etc [7], [8], which can eventually result from permanent failures of transis- tors. Mean-time-to-failure (MTTF), a widely used metric for lifetime reliability, is used to describe the expected lifetime of transistors [7]. To address the aging issues of many-core systems, various techniques such as dynamic voltage and fre- quency scaling (DVFS) [9], [3], [6], task mapping [10], [11], [12], etc. have been proposed. In addition, some studies have also taken the lifetime reliability as a design constraint [13], [14], [12] such that the performance is optimized while meeting a target MTTF. Existing state-of-the-art lifetime- constrained runtime task mapping algorithm [12] tries to map the applications to near-square regions in order to reduce the communication overhead and external contentions [15]. However, we notice that it is possible to lead to low throughput if the applications are always mapped to near-square regions. Fig. 1 presents an example, in which an application with 4 tasks is mapped to a 3 × 3 many-core system 1 . In the many-core system, the core in the central region is a high- aged core while others are less-aged. If the high-aged core has violated the pre-defined lifetime constraint, the incoming application has to wait until the core recovers from high- aged status. In Fig. 1(a), the application is mapped to a square region for lower communication overhead with the high-aged core selected. This would possibly lead to low 1 It is assumed that the maximum number of tasks on each core is 1.

Transcript of Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf ·...

Page 1: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

1

Runtime Mapping for Lifetime Reliability andThroughput Optimization in Many-Core Systems

Liang Wang, Xiaohang Wang, Member, IEEE, Ho-fung Leung, Senior member, IEEE,Terrence Mak, Member, IEEE,Jie Han Member, IEEE and Leibo Liu

Abstract—Due to technology scaling and increasing powerdensity, lifetime reliability is becoming one of major designconstraints in the design of future many-core systems. In thispaper, we propose a runtime task mapping scheme whichcan dynamically map applications given a lifetime reliabilityconstraint. A borrowing strategy is adopted to satisfy the lifetimereliability constraint in a long-term scale, while the lifetimeconstraint can be relaxed in local time scale when the commu-nication requirement is high. The applications are classified intocommunication-intensive applications and computation-intensiveapplications, which are mapped using different strategies. Thethreshold to classify the applications is determined dynamicallyaccording to current locations of all available cores besidescommunication to computation ratio of applications. The pro-posed runtime mapping scheme can improve the throughputbecause the communication distance of communication intensiveapplications is optimized and meanwhile the waiting time ofcomputation intensive application is reduced. The experimentalresults show that compared to the state-of-the-art lifetime-constrained mapping, the proposed mapping schemes can haveup to 50.6% throughput improvement with only a small penaltyon the communication cost.

Index Terms—IEEE, IEEEtran, journal, LATEX, paper, tem-plate.

I. INTRODUCTION

Aggressive technology scaling has enabled integration oftens or hundreds of cores on a single chip [1], [2], whichis known as many-core processor. The many-core systemsprovide high performance computing services for servers,data-centers, clusters, etc. The workloads on such a systemusually exhibit dynamic feature, i.e. applications with differ-ence characteristics arrive and leave the system dynamically.This necessitates an efficient runtime application mappingscheme, which can provide high throughput for the many-

Liang Wang is with the Department of Computer Science and Engineering,The Chinese University of Hong Kong, Shatin, NT, Hong Kong and the withthe Institute of Microelectronics, Tsinghua University, Beijing, China. Email:[email protected]

Xiaohang Wang is with the School of Software Engineering, South ChinaUniversity of Technology, and Guangzhou Institute of Advanced Technology,Guangzhou, China. Email: [email protected]

Ho-fung Leung is with the Department of Computer Science and Engineer-ing, The Chinese University of Hong Kong, Shatin, NT, Hong Kong. Email:[email protected]

Terrence Mak is with the School of Electronics and Computer Science,University of Southampton, Southampton SO17 1BJ, United Kingdom. Email:[email protected]

Jie Han is with the Department of Electrical and Computer Engineering,University of Alberta, Canada. Email: [email protected]

Leibo Liu is with the Institute of Microelectronics, Tsinghua University,Beijing, China. Email: [email protected]

Application

mapping area

Runtime

Mapping

An incoming

application

Low-aged core

High-aged core

An incoming

application

(a) Mapping scheme 1 (b) Mapping scheme 2

Runtime

Mapping

Application

mapping area

Fig. 1. Motivation example. (a) The application is mapped to a square region.(b) The application is mapped to a non-square region.

core system while satisfying system constraints such as limitedresources, thermal design power (TDP), lifetime reliability, etc.

Because of technology scaling, lifetime reliability is becom-ing one of major design challenges for many-core system [3],[4], [5], [6]. The transistors are affected by various agingmechanisms such as Electro-migration (EM), Time DependentDielectric Breakdown (TDDB), Negative Bias TemperatureInstability (NBTI), Hot Carrier Injection (HCI), etc [7], [8],which can eventually result from permanent failures of transis-tors. Mean-time-to-failure (MTTF), a widely used metric forlifetime reliability, is used to describe the expected lifetimeof transistors [7]. To address the aging issues of many-coresystems, various techniques such as dynamic voltage and fre-quency scaling (DVFS) [9], [3], [6], task mapping [10], [11],[12], etc. have been proposed. In addition, some studies havealso taken the lifetime reliability as a design constraint [13],[14], [12] such that the performance is optimized whilemeeting a target MTTF. Existing state-of-the-art lifetime-constrained runtime task mapping algorithm [12] tries to mapthe applications to near-square regions in order to reducethe communication overhead and external contentions [15].However, we notice that it is possible to lead to low throughputif the applications are always mapped to near-square regions.Fig. 1 presents an example, in which an application with4 tasks is mapped to a 3 × 3 many-core system1. In themany-core system, the core in the central region is a high-aged core while others are less-aged. If the high-aged corehas violated the pre-defined lifetime constraint, the incomingapplication has to wait until the core recovers from high-aged status. In Fig. 1(a), the application is mapped to asquare region for lower communication overhead with thehigh-aged core selected. This would possibly lead to low

1It is assumed that the maximum number of tasks on each core is 1.

Page 2: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

2

throughput due to long application waiting time. However, theapplications can have various communication characteristics,i.e. some applications are more computation-intensive. Forthose applications, it is not necessary to map to near-squareregions as the communication distance is less important. Asshown in Fig. 1(b), the application is mapped to a non-squareregion without occupying the high-aged core. Despite of slightincrease of the average communication distance, the systemthroughput can be improved because of shorter waiting timefor the recovery of high-aged core.

Another important feature of dynamic lifetime reliability(DRM) management is borrowing strategy, which is widelyused in lifetime-constraint designs [9], [16], [17]. Borrowingstrategy indicates that the lifetime constraint can be locallyrelaxed for better performance in local time window while stillmeeting the constraint over the large time window. However,most prior studies exploit borrowing strategy for DRM byDVFS in single-core or multi/many-core processors. And bor-rowing strategy is less exploited to manage lifetime reliabilityby runtime task mapping.

In this paper, we propose a runtime task mapping schemewith lifetime reliability as constraint. The runtime task map-ping scheme aim to improve the global throughput of thesystem given a lifetime reliability constraint. The borrowingstrategy is adopted to meet the lifetime constraint in a long-term scale. In other words, if the incoming application is acommunication intensive applications, it is mapped to a near-square area with lifetime constraint locally relaxed because ofhigher communication performance requirement in local timewindow. Meanwhile, the computation intensive applicationsare mapped to an area with higher dispersion to free thehigh-aged cores. The global throughput is improved becausethe communication performance of communication intensiveapplications is optimized, and meanwhile the waiting time ofcomputation intensive application is reduced. A preliminarywork has been presented in [18]. This paper extends thepreliminary work by adding substantial analyses, a novelmethod to classify applications and more experimental resultsincluding evaluations on realistic applications. The main con-tributions of this paper include:

(1) Propose a lifetime-constraint runtime task mappingscheme which exploits borrowing strategy to improve thesystem throughput by mapping computation intensive andcommunication intensive application with different strategies.

(2) Propose an improved neighborhood allocation schemefor runtime mapping.

(3) Propose a novel method to classify the applications intocomputation intensive and communication intensive applica-tions.

The remainder of the paper is organized as follows. Sec-tion II briefly introduces the related work. Section III discussesthe models of lifetime reliability and applications. Section IVpresents the problem statement. Section V proposes the run-time mapping. Section VI analyzes the experimental resultsand Section VII concludes this paper.

II. RELATED WORK

Lifetime-aware task mapping problems, which incorporateMTTF as an optimization objective or a constraint, have beensolved with both design-time and runtime solutions.

A. Design-time Solutions

Huang et al. [10] define an aging effect for multi-core sys-tem, and propose a task mapping problem to maximize MTTFof multi-core system given performance constraints. In [13],Huang et al. propose another task mapping problem whichminimizes the energy consumption under a lifetime constrainton multi-core system. Both the problems in [10], [13] aresolved by simulate annealing (SA) at design-time. Meyer etal. [19] propose a slack allocation for performance and cost-constrained applications. By a design-time task mapping, thelifetime and manufacturing cost are jointly optimized. Zhuet al. [20] exploit task mapping, processor-level structuralredundancy, floorplan, etc. to improve the MTTF of multi-core system given functionality and timing constraints. Allthese approaches address the lifetime reliability issues throughdesign-time mapping.

B. Runtime Solutions

Hartman et al. [21] propose a lifetime-aware runtime map-ping heuristics for NoC-based chip multiprocessors. In theirheuristics, the cores with smaller accumulated aging effectsare favored during the mapping process. Thus the heuristicsminimizes the accumulated aging effects of the cores andextends the lifetime. Rathore et al. [22] consider processvariation and propose a design-time/runtime hybrid methodto optimize lifetime of the many-core system. At design-time, a simulated annealing algorithm is adopted to find apareto-optimal mapping. At runtime, the tasks are migratedto healthier cores pro-actively. Sun et al. [23] propose aworkload balancing scheme for multi-core systems to relaxNBTI stressed cores. First, a capacity rate is defined for eachcore according to NBTI as an indication to accept workload.Then, a dynamic zoning algorithm is proposed to group coresinto zone and a dynamic task scheduling algorithm is proposedto allocate tasks into each zone with balanced workload. Thework extends 30% MTTF for multicore systems. All thesemethods target at maximizing MTTF of system. Some otherresearchers take the lifetime as a constraint to meet targetMTTF at runtime. Gnad et al. [14] propose a run-time lifetimemanagement system which harnesses dark silicon to decelerateaging, and improves the overall system performance for agiven lifetime. However, their method targets at malleableapplication which has a varying degree of parallelism, thusnot applicable for task graph based applications. The methodin [12] extends the runtime mapping in [24] to take the aginginformation into consideration when selecting the first node.Based on the first node selection method, a runtime mappinggiven lifetime constraint is proposed. However, the methoddoes not exploit the borrowing strategy feature of DRM. Inother words, it is neglected that the lifetime constraint can betemporally locally relaxed for better performance requirement.

Page 3: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

3

In this paper, some applications are mapped in dispersedregions to improve throughput. Fattah et al. [25] propose asimilar approach, in which a parameter α named allowed levelof dispersion is defined. They show that when 0 < α < 1,higher throughput can be achieved because of less waitingtime. However, the allowed level of dispersion keeps constantfor all applications.

Different from the approach in [25], we propose a hybridmapping which classifies the applications into communicationintensive applications and computation intensive applications.The computation intensive applications are allowed to map indispersed regions while the communication intensive applica-tions can only be mapped in compact regions. Meanwhile,the lifetime-constrained runtime mapping incorporates theborrowing strategy [9] to optimize the communication distanceas the first priority in local time scale, and manages the lifetimein a long-term scale.

III. MODELS

A. Lifetime Reliability

The aging mechanisms for lifetime reliability includeElectro-migration (EM), time-dependent dielectric breakdown(TDDB), stress migration (SM), negative bias temperatureinstability (NBTI), hot carrier injection (HCI), etc. In thispaper, we only consider EM-induced aging mechanism. Otheraging mechanisms can be easily incorporated using Sum-of-Failure-Rate (SOFR) models [13].

The lifetime reliability can be modeled according to Weibulldistribution [10]. The MTTF is given as

MTTF = αΓ(1 +1

β) (1)

where α is related to operating conditions such as temperature,switching activity, frequency, etc. Γ is the gamma function. βis the slope parameter for Weibull distribution. Consideringthe runtime variation of operating conditions, an aging effectfor a processor is also defined in [10], shown in Eq. 2.

A =∑ ∆ti

αi(2)

where ∆ti is the duration of the i-th time interval. αi is thescale parameter in the i-th time interval. The aging effect inte-grates all operating conditions including temperature, voltage,frequency and utilization into a single value.

To satisfy the pre-defined lifetime constraint, the agingeffect should be less that the target aging effect. We definea lifetime budget in Eq. 3.

∆A(t) = Atarget(t)−A(t) > 0 (3)

A positive lifetime budget value means the lifetime con-straint is satisfied and vice versa. For lifetime reliability man-agement, the lifetime can be regarded as a resource consumedover time. Atarget(t) indicates the given resource, and A(t)indicates the consumed resource. As soon as the consumedresource is less than the given resource, the lifetime constraintis satisfied. Eq. 2 and Eq. 3 also indicate that if a healthy coreis over stressed for a period, the lifetime budget decreases andeven the lifetime constraint is violated. On the contrary, if a

high-aged core (∆A < 0) is less stressed for a period, it canrecover from high-aged status.

B. Application Model

Each application can be modeled as a directed task graphAG = (T,E), where T is the set of tasks, and E is the set ofdirected edges. Each task ti ∈ T has a worst-case executiontime when mapped to a core. Each edge ei,j ∈ E representsdata dependency between the source task ti and destinationtask tj . The weight wi,j represents the communication volumefrom task ti to task tj . The execution time of an applicationor the makespan of a task graph is determined by both taskexecution time and communication time.

C. Platform Description

The many-core system consists of a set of homogeneouscores which are connected by a 2D-mesh on-chip interconnectnetwork. The platform is represented as a directed graphG(P,L), where P represents the set of cores ni and li,j ∈ Lrepresents the physical channel between two cores ni andnj . Each core is associated with a ∆A value. The resourcemanagement manager maintains an aging map, which containsthe ∆A values of all cores. If the size of the many-core systemis M ×N , the aging map can be represented by a matrix inEq. 4.

AM =

∆A1,1 · · · ∆A1,N

.... . .

...

∆AM,1 · · · ∆AM,N

(4)

D. Overview of the Proposed Methodology

A task mapping defines how an application is mapped ona set of cores. It is assumed the maximum number of tasksrunning on a core is 1. The mapping function map(ti) defineswhich core that the task ti is mapped on. The application map-ping and resource management are controlled by a centralizedresource manager. Fig. 2 shows the diagram of the proposedlifetime-constrained mapping. Fig. 2(a) shows the applicationswhich are represented by task graphs. In Fig. 2(b), an agingmap is used to keep the lifetime budgets ∆A of the cores in themany-core system. Fig. 2(c) shows the many cores which areinterconnected by on-chip network. At runtime, the resourcemanager takes the application information and platform agingmap as the inputs of the mapping algorithm. The output of thealgorithm specifies how the tasks of the incoming applicationsare mapped to the many-core system.

IV. PROBLEM STATEMENT

At runtime, there are N applications arriving at the many-core system within a given time period. To optimize theglobal throughput, the objective is to minimize the totalmakespan of all applications. The decision variable is the task-to-core mapping of each application. The makespan of eachapplication is represented as

σi = tifinish − tiarrive (5)

Page 4: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

4

Mapping

algorithm

Runtime resource

manager

...

Arriving applications at runtime

(App1, t1) (Appi, ti)

Extracted Aging Maps

Many-core

ΔA11·····ΔA1M

··········ΔAMM

······

······

App 3

App 1

App 2

(a)

(b) (c)

Fig. 2. Task mapping overview

where tiarrive and tifinish are the arrival time and the finishingtime of the i-th application respectively. Waiting time occurswhen an application arrives at the system but there are nosufficient resources (cores, power, lifetime, etc.) to run it. Theexecution time of an application includes the execution andcommunication of the tasks.

In this paper, the throughput is defined as the averagenumber of applications finished within a time period, at theend of which the lifetime constraint of the system is satisfied.

Throughput =N

t− t0arrive(6)

where t0arrive is the arrival time of the first application. Toimprove the throughput, the objective is to

min t (7)

subject tot ≥ tN−1finish (8)

∆Aj(t) > 0,∀nj ∈ P (9)

Eq. 8 defines the constraint that the all N applicationsshould have finished at time t. Eq. 9 defines the lifetimeconstraint which should be satisfied at time t, otherwise thehigh-aged cores should be kept free until the lifetime constraintis met.

V. PROPOSED APPROACH FOR LIFETIME CONSTRAINTMAPPING

In this section, we first propose a lifetime-constraint map-ping scheme. Fig. 3 presents the flow for task mappingscheme. First the centralized manager updates the aging mapof the many-core system periodically. When a new applicationarrives, the mapping algorithm is called to map the applicationonto the many-core system. The mapping algorithm includestwo steps (1) first node selection (2) mapping other tasksaround the first node. To satisfy the lifetime constraint, theapplication is put into the waiting queue if there are notenough cores or lifetime budget to map. Then based on thecharacteristics of the application, the applications are mappedwith either computation-biased mapping or communication-biased mapping.

Runtime

Resource

Manager

Collect status of

cores (occupied flag,

lifetime budget)

An incom

ing

applic

ation

Enough

resources?

Update Aging

map

Calculate

reliability factor

for each core

Task graphSelect first node

CCR >

Threshold?

Computation

biased mapping

Communication

biased mappingMapping

process

Fig. 3. Flow of the proposed task mapping scheme

A. Runtime Aging Monitoring and Aging Map Update

For runtime lifetime management, a runtime aging mon-itoring framework is required to make decisions based onthe runtime aging status. In this paper, we adopt a similarmethod with [26], in which the aging effect is estimated basedon aggregate operating conditions, including the temperature,switching activity, voltage, frequency, etc.

Given a target aging effect, the lifetime budget ∆A iscalculated for each core. If ∆A > 0, it indicates that theaging effect is less than the target aging effect, i.e. the lifetimeconstraint of the core is satisfied, and vice versa. The agingmap is updated periodically.

B. First Node Selection

Previous studies [27], [12] have developed various methodsto select the first node to minimize external congestion andaging effect of many-core system. In this paper, we adopt asimilar method with [12]. For each core (m,n) in the many-core system, a reliability factor RFm,n is evaluated based onthe status of the neighborhood cores. The reliability factor isrepresented as follows,

RFm,n =

m+r∑i=m−r

n+r∑j=n−r

Wi,j ×∆Ai,j × (r − d+ 1) (10)

where Wi,j = 0 if (i, j) is occupied and Wi,j = 1 if core (i, j)is unoccupied. r is the radius of square and d is the distancefrom the center. RFm,n incorporates lifetime reliability metric,status of the cores and the range of the square into a singlevalue. And the closer core around (m,n) is given higherweight. A higher RFm,n indicates that there are more freecores and more lifetime budgets around the core (m,n). Thefirst node is the core with the highest reliability factor. Notethat the reliability factor values of all cores are calculatedat runtime proactively. Thus the first node selection does notaffect the running time of the mapping algorithm.

Page 5: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

5

Algorithm 1 Mapping or queuing the incoming applicationInput AGi: the arrival application;

AM : ∆A of all cores;j: First node.

Output Choose Queue() or Map().1: C ← the number of cores that are free and ∆A > 02: |T | ← the number of tasks in AGi

3: if C < |T | then4: Queue(AGi)5: else if ∆Aj < 0 then6: Queue(AGi)7: else8: Map(AGi)9: end if

C. Lifetime Constraint Satisfaction

To satisfy the lifetime constraint, the resource managerneeds to decide whether mapping the incoming applicationor putting the application into the waiting queue. If thereis not enough lifetime budget, it is possible to violate thelifetime constraint to map the incoming application. Therefore,we propose a method in Algorithm 1. In the algorithm, wefirst count the number of cores that are not occupied by anyapplications and the lifetime budget is positive (∆A > 0) (line1). If the C is less than the number of tasks of the incomingapplication, there are not enough lifetime budgets, thus theincoming application is put into the waiting queue until moreresources are available (line 3), i.e. when more cores are freeor there are enough lifetime budgets. Since the first nodeincorporates the lifetime metric of itself and neighbor nodes,the selected first node is the core with the least-aged areaaround it. If the ∆A < 0, the application is also put into thewaiting queue (line 5). Although when ∆A > 0 for the firstnode it is possible that the neighbor nodes around the first nodehave ∆A < 0, this can be solved in the following mappingalgorithm which adopts different strategies according to thecharacteristics of applications.

D. Neighborhood Nodes Allocation

After the first node is selected, the tasks of the applicationare allocated to the neighborhood cores around the first node.Because the applications have different characteristics, theyhave different communication performance requirements. Inthis paper, the applications are classified into communicationintensive applications and computation intensive applicationsaccording to their communication to computation ratio (CCR),which is defined as the average data volume to be sent inone application divided by the average computation workload.Therefore, a higher CCR value indicates that the applicationperformance is more dominated by the communication whilea lower CCR value indicates the application performance ismore dominated by the computation. The method to classifythe applications is presented in Section V-E.

The borrowing strategy is adopted to meet the lifetimeconstraint in a long-term scale. Shown in Fig. 4, the mappingalgorithm includes two sub-routines: communication-biased

FN

FNFN: first node

High-aged core

Selected core

for mapping

FNFN

Communication

biased mapping

Computation

biased mapping

Fig. 4. Diagram for neighborhood nodes allocation, which includes twosub-routines: (1) communication-biased mapping for communication inten-sive application; (2) computation-biased mapping for computation intensiveapplication.

mapping and computation-biased mapping. If the incomingapplication is a communication intensive application, thecommunication-biased mapping is called to map the appli-cation to a near-square area with lifetime constraint locallyrelaxed. This is because the communication performance re-quirement in local time window is higher, the performance isthe first priority to optimize. On the other hand, the compu-tation intensive applications are mapped by the computation-biased mapping sub-routine to an area with higher dispersionto free the high-aged cores. Therefore, the over-stressed corecan recover from high-aged status. Thus, the lifetime constraintis satisfied in a long-term scale.

1) Communication-Biased Mapping: If the application is acommunication intensive application, it has higher requirementon the communication performance. Hence the communicationdistance is more important. The widely adopted method forneighborhood allocation is CONA [27]. With the objective toselect a near-square region to mitigate the internal and externalcongestion [15], CONA selects the node that fits the smallestsquare when mapping next task. However, the communicationdistance with predecessors of current task is not considered,which possibly leads to high communication cost.

In this paper, we propose an improved method for neigh-borhood nodes allocation named lifetime-aware neighborhoodallocation (LNA), which takes both task communication over-head and aging information into consideration. The algorithmfor the method is presented in Algorithm 2. Given the firstnode and arrival application, the next task to map is taken fromAGi in breadth first order (line 3 and line 7). The first task ismapped on the first node (line 4). Sr is expanded if there is noavailable core in current square (line 8-11). If the applicationis communication intensive, the next core is selected as thecore with minimum weighted hops with all predecessors oftn (line 12-13). Finally, the task tn is mapped on nm (line17). Compared to CONA, the major improvement is that the

Page 6: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

6

Algorithm 2 Pseducode of Lifetime-aware NeighborhoodAllocation (LNA)Input nf : First node;

∆Ai, i ∈ [1, |N |]: Aging map of all cores.AG(T,E): the application to be mapped;

Output Mapping function T → PDefinition Ω: the tasks that are not mapped;

1: Ω← T . All tasks in AG(T,E).2: r ← 1 . Radius of square around the first node3: Take the first task tf from Ω in breadth-first order of AG4: Map tf on nf5: Sr ← The set of unoccupied cores that are within current

square, which is with radius r around nf6: while Ω 6= ∅ do7: Take the next task tn in Ω in breadth-first order8: if Sr = ∅ then9: r ← r + 1

10: Sr is updated11: end if12: if communication-biased mapping then:13: nm ← the core within Sr and with the minimum

number of weighted hops with all predecessors of task tn14: else if computation-biased mapping then:15: nm ← the core with ∆Am > 0 and within Sr

and with the minimum number of weighted hops with allpredecessors of task tn

16: end if17: Map tn on nm18: end while

proposed method maps the next task to the core that is with theminimum weighted communication with predecessors of thetask. The weighted hops with all predecessors of task tn canbe represented as

∑wi,nMD(map(ti),map(tn)),∀ei,n ∈ E.

Since the maximum possible number of cores in Sr is |P |and the maximum possible number of predecessors is |T |, thecomplexity of the algorithm is O(|T |2|P |), showing that thealgorithm has low complexity.

2) Computation-Biased Mapping: Compared to communi-cation intensive application, the performance of computationintensive application is less affected by the communicationdistance. This creates the opportunities to map the applicationto a more dispersed area to avoid high-aged cores. The over-stressed cores can also recover from high-aged status. We pro-pose a method which combines the high-aged cores (∆A < 0)with other allocated cores to form a near-square shape. Themapping algorithm for computation intensive application isalso presented in Algorithm 2 (line 14-16). The high-agedcores are avoided when mapping the application but the com-munication distance is still considered. Compared to mappingfor communication intensive applications, the communicationdistance increases because more cores are avoided and themapping area is more dispersed.

E. Method for Application Classification

As presented in previous section, the runtime mappingscheme includes different strategies for communication in-

MRD = 2.56

6 x 6

(a) 6×6 many-core system

MRD = 5.19

10 x 10

Area of

free cores

(b) 10× 10 many-core system

Fig. 5. Average distance of remaining nodes. MRD is the average distance.Grayed blocks represent unavailable cores. The MRD value of the remainingcores increases with the size of many-core.

tensive applications and computation intensive applications.It still remains unsolved to determine an appropriate thresholdto classify the applications. A sub-optimal threshold possiblyleads to the following two scenarios:

(1) If the threshold is too low, most applications would beclassified as communication intensive applications which aremapped in compact regions, which results from long waitingtime as the example shown in Fig. 1;

(2) If the threshold is too high, most applications are classi-fied as computation-intensive applications which are mappedin dispersed regions. The possible scenario is that someapplications with high CCR are mapped to dispersed regions,leading to long communication time and high communicationcost. The high communication time also results from lowthroughput.

Above all, an inappropriate threshold can cause lowthroughput of the system. It is essential to determine anappropriate threshold to effectively improve the throughput ofthe system. In this paper, we propose a method to determinethe threshold dynamically according to communication tocomputation ratio (CCR) of applications and current locationsof all available cores.

First, for each application we define CCR as commcomp , which

indicates the average communication time in one applicationdivided by the average computation time. A high CCR valuemeans the application is more communication intensive, anda low CCR value means the application is more computationintensive. The next step is how to quantitatively determine athreshold for the CCR to classify the communication intensiveapplications and computation intensive applications. To avoidthe aforementioned 2 scenarios, the threshold should makea balance between the communication time and computationtime. Since the threshold should be defined before the mappingprocess starts, the mapping area and the communication timeare unknown. Therefore, it is assumed that all available coresare candidate cores for the mapping when the applicationis mapped in dispersed region. The average distance of allavailable cores is calculated and represented as MRDfree inEq. 11

MRD =

∑ni,nj∈DMD(ni, nj)(|D|

2

) (11)

where D denotes the set of all remaining available cores.

Page 7: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

7

TABLE ICONFIGURATIONS OF SYNTHETIC APPLICATIONS AND PLATFORM

Synthetic graph parametersNumber of tasks [10, 20], [20, 30]

Task execution time [100, 400] (cycles)

Power of task [50, 100] (mW)

CCR [0, 0.5], [0, 2]

Network ParametersSize & Topology 8× 8, 12× 12 2D mesh

Routing algorithm XY routing

Base topology 8× 8

Latency 2 cycles per hop

MD(ni, nj) denotes the Manhattan distance between core niand nj . We define the threshold in Eq. 12.

Thsld =1

MRDfree(12)

Note that Thsld is not a fixed value since it depends oncurrent locations of all available cores. Thsld can approxi-mately make the critical CCR achieve a balance between thecommunication time and computation time as MRDfree ×comm = comp. The insight behind this is that the averagedistance of the tasks is approximately equal to MRDfree

when the tasks are randomly mapped. Thsld is also associatedwith topology size. Fig 5 presents a comparison in 6× 6 and8×8 topology respectively. It shows that with the same numberof occupied cores, MRDfree is much higher in a largertopology. In other words, the computation intensive applicationis more likely to be mapped to a high dispersed region ina large topology. In this case the threshold should be lowersuch that more applications are classified as communicationintensive applications and mapped to compact regions to avoidhigh communication distance.

VI. EXPERIMENTAL RESULTS

A. Experimental Setup

In this paper, the experiments are performed in an in-house many-core simulator. McPat [28] and CALIPER [29]are integrated to model the power and lifetime reliability re-spectively. Both synthetic and realistic task graphs are adoptedto evaluate the proposed mapping schemes in the experiments.The synthetic task graphs are generated by DAGGEN [30]with configurations shown in Table I. Other configurations togenerate the synthetic graphs are adopted from the default pa-rameters in DAGGEN. Besides we also evaluate the mappingscheme using realistic task graphs which are extracted fromvideo processing applications [31].

The proposed lifetime-constrained mapping schemes LBRM,which incorporate CONA and LNA neighborhood allocationmethods, are named LBRM-CONA and LBRM-ANA respec-tively. The mapping scheme is also compared with the state-of-the-art lifetime-constrained scheme named as DSRM [12].The lifetime-agnostic runtime mapping named MAPPRO [24]is also compared.

TABLE IICONFIGURATIONS FOR EVALUATION

8× 8 many-core A1 A2 A3 A4

Application size [10, 20] [30, 40]

CCR [0, 2] [0, 0.5] [0, 2] [0, 0.5]

B. Evaluations on Synthetic Applications

In this subsection, the proposed mapping algorithm is eval-uated in terms of throughput and communication cost undervarious lifetime constraint. There are in total 106 applicationsgenerated in the simulation, and the arrival rate of applicationis 0.01 applications per cycle.

1) Throughput Comparisons: We adopt aging effect as thelifetime reliability constraint. Since aging effect reflects theaccumulated aging status with the time (shown in Eq. 2),we set the lifetime constraint from 1M to 0.4M respectively,where M is a reference aging effect value without any lifetimeconstraints. In other words, the initial lifetime budget isset from 1M to 0.4M , among which 0.4M is the tightestconstraint imposed on the many-core system.

Fig. 6 and Fig. 7 show the comparisons of normalizedthroughput in 8×8 and 12×12 many-core system respectively.A1-A4 and B1-B4 correspond to configurations with variousapplication sizes and CCR value as shown in Table II. Thethroughput is normalized to DSRM for comparison. In Fig. 6,when the initial lifetime budget is 1M , the average throughputimprovement of LBRM over DSRM is 18.1%. The improve-ment comes from the proposed new neighborhood allocationscheme LNA. When the initial lifetime budget is 0.4M ,the average throughput improvement LBRM-LNA over DSRMreaches 30.1%. This is because of both the adopted borrowingstrategy and LNA. By using borrowing strategy, LBRM mapscomputation-intensive applications in dispersed regions with-out occupying high-aged cores. The throughput is improveddue to less waiting time of computation-intensive applications.LBRM-CONA, combining borrowing strategy with CONAshows at most 3.6% and 1.2% throughput improvement for A1and A3 respectively, and at most 9.5% and 13.2% throughputimprovement for A2 and A4 respectively. The reason of lowerimprovement for A1 and A3 is that when CCR is [0, 2] less ap-plications are classified as computation-intensive applicationsand the lifetime constraint is over relaxed, making borrowingstrategy less effective as LBRM.

Fig. 7 presents similar results in 12 × 12 topology size.When the lifetime constraint is 0.4M , the average throughputimprovement LBRM-LNA over DSRM reaches 22.1%, which isless than that of 8×8 topology size. This is because the relativesize of the applications is smaller when in a larger topology,thus creating less opportunities to avoid high-aged cores.Similar as 8 × 8 topology, LBRM-CONA has no significantthroughput improvement when the CCR value is [0, 2], buthas at most 11.2% and 15.7% throughput improvement forA1 and A4 respectively.

2) Comparisons of Communication Cost: AverageWeighted Manhattan Distance (AWMD) is a widely usedmetric to evaluate the communication cost. Defined in [27],

Page 8: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

8

0.4

0.6

0.8

1

1.2

1.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRM

LBRM-CONA LBRM-LNA

Norm

aliz

ed

th

rou

gh

pu

t

(a) A1

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(b) A2

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(c) A3

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(d) A4

Fig. 6. Many-core platform size: 8× 8. (a)-(d) are the results under various application sizes and CCR values. The results are normalized throughput undervarious lifetime constraints. 1M , 0.8M , 0.6M , 0.4M denote various lifetime constraints. M is reference aging effect value of the system when there is nolifetime constraint. The configurations of A1-A4 is referred in Table II

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(a) A1

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(b) A2

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(c) A3

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(d) A4

Fig. 7. Many-core platform size: 12×12. (a)-(d) are the results under various application sizes and CCR values. The results are normalized throughput undervarious lifetime constraints. 1M , 0.8M , 0.6M , 0.4M denote various lifetime constraints. M is reference aging effect value of the system when there is nolifetime constraint. The configurations of B1-B4 is referred in Table II

AWMD is the sum product of communication distance and alledges’ weight of the mapped application, averaged by the totalcommunication weights. To evaluation the communicationcost of all mapped applications, AWMD defined in thispaper considers all mapped applications instead of singleapplication [27]. The AWMD is defined as follows,

AWMD =

∑n

∑ei,j∈En

wi,j ×MD(ei,j)∑n

∑ei,j∈En

wi,j(13)

where n is the total number of mapped applications, En the setof edges of the n-th application, wi,j is the weight of the edgeei,j and MD(ei,j) is the Manhattan Distance of task ti andtj . A lower AWMD value indicates lower communication costin terms of communication time and communication energy.

Table III shows the partial comparisons of the AWMDvalues with respect to Fig. 6 and Fig. 7. Due to spacelimitations, only AWMD values of 1M and 0.6M are shown.It shows that in average LBRM-CONA leads to slight increaseon AWMD compared to DSRM. This is because for LBRM-CONA, the computation intensive applications are allowedto map to dispersed regions, which increases the communi-cation distance. However, because only low-communicationapplications are mapped in dispersed regions, the AWMD isless affected. It can also been concluded that LBRM-CONAcan effectively improve the throughput without significantimpact on the communication cost. LBRM-LNA has much lessAWMD value because an improved neighborhood allocationscheme is adopted, which reduces the communication distancesignificantly.

C. Throughput Improvement vs. CCR range

In this section, we characterize the throughput improvementover DSRM given different CCR ranges. The CCR range isbetween 0 and maximum CCR. The maximum CCR is chosen

from 0.05 to 5. Fig. 8 presents the results for different topologysizes and applications sizes.

When CCR is low, it can be observed that LBRM-CONAcan effectively improve the throughput for all configurations.This is because when CCR is low, nearly all applicationsare classified as computation-intensive applications which canonly be mapped to compact area for DSRM. However, LBRM-CONA relaxes the requirement of compact area and reducesthe waiting time of applications. Hence the throughput isimprovement. The improvement is even more than 50.6%when the topology size is 8 × 8 and the application sizeis [30, 40]. The reason is that the application sizes of mostapplications even exceed half of the topology size, making itmore difficult to find a compact area for DSRM. The waitingtime can be greatly reduced if they are mapped to dispersedregions. When CCR is high, the improvement LBRM-CONAis much less obvious because most applications are with highCCR values and most applications are classified as communi-cation intensive applications. LBRM-CONA also maps mostapplication to compact area as DSRM, thus showing lessadvantages.

LBRM-CONA improves the throughput mainly due to bor-rowing strategy. Besides borrowing strategy, the improvementof LBRM-LNA also comes from the improved neighborhoodallocation scheme LNA. When the CCR is high, it showsthat LNA brings additional 20%-40% throughput improvement.However, when CCR is low, LBRM-LNA is close to LBRM-CONA. This is because LNA mainly improves the task alloca-tion by considering the communication with all predecessors.If there is less communication, LNA is getting closer to LBRM-CONA. Fig. 8(b) also shows quite different curve with othersbecause LBRM outperforms DSRM a lot at low CCR rangeand relative large application size.

Page 9: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

9

TABLE IIIAVERAGE WEIGHTED MANHATTAN DISTANCE

Many-core size 8× 8 12× 12

Configurations A1 A2 A3 A4 A1 A2 A3 A4

Constraints 1M 0.6M 1M 0.6M 1M 0.6M 1M 0.6M 1M 0.6M 1M 0.6M 1M 0.6M 1M 0.6M

MAPPRO 2.90 2.90 2.93 2.89 4.34 4.33 4.34 4.33 2.95 2.94 2.97 2.92 4.38 4.43 4.41 4.40

DSRM 2.91 2.90 2.88 2.90 4.42 4.42 4.42 4.42 2.97 2.91 2.96 2.90 4.42 4.42 4.45 4.40

LBRM-CONA 2.91 2.91 2.96 3.17 4.42 4.41 4.45 4.35 2.94 2.88 3.05 3.16 4.42 4.29 4.45 4.43

LBRM-LNA 2.02 1.99 2.09 2.28 2.73 2.70 2.74 2.75 2.07 2.17 2.34 2.24 2.89 2.87 2.90 3.02

0%

10%

20%

30%

40%

50%

60%

0.05 0.1 0.3 0.5 0.7 1 2 5

LBRM-CONALBRM-LNA

Thro

ughp

ut im

prov

emen

t ove

r DSR

M

Maximum CCR

(a) 8× 8, app. size: [10, 20]

0%

10%

20%

30%

40%

50%

60%

0.05 0.1 0.3 0.5 0.7 1 2 5

LBRM-CONALBRM-LNA

Thro

ughp

ut im

prov

emen

t ove

r DSR

M

Maximum CCR

(b) 8× 8, app. size: [30, 40]

0%

10%

20%

30%

40%

50%

60%

0.05 0.1 0.3 0.5 0.7 1 2 5

LBRM-CONALBRM-LNA

Thro

ughp

ut im

prov

emen

t ove

r DSR

M

Maximum CCR

(c) 12× 12, app. size: [10, 20]

0%

10%

20%

30%

40%

50%

60%

0.05 0.1 0.3 0.5 0.7 1 2 5

LBRM-CONALBRM-LNA

Thro

ughp

ut im

prov

emen

t ove

r DSR

M

Maximum CCR

(d) 12× 12, app. size: [30, 40]

Fig. 8. Throughput improvement over DSRM vs. CCR range for various topology size and application size. The CCR range is between 0 and maximumCCR. The CCR values of the applications are uniformly distributed within the given range.

0.40.60.8

11.21.4

A1 A2 A3 A4

LBRM-AVG LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(a) Many-core size: 8× 8

0.4

0.6

0.8

1

1.2

1.4

A1 A2 A3 A4

LBRM-AVG LBRM-LNA

Norm

aliz

ed

th

rou

gh

pu

t

(b) Many-core size: 12× 12

Fig. 9. Comparisons of the Classification Methods

D. Advantages of Proposed Classification Method

To show the advantages of the proposed classificationmethod, we compare it with a naive method, which determinesthe threshold according to the average CCR of all applications.For example, the threshold of CCR range [0, 0.5] is 0.25. Asshown in Fig. 9, LBRM mapping schemes, which incorporatethe proposed method and the naive method are named LBRM-LNA and LBRM-AVG respectively. In Fig. 9, A1-A4 corre-spond to the configurations in Table II. It shows that whenCCR is [0, 0.5], LBRM-LNA has more than 10% throughputimprovement than LBRM-AVG. However, when CCR is [0, 2],the improvement is not significant. This indicate that LBRM-LNA outperforms LBRM-AVG in most cases. When CCR is[0, 0.5], the threshold determined by the proposed method ismore appropriate for high throughput.

E. Evaluations on Realistic Task Graphs

In the paper, the realistic task graphs include Video ObjectPlane Decoder (VOPD), Picture-In-Picture application (PIP),and Multi Window Display application (MWD), triple videoobject plane decoder (TVOPD), MPEG4, MP3 encoder, H.263encoder and H.263 decoder [31], [32], [33]. Since only com-munication profiles of these applications are provided, weassume that the computation time of the tasks is identical and

TABLE IVINFORMATION OF REALISTIC TASK GRAPHS

Name Size CCR Name Size CCRMP3 Encoder 13 0.008 H.263 Decoder 14 0.008

H.263 Encoder 12 0.093 VOPD 16 1

MWD 12 0.256 MPEG4 12 0.256

PIP 8 0.608 TVOPD 38 0.724

the CCR value of VOPD is 1. The CCR values of the otherapplications are normalized accordingly shown in Table IV.In the simulation, there are in total 106 applications in whichthe 8 task graphs are uniformly distributed.

1) Throughput Comparisons: Fig. 10 shows the compar-isons of throughput for realistic task graphs. It shows thatin 8 × 8 topology size, LBRM-CONA and LBRM-LNA has atmost 16.8% and 21.2% throughput improvement respectively.When the topology size is 12× 12, LBRM-CONA and LBRM-LNA has at most 11.9% and 14.4% throughput improvementrespectively. LBRM-LNA has higher throughput because bor-rowing strategy is also adopted besides LNA neighborhoodallocation. The improvement is less in 12 × 12 because therelative size of an application is smaller compared to theplatform, creating less chances for avoiding high-aged cores.The results also show that LBRM is also effective in terms ofthroughput improvement for realistic task graphs.

2) AWMD Comparisons: Fig. 11 shows the comparisonsof communication distance for realistic task graphs. The re-sults show that although LBRM-CONA and LBRM-LNA allowdispersed mapping of some applications, the AWMD doesnot increase significantly. The main reason is that most ofthe dispersed applications have low communication. We canconclude that LBRM can effectively increase the throughputwith only a little penalty on the communication cost.

Page 10: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

10

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(a) Many-core size: 8× 8

0.40.60.8

11.21.4

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

Nor

mal

ized

thro

ughp

ut

(b) Many-core size: 12× 12

Fig. 10. Throughput comparison for realistic benchmarks

0

1

2

3

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

AWM

D

(a) Many-Core size: 8× 8

0

1

2

3

1M 0.8M 0.6M 0.4M

MAPPRO DSRMLBRM-CONA LBRM-LNA

AWM

D

(b) Many-Core size: 12× 12

Fig. 11. AWMD Comparisons for realistic benchmarks

VII. CONCLUSIONS

In this paper, we propose a lifetime-constrained runtimemapping algorithm, which can dynamically map the incom-ing applications on the many-core system under a lifetimeconstraint. Compared to existing lifetime aware constrainedmapping, the proposed mapping algorithms adopt the borrow-ing strategy to manage the many-core resources in multi-scale.In other words, the lifetime is managed in long-term scale butis locally relaxed to satisfy performance requirement in short-term scale. The applications are classified into communicationintensive and computation intensive applications, and mappedwith different strategies. We also propose an effective methodto classify the application into communication intensive andcomputation intensive applications. The experiments showthat compared to state-of-the-art lifetime-constrained mappingscheme, the proposed scheme can effectively improve thethroughput on many-core system for synthetic task graphsand realistic task graphs while having less impact on thecommunication cost.

REFERENCES

[1] C. Ramey, “Tile-gx100 manycore processor: Acceleration interfaces andarchitecture,” in 2011 IEEE Hot Chips 23 Symposium (HCS), Aug 2011,pp. 1–21.

[2] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani,S. Hutsell, R. Agarwal, and Y. C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, no. 2, pp.34–46, 2016.

[3] W. Song, S. Mukhopadhyay, and S. Yalamanchili, “Architectural reliabil-ity: Lifetime reliability characterization and management of many-coreprocessors,” IEEE Computer Architecture Letters, vol. 14, no. 2, pp.103–106, 2015.

[4] L. Huang and Q. Xu, “Characterizing the lifetime reliability ofmanycore processors with core-level redundancy,” in Proceedings ofthe International Conference on Computer-Aided Design, ser. ICCAD’10. Piscataway, NJ, USA: IEEE Press, 2010, pp. 680–685. [Online].Available: http://dl.acm.org/citation.cfm?id=2133429.2133574

[5] A. M. Rahmani, M. H. Haghbayan, A. Miele, P. Liljeberg, A. Jantsch,and H. Tenhunen, “Reliability-aware runtime power management formany-core systems in the dark silicon era,” IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems, vol. 25, no. 2, pp. 427–440,2017.

[6] Z. Yang, C. Serafy, T. Lu, and A. Srivastava, “Phase-driven learning-based dynamic reliability management for multi-core processors,” inProceedings of the 54th Annual Design Automation Conference (DAC),2017, pp. 46:1–46:6.

[7] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “The case for lifetimereliability-aware microprocessors,” in Proceedings of the 31st AnnualInternational Symposium on Computer Architecture (ISCA), 2004, pp.276–285.

[8] D. Lorenz, G. Georgakos, and U. Schlichtmann, “Aging analysis ofcircuit timing considering nbti and hci,” in Proceedings of 15th IEEEInternational On-Line Testing Symposium, 2009, pp. 3–8.

[9] P. Mercati, A. Bartolini, F. Paterna, T. S. Rosing, and L. Benini,“Workload and user experience aware dynamic reliability managementin multicore processors,” in Proceedings of 50th ACM/EDAC/IEEEDesign Automation Conference (DAC), 2013, pp. 1–6.

[10] L. Huang, F. Yuan, and Q. Xu, “Lifetime reliability-aware task alloca-tion and scheduling for mpsoc platforms,” in Proceedings of Design,Automation Test in Europe Conference Exhibition (DATE), 2009, pp.51–56.

[11] C. Bolchini, M. Carminati, A. Miele, A. Das, A. Kumar, and B. Veer-avalli, “Run-time mapping for reliable many-cores based on en-ergy/performance trade-offs,” in Proceedings of International Sympo-sium on Defect and Fault Tolerance in VLSI and NanotechnologySystems (DFTS), 2013, pp. 58–64.

[12] M. H. Haghbayan, A. Miele, A. M. Rahmani, P. Liljeberg, and H. Ten-hunen, “A lifetime-aware runtime mapping approach for many-coresystems in the dark silicon era,” in Proceedings of Design, AutomationTest in Europe Conference Exhibition (DATE), 2016, pp. 854–857.

[13] L. Huang and Q. Xu, “Energy-efficient task allocation and scheduling formulti-mode mpsocs under lifetime reliability constraint,” in Proceedingsof Design, Automation Test in Europe Conference Exhibition (DATE),2010, pp. 1584–1589.

[14] D. Gnad, M. Shafique, F. Kriebel, S. Rehman, D. Sun, and J. Henkel,“Hayat: Harnessing dark silicon and variability for aging decelerationand balancing,” in Proceedings of Design, Automation Test in EuropeConference Exhibition (DATE), 2015, pp. 180:1–180:6.

[15] M. Fattah, M. Daneshtalab, P. Liljeberg, and J. Plosila, “Smart hill climb-ing for agile dynamic mapping in many-core systems,” in Proceedingsof 50th ACM/EDAC/IEEE Design Automation Conference (DAC), 2013,pp. 1–6.

[16] C. Zhuo, D. Sylvester, and D. Blaauw, “Process variation andtemperature-aware reliability management,” in Proceedings of Design,Automation and Test in Europe Conference Exhibition, 2010, pp. 580–585.

[17] E. Karl, D. Blaauw, D. Sylvester, and T. Mudge, “Reliability modelingand management in dynamic microprocessor-based systems,” pp. 1057–1060, 2006.

[18] L. Wang, X. Wang, H.-f. Leung, and T. Mak, “Throughput optimizationfor lifetime budgeting in many-core systems,” in Proceedings of the onGreat Lakes Symposium on VLSI, 2017, pp. 451–454.

[19] B. H. Meyer, A. S. Hartman, and D. E. Thomas, “Cost-effective slack al-location for lifetime improvement in noc-based mpsocs,” in Proceedingsof Design, Automation Test in Europe Conference Exhibition (DATE),2010, pp. 1596–1601.

[20] C. Zhu, Z. Gu, R. P. Dick, and L. Shang, “Reliable multiprocessorsystem-on-chip synthesis,” in Proceedings of 5th IEEE/ACM/IFIP In-ternational Conference on Hardware/Software Codesign and SystemSynthesis (CODES+ISSS), 2007, pp. 239–244.

[21] A. S. Hartman and D. E. Thomas, “Lifetime improvement throughruntime wear-based task mapping,” in Proceedings of the 10th Interna-tional Conference on Hardware/Software Codesign and System Synthesis(CODES+ISSS), 2012, pp. 13–22.

[22] V. Rathore, V. Chaturvedi, and T. Srikanthan, “Performance constraint-aware task mapping to optimize lifetime reliability of manycoresystems,” in Proceedings of the 26th Great Lakes Symposium onVLSI(GLVLSI), 2016, pp. 377–380.

[23] J. Sun, R. Lysecky, K. Shankar, A. Kodi, A. Louri, and J. Roveda,“Workload assignment considering nbti degradation in multicore sys-tems,” ACM Journal on Emerging Technologies in Computing Systems(JETC), vol. 10, no. 1, pp. 4:1–4:22, 2014.

[24] M.-H. Haghbayan, A. Kanduri, A.-M. Rahmani, P. Liljeberg, A. Jantsch,and H. Tenhunen, “Mappro: Proactive runtime mapping for dynamicworkloads by quantifying ripple effect of applications on networks-on-chip,” in Proceedings of the 9th International Symposium on Networks-on-Chip (NoCS), 2015, pp. 26:1–26:8.

[25] M. Fattah, P. Liljeberg, J. Plosila, and H. Tenhunen, “Adjustable conti-guity of run-time task allocation in networked many-core systems,” in

Page 11: Runtime Mapping for Lifetime Reliability and Throughput ...jhan8/publications/journal_paper.pdf · Liang Wang is with the Department of Computer Science and Engineering, The Chinese

11

2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), 2014, pp. 349–354.

[26] P. Mercati, A. Bartolini, F. Paterna, L. Benini, and T. S. Rosing, “An on-line reliability emulation framework,” in Proceedings of the 12th IEEEInternational Conference on Embedded and Ubiquitous Computing(EUC), 2014, pp. 334–339.

[27] M. Fattah, M. Ramirez, M. Daneshtalab, P. Liljeberg, and J. Plosila,“Cona: Dynamic application mapping for congestion reduction in many-core systems,” in Proceedings of 30th International Conference onComputer Design (ICCD), 2012, pp. 364–370.

[28] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, andN. P. Jouppi, “Mcpat: An integrated power, area, and timing modelingframework for multicore and manycore architectures,” in Proceedingsof IEEE/ACM International Symposium on Microarchitecture (MICRO),2009, pp. 469–480.

[29] C. Bolchini and et al., “A lightweight and open-source frameworkfor the lifetime estimation of multicore systems,” in Proceedings ofInternational Conference on Computer Design (ICCD), 2014, pp. 166–172.

[30] F. Suter, “Daggen: A synthethic task graph generator,”https://github.com/frs69wq/daggen.

[31] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini,and G. D. Micheli, “Noc synthesis flow for customized domain specificmultiprocessor systems-on-chip,” IEEE Transactions on Parallel andDistributed Systems (TPDS), vol. 16, no. 2, pp. 113–129, 2005.

[32] Y. Xue and P. Bogdan, “Improving noc performance under spatio-temporal variability by runtime reconfiguration: a general mathematicalframework,” in Proceedings of Tenth IEEE/ACM International Sympo-sium on Networks-on-Chip (NOCS), 2016, pp. 1–8.

[33] S. Murali, C. Seiculescu, L. Benini, and G. D. Micheli, “Synthesis ofnetworks on chips for 3d systems on chips,” in Proceedings of Asiaand South Pacific Design Automation Conference (ASP-DAC), 2009, pp.242–247.

PLACEPHOTOHERE

Liang Wang

PLACEPHOTOHERE

Xiaohang Wang

PLACEPHOTOHERE

Ho-fung Leung

PLACEPHOTOHERE

Terrence Mak

PLACEPHOTOHERE

Jie Han

PLACEPHOTOHERE

Leibo Liu