[IEEE 2013 7th International Conference on Complex, Intelligent, and Software Intensive Systems...

5
Research on task allocation strategy and scheduling algorithm of multi-core load balance Chao Wu Schools of software, Dalian University of Technology Dalian, China, 116620 [email protected] Yifu Wang, Aoyang Zhao, Tie Qiu* Schools of software, Dalian University of Technology Dalian, China, 116620 [email protected], [email protected] Abstract Based on the research of multi-core load balancing’s task scheduling and allocation, we proposed the static task graphs stratification algorithm, the static task group scheduling algorithm, and the minimum dynamic link algorithm, aiming at the characteristics of multi-core processors. When these algorithms allocate tasks, they are expected to complete multi-core load balancing. Firstly, the task allocation is divided into two stages: It needs to break dependencies among tasks and relatively independent tasks will be in the same group at the first stage. It conducts static allocation for the principle of load balancing and it allocates initial tasks which have almost the same time for the system hardware threads in the second stage. It allocates tasks which come from system’s running for each hard ware thread with processor’s speed as a standard in the third stage. From the verification of simulation experiment, the algorithms can achieve better load balancing and minimum completion time. Keywordsmulti-core;scheduling algorithm;load balance; task allocation I. INTRODUCTION With the rapid growth of embedded processing, the architecture of embedded system begins turning to multiprocessor cooperative work and compute synchronously [1]. Because of the systematic complexity of uniprocessor is too high and it’s computing power is not very good. In this way, for the same system’s multi -task [2] needs, the cooperative- work processors can complete their respectively different functional applications at the highest efficiency. In recent years, the research of multi-core processors’ architecture begins to mature gradually. Multi-core system is able to use thread-level parallelism effectively by increasing the amount of computer’s physical processor or hardware threads (soft-core). It supports the true sense of the parallel execution [3, 4]. However, all tasks scheduling that system processes have dependencies [5-7] on practical applications. While the system is running, it can’t work very well. The most time system is waiting, so its parallelism isn’t used effectively. Mr. Zhou introduces to us a task group scheduling algorithm [8] in his book Multi-core Computing and Programming. And it can substantially reduce processor waiting caused by the task dependencies. However, it’s full accordance with descending order by the execution time when select tasks, so it will inevitably happens error accumulation phenomenon. But when we change the last task’s select orientation to avoid error accumulation, the error of single thread cannot be reduced, in fact, it may grow. In this paper, we aim at the characteristics of multi-core processors and consider how to maximize embedded multi- core load balancing [9, 10]. We built an eight core architecture based on MicroBlaze by studying static scheduling algorithm of multi-core system and improved a tasks’ static allocation algorithm which is designed to solve the multi-core processors load balancing, achieve fastest completing, and have least problems--the static task graphs stratification algorithm and the static task group scheduling algorithm .Meanwhile, based on solving tasks’ static allocation, this paper supplements this kind of tasks’ scheduling and allocation whose execution time cannot be predicted. And we proposed the minimum dynamic link algorithm. The remaining sections of this paper are organized as follows. In Section 2, some related work is the multicore hardware system architecture design base on FPGA. In Section 3, we build the algorithm analysis model and discuss task allocation strategy. In Section 4, the specific design of scheduling algorithm and test solutions for the assessment of system performance. In Section 5, we have the description of Experiment and definition of some parameters. So we can get the assessment of the algorithm model. In Section 6, concludes achieve results of this study, and to discuss the future direction. II. RELATED WORK In recent years, the study of task allocation strategy for multi-core load balance has developed very fast. Cheng et al [11], propose the new algorithm optimizes DAG graph by using clustering. Geng et al [12], find an algorithm diminishes communication overhead and keeps load balancing between cores, and meanwhile speedup ratio of parallel program is improved. Shen et al [13], proposes a greedy heuristic algorithm for thread scheduling based on the Intel multi-core architecture. A. introduction of the hardware test architecture: In this paper, we use the XPV5-LX110T experiment board to build an eight soft-core architecture for the testing. All of the cores use the harvard architecture and ALU is the calculation unit. As shown in fig.1. Well, make core which *Corresponding author: Tel: 0086-411-87571632 Email: [email protected] 2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems 978-0-7695-4992-7/13 $26.00 © 2013 IEEE DOI 10.1109/CISIS.2013.114 634

Transcript of [IEEE 2013 7th International Conference on Complex, Intelligent, and Software Intensive Systems...

Research on task allocation strategy and scheduling algorithm of multi-core load balance

Chao Wu Schools of software, Dalian University of Technology

Dalian, China, [email protected]

Yifu Wang, Aoyang Zhao, Tie Qiu* Schools of software, Dalian University of Technology

Dalian, China, [email protected], [email protected]

Abstract—Based on the research of multi-core load balancing’s task scheduling and allocation, we proposed the static task graphs stratification algorithm, the static task group scheduling algorithm, and the minimum dynamic link algorithm, aiming at the characteristics of multi-core processors. When these algorithms allocate tasks, they are expected to complete multi-core load balancing. Firstly, the task allocation is divided into two stages: It needs to break dependencies among tasks and relatively independent tasks will be in the same group at the first stage. It conducts static allocation for the principle of load balancing and it allocates initial tasks which have almost the same time for the system hardware threads in the second stage. It allocates tasks which come from system’s running for each hard ware thread with processor’s speed as a standard in the third stage. From the verification of simulation experiment, the algorithms can achieve better load balancing and minimum completion time.

Keywords—multi-core;scheduling algorithm;load balance; task allocation

I. INTRODUCTION

With the rapid growth of embedded processing, the architecture of embedded system begins turning to multiprocessor cooperative work and compute synchronously [1]. Because of the systematic complexity of uniprocessor is too high and it’s computing power is not very good. In this way, for the same system’s multi-task [2] needs, the cooperative-work processors can complete their respectively different functional applications at the highest efficiency.

In recent years, the research of multi-core processors’ architecture begins to mature gradually. Multi-core system is able to use thread-level parallelism effectively by increasing the amount of computer’s physical processor or hardware threads (soft-core). It supports the true sense of the parallel execution [3, 4]. However, all tasks scheduling that system processes have dependencies [5-7] on practical applications. While the system is running, it can’t work very well. The most time system is waiting, so its parallelism isn’t used effectively. Mr. Zhou introduces to us a task group scheduling algorithm [8] in his book Multi-core Computing and Programming. And it can substantially reduce processor waiting caused by the task dependencies. However, it’s full accordance with descending order by the execution time when select tasks, so it will inevitably happens error accumulation phenomenon. But when

we change the last task’s select orientation to avoid error accumulation, the error of single thread cannot be reduced, in fact, it may grow.

In this paper, we aim at the characteristics of multi-core processors and consider how to maximize embedded multi-core load balancing [9, 10]. We built an eight core architecture based on MicroBlaze by studying static scheduling algorithm of multi-core system and improved a tasks’ static allocation algorithm which is designed to solve the multi-core processors load balancing, achieve fastest completing, and have least problems--the static task graphs stratification algorithm and the static task group scheduling algorithm .Meanwhile, based on solving tasks’ static allocation, this paper supplements this kind of tasks’ scheduling and allocation whose execution time cannot be predicted. And we proposed the minimum dynamic link algorithm.

The remaining sections of this paper are organized as follows. In Section 2, some related work is the multicore hardware system architecture design base on FPGA. In Section 3, we build the algorithm analysis model and discuss task allocation strategy. In Section 4, the specific design of scheduling algorithm and test solutions for the assessment of system performance. In Section 5, we have the description of Experiment and definition of some parameters. So we can get the assessment of the algorithm model. In Section 6, concludes achieve results of this study, and to discuss the future direction.

II. RELATED WORK

In recent years, the study of task allocation strategy for multi-core load balance has developed very fast. Cheng et al [11], propose the new algorithm optimizes DAG graph by using clustering. Geng et al [12], find an algorithm diminishes communication overhead and keeps load balancing between cores, and meanwhile speedup ratio of parallel program is improved. Shen et al [13], proposes a greedy heuristic algorithm for thread scheduling based on the Intel multi-core architecture.

A. introduction of the hardware test architecture: In this paper, we use the XPV5-LX110T experiment board

to build an eight soft-core architecture for the testing. All of the cores use the harvard architecture and ALU is the calculation unit. As shown in fig.1. Well, make core which

*Corresponding author: Tel: 0086-411-87571632

Email: [email protected]

2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems

978-0-7695-4992-7/13 $26.00 © 2013 IEEE

DOI 10.1109/CISIS.2013.114

634

number is zero as the master core. It is responsible for gathering information of the tasks, processing scheduling algorithm and allocating tasks. Other seven are slave cores, they responsible for task execution.

Fig.1 The design of eight core architecture based on MicroBlaze in EDK software

The following factors should be considered in task allocation of multi-core processors: Execution time of tasks, Execution frequency, Priority, Communication, system resources and etc. In multi-core processors, multi-core communication is accomplished by internal high-speed bus [14, 15] or Shared memory. Comparing with multi-core computer system, multi-core processors can offer shorter communication distances as well as less communication delays, so it can even be considered as inter-core communication. Without considering other factors, statically scheduling of multi-core processors can be regarded as allocation of independent tasks. That is to say, it mostly guarantees each task of hardware-thread mutual independence, no dependency exists among tasks. In parallel run circumstances, try to avoid the delay caused by occupancy of resources or logical dependency.

B. Directed Acyclic Graph: Directed Acyclic Graph [16] owns excellent characteristics

to clearly describe the tasks the system requires and its relations. On the other hand, it can provide convenience in scheduling of jobs in task graph. In Directed Acyclic Graph:

In-degree: the number of Directed edges that point to the dot.

Out-degree: the number of Directed edges that the dot point to others.

From the knowledge of graph theory we learn that there must have at least a dot which In-degree is zero in the DAG. Otherwise, if the In-degree for each dot in a DAG is not zero, it’s sure that the DAG has ring. So we could stratify the DAG by this feature. In each layer, all the dots without dependency, they are independent (stable set).Well, task allocation of Multi-core load balancing is based on the task graphs stratification

III. MODEL BUILDING

In order to effectively control scheduling tasks between every hardware thread, we choose a hardware thread as the master core that charges for dynamic tasks-scheduling of other cores. We suppose that a processor has 8 cores.0 core is considered as the master core, other 7 cores are slave cores. When the system starts, we suppose that this system has N tasks. These tasks in the system are considered as a Directed acyclic graph. ti stands for the cost time of task i. The sum of the costing time, namely time-sum is the required time that each core finishes its distributed task. In order to achieve load balance, we need to calculate average costing-time of each slave core under ideal conditions and costing-time of each slave core in reality. Therefore we need to define these variables.

Definition 1: Avgtime is the average time. This Avgtime is not the average running time of all tasks but the average time that every core works to cost when the system distributes tasks for 7 cores, as shown in formula (1).

Avgtime= (t1 +t2 +t3 +…+tn)/7 (1)

Definition 2: total_delta is the sum of cumulative errors[17].total_delta equals the total of the whole running time of tasks-distribution of every core minuses Avgtime , as shown in formula (2).

total_delta=total_delta+Time_sum-Avgtime (2)

For the algorithm which we use during the system running, we need to know each task costs how much time. So we continue to define some variables.

Definition3 t1 is the number of clock tick at the beginning of the task . t2 is the number of clock tick at the end of the task . tis the period that from starting to a certain point. And the abilities of processors dealing with tasks are same.(7 cores start at the same time, so t approximately equals.)

Definition4: Time is the time of a task executing, as shown in formula (3).Time = t2 – t1; (3)

Definition5: Within t period, the total of distributed tasks of 7 cores are arrays total_task[i], (i=1,2 7).

Definition6: Within t period, the completed total of tasks of 7cores are arrays finish_task[j],(j=1,2 N).

Definition7: time[k] is the total cost time that P cores complete the tasks of finish_task[j] (k=1,2 N), as shown in formula (4).

time[k] = � � (4)

When we allocate task for a slave core, we need to think over the capability and the amount of remaining tasks of the slave core.

Definition8: V is the slave core’s processing speed, as shown in formula (5).

V = finishtask[i] / time[i] ( i=1,2……7); (5)Definition9: NUM is the number of tasks that slave core does

635

not finish, as shown in formula (6).NUM = totaltask[i] - finishtask[i]; ( i=1,2……7); (6)

Definition10: T’ is the time that attached core need to finish remaining tasks, as shown in formula (7).

T’ = NUM / V (7)

IV. ALGORITHM DESCRIPTIONS

For the original tasks produced at system starting, we use the static task graphs stratification algorithm to partition these tasks. Thus, in each layer all the tasks are independent. After that, the static task group scheduling algorithm will be used to allocate task for each slave core. During the system running, in order to assign the tasks which cannot predict their running time, the minimum dynamic link algorithm will be used to allocate task for each slave core.

A. The static task graphs stratification algorithm: Step1: Put the dots which In-degree is zero in the first

stratification, other dots at original position.

Step2: Make the In-degree of the dots rely on the Directed edges start from the dots in this stratification minus one.

Step3: In the graph assemble by excess dots, put the dots which In-degree is zero in the same stratification. Then make the number of stratification plus one, other dots at original position.

Step4: For the rest of the dots, repeat the step from Step2 to Step3 until all the dots are assigned to suitable task stratification.Fig.2 and fig.3 show the running processes of the algorithm.

Fig.2 Before task graphs stratification

Fig.3 After task graphs stratification

B. The static task group scheduling algorithmStep1: We should get each task's time-consuming(Suppose

there are N tasks),recording as t1 t2 t3 tn. Then, All tasks should be sorted by time-consuming from most to least.

Step2: Calculate the average time Avgtime.

Avgtime= (t1 +t2 +t3 +…+tn)/P (8)

Step3: Randomly selected a core m .We will select the largest time-consuming task for the slave cores, then we select the second-largest time-consuming task for this slave core ……;at the same time, we should estimate the total time of the allocation of tasks to meet or exceed the average time. If it is not reached, it is continued until we find the task tp .

Time_sum=� �

� (9)

Time_sum + tp> Avgtime (10) Step4: Then, we should continue to fine the task tk from tp

to tn to replace the task tp.

Time_sum + tk < Avgtime (11) Time_sum + tk-1 > Avgtime (12)

Step5: Start from the last task tn forward to find the task tm.Then, we select tm and estimate whether the total time is less than the average time. If the total time is less than average time, we continue to choose the next task.

Time_sum + tk + tm < Avgtime (13)

Time_sum + tk + tm-1 Avgtime (14)

Step6: We calculate the three value of the error according to the cumulative error of the previous assigned task.

� 1=total_delta + Time_sum + tk-1 – Avgtime (15)

� 2= total_delta + (Time_sum + tk + tm – Avgtime) (16)

� 3= total_delta + Time_sum+ tk + tm-1 – Avgtime (17)

We should assign tasks according to the minimum value of the error.

If � 1 is the minimum, assigned tasks are: t1 ,t2 ,t3 ,…,tp-1,tk-1

If � 2 is the minimum, assigned tasks are: t1 , t2 , t3 , , tp-1, tk , tm

If � 3 is the minimum, assigned tasks are: t1, t2, t3, , tp-1, tk , tm-1

Step7: For the rest of cores, repeat Step 3 to Step 6, until all tasks are assigned to the appropriate slave core.

The processe from Step3 to Step5 is shown in Fig. 4.

636

Fig.4 The static task group scheduling algorithm

C. The minimum dynamic link algorithmStep 1: Figure out the total time (signed as time[k]) that it

takes to handle tasks on cores from 1 to 7 respectively. We sign the time for handling a task as 'Time'. Time= t2-t1.

t1 represents the number of clock beats ,when task begins.

t2 represents the number of clock beats when task ends.

time[i] is the time it takes to finish the i’s task.

Step 2: Figure out the cores handle speed (signed as V), that is the number of tasks that are finished in one second.

Step 3: Figure out the number of remaining tasks that has not been completed on cores from 1 to 7.

Step 4: Figure out the time it needs to finish the remaining tasks on each core, signed as T'.

Step 5: When system launches initially, we use the static task group scheduling algorithm and the algorithm of static task graphs stratification to allocate a certain task to each core from 1 to 7.

Step 6: Use the flag bit to check whether the core is free or not. If core is free, then we jump to Step 9.Otherwise, go to Step 7.

Step 7: Use the Select sort algorithm to sort cores to figure out the least T', then schedule the least core.

If( ( totaltask[i] - finishtask[i] ) * time[i] * finishtask[j] ==

( totaltask[j] - finishtask[j] ) * time[j] * finishtask[i] )

{

//Because usually the more handle speed V, the more tasks that can be finished in the same time. We improve the rate of succeeding in scheduling and increase the total number of tasks by compare the speed V.

}

Step 8: Solution to problem about float calculation. As fraction compare can be transform into integer multiplication compare, we adopt integer multiplication compare to realize.

totaltask[i] - finishtask[i] / (finishtask[i] / time[i]) < totaltask[j] - finishtask[j] / (finishtask[j] / time[j])

( i != j ), can be transformed into

( totaltask[i] - finishtask[i] ) * time[i] * finishtask[j] <

( totaltask[j] - finishtask[j] ) * time[j] * finishtask[i]

Step 9: Communicate with other cores to finish schedule.

V. ALGORITHM TESTING

Firstly, we will estimate the static task graphs stratificationalgorithm and the static task group scheduling algorithm. We will import the hardware design to SDK. Then we will set up a C project in SDK based on the hardware design, and we will implement algorithms on the project. When the system starts, we suppose there are 280 tasks. Each task is represented by functions, test1()~test280(). And randomly generated 280numbers from 10 to 100 used to decide how many millisecondseach function delay. In an evaluation, the improved algorithms use the same numbers with the original algorithms. All the tasks are divided into four groups, from the first group to the fourth group. The second group depends on the first group; the third group depends on the second group, and so on. Like this, each layer has 70 tasks, the number of tasks is ten times the number of slave cores, and the effect of the algorithms can show difference. We will randomizes the order of the tasks, and register their relationships by bidirectional link table .Then, all the tasks are layered by the static task graphs stratification algorithm. For tasks of each layer, they are allocated for slave cores by the static task group scheduling algorithm. The result is shown in fig.5 and fig.6.

Fig.5 Compare the improved algorithms with original algorithms about the total elapsed time of each slave core.

637

Fig.6 Compare the improved algorithms with original algorithms about the total elapsed time of system when

accomplishes the same number of tasks.

We will still use the c project build in SDK[18], when testing the minimum dynamic link algorithm. But we develop it. After finish the 280 original tasks by the static task group scheduling algorithm, master core gets data from RAM andmake each five data as a task. Then, these data are sent to the slave cores, the sum of the five data will decide slave cores’ delay time. And we will use it to simulate the tasks during system running. We assume that these data are number range from 1 to 10. Then the master core begins to get data from RAM. When the number of the data increases to five, trigger the interrupt program. At this moment, each slave core calculates its running speed and load condition. Then send them to the master core. Finally, the master core uses the minimum dynamic link algorithm to assign tasks for slave cores based on these data. In each test, the master core will assign in total 490 tasks, the number of tasks is 70 times then cores. The result is shown in fig.7

Fig.7 the total elapsed time of each slave core

CONCLUSION This paper introduces three kinds of task assigning and

scheduling algorithms for multi-core system, they are the static task graphs stratification algorithm, the static task group scheduling algorithm and the minimum dynamic link algorithm, which could satisfy basic needs to keep the load of every hardware thread balanced. The static task graphs stratification

algorithm breaks dependency relationships between system tasks and put relatively independent tasks under the same task group. The static task group scheduling algorithm considers to reduce both the problem of error accumulation in the process of task allocation and errors caused by single-core, making the time spent by each core close to Average time-consuming. The minimum dynamic link algorithm, which assigns tasks to child cores by a master core, adopt the centralized scheduling strategy. This arithmetic achieved ideal results for load balance,and eliminated the CPU hunger [19, 20].

ACKNOWLEDGMENT

This work is supported by National Natural Science Foundation of P.R. China (Grant No. 61202443).

REFERENCES

[1] Gharesifard, B.; Cortes, J. Distributed strategies for making a digraph weight-balanced; 2009, p.771-777

[2] Tao Li; Qin Xu; Zhao Ning. ARM9 Multi-task Data Acquisition System Intelligent Improvement; 2011, p. 587 -590

[3] Ramkumar, B.; Kale, L.V. Machine independent AND and OR parallel execution of logic programs. II. Compiled execution;1994; p.181-192

[4] Sitohang, B. Parallel execution of relational algebra operator under distributed database systems; 2002; p.207-211

[5] Biswas, S.; Mall, R.; Satpathy, M. Task Dependency Analysis for Regression Test Selection of Embedded Programs;2011; p. 117-120

[6] Niethammer, C.; Glass, C.W.; Gracia, J. Avoiding Serialization Effects in Data / Dependency Aware Task Parallel Algorithms for Spatial Decomposition; 2012; p.743-748

[7] Yue Lu; Nolte, T.; Bate, I.; Norstrom, C. Timing Analyzing for Systems with Task Execution Dependencies; 2010; p.515-524

[8] WeiMing Zhou. Multi-core Computing and Programming; [9] Liu, Xi ; Pan, Lei ; Wang, Chong-Jun ; Xie, Jun-Yuan. A Lock-Free

Solution for Load Balancing in Multi-Core Environment.2011; p.1-4 [10] Yu, Kun-Ming ; Wu, Shu-Hao. An Efficient Load Balancing Multi-core

Frequent Patterns Mining Algorithm;2011; p.1408-1412 [11] Cheng, Hui. A High Efficient Task Scheduling Algorithm Based on

Heterogeneous Multi-Core Processor;2010; p.1-4 [12] Xiaozhong Geng ; Gaochao Xu ; Dan Wang ; Ying Shi. A task

scheduling algorithm based on multi-core processors;2011; p.942-945 [13] Sheng, Yan ; Sheng, Yang Quan ; Wei, Wang Xiao ; Feng, Zou.

Research on Thread Scheduling Algorithm in Automatic Parallelization;2009;p.1-4

[14] Hai Chen; Yi Zhang; Dongsheng Ma. A SIMO Parallel-String Driver IC for Dimmable LED Backlighting With Local Bus Voltage Optimization and Single Time-Shared Regulation Loop;2012; p.452-462

[15] Yang, H.; Kim, M. Estimation of traffic intensity in global bus computer communications networks; 1990; p.1079-1080

[16] Wang, Lisheng ; Wang, Kete ; Li, Xixi. A false-sharing-eliminable parallel tasks scheduling algorithm based on DAG;2010; p. V9-34 - V9-38

[17] Feihu Zhang ; Stahle, H. ; Guang Chen ; Chao Chen ; Simon, C. ; Knoll, A. A sensor fusion approach for localization with cumulative error elimination;2012; p.1-6

[18] Lie, Ioan ; Ionici, Cristian ; Gontean, Aurel-Stefan S. ; Cernăianu, Mihail. EDK implemented temperature controller;2010; p.344-349

[19] Chen, Yuansheng ; Zeng, Yu. Automatic Energy Status Controlling with Dynamic Voltage Scaling in Power-Aware High Performance Computing Cluster;2011; p.412-416

Zhang, Xiaodong ; Qu, Yanxia ; Xiao, Li. roving distributed workload performance by sharing both CPU and memory resources;2000; p.233-241

638