Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
4
Transcript of Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment...
Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous
Computing Environment
Computer Engineering Department
Presented By William ScheidelSeptember 19th, 2002
Agenda
Objective Heterogeneous Computing Background microHeterogeneous Computing Architecture microHeterogeneous Framework
Implementation Simulation Results Conclusion Future Work Acknowledgements
Objectives
The main objective of this thesis is to propose a new computing paradigm, called microHeterogeneous computing or mHC, which incorporates processing elements (vector processors, digital signal processors, etc) into a general purpose machine using the high-performance I/O buses that are available
This architecture will then be used in order to efficiently utilize fine-grained parallelism
Heterogeneous Computing
An architecture which provides an assortment of high performance machines for use by an application
Arose from the realization that no single machine is capable of performing all tasks in an optimal manner
Machines differ in both speed as well as in capabilities and are connected using high speed, high bandwidth intelligent interconnects that handle the intercommunication between each of the machines.
Motivation For Heterogeneous Computing
Hypothetical example of the advantage of using a heterogeneous suite of machines, where the heterogeneous suite time includes inter-machine communication overhead. Not drawn to scale.
Heterogeneous Computing Driving Application
Image Understanding Well suited to a heterogeneous environment due to its complexity and
involvement of different types of parallelism Consists of three main levels of processing, each level containing a different
type of parallelism Three main levels can also be executed in parallel:
– Lowest Level Consists of pixel-based operators and pixel subset operators such as edge detection Highest amount of parallelism Best suited to mesh connected SIMD machines
– Intermediate Level Grouping and organization of features previously extracted Communication is irregular, parallelism decreases as features are grouped Best suited to medium-grained MIMD machines
– Highest Level Knowledge processing Uses the data from the previous levels in order to infer semantic attributes about an image Requires coarse-grained loosely coupled MIMD machines
The three levels of processing required for image understanding.
Each of the levels contains large amounts of varying kinds of parallelism that can be exploited by heterogeneous computing
Heterogeneous Computing Driving Application
Image Understanding
Heterogeneous Computing Broad Issues
Analytical benchmarking Code-Type or task profiling Matching and Scheduling Programming environments Interconnection requirements environment
requirements
Execution Steps of Heterogeneous Applications
Analytical Benchmarking - determines the optimal speedup that a particular machine can achieve on different types of tasks
Code Profiling - determines the modes of computation that exist in each program segment as well as the execution times
Task Scheduling – tasks are mapped to a particular machine using some form of scheduling heuristic
Task Execution – the tasks are executed on the assigned machine
HC Issues Analytical Benchmarking
Measure of how well a given machine is able to perform on a certain type of code
Required in HSC to determine which types of code should be mapped to which machines
Benchmarking is an offline process Example results:
– SIMD machines are well suited for matrix computations / low level image processing
– MIMD machines are best suited for tasks that have limited intercommunication
HC Issues: Code-type Profiling
Used to determine the types of parallelism in the code as well as execution time
Tasks are separated into segments which contain a homogeneous type of parallelism
These segments can then be matched to a particular machine that is best suited to execute them
Code-type profiling is an offline process
Code Profiling Example
Example results from the code-profiling of a task.
The task is broken into S segments, each of which contains embedded homogeneous parallelism.
HC Issues: Matching and Scheduling
Goal is to map code-types to the best suited machine Costs most be carefully weighed
– Computation Costs – execution time of a code segment is dependent on the machine it is run on as well as the current workload of the machine
– Communication Costs – dependent on type of interconnection used and bandwidth– Interference Costs – resource contention occurs when multiple tasks are assigned to a
particular machine Problem determined to be NP-hard even for homogeneous environments Addition of heterogeneous processing elements adds to the complexity A large number of heuristic algorithms have been designed to schedule tasks to
machines on heterogeneous computing systems. Most such heuristic algorithms developed are static and assume the ETC
(expected time to compute) for every task on every machine to be known from code-type profiling and analytical benchmarking.
Example Static Scheduling Heuristics for HC
Opportunistic Load Balancing (OLB): assigns each task, in arbitrary order, to the next available machine.
User-Directed Assignment (UDA): assigns each task, in arbitrary order, to the machine with the best expected execution time for the task.
Fast Greedy : assigns each task, in arbitrary order, to the machine with the minimum completion time for that task.
Min-min : the minimum completion time for each task is computed respect to all machines. The task with the overall minimum completion time is selected and assigned to the corresponding machine. The newly mapped task is removed, and the process repeats until all tasks are mapped.
Max-min : The Max-min heuristic is very similar to the Min-min algorithm. The set of minimum completion times is calculated for every task. The task with overall maximum completion time from the set is selected and assigned to the corresponding machine.
Greedy or Duplex: The Greedy heuristic is literally a combination of the Min-min and Max-min heuristics by using the better solution
GA : The Genetic algorithm (GA) is used for searching large solution space. It operates on a population of chromosomes for a given problem. The initial population is generated randomly. A chromosome could be generated by any other heuristic algorithm.
Simulated Annealing (SA): an iterative technique that considers only one possible solution for each meta-task at a time. SA uses a procedure that probabilistically allows solution to be accepted to attempt to obtain a better search of the solution space based on a system temperature.
GSA : The Genetic Simulated Annealing (GSA) heuristic is a combination of the GA and SA techniques. Tabu : Tabu search is a solution space search that keeps track of the regions of the solution space which
have already been searched so as not to repeat a search near these areas . A* : A* is a tree search beginning at a root node that is usually a null solution. As the tree grows,
intermediate nodes represent partial solutions and leaf nodes represent final solutions. Each node has a cost function, and the node with the minimum cost function is replaced by its children. Any time a node is added, the tree is pruned by deleting the node with the largest cost function. This process continues until a complete mapping (a leaf node) is reached.
Example Static Scheduling Heuristics for HC
Every task has a ETC (expected time to compute) on a specific machine. If there are t tasks and m machines, we can obtain a t x m ETC matrix. ETC(i; j) is the estimated execution time for task i on machine j.
The Segmented min-min algorithm sorts the tasks according to ETCs.
The tasks can be sorted into an ordered list by the average ETC, the minimum ETC, or the maximum ETC.
Then, the task list is partitioned into segments with the equal size.
Example Static Scheduling Heuristics for HC:
The Segmented Min-Min Algorithm
Segmented Min-Min Scheduling Heuristic
Example Dynamic Scheduling Heuristics for HC:
HEFT Scheduling Heuristic
Heterogeneous Parallel Programming
Parallel Virtual Machine (PVM)– Enables a collection of heterogeneous computers to be used as a
coherent and flexible concurrent computational resource– Unit of parallelism are tasks which are generally processes– User also has the choice to view these resources as an
attributeless collection of virtual processing elements or choose to exploit the capabilities of specific machines in the host pool
– Allows multifaceted virtual machines to be configured within the same framework and permits messages containing more than one data type to be exchanged between machines having different data representations
Message Passing Interface (MPI)– Each processor is assigned a rank which then determines which
parts of the application that processor will execute– Division of tasks among processors is left solely up to the
developer writing the application – Includes routines to do point-to-point communication between two
processing elements, collective operations to simultaneously communicate information between all processing elements, and implicit as well as explicit synchronization
HC Issues: Interconnection Requirements
Interconnection medium must support high bandwidths and low latency communications (LANs won’t cut it)
Complexity in a heterogeneous system increases since different machines use different protocols for communication
Must support both shared memory and message-based communication
Heterogeneous Computing Limitations
Task Granularity– Heterogeneous computing only utilizes coarse-grained parallelism to
increase performance– Coarse-grained parallelism results in large task sizes and reduced coupling
which allows the processing elements to work more efficiently– Eequirement is also translated to most heterogeneous schedulers since
they are based on the scheduling of meta-tasks, i.e. tasks that have no dependencies
Communication Overhead– Tasks and their working sets must be transmitted over some form of
network in order to execute them, latency and bandwidth become crucial factors
– Overhead is also incurred when encoding and decoding the data for different architectures
Cost– Machines used in heterogeneous computing environments can be
prohibitively expensive– Expensive high speed, low latency networks are required to achieve the
best performance– Not cost effective for applications where only a small portion of the code
would benefit from such an environment
microHeterogeneous Computing
A new computing paradigm that attempts to most efficiently exploit the fine-grained parallelism found in most scientific computing applications
Environment is contained within a workstation and consists of a host processor and a number of additional PCI (or other high performance I/O bus) based processing elements– Elements might be DSP based, vector based, FGPA based, or
even reconfigurable computing elements– In combination with a host processor, these elements create
a small scale heterogeneous computing environment An mHC specific API was developed that greatly
simplifies using these types of devices for parallel applications
microHeterogeneous Computing Environment
Comparison Between Heterogeneous Computing and microHeterogeneous
Computing
Task Granularity– Heterogeneous environments only support coarse-grained
parallelism, while the mHC environment instead focuses on fine-grained parallelism by using a tightly coupled shared memory environment
– Task size is reduced to a single function call in a mHC environment– Drawbacks
Processing elements used in mHC are not nearly as powerful as the machines used in a standard heterogeneous environment
There is a small and finite number of processing elements that can be added to a single machine
Communication Overhead– High performance I/O buses are twice as fast as the fastest network– Less overhead is incurred when encoding and decoding data since
all processing elements use the same base architecture Cost Effectiveness
– Machines used in a heterogeneous environment can cost tens of thousands of dollars each, and require the extra expense of the high-speed, low latency interconnects to achieve acceptable performance
– mHC processing elements cost only hundreds of dollars
Comparison Between Heterogeneous Computing and microHeterogeneous
Computing
Analytical Benchmarking and Profiling– Analytical benchmarking is used for the same purpose in
both computing environments. The capabilities of each processing element or machine must be known before program execution begins so the scheduling algorithm is able to determine an efficient mapping of tasks.
– Profiling, while necessary in heterogeneous environments is not required in microHeterogeneous environments
Scheduling Heuristics– Scheduling algorithms in heterogeneous environments
generally take place during the compilation stage instead of during execution
– The scheduler for an mHC environment must be dynamic and map tasks in real-time in order to provide the best performance.
microHeterogeneous Computing API
An API was created for microHeterogeneous computing to provide a flexible and portable interface– User applications only need to make simple API calls, mapping
of tasks onto available devices is performed automatically– There is no need for an application to be recompiled if the
underlying implementation of microHeterogeneous Computing changes as long as the API is adhered to
The API supports a subset of the Gnu Scientific Library (GSL)
GSL is a freely distributable scientific API written in C– Includes an extensive library of scientific functions and data
types– Directly supports the Basic Linear Algebra Subprograms (BLAS)– Data structures are compatible with those used by the Vector,
Image, and Signal, Processing Library that is becoming a standard on embedded devices
microHeterogeneous Computing API (cont)
Vector Operations Matrix Operations Polynomial Solvers Permutations Combinations Sorting
Linear Algebra EigenVectors and
EigenValues Fast Fourier Transforms Numerical Integration Statistics
The microHeterogeneous Computing API provides support for the following areas of scientific
computing:
Suitable mHC Devices
XP-15– Developed by Texas Memory
Systems– DSP based accelerator card– Performs 80 32-bit floating
point operations per second– Contains 256 MB of on board
DDR Ram – Supports over 500 different
scientific functions– Increases FFT performance by
20x - 40x over a 1.4 Gigahertz P4
Pegasus-2– Developed by Catalina
Research– Vector Processor based– Supports FFT, matrix and vector
operations, convolutions, filters and more
– Supported functions operate between 5x and 15x faster then a 1.4 Gigahertz P4
microHeterogeneous Framework Implementation
Implemented as a dynamically linked library written purely in C that user applications interact with by way of the mHC API
The framework creates tasks from the API function calls and schedules them to the available processing elements
Phases of the microHeterogneous Framework
Initialization– The framework must be initialized before it may
be used– Scheduler and scheduler parameters chosen– Bus and Device Configuration Files read– Log file specified– Data structures created– Helper threads are created that move tasks from
a device’s task queue to the device. These threads are real-time threads that use a round-robin scheduling policy.
microHeterogeneous Computing Framework Overview
Device Configuration File
Determines what devices are available in the microHeterogeneous environment
File is XML based which makes it easy for other programs to generate and parse device configuration files
The following is configurable for each device:– Unique ID– Name– Description– Bus that the device uses– A list of API calls that the device supports, each API call in the
list contains: The ID and Name of the API call The speedup achieved as compared to the host processor The expected time to completion (ETC) of the API call given in
microseconds per byte of input
Example Device Configuration File
<mHCDeviceConfig> <Device> <ID>0</ID> <Name>Host</Name> <Description>A bad host.</Description> <BusName>Local</BusName> <BusID>0</BusID> <APISupport> <Function> <ID>26</ID> <Name>mhc_combination_next</Name> <Speedup>1</Speedup>
<CompletionTime>.015</CompletionTime>
</Function> <Function> <ID>9</ID> <Name>mhc_vector_sub</Name> <Speedup>1</Speedup>
<CompletionTime>.001</CompletionTime>
</Function> </APISupport> </Device>
<Device> <ID>1</ID> <Name>Vector1</Name> <Description>A simple vector
processor.</Description> <BusName>PCI</BusName> <BusID>1</BusID> <APISupport> <Function> <ID>9</ID> <Name>mhc_vector_sub</Name> <Speedup>10</Speedup>
<CompletionTime>.0001</CompletionTime>
</Function> </APISupport> </Device> </mHCDeviceConfig>
Bus Configuration File
Determines the bus characteristics being used by the devices File is XML based which makes it easy for other programs to
generate and parse bus configuration files The following is configurable for each bus:
– Unique ID– Name– Description– Initialization time
Specified in microseconds Taken into account once during the framework initialization
– Overhead Time Specified in microseconds Taken into account once for ever bus transaction
– Transfer Time Specified in microseconds per byte Taken into account once for every byte that is transmitted over the bus
Example Bus Configuration File
<mHCBusConfig> <Bus> <ID>0</ID> <Name>Local</Name> <Description>Used by the host</Description> <InitTime>0</InitTime> <Overhead>0</Overhead> <TransferTime>0</TransferTime> </Bus> <Bus> <ID>1</ID> <Name>PCI</Name> <Description>PCI bus</Description> <InitTime>50</InitTime> <Overhead>0.01</Overhead> <TransferTime>0.002</TransferTime> </Bus></mHCBusConfig>
Phases of the microHeterogneous Framework
Task Creation– A new task is created for every API call that is
made, except for initialization, finalization, and join calls
– Tasks encapsulate all of the information of a function call
ID of function to execute List of pointers to all of the arguments List of pointers to all of the data blocks used as inputs
and their sizes List of pointers to all of the data blocks used as outputs
and their sizes
Phases of the microHeterogneous Framework (cont)
Task Scheduling– After a task is created, it is passed to the
scheduling algorithm that was selected during initialization
– The scheduler determines which device to assign the task and places the task in that device’s task queue
Done dynamically in real-time Profiling of applications is not required
– As soon as the scheduler has mapped the task to a device the API call returns and the main user program is allowed to continue execution
Fast Greedy Scheduling Heuristic
Real-Time Min-Min Scheduling Heuristic
Weighted Real-Time Min-Min Scheduling Heuristic
Phases of the microHeterogneous Framework (cont)
Task Execution– If a task is available,
The helper thread checks to see if there are any unresolved dependencies
If there are no dependencies, the task is removed from the task queue and passed to the device driver for execution, otherwise it sleeps
– All tasks are executed on the host processor by simulated drivers
mHC Applications
Four mHC Application were written in order to test the performance of both the architecture and the different scheduling algorithms that were developed– Matrix: Performs basic matrix operations on a set of fifty 100 x 100
matrices First twenty-five matrices are summed together Last twenty-five matrices are subtracted from one another Every fifth matrix is scaled by a constant Finally, the inverse of all fifty matrices is determined
– Stats: Performs basic statistics on a block of five million values Divides a block of five million values into 50 blocks of 100,000 values Calculates the standard deviation of each block of values Determines the blocks of data with the minimum and maximum deviations
– Linalg: solves fifty sets of linear equations each containing one hundred and seventy-five variables
– Random: Used to stress test the different scheduling algorithms Creates random task graphs consisting of 300 tasks Tasks created are matrix element multiplications between a group of 25
matrices
Simulation Methodology
Each simulation run used a three step process:1. The sequential version of the application was run
– Done by using the ‘-s -1’ parameter when initializing the framework
– Used to determine the estimated time to completion (ETC) for each of the API calls on the host processor
– Output recorded for comparison purposes2. The parallel version of the application was run
– Done by using the appropriate scheduler parameter, bus configuration, and device configuration files
– Used to compare the parallel output to the sequential output to make sure that the parallelized results were correct
3. The parallel version was run using the timing mode1. Done by specifying the ‘-t’ parameter along with the
parameters used in Step 2.2. Used to get the final timing results for the simulation
Steps 1 and 2 were run five times, the median run was used for calculations
Matrix Simulation Results
2 3 4 5 6 7
Fast Greedy
2.33 2.33 2.31 2.34 2.31 2.31
RTmm 1.43 2.00 2.39 3.68 3.75 5.00
WRTmm 1.97 2.78 3.71 3.97 6.58 5.07
Stats Simulation Results
2 3 4 5 6 7
Fast Greedy
0.90 0.91 0.90 0.90 0.90 0.90
RTmm 0.97 1.00 1.07 1.08 1.15 1.10
WRTmm 1.03 1.11 1.14 1.14 1.16 1.17
Linalg Simulation Results
2 3 4 5 6 7
Fast Greedy
3.20 3.08 3.12 3.18 3.02 2.98
RTmm 1.42 3.40 2.29 4.34 3.72 6.40
WRTmm 1.64 3.40 4.49 6.00 6.44 8.28
Random Simulation – Similar Processing Elements
Simulation used between 1 and 6 additional processing elements, each having a speedup of 20x
Random Simulation – Different Processing Elements
Simulation used between 1 and 6 additional processing elements
Speedups of 20x, 10x, 5x, 2x, 1x, and 0.5x were used for devices 1 through 6 respectively.
Random Simulation – Different Bus Transfer Times
Simulation used three additional processing elements with various bus transfer times
Transfer times range from a 64-bit 33 MHz PCI bus (smallest), to a 100 Mb/s Ethernet connection (largest)
Random Simulation – Various Speedups
Simulation used four additional processing elements with various speedups
Conclusion
Accomplishments– A new computer architecture, microHeterogeneous Computing,
was presented that successfully exploits fine-grained parallelism in scientific based applications using additional processing elements
– An API was created that allows developers to incorporate mHC into their applications without being required to address task scheduling, load balancing, or threading issues
– A highly configurable mHC framework was implemented as a standard library which allows actual mHC compliant applications to be compiled and executed using standard techniques
Future Work– Creation of mHC compliant device drivers so that an actual
mHC environment can be created– While the microHeterogeneous API currently contains the most
common scientific functions, it needs to be expanded in order to become complimentary to the GNU Scientific Library.
– the concept of mHC clusters needs to be fully explored in order to determine the applicability of mHC to this area of computing.
mHC Cluster Based Computing
Acknowledgements
I would like to thank the following people for making this thesis possible My primary advisor, Dr. Shaaban for allowing me to work on such an
interesting and worthwhile project My committee members, Dr. Savakis and Dr. Heliotis for working
with very tight schedules My family for supporting me through this whole process My sister for putting a roof over my head for the last month and a
half