Energy / Performance trade-offs for transactional memory applications with an adaptive thread...

School of Software EngineeringUnder the guidance of the professor Xu Yanling

June 29, 2015

Thesis proposal defense for professional Master Degree in Computer Science

Energy / Performance trade-offs for transactional memory applicationswith an adaptive thread mapping method

Justin Brottes - M.SSchool of Software Engineering

Tongji University

Agenda

● Introduction

○ Motivation

○ Scientific question

○ Research background

● Rationale for the study

● Literature Review

● Methodology

○ Hypotheses

○ Research design

Introduction

Research motivation

I’m experienced in low-level application development, elementary and object-oriented languages, 3D drawing techniques (raycasting, raytracing), Linux/Unix systems.

I’m really greedy about improving my knowledge in Parallel Computing, Application behaviour at a low level and several other things which are relevant of supercomputing field.

Justin BrottesM.S in Information Technologies at EPITECH (France)M.S in Software Engineering at Tongji University (China)

“ Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel"). ”

Parallel Computing

Parallel computations can be done on multiple levels such as instructions, branch targets, loops, execution traces, and subroutines. Instruction Level Parallelization (ILP) is a measure of how many instructions that can run in parallel.

ILP can be implemented using software (compiler) or hardware (pipelining)

Parallel Computations

Which issues exists about the parallel computing implementation ?

"The way the processor industry is going, is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it."

- Steve Jobs Co-Founder, Apple.

"Redesigning your application to run multithreaded on a multicore machine is a little like learning to swim by jumping into the deep end."

- Herb Sutter Hair of the ISO C++ standards committee, Microsoft.

Scientific question

● Softwares do not benefit from the underlying computation resources and many of them require an immense effort and cost of rewriting and re-engineering

● The energy is increasingly becoming one of the most expensive resources and the most important cost item for running a large supercomputing solution

How could we optimize the performance and reduce the energy consumption for supercomputing applications ?

Main methods to develop parallel computing solutions

Research background

A mutex is a lockable object that is designed to signal when critical sections of code need exclusive access. It prevents other threads with the same protection from executing concurrently and access the same memory locations.

What is a Mutual Exclusion (Mutex) ?

● Mutex have property that only owner can release it

● Mutexes can cause deadlocks, priority inversion if they are not handled properly

● If synchronization between threads is needed, mutexs are costly

Mutex drawbacks

TM is intended to simplify parallel programming, specifically for accessing shared data across multiple threads. It is a sequence of memory operations that either execute completely (commit) or have no effect (abort).

An “all or nothing” sequence of operation :

● On commit, all memory operations appear to take effect as a unit (all at once)

● On abort, none of the stores appear to take effect

Transactions run in isolation :

● Effects stores are not visible until transaction commits

● No concurrent conflicting accesses by other transactions

What is the Transactional Memory ?

● The performance degradation that can be experienced when applications run with a non-optimal concurrency level

● It also cause a huge increase of energy consumption

TM drawbacks

Rationale for the study

1. Understand how works the performance of TM on multicore platforms. I first take a deeper look on the impacts of Software TM systems on the performance of TM applications.

2. Propose an existing but extended approach to improve the performance of TM applications through the exploitation of the memory hierarchy of modern multicore platforms.

3. Extend the aforementioned approach to predict and apply suitable thread mapping strategies for TM applications introduced by several researcher in order to propose a evolution

Step by step

Literature Review

Beginning to

2000s

2000s to

2007

2007 to

2011

2012 to

Nowadays

Evolutions & periods

Beginning to 2000s

● Transactional memory programming paradigm established by Herlihy and Moss start to replace mutexs

● Transactional memory research field was quite restricted to several pioneers researchers or supercomputing firms

● Works were highly focused on methods development and virtualisation way of TM

● All theoretical concepts were implemented in several projects between very early in 1995 and 1997, then it follows a long period of standards definition.

2000s to 2007 : STM evolution

● STM design has come a long way since the first STM algorithm by Shavit and Touitou appears which provided a non-blocking implementation of static transactions

● Many experimentations were done to develop newer solutions and comparisons with mutexs

2007 to 2011 : Late use of HTM

● Many researches conclude that TM has not yet gain the necessary maturation in order to present a compelling value proposition that will trigger its widespread adoption

● After several years of active research a, there is a lack of mentions in the research literature of large-scale applications that make use of TM

● The apparition of real HTM improvement who shows that hardware optimisation is giving more encouragement results.

● it is only in 2007 that the first hardware implementations of transactional memory was developed by Sun Microsystems.

2012 to Nowadays : Automatisation & Optimisation

● Research focus have changed target performances instead of methodologies

● Main studies rely on the automated thread mapping using different approaches

● An adaptive software basically uses available information about changes in its environment to improve its behavior over time (Machine Learning)

● The issue of reducing energy consumption in high performance multiprocessor systems is also quickly becoming urgent

My study position in the actual research field

● My research take place in the software part of TM.

● Software level optimisation is open to a wide range of developers. It gives them an alternative instead of hardware optimisation which could be more restrictive due to the costs of systems

● It is truly important to give more performance at the software level, it will in all cases help to have better applications behaviour

Methodology

Hypotheses

● The performance of TM applications can be improved if we match its characteristics to the underlying multicore platform

● The impacts of applying thread mapping on TM applications has been explored many times and proves its efficiency. I want to confirm these intuitions and propose an approach capable of predicting suitable thread mappings

● I need to find an interesting metric that could be added to balance the energy consumption because it is really complicated to deduce a viable results

Research design : Requirements

● The main approach for this research is to use a compiler which includes all STM functionalities (C/C++)

Research design : Difficulties

The increased complexity in the development of parallel programs can be eased up by a good understanding of the effective application behavior in its specific hardware and software execution contexts.

There are basically two main approaches to achieve this goal :

● Execution analysis : collects runtime information about the application behavior and uses such information to perform some action at runtime.

● Post-execution analysis : the collected runtime information is recorded in a detailed log (trace file) for later analysis.

Research design : Steps

There is three main step in the research method :

1. I need to understand better the parallel application behaviour, I need to realise many test and benchmark to collect data and analyse them.

2. ML based approach : the learning phase. The learning phase is subdivided in the following three major steps: application profiling, data pre-processing and learning process.

3. The last step will consist to implement the system as an extension of the system chosen and to run tests to confirm the efficiency of the current method.

Thank you.

Do you have some questions ?

Energy / Performance trade-offs for transactional memory applications with an adaptive thread...

Engineering

Transcript of Energy / Performance trade-offs for transactional memory applications with an adaptive thread...