Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.
Extending Open64 with Transactional Memory features
description
Transcript of Extending Open64 with Transactional Memory features
Extending Open64 withTransactional Memory features
Jiaqi ZhangTsinghua University
Contents
• Background• Design• Implementation• Optimization• Experiment• Conclusion
Transactional Memory Background
• Trend to concurrent programming• Current solution:
– Lock– Flaws:
• Association between locks and data• Deadlock• Not composable
Transactional Memory Background
a.credit(amount);b.debit(amount);
class Account{ int balance; lock mylock; bool credit(int amount); bool debit(int amount); };
bool credit(int amount){ acquire(mylock); balance+=amount; release(mylock);}bool debit(int amount){ acquire(mylock); balance-=amount; release(mylock);}
inconsistent stateacquire(a.mylock);acquire(b.mylock);
release(a.mylock);release(b.mylock);
Poor abstraction of class AccountDeadlockExposed implementation details
transfer(Account a, Account b, int amount){
}
atomic{ a.credit(amount); b.debit(amount);}
Transactional Memory Background
• Current Implementations– TM libraries
• DSTM• DracoSTM• TL2• TinySTM• ……..
Function calls:TM_INIT()/TM_SHUTDOWN()TM_ATOMIC_BEGIN()/TM_ATOMIC_END()TM_SHARED_READ()/TM_SHARED_WRITE()
Explicit Transaction
Transactional Memory Background
• Current Implementations– Compilers
• Intel C++ STM Compiler• Tanger• OpenTM• GCC
Design
• Programming Interfaces#pragma tm atomic [clause]structured block
readonly
private(var list)
shared(var list)
#pragma tm abort
#pragma tm functionfunction declaration
#pragma tm waiverfunction declaration
Design
• TM runtime interfaces (TL2)Interface Description
Thread* TxNewThread() Allocate a new Thread structure to keep logs
TxStart(Thread* Self, jmp_buf* buf, int flags) Start a new transaction for current thread
TxCommit(Thread* Self) Commit the current transaction
TxLoad(Thread* Self, void* addr) Perform synchronized load from given memory address
TxStore(Thread* Self, void* addr, intptr_t val) Perform synchronized store to given memory address
TxStoreLocal(Thread* Self, void* addr, intptr_t val) Perform locally logged store to given memory address
TxAbort(Thread* Self) Abort the current transaction and re-execute
Design
• Wrapper functions– To ease the process of integrating new TM librariestm_init()/tm_finalize()tm_thread_start()/tm_thread_end()
__tm_atomic_begin()/__tm_atomic_end()__tm_shared_read()/__tm_shared_read_float()__tm_shared_write()/__tm_shared_write_float()__tm_local_write()/__tm_local_write_float()
by programmers
by compiler
more wrapper functions are needed for other data types, and additional TM semantics
Design
• Optimization– Eliminate redundant calls to runtime libraries
Implementation
• General Transformation
Implementation
• General Transformation– #pragma tm atomic– simple statements– control flow statements
• IF• WHILE_DO
a = b+c;
PARM #address of cCALL <__tm_shared_read> LDID <return_offset>STID #tm_preg_num_0 PARM #address of bCALL <__tm_shared_read> LDID <return_offset> STID #tm_preg_num_1 LDID #tm_preg_num_0 LDID #tm_preg_num_1 ADD PARM PARM #address of aCALL <__tm_shared_write>
setjmp();__tm_atomic_begin();
for(;i<10;i++){}
PARM #address of ICALL <__tm_shared_read> LDID <return_offset>STID #tm_preg_num_0WHILE_DO LDID #tm_preg_num_0 INTCONST 9 LEBODY BLOCK ……………. PARM #address of I CALL <__tm_shared_read> LDID <return_offset> STID #tm_preg_num_0 END_BLOCK
Implementation
• General Transformation1.1 int i = 0;
1.2 #pragma tm atomic
{
1.3 int j = 0;
1.4 for(i=0;i<20;i++)
{
1.5 for(j=0;j<10;j++)
{
1.6 result++;
}
}
}
2.1 int i = 0;
2.2 jmpbuf jbuf;
2.3 _setjmp(jbuf);
2.4 TxStart(Self, jbuf);
2.5 TxStore(Self, &j, 0);
2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20;
TxStore(Self, &i, TxLoad(Self, &i)+1)){
2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10;
TxStore(Self, &j, TxLoad(Self, &j)+1)){
2.8 TxStore(Self, &result, TxLoad(Self, &result)+1);
}}
2.9 TxCommit(Self);
Implementation
• Functions– clone and instrument
#pragma tm functionvoid calculate(){}
void calculate()
__tm_cloned__calculate() //instrumented
#pragma tm atomic{ calculate();}
#pragma tm atomic{ __tm_cloned__calculate();}
Implementation
• Optimization1.1 int i = 0;
1.2 #pragma tm atomic
{
1.3 int j = 0;
1.4 for(i=0;i<20;i++)
{
1.5 for(j=0;j<10;j++)
{
1.6 result++;
}
}
}
2.1 int i = 0;
2.2 jmpbuf jbuf;
2.3 _setjmp(jbuf);
2.4 TxStart(Self, jbuf);
2.5 TxStore(Self, &j, 0);
2.6 for (TxStore(Self, &i, 0);; TxLoad(Self, &i)<20;
TxStore(Self, &i, TxLoad(Self, &i)+1)){
2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10;
TxStore(Self, &j, TxLoad(Self, &j)+1)){
2.8 TxStore(Self, &result, TxLoad(Self, &result)+1);
}}
2.9 TxCommit(Self);
Transaction local variables : detected by the frontend
Implementation
• Optimization1.1 int i = 0;
1.2 #pragma tm atomic
{
1.3 int j = 0;
1.4 for(i=0;i<20;i++)
{
1.5 for(j=0;j<10;j++)
{
1.6 result++;
}
}
}
2.1 int i = 0;
2.2 jmpbuf jbuf;
2.3 _setjmp(jbuf);
2.4 TxStart(Self, jbuf);
2.5 j=0;
2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20;
TxStore(Self, &i, TxLoad(Self, &i)+1)){
2.7 for(j=0; j<10;j++)){
2.8 TxStore(Self, &result, TxLoad(Self, &result)+1);
}}
2.9 TxCommit(Self);
Barrier Free variables : detected according to its storage class
Implementation
• Optimization1.1 int i = 0;
1.2 #pragma tm atomic
{
1.3 int j = 0;
1.4 for(i=0;i<20;i++)
{
1.5 for(j=0;j<10;j++)
{
1.6 result++;
}
}
}
2.1 int i = 0;
2.2 jmpbuf jbuf;
2.3 _setjmp(jbuf);
2.4 TxStart(Self, jbuf);
2.5 j=0;
2.6 for (; i<20; TxStoreLocal(Self, &i, i+1)){
2.7 for(j=0; j<10;j++)){
2.8 TxStore(Self, &result, TxLoad(Self, &result)+1);
}}
2.9 TxCommit(Self);
Implementation
• Optimization– Optimization opportunities detection strategy
• Pthread parallel task – transaction local: declared in tm atomic scope– barrier free: auto variables
• Cloned transactional function– transaction local: declared in the function
• OpenMP parallel task– transaction local: declared in tm atomic scope– barrier free: declared in micro task, marked in openmp private clause
• Checking readonly transactions
– Limitation• Reserved design for pointers• Needs programmers to participate in optimization
Preliminary Experiments• Compare with fine-grained lock based application
Preliminary Experiments
• Compare with manually instrumented application
Preliminary Experiments
#pragma tm atomic{ int j; *new_centers_len[index] ++; for(j=0;j<nfeatures;j++){ new_centers[index][j]+=feature[i][j]; }}
private(feature)
Conclusion & Future work
• A infrastructure for TM on Open64– Replaceable TM implementation– Optimization
• More experiments on non-trivial applications are desired• Nested transaction• Signal processing• Event handler• Indirect calls• Dealing with legacy code• …
FastDB: 8 out of 75 critical regions contain nested transactionsFastDB: 28 out of 75 critical regions contain signal processing
PARSEC: 20 out of 55 critical regions contain signal processing
Thanks