A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

Post on 14-Jan-2016

39 views 3 download

description

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991 Presented by Tina Swenson April 15, 2010. Agenda. Introduction Small Objects Non-Blocking Transformation Wait-free Transformation Large Objects Non-Blocking Transformation Conclusion. - PowerPoint PPT Presentation

Transcript of A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

Presented by Tina SwensonApril 15, 2010

AgendaAgendaIntroductionSmall Objects

◦Non-Blocking Transformation◦Wait-free Transformation

Large Objects◦Non-Blocking Transformation

Conclusion

IntroductionIntroduction

Key WordsKey WordsCritical Section – In the author’s

context, CS refers to blocking code.

Non-blocking (NB) – some process will complete its operation after a finite number of steps.

Wait-free (a.k.a. starvation-free) (WF) – all processes will complete their operations after a finite number of steps.

MotivationMotivationConventional Techniques – The

use of a critical sections (by author’s definition) means only one process has access to the data.

Implementing NB/WF - We cannot use a critical section since it could cause a process to block forever (thus violating the definitions of NB And WF)

Practical issues addressed.◦Reasoning is hard.◦Fault tolerance is costly.

Automatic Automatic TransformationsTransformationsAllow the programmer to reason and

program sequentially.

The sequential code is converted into concurrent objects. ◦ The author doesn’t specify what performs

this transformation!

Access to the concurrent object is protected via atomic instructions.

Atomics UsedAtomics UsedLoad_linked

◦Copies the value of the shared variable to a local.

◦Watches the memory for any other processor accessing it.

Store_conditional ◦Uploads the new version to the shared

variable, returning success or failure. ◦If LL tells SC that some other process

accessed the memory, SC will fail.

Atomics UsedAtomics Used3 Reasons for LL and SC:

1. Efficient implementation in cache-coherent architectures.

2. CAS instruction is inadequate.Less efficient & more complex.

3. LL and SC are easy to use (compared to CAS code).

CorrectnessCorrectnessLinearizability.

◦Used as the basic correctness condition for the concurrent objects created by the automatic transformation.

Is this claim really strong enough?What about this quote from p18?

◦“...as long as the store_conditional has no spurious failures, each operation will complete after at most 2 loop iterations.”

Priority QueuesPriority QueuesThe author implements a priority

queue to test his new coding paradigm.

Dequeue Sequential Code

int pqueue_deq(pqueue_type *p){

int best;

if (!p->size) return PQUEUE_EMPTY;

best = p->element[0];

p->element[0] = p->element[-- p->size];

pqueue_heapify(p, 0);

return best;

}

Notice: No code to protect the shared data!

Hardware & Software Hardware & Software UsedUsed18 Processors

◦National Semiconductor Encore Multimax NS32532 processors

Code implemented with C

Small ObjectsSmall Objects

Key WordsKey WordsSmall Object - An object that is small

enough to be copied in one instruction.

Sequential Object – A data structure that occupies a fixed size, contiguous region of memory.◦ The Heap.

Concurrent Object – A shared variable that holds a pointer to a structure with 2 fields:◦ Version – the Heap◦ Check[2]

Small ObjectsSmall Objects

Non-Blocking Non-Blocking TransformationsTransformations

Non-Blocking Non-Blocking TransformationTransformationTransforming a sequential object

into a non-blocking concurrent object.

Our sequential program code must:◦have no side-effects.◦be total.

Race ConditionRace Condition1. Processes X and Y read pointer to block b.2. Y replaces b with b’.3. X copies b while Y is copying b’ to b.4. P’s copy may not be a valid state of the

sequential object.

Solution – code example coming!Consistency check after copying the old version and before applying the sequential write.

The Code: Non-BlockingThe Code: Non-Blocking

Typedef struct {

pqueue_type version;

unsigned check[2];

}Pqueue_type;

...

We’ve converted our sequential object (the heap) into a concurrent object!

• version is our original heap.• check is our flag to help with race conditions.

The Code: Non-BlockingThe Code: Non-Blocking...Static Pqueue_type *new_pqueue;

int Pqueue_deq(Pqueue_type **Q){

Pqueue_type *old_pqueue;

Pqueue_type *old_version;

int result;

unsigned first, last;

...

Local copies of pointers:• old_pqueue = the concurrent object• old_version = the heap.

result is our priority queue value removed from this Pqueue_deq operation.

first, last help us with detecting a race condition. More later.

The Code: Non-BlockingThe Code: Non-Blockingint Pqueue_deq(Pqueue_type **Q){

...

while(1){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

copy(old_version, new_version);

last = old_pqueue->check[0];

if (first == last)

{

result = pqueue_deq(new_version);

if (store_conditional(Q, new_version )) break;

}

}

new_pqueue = old_pqueue;

return result;

}

Use our atomic primitive load_linked to copy the concurrent object (loads into a register) and starts watching the memory for any other processor trying to access this memory.

Dereference our old and new heaps, saving the version.

The Code: Non-BlockingThe Code: Non-Blockingint Pqueue_deq(Pqueue_type **Q){

...

while(1){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

copy(old_version, new_version);

last = old_pqueue->check[0];

if (first == last)

{

result = pqueue_deq(new_version);

if (store_conditional(Q, new_version )) break;

}

}

new_pqueue = old_pqueue;

return result;

}

Preventing the race condition!

Copy the old, new data.

If the check values do not match, loop again. We failed.

The Code: Non-BlockingThe Code: Non-Blockingint Pqueue_deq(Pqueue_type **Q){

...

while(1){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

copy(old_version, new_version);

last = old_pqueue->check[0];

if (first == last){

result = pqueue_deq(new_version);

if (store_conditional(Q, new_version )) break;

}

}

new_pqueue = old_pqueue;

return result;

}

If the check values DO match, now we can perform our dequeue operation!

Try to publicize the new heap via store_conditional, which could fail and we loop back.

Lastly, copy the old concurrent object pointer to the new concurrent pointer.

Return our priority queue result.

Experimental Results Experimental Results Small Object, Non-Blocking (naive)Small Object, Non-Blocking (naive)

Ugh! That’s terrible!• Bus contention• Starvation

Wasted Parallelism!

Exponential BackoffExponential Backoff...

if (first == last)

{

result = pqueue_deq(new_version);

if (store_conditional(Q, new_version )) break;

}

if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay;

delay = random() % max_delay;

for (i = 0; i < delay; i++);

} /* end while*/

new_pqueue = old_pqueue;

return result;

}

When the consistency check or the store_conditional fails, introduce back-off for a random amount of time!

Experimental Results Experimental Results Small Object, Non-Blocking (back-off)Small Object, Non-Blocking (back-off)

Better, but NB is still not as fast as spin-locks (w/ backoff).

Wasted Parallelism!

Small ObjectsSmall Objects

Wait-Free Wait-Free TransformationsTransformations

Key WordsKey WordsOperational Combining –

◦Process starts an operation. ◦Record the call in Invocation. ◦Upon completion of the operation,

record the result in Result.

Wait-Free Protocol Wait-Free Protocol Based on non-blocking and

applying operational combining.

Record an operation in Invocation.◦Invocation structure:

operation name argument value toggle bit

Wait-Free Protocol Wait-Free Protocol Concurrent object:

◦Version◦check[2]◦response[n]

All the processes share an array to announce invocations.

New to our concurrent object!

The pth element is the result of the last completed operation.

Wait-Free Protocol Wait-Free Protocol When an operation starts, record

the operation name and argument in announce[p]

When a process records a new invocation, flip the toggle bit inside the invocation struct!◦Flipping the bit distinguishes old

invocations from new invocations.

Wait-Free Protocol Wait-Free Protocol New Function: Apply()

◦ Does the work of any waiting threads before it does its own work.void apply (inv_type announce[MAX_PROCS], pqueue_type *object){

int i;

for (i = 0; i < MAX_PROCS; i++){

if(announce[i].toggle != object->res_types[i].toggle){

switch(announce[i].op_name){

case ENG_CODE:

object->res_type[i].value =

pqueue_enq(&ojbect->version, announce[i].arg);

break;

case DEQ_CODE:

object->res_type[i].value =

pqueue_deq(&ojbect->version, announce[i].arg);

break;

default:

fprintf(stderr, “Unknown operation code \n”);

exit(1);

};

object->res_types[i].toggle = announce[i].toggle;

}

}

}

For ALL Processes, do ALL the outstanding work!

The Code: Wait-FreeThe Code: Wait-FreeTypedef struct {

pqueue_type version;

unsigned check[2];

responses[n];

}Pqueue_type;

static Pqueue_type *new_pqueue;

static int max_delay;

static invocation announce[MAX_PROCS];

static int P; /* current process ID */

...

responses is new to concurrent object. Pth element is the result of the last completed operation.

announce[P]; Track all processes!

The Code: Wait-FreeThe Code: Wait-Freeint Pqueue_deq(Pqueue_type **Q){

Pqueue_type *old_pqueue;

Pqueue_type *old_version, *new_version;

int i, delay, result, new_toggle;

unsigned first, last;

announce[P].op_name = DEQ_CODE;

new_toggle = announce[P].toggle = !announce[P].toggle;

if (max_delay> 1) max_delay = max_delay >> 1;

Record the process name.

Flip the toggle bit.

...

while(((*Q)->responses[P].toggle != new_toggle)

|| ((*Q)->responses[P].toggle != new_toggle)){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

memcopy(old_version, new_version, sizeof(pqueue_type));

last = old_pqueue->check[0];

if (first == last){

result = pqueue_deq(new_version);

apply(announce, Q);

if (store_conditional(Q, new_version )) break;

}

if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay;

delay = random() % max_delay;

for (i = 0; i < delay; i++);

}

new_pqueue = old_pqueue;

return result;

}

Check the toggle bit TWICE!The author claims it avoids a race condition???

...

while(((*Q)->responses[P].toggle != new_toggle)

|| ((*Q)->responses[P].toggle != new_toggle)){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

memcopy(old_version, new_version, sizeof(pqueue_type));

last = old_pqueue->check[0];

if (first == last){

result = pqueue_deq(new_version);

apply(announce, Q);

if (store_conditional(Q, new_version )) break;

}

if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay;

delay = random() % max_delay;

for (i = 0; i < delay; i++);

}

new_pqueue = old_pqueue;

return result;

}

Same as before.

...

while(((*Q)->responses[P].toggle != new_toggle)

|| ((*Q)->responses[P].toggle != new_toggle)){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

memcopy(old_version, new_version, sizeof(pqueue_type));

last = old_pqueue->check[0];

if (first == last){

result = pqueue_deq(new_version);

apply(announce, Q);

if (store_conditional(Q, new_version )) break;

}

if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay;

delay = random() % max_delay;

for (i = 0; i < delay; i++);

}

new_pqueue = old_pqueue;

return result;

}

Pretty much same as before.

...

while(((*Q)->responses[P].toggle != new_toggle)

|| ((*Q)->responses[P].toggle != new_toggle)){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

memcopy(old_version, new_version, sizeof(pqueue_type));

last = old_pqueue->check[0];

if (first == last){

result = pqueue_deq(new_version);

apply(announce, Q);

if (store_conditional(Q, new_version )) break;

}

if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay;

delay = random() % max_delay;

for (i = 0; i < delay; i++);

}

new_pqueue = old_pqueue;

return result;

}

apply pending operations to the NEW version.

...

while(((*Q)->responses[P].toggle != new_toggle)

|| ((*Q)->responses[P].toggle != new_toggle)){

old_pqueue = load_linked(Q);

old_version = &old_pqueue->version;

new_version = &new_pqueue->version;

first = old_pqueue->check[1];

memcopy(old_version, new_version, sizeof(pqueue_type));

last = old_pqueue->check[0];

if (first == last){

result = pqueue_deq(new_version);

apply(announce, Q);

if (store_conditional(Q, new_version )) break;

}

if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay;

delay = random() % max_delay;

for (i = 0; i < delay; i++);

}

new_pqueue = old_pqueue;

return result;

}

Same.

Race ConditionRace Condition1. P reads a pointer to version v (our heap).

2. Q replaces v with v’.

3. Q starts another operation.

4. Q checks the announce array and applies P’s operations to v’ and stores the result in v’s response array!

5. P sees the toggle bits match and returns.

6. Q fails to install v as the next version, thus ensuring P has the wrong result.

Solution:◦Checking the value of the toggle bit

twice.◦What?

Experimental ResultsExperimental Results

Wasted Parallelism!

Large ObjectsLarge Objects

Key WordsKey WordsLarge Objects -

◦Objects that are too large to be copied at once.

◦Represented by a set of blocks linked by pointers.

Logically Distinct – ◦An operation creates and returns a

new object based on the old one. The old and new version may share a lot of memory.

Memory ManagementMemory ManagementPer-process pool of memory

◦ 3 states: committed, allocated and freed

Operations:◦ set_alloc moves block from committed

(freed?) to allocated and returns address◦ set_free moves block to freed◦ set_prepare marks blocks in allocated as

consistent◦ set_commit sets committed to union of freed

and committed◦ set_abort sets freed and allocated to the

empty set

Performance Performance ImprovementsImprovementsSkew Heap

◦Approximatly-balanced binary tree.◦Easier to maintain, thus better

performance.◦The update process doesn’t touch

most of the tree.

Experimental ResultsExperimental Results

ConclusionConclusion

Transforming DataTransforming DataTransforming Data from

Sequential To Concurrent.◦Let programmer write sequentially

without thought to memory.◦Let some mechanism (e.g. compiler)

do the transformation to concurrent automatically.

Key Instructions: ◦Load_Linked◦Store_Conditional

General ObservationGeneral ObservationIs it really worth all the extra

work and wasted parallelism just to avoid starvation? Just to ensure fault tolerance?

“We propose extremely simple and efficient memory management technieques...” Is this true? I doesn’t seem simple to me!

Going ForwardGoing ForwardResulting Research?Are we in the wrong paradigm?

Thank YouThank You