Tolerating Faults in Counting Networks Marc D. Riedel Jehoshua Bruck California Institute of...

101
Tolerating Faults in Tolerating Faults in Counting Networks Counting Networks http://www.paradise.caltech.edu Marc D. Riedel Jehoshua Bruck California Institute of Technology Parallel and Distributed Computing Group
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    1

Transcript of Tolerating Faults in Counting Networks Marc D. Riedel Jehoshua Bruck California Institute of...

Tolerating Faults in Tolerating Faults in Counting NetworksCounting Networks

http://www.paradise.caltech.edu

Marc D. Riedel Jehoshua BruckCalifornia Institute of Technology

Parallel and Distributed Computing Group

Multiprocessor Coordination

0P

• scheduling

Shared Counting

Processes cooperate to assign successive values

1P

2P

3P

4P

602602

606606

605605

601601

603603

604604

607607

608608609609

610610

• load balancing• resource allocation

Multiprocessor CoordinationCentralized Solution

serialized access

0P

1P

2P

3P

4P

602602

601601

603603

604604608608

600601602603604605606

Multiprocessor CoordinationCentralized Solution

• high contentionDisadvantages:

0P

1P

2P

3P

4P

602602

601601

603603

604604608608

• low throughput

0

00

0 0

0

Counting NetworksData structure for multiprocessor coordinationAspnes, Herlihy & Shavit (1991)

concurrent data structure

0

00

0 0

0

Counting NetworksData structure for multiprocessor coordinationAspnes, Herlihy & Shavit (1991)

0P 1

600 ,0P11

concurrent data structure

0

00

0 0

0

Counting NetworksData structure for multiprocessor coordinationAspnes, Herlihy & Shavit (1991)

0P 0

0

1

0

0 0

0 600 ,0P111P 1 1

600 ,1P1

concurrent data structure

change thisto 601 with eq.editor

Counting NetworksData structure for multiprocessor coordinationAspnes, Herlihy & Shavit (1991)

)(logdepth 2 nO

nwidth

Concurrent accessby up to n processes

Each process accesses 1/n-th of bits0

00

0 0

0

0

0

1

0

0 0

0 111 1

1

Counting NetworksData structure for multiprocessor coordinationAspnes, Herlihy & Shavit (1991)

)(logdepth 2 nO

nwidth

0

00

0 0

0

0

0

1

0

0 0

0 111 1

1

• low contentionAdvantages:

• high throughput

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

Balancer

Asynchronous token routing device

inputs outputs

1 bit of memory

balancedtoken counts

Balancer

Asynchronous token routing device

Shared Memory Architectures

Balancer : shared boolean variable.

Type balancerbegin state: boolean; top: ptr to balancer; bottom: ptr to balancer;end

statetop

bottom

1

Processes shepherd tokens through the network.

01

b

e

a

a a a

b

b bc c

c c

d

d e e

e d dfg

f

g

f f

g

g

Counting NetworkData structure for multiprocessor coordination

Aspnes, Herlihy & Shavit (1991)

depth )(log2 nO

outputsn

inputsn

b

e

a

a a a

b

b bc c

c c

d

d e e

e d dfg

f

g

f f

g

g

step sequence

Counting Network

Isomorphic to Batcher’s Bitonic sorting network.

Snapshot

inputs outputs

1 bit of memory

x

y

2

yx

2

yx

Balancer

3

1

3

0

1

2

2

2

2

1

2

2

2

2

1

2

Execution trace: token counts on all wires

Counting Network

0P

concurrent data structure

01

00 0P

Fault Tolerance

0

• No lost tokensNo errors in control:Dynamic faults in the data

structure: • Corrupted data• Inaccessible data

• No errors in network wiring

inputs outputs

Fault Model

inputs outputs

Fault Model

fault!

inputs outputs

Fault Model

state is inaccessible

inputs outputs

Fault Model

state is inaccessible

tokens bypass balancer

inputs outputs

Fault Model

state is inaccessible

tokens bypass balancer

inputs outputs

Fault Model

state is inaccessible

tokens bypass balancer

inputs outputs

Fault Model

imbalance in token counts

state is inaccessible

tokens bypass balancer

inputs outputs

Fault Model

2

yx

2

yx

:, yx received prior to the fault

:, yx received after the fault

x x

y y

x

y

tokens bypass balancer

Fault Tolerance

Naïve approach: replicate every balancer.

outputsinputs

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

fault!

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

Fault Tolerance

inputs outputs

Naïve approach: replicate every balancer.

imbalance in token countsDoesn’t work!

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 “pseudo-balancers”,

tolerates k faults

two bits of memory each

Pseudo-Balancer

inputs outputs

L

two bits of memory

state: up or downstatus: leader (L) or follower (F)

Fault Tolerance

1st Solution: Counting Network constructed with FT balancers.

CountingNetwork

)(log2 nO

FT Counting

Network

)log( 2 nkO

tolerates k faults

Fault Tolerance

FT balancers

1y2y

ny

CorrectionNetwork

1x2x

nx

1y2y

ny

CountingNetwork

2nd Solution: Rectify errors with a correction network.

)(log2 nO )log( 2 nkO

remapped faulty balancers

(better provided that )log nk

Remapping Faulty Balancers

fault

Remapping Faulty Balancers

inaccessiblebalancer

Remapping Faulty Balancers

inaccessiblebalancer

spare balancer,random initial state

Redirect pointers to spare balancer

Remapping Faulty Balancers

inputs outputs

Fault Model

inputs outputs

Fault Model

fault!

inputs outputs

Fault Model

spurious state transition

Remapped balancer

inputs outputs

Fault Model

spurious state transition

Remapped balancer

inputs outputs

Fault Model

imbalance in token counts

spurious state transition

Remapped balancer

inputs outputs

Fault Model

x

y

fyx

2

fyx

2

1} 0, 1,{ somefor f

Remapped balancer

Error Bound

Error bound for the output sequence of a balancing network with remapped balancers:

1x2x

nx

1y2y

ny

BalancingNetwork

k faults

Distance Measure

n

iii yyD

12

1)y(y,

The distance between two sequences nyyy ,,, 21 ynyyy ,,, 21 yand is:

Definition:

gives number of“misplaced tokens”

)y(y,D

1x2x

nx

1y2y

ny

BalancingNetwork

k faults

Two identical balancing networks, given same inputs:

1x

2x

nx

1y2y

ny

1x2x

nx

1y2y

ny

kD )y(y,

Error Bound

k faultsno faults

3

1

3

0

1

2

2

2

Execution without faults:

2

1

2

2

2

2

1

2

Error Bound

3

1

3

0

1

2

2

2

2

1

2

2

2

2

1

2

3

1

3

0

1

2

2

2

2

1

1

3

2

1

1

3

Execution with a fault:

Error Bound

2

2

1

2

2

1

1

3

Distance: 101012

1

= 1

= 0

= 1

= 0

Error Bound

Correction Network

Strategy: Construct a block which reduces error by one.

step sequencewith k errors

step sequencewith errors1k

1y2y

ny

1y2y

ny

CORRECT[n]

Correction Network

1z

2z

nz BUTTERFLY[n]

1y2y

ny

largest value

smallest value

step sequencewith k errors

1y2y

ny

step sequencewith errors1k

To reduce error by one: balance smallest and largest entries.

Butterfly Network

Network which separates out smallest and largest entries:

0

1

10

1

0

1

34

0

1

0

6

5

1

0

17

17

4

3

3

2

9

9

9

8

7

6

6

5

6

6

6

5

largest value

smallest value

Butterfly Network

Balance smallest and largest entries:

0

1

10

1

0

1

34

0

1

0

6

5

1

0

17

17

4

3

3

2

9

9

9

8

7

6

6

5

6

6

6

5

6

6

6

5

6

6

6

6

error reduced

Correction Network

step sequencewith k errors

Strategy: to correct k faults, append k copies.

1y2y

ny

CORRECT[n]#k

1y2y

ny

CORRECT[n]#1

)1)(log1( nk)1)(log1( nk

smooth sequence

step sequence

)1)(log1(depth nkk

Fault Tolerance

FT balancers

1y2y

ny

CorrectionNetwork

)log( 2 nkO

1x2x

nx

1y2y

ny

CountingNetwork

)(log2 nO

remapped faulty balancers

Correction network, constructed with FT balancers, isappended to counting network.

Conclusions

• Upper bound on error resulting from faults.

• Practical method for tolerating faults with extra stages.)log( 2 nkO

Future Work• Extend concepts to Diffracting Trees (Shavit et al.,

1996) and other constructs.• General framework for fault-tolerant concurrent

data structures.

Leader

incoming tokens colored green

Accepts tokens on either wire.

inputs outputs

L

two bits of memory

Colors outgoing tokens red.

Leader

incoming tokens colored green

Accepts tokens on either wire.

inputs outputs

L

two bits of memory

Colors outgoing tokens red.

Leader

incoming tokens colored green

Accepts tokens on either wire.

inputs outputs

L

two bits of memory

Colors outgoing tokens red.

Leader

incoming tokens colored green

Accepts tokens on either wire.

inputs outputs

L

two bits of memory

Colors outgoing tokens red.

Leader

incoming tokens colored green

Accepts tokens on either wire.

inputs outputs

L

two bits of memory

Colors outgoing tokens red.

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Becomes a leader if it receives a green token.

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Becomes a leader if it receives a green token.

L

Follower

Accepts red tokens in order.

inputs outputs

F

two bits of memory

Becomes a leader if it receives a green token.

L

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

L F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

? F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

? F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

? F F

k+1 pseudo-balancers

Fault-Tolerant Balancer

inputs outputs

? F F

k+1 pseudo-balancers

L

Fault-Tolerant Balancer

inputs outputs

? F F

k+1 pseudo-balancers

L

Fault-Tolerant Balancer

inputs outputs

? F F

k+1 pseudo-balancers

L