1April 2008
Brahms
Byzantine-Resilient Random Membership SamplingBortnikov, Gurevich, Keidar, Kliot, and Shraer
April 2008 2
Edward (Eddie) Bortnikov Maxim (Max) Gurevich Idit Keidar
Gabriel (Gabi) Kliot Alexander (Alex) Shraer
April 2008 3
Why Random Node Sampling
Gossip partners Random choices make gossip protocols work
Unstructured overlay networks E.g., among super-peers Random links provide robustness, expansion
Gathering statistics Probe random nodes
Choosing cache locations
April 2008 4
The Setting
Many nodes – n 10,000s, 100,000s, 1,000,000s, …
Come and go Churn
Every joining node knows some others Connectivity
Full network Like the Internet
Byzantine failures
April 2008 5
Byzantine Fault Tolerance (BFT)
Faulty nodes (portion f) Arbitrary behavior: bugs, intrusions, selfishness Choose f ids arbitrarily
No CA, but no panacea for Cybil attacks
May want to bias samples Isolate nodes, DoS nodes Promote themselves, bias statistics
April 2008 6
Previous Work
Benign gossip membership Small (logarithmic) views Robust to churn and benign failures Empirical study [Lpbcast,Scamp,Cyclon,PSS] Analytical study [Allavena et al.] Never proven uniform samples Spatial correlation among neighbors’ views [PSS]
Byzantine-resilient gossip Full views [MMR,MS,Fireflies,Drum,BAR] Small views, some resilience [SPSS] We are not aware of any analytical work
April 2008 7
Our Contributions
1. Gossip-based BFT membership Linear portion f of Byzantine failures O(n1/3)-size partial views Correct nodes remain connected Mathematically analyzed, validated in
simulations
2. Random sampling Novel memory-efficient approach Converges to proven independent uniform
samples
The view is not all bad
Better than benign gossip
8April 2008
Brahms
1. Sampling - local component
2. Gossip - distributed componentsample
Sampler
Gossip
view
April 2008 9
Sampler Building Block
Input: data stream, one element at a time Bias: some values appear more than others Used with stream of gossiped ids
Output: uniform random sample of unique elements seen thus far Independent of other Samplers One element at a time (converging)
next
sample
Sampler
April 2008 10
Sampler Implementation
Memory: stores one element at a time Use random hash function h
From min-wise independent family [Broder et al.] For each set X, and all , Xx
||
1))()}(Pr(min{
XxhXh
Sampler
next
sample
init
Keep id with smallest hash so far
Choose random hash function
April 2008 11
Component S: Sampling and Validation
SamplerSampler
sample
Sampler Sampler
nextinit
using pings
id streamfrom gossip
ValidatorValidator Validator Validator
S
April 2008 12
Gossip Process
Provides the stream of ids for S Needs to ensure connectivity Use a bag of tricks to overcome attacks
April 2008 13
Gossip-Based Membership Primer
Small (sub-linear) local view V V constantly changes - essential due to churn
Typically, evolves in (unsynchronized) rounds Push: send my id to some node in V
Reinforce underrepresented nodes Pull: retrieve view from some node in V
Spread knowledge within the network [Allavena et al. ‘05]: both are essential
Low probability for partitions and star topologies
April 2008 14
Brahms Gossip Rounds
Each round: Send pushes, pulls to random nodes from V Wait to receive pulls, pushes Update S with all received ids (Sometimes) re-compute V
Tricky! Beware of adversary attacks
April 2008 15
Problem 1: Push Drowning
Push Alice
Push Bob
Push Dana
Push Caro
lP
ush
Ed
Push Mallory
Push M
&M
Push Malfoy
A
D
B
EM
M
M
M
M
April 2008 16
Trick 1: Rate-Limit Pushes
Use limited messages to bound faulty pushes system-wide
E.g., computational puzzles/virtual currency Faulty nodes can send portion p of them Views won’t be all bad
April 2008 17
Problem 2: Quick Isolation
Push Alice
Push Bob
Push Dana
Push Caro
l
Pu
sh E
d Push MalloryPush M
&M
Pu
sh M
alfo
y
A
C
E
M
Ha! She’s out! Now let’s move on
to the next guy!
D
April 2008 18
Trick 2: Detection & Recovery
Do not re-compute V in rounds when too many pushes are received
Slows down isolation; does not prevent it
Push Mallory
Push M&M
Pu
sh M
alfo
y
Hey! I’m swamped!I better ignore all of ‘em pushes…
Push Bob
April 2008 19
Trick 3: Balance Pulls & Pushes
Control contribution of push - α|V| ids versus contribution of pull - β|V| ids Parameters α, β
Pull-only eventually all faulty ids Pull from faulty nodes – all faulty ids,
from correct nodes – some faulty ids Push-only quick isolation of attacked node Push ensures: system-wide not all bad ids Pull slows down (does not prevent) isolation
April 2008 20
Trick 4: History Samples
Attacker influences both push and pull
Feedback γ|V| random ids from S Parameters α + β + γ = 1
Attacker loses control - samples are eventually perfectly uniform
Yoo-hoo, is there any good process out there?
April 2008 21
View and Sample Maintenance
Pushed ids
Pulled ids
S |V| |V| |V|
View V Sample
April 2008 22
Key Property
Samples take time to help Assume attack starts when samples are empty
With appropriate parameters E.g.,
Time to isolation > time to convergence
Prove lower boundusing tricks 1,2,3
(not using samples yet)
Prove upper bound until some good sample
persists forever
3| | | | ( )V S n
Self-healing from partitions
April 2008 23
History Samples: Rationale
Judicious use essential Bootstrap, avoid slow convergence Deal with churn
With a little bit of history samples (10%) we can cope with any adversary Amplification!
24April 2008
Analysis
1. Sampling - mathematical analysis
2. Connectivity - analysis and simulation
3. Full system simulation
April 2008 25
Connectivity Sampling
Theorem: If overlay remains connected indefinitely, samples are eventually uniform
April 2008 26
Sampling Connectivity Ever After
Perfect sample of a sampler with hash h: the id with the lowest h(id) system-wide
If correct, sticks once the sampler sees it Correct perfect sample self-healing from
partitions ever after
We analyze PSP(t) – probability of perfect sample at time t
April 2008 27
Convergence to 1st Perfect Sample
n = 1000
f = 0.2
40% unique ids in stream
April 2008 28
Scalability
Analysis says:
For scalability, want small and constant convergence time independent of system size, e.g., when
2| | | |
( ) (1 )V S
nPSP t e
3| | | | ( )V S n
2| | (log ),| | ( )
log
nV n S
n
April 2008 29
Connectivity Analysis 1: Balanced Attacks Attack all nodes the same Maximizes faulty ids in views system-wide
in any single round If repeated, system converges to fixed point
ratio of faulty ids in views, which is < 1 if γ=0 (no history) and p < 1/3 or History samples are used, any p
There are always good ids in views!
April 2008 30
Fixed Point Analysis: Push
i
Local view node 1
Local view node i
Time t:
push
1 Time t+1:
push from faulty node
lost push
x(t) – portion of faulty nodes in views at round t;portion of faulty pushes to correct nodes :
p / ( p + ( 1 − p )( 1 − x(t) ) )
April 2008 31
Fixed Point Analysis: Pull
i
Local view node 1
Local view node i
Time t:
pull from i: faulty with probability x(t)
Time t+1:
E[x(t+1)] = p / (p + (1 − p)(1 − x(t))) + ( x(t) + (1-x(t))x(t) ) + γf
pull from faulty
April 2008 32
Faulty Ids in Fixed Point
With a few history samples, any
portion of bad nodes can be toleratedPerfectly validated
fixed pointsand convergence
Assumed perfect in analysis, real history
in simulations
April 2008 33
Convergence to Fixed Point
n = 1000
p = 0.2
α=β=0.5
γ=0
April 2008 34
Connectivity Analysis 2:Targeted Attack – Roadmap Step 1: analysis without history samples
Isolation in logarithmic time … but not too fast, thanks to tricks 1,2,3
Step 2: analysis of history sample convergence Time-to-perfect-sample < Time-to-Isolation
Step 3: putting it all together Empirical evaluation No isolation happens
April 2008 35
Targeted Attack – Step 1
Q: How fast (lower bound) can an attacker isolate one node from the rest?
Worst-case assumptions No use of history samples ( = 0) Unrealistically strong adversary
Observes the exact number of correct pushes and complements it to α|V|
Attacked node not represented initially Balanced attack on the rest of the system
April 2008 36
Isolation w/out History Samples
n = 1000
p = 0.2
α=β=0.5
γ=0
Depend on α,β,p
Isolation time for |V|=60
2x2 i, ji
E(indegree(t 1)) E(indegree(t))A , A 1
E(outdegree(t 1)) E(outdegree(t)
April 2008 37
Step 2: Sample Convergence
Perfect sample in 2-3 rounds
n = 1000
p = 0.2
α=β=0.5, γ=0
40% unique ids
Empirically verified
April 2008 38
Step 3: Putting It All TogetherNo Isolation with History Samples
Works well despite small PSP
n = 1000
p = 0.2
α=β=0.45
γ=0.1
April 2008 39
p = 0.2
α=β=0.45
γ=0.1
Sample Convergence (Balanced)
32|||| NSV
32|||| nSV
Convergence twice as fast with 33|||| nSV
40April 2008
Summary
O(n1/3)-size views Resist Byzantine failures of linear portion Converge to proven uniform samples Precise analysis of impact of failures