LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to...

Post on 24-May-2020

4 views 0 download

Transcript of LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to...

LECTURE 6 Consistency and Replication

Replication and why is it hard

¨  Data needs to be replicated for reliability or to improve performance ¤  If a replica crashes, possible to continue to have access. ¤  Performance improves by accessing nearer data stores ¤  Aids scaling

¨  However, a key challenge is ensuring the consistency of data across replicas. ¤ Hard to realize in large distributed systems. `

Handling Failures

¨  Logical clocks enable all replicas to execute updates in same order ¤ But, cannot handle failure of any replica!

¨  To deal with crash failures we saw: ¤ Chandy-Lamport snapshot for coordinated checkpoint ¤ Log sources of non-determinism between checkpoints

3

Types of failures

¨  Crash failures ¤ Can resume with previously saved state

¨  Fail stop ¤ All state is lost upon failure ¤ Need to replicate state

4

Primary Backup Replication 5

Client Primary Backup

Backup

Backup

Primary Backup Replication

¨  How to handle primary failure? ¤ Promote one of the backups as the primary

¨  How to handle backup failure? ¤ Add another machine as a backup

6

Primary Backup Sync

¨  When should primary sync up with backups?

¨  What should be transferred when syncing?

7

Example: MapReduce Master

RegisterServer() { while (1) {

receive msg and parse addr

pick task to assign

mark task as assigned

respond with task assignment }

}

8

Primary & Backups in sync

When to sync

again?

Takeaways

¨  Cannot tolerate primary failure after update to its state is externally visible

¨  Corollary: Okay for primary to be out of sync with backup until change is externally visible ¤ External consistency

¨  Implications for when primary should sync with backups?

9

Primary-Backup Sync 10

C1

C2

P

B

Reads vs. Updates

¨  For operations that do not update state, primary need not consult backups ¤ When is this true?

¨  If backup is externally consistent with primary ¤ à If backup takes over as primary, it will generate

identical response as primary may have

11

What to transfer in sync?

¨  Snapshot of primary’s state ¤ Slow ¤ When is this necessary? ¤ Necessary when bootstrapping new backup

¨  Every operation ¤ Why is this okay? ¤ Leverage determinism of state machine

12

Service Development

¨  Getting coordination right between primary and backups is tricky ¤ Easy to mess up

¨  Must make replication transparent to developer

13

RSM with Logical Clocks 14

Replicated State Machine

Application

Updates

Ordered Updates

Replicated State Machine

Application

Updates

Server1 Server2

Transparent Primary Backup

¨  Application relies on library to keep primary and backups in sync ¤ Receive message from client ¤ Sync with backups before sending response to client

¨  Will this solution work? ¨  Does not capture non-determinism in execution

15

Virtual Machines

Operating System

Hardware

Applications

CPU Disk RAM

Process File system Virtual memory

Virtual Machine

Virtual Machine Monitor

16

Primary VM

Backup VM

Logging channel

Shared Disk

�!�,)� �� �*!� �� �&%��,)�+!&%�

@8AG4G<BA B9 94H?GGB?8E4AG .%F 9BE G;8 ( (�*!+� C?4G9BE@� 'HE 4CCEB46; <F F<@<?4E 5HG J8 ;4I8 @478 FB@89HA74@8AG4? 6;4A:8F 9BE C8E9BE@4A68 E84FBAF 4A7 <AI8FG<:4G87 4 AH@58E B9 78F<:A 4?G8EA4G<I8F� !A 477<G<BA J8 ;4I8;47 GB 78F<:A 4A7 <@C?8@8AG @4AL 477<G<BA4? 6B@CBA8AGF<A G;8 FLFG8@ 4A7 784? J<G; 4 AH@58E B9 CE46G<64? <FFH8FGB 5H<?7 4 6B@C?8G8 FLFG8@ G;4G <F 8�6<8AG 4A7 HF45?8 5L6HFGB@8EF EHAA<A: 8AG8ECE<F8 4CC?<64G<BAF� +<@<?4E GB @BFGBG;8E CE46G<64? FLFG8@F 7<F6HFF87 J8 BA?L 4GG8@CG GB 784?J<G; 94<?FGBC 94<?HE8F 1��3 J;<6; 4E8 F8EI8E 94<?HE8F G;4G 64A58 78G86G87 589BE8 G;8 94<?<A: F8EI8E 64HF8F 4A <A6BEE86G 8KG8EA4??L I<F<5?8 46G<BA�,;8 E8FG B9 G;8 C4C8E <F BE:4A<M87 4F 9B??BJF� �<EFG J8

78F6E<58 BHE 54F<6 78F<:A 4A7 78G4<? BHE 9HA74@8AG4? CEBGB6B?F G;4G 8AFHE8 G;4G AB 74G4 <F ?BFG <9 4 546>HC .% G4>8FBI8E 49G8E 4 CE<@4EL .% 94<?F� ,;8A J8 78F6E<58 <A 78G4<? @4AL B9 G;8 CE46G<64? <FFH8F G;4G @HFG 58 477E8FF87 GB5H<?7 4 EB5HFG 6B@C?8G8 4A7 4HGB@4G87 FLFG8@� /8 4?FB78F6E<58 F8I8E4? 78F<:A 6;B<68F G;4G 4E<F8 9BE <@C?8@8AG<A:94H?GGB?8E4AG .%F 4A7 7<F6HFF G;8 GE478B�F <A G;8F8 6;B<68F�&8KG J8 :<I8 C8E9BE@4A68 E8FH?GF 9BE BHE <@C?8@8AG4G<BA9BE FB@8 58A6;@4E>F 4A7 FB@8 E84? 8AG8ECE<F8 4CC?<64G<BAF��<A4??L J8 78F6E<58 E8?4G87 JBE> 4A7 6BA6?H78�

2. BASIC FT DESIGN�<:HE8 � F;BJF G;8 54F<6 F8GHC B9 BHE FLFG8@ 9BE 94H?G

GB?8E4AG .%F� �BE 4 :<I8A .% 9BE J;<6; J8 78F<E8 GB CEBI<7894H?G GB?8E4A68 �G;8 -.'*�.4 .%� J8 EHA 4 �!(1- .% BA4 7<�8E8AG C;LF<64? F8EI8E G;4G <F >8CG <A FLA6 4A7 8K86HG8F<78AG<64??L GB G;8 CE<@4EL I<EGH4? @46;<A8 G;BH:; J<G; 4F@4?? G<@8 ?4:� /8 F4L G;4G G;8 GJB .%F 4E8 <A 2'.01�) ),!(�/0#-� ,;8 I<EGH4? 7<F>F 9BE G;8 .%F 4E8 BA F;4E87 FGBE4:8�FH6; 4F 4 �<5E8 �;4AA8? BE <+�+! 7<F> 4EE4L� 4A7 G;8E89BE8 4668FF<5?8 GB G;8 CE<@4EL 4A7 546>HC .% 9BE <ACHG 4A7BHGCHG� �/8 J<?? 7<F6HFF 4 78F<:A <A J;<6; G;8 CE<@4EL 4A7546>HC .% ;4I8 F8C4E4G8 ABAF;4E87 I<EGH4? 7<F>F <A +86G<BA ����� 'A?L G;8 CE<@4EL .% 47I8EG<F8F <GF CE8F8A68 BAG;8 A8GJBE> FB 4?? A8GJBE> <ACHGF 6B@8 GB G;8 CE<@4EL .%�+<@<?4E?L 4?? BG;8E <ACHGF �FH6; 4F >8L5B4E7 4A7 @BHF8� :BBA?L GB G;8 CE<@4EL .%��?? <ACHG G;4G G;8 CE<@4EL .% E868<I8F <F F8AG GB G;8

546>HC .% I<4 4 A8GJBE> 6BAA86G<BA >ABJA 4F G;8 ),%%'+%!&�++#)� �BE F8EI8E JBE>?B47F G;8 7B@<A4AG <ACHG GE4�6<F A8GJBE> 4A7 7<F>� �77<G<BA4? <A9BE@4G<BA 4F 7<F6HFF8758?BJ <A +86G<BA ��� <F GE4AF@<GG87 4F A868FF4EL GB 8AFHE8G;4G G;8 546>HC .% 8K86HG8F ABA78G8E@<A<FG<6 BC8E4G<BAF<A G;8 F4@8 J4L 4F G;8 CE<@4EL .%� ,;8 E8FH?G <F G;4G G;8546>HC .% 4?J4LF 8K86HG8F <78AG<64??L GB G;8 CE<@4EL .%� BJ8I8E G;8 BHGCHGF B9 G;8 546>HC .% 4E8 7EBCC87 5LG;8 ;LC8EI<FBE FB BA?L G;8 CE<@4EL CEB7H68F 46GH4? BHGCHGFG;4G 4E8 E8GHEA87 GB 6?<8AGF� �F 78F6E<587 <A +86G<BA ��� G;8CE<@4EL 4A7 546>HC .% 9B??BJ 4 FC86<�6 CEBGB6B? <A6?H7<A:8KC?<6<G 46>ABJ?87:@8AGF 5L G;8 546>HC .% <A BE78E GB8AFHE8 G;4G AB 74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F�,B 78G86G <9 4 CE<@4EL BE 546>HC .% ;4F 94<?87 BHE FLF

G8@ HF8F 4 6B@5<A4G<BA B9 ;84EG584G<A: 58GJ88A G;8 E8?8I4AGF8EI8EF 4A7 @BA<GBE<A: B9 G;8 GE4�6 BA G;8 ?B::<A: 6;4AA8?�!A 477<G<BA J8 @HFG 8AFHE8 G;4G BA?L BA8 B9 G;8 CE<@4ELBE 546>HC .% G4>8F BI8E 8K86HG<BA 8I8A <9 G;8E8 <F 4 FC?<G5E4<A F<GH4G<BA J;8E8 G;8 CE<@4EL 4A7 546>HC F8EI8EF ;4I8?BFG 6B@@HA<64G<BA J<G; 846; BG;8E�!A G;8 9B??BJ<A: F86G<BAF J8 CEBI<78 @BE8 78G4<?F BA F8I

8E4? <@CBEG4AG 4E84F� !A +86G<BA ��� J8 :<I8 FB@8 78G4<?FBA G;8 78G8E@<A<FG<6 E8C?4L G86;AB?B:L G;4G 8AFHE8F G;4G CE<@4EL 4A7 546>HC .%F 4E8 >8CG <A FLA6 I<4 G;8 <A9BE@4G<BAF8AG BI8E G;8 ?B::<A: 6;4AA8?� !A +86G<BA ��� J8 78F6E<584 9HA74@8AG4? EH?8 B9 BHE �, CEBGB6B? G;4G 8AFHE8F G;4G AB74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F� !A +86G<BA ��� J8 78F6E<58BHE @8G;B7F 9BE 78G86G<A: 4A7 E8FCBA7<A: GB 4 94<?HE8 <A 46BEE86G 94F;<BA�

2.1 Deterministic Replay Implementation�F J8 ;4I8 @8AG<BA87 E8C?<64G<A: F8EI8E �BE .%� 8K8

6HG<BA 64A 58 @B78?87 4F G;8 E8C?<64G<BA B9 4 78G8E@<A<FG<6 FG4G8 @46;<A8� !9 GJB 78G8E@<A<FG<6 FG4G8 @46;<A8F 4E8FG4EG87 <A G;8 F4@8 <A<G<4? FG4G8 4A7 CEBI<787 G;8 8K46G F4@8<ACHGF <A G;8 F4@8 BE78E G;8A G;8L J<?? :B G;EBH:; G;8 F4@8F8DH8A68F B9 FG4G8F 4A7 CEB7H68 G;8 F4@8 BHGCHGF� � I<EGH4? @46;<A8 ;4F 4 5EB47 F8G B9 <ACHGF <A6?H7<A: <A6B@<A:A8GJBE> C46>8GF 7<F> E847F 4A7 <ACHG 9EB@ G;8 >8L5B4E74A7 @BHF8� &BA78G8E@<A<FG<6 8I8AGF �FH6; 4F I<EGH4? <AG8EEHCGF� 4A7 ABA78G8E@<A<FG<6 BC8E4G<BAF �FH6; 4F E847<A:G;8 6?B6> 6L6?8 6BHAG8E B9 G;8 CEB68FFBE� 4?FB 4�86G G;8 .%�FFG4G8� ,;<F CE8F8AGF G;E88 6;4??8A:8F 9BE E8C?<64G<A: 8K86HG<BA B9 4AL .% EHAA<A: 4AL BC8E4G<A: FLFG8@ 4A7 JBE>?B47���� 6BEE86G?L 64CGHE<A: 4?? G;8 <ACHG 4A7 ABA78G8E@<A<F@A868FF4EL GB 8AFHE8 78G8E@<A<FG<6 8K86HG<BA B9 4 546>HC I<EGH4? @46;<A8 ��� 6BEE86G?L 4CC?L<A: G;8 <ACHGF 4A7 ABA78G8E@<A<F@ GB G;8 546>HC I<EGH4? @46;<A8 4A7 ��� 7B<A:FB <A 4 @4AA8E G;4G 7B8FA�G 78:E478 C8E9BE@4A68� !A 477<G<BA @4AL 6B@C?8K BC8E4G<BAF <A K�� @<6EBCEB68FFBEF ;4I8HA78�A87 ;8A68 ABA78G8E@<A<FG<6 F<78 8�86GF� �4CGHE<A:G;8F8 HA78�A87 F<78 8�86GF 4A7 E8C?4L<A: G;8@ GB CEB7H68G;8 F4@8 FG4G8 CE8F8AGF 4A 477<G<BA4? 6;4??8A:8�.%J4E8 78G8E@<A<FG<6 E8C?4L 1��3 CEBI<78F 8K46G?L G;<F

9HA6G<BA4?<GL 9BE K�� I<EGH4? @46;<A8F BA G;8 .%J4E8 I+C;8E8C?4G9BE@� �8G8E@<A<FG<6 E8C?4L E86BE7F G;8 <ACHGF B9 4 .%4A7 4?? CBFF<5?8 ABA78G8E@<A<F@ 4FFB6<4G87 J<G; G;8 .%8K86HG<BA <A 4 FGE84@ B9 ?B: 8AGE<8F JE<GG8A GB 4 ?B: �?8� ,;8.% 8K86HG<BA @4L 58 8K46G?L E8C?4L87 ?4G8E 5L E847<A: G;8?B: 8AGE<8F 9EB@ G;8 �?8� �BE ABA78G8E@<A<FG<6 BC8E4G<BAFFH�6<8AG <A9BE@4G<BA <F ?B::87 GB 4??BJ G;8 BC8E4G<BA GB 58E8CEB7H687 J<G; G;8 F4@8 FG4G8 6;4A:8 4A7 BHGCHG� �BEABA78G8E@<A<FG<6 8I8AGF FH6; 4F G<@8E BE !' 6B@C?8G<BA <A

31

Primary VM

Backup VM

Logging channel

Shared Disk

�!�,)� �� �*!� �� �&%��,)�+!&%�

@8AG4G<BA B9 94H?GGB?8E4AG .%F 9BE G;8 ( (�*!+� C?4G9BE@� 'HE 4CCEB46; <F F<@<?4E 5HG J8 ;4I8 @478 FB@89HA74@8AG4? 6;4A:8F 9BE C8E9BE@4A68 E84FBAF 4A7 <AI8FG<:4G87 4 AH@58E B9 78F<:A 4?G8EA4G<I8F� !A 477<G<BA J8 ;4I8;47 GB 78F<:A 4A7 <@C?8@8AG @4AL 477<G<BA4? 6B@CBA8AGF<A G;8 FLFG8@ 4A7 784? J<G; 4 AH@58E B9 CE46G<64? <FFH8FGB 5H<?7 4 6B@C?8G8 FLFG8@ G;4G <F 8�6<8AG 4A7 HF45?8 5L6HFGB@8EF EHAA<A: 8AG8ECE<F8 4CC?<64G<BAF� +<@<?4E GB @BFGBG;8E CE46G<64? FLFG8@F 7<F6HFF87 J8 BA?L 4GG8@CG GB 784?J<G; 94<?FGBC 94<?HE8F 1��3 J;<6; 4E8 F8EI8E 94<?HE8F G;4G 64A58 78G86G87 589BE8 G;8 94<?<A: F8EI8E 64HF8F 4A <A6BEE86G 8KG8EA4??L I<F<5?8 46G<BA�,;8 E8FG B9 G;8 C4C8E <F BE:4A<M87 4F 9B??BJF� �<EFG J8

78F6E<58 BHE 54F<6 78F<:A 4A7 78G4<? BHE 9HA74@8AG4? CEBGB6B?F G;4G 8AFHE8 G;4G AB 74G4 <F ?BFG <9 4 546>HC .% G4>8FBI8E 49G8E 4 CE<@4EL .% 94<?F� ,;8A J8 78F6E<58 <A 78G4<? @4AL B9 G;8 CE46G<64? <FFH8F G;4G @HFG 58 477E8FF87 GB5H<?7 4 EB5HFG 6B@C?8G8 4A7 4HGB@4G87 FLFG8@� /8 4?FB78F6E<58 F8I8E4? 78F<:A 6;B<68F G;4G 4E<F8 9BE <@C?8@8AG<A:94H?GGB?8E4AG .%F 4A7 7<F6HFF G;8 GE478B�F <A G;8F8 6;B<68F�&8KG J8 :<I8 C8E9BE@4A68 E8FH?GF 9BE BHE <@C?8@8AG4G<BA9BE FB@8 58A6;@4E>F 4A7 FB@8 E84? 8AG8ECE<F8 4CC?<64G<BAF��<A4??L J8 78F6E<58 E8?4G87 JBE> 4A7 6BA6?H78�

2. BASIC FT DESIGN�<:HE8 � F;BJF G;8 54F<6 F8GHC B9 BHE FLFG8@ 9BE 94H?G

GB?8E4AG .%F� �BE 4 :<I8A .% 9BE J;<6; J8 78F<E8 GB CEBI<7894H?G GB?8E4A68 �G;8 -.'*�.4 .%� J8 EHA 4 �!(1- .% BA4 7<�8E8AG C;LF<64? F8EI8E G;4G <F >8CG <A FLA6 4A7 8K86HG8F<78AG<64??L GB G;8 CE<@4EL I<EGH4? @46;<A8 G;BH:; J<G; 4F@4?? G<@8 ?4:� /8 F4L G;4G G;8 GJB .%F 4E8 <A 2'.01�) ),!(�/0#-� ,;8 I<EGH4? 7<F>F 9BE G;8 .%F 4E8 BA F;4E87 FGBE4:8�FH6; 4F 4 �<5E8 �;4AA8? BE <+�+! 7<F> 4EE4L� 4A7 G;8E89BE8 4668FF<5?8 GB G;8 CE<@4EL 4A7 546>HC .% 9BE <ACHG 4A7BHGCHG� �/8 J<?? 7<F6HFF 4 78F<:A <A J;<6; G;8 CE<@4EL 4A7546>HC .% ;4I8 F8C4E4G8 ABAF;4E87 I<EGH4? 7<F>F <A +86G<BA ����� 'A?L G;8 CE<@4EL .% 47I8EG<F8F <GF CE8F8A68 BAG;8 A8GJBE> FB 4?? A8GJBE> <ACHGF 6B@8 GB G;8 CE<@4EL .%�+<@<?4E?L 4?? BG;8E <ACHGF �FH6; 4F >8L5B4E7 4A7 @BHF8� :BBA?L GB G;8 CE<@4EL .%��?? <ACHG G;4G G;8 CE<@4EL .% E868<I8F <F F8AG GB G;8

546>HC .% I<4 4 A8GJBE> 6BAA86G<BA >ABJA 4F G;8 ),%%'+%!&�++#)� �BE F8EI8E JBE>?B47F G;8 7B@<A4AG <ACHG GE4�6<F A8GJBE> 4A7 7<F>� �77<G<BA4? <A9BE@4G<BA 4F 7<F6HFF8758?BJ <A +86G<BA ��� <F GE4AF@<GG87 4F A868FF4EL GB 8AFHE8G;4G G;8 546>HC .% 8K86HG8F ABA78G8E@<A<FG<6 BC8E4G<BAF<A G;8 F4@8 J4L 4F G;8 CE<@4EL .%� ,;8 E8FH?G <F G;4G G;8546>HC .% 4?J4LF 8K86HG8F <78AG<64??L GB G;8 CE<@4EL .%� BJ8I8E G;8 BHGCHGF B9 G;8 546>HC .% 4E8 7EBCC87 5LG;8 ;LC8EI<FBE FB BA?L G;8 CE<@4EL CEB7H68F 46GH4? BHGCHGFG;4G 4E8 E8GHEA87 GB 6?<8AGF� �F 78F6E<587 <A +86G<BA ��� G;8CE<@4EL 4A7 546>HC .% 9B??BJ 4 FC86<�6 CEBGB6B? <A6?H7<A:8KC?<6<G 46>ABJ?87:@8AGF 5L G;8 546>HC .% <A BE78E GB8AFHE8 G;4G AB 74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F�,B 78G86G <9 4 CE<@4EL BE 546>HC .% ;4F 94<?87 BHE FLF

G8@ HF8F 4 6B@5<A4G<BA B9 ;84EG584G<A: 58GJ88A G;8 E8?8I4AGF8EI8EF 4A7 @BA<GBE<A: B9 G;8 GE4�6 BA G;8 ?B::<A: 6;4AA8?�!A 477<G<BA J8 @HFG 8AFHE8 G;4G BA?L BA8 B9 G;8 CE<@4ELBE 546>HC .% G4>8F BI8E 8K86HG<BA 8I8A <9 G;8E8 <F 4 FC?<G5E4<A F<GH4G<BA J;8E8 G;8 CE<@4EL 4A7 546>HC F8EI8EF ;4I8?BFG 6B@@HA<64G<BA J<G; 846; BG;8E�!A G;8 9B??BJ<A: F86G<BAF J8 CEBI<78 @BE8 78G4<?F BA F8I

8E4? <@CBEG4AG 4E84F� !A +86G<BA ��� J8 :<I8 FB@8 78G4<?FBA G;8 78G8E@<A<FG<6 E8C?4L G86;AB?B:L G;4G 8AFHE8F G;4G CE<@4EL 4A7 546>HC .%F 4E8 >8CG <A FLA6 I<4 G;8 <A9BE@4G<BAF8AG BI8E G;8 ?B::<A: 6;4AA8?� !A +86G<BA ��� J8 78F6E<584 9HA74@8AG4? EH?8 B9 BHE �, CEBGB6B? G;4G 8AFHE8F G;4G AB74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F� !A +86G<BA ��� J8 78F6E<58BHE @8G;B7F 9BE 78G86G<A: 4A7 E8FCBA7<A: GB 4 94<?HE8 <A 46BEE86G 94F;<BA�

2.1 Deterministic Replay Implementation�F J8 ;4I8 @8AG<BA87 E8C?<64G<A: F8EI8E �BE .%� 8K8

6HG<BA 64A 58 @B78?87 4F G;8 E8C?<64G<BA B9 4 78G8E@<A<FG<6 FG4G8 @46;<A8� !9 GJB 78G8E@<A<FG<6 FG4G8 @46;<A8F 4E8FG4EG87 <A G;8 F4@8 <A<G<4? FG4G8 4A7 CEBI<787 G;8 8K46G F4@8<ACHGF <A G;8 F4@8 BE78E G;8A G;8L J<?? :B G;EBH:; G;8 F4@8F8DH8A68F B9 FG4G8F 4A7 CEB7H68 G;8 F4@8 BHGCHGF� � I<EGH4? @46;<A8 ;4F 4 5EB47 F8G B9 <ACHGF <A6?H7<A: <A6B@<A:A8GJBE> C46>8GF 7<F> E847F 4A7 <ACHG 9EB@ G;8 >8L5B4E74A7 @BHF8� &BA78G8E@<A<FG<6 8I8AGF �FH6; 4F I<EGH4? <AG8EEHCGF� 4A7 ABA78G8E@<A<FG<6 BC8E4G<BAF �FH6; 4F E847<A:G;8 6?B6> 6L6?8 6BHAG8E B9 G;8 CEB68FFBE� 4?FB 4�86G G;8 .%�FFG4G8� ,;<F CE8F8AGF G;E88 6;4??8A:8F 9BE E8C?<64G<A: 8K86HG<BA B9 4AL .% EHAA<A: 4AL BC8E4G<A: FLFG8@ 4A7 JBE>?B47���� 6BEE86G?L 64CGHE<A: 4?? G;8 <ACHG 4A7 ABA78G8E@<A<F@A868FF4EL GB 8AFHE8 78G8E@<A<FG<6 8K86HG<BA B9 4 546>HC I<EGH4? @46;<A8 ��� 6BEE86G?L 4CC?L<A: G;8 <ACHGF 4A7 ABA78G8E@<A<F@ GB G;8 546>HC I<EGH4? @46;<A8 4A7 ��� 7B<A:FB <A 4 @4AA8E G;4G 7B8FA�G 78:E478 C8E9BE@4A68� !A 477<G<BA @4AL 6B@C?8K BC8E4G<BAF <A K�� @<6EBCEB68FFBEF ;4I8HA78�A87 ;8A68 ABA78G8E@<A<FG<6 F<78 8�86GF� �4CGHE<A:G;8F8 HA78�A87 F<78 8�86GF 4A7 E8C?4L<A: G;8@ GB CEB7H68G;8 F4@8 FG4G8 CE8F8AGF 4A 477<G<BA4? 6;4??8A:8�.%J4E8 78G8E@<A<FG<6 E8C?4L 1��3 CEBI<78F 8K46G?L G;<F

9HA6G<BA4?<GL 9BE K�� I<EGH4? @46;<A8F BA G;8 .%J4E8 I+C;8E8C?4G9BE@� �8G8E@<A<FG<6 E8C?4L E86BE7F G;8 <ACHGF B9 4 .%4A7 4?? CBFF<5?8 ABA78G8E@<A<F@ 4FFB6<4G87 J<G; G;8 .%8K86HG<BA <A 4 FGE84@ B9 ?B: 8AGE<8F JE<GG8A GB 4 ?B: �?8� ,;8.% 8K86HG<BA @4L 58 8K46G?L E8C?4L87 ?4G8E 5L E847<A: G;8?B: 8AGE<8F 9EB@ G;8 �?8� �BE ABA78G8E@<A<FG<6 BC8E4G<BAFFH�6<8AG <A9BE@4G<BA <F ?B::87 GB 4??BJ G;8 BC8E4G<BA GB 58E8CEB7H687 J<G; G;8 F4@8 FG4G8 6;4A:8 4A7 BHGCHG� �BEABA78G8E@<A<FG<6 8I8AGF FH6; 4F G<@8E BE !' 6B@C?8G<BA <A

31

VMM-based Primary Backup

¨  Primary and backup execute on two virtual machines

¨  Primary logs inputs and outputs ¨  Backup applies inputs from log ¨  Primary waits for backup output

¨  Primary-backup monitor each other ¤ If primary fails, backup takes over

¨  VMM at primary logs external inputs and causes of non-determinism

¨  Example: Log results of non-deterministic instructions ¤ e.g., timestamp counter read (RDTSC)

¨  Random number generation?

Log-based VM replication 18

¨  VMM at primary sends log entries to VMM at backup over the logging channel

¨  Backup hypervisor replays log entries ¤  Stops backup VM at next input event or non-deterministic instruction

¤ Delivers same input as primary ¤ Delivers same result to non-deterministic instruction as primary

Log-based VM replication 19

¨  VM inputs ¤  Incoming network packets ¤ Disk reads ¤ Keyboard and mouse events ¤ Clock timer interrupt events ¤ Results of non-deterministic instructions

¨  VM outputs ¤ Outgoing network packets ¤ Disk writes

¨  Why log outputs?

Virtual Machine I/O 20

Example Scenario

Primary

¨  Write input to log ¨  Apply input ¨  Produce output ¨  Fail!

Backup

¨  Take over as primary ¨  Read from log and

apply input ¨  Timer interrupt fires ¨  Produce output

Backup does not know timing of output relative to timer interrupt

Execution of interrupt handler may affect output

Optimizing Performance

¨  Primary must buffer output until ACK from backup ¤ Why? ¤ Slow from client perspective!

n  If replicas are not in close proximity.

¨  How to optimize performance? ¨  Pipeline sync with backup and execution at primary

22

VMM-based Replication 23

C

P

B

Primary Backup Replication

¨  Promote one of the backups if primary fails ¨  Replace any failed backup

¨  When should primary sync with backups? ¤ Before making state change externally visible ¤ Primary and backups must be externally consistent

¨  What to sync? ¤ Entire state when bootstrapping new backup ¤ Thereafter, all sources of non-determinism

24

RSM with Primary Backup 25

Operating System

Application

Operating System

Application

Server1 Server2

Virtual Machine Monitor

Virtual Machine Monitor

Hardware Hardware

Client perspective

¨  What does a worker need to know in order to register itself with replicated master?

¨  Needs to know which machine is primary

26

Primary Backup Replication

Client Primary Backup

Backup

Backup

27

Client perspective

¨  What does a worker need to know in order to register itself with replicated master?

¨  Needs to know which machine is primary

¨  Can primary be hard coded into client code? ¨  No, primary gets replaced when it fails

¨  How does client discover current primary?

28

Primary Backup Replication

Client Primary Backup

Backup

Backup

View service

29

View service

¨  Maintains current membership of primary-backup service (called view) ¤ View number, primary, backup

¨  When does view service change view? ¨  When primary or any backup fails ¨  Periodically exchange heartbeat messages to

detect failures

¨  What if view service is down or not reachable?

30

Transitioning between views

¨  How to ensure new primary is up-to-date? ¤ Only promote a previous backup ¤ This is why view service needs to pick backups

¨  How does view service know if a backup is up-to-date?

¨  Two scenarios for ill-timed primary failure: ¤ Primary applies operation but fails before syncing with

backup ¤ Primary fails before new backup is initialized

31

Transitioning between views

¨  View service broadcasts view change to all

¨  Primary must ACK new view once backup is up-to-date

¨  Two implications: ¤ Liveness detection timeout > State transfer time ¤ Cannot change view if primary fails during sync

32

View service

¨  View change has three steps: ¤ View service announces new view ¤ Primary syncs with new backup if there is one ¤ Primary acknowledges new view

¨  Stuck if primary fails in the midst of this process

33

Scalability of View service

¨  Does every client need to contact view service before any operation?

¨  Clients can cache view across operations

¨  When to invalidate cached view? ¨  Client invalidates cache when no response or

negative response from primary

34

Split Brain

Client

S1

S2

View service (1,S1, _) (2,S1,S2) (3,S2, _)

(1,S1, _) (2,S1,S2)

(2,S1,S2) (3,S2, _)

(2,S1,S2)

35

Avoiding Split Brain

¨  Primary must forward all operations to backups ¤ Goal: Get ACKs from backups that they too recognize

primary

¨  Why can’t backups be mistaken about who is primary? ¤ Only a backup can be promoted as primary

36

View service

¨  Valid sequence of views: ¤  (1, S1, _) à (2, S1, S2) à (3, S1, S3) à (4, S3, S4) à (5, S4, _)

¨  Examples of invalid transitions between views? ¤  (1, S1, S2) à (2, S3, S4) ¤  (1, S1, S2) à (2, _, S2) ¤  (1, S1, _) à (2, S2, S1)

37

Summary of view service

¨  Monitors primary and backups to detect when to change view ¤ Can change only after primary has ACKed view ¤ Primary ACKs only after syncing with backups

¨  Clients cache view for scalability ¨  To address split brain, primary must check with

backup before serving client

38

¨  One copy in SF (primary), one in NY (backup)

Replicating Bank Database

“Deposit $100”

“Pay 1% interest”

$1,000 $1,000

39

Primary-Backup Sync 40

C1

C2

P

B

“Deposit $100”

“Pay 1% interest”

$1,000

$1,000 $1,110

$1,111

Ordering of Updates

¨  All updates must be applied in the same order at all replicas

¨  External view: Total ordering of writes

¨  Primary effectively serializes all writes ¤  Order of events made known to replicas

41

Serving Reads

¨  Can backups serve reads? ¤ Assume no split brain

¨  What if primary’s state is ahead of backup? ¤ Updates to primary not yet externally visible ¤  Effect of read equivalent to if primary fails at this point

¨  What if backup’s state is ahead of primary? ¤ Different backups may not be in sync ¤  Primary may get replaced before it applies update

42

Reads: Primary vs. Backup 43

C1

P

C2

B1

“Deposit $100”

$1100 $1000

B2

Desired Properties

¨  All writes are totally ordered

¨  Once read returns particular value, all later reads should return that value or value of later write

¨  Once a write completes, all later reads should return value of that write or value of later write

44

Reads relative to Writes 45

C1

P

C2

B

“Deposit $100”

“Pay 1% interest”

$1100 $1111

Consistency Spectrum

¨  Consistency: What are the properties of externally visible effects?

46

Linearizability Sequential Causal Eventual

Consistency

Read-after- write

Ease of programming

Linearizability

¨  Total ordering of writes ¨  Read returns last completed write

¨  Single copy semantics ¤ Externally visible effects of writes and reads are

equivalent to if there existed a single copy

¨  Users oblivious to replication

47

Why weaken consistency?

¨  Shouldn’t we always strive for single copy semantics? ¤ Comes at the expense of lower performance

¨  Latency vs. consistency tradeoff

48

Consistency Spectrum 49

Linearizability Sequential Causal Eventual

Consistency

Read-after- write

Ease of programming

Latency

Sequential Consistency

¨  Results of any execution same as that when the operations by all processes on the data store were executed in some sequential order.

¨  And … the operations of each individual process appear in this sequence in the order specified by its program.

Example

¨  (a) is sequentially consistent. All processes see the same interleaving of the operations (x takes value b before a)

¨  (b) is not sequentially consistent. P3 sees x taking on value b before a while P4 sees the opposite.

Causal Consistency

¨  Order of causally related writes must be preserved in values returned to reads ¤  If W1 à W2, then if a read sees effect of W2, it must

see effect of W1

¨  Example: Facebook News Feed ¤ Okay to not see all completed posts ¤ But, if you see a comment, you must see the post on

which the comment is made ¨  Main utility: Lazy sync between replicas

52

Example

¨  The operations above do not result in sequential consistency (see reads of P3 and P4)

¨  But there is causal consistency. This is because the writes W1(x)c and W2(x)b are concurrent (there are no dependencies and thus no causal relationship).

¨  But note that W1(x)a and W2(x)b are causally related since R2(x)a may be the reason for W2(x)b.

Example

¨  (a) is not causally consistent. R3(x)b cannot follow R3(x)a since the the write of b to x is dependent on P2 reading x’s value to be a.

¨  (b) however is causally consistent since W1(x) a and W2(x) b are concurrent.

Linearizability example

¨  Every operation should appear to take effect instantaneously at some moment before its start and completion (sometime in the shaded area)

¨  Thus, the result is consistent regardless of ordering of operations

Linearizability with Locks

Client Replica 2

Replica 1

Replica 3

Lock service

56

Problems?

Client failures!

Lease

¨  Lock with timeout ¨  If lease holder fails, not a problem because lease

will expire

¨  How to pick lease timeout value? ¤ Short timeout à Client needs to renew lease ¤ Long timeout à Unnecessarily block operations

57

Discrepancy in Lease Validity

Client Replica 2

Replica 1

Replica 3

Lease service

58

Scenario in which lease server and client

differ about lease validity?

Discrepancy in Lease Validity

¨  Message that grants lease may have high delay

¨  Clock at lease holder and lease service may have different skew

¨  How to account for potential discrepancy?

59

Discrepancy in Lease Validity

Client Replica 2

Replica 1

Replica 3

Lease service

60

Replica must check with lease service to confirm

lease validity