Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Fault Tolerance and Job Recovery in Apache Flink™

Till Rohrmann [email protected] @stsffap

Better be safe than sorry §  Failures will happen §  EMC estimated $1.7 billion costs due to

data loss and system downtime §  Recovery will save you time and costs §  Switch between algorithms §  Live upgrade of your system

3

Fault Tolerance

4

Fault tolerance guarantees §  At most once •  No guarantees at all

§  At least once •  For many applications

sufficient §  Exactly once

§  Flink provides all guarantees

5

Checkpoints §  Consistent snapshots of distributed data

stream and operator state

6

Barriers §  Markers for checkpoints §  Injected in the data flow

7

8

§  Alignment for multi-input operators

Operator State §  Stateless operators §  System state §  User defined state

9

ds.filter(_!=0)

ds.keyBy(0).window(TumblingTimeWindows.of(5,TimeUnit.SECONDS))

publicclassCounterSumimplementsRichReduceFunction<Long>{privateOperatorState<Long>counter;

@OverridepublicLongreduce(Longv1,Longv2)throwsException{counter.update(counter.value()+1);returnv1+v2;}

@Overridepublicvoidopen(Configurationconfig){counter=getRuntimeContext().getOperatorState(“counter”,0L,false);}}

Advantages §  Separation of app logic from recovery •  Checkpointing interval is just a config

parameter

§  High throughput •  Controllable checkpointing overhead

§  Low impact on latency

14

Cluster High Availability

16

Without high availability

17

JobManager

TaskManager

With high availability

18

JobManager

TaskManager

Stand-by JobManager

ApacheZookeeper™

KEEPGOING

Persisting jobs

19

JobManager

Client

TaskManagers

ApacheZookeeper™

Job

1.  Submitjob

Persisting jobs

20

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Submitjob2.  PersistexecuAongraph

Persisting jobs

21

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Submitjob2.  PersistexecuAongraph3.  WritehandletoZooKeeper

Persisting jobs

22

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Submitjob2.  PersistexecuAongraph3.  WritehandletoZooKeeper4.  Deploytasks

Handling checkpoints

23

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Takesnapshots


24

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Takesnapshots2.  Persistsnapshots3.  SendhandlestoJM


25

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Takesnapshots2.  Persistsnapshots3.  SendhandlestoJM4.  Createglobalcheckpoint


26

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Takesnapshots2.  Persistsnapshots3.  SendhandlestoJM4.  Createglobalcheckpoint5.  Persistglobalcheckpoint


27

JobManager

Client

TaskManagers

ApacheZookeeper™

1.  Takesnapshots2.  Persistsnapshots3.  SendhandlestoJM4.  Createglobalcheckpoint5.  Persistglobalcheckpoint6.  WritehandletoZooKeeper

Conclusion

28

TL;DL §  Job recovery mechanism with low latency

and high throughput §  Exactly one processing semantics §  No single point of failure

è Flink will always keep processing your data

31

flink.apache.org @ApacheFlink

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Technology

Transcript of Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015