Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
-
Upload
till-rohrmann -
Category
Technology
-
view
1.575 -
download
1
Transcript of Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Better be safe than sorry § Failures will happen § EMC estimated $1.7 billion costs due to
data loss and system downtime § Recovery will save you time and costs § Switch between algorithms § Live upgrade of your system
3
Fault tolerance guarantees § At most once • No guarantees at all
§ At least once • For many applications
sufficient § Exactly once
§ Flink provides all guarantees
5
Operator State § Stateless operators § System state § User defined state
9
ds.filter(_!=0)
ds.keyBy(0).window(TumblingTimeWindows.of(5,TimeUnit.SECONDS))
publicclassCounterSumimplementsRichReduceFunction<Long>{privateOperatorState<Long>counter;
@OverridepublicLongreduce(Longv1,Longv2)throwsException{counter.update(counter.value()+1);returnv1+v2;}
@Overridepublicvoidopen(Configurationconfig){counter=getRuntimeContext().getOperatorState(“counter”,0L,false);}}
Advantages § Separation of app logic from recovery • Checkpointing interval is just a config
parameter
§ High throughput • Controllable checkpointing overhead
§ Low impact on latency
14
Persisting jobs
20
JobManager
Client
TaskManagers
ApacheZookeeper™
1. Submitjob2. PersistexecuAongraph
Persisting jobs
21
JobManager
Client
TaskManagers
ApacheZookeeper™
1. Submitjob2. PersistexecuAongraph3. WritehandletoZooKeeper
Persisting jobs
22
JobManager
Client
TaskManagers
ApacheZookeeper™
1. Submitjob2. PersistexecuAongraph3. WritehandletoZooKeeper4. Deploytasks
Handling checkpoints
24
JobManager
Client
TaskManagers
ApacheZookeeper™
1. Takesnapshots2. Persistsnapshots3. SendhandlestoJM
Handling checkpoints
25
JobManager
Client
TaskManagers
ApacheZookeeper™
1. Takesnapshots2. Persistsnapshots3. SendhandlestoJM4. Createglobalcheckpoint
Handling checkpoints
26
JobManager
Client
TaskManagers
ApacheZookeeper™
1. Takesnapshots2. Persistsnapshots3. SendhandlestoJM4. Createglobalcheckpoint5. Persistglobalcheckpoint
Handling checkpoints
27
JobManager
Client
TaskManagers
ApacheZookeeper™
1. Takesnapshots2. Persistsnapshots3. SendhandlestoJM4. Createglobalcheckpoint5. Persistglobalcheckpoint6. WritehandletoZooKeeper
TL;DL § Job recovery mechanism with low latency
and high throughput § Exactly one processing semantics § No single point of failure
è Flink will always keep processing your data
31