Apache Flink Training - Metrics & Monitoring

Post on 17-Mar-2018

2.337 views 1 download

Transcript of Apache Flink Training - Metrics & Monitoring

1

Apache Flink® Training

Flink v1.3 – 14.09.2017

Apache Flink® Training

Metrics and Monitoring

Metrics

2

Metrics

<identifier, measurement>

Types

• Counter

• Meter (rate)

• Histogram

• Gauge (arbitrary value)

Exposed via MetricReporters

Also a REST API

Visualized in the WebUI

3

Example

4

public static class MyMap extends RichMapFunction<String, String> {private Counter count;

@Overridepublic void open(Configuration config) {

count = getRuntimeContext().getMetricGroup().counter("numRecordsIn");

}

@Overridepublic String map(String input) {

count.inc();// return something

}}

Other metric types

Gauge

• Value can be any object that implement toString()

Histogram

• No default implementation, but a wrapper for

Codahale/DropWizard histograms

Meter

• Your code calls meter.markEvent() or meter.markEvent(n)

• Flink counts events, and also reports the average rate

5

Metric Groups

Metrics are attached to MetricGroups, which provide

context about what is being measured

6

Adding your own MetricGroups

Useful for categorizing your measurements

counter = getRuntimeContext().getMetricGroup().addGroup("MyMetrics").counter("myCounter");

7

8

127.0.0.1.taskmanager.ABCDE.MyJob.MyOperator.1.numRecordsIn

host taskmanager job operator metric

Scope Formats

Dot-separate list of variables and contants

Variables are replaced at runtime

Configured in flink-conf.yaml

<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>

9

Scope Formats

Each metric is associated with one of 6 formats:• metrics.scope.jm

• metrics.scope.jm.job

• metrics.scope.tm

• metrics.scope.tm.job

• metrics.scope.task

• metrics.scope.operator

10

Metric Reporter

Exposes metrics to the outside world

• Ganglia

• Graphite

• JMX

• StatsD

• or roll your own …

11

Example

12

public static class Log4JReporter implements MetricReporter {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);

public void open(MetricConfig config) {}

public void close() {}

Example, cont.

13

public static class Log4JReporter implements MetricReporter {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);

private final Map<Counter, String> counters = new ConcurrentHashMap<>();

public void notifyOfAddedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.put((Counter) metric, group.getMetricIdentifier(metricName));

}}

public void notifyOfRemovedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.remove(metric);

}}

Example, cont.

14

public static class Log4JReporter implements MetricReporter, Scheduled {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);

private final Map<Counter, String> counters = new ConcurrentHashMap<>();

public void notifyOfAddedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.put((Counter) metric, group.getMetricIdentifier(metricName));

}}

public void notifyOfRemovedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.remove(metric);

}}

public void report() {for (Map.Entry<Counter, String> metric : counters.entrySet()) {LOG.info(metric.getValue() + ": " + metric.getKey());

}}

Configuration

metrics.reporters: log

metrics.reporter.log.class: org.apache.flink.metrics.log4j.Log4JReporter

metrics.reporter.log.interval: 5 SECONDS

https://github.com/zentol/log4jreporter/blob/master/src/main/java/org/apache/fli

nk/metrics/log4j/Log4JReporter.java

15

Monitoring REST API

16

Some available requests

/config

/overview

/jobmanager/metrics

/jobs

/jobs/<id>/metrics

/jobs/<id>/checkpoints

/jobs/<id>/vertices/<id>/metrics?get=0.numRecordsOutPerSecond

/taskmanagers

/taskmanagers/<id>/metrics?get=<metric>

...

17

Available metrics

18

Available metrics

Many system metrics are built into Flink, including

• CPU, memory, threads, GC

• Classloading, network, cluster, IO

• Checkpointing, throughput, latency

19

Latency Monitoring

20

Latency tracking

env.getConfig().setLatencyTrackingInterval(msec)

Latency markers are injected by the sources, and flow

through the execution graph

• If records are queued in front of an operator, a marker will wait,

but it will otherwise bypass operators

Sinks track latency for each parallel source instance

21

Back Pressure Monitoring

22

What is Back Pressure?

Records in your job flow downstream, from

sources to sinks

When a downstream operator can’t keep

up, it exerts back pressure that propagates

upstream

23

Detecting Back Pressure

24

OK: < 10%LOW: 10 – 50%HIGH: > 50%

Configuration

jobmanager.web.backpressure.refresh-interval (60000 msec)

jobmanager.web.backpressure.num-samples (100)

jobmanager.web.backpressure.delay-between-samples (50 msec)

25

26

27

28

Slow sink?

E.g., slow database indexing may be

causing backpressure

Try a discarding sink to rule out the sink

29