Mikio Braun – Data flow vs. procedural programming

23
October 13, 2015 Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 1 Flink Forward 2015 Data flow vs. procedural programming: How to put your algorithms into Flink October 13, 2015 Mikio L. Braun, Zalando SE @mikiobraun

Transcript of Mikio Braun – Data flow vs. procedural programming

Page 1: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 1

Flink Forward 2015

Data flow vs. proceduralprogramming: How to put your

algorithms into Flink

October 13, 2015

Mikio L. Braun,Zalando SE

@mikiobraun

Page 2: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 2

Python vs Flink

● Coming from Python, what are the differencesin programming style I have to know to getstarted in Flink?

Page 3: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 3

Programming how we're used to

● Computing a sum

● Tools at our disposal:

– variables

– control flow (loops, if)

– function calls as basic piece of abstraction

def computeSum(a):sum = 0for i in range(len(a))

sum += a[i]return sum

Page 4: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 4

Data Analysis Algorithms

Let's consider centering

becomes

or even just

def centerPoints(xs): sum = xs[0].copy() for i in range(1, len(xs)): sum += xs[i] mean = sum / len(xs) for i in range(len(xs)): xs[i] -= mean return xs

xs -xs.mean(axis=0)

Page 5: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 5

Don't use for-loops

● Put your data into a matrix

● Don't use for loops

Page 6: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 6

Least Squares Regression

● Compute

● Becomes

What you learn is thinking in matrices, breakingdown computations in terms of matrix algebra

def lsr(X, y, lam): d = X.shape[1] C = X.T.dot(X) + lam * pl.eye(d) w = np.linalg.solve(C, X.T.dot(y)) return w

Page 7: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 7

Basic tools

Advantage

– very familiar

– close to math

Disadvantage

– hard to scale

● Basic procedural programming paradigm

● Variables

● Ordered arrays and efficient functions on those

Page 8: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 8

Parallel Data Flow

Often you have stuff like

Which is inherently easy to scale

for i in someSet:map x[i] to y[i]

Page 9: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 9

New Paradigm

● Basic building block is an (unordered) set.

● Basic operations inherently parallel

Page 10: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 10

Computing, Data Flow Style

Computing a sum

Computing a mean

sum(x) = xs.reduce((x,y) => x + y)

mean(x) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))

.map(xc => xc._1 / xc._2)

Page 11: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 11

Apache Flink

● Data Flow system

● Basic building block is a DataSet[X]

● For execution, sets up all computing nodes,streams through data

Page 12: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 12

Apache Flink: Getting Started

● Use Scala API

● Minimal project with Maven (build tool) orGradle

● Use an IDE like IntelliJ

● Always import org.apache.flink.api.scala._

Page 13: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 13

Centering (First Try)

def computeMeans(xs: DataSet[DenseVector]) =xs.map(x => (x,1))

.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = {val mean = computeMean(xs)

xs.map(x => x – mean)}

You cannot nest DataSet operations!

Page 14: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 14

Sorry, restrictions apply.

● Variables hold (lazy) computations

● You can't work with sets within the operations

● Even if result is just a single element, it's aDataSet[Elem].

● So what to do?

– cross joins

– broadcast variables

Page 15: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 15

Centering (Second Try)

Works, but seems excessive because the meanis copied to each data element.

def computeMeans(xs: DataSet[DenseVector]) =xs.map(x => (x,1))

.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.crossWithTiny(mean).map(xm => xm._1 – xm._2)}

Page 16: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 16

Broadcast Variables

● Side information sent to all worker nodes

● Can be a DataSet

● Gets accessed as a Java collection

Page 17: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 17

class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O) extends RichMapFunction[T, O] { var broadcastVariable: B = _

@throws(classOf[Exception]) override def open(configuration: Configuration): Unit = { broadcastVariable = getRuntimeContext .getBroadcastVariable[B]("broadcastVariable") .get(0) }

override def map(value: T): O = { fun(value, broadcastVariable) } }

Broadcast Variables

Page 18: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 18

Centering (Third Try)def computeMeans(xs: DataSet[DenseVector]) =

xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = {val mean = computeMean(xs)

xs.mapWithBcVar(mean).map((x, m) => x – m)}

Page 19: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 19

Intermediate Results pattern

val x = someDataSetComputation()val y = someOtherDataSetComputation()

val z = dataSet.mapWithBcVar(x)((d, x) => …)

val result = anotherDataSet.mapWithBcVar((y,z)) { (d, yz) => val (y,z) = yz …}

x = someComputation()y = someOtherComputation()

z = someComputationOn(dataSet, x)

result = moreComputationOn(y, z)

Page 20: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 20

Matrix Algebra

● No ordered sets per se in Data Flow context.

Page 21: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 21

Vector operations by explicit joins

● Encode vector (a1, a2, …, an) with

{(1, a1), (2, a2), … (n, an)}

● Addition:

– a.join(b).where(0).equalTo(0) .map((ab) => (ab._1._1, ab._1._2 + ab._2._2))

after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }

Page 22: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 22

Back to Least Squares Regression

Two operations: computing X'X and X'Y

def lsr(xys: DataSet[(DenseVector, Double)]) = { val XTX = xs.map(x => x.outer(x)).reduce(_ + _) val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _)

C = XTX.mapWithBcVar(XTY) { vars => val XTX = vars._1 val XTY = var.s_2

val weight = XTX \ XTY }}

Page 23: Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 23

Summary and Outlook

● Procedural vs. Data Flow

– basic building blocks elementwise operations onunordered sets

– can't be nested

– combine intermediate results via broadcast vars

● Iterations

● Beware of TypeInformation implicits.