Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of...

87
Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1

Transcript of Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of...

Page 1: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Those 10000 classes I never wrote

The power of recursion schemes applied to data engineering

1

Page 2: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

greetings

• The name’s Valentin (@ValentinKasas)

• Organizer of the Paris Scala User Group

• Member of the ScalaIO conference crew

• Cook

• Wannabe marathon runner

2

Page 3: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

disclaimer

• Too much to show in a short time

!=> no time for cat pictures or funny gifs

• A lot of code that is

• potentially scary

• probably doesn’t compile

3

Page 4: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

the story of a data engineering problem

4

Page 5: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Problem statement

• Building a « meta-Datalake »

• ~100 data sources w/ ~100 tables each

• From None to production in 6 months

• 5 developers

5

Page 6: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

More fun

• Data coming in batches and streams

• Privacy by design

• Data sources enrolled on a rolling basis

6

Page 7: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

requirements

• Pipeline must be configured at runtime

• Needs a specific schema (!<- privacy)

• Must work with different input & output formats

• JSON, CSV, Parquet, Avro, …

7

Page 8: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Solution

8

Page 9: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

one schema to rule them all

• The only way to have a generic Pipeline is to build it around a schema

• That schema must contain enough information to allow the required « privacy-by-design »

9

Page 10: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

schemas are cool

With a schema we can generically

• validate data

• generate random test data

• translate data between formats

10

Page 11: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Schemas are recursive

• a schema is composed of smaller schemas

• that are composed of smaller schemas

• that are composed of smaller schemas

• … etc

• … and so is the data they represent

11

Page 12: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

explicit recursion is bad

• The compiler doesn’t help much

• StackOverflowError is around the corner

• Mixed concerns

• Not reusable code

12

Page 13: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Recursion schemes FTW !

• Originate in « Functional programming with Bananas, Lenses, Envelopes and Barbed Wire »

• Decouple the how and the what

• Scala implementation :

github.com/slamdata/matryoshka

13

Page 14: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

families of schemes

• folds: destroy a structure bottom-up

ex: cata

• unfolds: build-up a structure top-down

ex: ana

• refolds: combine an unfold w/ a fold

ex: hylo

14

Page 15: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

recursion scheme recipe

To use any recursion scheme, you need:

• a Functor (called « pattern-functor »)

• a Fix-point type (most of the time)

• an Algebra and/or a Coalgebra

15

Page 16: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

pattern functors

• idea : replace the recursive « slot » by a type parameter

16

sealed trait Tree final case class Node(left: Tree, right: Tree) extends Tree final case class Leaf(label: Int) extends Tree

Page 17: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

pattern functors

17

sealed trait TreeF[A] final case class Node[A](left: A, right: A) extends TreeF[A] final case class Leaf[A](label: Int) extends TreeF[A]

object TreeF {

implicit val treeFunctor = new Functor[TreeF] { def map[A, B](fa: TreeF[A])(f: A !=> B): TreeF[B] = fa match { case Node(l, r) !=> Node(f(l), f(r)) case Leaf(i) !=> Leaf[B](i) } } }

Page 18: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

pattern functors

• Problem : different shapes of trees have different types

18

val tree1 = Node(Node(Leaf(1), Leaf(2)), Node(Leaf(3), Leaf(4))) !// tree1: Node[Node[Leaf[Nothing]]] = !!...

val tree2 = Node(Leaf(0), Leaf(-1)) !// tree2: Node[Leaf[Nothing!]] = !!...

val tree3 = Node(Leaf(42), Node(Leaf(12), Leaf(84))) !// tree3: Node[TreeF[_ !<: Leaf[Nothing]]] = !!...

Page 19: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Fix-point types• Solution : use a fix-point type to

« swallow » recursion

19

final case class Fix[F[_!]](unFix: F[Fix[F!]])

Page 20: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Fix-point types• Solution : use a fix-point type to

« swallow » recursion

19

final case class Fix[F[_!]](unFix: F[Fix[F!]])

val tree2 = Fix[TreeF](Node( Fix[TreeF](Leaf(0)), Fix[TreeF](Leaf(-1)))) !// tree2: Fix[TreeF] = !!...

val tree3 = Fix[TreeF](Node( Fix[TreeF](Leaf(42)), Fix[TreeF](Node( Fix[TreeF](Leaf(12)), Fix[TreeF](Leaf(84)))))) !// tree3: Fix[TreeF] = !!...

Page 21: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

(co)Algebras

• Define what to do with a single layer of your recursive structure

• Algebras: collapse, 1 layer at a time

F[A] !=> A

• Coalgebras: build-up, 1 layer at a time

A !=> F[A]

• A is referred to as the (co)algebra’s carrier

20

Page 22: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 23: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 24: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

coalg( )

Page 25: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 26: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 27: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

hylo( )

Page 28: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 29: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

coalg( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 30: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 31: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 32: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

hylo( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 33: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 34: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

coalg( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 35: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 36: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 37: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

alg( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 38: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 39: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 40: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 41: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

hylo( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 42: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 43: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

coalg( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 44: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 45: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 46: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

alg( )

Page 47: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 48: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 49: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

alg( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 50: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 51: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 52: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 53: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

hylo( )

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 54: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 55: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 56: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

alg( )

Page 57: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 58: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 59: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

alg( )

Page 60: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

hylomorphisms

21

Pattern-functor

Algebra

Coalgebra

def hylo[F[_], A, B](a: A) (alg: Algebra[F, B], coalg: Coalgebra[F, A]): B =

alg(coalg(a) map (hylo(_)(alg, coalg)))

Page 61: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

GEtting real

22

Page 62: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

one schema to rule one schema to rule them all…

23

Input Schema

Parquet

Avro

Page 63: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

one schema to rule one schema to rule them all…

23

SchemaF[A]

Input Schema

Parquet

Avro

Page 64: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

one schema to rule one schema to rule them all…

23

SchemaF[A]

Input Schema

Parquet

Avro

Page 65: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

one schema to rule one schema to rule them all…

23

SchemaF[A]

Input Schema

Parquet

Avro

Page 66: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Step 1 : pattern functor

24

sealed trait SchemaF[A]

final case class StructF[A](fields: Map[String, A]) extends SchemaF[A] final case class ArrayF[A](elementType: A) extends SchemaF[A] final case class IntF[A]() extends SchemaF[A] final case class StringF[A]() extends SchemaF[A] !// etc !!...

object SchemaF { implicit val schemaFunctor = new Functor[SchemaF] {

def map[A, B](fa: SchemaF[A])(f: A !=> B): SchemaF[B] = fa match { case StructF(fields) !=> StructF(fields.mapValues(f)) case ArrayF(elem) !=> ArrayF(f(elem)) case IntF() !=> IntF[B]() case StringF() !=> StringF[B]() } } }

Page 67: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

STEP 2 : Algebras

25

val schemaToParquet: Algebra[SchemaF, DataType] = {

case StructF(fields) !=> StructType(fields.map((StructField.apply _).tupled)) case ArrayF(elem) !=> ArrayType(elem) case IntF() !=> IntegerType case StringF() !=> StringType

}

fixSchema.cata(schemaToParquet)

Page 68: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

step 2.1: coalgebras

26

val avroToSchema: Coalgebra[SchemaF, Schema] = { avro !=> avro.getType match { case RECORD !=> StructF(avro.getFields.map(f !=> f.name !-> f.schema)) case ARRAY !=> ArrayF(avro.getElementType()) case INT !=> IntF() case STRING !=> StringF() } }

avroSchema.ana[Fix](avroToSchema)

avroSchema.hylo(schemaToParquet, avroToSchema)

Page 69: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

validating data• we used github.com/jto/validation

• defines the Rule[I, O] type

• we need a single ADT to represent data

• otherwise we couldn’t write algebras

27

sealed trait DataF[A] final case class GStruct[A](fields: Map[String, A]) extends DataF[A] final case class GArray[A](elements: List[A]) extends DataF[A] final case class GInt[A](value: Int) extends DataF[A] final case class GString[A](value: String) extends DataF[A]

Page 70: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

producing validators

28

val validator: Algebra[SchemaF, Rule[JsValue, Fix[DataF]]] = { case StructF(fields) !=> fields.toList.traverse{ case (name, validation) !=> name !-> (JsPath \ name).read(validation) }.map(fs !=> Fix(GStruct(fs.toMap)))

case ArrayF(element) !=> JsPath.pickList(element).map(es !=> Fix(GArray(es)))

case IntF() !=> JsPath.read[Int].map(v !=> Fix(GInt(v)))

case StringF() !=> JsPath.read[String].map(v !=> Fix(GString(v))) }

Page 71: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

less easy stuff

29

Page 72: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Schema to Avro

• Impossible to write directly an Algebra[SchemaF, Schema]

• In an avro schema, each type must have an unique name

• Solution:

• use the path of each node to name types

• store previously built schemas in a registry

30

Page 73: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

Solution 1 : keep track of the path

31

final case class EnvT[E, F[_], A](run: (E, F[A]))

type WithPath[A] = EnvT[String, SchemaF, A]

val schemaWithPath: Coalgebra[WithPath, (String, Fix[SchemaF])] = { case (path, Fix(StructF(fields))) !=> EnvT(( path, StructF(fields.map{case (k, v) !=> (s"$path.$k", Fix(v))}) )) case (path, Fix(ArrayF(e))) !=> EnvT(( path, ArrayF((path, Fix(e))) )) !// !!... }

Page 74: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

keeping track of the path (2)

32

def withPathToAvro(namespace: String) : Algebra[EnvT[String, SchemaF, ?], Schema] = { case EnvT((path, StructF(fields))) !=> SchemaBuilder .namespace(namespace) .record(path) .... }

("", fixSchema).hylo(withPathToAvro, schemaWithPath)

Page 75: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

solution 2 : use a registry

• Schemes come in different flavours : « classic », monadic, generalized

• Here we need the monadic flavour :

AlgebraM[M[_], F[_], A] !== F[A] !=> M[A]

33

type FingerPrinted = (Long, Schema)

type Registry[A] = State[Map[Long, Schema], A]

Page 76: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

solution 2 : using a registry

34

val reuseTypes: AlgebraM[Registry, SchemaF, FingerPrinted] = { avro !=> case StructF(fields) !=> val fp = fingerPrint(fields) State.get.flatMap{ knownTypes !=> if(knownTypes.contains(fp)) State.state(fp !-> knownTypes(fp)) else { val schema = SchemaBuilder .newRecord(fp) !!... State.put( knownType + (fp !-> schema)) .map(State.state(fp !-> schema)) } } }

fixSchema.cataM(reuseTypes)

Page 77: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

writing transformed data

• Once we get a Fix[DataF], we need to serialize it

• to Parquet, using spark’s Row

• to Avro using GenericRecord

35

Page 78: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

serializing to parquet

• We can nest Rows, so we can use an algebra with Row as the carrier

• But, we don’t want our « simple » columns to be wrapped

• If we have a 2 columns table we want to output Row(col1, col2), not Row(Row(col1), Row(col2))

36

Page 79: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

paramorphism

• We need to know when we’ve reached the « bottom » of our tree

• para is a scheme that gives us the previous « tree » it has consumed

• It uses a GAlgebra[(T, ?), F, A]

GAlgebra[W[_], F[_], A] !== F[W[A!]] !=> A

37

Page 80: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

building rows

38

val toRow: GAlgebra[(Fix[DataF], ?), DataF, Row] = { case GStruct(fields) !=> val values = fields.map{ case (_, (fix, value)) !=> if (isSimpleType(fix)) value.get(0) !// unwrap the row else value } Row(values: _*)

case GInt(value) !=> Row(value)

!// etc… }

fixData.para[Row](toRow)

Page 81: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

to be continued …

39

Page 82: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

well, actually …

40

Page 83: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

inspired from real facts… but inspired only

41

Page 84: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

It might seem hard at first …

42

Page 85: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

It might seem hard at first …

42

Page 86: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

… but it gets better

43

Page 87: Those 10000 classes - Lambda Days · 2018-02-26 · Those 10000 classes I never wrote The power of recursion schemes applied to data engineering 1. greetings ... Parquet Avro. one

… but it gets better

43