Quark: A Purely-Functional Scala DSL for Data Processing & Analytics
-
Upload
john-de-goes -
Category
Technology
-
view
2.251 -
download
2
Transcript of Quark: A Purely-Functional Scala DSL for Data Processing & Analytics
![Page 1: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/1.jpg)
Quark: A Purely-Functional Scala DSL for Data Processing & AnalyticsJohn A. De Goes
@jdegoes - http://degoes.net
![Page 2: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/2.jpg)
Apache Spark
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
![Page 3: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/3.jpg)
Spark Sucks
— Functional-ish
— Exceptions, typecasts
— SparkContext
— Serializable
— Unsafe type-safe programs
— Second-class support for databases
— Dependency hell (>100)
— Painful debugging
— Implementation-dependent performance
![Page 4: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/4.jpg)
Why Does Spark Have to Suck?Computation
val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) <---- Where Spark goes wrong .map(word => (word, 1)) <---- Where Spark goes wrong .reduceByKey(_ + _) <---- Where Spark goes wrong
![Page 5: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/5.jpg)
WWFPD?
— Purely functional
— No exceptions, no casts, no nulls
— No global variables
— No serialization
— Safe type-safe programs
— First-class support for databases
— Few dependencies
— Better debugging
— Implementation-independent performance
![Page 6: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/6.jpg)
Rule #1 in Functional ProgrammingDon't solve the problem, describe the solution.
AKA the "Do Nothing" rule
=> Don't compute, embed a compiled language into Scala
![Page 7: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/7.jpg)
QuarkCompilation
Quark is a Scala DSL built on Quasar Analytics, a general-purpose compiler for translating data processing over semi-structured data into efficient plans that execute 100% inside the target infrastructure.
val textFile = Dataset.load("...")val counts = textFile.flatMap(line => line.typed[Str].split(" ")) .map(word => (word, 1)) .reduceByKey(_.sum)
![Page 8: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/8.jpg)
More QuarkCompilation
val dataset = Dataset.load("/prod/profiles")
val averageAge = dataset.groupBy(_.country[Str]).map(_.age[Int]).reduceBy(_.average)
![Page 9: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/9.jpg)
Quark TargetsOne DSL to Rule Them All
— MongoDB
— Couchbase
— MarkLogic
— Hadoop / HDFS
— Add your connector here!
![Page 10: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/10.jpg)
Both Quark and Quasar Analytics are purely-functional, open source projects written in 100% Scala.
https://github.com/quasar-analytics/
![Page 11: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/11.jpg)
How To DSLAdding Integers
sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(v: Expr, v: Expr) extends Expr
def int(v: Int): Expr = Integer(v)def add(l: Expr, r: Expr): Expr = Addition(l, r)
add(add(int(1), int(2)), int(3)) : Expr
def interpret(e: Expr): Int = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r)}def serialize(v: Expr): Json = ???def deserialize(v: Json): Expr = ???
![Page 12: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/12.jpg)
How To DSLAdding Strings
sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(l: Expr, r: Expr) extends Expr // Uh, oh!final case class Str(v: String) extends Exprfinal case class StringConcat(l: Expr, r: Expr) extends Expr // Uh, oh!
![Page 13: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/13.jpg)
How To DSLPhantom Type
sealed trait Expr[A]final case class Integer(v: Int) extends Expr[Int]final case class Addition(l: Expr[Int], r: Expr[Int]) extends Expr[Int]final case class Str(v: String) extends Expr[String]final case class StringConcat(l: Expr[String], r: Expr[String]) extends Expr[String]
def interpret[A](e: Expr[A]): A = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r) case Str(v) => v case StringConcat(l, r) => interpret(l) ++ interpret(r)}def serialize[A](v: Expr[A]): Json = ???def deserialize[Z](v: Json): Expr[A] forSome { type A } = ???
![Page 14: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/14.jpg)
How To DSLGADTs in Scala still have bugs
SI-8563, SI-9345, SI-6680
FRIENDS DON'T LET FRIENDS USE GADTS IN SCALA.
![Page 15: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/15.jpg)
How To DSLFinally Tagless
trait Expr[F[_]] { def int(v: Int): F[Int] def str(v: String): F[String] def add(l: F[Int], r: F[Int]): F[Int] def concat(l: F[String], r: F[String]): F[String]}
trait Dsl[A] { def apply[F[_]](implicit F: Expr[F]): F[A]}
def int(v: Int): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.int(v)}
def add(l: Dsl[Int], r: Dsl[Int]): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.add(l.apply[F], r.apply[F])}// ...
![Page 16: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/16.jpg)
How To DSLFinally Tagless
type Id[A] = A
def interpret: Expr[Id] = new Expr[Id] { def int(v: Int): Id[Int] = v def str(v: String): Id[String] = v def add(l: Id[Int], r: Id[Int]): Id[Int] = l + r def concat(l: Id[String], r: Id[String]): Id[String] = l + r}
add(int(1), int(2)).apply(interpret) // Id(3)
final case class Const[A, B](a: A)
def serialize: Expr[Const[Json, ?]] = ???def deserialize[F[_]: Expr](json: Json): F[A] forSome { type A } = ???
![Page 17: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/17.jpg)
Quark 101The Building Blocks
— Type. Represents a reified type of an element in a dataset.
— **Dataset[A]**. Represents a dataset, produced by successive application of set-level operations (SetOps). Describes a directed-acyclic graph.
— **MappingFunc[A, B]**. Represents a function from A to B that is produced by successive application of mapping-level operations (MapOps) to the input.
— **ReduceFunc[A, B]**. Represents a reduction from A to B, produced by application of reduction-level operations (ReduceOps) to the input.
![Page 18: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/18.jpg)
Let's Build Us a Mini-Quark!
![Page 19: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/19.jpg)
Mini-QuarkType System
sealed trait Typeobject Type { final case class Unknown() extends Type final case class Timestamp() extends Type final case class Date() extends Type final case class Time() extends Type final case class Interval() extends Type final case class Int() extends Type final case class Dec() extends Type final case class Str() extends Type final case class Map[A <: Type, B <: Type](key: A, value: B) extends Type final case class Arr[A <: Type](element: A) extends Type final case class Tuple2[A <: Type, B <: Type](_1: A, _2: B) extends Type final case class Bool() extends Type final case class Null() extends Type type UnknownMap = Map[Unknown, Unknown] val UnknownMap : UnknownMap = Map(Unknown(), Unknown())
type UnknownArr = Arr[Unknown] val UnknownArr : UnknownArr = Arr(Unknown())
type Record[A <: Type] = Map[Str, A] type UnknownRecord = Record[Unknown]}
![Page 20: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/20.jpg)
Mini-QuarkSet-Level Operations
sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]}
![Page 21: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/21.jpg)
Mini-QuarkDataset
sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
![Page 22: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/22.jpg)
Mini-QuarkMapping
sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: ???) // What goes here?}
![Page 23: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/23.jpg)
Mini-QuarkMapping: Attempt #1
sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: F[A] => F[B]) // Doesn't really work...}
![Page 24: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/24.jpg)
Mini-QuarkMapping: Attempt #2
sealed trait MappingFunc[A, B] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[B]}trait MappingOps[F[_]] { def str(v: String): F[Type.Str]
def project[K <: Type, V <: Type](v: F[Type.Map[K, V]], k: F[K]): F[V]
def add(l: F[Type.Int], r: F[Type.Int]): F[Type.Int]
def length[A <: Type](v: F[Type.Arr[A]]): F[Type.Int]
...}object MappingOps { def id[A]: MappingFunc[A, B] = new MappingFunc[A, A] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[A] = v }}
![Page 25: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/25.jpg)
Mini-QuarkMapping: Attempt #2
trait SetOps[F[_]] { def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: MappingFunc[A, B]): F[B] // Yay!!!}
![Page 26: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/26.jpg)
Mini-QuarkDataset: Mapping
sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: ???): Dataset[B] = ??? // What goes here???}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
![Page 27: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/27.jpg)
Mini-QuarkDataset: Mapping Attempt #1
sealed trait Dataset[A] { self => def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
// dataset.map(_.length) // Cannot ever work!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Cannot ever work!
![Page 28: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/28.jpg)
Mini-QuarkDataset: Mapping Attempt #2
sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: MappingFunc[A, A] => MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f(MappingFunc.id[A])) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
// dataset.map(_.length) // Works with right methods on MappingFunc!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Works with right methods on MappingFunc!
![Page 29: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/29.jpg)
Mini-QuarkDataset: Mapping Binary Operators
val netProfit = dataset.map(v => v.netRevenue[Dec] - v.netCosts[Dec])
![Page 30: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/30.jpg)
Mini-QuarkMappingFuncs Are Arrows!
trait MappingFunc[A <: Type, B <: Type] extends Dynamic { self => import MappingFunc.Case
def apply[F[_]: MappingOps](v: F[A]): F[B]
def >>> [C <: Type](that: MappingFunc[B, C]): MappingFunc[A, C] = new MappingFunc[A, C] { def apply[F[_]: MappingOps](v: F[A]): F[C] = that.apply[F](self.apply[F](v)) }
def + (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].add(self(v), that(v)) }
def - (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].subtract(self(v), that(v)) } ...}
![Page 31: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/31.jpg)
Mini-QuarkApplicative Composition
MappingFunc[A, B] A -----------------------------B \ / \ / \ / \ / MappingFunc[A, B ⊕ C] \ /MappingFunc[A, C] \ / \ / C
![Page 32: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics](https://reader030.fdocuments.us/reader030/viewer/2022021420/58f2d68c1a28ab77078b457d/html5/thumbnails/32.jpg)
Learn More
— Finally Tagless: http://okmij.org/ftp/tagless-final/
— Quark: https://github.com/quasar-analytics/quark
— Quasar: https://github.com/quasar-analytics/quasar
THANK YOU
@jdegoes - http://degoes.net