Working with the Scalding Type-Safe API

Copyright © 2014 Criteo

Working with the Scalding Type-Safe API

Without tearing your hair out (too much)Justin Coffey, Sr Staff Devlead, Criteo

Sofian Djamaa, Sr Engineer, Criteo

2016-04-14


The Scalding Gods hate you

2

Don’t worry, they hate us too

Hopefully, this presentation will help you understand their whims a bit better


A Few Words on Scalding

• Scalding is a framework to write Map Reduce jobs in a more functional way

• It’s written in Scala

• And uses the Java Map Reduce framework, Cascading

• It has a non-typed, “fields” API and a newer typed-API, which is what we’ll be

talking about

3


The Promise

4

TypedPipe.from(TextLine(args("input")))

.flatMap { line => line.split(",") }

.groupBy { word => word }

.size

.write(TypedTsv[(String, Long)](args("output")))


Our Reality

5

TypedPipe.from(SomeSource(args("some-place"))

.map { TargetType.fromClassWith23Fields }

.groupBy { _.lotsOfDimensions }

.sumByKey

.write(TypedSink[TargetType](args("another-place")))


Our Reality, continued

6

java.lang.AssertionError: assertion failed: Arity of (class

com.twitter.scalding.LowPriorityConversions$SingleSetter$) is 1,

which doesn't match: {your class with 23 fields}


3 hours later, you ask yourself…

Why, oh why, is there a default TupleSetter that doesn’t actually work for anything other than Products?

7


What we’re going to talk about

We will humbly* present our work in dealing with pitfalls of using the Type-Safe Scalding API in an existing Hadoop environment.

*and I mean it. I expect we’ve done a fair number of things poorly!

8


Your humble hosts for the next 40 minutes

• Justin Coffey: Devlead for the Scalability Analytics team at Criteo, responsible for all analytic systems, data and products.

• Sofian Djamaa: Senior engineer in Scalability Analytics at Criteo, works (hard) on batch and stream processing.

9


Criteo Scale

• 2 Hadoop clusters ~2000 nodes total

• 28K cores, 300TB RAM, Many Many PBs

• 50 billion events logged per day

• 25TB data ingested per day

• 7 weeks of vacation per year ;)

10


Some Context

Before we dig into the nitty-gritty, I think some context is in order.

11


A Use Case

• Build a 20TB time-series dataset for fast operational reporting

• Source data are application logs in hadoop

• Store data in Vertica (a scale-out columnar DB)

12


The Scalding Part

• Scalding handles all data transformations

• Including simple aggregations of data

• Keeps logic in one place, scales out easily

• Reduces load on Vertica

13


A Note on Scheduling

• We use our very own NIH scheduler, Langoustine

• It uses a Scala DSL for code-as-config to describe a job DAG

• It is an opinionated scheduler and expects your jobs to be idempotent

• Runs inside a Finatra app (http://twitter.github.io/finatra/)

14


Langoustine Quick UI Tour


Langoustine Quick UI Tour

Red is bad.


Langoustine DSL

object HelloWorldWorkflow {

val hello = Job(

name = "hello",

calculationPeriod = 1 minute,

primaryTask = { (c: TaskContext) => EchoTask(c, "hello, ") }

).toGraph

val world = Job(

name = "world",

calculationPeriod = 1 minute,

primaryTask = { (c: TaskContext) => EchoTask(c, "world!") }

).toGraph

val jobs = world dependsOn hello

}


Langoustine App

object HelloWorldApp extends DefaultLangoustineLauncher {

override lazy val run = execute(HelloWorldWorkflow.jobs)

}


File Formats at Criteo

• Our workflow consumes Parquet, Pail/JSON and Text-Delimited data

20


Finding the Data

• Consuming data requires knowing where it is

• While we do have hive, we don’t use hCatalog outside of it

• Each job has to roll its own data location service

21


The Type-Safe API

It compiles, it works! (hardy har har)

22


23+ Field Logs

23+ fields and Scala 2.10, means writing your own TupleSetters and Converters

23


JobTest and Arity

JobTest with non-Products means Arity exceptions.

JobTest just tests your logic, not the nasty stuff at the edges.

24


Fine, then

Where’s the beef?

25


For a given type, we need

• A TupleSetter and TupleConverter

• A Source and a Sink

• A partition finder/data locality service

26


ScaldingType[T, K]

27

trait ScaldingType[T, K] {

implicit def converter: TupleConverter[T]

implicit def setter: TupleSetter[T]

def fields: Fields

def partitions(key: K): Seq[String]

def source(partitionKey: K): Source

def sink(partitionKey: K): Source

}


TupleConverter and TupleSetter

28

new TupleConverter[SomeType] {

def apply(te: TupleEntry): SomeType =

new SomeType(te.getInt(0), …)

def arity: Int = ???

}

new TupleSetter[SomeType] {

def apply(arg: SomeType): Tuple = {

val t = Tuple.size(arity)

t.set(0, arg.someInt)

…

t

}

def arity: Int = ???

}


Fun, right?

Remind any one of working with JDBC?

29


Macros to the rescue!

30

import com.criteo.scalding.utils._

val setter = SchemaUtils.scaldingTupleSetterFor[SomeType]

val converter = SchemaUtils.scaldingTupleConverterFor[SomeType]

val fields = SchemaUtils.scaldingFieldsFor[SomeType]


And what about Sources?

31

trait TsvSources[T, K] { self: ScaldingType[T, K] =>

override def sink(partitionKey: K): Source =

Tsv(partitions(partitionKey).head)

override def source(partitionKey: K): Source =

Tsv(partitions(partitionKey).head, fields)

}


Ty(p)ing it all together

32

class Events(

val timestamp: DateTime = new DateTime(0),

val name: String = “”,

val events: Long = 0

)

class TimeKey(root: String, time: DateTime)

object Events extends ScaldingType[Events, TimeKey]

with TsvSources[Events, TimeKey] {

implicit val setter = SchemaUtils.scaldingTupleSetterFor[Events]

implicit val converter = SchemaUtils.scaldingTupleConverterFor[Events]

val fields = SchemaUtils.scaldingFieldsFor[Events]

// now, we just have to implement our partition building function!

def partitions(key: TimeKey): Seq[String] = ???

}


Let’s write a Job!

We’ll take our Events type and turn it into an EventsSum type, counting the number of events per timestamp as we go.

EventsSum(val time: DateTime, val events: Long)

33


EventsSumJob

34

class EventsSumJob(args: Args) extends Job(args)

with TimeJobArgs {

TypedPipe.from[Events](

Events.source(typedArgs).read,

Events.fields

).map(EventsSum.fromEvents)

.sumByKey

.values

.write(

TypedSink[EventsSum](

EventsSum.sink(typedArgs)

)

)

}


Wait, what?

35


with TimeJobArgs {



Events.fields


.sumByKey

.values

.write(



)

)

}


Wait, what?

• the TimeJobArgs trait maps Scalding’s Args object to our own type, available via typedArgs

• fromEvents is just a simple mapping function, Events => EventsSum

• sumByKey is a Scalding function that performs an aggregation and requires a Semigroup and an Ordering

36


A Semi-what???

Semigroups are a (mathematical) Set with an associative binary operation

It is also a Monoid without a zero value

37


Don’t Freak Out

In Scalding land, a Semigroup[T] is just a thing that describes how two T’s should be added together.

38


Semigroup[EventsSum]

39

case class EventsSum(

val time: DateTime = new DateTime(0),


)

object EventsSum extends ScaldingType[…]

with TsvSources[…] {

…

implicit val ordering: Ordering[EventsSum] =

Ordering.by(_.time.getMillis)

implicit val semiGroup = new Semigroup[EventsSum] {

override def plus(

l: EventsSum,

r: EventsSum) = {

require(l.time == r.time, “l and r times must match!”)

l.copy(events = l.events + r.events)

}

}

}


And now add that map function

40

case class EventsSum(

val time: DateTime = new DateTime(0),


)

object EventsSum extends ScaldingType[…]

with TsvSources[…] {

…

implicit val ordering: Ordering[EventsSum] =

Ordering.by(_.time.getMillis)

implicit val semiGroup = new Semigroup[EventsSum] {

override def plus(

l: EventsSum,

r: EventsSum) = {



}

}

def fromEvents(src: Events): EventsSum =

EventsSum(src.timestamp, src.events)

}


Progress so far…

41


with TimeJobArgs {



Events.fields

).map(EventsSum.fromEvents) <= done!

.sumByKey <= done!

.values

.write(



)

)

}


Let’s parse the Args!

42

trait TypedArgsParser[T] {

def args2TypedArgs(args: Args): T

def args2TypedArgs(args: Array[String]): T =

args2TypedArgs(Args(args))

}

case class TimeArgs(root: String, time: DateTime)

trait TimeArgsParser extends TypedArgsParser[TimeArgs] {

override def args2TypedArgs(args: Args): TimeArgs =

TimeArgs(

root = args.required("root"),

time = new DateTime(args.required("time"), DateTimeZone.UTC)

)

}


Now, expose typedArgs to the Job

43

trait TypedJobArgs[T] extends Job with TypedArgsParser[T] {

def typedArgs: T = args2TypedArgs(args)

}

trait TimeJobArgs extends TypedJobArgs[TimeArgs]

with TimeArgsParser


Progress so far…

44


with TimeJobArgs { <= done!


Events.source(typedArgs).read, <= done!

Events.fields

).map(EventsSum.fromEvents) <= done!

.sumByKey <= done!

.values

.write(


EventsSum.sink(typedArgs) <= done!

)

)

}


Partitions

Partitions are just folders that contain a batch of data to process.

The partition function signature for our two types is the same: TimeKey => Seq[String]

Note also that TimeArgs is equivalent to the TimeKey we defined earlier.

45


TimeArgs as Partition Key

46

object DateUtils {

val f = DateTimeFormat.forPattern(

“yyyy-MM-dd-HH”

).withZoneUTC()

def dayHour(time: DateTime) = f.print(time)

}

object EventsSum extends ScaldingType[EventsSum, TimeArgs]

with TsvSources[EventsSum, TimeArgs] {

…

def partition(k: TimeArgs) = Seq(

s”${k.root}/events_sum/${DateUtils.dayHour(k.time)}”

)

}


Looking back at our Job…

47


with TimeJobArgs {

TypedPipe.from[Events]( <= implicit converter


Events.fields


.sumByKey <= semigroup and ordering

.values

.write(

TypedSink[EventsSum]( <= implicit setter


)

)

}


Idempotence

48

In computer science, the term idempotent is used more comprehensively to describe an operation that will produce the same results if executed once or multiple times.

ref: https://en.wikipedia.org/wiki/Idempotence#Computer_science_meaning


Making the Job Idempotent

49

object EventsSumJob extends IdempotentJob[EventsSum]

with IntervalArgsParser {

override def jobClass = classOf[EventsSumJob]

override def partitionsToClean(args: Array[String]) =

EventsSum.partitions(TimeArgs(args2TypedArgs(args))

}


The job with all of that boilerplate

50


with TimeJobArgs {

implicit val eventsSumTupleSetter: TupleSetter[EventsSum] = new TupleSetter[SomeType] {

def apply(arg: SomeType): Tuple = {

val t = Tuple.size(arity)

t.set(…)

…

t

}

def arity: Int = 2

}

implicit val eventsTupleConverter: TupleConverter[Events] = new TupleConverter[Events] {

def apply(te: TupleEntry): Events =

new Events(…)

def arity: Int = 3

}

val eventsFields = new Fields(“time”, “name”, “events”)

val timeArgs = TimeArgs(

root = args.required("root"),

time = new DateTime(args.required("time"), DateTimeZone.UTC)

)

implicit val eventsSumSemiGroup: SemiGroup[EventsSum] = new Semigroup[EventsSum] {

override def plus(

l: EventsSum,

r: EventsSum) = {



}

}

implicit val eventsSumOrdering: Ordering[EventsSum] = Ordering.by(_.time.getMillis)

val events2EventsSum: Events => EventsSum = { e =>

EventsSum(…)

}

TypedPipe.from[Events]( <= implicit converter


Events.fields


.sumByKey <= semigroup and ordering

.values

.write(

TypedSink[EventsSum]( <= implicit setter


)

)

}


Taking it all in

• The type safe API pushes the unsafe parts to the edges (converters and setters)

• ScaldingType[T] formalizes this along with data location, allowing for less boilerplate in your jobs

• TypedArgs permit stable interfaces across many jobs, with copious code sharing

• IdempotentJob[T] makes scheduling and replay of your job safe

• In the end your jobs are just logic, with very little boilerplate!

51


Execution Optimization

If you’re still with us, here’s a bit more detail on run-time optimizations

52


A more complicated Job

53



Events.fields


.groupBy( events => events.type )

.sum(EventsSum.aggregator) dimensions (enrich

.values events)

.groupBy( e => e.time )

.join(WeatherForecastSource.groupBy( w => w.time ))

.values

.map(WeatherEvents.fromEventsWithWeather)


Generated execution

How many M/R jobs will be generated?

Reminder: Scalding code is translated to (at runtime) Cascading pipes. Those pipes are evaluated through an execution plan and produce MapReduce jobs (still at runtime).

54


4 jobs?

55


Events.source(typedArgs).read, triggers a reduce

Events.fields



.sum(EventsSum.aggregator)

.values other source to join with



.values


merge of 2 sources


3 jobs!

56


Events.source(typedArgs).read, in the same job as

Events.fields scalding reorders

).map(EventsSum.fromEvents) operations


.sum(EventsSum.aggregator)

.values



.values



Down to 2 jobs!

57


Events.source(typedArgs).read, join done on the

Events.fields second job



.values

.map(WeatherEvents.fromEventsWithWeather) first job

.sumByLocalKeys(EventsSum.aggregator)

.values

sumByLocalKeys: Map-side computation


Only one job?

58


Events.source(typedArgs).read, source retrieved

Events.fields in memory and joined

).map(EventsSum.fromEvents) in each mapper

.hashJoin(WeatherForecastSource.source())

.values


.sumByLocalKeys(EventsSum.aggregator)

.values

Generates one job with mappers only: result in HDFS is wrong (until another job aggregates all files).

Due to hashJoin implicitly indicating that the data is fully copied to all mappers: no need for a consolidation phase, therefore no reduce.


Performance hints

A few number of jobs means less scheduling issues: better latency for end-to-end workflows.

Extreme optimization might lead to data inconsistency.

59


Performance hints: mappers

60


Performance hints: mappers

Mappers are not taking the same amount of time: data are not distributed evenly on the mappers.

Reminder : data are stored in blocks in HDFS. Each mapper runs on a block of data. If a file fits in less than a block, a mapper will not take other data (except using a file combiner…) and will finish earlier than the others.

shard(xx)

Forces a reduce phase: generates « xx » intermediate files instead of big ones (mappers will handle less files then thanks to the triggering of a new reduce phase). Used to distribute data over mappers in intermediate steps.

61


Performance hints: reducers

By default Scalding uses only one reducer. If your data can be partitioned and doesn’t fit in memory (lot of spilling), it’s better to have several reducers.

Warning: the more reducers, the more files, the more mappers for the next step (pay attention to the balance between block size and number of mappers)

62


Performance hints: check the counters

JVM use (especially GC counts), spilled records more than 100% of the input size (output data doesn’t fit in memory), distribution of data between mappers, data format (e.g. Parquet)…

63


A final word

We’re hiring.

64


And we have a nice deck

65


Thank You!

Justin Coffey, [email protected], @jqcoffey

Sofian Djamaa, [email protected], @sdjamaa

http://labs.criteo.com/blog

Also, come see us in Paris on May 26th for a Criteo

Sponsored Day with Criteo, Google and Spotify

Engineers!

mailto:[email protected]

mailto:[email protected]

Working with the Scalding Type-Safe API

Engineering

Transcript of Working with the Scalding Type-Safe API