Download - Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

Map(), flatMap() and reduce() are your new best friends:

simpler collections, concurrency, and big data

Chris Richardson

Author of POJOs in ActionFounder of the original CloudFoundry.com

@[email protected]://plainoldobjects.com

@crichardson

Presentation goal

How functional programming simplifies your code

Show that map(), flatMap() and reduce()

are remarkably versatile functions

@crichardson

About Chris

@crichardson

About Chris

Founder of a buzzword compliant (stealthy, social, mobile, big data, machine learning, ...) startup

Consultant helping organizations improve how they architect and deploy applications using cloud, micro services, polyglot applications, NoSQL, ...

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

@crichardson

Functional programming is a programming paradigm

Functions are the building blocks of the application

Best done in a functional programming language

@crichardson

Functions as first class citizens

Assign functions to variables

Store functions in fields

Use and write higher-order functions:

Pass functions as arguments

Return functions as values

@crichardson

Avoids mutable state

Use:

Immutable data structures

Single assignment variables

Some functional languages such as Haskell don’t allow side-effects

@crichardson


"the highest goal of programming-language design to enable good ideas to be elegantly expressed"

http://en.wikipedia.org/wiki/Tony_Hoare

@crichardson

Why functional programming?More expressive

More intuitive - declarative code matches problem definition

Functional code is usually much more composable

Immutable state:

Less error-prone

Easy parallelization and concurrency

But be pragmatic

@crichardson

An ancient idea that has recently become popular

@crichardson

Mathematical foundation:

λ-calculus

Introduced byAlonzo Church in the 1930s

@crichardson

Lisp = an early functional language invented in 1958

http://en.wikipedia.org/wiki/Lisp_(programming_language)

1940

1950

1960

1970

1980

1990

2000

2010

garbage collection dynamic typing

self-hosting compiler tree data structures

(defun factorial (n) (if (<= n 1) 1 (* n (factorial (- n 1)))))

@crichardson

My final year project in 1985: Implementing SASL

sieve (p:xs) = p : sieve [x | x <- xs, rem x p > 0];

primes = sieve [2..]

A list of integers starting with 2

Filter out multiples of p

Mostly an Ivory Tower technology

Lisp was used for AI

FP languages: Miranda, ML, Haskell, ...

“Side-effects kills kittens and puppies”

@crichardson

http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html

!*

!*

!*

@crichardson

But today FP is mainstreamClojure - a dialect of Lisp

A hybrid OO/functional language

A hybrid OO/FP language for .NET

Java 8 has lambda expressions

@crichardson

Java 8 lambda expressions are functions

x -> x * x

x -> { for (int i = 2; i < Math.sqrt(x); i = i + 1) { if (x % i == 0) return false; } return true; };

(x, y) -> x * x + y * y

An instance of an anonymous inner class that implements a functional interface (kinda)

@crichardson

Agenda





@crichardson

Lot’s of application code=

collection processing:

Mapping, filtering, and reducing

@crichardson

Social network examplepublic class Person {

enum Gender { MALE, FEMALE }

private Name name; private LocalDate birthday; private Gender gender; private Hometown hometown;

private Set<Friend> friends = new HashSet<Friend>(); ....

public class Friend {

private Person friend; private LocalDate becameFriends; ...}

public class SocialNetwork { private Set<Person> people; ...

@crichardson

Typical iterative code - e.g. filteringpublic class SocialNetwork {

private Set<Person> people;

...

public Set<Person> lonelyPeople() { Set<Person> result = new HashSet<Person>(); for (Person p : people) { if (p.getFriends().isEmpty()) result.add(p); } return result; }

Declare result variable

Modify result

Return result

Iterate

@crichardson

Problems with this style of programming

Low level

Imperative (how to do it) NOT declarative (what to do)

Verbose

Mutable variables are potentially error prone

Difficult to parallelize

@crichardson

Java 8 streams to the rescue

A sequence of elements

“Wrapper” around a collection (and other types: e.g. JarFile.stream(), Files.lines())

Streams can also be infinite

Provides a functional/lambda-based API for transforming, filtering and aggregating elements

Much simpler, cleaner and declarative code

@crichardson

public class SocialNetwork {


...

public Set<Person> peopleWithNoFriends() { Set<Person> result = new HashSet<Person>(); for (Person p : people) { if (p.getFriends().isEmpty()) result.add(p); } return result; }

Using Java 8 streams - filteringpublic class SocialNetwork {


...

public Set<Person> lonelyPeople() { return people.stream()

.filter(p -> p.getFriends().isEmpty())

.collect(Collectors.toSet()); }

predicate lambda expression

@crichardson

The filter() function

s1 a b c d e ...

s2 a c d ...

s2 = s1.filter(f)

Elements that satisfy predicate f

@crichardson

Using Java 8 streams - mapping

class Person ..

private Set<Friend> friends = ...;

public Set<Hometown> hometownsOfFriends() { return friends.stream() .map(f -> f.getPerson().getHometown()) .collect(Collectors.toSet()); }

@crichardson

The map() function

s1 a b c d e ...

s2 f(a) f(b) f(c) f(d) f(e) ...

s2 = s1.map(f)

@crichardson

Using Java 8 streams - friend of friends using flatMap

class Person ..

public Set<Person> friendOfFriends() { return friends.stream() .flatMap(friend -> friend.getPerson().friends.stream()) .map(Friend::getPerson) .filter(f -> f != this) .collect(Collectors.toSet()); }

maps and flattens

@crichardson

The flatMap() function

s1 a b ...

s2 f(a)0 f(a)1 f(b)0 f(b)1 f(b)2 ...

s2 = s1.flatMap(f)

@crichardson

Using Java 8 streams - reducingpublic class SocialNetwork {


...

public long averageNumberOfFriends() { return people.stream() .map ( p -> p.getFriends().size() ) .reduce(0, (x, y) -> x + y) / people.size(); } int x = 0;

for (int y : inputStream) x = x + yreturn x;

@crichardson

The reduce() function

s1 a b c d e ...

x = s1.reduce(initial, f)

f(f(f(f(f(f(initial, a), b), c), d), e), ...)

@crichardson

Adopting FP with Java 8 is straightforward

Simply start using streams and lambdasEclipse can refactor anonymous inner classes to lambdas

@crichardson

Agenda





@crichardson

Let’s imagine that you are writing code to display the

products in a user’s wish list

@crichardson

The need for concurrencyStep #1

Web service request to get the user profile including wish list (list of product Ids)

Step #2

For each productId: web service request to get product info

But

Getting products sequentially ⇒ terrible response time

Need fetch productInfo concurrentlyComposing sequential + scatter/gather-style

operations is very common

@crichardson

Futures are a great abstraction for composing concurrent operations

http://en.wikipedia.org/wiki/Futures_and_promises

@crichardson

Worker thread or event-driven code

Main thread

Composition with futures

Outcome

Future 2

Clientget Asynchronous

operation 2

set

initiates

Asynchronous operation 1

Outcome

Future 1

getset

@crichardson

But composition with basic futures is difficult

Java 7 future.get([timeout]):

Blocking API ⇒ client blocks thread

Difficult to compose multiple concurrent operations

Futures with callbacks:

e.g. Guava ListenableFutures, Spring 4 ListenableFuture

Attach callbacks to all futures and asynchronously consume outcomes

But callback-based code = messy code

See http://techblog.netflix.com/2013/02/rxjava-netflix-api.html

We need functional futures!

@crichardson

Functional futures - Scala, Java 8 CompletableFuture

def asyncPlus(x : Int, y : Int) : Future[Int] = ... x + y ...

val future2 = asyncPlus(4, 5).map{ _ * 3 }

assertEquals(27, Await.result(future2, 1 second))

Asynchronously transforms future

def asyncSquare(x : Int) : Future[Int] = ... x * x ...

val f2 = asyncPlus(5, 8).flatMap { x => asyncSquare(x) }

assertEquals(169, Await.result(f2, 1 second))

Calls asyncSquare() with the eventual outcome of

asyncPlus()

@crichardson

Functions like map() are asynchronous

someFn(outcome1)

f2

f2 = f1 map (someFn) Outcome1

f1

Implemented using callbacks

@crichardson

class WishListService(...) { def getWishList(userId : Long) : Future[WishList] = {

userService.getUserProfile(userId).

Scala wish list service

Java 8 Completable Futures let you write similar code

Future[UserProfile]

map { userProfile => userProfile.wishListProductIds}. flatMap { productIds => val listOfProductFutures = productIds map productInfoService.getProductInfo Future.sequence(listOfProductFutures) }. map { products => WishList(products) }

Future[List[Long]]

List[Future[ProductInfo]]

Future[List[ProductInfo]]

Future[WishList]

@crichardson

Your mouse is your database

Erik Meijer

http://queue.acm.org/detail.cfm?id=2169076

@crichardson

Introducing Reactive Extensions (Rx)

The Reactive Extensions (Rx) is a library for composing asynchronous and event-based programs using observable sequences and LINQ-style query operators. Using Rx, developers represent asynchronous data streams

with Observables , query asynchronous data streams using LINQ operators , and .....

https://rx.codeplex.com/

@crichardson

About RxJava

Reactive Extensions (Rx) for the JVM

Original motivation for Netflix was to provide rich Futures

Implemented in Java

Adaptors for Scala, Groovy and Clojure

Embraced by Akka and Spring Reactor: http://www.reactive-streams.org/

https://github.com/Netflix/RxJava

@crichardson

RxJava core concepts

trait Observable[T] { def subscribe(observer : Observer[T]) : Subscription ...}

trait Observer[T] {def onNext(value : T)def onCompleted()def onError(e : Throwable)

}

Notifies

An asynchronous stream of items

Used to unsubscribe

Comparing Observable to...Observer pattern - similar but adds

Observer.onComplete()

Observer.onError()

Iterator pattern - mirror image

Push rather than pull

Futures - similar

Can be used as Futures

But Observables = a stream of multiple values

Collections and Streams - similar

Functional API supporting map(), flatMap(), ...

But Observables are asynchronous

@crichardson

Fun with observables

val every10Seconds = Observable.interval(10 seconds)

-1 0 1 ...

t=0 t=10 t=20 ...

val oneItem = Observable.items(-1L)

val ticker = oneItem ++ every10Seconds

val subscription = ticker.subscribe { (value: Long) => println("value=" + value) }...subscription.unsubscribe()

@crichardson

def getTableStatus(tableName: String) : Observable[DynamoDbStatus]=

Observable { subscriber: Subscriber[DynamoDbStatus] =>

}

Connecting observables to the outside world

amazonDynamoDBAsyncClient.describeTableAsync(new DescribeTableRequest(tableName), new AsyncHandler[DescribeTableRequest, DescribeTableResult] {

override def onSuccess(request: DescribeTableRequest, result: DescribeTableResult) = { subscriber.onNext(DynamoDbStatus(result.getTable.getTableStatus)) subscriber.onCompleted() }

override def onError(exception: Exception) = exception match { case t: ResourceNotFoundException => subscriber.onNext(DynamoDbStatus("NOT_FOUND")) subscriber.onCompleted() case _ => subscriber.onError(exception) } }) }

Called once per subscriber

Asynchronously gets information about DynamoDB table

@crichardson

Transforming observables

val tableStatus : Observable[DynamoDbMessage] = ticker.flatMap { i => logger.info("{}th describe table", i + 1) getTableStatus(name) }

Status1 Status2 Status3 ...

t=0 t=10 t=20 ...

+ Usual collection methods: map(), filter(), take(), drop(), ...

@crichardson

Calculating rolling averageclass AverageTradePriceCalculator {

def calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = { ... }

case class Trade( symbol : String, price : Double, quantity : Int ...)

case class AveragePrice(symbol : String, price : Double, ...)

@crichardson

Calculating average pricesdef calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = {

trades.groupBy(_.symbol). map { case (symbol, tradesForSymbol) => val openingEverySecond =

Observable.items(-1L) ++ Observable.interval(1 seconds) def closingAfterSixSeconds(opening: Any) =

Observable.interval(6 seconds).take(1)

tradesForSymbol.window(openingEverySecond, closingAfterSixSeconds _).map { windowOfTradesForSymbol => windowOfTradesForSymbol.fold((0.0, 0, List[Double]())) { (soFar, trade) => val (sum, count, prices) = soFar (sum + trade.price, count + trade.quantity, trade.price +: prices) } map { case (sum, length, prices) => AveragePrice(symbol, sum / length, prices) } }.flatten }.flatten }

Create an Observable of per-symbol Observables

Create an Observable of per-symbol Observables

@crichardson

Agenda





@crichardson

Let’s imagine that you want to count word frequencies

@crichardson

Scala Word Count

val frequency : Map[String, Int] = Source.fromFile("gettysburgaddress.txt").getLines() .flatMap { _.split(" ") }.toList

frequency("THE") should be(11)frequency("LIBERTY") should be(1)

.groupBy(identity) .mapValues(_.length))

Map

Reduce

@crichardson

But how to scale to a cluster of machines?

@crichardson

Apache HadoopOpen-source software for reliable, scalable, distributed computing

Hadoop Distributed File System (HDFS)

Efficiently stores very large amounts of data

Files are partitioned and replicated across multiple machines

Hadoop MapReduce

Batch processing system

Provides plumbing for writing distributed jobs

Handles failures

...

@crichardson

Overview of MapReduceInputData

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

OutputDataShuffle

(K,V)

(K,V)

(K,V)

(K,V)*

(K,V)*

(K,V)*

(K1,V, ....)*

(K2,V, ....)*

(K3,V, ....)*

(K,V)

(K,V)

(K,V)

@crichardson

MapReduce Word count - mapperclass Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }}

(“Four”, 1), (“score”, 1), (“and”, 1), (“seven”, 1), ...

Four score and seven years⇒

http://wiki.apache.org/hadoop/WordCount

@crichardson

Hadoop then shuffles the key-value pairs...

@crichardson

MapReduce Word count - reducer class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

(“the”, 11)

(“the”, (1, 1, 1, 1, 1, 1, ...))⇒

http://wiki.apache.org/hadoop/WordCount

@crichardson

About MapReduceVery simple programming abstract yet incredibly powerful

By chaining together multiple map/reduce jobs you can process very large amounts of data in interesting ways

e.g. Apache Mahout for machine learning

But

Mappers and Reducers = verbose code

Development is challenging, e.g. unit testing is difficult

It’s disk-based, batch processing ⇒ slow

@crichardson

Scalding: Scala DSL for MapReduceclass WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )

def tokenize(text : String) : Array[String] = { text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "") .split("\\s+") }}

https://github.com/twitter/scalding

Expressive and unit testable

Each row is a map of named fields

@crichardson

Apache SparkPart of the Hadoop ecosystem

Key abstraction = Resilient Distributed Datasets (RDD)

Collection that is partitioned across cluster members

Operations are parallelized

Created from either a Scala collection or a Hadoop supported datasource - HDFS, S3 etc

Can be cached in-memory for super-fast performance

Can be replicated for fault-tolerance

REPL for executing ad hoc queries

http://spark.apache.org

@crichardson

Spark Word Countval sc = new SparkContext(...)

sc.textFile("s3n://mybucket/...") .flatMap { _.split(" ")} .groupBy(identity) .mapValues(_.length) .toArray.toMap }}

Expressive, unit testable and very fast

@crichardson

Summary

Functional programming enables the elegant expression of good ideas in a wide variety of domains

map(), flatMap() and reduce() are remarkably versatile higher-order functions

Use FP and OOP together

Java 8 has taken a good first step towards supporting FP

@crichardson

Questions?

@crichardson [email protected]

http://plainoldobjects.com