Implementing BigPetStore with Apache Flink

Post on 09-Feb-2017

501 views 0 download

Transcript of Implementing BigPetStore with Apache Flink

Implementing BigPetStore A blueprint for Flink users

Márton Balassimbalassi@apache.org / @MartonBalassi

Hungarian Academy of Sciences

Outline

• BigPetStore model• Data generator with the DataSet API• ETL with the DataSet & Table API• Matrix factorization with FlinkML• Recommendation with the DataStream API• Summary

BigPetStore

Blueprints for Big Data applicationsConsists of:• Data Generators

• Examples using tools in Big Data ecosystem to process data

• Build system and tests for integrating tools and multiple JVM languages

Part of the Apache BigTop project

BigPetStore model

• Customers visiting pet stores generating transactions, location based

Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

Data generation

val env = ExecutionEnvironment.getExecutionEnvironmentval (stores, products, customers) = getData()val startTime = getCurrentMillis()

val transactions = env.fromCollection(customers).flatMap(new TransactionGenerator(products)).withBroadcastSet(stores, ”stores”).map{t => t.setDateTime(t.getDateTime + startTime); t}

transactions.writeAsText(output)

• Use RJ Nowling’s Java generator classes• Write transactions to JSON

ETL with the DataSet API

val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val productsWithIndex = transactions.flatMap(_.getProducts).distinct.zipWithUniqueId

val customerAndProductPairs = transactions.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,

p))).join(productsWithIndex).where(_._2).equalTo(_._2).map(pair => (pair._1._1, pair._2._1)).distinct

customerAndProductPairs.writeAsCsv(output)

• Read the dirty JSON• Output (customer, product) pairs for the

recommender

ETL with the Table API

val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val table = transactions.map(toCaseClass(_)).toTable

val storeTransactionCount = table.groupBy('storeId).select('storeId, 'storeName, 'storeId.count as 'count)

val bestStores = table.groupBy('storeId).select('storeId.max as 'max).join(storeTransactionCount).where(”count = max”).select('storeId, 'storeName, 'storeId.count as 'count).toDataSet[StoreCount]

• Read the dirty JSON• SQL style queries

A little Recommeder theory

Item factors

User side information User-Item matrixUser factors

Item side informatio

n

U

I

PQ

R

• R is potentially huge, approximate it with PQ• Prediction is TopK(user’s row Q)

Matrix factorization with FlinkML

val env = ExecutionEnvironment.getExecutionEnvironmentval input = env.readCsvFile[(Int,Int)](inputFile)

.map(pair => (pair._1, pair._2, 1.0))

val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)

model.fit(input)

val (p, q) = model.factorsOption.getp.writeAsText(pOut)q.writeAsText(qOut)

• Read the (customer, product) pairs• Write P and Q to file

Recommendation with the DataStream API

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.socketTextStream(”localhost”, 9999).map(new GetUserVector()).broadcast().map(new PartialTopK()).keyBy(0).flatMap(new GlobalTopK()).print();

• Get the user’s row for a userID• Compute the distributed TopK of the

user’s row Q

Summary

• Go beyond WordCount with BigPetStore• Feel free to mix the DataSet, DataStream,

FlinkML, Table APIs in your Flink workflows• Data generation, cleaning, ETL, Machine

learning, streaming prediction on top of one engine with under 500 lines of code

• Java and Scala APIs work well together• A Flink pet project is always fun. No pun intended.

Big thanks to

• The BigPetStore folks:Suneel MarthiRonald J. NowlingJay Vyas

• Squirrels helping with the code:Gyula FóraGábor GevayGábor HermannFabian HueskeAljoscha Krettek

• And to the whole Flink community

Check out the code

https://github.com/mbalassi/bigpetstore-flink

Márton Balassimbalassi@apache.org / @MartonBalassi

Hungarian Academy of Sciences