Apache Spark Model Deployment

Apache Spark(™)Model Deployment

Bay Area Spark Meetup – June 30, 2016Richard Garris – Big Data Solution Architect Focused on Advanced Analytics

About Me

Richard L Garris • [email protected] • @rlgarris [Twitter]

Big Data Solutions Architect @ Databricks12+ years designing Enterprise Data Solutions for everyone from

startups to Global 2000Prior Work Experience PwC, Google, SkytreeOhio State Buckeye and CMU Alumni

2

mailto:[email protected]

About Apache Spark MLlib

Started at Berkeley AMPLab (Apache Spark 0.8)

Now (Apache Spark 2.0)• Contributions from 75+ orgs, ~250 individuals• Development driven by Databricks: roadmap + 50% of PRs• Growing coverage of distributed algorithms

Spark

SparkSQL Streaming MLlib GraphFrames

3

MLlib Goals

General Machine Learning library for big data• Scalable & robust• Coverage of common algorithms• Leverages Apache Spark

Tools for practical workflows

Integration with existing data science tools

4

Apache Spark MLlib• spark.mllib• Pre Mllib < Spark 1.4• Spark Mllib was a lower

level library that used Spark RDDs

• Uses LabeledPoint, Vectors and Tuples

• Maintenance Mode only after Spark 2.X

// Load and parse the data

val data = sc.textFile("data/mllib/ridge-data/lpsa.data")

val parsedData = data.map { line =>

val parts = line.split(',')

LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))

}.cache()

// Building the model

val numIterations = 100

val stepSize = 0.00000001

val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error

val valuesAndPreds = parsedData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Apache Spark – ML Pipelines• spark.ml• Spark > 1.4• Spark.ML pipelines –

able to create more complex models

• Integrated with DataFrames

// Let's initialize our linear regression learner

val lr = new LinearRegression()

// Now we set the parameters for the methodlr.setPredictionCol("Predicted_PE") .setLabelCol("PE").setMaxIter(100).setRegParam(0.1)

// We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar.

val lrPipeline = new Pipeline()lrPipeline.setStages(Array(vectorizer, lr))

// Let's first train on the entire dataset to see what we getval lrModel = lrPipeline.fit(trainingSet)

The Agile Modeling ProcessSet Business

Goals

Understand Your Data

Create Hypothesis

Devise Experiment

Prepare Data

Train-Tune-Test Model

Deploy Model

Measure / Evaluate Results

The Agile Modeling ProcessSet Business

Goals


Create Hypothesis

Devise Experiment

Prepare Data


Deploy Model


Focus of this talk

What is a Model?

•

But What Really is a Model?A model is a complex pipeline of components• Data Sources• Joins• Featurization Logic• Algorithm(s)

• Transformers• Estimators

• Tuning Parameters

ML Pipelines

11

Train model

Evaluate

Load data

Extract featuresA very simple pipeline

ML Pipelines

12

Train model 1

Evaluate

Datasource 1Datasource 2

Datasource 3

Extract featuresExtract features

Feature transform 1

Feature transform 2

Feature transform 3

Train model 2

Ensemble

A real pipeline

Why ML persistence?

13

Data Science

Software Engineering

Prototype (Python/R)Create model

Re-implement model for production (Java)

Deploy model

Why ML persistence?

14

Data Science


Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to

make prediction

• Extra implementation work• Different code paths• Synchronization overhead

Re-implement Pipeline for production (Java)

Deploy Pipeline

With ML persistence...

15

Data Science


Prototype (Python/R)Create Pipeline

Persist model or Pipeline: model.save(“s3n://...”)

Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production

Demo

Model Serialization in Apache Spark 2.0 using Parquet

What are the Requirements for a Robust Model Deployment System?

Customer SLAs• Response time• Throughput (predictions per second)• Uptime / Reliability

Tech Stack• C / C++• Legacy (mainframe)• Java • Docker

Your Model Scoring Environment

Offline • Internal Use (batch)• Emails, Notifications (batch)• Offline – schedule based or

event trigger based

Model Scoring Offline vs Online

Online• Customer Waiting on the

Response (human real-time)• Super low-latency with fixed

response window (transactional fraud, ad bidding)

Not All Models Return a Yes / No

Model Scoring Considerations

Example: Login Bot DetectorDifferent behavior depending on probability score

0.0-0.4 Allow login☞0.4-0.6 Challenge Question☞0.6 to 0.75 Send SMS☞0.75 to 0.9 Refer to Agent☞0.9 - 1.0 Block☞

Example: Item RecommendationsOutput is a ranking of the top n items

API – send user ID + number of itemsReturn sorted set of items to recommendOptional – pass context sensitive information to tailor results

Model Updates and Versioning

• Model Update Frequency (nightly, weekly, monthly, quarterly)

• Model Version Tracking• Model Release Process

• Dev Test Staging Production‣ ‣ ‣• Model update process

• Benchmark (or Shadow Models)• Phase-In (20% traffic)• Big Bang

• Models can have both reward and risk to the business– Well designed models prevent fraud, reduce churn, increase sales– Poorly designed models increase fraud, could impact the company’s brand,

cause compliance violations or other risks• Models should be governed by the company's policies and procedures,

laws and regulations and the organization's management goals

Model Governance

Considerations• Models have to be transparent, explainable, traceable and interpretable for auditors

/ regulators• Models may need reason codes for rejections (e.g. if I decline someone credit why?)• Models should have an approval and release process• Models also cannot violate any discrimination laws or use features that could be

traced to religion, gender, ethnicity,

Model A/B TestingSet Business

Goals


Create Hypothesis

Devise Experiment

Prepare Data


Deploy Model


• A/B testing – comparing two versions to see what performs better

• Historical data works for evaluating models in testing, but production experiments required to validate model hypothesis

• Model update process• Benchmark (or Shadow Models)• Phase-In (20% traffic)• Big Bang

A/B Framework should support these steps

• Monitoring is the process of observing the model’s performance, logging it’s behavior and alerting when the model degrades

• Logging should log exactly the data feed into the model at the time of scoring

• Model alerting is critical to detect unusual or unexpected behaviors

Model Monitoring

Open Loop vs Closed Loop• Open Loop – human being involved• Closed Loop – no human involved

Model Scoring – almost always closed loop, some models alert agents or customer service Model Training – usually open loop with a data scientist in the loop to update the model

Online Learning

• closed loop, entirely machine driven modeling is risky• need to have proper model monitoring and

safeguards to prevent abuse / sensitivity to noise• Mllib supports online through streaming models (k-

means, logistic regression support online)• Alternative – use a more complex model to better fit

new data rather than using online learning

Model Deployment Architectures

Architecture #1Offline Recommendations

Train ALS Model Send Offers to Customers

Save Offers to NoSQL

Ranked Offers

Display Ranked Offers in Web / Mobile

Nightly Batch

Architecture #2Precomputed Features with Streaming

Web Logs

Kill User’s Login SessionPre-compute Features Features

Spark Streaming

Architecture #3Local Apache Spark(™)

Train Model in Spark Save Model to S3 / HDFS

New Data

Copy Model to Production

Predictions

Run Spark Local

Demo

• Example of Offline Recommendations using ALS and Redis as a NoSQL Cache

Try Databricks Community Edition

2016 Apache Spark Survey

33

Spark Summit EU Brussels

October 25-27

The CFP closes at 11:59pm on July 1st

For more information and to submit:

https://spark-summit.org/eu-2016/

34

Apache Spark Model Deployment

Technology

Transcript of Apache Spark Model Deployment