Apache Spark Model Deployment
-
Upload
databricks -
Category
Technology
-
view
3.070 -
download
0
Transcript of Apache Spark Model Deployment
Apache Spark(™)Model Deployment
Bay Area Spark Meetup – June 30, 2016Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
About Me
Richard L Garris • [email protected] • @rlgarris [Twitter]
Big Data Solutions Architect @ Databricks12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000Prior Work Experience PwC, Google, SkytreeOhio State Buckeye and CMU Alumni
2
About Apache Spark MLlib
Started at Berkeley AMPLab (Apache Spark 0.8)
Now (Apache Spark 2.0)• Contributions from 75+ orgs, ~250 individuals• Development driven by Databricks: roadmap + 50% of PRs• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphFrames
3
MLlib Goals
General Machine Learning library for big data• Scalable & robust• Coverage of common algorithms• Leverages Apache Spark
Tools for practical workflows
Integration with existing data science tools
4
Apache Spark MLlib• spark.mllib• Pre Mllib < Spark 1.4• Spark Mllib was a lower
level library that used Spark RDDs
• Uses LabeledPoint, Vectors and Tuples
• Maintenance Mode only after Spark 2.X
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Apache Spark – ML Pipelines• spark.ml• Spark > 1.4• Spark.ML pipelines –
able to create more complex models
• Integrated with DataFrames
// Let's initialize our linear regression learner
val lr = new LinearRegression()
// Now we set the parameters for the methodlr.setPredictionCol("Predicted_PE") .setLabelCol("PE").setMaxIter(100).setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar.
val lrPipeline = new Pipeline()lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what we getval lrModel = lrPipeline.fit(trainingSet)
The Agile Modeling ProcessSet Business
Goals
Understand Your Data
Create Hypothesis
Devise Experiment
Prepare Data
Train-Tune-Test Model
Deploy Model
Measure / Evaluate Results
The Agile Modeling ProcessSet Business
Goals
Understand Your Data
Create Hypothesis
Devise Experiment
Prepare Data
Train-Tune-Test Model
Deploy Model
Measure / Evaluate Results
Focus of this talk
What is a Model?
•
But What Really is a Model?A model is a complex pipeline of components• Data Sources• Joins• Featurization Logic• Algorithm(s)
• Transformers• Estimators
• Tuning Parameters
ML Pipelines
11
Train model
Evaluate
Load data
Extract featuresA very simple pipeline
ML Pipelines
12
Train model 1
Evaluate
Datasource 1Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline
Why ML persistence?
13
Data Science
Software Engineering
Prototype (Python/R)Create model
Re-implement model for production (Java)
Deploy model
Why ML persistence?
14
Data Science
Software Engineering
Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to
make prediction
• Extra implementation work• Different code paths• Synchronization overhead
Re-implement Pipeline for production (Java)
Deploy Pipeline
With ML persistence...
15
Data Science
Software Engineering
Prototype (Python/R)Create Pipeline
Persist model or Pipeline: model.save(“s3n://...”)
Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production
Demo
Model Serialization in Apache Spark 2.0 using Parquet
What are the Requirements for a Robust Model Deployment System?
Customer SLAs• Response time• Throughput (predictions per second)• Uptime / Reliability
Tech Stack• C / C++• Legacy (mainframe)• Java • Docker
Your Model Scoring Environment
Offline • Internal Use (batch)• Emails, Notifications (batch)• Offline – schedule based or
event trigger based
Model Scoring Offline vs Online
Online• Customer Waiting on the
Response (human real-time)• Super low-latency with fixed
response window (transactional fraud, ad bidding)
Not All Models Return a Yes / No
Model Scoring Considerations
Example: Login Bot DetectorDifferent behavior depending on probability score
0.0-0.4 Allow login☞0.4-0.6 Challenge Question☞0.6 to 0.75 Send SMS☞0.75 to 0.9 Refer to Agent☞0.9 - 1.0 Block☞
Example: Item RecommendationsOutput is a ranking of the top n items
API – send user ID + number of itemsReturn sorted set of items to recommendOptional – pass context sensitive information to tailor results
Model Updates and Versioning
• Model Update Frequency (nightly, weekly, monthly, quarterly)
• Model Version Tracking• Model Release Process
• Dev Test Staging Production‣ ‣ ‣• Model update process
• Benchmark (or Shadow Models)• Phase-In (20% traffic)• Big Bang
• Models can have both reward and risk to the business– Well designed models prevent fraud, reduce churn, increase sales– Poorly designed models increase fraud, could impact the company’s brand,
cause compliance violations or other risks• Models should be governed by the company's policies and procedures,
laws and regulations and the organization's management goals
Model Governance
Considerations• Models have to be transparent, explainable, traceable and interpretable for auditors
/ regulators• Models may need reason codes for rejections (e.g. if I decline someone credit why?)• Models should have an approval and release process• Models also cannot violate any discrimination laws or use features that could be
traced to religion, gender, ethnicity,
Model A/B TestingSet Business
Goals
Understand Your Data
Create Hypothesis
Devise Experiment
Prepare Data
Train-Tune-Test Model
Deploy Model
Measure / Evaluate Results
• A/B testing – comparing two versions to see what performs better
• Historical data works for evaluating models in testing, but production experiments required to validate model hypothesis
• Model update process• Benchmark (or Shadow Models)• Phase-In (20% traffic)• Big Bang
A/B Framework should support these steps
• Monitoring is the process of observing the model’s performance, logging it’s behavior and alerting when the model degrades
• Logging should log exactly the data feed into the model at the time of scoring
• Model alerting is critical to detect unusual or unexpected behaviors
Model Monitoring
Open Loop vs Closed Loop• Open Loop – human being involved• Closed Loop – no human involved
Model Scoring – almost always closed loop, some models alert agents or customer service Model Training – usually open loop with a data scientist in the loop to update the model
Online Learning
• closed loop, entirely machine driven modeling is risky• need to have proper model monitoring and
safeguards to prevent abuse / sensitivity to noise• Mllib supports online through streaming models (k-
means, logistic regression support online)• Alternative – use a more complex model to better fit
new data rather than using online learning
Model Deployment Architectures
Architecture #1Offline Recommendations
Train ALS Model Send Offers to Customers
Save Offers to NoSQL
Ranked Offers
Display Ranked Offers in Web / Mobile
Nightly Batch
Architecture #2Precomputed Features with Streaming
Web Logs
Kill User’s Login SessionPre-compute Features Features
Spark Streaming
Architecture #3Local Apache Spark(™)
Train Model in Spark Save Model to S3 / HDFS
New Data
Copy Model to Production
Predictions
Run Spark Local
Demo
• Example of Offline Recommendations using ALS and Redis as a NoSQL Cache
Try Databricks Community Edition
2016 Apache Spark Survey
33
Spark Summit EU Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://spark-summit.org/eu-2016/
34