AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon...

40
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Sr. Product Manager, Amazon EMR Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group November 29, 2016 MAC303 Zillow Group: Developing Classification and Recommendation Engines With Amazon EMR and Apache Spark

Transcript of AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon...

Page 1: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jonathan Fritz, Sr. Product Manager, Amazon EMR

Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group

November 29, 2016

MAC303

Zillow Group: Developing Classification and

Recommendation Engines With

Amazon EMR and Apache Spark

Page 2: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

What to Expect from the Session

• Apache Spark and Spark ML overview

• Running Spark ML on Amazon EMR

• Interactive notebook options

• Building recommendation engines at Zillow Group

Page 3: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Spark for fast processing

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-

reduce for execution

• Minimizes I/O by storing data

in DataFrames in memory

• Partitioning-aware to avoid

network-intensive shuffle

Page 4: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Spark components to match your use case

Page 5: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Spark ML addresses the full ML pipeline

- Built on top of DataFrame API

- Extract, transform, and select features

- Distributed algorithms

- Classification and Regression

- Clustering

- Collaborative Filtering

- Model selection tools

- Pipelines

Process Data

Feature Extraction

Model Training

Model Testing

Model Validation

Page 6: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Extracting features in DataFrames

- Feature Extractors

- CountVectorizer

- Feature Transformers

- Tokenizer

- Binarizer

- StandardScaler

- Feature Selectors

- VectorSlicer

Page 7: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Many storage layers to choose from

Amazon DynamoDB

Amazon RDS Amazon Kinesis

Amazon Redshift

Amazon S3

Amazon EMR

Page 8: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Training data

Bank loan

write-off

predictions

Page 9: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Classification algorithms in Spark ML

- Logistic regression

- Decision tree classifier

- Random forest classifier

- Gradient-boosted tree classifier

- Multilayer perceptron classifier

- One-vs-Rest classified

- Naive Bayes

Page 10: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

What is logistic regression?

Page 11: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

What are decision trees?

Weather predictors for Golf

Page 12: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Decision trees: tree induction

Page 13: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Decision trees: partition data with hyperplanes

Page 14: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Spark ML pipelines - training

Page 15: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Spark ML pipelines - testing

Page 16: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Creating a Spark ML pipeline

val pipeline = new Pipeline().setStages(Array(assembler, indexer, dt))

val model = pipeline.fit(df)

val predictions = model.transform(df)

Save and load machine learning models and full Pipelines

Page 17: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Tools to pick the right model

- CrossValidator and TrainValidationSplit select the Model

produced by the best-performing set of parameters

- Split the input data into separate training and test

datasets

- For each (training, test) pair, iterate through the set of

ParamMaps

- Fit the Estimator using those parameters, get the fitted

Model, and evaluate the Model’s performance using the

Evaluator

Page 18: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

Open-Source VarietyLatest versions of software

ManagedSpend less time monitoring

SecureEasy to manage options

FlexibleCustomize the cluster

Page 19: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Develop fast using notebooks and IDEs

Page 20: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

• Run Spark Driver in

Client or Cluster mode

• Spark application runs

as a YARN application

• SparkContext runs as a

library in your program,

one instance per Spark

application.

• Spark Executors run in

YARN Containers on

NodeManagers in your

cluster

• Access Spark UI through

the Resource Manager

or Spark History Server

Spark on YARN

Spark UI

Page 21: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Monitor your Spark jobs

Page 22: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Auto Scaling for data science on-demand

YARN metrics

Page 23: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Coming soon: advanced Spot provisioning

Master Node Core Instance Fleet Task Instance Fleet

• Provision from a list of instance types with Spot and On-Demand

• Launch in the most optimal AZ based on capacity/price

• Spot Block support

Page 24: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Productionizing your pipeline

Amazon EMR

Step API

Submit a Spark

application

Amazon EMR

AWS Data Pipeline

Airflow, Luigi, or other

schedulers on EC2

Create a pipeline

to schedule job

submission or create

complex workflows

AWS Lambda

Use AWS Lambda to

submit applications to

EMR Step API or directly

to Spark on your cluster

Page 25: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Recommendation Systems @

Zillow GroupJasjeet Thind

Sr Director, Data Science & Engineering

Page 26: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Agenda

Intro to Zillow Group

Recommendation Use Cases

Architecture

Algorithms

Training & Scoring Pipeline

Metrics

Page 27: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Zillow Group

Build the world's largest, most trusted, and vibrant home-related marketplace.

Page 28: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Recommendation use cases

Email - homes for sale / for rent

Home Details - homes for sale / homes like this

Personalized Search

Mobile - smart SMS and push notifications

Home owner / pre-seller predictions

Lender selection algorithm

Similar photos / video

Page 29: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Architecture

RECOMMENDATION API(Python, R, Flask)

Zillow Group

Data Lake(S3 / Kinesis)

Property Featurization(Spark EMR)

User Profiles(Spark EMR)

Ranking(Spark EMR)

Wedge Counting

Collaborative Filtering(Spark EMR)

Property Aggregate Features(Spark EMR)

Data Collection Systems(Java/Python/SQL)

Page 30: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Like vs. dislike

Predict homes per user using behavior of similar users

Like = user actively engaged with property

Dislike = user viewed property but weak engagement

$22M

$19M

$664K

?+

+

- +

-

Spencer Stan

Feature Description

uid unique id of user

pid Property id

first_visit timestamp or 0

num_views sigmoid(#views)

time_spent time on page

num_contacts # leads sent

num_saves # saves on zpid

num_shares # shares on zpid

num_photos # photos viewed

Page 31: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Wedge count

For all user & property pairs to form a prediction, perform wedge count

- http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf

Does Stan like $19M? Wedge #

3

(wedge03_cnt

)

5

(wedge05_cnt

)

$22M

+

-

$19M+

?

Spencer

Stan

$664k

-

+

$19M+

?

Spencer

Stan

Page 32: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Classifier

Gradient Boosting Classifier (sklearn)

Popular users / properties:

- Divide wedge counts by degree product ju * ki

Prediction for all user / property pairs, limit candidate set by

- Top 10 zip codes

- 300 properties per user

features

wedge00_cnt

wedge01_cnt

wedge02_cnt

wedge03_cnt

wedge04_cnt

wedge05_cnt

wedge06_cnt

wedge07_cnt

wedge00_norm_cnt

wedge01_norm_cnt

wedge02_norm_cnt

wedge03_norm_cnt

wedge04_norm_cnt

wedge05_norm_cnt

wedge06_norm_cnt

wedge07_norm_cnt

Does Stan like the $19M home? features

(uid: Stan, pid: $19M) (see right side)

Page 33: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

User profile

Signals - website, mobile app, and search queries

Binary classification

- labels (like/dislike) same as collab filtering model

User profile model determines preference scores

Features (categorical

variables)

Bath 0_bath, 0.5_bath, 1_Bath,

1.5_bath, 2_bath,

2.5_bath, 3_bath

Bed 0_bed, 1_bed, 2_bed,

3_bed, 4_bed, 5_bed

Price 100_125_price,

125_150_price,

150_175_price

Use

Code

condo, single_family,

farm_land

Zipcode zip_98109

pid uid features label

0 or 1 - see right side 0 or 1

0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6

Page 34: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Ranking

Property matrix - feature space same as user profile

Dot product of property matrix with user profile vector

Age decay for older listings

(uid, pid) score

{"uId":"10307499",

"pId":"1044183744"}

0.3364

1 0 0 0

0 0 1 0

1 0 0 0

0 0 0 1

0

0.01

0.8

0.6

0_bed 1_bed 2_bed 3_bed uid_0

pid_0

pid_1

pid_2

pid_3

=

0

0.8

0

0.6

Page 35: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Training & scoring

Collect user behavior and real-estate data, train the various models, generate the

candidate set, and make predictions.

User

Behavior

(Kinesis

/S3)

Public

Record

(Kinesis

/ S3)

Event API

(Java)

Producer

(Python)

Filter

(Spark)User Store

(Hive / S3)

Spark job creates Hive

table with user events

(uid, pid) partitioned

by date

Active

Listings

(Kinesis

/ S3)

Producer

(Python)

Training Data

(Spark)Training Set

(Hive / S3)

pid -> uid reverse index

Past and current

user events

Models

(Python)

Train Models

(Spark)

Score

(Spark)

Recommendations

Property Data

Collaborative Filtering

/ User Profile Models

Hashmap

(Redis)

Wedge features or property

features (user profile)

Page 36: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Offline evaluation

Hyperparameter tuning with validation set

Training/test data sets for model evaluation

Offline Metrics Description

Precision rk = # recommended properties in test set in top k

Recall n = total properties in the test set

Freshness # listings recommended w/ modified date < y day old in top k

Coverage # unique listings recommended across all users / total # unique listings

Page 37: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Future work

Classifiers for listing descriptions

Deep learning on listing images

Structured streaming on Spark 2.0

Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy

Real-time scoring

Page 38: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Thank you!

[email protected]

aws.amazon.com/emr/

aws.amazon.com/blogs/big-data/

http://www.zillow.com/data-science/

Come join us @ Zillow Group!

Hiring:

- SDE, ML, Data Scientist

- Big Data Engineer

- Analytic Engineer

- Product Management

Page 39: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Remember to complete

your evaluations!

Page 40: AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Related Sessions