AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon...

Jonathan Fritz, Sr. Product Manager, Amazon EMR

Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group

November 29, 2016

MAC303

Zillow Group: Developing Classification and

Recommendation Engines With

Amazon EMR and Apache Spark

What to Expect from the Session

• Apache Spark and Spark ML overview

• Running Spark ML on Amazon EMR

• Interactive notebook options

• Building recommendation engines at Zillow Group

Spark for fast processing

filter

groupBy

Stage 3

Stage 1

Stage 2

C: D: E:

= cached partition= RDD

• Massively parallel

• Uses DAGs instead of map-

reduce for execution

• Minimizes I/O by storing data

in DataFrames in memory

• Partitioning-aware to avoid

network-intensive shuffle

Spark components to match your use case

Spark ML addresses the full ML pipeline

- Built on top of DataFrame API

- Extract, transform, and select features

- Distributed algorithms

- Classification and Regression

- Clustering

- Collaborative Filtering

- Model selection tools

- Pipelines

Process Data

Feature Extraction

Model Training

Model Testing

Model Validation

Extracting features in DataFrames

- Feature Extractors

- CountVectorizer

- Feature Transformers

- Tokenizer

- Binarizer

- StandardScaler

- Feature Selectors

- VectorSlicer

Many storage layers to choose from

Amazon DynamoDB

Amazon RDS Amazon Kinesis

Amazon Redshift

Amazon S3

Amazon EMR

Training data

Bank loan

write-off

predictions

Classification algorithms in Spark ML

- Logistic regression

- Decision tree classifier

- Random forest classifier

- Gradient-boosted tree classifier

- Multilayer perceptron classifier

- One-vs-Rest classified

- Naive Bayes

What is logistic regression?

What are decision trees?

Weather predictors for Golf

Decision trees: tree induction

Decision trees: partition data with hyperplanes

Spark ML pipelines - training

Spark ML pipelines - testing

Creating a Spark ML pipeline

val pipeline = new Pipeline().setStages(Array(assembler, indexer, dt))

val model = pipeline.fit(df)

val predictions = model.transform(df)

Save and load machine learning models and full Pipelines

Tools to pick the right model

- CrossValidator and TrainValidationSplit select the Model

produced by the best-performing set of parameters

- Split the input data into separate training and test

datasets

- For each (training, test) pair, iterate through the set of

ParamMaps

- Fit the Estimator using those parameters, get the fitted

Model, and evaluate the Model’s performance using the

Evaluator

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

Open-Source VarietyLatest versions of software

ManagedSpend less time monitoring

SecureEasy to manage options

FlexibleCustomize the cluster

Develop fast using notebooks and IDEs

• Run Spark Driver in

Client or Cluster mode

• Spark application runs

as a YARN application

• SparkContext runs as a

library in your program,

one instance per Spark

application.

• Spark Executors run in

YARN Containers on

NodeManagers in your

cluster

• Access Spark UI through

the Resource Manager

or Spark History Server

Spark on YARN

Spark UI

Monitor your Spark jobs

Auto Scaling for data science on-demand

YARN metrics

Coming soon: advanced Spot provisioning

Master Node Core Instance Fleet Task Instance Fleet

• Provision from a list of instance types with Spot and On-Demand

• Launch in the most optimal AZ based on capacity/price

• Spot Block support

Productionizing your pipeline

Amazon EMR

Step API

Submit a Spark

application

Amazon EMR

AWS Data Pipeline

Airflow, Luigi, or other

schedulers on EC2

Create a pipeline

to schedule job

submission or create

complex workflows

AWS Lambda

Use AWS Lambda to

submit applications to

EMR Step API or directly

to Spark on your cluster

Recommendation Systems @

Zillow GroupJasjeet Thind

Sr Director, Data Science & Engineering

Agenda

Intro to Zillow Group

Recommendation Use Cases

Architecture

Algorithms

Training & Scoring Pipeline

Metrics

Zillow Group

Build the world's largest, most trusted, and vibrant home-related marketplace.

Recommendation use cases

Email - homes for sale / for rent

Home Details - homes for sale / homes like this

Personalized Search

Mobile - smart SMS and push notifications

Home owner / pre-seller predictions

Lender selection algorithm

Similar photos / video

Architecture

RECOMMENDATION API(Python, R, Flask)

Zillow Group

Data Lake(S3 / Kinesis)

Property Featurization(Spark EMR)

User Profiles(Spark EMR)

Ranking(Spark EMR)

Wedge Counting

Collaborative Filtering(Spark EMR)

Property Aggregate Features(Spark EMR)

Data Collection Systems(Java/Python/SQL)

Like vs. dislike

Predict homes per user using behavior of similar users

Like = user actively engaged with property

Dislike = user viewed property but weak engagement

Spencer Stan

Feature Description

uid unique id of user

pid Property id

first_visit timestamp or 0

num_views sigmoid(#views)

time_spent time on page

num_contacts # leads sent

num_saves # saves on zpid

num_shares # shares on zpid

num_photos # photos viewed

Wedge count

For all user & property pairs to form a prediction, perform wedge count

- http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf

Does Stan like $19M? Wedge #

(wedge03_cnt

(wedge05_cnt

Spencer

Classifier

Gradient Boosting Classifier (sklearn)

Popular users / properties:

- Divide wedge counts by degree product ju * ki

Prediction for all user / property pairs, limit candidate set by

- Top 10 zip codes

- 300 properties per user

features

wedge00_cnt

wedge01_cnt

wedge02_cnt

wedge03_cnt

wedge04_cnt

wedge05_cnt

wedge06_cnt

wedge07_cnt

wedge00_norm_cnt

wedge01_norm_cnt

wedge02_norm_cnt

wedge03_norm_cnt

wedge04_norm_cnt

wedge05_norm_cnt

wedge06_norm_cnt

wedge07_norm_cnt

Does Stan like the $19M home? features

(uid: Stan, pid: $19M) (see right side)

User profile

Signals - website, mobile app, and search queries

Binary classification

- labels (like/dislike) same as collab filtering model

User profile model determines preference scores

Features (categorical

variables)

Bath 0_bath, 0.5_bath, 1_Bath,

1.5_bath, 2_bath,

2.5_bath, 3_bath

Bed 0_bed, 1_bed, 2_bed,

3_bed, 4_bed, 5_bed

Price 100_125_price,

125_150_price,

150_175_price

condo, single_family,

farm_land

Zipcode zip_98109

pid uid features label

0 or 1 - see right side 0 or 1

0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6

Ranking

Property matrix - feature space same as user profile

Dot product of property matrix with user profile vector

Age decay for older listings

(uid, pid) score

{"uId":"10307499",

"pId":"1044183744"}

0.3364

1 0 0 0

0 0 1 0

1 0 0 0

0 0 0 1

0_bed 1_bed 2_bed 3_bed uid_0

Training & scoring

Collect user behavior and real-estate data, train the various models, generate the

candidate set, and make predictions.

Behavior

(Kinesis

Public

Record

(Kinesis

Event API

(Java)

Producer

(Python)

Filter

(Spark)User Store

(Hive / S3)

Spark job creates Hive

table with user events

(uid, pid) partitioned

by date

Active

Listings

(Kinesis

Producer

(Python)

Training Data

(Spark)Training Set

(Hive / S3)

pid -> uid reverse index

Past and current

user events

Models

(Python)

Train Models

(Spark)

Recommendations

Property Data

Collaborative Filtering

/ User Profile Models

Hashmap

(Redis)

Wedge features or property

features (user profile)

Offline evaluation

Hyperparameter tuning with validation set

Training/test data sets for model evaluation

Offline Metrics Description

Precision rk = # recommended properties in test set in top k

Recall n = total properties in the test set

Freshness # listings recommended w/ modified date < y day old in top k

Coverage # unique listings recommended across all users / total # unique listings

Future work

Classifiers for listing descriptions

Deep learning on listing images

Structured streaming on Spark 2.0

Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy

Real-time scoring

Thank you!

jonfritz@amazon.com

aws.amazon.com/emr/

aws.amazon.com/blogs/big-data/

http://www.zillow.com/data-science/

Come join us @ Zillow Group!

Hiring:

- SDE, ML, Data Scientist

- Big Data Engineer

- Analytic Engineer

- Product Management

Remember to complete

your evaluations!

Related Sessions

AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon...

Technology

Transcript of AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon...

· Concerto N° 1 Trumpet, Piano (continued) EMR 666 EMR 676 EMR 665 EMR 663 EMR 641 EMR 679 EMR 682 EMR 6098 EMR 644 EMR 6075 EMR 6061 EMR 6012 EMR 6065 EMR 683 EMR 6021 EMR 6026

EMR 11503 Rock Star ancien titre Rock Fever · EMR 11503 EMR 10119 EMR 11808 EMR 11623 EMR 11515 EMR 11411 EMR 11802 EMR 11739 EMR 11625 EMR 11426 EMR 11439 EMR 11831 Time 3’26

Collin Koppang Zillow

ZILLOW GROUP, INC.

Zillow Suit Vht

Zillow Patent Suit

DISCOGRAPHY · (Vangelis) N° EMR Blasorchester Concert Band EMR 1619 EMR 1663 EMR 1661 EMR 1638 EMR 1660 EMR 1653 EMR 1458 EMR 1178 EMR 1334 EMR 1546 EMR 1166 Time 4’24 4’32

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Zillow Rent Connect

Zillow 2Q11

Zillow Pro for Brokers

(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS re:Invent 2014

DISCOGRAPHY - alle-noten.de · Collection Timofei Dokshitser Trumpet Solo EMR 6001 EMR 639 EMR 677 Trumpet, Piano EMR 625 EMR 624 EMR 626 EMR 693 EMR 640 EMR 615 EMR 617 EMR 618 EMR

Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013

DISCOGRAPHY - Amazon S3 · 2020. 6. 9. · Brass Band EMR 1433 EMR 1241 EMR 2507 EMR 2760 EMR 2753 EMR 2574 EMR 1424 EMR 2622 EMR 1240 EMR 1886 EMR 2634 EMR 2551 EMR 1693 EMR 2761

DISCOGRAPHY - Amazon S3 · Take Five N° EMR Brass Band EMR 3619 EMR 3620 EMR 3621-EMR 3622 EMR 3623 EMR 3624 EMR 3625 EMR 3626 EMR 3627 EMR 3628 EMR 3629 EMR 3630 EMR 3631 EMR 3632

Zillow Racism Suit

AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (BDM401)

Zillow 3Q11

DISCOGRAPHY · Wind Band (Arrangements) (Fortsetzung - Continued - Suite) EMR 1397 EMR 10269 EMR 1025 EMR 10383 EMR 1922 EMR 1788 EMR 1634 EMR 1653 EMR 10366 EMR 1661 EMR 1406 EMR