Spark Summit East NYC Meetup 02-16-2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Spark and Recommendations

Spark, Streaming, Machine Learning, Graph Processing, Approximations, Probabilistic Data Structures, NLP

Spark-NYC Meetup @ Spark Summit Thanks, Bloomberg!

Feb 16th, 2016

Chris Fregly Principal Data Solutions Engineer

We’re Hiring! (Only Nice People) advancedspark.com!

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Netflix OSS Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Meetup Organizer Advanced Apache Meetup

Book Author Advanced .

Due 2016


Advanced Apache Spark Meetup http://advancedspark.com

Meetup Metrics Top 5 Most-active Spark Meetup! 2600 Members in just 6 mos!! 2600 Docker downloads (demos)

Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance

3

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Live, Interactive Demo!! Audience Participation Required

(cell phone or laptop)

4


demo.advancedspark.com End User ->

ElasticSearch ->

Spark ML -> Data Scientist -> 5

<- Kafka <- Spark Streaming <- Cassandra, Redis <- Zeppelin, iPython


Presentation Outline   Scaling with Parallelism and Composability

  Similarity and Recommendations

  When to Approximate

  Common Algorithms and Data Structures   Common Libraries and Tools

  Netflix Recommendations and Data Pipeline 6


Scaling with Parallelism

7

Peter O(log n)

O(log n)


Scaling with Composability

Max (a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d) == (a U b) U (c U d)

Addition (a + b + c + d) == (a + b) + (c + d)

Multiply (a * b * c * d) == (a * b) * (c * d)

Division??

8


What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857

9

What were the Egyptians thinking?! Not Composable

“Divide like an Egyptian”


What about Average?

Overall AVG ( [3, 1] ((3 + 5) + (5 + 7)) 20 [5, 1] == ----------------------- == --- == 5 [5, 1] ((1 + 2) + 1) 4 [7, 1]

) 10

value

count

Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2

Divide, Add, Divide? Not Composable

Single Divide at the End? Doesn’t need to be Composable!

AVG (3, 5, 5, 7) == 5

Add, Add, Add? Composable!


Similarity

12


Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude

13


Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias

14

Normalizes to unit vectors


Jaccard Similarity Set similarity measurement Set intersection / set union -> Based on Jaccard distance Bias towards popularity

15


Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem

16


Word Similarity Edit Distance Calculate char differences between words Deletes, transposes, replaces, inserts

17


Document Similarity TD/IDF Term Freq / Inverse Document Freq Used by most search engines

Word2Vec Words embedded in vector space nearby similars

18


Similarity Pathway ie. Closest recommendations between 2 people

19


Calculating Similarity Exact Brute-Force “All-pairs similarity” aka “Pair-wise similarity”, “Similarity join” Cartesian O(n^2) shuffle and comparison

Approximate Sampling Bucketing (aka “Partitioning”, “Clustering”) Remove data with low probability of similarity

Reduce shuffle and comparisons 20


Bonus: Document Summary Text Rank aka “Sentence Rank” TF/IDF + Similarity Graph + PageRank

Intuition Surface summary sentences (abstract) Most similar to all others (TF/IDF + Similarity Graph) Most influential sentences (PageRank)

21


Similarity Graph Vertex is movie, tag, actor, plot summary, etc. Edges are relationships and weights

22


Topic-Sensitive PageRank Graph diffusion algorithm Pre-process graph, add vector of probabilities to each vertex Probability of landing at this vertex from every other vertex

23


Recommendations

24


Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like, rating, view movie, read profile, search terms Implicit User Feedback: click, hover, scroll, navigation Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features

25


Features Binary Features: True or False

Numeric Discrete Features: Integers

Numeric Features: Real values

Ordinal Features: Maintain order (S -> M -> L -> XL -> XXL)

Temporal Features: Time-based (Time of Day, Binge View)

Categorical Features: Finite, unique categories (sports teams)

Latent Features: Hidden features that arise from within data

26


Feature Engineering Dimension Reduction Reduce number of features (aka “feature space”)

Principle Component Analysis (PCA) Find principle features that describe the data in terms of variance Peel the dimensional layers back until you describe the data

Example: One-Hot Encoding Convert categorical feature values to 0’s, 1’s Remove any hint of a relationship between the categories Bears -> 1 Bears -> [1,0,0] 49’ers -> 2 --> 49’ers -> [0,1,0] Steelers-> 3 Steelers-> [0,0,1]

27

1 binary column per category


Non-Personalized Recommendations

28


Cold Start Problem “Cold Start” problem New user, don’t know their pref, must show them something!

Movies with highest-rated actors Top K Aggregations

Most desirable singles PageRank of like activity

Facebook social graph Recommend friend activity

29


Personalized Recommendations

30


Clustering (aka. Nearest Neighbors) User-to-User Clustering Similar items viewed or rated Similar viewing pattern (ie. binge or casual)

Item-to-Item Clustering Similar item tags/metadata (Jaccard Similiarity, Locality Sensitive Hash) Similar profile text and categories (TF/IDF, Word2Vec, NLP, One-Hot) Similar images/facial structures (Convolutional Neural Nets, Eigenfaces)

31

http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html My OKCupid Profile My Hinge Profile

Dating Site ->


Bonus: NLP Conversation Bot

32

“If your responses to my generic opening lines are positive, I may read your profile.” Spark ML and Stanford CoreNLP: TF/IDF, DecisionTrees, Sentiment

Analysis


User-to-Item Collaborative Filtering Matrix Factorization ①  Factor the large matrix (left) into 2 smaller matrices (right) ②  Smaller matrices, when multiplied, approximate original ③  Fill in the missing values with in the large matrix ④  Surface latent features from within user-item interaction

33


Item-to-Item Collaborative Filtering Made famous by Amazon Paper ~2003 Problem As # of users grew, Matrix Factorization couldn’t scale

Solution Offline/Batch Generate itemId -> List[userId] vectors

Online/Real-time For each item in cart, recommend similar items from vector space

34


When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now)

Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top influencers (PageRank)

Streaming aggregations (distinct count or top k) Inherently sloppy means of collecting (at least once delivery)

36

Approximate as much as you can get away with! Ask for forgiveness later !!


When NOT to Approximate? If you’ve ever heard the term…

“Sarbanes-Oxley”

…in-that-order, at the office, after 2002.

37


A Few Good Algorithms

39

You can’t handle the approximate!


Common to These Algos & Data Structs Low, fixed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoff between size and error Rely on multiple hash functions or operations Size of hash range defines error

40


Bloom Filter Set.contains(key): Boolean

“Hash Multiple Times and Flip the Bits Wherever You Land”

41


Bloom Filter Approximate set membership for key False positive: expect contains(), actual !contains() True negative: expect !contains(), actual !contains()

Elements are only added, never removed

42


Bloom Filter in Action

43

set(key) contains(key): Boolean

Images by @avibryant

TRUE -> maybe contains FALSE -> definitely does not contain.


CountMin Sketch Frequency Count and TopK

“Hash Multiple Times and Add 1 Wherever You Land”

44


CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter

45

Matei Zaharia Martin Odersky Donald Trump


CountMin Sketch In Action (TopK, Count)

46

Images derived from @avibryant

Find minimum of all rows

… …

Can overestimate, but never underestimate

Multiple hash functions (1 hash function per row)

Binary hash output (1 element per column)

x 2 occurrences of “Top Gun” for slightly additional complexity

Top Gun Top Gun

Top Gun (x 2)

A FewGood Men

Taps

Top Gun (x 2)

add(Top Gun, 2)

getCount(Top Gun): Long

Use Case: TopK movies using total views

add(A Few Good Men, 1)

add(Taps, 1)

A FewGood Men

Taps

…

…

Overlap Top Gun

Overlap A Few Good Men


HyperLogLog Count Distinct

“Hash Multiple Times and Uniformly Distribute Where You Land”

47


HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution

Error estimate 14 bits for size of range m = 2^14 = 16,384 hash slots error = 1.04/(sqrt(16,384)) = .81%

48

Not many of these


HyperLogLog In Action (Count Distinct) Use Case: Number of distinct users who view a movie

49

0 32

Top Gun: Hour 2 user2001

user 4009

user 3002

user 7002

user 1005

user 6001

User 8001

User 8002

user 1001

user 2009

user 3005

user 3003

Top Gun: Hour 1 user 3001

user 7009

0 16

Uniform Distribution: Estimate distinct # of users by inspecting just the beginning

0 32

Top Gun: Hour 1 + 2 user2001

user 4009

user 3002

user 7002

user 1005

user 6001

User 8001

User 8002

Combine across different scales

user 7009

user 1001

user 2009

user 3005

user 3003

user 3001


Locality Sensitive Hashing Set Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

50


Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m

Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !!

51


DIMSUM Set Similarity

“Pre-process and ignore data that is unlikely to be similar.”

52


DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold)

Twitter DIMSUM Case Study 40% efficiency gain over bruce-force Cosine Sim

53


Common Tools to Approximate

Twitter Algebird

Redis

Apache Spark

55

Composable Library

Distributed Cache

Big Data Processing


Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count)

56


Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie

PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001

Get distinct count (cardinality) of set

PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie)

57

ignore duplicates

Tunable

Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL


Spark Approximations Spark Core

RDD.count*Approx() Spark SQL

PartialResult approxCountDistinct(column), HyperLogLogPlus

Spark ML Stratified sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of comparison shuffle RowMatrix.columnSimilarities(threshold)

Spark Streaming A/B testing StreamingTest.setTestMethod(“welch”).registerStream(dstream)

58


Demos!

59


Counting Exact Count vs. Approx HyperLogLog, CountMin Sketch

60


HashSet vs. HyperLogLog (Memory)

61


HashSet vs. CountMin Sketch (Memory)

62


Set Similarity Bruce Force vs. Locality Sensitive Hashing Similarity

63


Brute Force Cartesian All Pair Similarity

64

47 seconds


Locality Sensitive Hash All Pair Similarity

65

6 seconds


Many More Demos!

or Download Docker Clone Github

66

http://advancedspark.com


Netflix Recommendation & Data Pipeline From 5 Stars to Trending Now

68


Netflix Has a Lot of Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like politics and Kevin Spacey.

69

The UK doesn’t have White Castle. Renamed my favourite movie to:

“Harold and Kumar Get the Munchies”

My favorite movie: “Harold and Kumar Go to White Castle”

Summary: Buy NFLX Stock!

This broke my unit tests!


$1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (RMSE)

Dataset (userId, movieId, rating, timestamp) Test data withheld to calculate RMSE upon submission

Winning algorithm 10.06% improvement (RMSE) Ensemble of 500+ ML combined with GBDT’s Computationally impractical

70


Secrets to the Winning Algorithms Adjust for the following human bias… ①  Alice Effect: rate lower than average user ②  Inception Effect: rated higher than average movie

③  Overall mean rating of a movie

④  Number of people who have rated a movie

⑤  Mood, time of day, day of week, season, weather

⑥  Number of days since user’s first rating

⑦  Number of days since movie’s first rating 71


Netflix Data Pipeline - Then

72

v1.0!

v2.0!


Netflix Data Pipeline - Now

73

v3.0!

8 million events per second


Netflix Recommendation Pipeline

74

Throw away batch-generated user factors (U)


Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering

75

Ensembles


Netflix Trending Now Time of day Personalized to user (viewing history, past ratings) Personalized to events (Valentine’s Day)

76

“VHS”

Number of Plays

Number of Impressions

Calculate Take Rate


Bonus: Pandora Time of Day Recs Work Days Play familiar music User is less likely accept new music

Evenings and Weekends Play new music More like to accept new music

77


Netflix Social Integration Post to Facebook after movie start (5 mins) Recommend without needing viewing history Helps with Cold Start problem

78


Netflix Search No results? No problem… Show similar results!

Empty searches are good! Explicit feedback for future recommendations Content to buy and produce!

79


Bonus: Netflix in 2004 Netflix noticed people started to rate movies higher!? Why?

Significant UI improvements made around that time Recommendation improvements (Cinematch)

80


Thank You!! Chris Fregly @cfregly IBM Spark Tech Center http://spark.tc San Francisco, California, USA

http://advancedspark.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker

Find me: LinkedIn, Twitter, Github, Email, Fax 81

Image derived from http://www.duchess-france.org/

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Power of data. Simplicity of design. Speed of innovation.

IBM Spark

advancedspark.com @cfregly

Spark Summit East NYC Meetup 02-16-2016

Software

Transcript of Spark Summit East NYC Meetup 02-16-2016