Download - Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Data Mining

1

The art of extracting knowledge from large bodies of structured data.

Let’s put it to use!

Preliminaries

Recommendations

2

Basic Recommendations with Collaborative Filtering

Preliminaries

Making Recommendations

4

Preliminaries

The Netflix Prize (2006-2009)

5

Preliminaries

The Netflix Prize (2006-2009)

6

Preliminaries

What was the Netflix Prize?

• In October, 2006 Netflix released a dataset containing 100 million anonymous movie ratings and challenged the data mining, machine learning, and computer science communities to develop systems that could beat the accuracy of its recommendation system, Cinematch.

• Thus began the Netflix Prize, an open competition for the best collaborative filtering algorithm to predict user ratings for films, solely based on previous ratings without any other information about the users or films.

7

Preliminaries

The Netflix Prize Datasets

• Netflix provided a training dataset of 100,480,507 ratings that 480,189 users gave to 17,770 movies.

– Each training rating (or instance) is of the form user,movie, data of rating, rating .

– The user and movie fields are integer IDs, while ratings are from 1 to 5 (integral) stars.

8

Preliminaries

The Netflix Prize Datasets

• The qualifying dataset contained over 2,817,131 instances of the form user,movie, date of rating , with ratings known only to the jury.

• A participating team’s algorithm had to predict grades on the entire qualifying set, consisting of a validation and test set.

– During the competition, teams were only informed of the score for a validation or quiz set of 1,408,342 ratings.

– The jury used a test set of 1,408,789 ratings to determine potential prize winners.

9

Preliminaries

The Netflix Prize Data

10

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Users

Movie Ratings

Preliminaries


11

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Instances (samples, examples,

observations)

Movie Ratings

Preliminaries


12

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Users

Features (attributes, dimensions)

Preliminaries

The Netflix Prize Goal

13

Star Wars

Hoop Dreams

Contact Titanic

Joe 5 2 5 4

John 2 5 3

Al 2 2 4 2

Everaldo 5 1 5 ?

Movie Ratings

Users

Goal: Predict ? (a movie rating) for a user

Preliminaries

The Netflix Prize Methods

14

Bennett, James, and Stan Lanning. "The Netflix Prize." Proceedings of KDD Cup and Workshop. Vol. 2007. 2007.

Preliminaries

The Netflix Prize Methods

15

We will discuss these methods now.

We will discuss these methods by the end of the course.

Preliminaries

Raw Averages

User average: Simply assign the average rating given by user 𝑟𝑢∈𝑈 , where 𝑈 is the set of all users.

Item average: Simply assign 𝑟𝑢,𝑖𝑖∈𝐼 , where 𝐼 is the set of all

items and 𝑟𝑢,𝑖 is the rating given to item 𝑖 by user 𝑢.

16

Preliminaries

Raw Averages

User average: Simply assign the average rating given by user 𝑟𝑢∈𝑈 , where 𝑈 is the set of all users.

Item average: Simply assign 𝑟𝑢,𝑖𝑖∈𝐼 , where 𝐼 is the set of all

items and 𝑟𝑢,𝑖 is the rating given to item 𝑖 by user 𝑢.

17

What about universally good or bad movies? Or skewed rating systems?

Preliminaries

Bayesian Method

Apply Bayes’ Theorem: Of the ratings 𝑟 ∈ 𝑅 a user could give for a movie, assign the highest value:

18

𝑃 𝑟 𝑖 =𝑃 𝑖 𝑟 𝑃 𝑟

𝑃 𝑖

where 𝑃 𝑟 𝑖 is the (conditional) probability of rating 𝑟 given item 𝑖, 𝑃 𝑖 𝑟 is the (conditional) probability of item 𝑖 given rating 𝑟, 𝑃 𝑟 is the (prior) probability of rating 𝑟, and 𝑃 𝑖 is the (prior) probability of item 𝑖.

Preliminaries

Bayesian Method

Apply Bayes’ Theorem: Of the ratings 𝑟 ∈ 𝑅 a user could give for a movie, assign the highest value:

19

𝑃 𝑟 𝑖 =𝑃 𝑖 𝑟 𝑃 𝑟

𝑃 𝑖

where 𝑃 𝑟 𝑖 is the (conditional) probability of rating 𝑟 given item 𝑖, 𝑃 𝑖 𝑟 is the (conditional) probability of item 𝑖 given rating 𝑟, 𝑃 𝑟 is the (prior) probability of rating 𝑟, and 𝑃 𝑖 is the (prior) probability of item 𝑖.

But this method still doesn’t account for the similarity between users.

Preliminaries

Cute Kitten Picture Intermission

20

Preliminaries

Key to Collaborative Filtering

Common insight: personal tastes are correlated

If Alice and Bob both like X and Alice likes Y, then Bob is more likely to like Y, especially (perhaps) if Bob knows Alice.

21

Preliminaries

Collaborative Filtering

Collaborative filtering (CF) systems work by collecting user feedback in the form of ratings for items in a given domain and exploiting similarities in rating behavior amongst several users in determining how to recommend an item

22

Preliminaries

Collaborative Filtering Dataset

23

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Users

Items

Goal: Predict ? (an item) for 𝑛 (a user)

Preliminaries

Types of Collaborative Filtering

24

Neighborhood- or Memory-based

Model-based

Hybrid

2

3

1

Preliminaries

Types of Collaborative Filtering

25

Neighborhood- or Memory-based

2

3

1 We’ll talk about this type now.

Preliminaries

Neighborhood-based CF

A subset of users are chosen based on their similarity to the active users, and a weighted combination of their ratings is used to produce predictions for this user.

26

Preliminaries


It has three steps:

27

1 Assign a weight to all users with respect to similarity with the active user

2

3

Select k users that have the highest similarity with the active user—commonly called the neighborhood.

Compute a prediction from a weighted combination of the selected neighbors’ ratings.

Preliminaries


In step 1, the weight 𝑤𝑎,𝑢 is a measure of similarity

between the user 𝑢 and the active user 𝑎. The most commonly used measure of similarity is the Pearson correlation coefficient between the ratings of the two users:

28

Step 1

𝑤𝑎,𝑢 = 𝑟𝑎,𝑖 − 𝑟 𝑎 𝑟𝑢,𝑖 − 𝑟 𝑢𝑖∈𝐼

𝑟𝑎,𝑖 − 𝑟 𝑎2 𝑟𝑢,𝑖 − 𝑟 𝑢

2

𝑖∈𝐼𝑖∈𝐼

where 𝐼 is the set of items rated by both users, 𝑟𝑢,𝑖 is the rating given to item 𝑖 by user 𝑢, and 𝑟 𝑢 is the mean rating given by user 𝑢.

Preliminaries


In step 2, some sort of threshold is used on the similarity score to determine the “neighborhood.”

29

Step 2

Preliminaries


In step 3, predictions are generally computed as the weighted average of deviations from the neighbor’s mean, as in:

30

Step 3

𝑝𝑎,𝑖 = 𝑟 𝑎 = 𝑟𝑢,𝑖 − 𝑟 𝑢 × 𝑤𝑎,𝑢𝑢∈𝐾

𝑤𝑎,𝑢𝑢∈𝐾

where 𝑝𝑎,𝑖 is the prediction for the active user 𝑎 for item 𝑖, 𝑤𝑎,𝑢 is the similarity between users 𝑎 and 𝑢, and 𝐾 is the neighborhood or set of most similar users.

Preliminaries

Neighborhood-base CF

• Common Problems: – The search for similar users has high computational

complexity, causing conventional neighborhood-based CF algorithms to not scale well.

– It is common for the active user to have highly correlated neighbors that are based on very few co-rated (overlapping) items, which often result in bad predictors.

– When measuring the similarity between users, items that have been rated by all (and universally liked or disliked) are not as useful as less common items. 31

Preliminaries

Item-to-Item Matching

• An extension to neighborhood-based CF.

• Addresses the problem of high computational complexity of searching for similar users.

• The idea:

32

Rather than matching similar users, match a user’s rated items to similar items.

Preliminaries


In this approach, similarities between pairs of items 𝑖 and 𝑗 are computed off-line using Pearson correlation, given by:

33

𝑤𝑖,𝑗 = 𝑟𝑢,𝑖 − 𝑟 𝑖 𝑟𝑢,𝑗 − 𝑟 𝑗𝑢∈𝑈

𝑟𝑢,𝑖 − 𝑟 𝑖2 𝑟𝑢,𝑗 − 𝑟 𝑗

2

𝑢∈𝑈𝑢∈𝑈

where 𝑈 is the set of all users who have rated both items 𝑖 and 𝑗, 𝑟𝑢,𝑖 is the rating of user 𝑢 on item 𝑖, and 𝑟 𝑖 is the average rating of

the 𝑖th item across users.

Preliminaries


Now, the rating for item 𝑖 for user 𝑎 can be predicted using a simple weighted average, as in:

34

𝑝𝑎,𝑖 = 𝑟𝑢,𝑖𝑤𝑖,𝑗𝑗∈𝐾

𝑤𝑖,𝑗𝑗∈𝐾

where 𝐾 is the neighborhood set of the 𝑘 items rated by 𝑎 that are most similar to 𝑖.

Preliminaries

Significance Weighting

• Another extension to neighborhood-based CF.

• Addresses the problem of bad predictors generated by active user to have highly correlated neighbors that are based on very few co-rated (overlapping) items.

• The idea:

35

Multiply the similarity weight by a significance weighting factor, which devalues the correlations

based on a few co-rated items.

Preliminaries

Inverse User Frequency

• Yet another extension to neighborhood-based CF.

• Addresses the problem of the dominance of items that have been rated by all (and universally liked or disliked), yet are not as useful as less common items.

• The idea:

36

Weight an item rating by the inverse of the frequency that item is rated.

Preliminaries

Inverse User Frequency

When measuring the similarity between users, items that have been rated by all (and universally liked or disliked) are not as useful as less common items. To account for this, compute

37

𝑓𝑖 = log𝑛 𝑛𝑖

where 𝑛𝑖 is the number of users who have rated item 𝑖 out of the total number of 𝑛 users. To apply inverse user frequency while using similarity-based CF, the original rating is transformed for 𝑖 by multiplying it by the factor 𝑓𝑖 .

Preliminaries

And Now…

38

Let’s run “the data mining” on some data!

Preliminaries

References

• Prem Melville and Vikas Sindhwani. Recommender Systems. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey Webb (Eds), Springer, 2010.

39