Web Mining and Recommender Systems

122
Web Mining and Recommender Systems Supervised learning – Regression

Transcript of Web Mining and Recommender Systems

Supervised learning – Regression
Learning
problems
learning
labeled data the underlying
associated with the data
What is supervised learning?
Infer the function
recommender
Example
A: ratings that others have given to
each movie, and that I have given to
other movies
A: features about the movie and the
users who evaluated it
age, gender,
location, etc.
knowledge, e.g.
if (movie[‘mpaa_rating’]) == “PG”):
Identify words that I frequently mention in my social media
posts, and recommend movies whose plot synopses use
similar types of language
ratings. Recommend movies that
knowledge)
Disadvantages:
about how users relate to items
• Cannot adapt to new data/information
Advantages:
about how users relate to items
• May not be adaptable to new settings
Advantages:
data
with labeled ratings
(predicting ratings)
Supervised versus unsupervised learning
Learning approaches attempt to
Unsupervised learning approaches find
optimized to solve a particular predictive task
Supervised learning aims to directly model the
relationship between input and output variables, so that the
output variables can be predicted accurately given the input
Regression
supervised learning approaches to learn
relationships between input variables
(features) and output variables
of the form
130cm 200cm
Q: Can we find a line that (approximately) fits the data?
Motivation: height vs. weight
Q: Can we find a line that (approximately) fits the data?
• If we can find such a line, we can use it to make predictions
(i.e., estimate a person's weight given their height)
• How do we formulate the problem of finding a line?
• If no line will fit the data exactly, how to approximate?
• What is the "best" line?
Recap: equation for a line
What is the formula describing the line?
Height
Weight
40kg
120kg
What about in more dimensions?
Height
Weight
40kg
120kg
What about in more dimensions?
Height
Weight
40kg
120kg
of the form
of the form
features
setup in terms of lines (or
hyperplanes) of best fit
Worked Example – Regression
regression problem
of the form
of the form
beers vary with age?
http://cseweb.ucsd.edu/classes/fa19/cse258-a/data/beer_50000.json
beers vary with age?
Example 1
Real-valued features
Example 2
function of gender?
Example 2
Gender
Rating
is the how much higher females rate than
males (in this case a negative number)
We’re really still fitting a line though!
Exercise
impact it has on people’s rating
behavior?
problem
Regression – Feature Transforms & Worked
regression problem
engineering in more depth
supervised learning approaches to learn
relationships between input variables
(features) and output variables
of the form
of the form
beers vary with age?
Example: Polynomial functions
• We just need to use the feature vector
x = [1, ABV, ABV^2, ABV^3]
Fitting complex functions
fit arbitrary functions of the features! E.g.:
• We can perform arbitrary combinations of the
features and the model will still be linear in the
parameters (theta):
wanted to transform the parameters:
• The linear models we’ve seen so far do not support
these types of transformations (i.e., they need to be
linear in their parameters)
transformations of parameters, e.g. neural networks
Learning Outcomes
example
feature transforms to fit (e.g.)
polynomials with regression
Regression – Categorical Features
features within regression algorithms
function of gender?
Example
Gender
Rating
is the how much higher females rate than
males (in this case a negative number)
We’re really still fitting a line though!
Motivating examples
What if we had more than two values? (e.g {“male”, “female”, “other”, “not specified”})
Could we apply the same approach?
gender = 0 if “male”, 1 if “female”, 2 if “other”, 3 if “not specified”
if male
if female
if other
Motivating examples
What if we had more than two values? (e.g {“male”, “female”, “other”, “not specified”})
Gender
Rating
Motivating examples
• This model is valid, but won’t be very effective
• It assumes that the difference between “male” and
“female” must be equivalent to the difference
between “female” and “other”
• But there’s no reason this should be the case!
Gender
Rating
Motivating examples
Gender
Rating
Motivating examples
if male
if female
if other
feature = [0, 1, 0] for “other”
feature = [0, 0, 1] for “not specified”
Concept: One-hot encodings
feature = [0, 0, 1] for “not specified”
• This type of encoding is called a one-hot encoding (because
we have a feature vector with only a single “1” entry)
• Note that to capture 4 possible categories, we only need three
dimensions (a dimension for “male” would be redundant)
• This approach can be used to capture a variety of categorical
feature types, as well as objects that belong to multiple
categories
features within regression algorithms
hot" encoding
Regression – Temporal Features
within regression algorithms
impact it has on people’s rating
behavior?
Time
Rating
• In principle this picture looks okay (compared our
previous example on categorical features) – we’re
predicting a real valued quantity from real
valued data (assuming we convert the date string
to a number)
• So, what would happen if (e.g. we tried to train a
predictor based on the month of the year)?
Motivating examples
• Let’s start with a simple feature representation,
e.g. map the month name to a month number:
Jan = [0]
Feb = [1]
Mar = [2]
The model we’d learn might look something like:
J F M A M J J A S O N D
0 1 2 3 4 5 6 7 8 9 10 11
Rating
1 star
5 stars
Motivating examples
J F M A M J J A S O N D J F M A M J J A S O N D
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
Rating
look at multiple years?
December 31 to its January 1st value.
• This type of “sawtooth” pattern probably
isn’t very realistic
look at multiple years?
Modeling temporal data
J F M A M J J A S O N D J F M A M J J A S O N D
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
Rating
?
• Also, it’s not a linear model
• Q: What’s a class of functions that we can use to
capture a more flexible variety of shapes?
• A: Piecewise functions!
would be a valid solution, but is difficult to get
right, and fairly inflexible
Concept: Fitting piecewise functions
following:
J F M A M J J A S O N D
0 1 2 3 4 5 6 7 8 9 10 11
Rating
In fact this is very easy, even for a linear
model! This function looks like:
1 if it’s Feb, 0
otherwise
• Note that we don’t need a feature for January
• i.e., theta_0 captures the January value, theta_1
captures the difference between February and
January, etc.
follows:
where
encoding, just like we saw in the
“categorical features” example
• This type of feature is very flexible, as it can
handle complex shapes, periodicity, etc.
• We could easily increase (or decrease) the
resolution to a week, or an entire season,
rather than a month, depending on how
fine-grained our data was
Concept: Combining one-hot encodings
several one-hot encodings together:
Season vs.
rating (overall)
Learning Outcomes
periodic data
Regression Diagnostics
Learning Goals
algorithms
error or something else)
before it’s “low enough”?
A: It depends! The MSE is proportional
to the variance of the data
Regression diagnostics
Regression diagnostics
Learning Outcomes
algorithms
and R^2 coefficient
• Explained the relationship between
Overfitting
and regularization
Overfitting
Q: But can’t we get an R^2 of 1
(MSE of 0) just by throwing in
enough random features?
should always be evaluated on data
that wasn’t used to train the model
A good model is one that
generalizes to new data
training data but doesn’t
generalize, we are said to be
overfitting
Overfitting
training data but doesn’t
generalize, we are said to be
overfitting
overfitting?
Occam’s razor
“simple” hypothesis?
A1: A “simple” model is one where
theta has few non-zero parameters (only a few features are relevant)
A2: A “simple” model is one where
theta is almost uniform (few features are significantly more relevant than others)
Occam’s razor
theta has few non-zero parameters
A2: A “simple” model is one
where theta is almost uniform
is small
is small
penalizing model complexity during
penalizing model complexity during
Optimizing the (regularized) model
solution as we did before
• Or, we can try to solve using
gradient descent
• How to initialize theta?
• How to set the step size alpha
These aren’t really the point of this class though
Optimizing the (regularized) model
Optimizing the (regularized) model
Learning Outcomes
using the l1 and l2 norms
• (very briefly) touched on gradient
descent
Model Selection & Summary
Each value of lambda generates a
different model. Q: How do we
select which one is the best?
Model selection
error?
error?
Model selection
“tune” the model’s parameters
• Training set: used to optimize the model’s
parameters
• Test set: used to report how well we expect the
model to perform on unseen data
• Validation set: used to tune any model
parameters that are not directly optimized
Model selection
validation, and test sets
• The training error increases as lambda increases
• The validation and test error are at least as large as
the training error (assuming infinitely large
random partitions)
spot” between under- and over-fitting
Model selection
Summary: Regression
• Overfitting and regularization