Using Per Item Features

Using Per-Item Features

Adding Movie Features

Building on the same dataset with “four users having rated some but not all of the five movies,” we now add movie features.

Feature Examples

Two features X₁ and X₂ that indicate:

X₁: “How much each of these is a romance movie”
X₂: “How much each of these is an action movie”

Sample feature values:

Love at Last: [0.9, 0] - “very romantic movie” but “not at all an action movie”
Nonstop Car chases: [0.1, 1.0] - “has just a little bit of romance” but “has a ton of action”

Enhanced Notation

n: “Number of features we have” (n=2 for romance and action features)
X(i): Feature vector for movie i
- X(1) = [0.9, 0] for Love at Last
- X(3) = [0.99, 0] for Cute Puppies of Love

Linear Regression Approach for User Predictions

Prediction Model

For user j, predict rating for movie i as:

prediction = w^(j) · X^(i) + b^(j)

“This is just a lot like linear regression” where each user has individual parameters w^(j) and b^(j).

Example Calculation

For Alice (user 1) with parameters w^(1) = [5, 0] and b^(1) = 0:

Prediction for movie 3 (Cute Puppies of Love):

w^(1) · X^(3) + b^(1) = [5,0] · [0.99,0] + 0 = 4.95

Cost Function Development

Single User Cost Function

For learning parameters w^(j) and b^(j) for user j:

J(w^(j), b^(j)) = (1/2m^(j)) * Σ[i: r(i,j)=1] (w^(j)·X^(i) + b^(j) - y^(i,j))²

Key components:

m^(j): “Number of movies rated by user j”
Sum only over movies where r(i,j) = 1 (user j has rated movie i)
Uses “mean squared error criteria”

Adding Regularization

Complete cost function includes regularization to “prevent overfitting”:

J(w^(j), b^(j)) = (1/2m^(j)) * Σ[squared_error] + (λ/2m^(j)) * Σ[k=1 to n] (w_k^(j))²

Simplification Note

For recommender systems, “it would be convenient to actually eliminate this division by m^(j) term” since “m^(j) is just a constant in this expression.”

Learning Parameters for All Users

Multi-User Cost Function

To learn parameters for all users simultaneously:

J_total = Σ[j=1 to nu] J(w^(j), b^(j))

This approach trains “a different linear regression model for each of the nu users.”

Optimization

Using “gradient descent or any other optimization algorithm to minimize this as a function of w^(1), b^(1) all the way through w^(nu), b^(nu)” yields “a pretty good set of parameters for predicting movie ratings for all the users.”

Algorithm Characteristics

The algorithm is “a lot like linear regression” where the prediction “plays a role similar to the output f(x) of linear regression” but now “we’re training a different linear regression model for each of the nu users.”

Next Challenge

The remaining question: “Where do these features come from? And what if you don’t have access to such features that give you enough detail about the movies with which to make these predictions?”

This leads to the next development: learning features automatically from the data itself.