Skip to content
Pablo Rodriguez

Using Per Item Features

Building on the same dataset with “four users having rated some but not all of the five movies,” we now add movie features.

Two features X₁ and X₂ that indicate:

  • X₁: “How much each of these is a romance movie”
  • X₂: “How much each of these is an action movie”

Sample feature values:

  • Love at Last: [0.9, 0] - “very romantic movie” but “not at all an action movie”
  • Nonstop Car chases: [0.1, 1.0] - “has just a little bit of romance” but “has a ton of action”
  • n: “Number of features we have” (n=2 for romance and action features)
  • X(i): Feature vector for movie i
    • X(1) = [0.9, 0] for Love at Last
    • X(3) = [0.99, 0] for Cute Puppies of Love

Linear Regression Approach for User Predictions

Section titled “Linear Regression Approach for User Predictions”

For user j, predict rating for movie i as:

prediction = w^(j) · X^(i) + b^(j)

“This is just a lot like linear regression” where each user has individual parameters w^(j) and b^(j).

For Alice (user 1) with parameters w^(1) = [5, 0] and b^(1) = 0:

Prediction for movie 3 (Cute Puppies of Love):

w^(1) · X^(3) + b^(1) = [5,0] · [0.99,0] + 0 = 4.95

For learning parameters w^(j) and b^(j) for user j:

J(w^(j), b^(j)) = (1/2m^(j)) * Σ[i: r(i,j)=1] (w^(j)·X^(i) + b^(j) - y^(i,j))²

Key components:

  • m^(j): “Number of movies rated by user j”
  • Sum only over movies where r(i,j) = 1 (user j has rated movie i)
  • Uses “mean squared error criteria”

Complete cost function includes regularization to “prevent overfitting”:

J(w^(j), b^(j)) = (1/2m^(j)) * Σ[squared_error] + (λ/2m^(j)) * Σ[k=1 to n] (w_k^(j))²

Simplification Note

For recommender systems, “it would be convenient to actually eliminate this division by m^(j) term” since “m^(j) is just a constant in this expression.”

To learn parameters for all users simultaneously:

J_total = Σ[j=1 to nu] J(w^(j), b^(j))

This approach trains “a different linear regression model for each of the nu users.”

Using “gradient descent or any other optimization algorithm to minimize this as a function of w^(1), b^(1) all the way through w^(nu), b^(nu)” yields “a pretty good set of parameters for predicting movie ratings for all the users.”

The algorithm is “a lot like linear regression” where the prediction “plays a role similar to the output f(x) of linear regression” but now “we’re training a different linear regression model for each of the nu users.”

The remaining question: “Where do these features come from? And what if you don’t have access to such features that give you enough detail about the movies with which to make these predictions?”

This leads to the next development: learning features automatically from the data itself.