Simplification Note
For recommender systems, “it would be convenient to actually eliminate this division by m^(j) term” since “m^(j) is just a constant in this expression.”
Building on the same dataset with “four users having rated some but not all of the five movies,” we now add movie features.
Two features X₁ and X₂ that indicate:
Sample feature values:
For user j, predict rating for movie i as:
prediction = w^(j) · X^(i) + b^(j)
“This is just a lot like linear regression” where each user has individual parameters w^(j) and b^(j).
For Alice (user 1) with parameters w^(1) = [5, 0] and b^(1) = 0:
Prediction for movie 3 (Cute Puppies of Love):
w^(1) · X^(3) + b^(1) = [5,0] · [0.99,0] + 0 = 4.95
For learning parameters w^(j) and b^(j) for user j:
J(w^(j), b^(j)) = (1/2m^(j)) * Σ[i: r(i,j)=1] (w^(j)·X^(i) + b^(j) - y^(i,j))²
Key components:
Complete cost function includes regularization to “prevent overfitting”:
J(w^(j), b^(j)) = (1/2m^(j)) * Σ[squared_error] + (λ/2m^(j)) * Σ[k=1 to n] (w_k^(j))²
Simplification Note
For recommender systems, “it would be convenient to actually eliminate this division by m^(j) term” since “m^(j) is just a constant in this expression.”
To learn parameters for all users simultaneously:
J_total = Σ[j=1 to nu] J(w^(j), b^(j))
This approach trains “a different linear regression model for each of the nu users.”
Using “gradient descent or any other optimization algorithm to minimize this as a function of w^(1), b^(1) all the way through w^(nu), b^(nu)” yields “a pretty good set of parameters for predicting movie ratings for all the users.”
The algorithm is “a lot like linear regression” where the prediction “plays a role similar to the output f(x) of linear regression” but now “we’re training a different linear regression model for each of the nu users.”
The remaining question: “Where do these features come from? And what if you don’t have access to such features that give you enough detail about the movies with which to make these predictions?”
This leads to the next development: learning features automatically from the data itself.