Mean Normalization

Feature Normalization for Recommender Systems

“Back in the first course, you have seen how for linear regression, future normalization can help the algorithm run faster.” For recommender systems with “numbers wide such as movie ratings from one to five or zero to five stars, it turns out your algorithm will run more efficiently. And also perform a bit better if you first carry out mean normalization.”

Mean normalization involves “normalize the movie ratings to have a consistent average value.”

Problem with New Users

Adding User Eve

To illustrate the issue, adding “a fifth user Eve who has not yet rated any movies” demonstrates the problem.

For a user with no ratings:

Parameters: w^(5) = [0, 0] and b^(5) = 0
“Because Eve hasn’t rated any movies yet, the parameters w and b don’t affect this first term in the cost function”
“Minimizing this means making the parameters w as small as possible”

Prediction Issue

With zero parameters, the algorithm predicts:

rating = w^(5) · x^(i) + b^(5) = 0

Mean Normalization Process

Step 1: Calculate Movie Averages

For each movie i, compute μᵢ - “the average rating that was given”:

Movie 1: “had two 5s and two 0s and so the average rating is 2.5”
Movie 2: “had a 5 and a 0, so that averages out to 2.5”
Movie 3: “4 and 0 averages out to 2”
Movie 4: “averages out to 2.25 rating”
Movie 5: “not that popular, has an average 1.25 rating”

“Averaging over just the users that did read that particular movie.”

Step 2: Subtract Mean from Ratings

Transform original ratings Y(i,j) by subtracting movie means:

Y_norm(i,j) = Y(i,j) - μᵢ

Example transformations:

Original 5-star rating becomes: 5 - 2.5 = 2.5
Original 0-star rating becomes: 0 - 2.25 = -2.25

Step 3: Adjust Predictions

When making predictions, add back the mean:

prediction = w^(j) · x^(i) + b^(j) + μᵢ

Improved New User Predictions

Example with User Eve

For the new user with w^(5) = [0, 0] and b^(5) = 0:

prediction for movie 1 = w^(5) · x^(1) + b^(5) + μ₁ = 0 + 2.5 = 2.5

Improvement

“This seems more reasonable to think Eve is likely to rate this movie 2.5 rather than think Eve will rate all movie zero stars just because she hasn’t rated any movies yet.”

“The effect of this algorithm is it will cause the initial guesses for the new user Eve to be just equal to the mean of whatever other users have rated these five movies.”

Additional Benefits

Performance Improvements

“By normalizing the mean of the different movies ratings to be zero, the optimization algorithm for the recommender system will also run just a little bit faster.”

Better Behavior

“It does make the algorithm behave much better for users who have rated no movies or very small numbers of movies. And the predictions will become more reasonable.”

Alternative: Column Normalization

Row vs Column Normalization

Row normalization (recommended): “Normalize each of the rows of this matrix to have zero mean” - helps with new users
Column normalization: “Normalize the columns of this matrix to have zero mean” - would help with new movies

Why Row Normalization is Preferred

“Normalizing the rows so that you can give reasonable ratings for a new user seems more important than normalizing the columns.”

For new movies: “If there’s a brand new movie that no one has rated yet, you probably shouldn’t show that movie to too many users initially because you don’t know that much about that movie.”

Summary

Mean normalization “makes the algorithm run a little bit faster” but “even more important, it makes the algorithm give much better, much more reasonable predictions when there are users that rated very few movies or even no movies at all.”

This implementation detail “will make your recommended system work much better” by providing sensible default predictions based on average movie ratings rather than zero ratings.