Skip to content
Pablo Rodriguez

Collaborative Vs Content Based

Collaborative filtering recommends items based on ratings of users who gave similar ratings as you. The system has some number of users give some ratings for some items, and the algorithm figures out how to use that to recommend new items to you.

Content-based filtering takes a different approach to deciding what to recommend to you. A content-based filtering algorithm will recommend items to you based on the features of users and features of the items to find a good match.

Key Distinction

Content-based filtering requires having some features of each user, as well as some features of each item and it uses those features to try to decide which items and users might be a good match for each other.

Both approaches still use the same core rating data:

  • r(i,j): Whether or not user j has rated item i
  • y(i,j): The rating that user j gave item i (if defined)

The key difference is that content-based filtering can make good use of features of the user and of the items to find better matches than potentially a pure collaborative filtering approach might be able to.

  • Age of the user
  • Gender: One-hot feature with values based on whether the user’s self-identified gender is male or female or unknown
  • Country: One-hot feature with about 200 possible values for different countries

Content-based systems can look at past behaviors of the user to construct feature vectors:

If you look at the top thousand movies in your catalog, you might construct a thousand features that tells you of the thousand most popular movies in the world which of these has the user watched.

You can take ratings the user might have already given in order to construct new features:

  • If you have a set of movies and know what genre each movie is in
  • Calculate the average rating per genre that the user has given
  • Of all the romance movies that the user has rated, what was the average rating?
  • Of all the action movies that the user has rated, what was the average rating?

These features combine to create a user feature vector: x_u^(j) for user j.

  • Year of the movie
  • Genre or genres of the movie if known
  • Critic reviews: Construct one or multiple features to capture something about what the critics are saying about the movie
  • Average rating: Take user ratings of the movie to construct a feature such as the average rating of this movie
  • Average rating per country
  • Average rating per user demographic
  • Other types of features based on user feedback patterns

These create a movie feature vector: x_m^(i) for movie i.

User features and movie features can be very different in size:

  • User features could be 1500 numbers
  • Movie features could be just 50 numbers
  • This asymmetry is perfectly acceptable

Previously in collaborative filtering: w^(j) · x^(i) + b^(j)

In content-based filtering, we eliminate b^(j) and replace the notation:

  • w^(j) becomes v_u^(j) (vector computed for user j, where u stands for user)
  • x^(i) becomes v_m^(i) (vector computed for movie i, where m stands for movie)
  • v_u^(j): List of numbers computed from the features of user j
  • v_m^(i): List of numbers computed from the features of movie i
  • Both vectors must be the same size to compute dot product (e.g., both are 32 numbers)

The prediction becomes: v_u^(j) · v_m^(i)

If the user vector v_u captures user preferences as [4.9, 0.1, …]:

  • First number: How much they like romance movies
  • Second number: How much they like action movies

And the movie vector v_m is [4.5, 0.2, …]:

  • First number: How much this is a romance movie
  • Second number: How much this is an action movie

Then the dot product hopefully gives a sense of how much this particular user will like this particular movie.

Collaborative Filtering: Number of users give ratings of different items, algorithm learns patterns from rating similarities.

Content-Based Filtering: Features of users and features of items are used to find good matches between users and items by computing vectors v_u for users and v_m for items, then taking dot products to find good matches.

The challenge is learning how to compute v_u and v_m from the available feature information.