Skip to content
Pablo Rodriguez

Deep Learning

A good way to develop a content-based filtering algorithm is to use deep learning. The approach you see in this video is the way that many important commercial state-of-the-art content-based filtering algorithms are built today.

The first neural network is the user network that takes as input the list of features of the user (x_u) such as age, gender, country of the user, and so on. Using a few layers of dense neural network layers, it outputs the vector v_u that describes the user.

In the example network:

  • Input: User features (age, gender, country, etc.)
  • Hidden layers: Dense neural network layers
  • Output layer: 32 units producing v_u as a list of 32 numbers

Similarly, to compute v_m for a movie, we have a movie network that takes as input features of the movie and through a few layers of a neural network outputs v_m, the vector that describes the movie.

  • Input: Movie features (year, genre, stars, etc.)
  • Hidden layers: Dense neural network layers
  • Output layer: 32 units producing v_m as a list of 32 numbers

Architecture Flexibility

The user network and the movie network can hypothetically have different numbers of hidden layers and different numbers of units per hidden layer. All that matters is the output layer needs to have the same size dimension.

The rating prediction is computed as: v_u · v_m (dot product)

Rather than drawing two separate networks, we can represent this as a single neural network:

  • Upper portion: User network (x_u → v_u)
  • Lower portion: Movie network (x_m → v_m)
  • Dot product layer: Combines v_u and v_m to produce prediction

For binary labels (like/dislike rather than star ratings):

  • Apply sigmoid function to v_u · v_m
  • Predict probability that y^(i,j) = 1
  • Use this to predict the probability that user j likes item i

The cost function is very similar to collaborative filtering:

J = Σ[(i,j): r(i,j)=1] (v_u^(j) · v_m^(i) - y^(i,j))²

Where we sum over all pairs i and j where we have labels (r(i,j) = 1).

  • There’s no separate training procedure for the user and movie networks
  • This cost function is used to train all parameters of both networks simultaneously
  • The networks are judged according to how well v_u and v_m predict y^(i,j)
  • Use gradient descent or other optimization algorithms to minimize the cost function

For regularization, add the usual neural network regularization term to encourage the networks to keep the values of their parameters small.

After training, the model can find similar items using the learned feature vectors.

  • v_m^(i): 32-dimensional vector describing movie i
  • To find movies similar to movie i, look for other movies k where the distance between vectors is small
  • Use squared distance: ||v_m^(k) - v_m^(i)||²

This is similar to collaborative filtering’s approach of finding movies with similar feature vectors.

This approach demonstrates one of the benefits of neural networks mentioned when discussing decision trees versus neural networks: it’s easier to take multiple neural networks and put them together to make them work in concert to build a larger system.

The content-based filtering system shows this by:

  • Taking a user network and movie network
  • Putting them together
  • Taking the inner product of their outputs
  • Creating a more complex architecture that’s quite powerful

In practice, developers often end up spending significant time carefully designing the features needed to feed into these content-based filtering algorithms. If building these systems commercially, it may be worth spending time engineering good features for the application.

One limitation of the algorithm as described is that it can be computationally very expensive to run if you have a large catalog with many different movies you may want to recommend. This challenge leads to the need for scaling solutions addressed in subsequent approaches.

The deep learning approach to content-based filtering provides a flexible, powerful framework that can incorporate rich user and item features to make sophisticated recommendations while maintaining the ability to find similar items and scale to commercial applications.