Skip to content
Pablo Rodriguez

Learning State Value Function

“The key idea is that we’re going to train a neural network to compute or to approximate the state action value function Q(s,a), and that in turn will let us pick good actions.”

High-Level Process: “The heart of the learning algorithm is we’re going to train a neural network that inputs the current state and the current action and computes or approximates Q(s,a).”

The lunar lander state includes:

  • x, y: Position coordinates
  • ẋ, ẏ: Velocity components
  • θ, θ̇: Angle and angular velocity
  • l, r: Left and right leg ground contact (binary)

One-Hot Encoding for four possible actions:

  • Nothing: [1, 0, 0, 0]
  • Left thruster: [0, 1, 0, 0]
  • Main engine: [0, 0, 1, 0]
  • Right thruster: [0, 0, 0, 1]

Total input x: 12 numbers (8 for state + 4 for one-hot action encoding)

network_architecture.txt
Input Layer: 12 numbers (state + action)
Hidden Layer 1: 64 units
Hidden Layer 2: 64 units
Output Layer: 1 unit (Q value)

Output: Single Q(s,a) value that the neural network approximates

“The job of the neural network is to output Q(s,a), the state action value function for the lunar lander given the input s and a.”

Target Value: “I’m also going to refer to this value Q(s,a) as the target value y that we’re training the neural network to approximate.”

When the lunar lander is in state s:

  1. Compute Q(s, nothing)
  2. Compute Q(s, left)
  3. Compute Q(s, main)
  4. Compute Q(s, right)

“Whichever of these has the highest value, you would pick the corresponding action a. So for example, if out of these four values, Q(s, main) is largest, then you would decide to gunfire the main engine of the lunar lander.”

Key Question: “How do you train a neural network to output Q(s,a)?”

Solution: “The approach will be to use Bellman’s equations to create a training set with lots of examples x and y, and then we’ll use supervised learning exactly as you learned in the second course when we talked about neural networks.”

From Bellman Equation: Q(s,a) = R(s) + γ max_{a’} Q(s’, a’)

  • Input (x): State-action pair (s, a)
  • Target (y): Right-hand side of Bellman equation

Random Exploration: “We’re going to use the lunar lander, and just try taking different actions in it. If we don’t have a good policy yet, we’ll take actions randomly, fire the left thruster, fire the right thruster, fire the main engine, do nothing.”

Experience Tuples: By taking actions, “we’ll observe a lot of examples of when we’re in some state, and we took some action, maybe a good action, maybe a terrible action, either way. Then we got some rewards R(s) for being in that state, and as a result of our action, we got to some new state s’.”

Each experience: (S_t, A_t, R_t, S_{t+1})

Multiple Examples:

  • (S₁, A₁, R₁, S’₁)
  • (S₂, A₂, R₂, S’₂)
  • (S₁₀,₀₀₀, A₁₀,₀₀₀, R₁₀,₀₀₀, S’₁₀,₀₀₀)

Each tuple (S_i, A_i, R_i, S’_i) creates one training example:

Input: X_i = [S_i, A_i] (12 numbers total)

Target: Y_i = R_i + γ max_{a’} Q(S’_i, a’)

“Y₁ would be computed using the right-hand side of the Bellman equation. In particular, the Bellman equation says, when you input S₁, A₁, you want Q(S₁, A₁) to be this right-hand side, to be equal to R(S₁) plus Gamma max over a’ of Q(S₁’, a’).”

Initial Challenge: “You may be wondering, wait, where does Q(S’, a’), or Q(S’₁, a’) come from? Well, initially, we don’t know what is the Q function.”

Solution: “When you don’t know what is the Q function, you can start off with taking a totally random guess for what is the Q function, and we’ll see on the next slide that the algorithm will work nonetheless.”

  1. Initialize: Start with random Q function estimate
  2. Generate: Create training examples using current Q estimate
  3. Train: Use supervised learning (MSE loss) to improve Q function
  4. Update: Replace old Q with new improved estimate
  5. Repeat: Use better Q estimate for next round of training

Key Insight: “In every step, Q here is just going to be some guess that will get better over time” through this iterative refinement process.

The approach combines reinforcement learning exploration with supervised learning optimization, using the Bellman equation to create training targets that gradually improve the Q function approximation.