Learning State Value Function

Learning the State-Value Function

Core Learning Approach

Neural Network for Q Function

“The key idea is that we’re going to train a neural network to compute or to approximate the state action value function Q(s,a), and that in turn will let us pick good actions.”

High-Level Process: “The heart of the learning algorithm is we’re going to train a neural network that inputs the current state and the current action and computes or approximates Q(s,a).”

Input Representation

State Vector (8 numbers)

The lunar lander state includes:

x, y: Position coordinates
ẋ, ẏ: Velocity components
θ, θ̇: Angle and angular velocity
l, r: Left and right leg ground contact (binary)

Action Encoding (4 numbers)

One-Hot Encoding for four possible actions:

Nothing: [1, 0, 0, 0]
Left thruster: [0, 1, 0, 0]
Main engine: [0, 0, 1, 0]
Right thruster: [0, 0, 0, 1]

Combined Input Vector

Total input x: 12 numbers (8 for state + 4 for one-hot action encoding)

Neural Network Architecture

Network Structure

Input Layer: 12 numbers (state + action)
Hidden Layer 1: 64 units
Hidden Layer 2: 64 units
Output Layer: 1 unit (Q value)

Output: Single Q(s,a) value that the neural network approximates

Training Target

“The job of the neural network is to output Q(s,a), the state action value function for the lunar lander given the input s and a.”

Target Value: “I’m also going to refer to this value Q(s,a) as the target value y that we’re training the neural network to approximate.”

Action Selection Process

Computing All Q Values

When the lunar lander is in state s:

Compute Q(s, nothing)
Compute Q(s, left)
Compute Q(s, main)
Compute Q(s, right)

Optimal Action Selection

“Whichever of these has the highest value, you would pick the corresponding action a. So for example, if out of these four values, Q(s, main) is largest, then you would decide to gunfire the main engine of the lunar lander.”

Training Data Generation

Bellman Equation Foundation

Key Question: “How do you train a neural network to output Q(s,a)?”

Solution: “The approach will be to use Bellman’s equations to create a training set with lots of examples x and y, and then we’ll use supervised learning exactly as you learned in the second course when we talked about neural networks.”

Training Pair Creation

From Bellman Equation: Q(s,a) = R(s) + γ max_{a’} Q(s’, a’)

Input (x): State-action pair (s, a)
Target (y): Right-hand side of Bellman equation

Experience Collection

Random Exploration: “We’re going to use the lunar lander, and just try taking different actions in it. If we don’t have a good policy yet, we’ll take actions randomly, fire the left thruster, fire the right thruster, fire the main engine, do nothing.”

Experience Tuples: By taking actions, “we’ll observe a lot of examples of when we’re in some state, and we took some action, maybe a good action, maybe a terrible action, either way. Then we got some rewards R(s) for being in that state, and as a result of our action, we got to some new state s’.”

Data Collection Format

Each experience: (S_t, A_t, R_t, S_{t+1})

Multiple Examples:

(S₁, A₁, R₁, S’₁)
(S₂, A₂, R₂, S’₂)
…
(S₁₀,₀₀₀, A₁₀,₀₀₀, R₁₀,₀₀₀, S’₁₀,₀₀₀)

Training Example Construction

From Experience to Training Data

Each tuple (S_i, A_i, R_i, S’_i) creates one training example:

Input: X_i = [S_i, A_i] (12 numbers total)

Target: Y_i = R_i + γ max_{a’} Q(S’_i, a’)

Bellman Target Calculation

“Y₁ would be computed using the right-hand side of the Bellman equation. In particular, the Bellman equation says, when you input S₁, A₁, you want Q(S₁, A₁) to be this right-hand side, to be equal to R(S₁) plus Gamma max over a’ of Q(S₁’, a’).”

Bootstrap Q Function

Initial Challenge: “You may be wondering, wait, where does Q(S’, a’), or Q(S’₁, a’) come from? Well, initially, we don’t know what is the Q function.”

Solution: “When you don’t know what is the Q function, you can start off with taking a totally random guess for what is the Q function, and we’ll see on the next slide that the algorithm will work nonetheless.”

Iterative Improvement

Training Process

Initialize: Start with random Q function estimate
Generate: Create training examples using current Q estimate
Train: Use supervised learning (MSE loss) to improve Q function
Update: Replace old Q with new improved estimate
Repeat: Use better Q estimate for next round of training

Key Insight: “In every step, Q here is just going to be some guess that will get better over time” through this iterative refinement process.

The approach combines reinforcement learning exploration with supervised learning optimization, using the Bellman equation to create training targets that gradually improve the Q function approximation.