Learning State Value Function
Learning the State-Value Function
Section titled “Learning the State-Value Function”Core Learning Approach
Section titled “Core Learning Approach”Neural Network for Q Function
Section titled “Neural Network for Q Function”“The key idea is that we’re going to train a neural network to compute or to approximate the state action value function Q(s,a), and that in turn will let us pick good actions.”
High-Level Process: “The heart of the learning algorithm is we’re going to train a neural network that inputs the current state and the current action and computes or approximates Q(s,a).”
Input Representation
Section titled “Input Representation”State Vector (8 numbers)
Section titled “State Vector (8 numbers)”The lunar lander state includes:
- x, y: Position coordinates
- ẋ, ẏ: Velocity components
- θ, θ̇: Angle and angular velocity
- l, r: Left and right leg ground contact (binary)
Action Encoding (4 numbers)
Section titled “Action Encoding (4 numbers)”One-Hot Encoding for four possible actions:
- Nothing: [1, 0, 0, 0]
- Left thruster: [0, 1, 0, 0]
- Main engine: [0, 0, 1, 0]
- Right thruster: [0, 0, 0, 1]
Combined Input Vector
Section titled “Combined Input Vector”Total input x: 12 numbers (8 for state + 4 for one-hot action encoding)
Neural Network Architecture
Section titled “Neural Network Architecture”Network Structure
Section titled “Network Structure”Input Layer: 12 numbers (state + action)Hidden Layer 1: 64 unitsHidden Layer 2: 64 unitsOutput Layer: 1 unit (Q value)
Output: Single Q(s,a) value that the neural network approximates
Training Target
Section titled “Training Target”“The job of the neural network is to output Q(s,a), the state action value function for the lunar lander given the input s and a.”
Target Value: “I’m also going to refer to this value Q(s,a) as the target value y that we’re training the neural network to approximate.”
Action Selection Process
Section titled “Action Selection Process”Computing All Q Values
Section titled “Computing All Q Values”When the lunar lander is in state s:
- Compute Q(s, nothing)
- Compute Q(s, left)
- Compute Q(s, main)
- Compute Q(s, right)
Optimal Action Selection
Section titled “Optimal Action Selection”“Whichever of these has the highest value, you would pick the corresponding action a. So for example, if out of these four values, Q(s, main) is largest, then you would decide to gunfire the main engine of the lunar lander.”
Training Data Generation
Section titled “Training Data Generation”Bellman Equation Foundation
Section titled “Bellman Equation Foundation”Key Question: “How do you train a neural network to output Q(s,a)?”
Solution: “The approach will be to use Bellman’s equations to create a training set with lots of examples x and y, and then we’ll use supervised learning exactly as you learned in the second course when we talked about neural networks.”
Training Pair Creation
Section titled “Training Pair Creation”From Bellman Equation: Q(s,a) = R(s) + γ max_{a’} Q(s’, a’)
- Input (x): State-action pair (s, a)
- Target (y): Right-hand side of Bellman equation
Experience Collection
Section titled “Experience Collection”Random Exploration: “We’re going to use the lunar lander, and just try taking different actions in it. If we don’t have a good policy yet, we’ll take actions randomly, fire the left thruster, fire the right thruster, fire the main engine, do nothing.”
Experience Tuples: By taking actions, “we’ll observe a lot of examples of when we’re in some state, and we took some action, maybe a good action, maybe a terrible action, either way. Then we got some rewards R(s) for being in that state, and as a result of our action, we got to some new state s’.”
Data Collection Format
Section titled “Data Collection Format”Each experience: (S_t, A_t, R_t, S_{t+1})
Multiple Examples:
- (S₁, A₁, R₁, S’₁)
- (S₂, A₂, R₂, S’₂)
- …
- (S₁₀,₀₀₀, A₁₀,₀₀₀, R₁₀,₀₀₀, S’₁₀,₀₀₀)
Training Example Construction
Section titled “Training Example Construction”From Experience to Training Data
Section titled “From Experience to Training Data”Each tuple (S_i, A_i, R_i, S’_i) creates one training example:
Input: X_i = [S_i, A_i] (12 numbers total)
Target: Y_i = R_i + γ max_{a’} Q(S’_i, a’)
Bellman Target Calculation
Section titled “Bellman Target Calculation”“Y₁ would be computed using the right-hand side of the Bellman equation. In particular, the Bellman equation says, when you input S₁, A₁, you want Q(S₁, A₁) to be this right-hand side, to be equal to R(S₁) plus Gamma max over a’ of Q(S₁’, a’).”
Bootstrap Q Function
Section titled “Bootstrap Q Function”Initial Challenge: “You may be wondering, wait, where does Q(S’, a’), or Q(S’₁, a’) come from? Well, initially, we don’t know what is the Q function.”
Solution: “When you don’t know what is the Q function, you can start off with taking a totally random guess for what is the Q function, and we’ll see on the next slide that the algorithm will work nonetheless.”
Iterative Improvement
Section titled “Iterative Improvement”Training Process
Section titled “Training Process”- Initialize: Start with random Q function estimate
- Generate: Create training examples using current Q estimate
- Train: Use supervised learning (MSE loss) to improve Q function
- Update: Replace old Q with new improved estimate
- Repeat: Use better Q estimate for next round of training
Key Insight: “In every step, Q here is just going to be some guess that will get better over time” through this iterative refinement process.
The approach combines reinforcement learning exploration with supervised learning optimization, using the Bellman equation to create training targets that gradually improve the Q function approximation.