Continuous State Spaces Quiz

Question 1

The Lunar Lander is a continuous state Markov Decision Process (MDP) because:

The state contains numbers such as position and velocity that are continuous valued. ✓
The reward contains numbers that are continuous valued
The state-action value Q(s,a) function outputs continuous valued numbers
The state has multiple numbers rather than only a single number (such as position in the x-direction)

Question 2

In the learning algorithm described in the videos, we repeatedly create an artificial training set to which we apply supervised learning where the input x=(s,a) and the target, constructed using Bellman’s equations, is y = _____?

y=R(s)
y=R(s)+γmax_{a’}Q(s’,a’) where s’ is the state you get to after taking action a in state s ✓
y=R(s’) where s’ is the state you get to after taking action a in state s
y=max_{a’}Q(s’,a’) where s’ is the state you get to after taking action a in state s

Quick Reference

Continuous vs Discrete State Spaces

Discrete: Finite set of possible states (e.g., Mars rover positions 1-6)
Continuous: Infinite possible state values (e.g., real-valued position coordinates)

Key Continuous State Examples

Car/Truck: 6 numbers (x, y, θ, ẋ, ẏ, θ̇)
Helicopter: 12 numbers (position, orientation, velocities, angular velocities)
Lunar Lander: 8 numbers (x, y, ẋ, ẏ, θ, θ̇, l, r)

Neural Network Training Process

Collect experiences: (s, a, r, s’) tuples from environment interaction
Create training examples: x = (s,a), y = r + γ max_{a’} Q(s’, a’)
Train network: Use supervised learning to approximate Q(s,a)
Iterate: Repeat with improved Q function estimate

Bellman Equation Application

The target y represents the expected total return: immediate reward plus discounted optimal future value from the next state.