Skip to content
Pablo Rodriguez

Continuous State Spaces Quiz

The Lunar Lander is a continuous state Markov Decision Process (MDP) because:

  • The state contains numbers such as position and velocity that are continuous valued.
  • The reward contains numbers that are continuous valued
  • The state-action value Q(s,a) function outputs continuous valued numbers
  • The state has multiple numbers rather than only a single number (such as position in the x-direction)

In the learning algorithm described in the videos, we repeatedly create an artificial training set to which we apply supervised learning where the input x=(s,a) and the target, constructed using Bellman’s equations, is y = _____?

  • y=R(s)
  • y=R(s)+γmax_{a’}Q(s’,a’) where s’ is the state you get to after taking action a in state s
  • y=R(s’) where s’ is the state you get to after taking action a in state s
  • y=max_{a’}Q(s’,a’) where s’ is the state you get to after taking action a in state s
  • Discrete: Finite set of possible states (e.g., Mars rover positions 1-6)
  • Continuous: Infinite possible state values (e.g., real-valued position coordinates)
  • Car/Truck: 6 numbers (x, y, θ, ẋ, ẏ, θ̇)
  • Helicopter: 12 numbers (position, orientation, velocities, angular velocities)
  • Lunar Lander: 8 numbers (x, y, ẋ, ẏ, θ, θ̇, l, r)
  1. Collect experiences: (s, a, r, s’) tuples from environment interaction
  2. Create training examples: x = (s,a), y = r + γ max_{a’} Q(s’, a’)
  3. Train network: Use supervised learning to approximate Q(s,a)
  4. Iterate: Repeat with improved Q function estimate

The target y represents the expected total return: immediate reward plus discounted optimal future value from the next state.