Continuous State Spaces Quiz
Continuous State Spaces Quiz
Section titled “Continuous State Spaces Quiz”Question 1
Section titled “Question 1”The Lunar Lander is a continuous state Markov Decision Process (MDP) because:
- The state contains numbers such as position and velocity that are continuous valued. ✓
- The reward contains numbers that are continuous valued
- The state-action value Q(s,a) function outputs continuous valued numbers
- The state has multiple numbers rather than only a single number (such as position in the x-direction)
Question 2
Section titled “Question 2”In the learning algorithm described in the videos, we repeatedly create an artificial training set to which we apply supervised learning where the input x=(s,a) and the target, constructed using Bellman’s equations, is y = _____?
- y=R(s)
- y=R(s)+γmax_{a’}Q(s’,a’) where s’ is the state you get to after taking action a in state s ✓
- y=R(s’) where s’ is the state you get to after taking action a in state s
- y=max_{a’}Q(s’,a’) where s’ is the state you get to after taking action a in state s
Quick Reference
Section titled “Quick Reference”Continuous vs Discrete State Spaces
Section titled “Continuous vs Discrete State Spaces”- Discrete: Finite set of possible states (e.g., Mars rover positions 1-6)
- Continuous: Infinite possible state values (e.g., real-valued position coordinates)
Key Continuous State Examples
Section titled “Key Continuous State Examples”- Car/Truck: 6 numbers (x, y, θ, ẋ, ẏ, θ̇)
- Helicopter: 12 numbers (position, orientation, velocities, angular velocities)
- Lunar Lander: 8 numbers (x, y, ẋ, ẏ, θ, θ̇, l, r)
Neural Network Training Process
Section titled “Neural Network Training Process”- Collect experiences: (s, a, r, s’) tuples from environment interaction
- Create training examples: x = (s,a), y = r + γ max_{a’} Q(s’, a’)
- Train network: Use supervised learning to approximate Q(s,a)
- Iterate: Repeat with improved Q function estimate
Bellman Equation Application
Section titled “Bellman Equation Application”The target y represents the expected total return: immediate reward plus discounted optimal future value from the next state.