Skip to content
Pablo Rodriguez

State Action Value Function Lab

This programming lab provides hands-on experience with the state-action value function by allowing modification of the Mars rover parameters to observe how Q(s,a) values change.

imports.py
import numpy as np
from utils import *
fixed_params.py
# Do not modify
num_states = 6
num_actions = 2
modifiable_params.py
terminal_left_reward = 100
terminal_right_reward = 40
each_step_reward = 0
# Discount factor
gamma = 0.5
# Probability of going in the wrong direction
misstep_prob = 0
visualization.py
generate_visualization(terminal_left_reward, terminal_right_reward, each_step_reward, gamma, misstep_prob)
  1. Run initial code: Execute all cells to see baseline Q(s,a) values
  2. Modify parameters: Change the modifiable parameters one at a time
  3. Observe changes: Note how Q values and optimal policy change
  4. Experiment systematically: Try different combinations of parameters
  • Change terminal_right_reward to 10
  • Change terminal_left_reward to 50
  • Try negative rewards for intermediate states
  • How do Q values change when rewards change?
  • Which states are most affected by discount factor changes?
  • How does the optimal policy shift with different parameters?
  • When does the policy change from “always go left” to mixed strategies?
  • How does uncertainty (misstep_prob > 0) affect decision making?
  • What happens to Q values as the environment becomes more stochastic?
  • High γ (near 1.0): More patient, willing to wait for better rewards
  • Low γ (near 0.0): Impatient, prefers immediate rewards
  • Medium γ (around 0.5): Balanced approach

With the initial parameters, you should see Q values matching the lecture examples where the optimal policy is to go left from states 2, 3, 4 and right from state 5.

  • Lower right reward: May cause “always go left” policy
  • Higher discount factor: Increases patience for distant rewards
  • Lower discount factor: Increases preference for immediate rewards
  • Positive misstep probability: Reduces overall Q values due to uncertainty

The generate_visualization function creates a display showing:

  • Current Q(s,a) values for each state-action pair
  • Optimal policy (best action for each state)
  • Expected returns from each state under optimal policy

Changes to parameters affect the entire Q function simultaneously. The visualization helps you understand these relationships without needing to implement the underlying value iteration algorithm yourself.

Learning Goal

By completing this lab, you should gain intuition about:

  • How reward structure affects optimal behavior
  • The role of discount factor in balancing immediate vs future rewards
  • How uncertainty (stochastic environments) impacts decision making
  • The relationship between Q values and optimal policies

This hands-on experience prepares you for understanding more complex reinforcement learning algorithms while building intuition about the core concepts through interactive exploration.