State Action Value Function Lab
State-Action Value Function Lab
Section titled “State-Action Value Function Lab”Lab Overview
Section titled “Lab Overview”This programming lab provides hands-on experience with the state-action value function by allowing modification of the Mars rover parameters to observe how Q(s,a) values change.
Code Structure
Section titled “Code Structure”Import and Setup
Section titled “Import and Setup”import numpy as npfrom utils import *
Fixed Parameters
Section titled “Fixed Parameters”# Do not modifynum_states = 6num_actions = 2
Modifiable Parameters
Section titled “Modifiable Parameters”terminal_left_reward = 100terminal_right_reward = 40each_step_reward = 0
# Discount factorgamma = 0.5
# Probability of going in the wrong directionmisstep_prob = 0
Generate Visualization
Section titled “Generate Visualization”generate_visualization(terminal_left_reward, terminal_right_reward, each_step_reward, gamma, misstep_prob)
Exercise Instructions
Section titled “Exercise Instructions”Basic Exploration
Section titled “Basic Exploration”- Run initial code: Execute all cells to see baseline Q(s,a) values
- Modify parameters: Change the modifiable parameters one at a time
- Observe changes: Note how Q values and optimal policy change
- Experiment systematically: Try different combinations of parameters
Specific Experiments to Try
Section titled “Specific Experiments to Try”- Change
terminal_right_reward
to 10 - Change
terminal_left_reward
to 50 - Try negative rewards for intermediate states
- Set
gamma = 0.9
(more patient) - Set
gamma = 0.3
(very impatient) - Set
gamma = 0.99
(almost no discounting)
- Set
misstep_prob = 0.1
(10% chance of wrong direction) - Set
misstep_prob = 0.4
(40% chance of wrong direction) - Observe how uncertainty affects Q values
Key Observations to Make
Section titled “Key Observations to Make”Q Value Patterns
Section titled “Q Value Patterns”- How do Q values change when rewards change?
- Which states are most affected by discount factor changes?
- How does the optimal policy shift with different parameters?
Policy Changes
Section titled “Policy Changes”- When does the policy change from “always go left” to mixed strategies?
- How does uncertainty (misstep_prob > 0) affect decision making?
- What happens to Q values as the environment becomes more stochastic?
Discount Factor Effects
Section titled “Discount Factor Effects”- High γ (near 1.0): More patient, willing to wait for better rewards
- Low γ (near 0.0): Impatient, prefers immediate rewards
- Medium γ (around 0.5): Balanced approach
Expected Results
Section titled “Expected Results”Baseline Results (γ = 0.5)
Section titled “Baseline Results (γ = 0.5)”With the initial parameters, you should see Q values matching the lecture examples where the optimal policy is to go left from states 2, 3, 4 and right from state 5.
Modified Results Examples
Section titled “Modified Results Examples”- Lower right reward: May cause “always go left” policy
- Higher discount factor: Increases patience for distant rewards
- Lower discount factor: Increases preference for immediate rewards
- Positive misstep probability: Reduces overall Q values due to uncertainty
Programming Notes
Section titled “Programming Notes”Understanding the Visualization
Section titled “Understanding the Visualization”The generate_visualization
function creates a display showing:
- Current Q(s,a) values for each state-action pair
- Optimal policy (best action for each state)
- Expected returns from each state under optimal policy
Parameter Relationship
Section titled “Parameter Relationship”Changes to parameters affect the entire Q function simultaneously. The visualization helps you understand these relationships without needing to implement the underlying value iteration algorithm yourself.
Lab Objectives
Section titled “Lab Objectives”By completing this lab, you should gain intuition about:
- How reward structure affects optimal behavior
- The role of discount factor in balancing immediate vs future rewards
- How uncertainty (stochastic environments) impacts decision making
- The relationship between Q values and optimal policies
This hands-on experience prepares you for understanding more complex reinforcement learning algorithms while building intuition about the core concepts through interactive exploration.