State Action Value Function Lab

State-Action Value Function Lab

Lab Overview

This programming lab provides hands-on experience with the state-action value function by allowing modification of the Mars rover parameters to observe how Q(s,a) values change.

Code Structure

Import and Setup

import numpy as np
from utils import *

Fixed Parameters

# Do not modify
num_states = 6
num_actions = 2

Modifiable Parameters

terminal_left_reward = 100
terminal_right_reward = 40
each_step_reward = 0

# Discount factor
gamma = 0.5

# Probability of going in the wrong direction
misstep_prob = 0

Generate Visualization

generate_visualization(terminal_left_reward, terminal_right_reward, each_step_reward, gamma, misstep_prob)

Exercise Instructions

Basic Exploration

Run initial code: Execute all cells to see baseline Q(s,a) values
Modify parameters: Change the modifiable parameters one at a time
Observe changes: Note how Q values and optimal policy change
Experiment systematically: Try different combinations of parameters

Specific Experiments to Try

Change terminal_right_reward to 10
Change terminal_left_reward to 50
Try negative rewards for intermediate states

Set gamma = 0.9 (more patient)
Set gamma = 0.3 (very impatient)
Set gamma = 0.99 (almost no discounting)

Set misstep_prob = 0.1 (10% chance of wrong direction)
Set misstep_prob = 0.4 (40% chance of wrong direction)
Observe how uncertainty affects Q values

Key Observations to Make

Q Value Patterns

How do Q values change when rewards change?
Which states are most affected by discount factor changes?
How does the optimal policy shift with different parameters?

Policy Changes

When does the policy change from “always go left” to mixed strategies?
How does uncertainty (misstep_prob > 0) affect decision making?
What happens to Q values as the environment becomes more stochastic?

Discount Factor Effects

High γ (near 1.0): More patient, willing to wait for better rewards
Low γ (near 0.0): Impatient, prefers immediate rewards
Medium γ (around 0.5): Balanced approach

Expected Results

Baseline Results (γ = 0.5)

With the initial parameters, you should see Q values matching the lecture examples where the optimal policy is to go left from states 2, 3, 4 and right from state 5.

Modified Results Examples

Lower right reward: May cause “always go left” policy
Higher discount factor: Increases patience for distant rewards
Lower discount factor: Increases preference for immediate rewards
Positive misstep probability: Reduces overall Q values due to uncertainty

Programming Notes

Understanding the Visualization

The generate_visualization function creates a display showing:

Current Q(s,a) values for each state-action pair
Optimal policy (best action for each state)
Expected returns from each state under optimal policy

Parameter Relationship

Changes to parameters affect the entire Q function simultaneously. The visualization helps you understand these relationships without needing to implement the underlying value iteration algorithm yourself.

Lab Objectives

Learning Goal

By completing this lab, you should gain intuition about:

How reward structure affects optimal behavior
The role of discount factor in balancing immediate vs future rewards
How uncertainty (stochastic environments) impacts decision making
The relationship between Q values and optimal policies

This hands-on experience prepares you for understanding more complex reinforcement learning algorithms while building intuition about the core concepts through interactive exploration.