Value Function Example

State-Action Value Function Example

Interactive Exploration Purpose

This section demonstrates “how the values of Q(s,a) change depending on the problem” through an interactive Jupyter notebook that allows modification of the Mars rover parameters.

Notebook Structure

Fixed Parameters

Number of states: 6 (do not change)
Number of actions: 2 (do not change)

Modifiable Parameters

Terminal left reward: 100 (initially)
Terminal right reward: 40 (initially)
Each step reward: 0 (intermediate states)
Discount factor (γ): 0.5 (initially)
Misstep probability: 0 (initially, for later discussion)

Visualization Output

The code “compute and visualize the optimal policy as well as the Q function Q(s,a)” showing the values from the lecture examples.

Parameter Modification Examples

Reducing Right Terminal Reward

Change: Terminal right reward from 40 to 10

Effect on Q values:

State 5: Q(left) = 6.25, Q(right) = 5
Policy change: “Now when the reward at the right is so small it’s only 10, even when you’re so close to you rather go left all the way”
New optimal policy: “Go left from every single state”

Increasing Discount Factor

Change: γ from 0.5 to 0.9

Effect: “Makes the Mars Rover less impatient is willing to take longer to hold out for a higher reward”

Reasoning:

“Rewards in the future are not multiplied by 0.5 to some high power is multiplied by 0.9 to some high power”
“Willing to be more patient, because rewards in the future are not discounted or multiplied by as small a number”

Results:

State 5: Q(left) = 65.61, Q(right) = 36
Note: “36 is 0.9 times this terminal reward of 40”
Policy: “When a small patient is willing to go to the left, even when you’re in state 5”

Decreasing Discount Factor

Change: γ to 0.3 (very small)

Effect: “Very heavily discounts rewards in the future. This makes it incredibly impatient.”

Policy Impact:

State 4 behavior changes: “Not going to have the patience to go for the larger 100 reward, because the discount factor gamma is now so small”
“It would rather go for the reward of 40 even though it’s a much smaller reward is closer”

Key Learning Outcomes

Understanding Q Value Changes

By experimenting with different parameters, you can observe:

How Q(s,a) values change with different reward structures
How discount factor affects patience/impatience in decision making
How optimal policy adapts to parameter changes

Return vs Policy Relationship

Optimal return: “The larger of these two numbers Q(s,a)” for any state
Policy changes directly follow from Q value comparisons
Lower discount factors favor immediate rewards over future gains

Recommended Exploration

Change reward function: Modify terminal rewards to see policy shifts
Adjust discount factor γ: Try different values to understand patience effects
Observe Q(s,a) changes: Notice how values shift with parameters
Analyze optimal policy: See how action choices change based on Q values

Lab Benefits

“I hope that will sharpen your intuition about how these different quantities are affected depending on the rewards and so on in reinforcement learning application.”

The interactive exploration helps build intuition for how reinforcement learning parameters affect both the computed Q values and the resulting optimal policies, preparing for understanding the Bellman equation in the next section.