Value Function Example
State-Action Value Function Example
Section titled “State-Action Value Function Example”Interactive Exploration Purpose
Section titled “Interactive Exploration Purpose”This section demonstrates “how the values of Q(s,a) change depending on the problem” through an interactive Jupyter notebook that allows modification of the Mars rover parameters.
Notebook Structure
Section titled “Notebook Structure”Fixed Parameters
Section titled “Fixed Parameters”- Number of states: 6 (do not change)
- Number of actions: 2 (do not change)
Modifiable Parameters
Section titled “Modifiable Parameters”- Terminal left reward: 100 (initially)
- Terminal right reward: 40 (initially)
- Each step reward: 0 (intermediate states)
- Discount factor (γ): 0.5 (initially)
- Misstep probability: 0 (initially, for later discussion)
Visualization Output
Section titled “Visualization Output”The code “compute and visualize the optimal policy as well as the Q function Q(s,a)” showing the values from the lecture examples.
Parameter Modification Examples
Section titled “Parameter Modification Examples”Reducing Right Terminal Reward
Section titled “Reducing Right Terminal Reward”Change: Terminal right reward from 40 to 10
Effect on Q values:
- State 5: Q(left) = 6.25, Q(right) = 5
- Policy change: “Now when the reward at the right is so small it’s only 10, even when you’re so close to you rather go left all the way”
- New optimal policy: “Go left from every single state”
Increasing Discount Factor
Section titled “Increasing Discount Factor”Change: γ from 0.5 to 0.9
Effect: “Makes the Mars Rover less impatient is willing to take longer to hold out for a higher reward”
Reasoning:
- “Rewards in the future are not multiplied by 0.5 to some high power is multiplied by 0.9 to some high power”
- “Willing to be more patient, because rewards in the future are not discounted or multiplied by as small a number”
Results:
- State 5: Q(left) = 65.61, Q(right) = 36
- Note: “36 is 0.9 times this terminal reward of 40”
- Policy: “When a small patient is willing to go to the left, even when you’re in state 5”
Decreasing Discount Factor
Section titled “Decreasing Discount Factor”Change: γ to 0.3 (very small)
Effect: “Very heavily discounts rewards in the future. This makes it incredibly impatient.”
Policy Impact:
- State 4 behavior changes: “Not going to have the patience to go for the larger 100 reward, because the discount factor gamma is now so small”
- “It would rather go for the reward of 40 even though it’s a much smaller reward is closer”
Key Learning Outcomes
Section titled “Key Learning Outcomes”Understanding Q Value Changes
Section titled “Understanding Q Value Changes”By experimenting with different parameters, you can observe:
- How Q(s,a) values change with different reward structures
- How discount factor affects patience/impatience in decision making
- How optimal policy adapts to parameter changes
Return vs Policy Relationship
Section titled “Return vs Policy Relationship”- Optimal return: “The larger of these two numbers Q(s,a)” for any state
- Policy changes directly follow from Q value comparisons
- Lower discount factors favor immediate rewards over future gains
Recommended Exploration
Section titled “Recommended Exploration”-
Change reward function: Modify terminal rewards to see policy shifts
-
Adjust discount factor γ: Try different values to understand patience effects
-
Observe Q(s,a) changes: Notice how values shift with parameters
-
Analyze optimal policy: See how action choices change based on Q values
Lab Benefits
Section titled “Lab Benefits”“I hope that will sharpen your intuition about how these different quantities are affected depending on the rewards and so on in reinforcement learning application.”
The interactive exploration helps build intuition for how reinforcement learning parameters affect both the computed Q values and the resulting optimal policies, preparing for understanding the Bellman equation in the next section.