Return In Reinforcement
The Return in Reinforcement Learning
Section titled “The Return in Reinforcement Learning”Motivation: Comparing Reward Sequences
Section titled “Motivation: Comparing Reward Sequences”Practical Analogy:
- Five-dollar bill at your feet (immediate)
- Ten-dollar bill half an hour across town (delayed)
- “Which one would you rather go after?”
The concept of return captures that “rewards you can get quicker are maybe more attractive than rewards that take you a long time to get to.”
Return Definition
Section titled “Return Definition”Basic Formula
Section titled “Basic Formula”The return is “the sum of these rewards but weighted by one additional factor, which is called the discount factor.”
General Return Formula: Return = R₁ + γR₂ + γ²R₃ + γ³R₄ + …
Where:
- R₁, R₂, R₃…: Rewards at each time step
- γ (Gamma): Discount factor (number less than 1)
Mars Rover Example with γ = 0.9
Section titled “Mars Rover Example with γ = 0.9”Left Path from State 4: 0 → 0 → 0 → 100 Return Calculation:
- Step 1: 0 (no discount)
- Step 2: 0.9 × 0 = 0
- Step 3: 0.9² × 0 = 0
- Step 4: 0.9³ × 100 = 0.729 × 100 = 72.9
Total Return: 72.9
Discount Factor Effects
Section titled “Discount Factor Effects”Impatience Mechanism
Section titled “Impatience Mechanism”The discount factor “has the effect of making the reinforcement learning algorithm a little bit impatient.”
Credit Assignment:
- First reward: Full credit (1 × R₁)
- Second reward: Reduced credit (0.9 × R₂)
- Third reward: Further reduced credit (0.9² × R₃)
- “Getting rewards sooner results in a higher value for the total return”
Common Discount Factor Values
Section titled “Common Discount Factor Values”- Typical range: 0.9, 0.99, or 0.999 (close to 1)
- Example value: 0.5 (for illustration - “very heavily discounts rewards in the future”)
Practical Examples
Section titled “Practical Examples”Left Path with γ = 0.5
Section titled “Left Path with γ = 0.5”From State 4: 0 + 0.5×0 + 0.5²×0 + 0.5³×100 = 0.125×100 = 12.5
Returns for Different Starting States (Always Go Left)
Section titled “Returns for Different Starting States (Always Go Left)”- State 1: 100 (immediate reward, no discounting)
- State 2: 50 (one step to reward)
- State 3: 25 (two steps to reward)
- State 4: 12.5 (three steps to reward)
- State 5: 6.25 (four steps to reward)
- State 6: 40 (terminal state reward)
Alternative Strategy: Always Go Right
Section titled “Alternative Strategy: Always Go Right”From State 4: 0 + 0.5×0 + 0.5²×40 = 0.25×40 = 10
Returns for Different Starting States:
- State 1: 140 (terminal)
- State 2: 2.5
- State 3: 5
- State 4: 10
- State 5: 20
- State 6: 40 (terminal)
Optimal Mixed Strategy
Section titled “Optimal Mixed Strategy”- States 2, 3, 4: Go left
- State 5: Go right (close to right reward)
Resulting Returns: 100, 50, 25, 12.5, 20, 40
State 5 Calculation: 0 + 0.5×40 = 20
Financial Applications
Section titled “Financial Applications”Time Value of Money
Section titled “Time Value of Money”“In financial applications, the discount factor also has a very natural interpretation as the interest rate or the time value of money.”
Reasoning:
- Dollar today can be invested to earn interest
- Dollar today worth more than dollar in future
- Discount factor represents “how much less is a dollar in the future worth compared to a dollar today”
Negative Rewards Handling
Section titled “Negative Rewards Handling”For systems with negative rewards:
- Discount factor “incentivizes the system to push out the negative rewards as far into the future as possible”
- Example: If you must pay $10 (negative reward -10), better to postpone payment
- “$10 a few years from now, because of the interest rate is actually worth less than $10 that you had to pay today”
Summary
Section titled “Summary”The return provides a mathematical framework for comparing different sequences of rewards by weighing immediate rewards more heavily than future rewards. This creates natural impatience in reinforcement learning algorithms and aligns with financial principles of time value. The return depends directly on the actions taken, making it the key metric for evaluating and comparing different policies.