Return In Reinforcement

The Return in Reinforcement Learning

Motivation: Comparing Reward Sequences

Practical Analogy:

Five-dollar bill at your feet (immediate)
Ten-dollar bill half an hour across town (delayed)
“Which one would you rather go after?”

The concept of return captures that “rewards you can get quicker are maybe more attractive than rewards that take you a long time to get to.”

Return Definition

Basic Formula

The return is “the sum of these rewards but weighted by one additional factor, which is called the discount factor.”

General Return Formula: Return = R₁ + γR₂ + γ²R₃ + γ³R₄ + …

Where:

R₁, R₂, R₃…: Rewards at each time step
γ (Gamma): Discount factor (number less than 1)

Mars Rover Example with γ = 0.9

Left Path from State 4: 0 → 0 → 0 → 100 Return Calculation:

Step 1: 0 (no discount)
Step 2: 0.9 × 0 = 0
Step 3: 0.9² × 0 = 0
Step 4: 0.9³ × 100 = 0.729 × 100 = 72.9

Total Return: 72.9

Discount Factor Effects

Impatience Mechanism

The discount factor “has the effect of making the reinforcement learning algorithm a little bit impatient.”

Credit Assignment:

First reward: Full credit (1 × R₁)
Second reward: Reduced credit (0.9 × R₂)
Third reward: Further reduced credit (0.9² × R₃)
“Getting rewards sooner results in a higher value for the total return”

Common Discount Factor Values

Typical range: 0.9, 0.99, or 0.999 (close to 1)
Example value: 0.5 (for illustration - “very heavily discounts rewards in the future”)

Practical Examples

Left Path with γ = 0.5

From State 4: 0 + 0.5×0 + 0.5²×0 + 0.5³×100 = 0.125×100 = 12.5

Returns for Different Starting States (Always Go Left)

State 1: 100 (immediate reward, no discounting)
State 2: 50 (one step to reward)
State 3: 25 (two steps to reward)
State 4: 12.5 (three steps to reward)
State 5: 6.25 (four steps to reward)
State 6: 40 (terminal state reward)

Alternative Strategy: Always Go Right

From State 4: 0 + 0.5×0 + 0.5²×40 = 0.25×40 = 10

Returns for Different Starting States:

State 1: 140 (terminal)
State 2: 2.5
State 3: 5
State 4: 10
State 5: 20
State 6: 40 (terminal)

Optimal Mixed Strategy

States 2, 3, 4: Go left
State 5: Go right (close to right reward)

Resulting Returns: 100, 50, 25, 12.5, 20, 40

State 5 Calculation: 0 + 0.5×40 = 20

Financial Applications

Time Value of Money

“In financial applications, the discount factor also has a very natural interpretation as the interest rate or the time value of money.”

Reasoning:

Dollar today can be invested to earn interest
Dollar today worth more than dollar in future
Discount factor represents “how much less is a dollar in the future worth compared to a dollar today”

Negative Rewards Handling

For systems with negative rewards:

Discount factor “incentivizes the system to push out the negative rewards as far into the future as possible”
Example: If you must pay $10 (negative reward -10), better to postpone payment
“$10 a few years from now, because of the interest rate is actually worth less than $10 that you had to pay today”

Summary

The return provides a mathematical framework for comparing different sequences of rewards by weighing immediate rewards more heavily than future rewards. This creates natural impatience in reinforcement learning algorithms and aligns with financial principles of time value. The return depends directly on the actions taken, making it the key metric for evaluating and comparing different policies.