Skip to content
Pablo Rodriguez

Return In Reinforcement

Practical Analogy:

  • Five-dollar bill at your feet (immediate)
  • Ten-dollar bill half an hour across town (delayed)
  • “Which one would you rather go after?”

The concept of return captures that “rewards you can get quicker are maybe more attractive than rewards that take you a long time to get to.”

The return is “the sum of these rewards but weighted by one additional factor, which is called the discount factor.”

General Return Formula: Return = R₁ + γR₂ + γ²R₃ + γ³R₄ + …

Where:

  • R₁, R₂, R₃…: Rewards at each time step
  • γ (Gamma): Discount factor (number less than 1)

Left Path from State 4: 0 → 0 → 0 → 100 Return Calculation:

  • Step 1: 0 (no discount)
  • Step 2: 0.9 × 0 = 0
  • Step 3: 0.9² × 0 = 0
  • Step 4: 0.9³ × 100 = 0.729 × 100 = 72.9

Total Return: 72.9

The discount factor “has the effect of making the reinforcement learning algorithm a little bit impatient.”

Credit Assignment:

  • First reward: Full credit (1 × R₁)
  • Second reward: Reduced credit (0.9 × R₂)
  • Third reward: Further reduced credit (0.9² × R₃)
  • “Getting rewards sooner results in a higher value for the total return”
  • Typical range: 0.9, 0.99, or 0.999 (close to 1)
  • Example value: 0.5 (for illustration - “very heavily discounts rewards in the future”)

From State 4: 0 + 0.5×0 + 0.5²×0 + 0.5³×100 = 0.125×100 = 12.5

Returns for Different Starting States (Always Go Left)

Section titled “Returns for Different Starting States (Always Go Left)”
  • State 1: 100 (immediate reward, no discounting)
  • State 2: 50 (one step to reward)
  • State 3: 25 (two steps to reward)
  • State 4: 12.5 (three steps to reward)
  • State 5: 6.25 (four steps to reward)
  • State 6: 40 (terminal state reward)

From State 4: 0 + 0.5×0 + 0.5²×40 = 0.25×40 = 10

Returns for Different Starting States:

  • State 1: 140 (terminal)
  • State 2: 2.5
  • State 3: 5
  • State 4: 10
  • State 5: 20
  • State 6: 40 (terminal)
  • States 2, 3, 4: Go left
  • State 5: Go right (close to right reward)

Resulting Returns: 100, 50, 25, 12.5, 20, 40

State 5 Calculation: 0 + 0.5×40 = 20

“In financial applications, the discount factor also has a very natural interpretation as the interest rate or the time value of money.”

Reasoning:

  • Dollar today can be invested to earn interest
  • Dollar today worth more than dollar in future
  • Discount factor represents “how much less is a dollar in the future worth compared to a dollar today”

For systems with negative rewards:

  • Discount factor “incentivizes the system to push out the negative rewards as far into the future as possible”
  • Example: If you must pay $10 (negative reward -10), better to postpone payment
  • “$10 a few years from now, because of the interest rate is actually worth less than $10 that you had to pay today”

The return provides a mathematical framework for comparing different sequences of rewards by weighing immediate rewards more heavily than future rewards. This creates natural impatience in reinforcement learning algorithms and aligns with financial principles of time value. The return depends directly on the actions taken, making it the key metric for evaluating and comparing different policies.