Skip to content
Pablo Rodriguez

Bellman Equation

Challenge: “How do you compute these values Q(s,a)?”

Solution: “In reinforcement learning, there’s a key equation called the Bellman equation that will help us to compute the state action value function.”

  • S: Current state
  • R(S): Reward of current state (e.g., R(1) = 100, R(2) = 0, R(6) = 40)
  • A: Current action taken in state S
  • S’: State reached after taking action A from state S
  • A’: Action that might be taken in new state S’

Q(S,A) = R(S) + γ × max_{A’} Q(S’, A’)

Where:

  • R(S): Immediate reward from current state
  • γ: Discount factor
  • max_{A’} Q(S’, A’): Maximum Q value over all possible actions in next state S’

Setup:

  • Current state: S = 2
  • Action: A = right
  • Next state: S’ = 3

Calculation: Q(2, right) = R(2) + γ × max{Q(3, left), Q(3, right)} = 0 + 0.5 × max{25, 6.25} = 0 + 0.5 × 25 = 12.5

Setup:

  • Current state: S = 4
  • Action: A = left
  • Next state: S’ = 3

Calculation: Q(4, left) = R(4) + γ × max{Q(3, left), Q(3, right)} = 0 + 0.5 × max{25, 6.25} = 0 + 0.5 × 25 = 12.5

“If you’re in a terminal state, then Bellman Equation simplifies to Q(s,a) equals to R(s) because there’s no state S’ and so that second term would go away.”

Result: Q(terminal states) = immediate reward (100 or 40)

The Bellman equation breaks down total return into two components:

  1. Immediate reward R(S): “The reward you get right away” (also called immediate reward)
  2. Future return: γ × (best possible return from next state S’)

Total Return Breakdown:

  • Original: R₁ + γR₂ + γ²R₃ + γ³R₄ + …
  • Factored: R₁ + γ(R₂ + γR₃ + γ²R₄ + …)

The second part (R₂ + γR₃ + γ²R₄ + …) represents “the total return if you were to start from state S’” which equals max_{A’} Q(S’, A’).

Direct calculation: 0 + 0.5×0 + 0.5²×0 + 0.5³×100 = 12.5

Bellman decomposition:

  • Immediate: R(4) = 0
  • Future: 0.5 × (0 + 0.5×0 + 0.5²×100) = 0.5 × 25 = 12.5
  • Total: 0 + 12.5 = 12.5

“The total return you get in the reinforcement learning problem has two parts. The first part is this reward that you get right away, and then the second part is Gamma times the return you get starting from the next state S’.”

“These two components together, R of S plus Gamma times the return from the next state, that is equal to the total return from the current state S. That is the essence of the Bellman equation.”

“In case you think this is quite complicated and you aren’t following all the details, don’t worry about it. So long as you apply this equation, you will manage to get the right results.”

The Bellman equation provides a recursive relationship that allows computing Q(s,a) values systematically by breaking down the total expected return into immediate rewards plus discounted future optimal returns.

The equation forms the foundation for many reinforcement learning algorithms by providing a way to compute optimal action values even before knowing the complete optimal policy.