Bellman Equation

Problem Setup

Challenge: “How do you compute these values Q(s,a)?”

Solution: “In reinforcement learning, there’s a key equation called the Bellman equation that will help us to compute the state action value function.”

Notation System

Current State Variables

S: Current state
R(S): Reward of current state (e.g., R(1) = 100, R(2) = 0, R(6) = 40)
A: Current action taken in state S

Next State Variables

S’: State reached after taking action A from state S
A’: Action that might be taken in new state S’

Bellman Equation Formula

Mathematical Expression

Q(S,A) = R(S) + γ × max_{A’} Q(S’, A’)

Where:

R(S): Immediate reward from current state
γ: Discount factor
max_{A’} Q(S’, A’): Maximum Q value over all possible actions in next state S’

Detailed Examples

Example 1: Q(State 2, Right)

Setup:

Current state: S = 2
Action: A = right
Next state: S’ = 3

Calculation: Q(2, right) = R(2) + γ × max{Q(3, left), Q(3, right)} = 0 + 0.5 × max{25, 6.25} = 0 + 0.5 × 25 = 12.5

Example 2: Q(State 4, Left)

Setup:

Current state: S = 4
Action: A = left
Next state: S’ = 3

Calculation: Q(4, left) = R(4) + γ × max{Q(3, left), Q(3, right)} = 0 + 0.5 × max{25, 6.25} = 0 + 0.5 × 25 = 12.5

Terminal States

“If you’re in a terminal state, then Bellman Equation simplifies to Q(s,a) equals to R(s) because there’s no state S’ and so that second term would go away.”

Result: Q(terminal states) = immediate reward (100 or 40)

Intuitive Understanding

Two-Part Decomposition

The Bellman equation breaks down total return into two components:

Immediate reward R(S): “The reward you get right away” (also called immediate reward)
Future return: γ × (best possible return from next state S’)

Mathematical Intuition

Total Return Breakdown:

Original: R₁ + γR₂ + γ²R₃ + γ³R₄ + …
Factored: R₁ + γ(R₂ + γR₃ + γ²R₄ + …)

The second part (R₂ + γR₃ + γ²R₄ + …) represents “the total return if you were to start from state S’” which equals max_{A’} Q(S’, A’).

Concrete Example: Q(4, left)

Direct calculation: 0 + 0.5×0 + 0.5²×0 + 0.5³×100 = 12.5

Bellman decomposition:

Immediate: R(4) = 0
Future: 0.5 × (0 + 0.5×0 + 0.5²×100) = 0.5 × 25 = 12.5
Total: 0 + 12.5 = 12.5

High-Level Insight

Core Principle

“The total return you get in the reinforcement learning problem has two parts. The first part is this reward that you get right away, and then the second part is Gamma times the return you get starting from the next state S’.”

Practical Application

“These two components together, R of S plus Gamma times the return from the next state, that is equal to the total return from the current state S. That is the essence of the Bellman equation.”

Usage Notes

Complexity Acknowledgment

“In case you think this is quite complicated and you aren’t following all the details, don’t worry about it. So long as you apply this equation, you will manage to get the right results.”

Key Takeaway

The Bellman equation provides a recursive relationship that allows computing Q(s,a) values systematically by breaking down the total expected return into immediate rewards plus discounted future optimal returns.

The equation forms the foundation for many reinforcement learning algorithms by providing a way to compute optimal action values even before knowing the complete optimal policy.