Bellman Equation
Bellman Equation
Section titled “Bellman Equation”Problem Setup
Section titled “Problem Setup”Challenge: “How do you compute these values Q(s,a)?”
Solution: “In reinforcement learning, there’s a key equation called the Bellman equation that will help us to compute the state action value function.”
Notation System
Section titled “Notation System”Current State Variables
Section titled “Current State Variables”- S: Current state
- R(S): Reward of current state (e.g., R(1) = 100, R(2) = 0, R(6) = 40)
- A: Current action taken in state S
Next State Variables
Section titled “Next State Variables”- S’: State reached after taking action A from state S
- A’: Action that might be taken in new state S’
Bellman Equation Formula
Section titled “Bellman Equation Formula”Mathematical Expression
Section titled “Mathematical Expression”Q(S,A) = R(S) + γ × max_{A’} Q(S’, A’)
Where:
- R(S): Immediate reward from current state
- γ: Discount factor
- max_{A’} Q(S’, A’): Maximum Q value over all possible actions in next state S’
Detailed Examples
Section titled “Detailed Examples”Example 1: Q(State 2, Right)
Section titled “Example 1: Q(State 2, Right)”Setup:
- Current state: S = 2
- Action: A = right
- Next state: S’ = 3
Calculation: Q(2, right) = R(2) + γ × max{Q(3, left), Q(3, right)} = 0 + 0.5 × max{25, 6.25} = 0 + 0.5 × 25 = 12.5
Example 2: Q(State 4, Left)
Section titled “Example 2: Q(State 4, Left)”Setup:
- Current state: S = 4
- Action: A = left
- Next state: S’ = 3
Calculation: Q(4, left) = R(4) + γ × max{Q(3, left), Q(3, right)} = 0 + 0.5 × max{25, 6.25} = 0 + 0.5 × 25 = 12.5
Terminal States
Section titled “Terminal States”“If you’re in a terminal state, then Bellman Equation simplifies to Q(s,a) equals to R(s) because there’s no state S’ and so that second term would go away.”
Result: Q(terminal states) = immediate reward (100 or 40)
Intuitive Understanding
Section titled “Intuitive Understanding”Two-Part Decomposition
Section titled “Two-Part Decomposition”The Bellman equation breaks down total return into two components:
- Immediate reward R(S): “The reward you get right away” (also called immediate reward)
- Future return: γ × (best possible return from next state S’)
Mathematical Intuition
Section titled “Mathematical Intuition”Total Return Breakdown:
- Original: R₁ + γR₂ + γ²R₃ + γ³R₄ + …
- Factored: R₁ + γ(R₂ + γR₃ + γ²R₄ + …)
The second part (R₂ + γR₃ + γ²R₄ + …) represents “the total return if you were to start from state S’” which equals max_{A’} Q(S’, A’).
Concrete Example: Q(4, left)
Section titled “Concrete Example: Q(4, left)”Direct calculation: 0 + 0.5×0 + 0.5²×0 + 0.5³×100 = 12.5
Bellman decomposition:
- Immediate: R(4) = 0
- Future: 0.5 × (0 + 0.5×0 + 0.5²×100) = 0.5 × 25 = 12.5
- Total: 0 + 12.5 = 12.5
High-Level Insight
Section titled “High-Level Insight”Core Principle
Section titled “Core Principle”“The total return you get in the reinforcement learning problem has two parts. The first part is this reward that you get right away, and then the second part is Gamma times the return you get starting from the next state S’.”
Practical Application
Section titled “Practical Application”“These two components together, R of S plus Gamma times the return from the next state, that is equal to the total return from the current state S. That is the essence of the Bellman equation.”
Usage Notes
Section titled “Usage Notes”Complexity Acknowledgment
Section titled “Complexity Acknowledgment”“In case you think this is quite complicated and you aren’t following all the details, don’t worry about it. So long as you apply this equation, you will manage to get the right results.”
Key Takeaway
Section titled “Key Takeaway”The Bellman equation provides a recursive relationship that allows computing Q(s,a) values systematically by breaking down the total expected return into immediate rewards plus discounted future optimal returns.
The equation forms the foundation for many reinforcement learning algorithms by providing a way to compute optimal action values even before knowing the complete optimal policy.