Random Environment
Random (Stochastic) Environment (Optional)
Section titled “Random (Stochastic) Environment (Optional)”Motivation for Stochastic Environments
Section titled “Motivation for Stochastic Environments”Real-World Challenges
Section titled “Real-World Challenges”“In some applications, when you take an action, the outcome is not always completely reliable.”
Mars Rover Examples:
- Rock slide affecting movement
- Slippery floor causing rover to slip
- Wind blowing robot off course
- Wheel slipping or mechanical issues
General Principle: “Many robots don’t always manage to do exactly what you tell them because of wind blowing and off course and the wheel slipping or something else.”
Stochastic Mars Rover Model
Section titled “Stochastic Mars Rover Model”Action Reliability
Section titled “Action Reliability”When commanding the rover to go left:
- 90% success rate (0.9): Correctly goes left
- 10% failure rate (0.1): Accidentally goes right (opposite direction)
Example from State 4:
- Command: Go left
- 0.9 probability: End up in State 3
- 0.1 probability: End up in State 5
Symmetric Failure Model
Section titled “Symmetric Failure Model”When commanding right:
- 0.9 probability: Correctly end up in State 5
- 0.1 probability: Accidentally end up in State 3
Random Reward Sequences
Section titled “Random Reward Sequences”Policy Execution Variability
Section titled “Policy Execution Variability”Using the same policy (left from states 2,3,4 and right from state 5), different executions produce different outcomes:
Execution 1 (Lucky):
- Path: 4 → 3 → 2 → 1
- Rewards: 0, 0, 0, 100
Execution 2 (Less Lucky):
- Path: 4 → 3 → 4 → 3 → 2 → 1
- Rewards: 0, 0, 0, 0, 0, 100
- “Robot slips and ends up heading back to state four instead”
Execution 3 (Unlucky Start):
- Path: 4 → 5 → 6
- Rewards: 0, 0, 40
- “You may get unlucky even on the first step and you end up going to state five because it slipped”
Expected Return Concept
Section titled “Expected Return Concept”Problem with Deterministic Return
Section titled “Problem with Deterministic Return”“When the reinforcement learning problem is stochastic, there isn’t one sequence of rewards that you see for sure instead you see this sequence of different rewards.”
Solution: Expected (Average) Return
Section titled “Solution: Expected (Average) Return”Goal: “Maximizing the average value of the sum of discounted rewards”
Definition: “If you were to take your policy and try it out a thousand times or a 100,000 times or a million times, you get lots of different reward sequences like that and if you were to take the average over all of these different sequences of the sum of discounted rewards, then that’s what we call the expected return.”
Mathematical Notation
Section titled “Mathematical Notation”Expected Return: E[R₁ + γR₂ + γ²R₃ + …]
Stochastic Reinforcement Learning Goal
Section titled “Stochastic Reinforcement Learning Goal”Objective
Section titled “Objective”“The job of reinforcement learning algorithm is to choose a policy π to maximize the average or the expected sum of discounted rewards.”
Formal Goal: Choose policy π(s) to maximize expected return over all possible random outcomes.
Modified Bellman Equation
Section titled “Modified Bellman Equation”Stochastic Version
Section titled “Stochastic Version”Q(S,A) = R(S) + γ × E[max_{A’} Q(S’, A’)]
Key Difference
Section titled “Key Difference”“The difference now is that when you take the action a in state s, the next state s’ you get to is random.”
Example: “When you’re in state 3 and you tell it to go left the next state s’ it could be the state 2, or it could be the state 4.”
Explanation
Section titled “Explanation”“We say that the total return from state s, taking action a, once in a behaving optimally, is equal to the reward you get right away, also called the immediate reward plus the discount factor, Gamma plus what you expect to get on average of the future returns.”
Laboratory Exploration
Section titled “Laboratory Exploration”Misstep Probability Parameter
Section titled “Misstep Probability Parameter”In the optional lab, misstep_prob
represents “the probability of your Mars Rover going in the opposite direction, than you had commanded it to.”
Effect of Increasing Uncertainty
Section titled “Effect of Increasing Uncertainty”10% Misstep (misstep_prob = 0.1):
- Q values and optimal returns decrease slightly
- “These values are now a little bit lower because you can’t control the robot as well as before”
40% Misstep (misstep_prob = 0.4):
- Values decrease further
- “These values end up even lower because your degree of control over the robot has decreased”
General Pattern
Section titled “General Pattern”Higher uncertainty → Lower Q values → Reduced expected performance
Practical Implications
Section titled “Practical Implications”Control vs Uncertainty Trade-off
Section titled “Control vs Uncertainty Trade-off”Stochastic environments require balancing:
- Desired actions vs actual outcomes
- Planning complexity vs execution reliability
- Expected performance vs worst-case scenarios
Real-World Applications
Section titled “Real-World Applications”This framework applies to:
- Robotics with mechanical uncertainty
- Financial trading with market volatility
- Autonomous vehicles with sensor noise
- Any decision-making under uncertainty
The stochastic framework extends reinforcement learning to handle realistic situations where actions don’t always produce intended outcomes, requiring algorithms to optimize expected rather than deterministic returns.