Random Environment

Random (Stochastic) Environment (Optional)

Motivation for Stochastic Environments

Real-World Challenges

“In some applications, when you take an action, the outcome is not always completely reliable.”

Mars Rover Examples:

Rock slide affecting movement
Slippery floor causing rover to slip
Wind blowing robot off course
Wheel slipping or mechanical issues

General Principle: “Many robots don’t always manage to do exactly what you tell them because of wind blowing and off course and the wheel slipping or something else.”

Stochastic Mars Rover Model

Action Reliability

When commanding the rover to go left:

90% success rate (0.9): Correctly goes left
10% failure rate (0.1): Accidentally goes right (opposite direction)

Example from State 4:

Command: Go left
0.9 probability: End up in State 3
0.1 probability: End up in State 5

Symmetric Failure Model

When commanding right:

0.9 probability: Correctly end up in State 5
0.1 probability: Accidentally end up in State 3

Random Reward Sequences

Policy Execution Variability

Using the same policy (left from states 2,3,4 and right from state 5), different executions produce different outcomes:

Execution 1 (Lucky):

Path: 4 → 3 → 2 → 1
Rewards: 0, 0, 0, 100

Execution 2 (Less Lucky):

Path: 4 → 3 → 4 → 3 → 2 → 1
Rewards: 0, 0, 0, 0, 0, 100
“Robot slips and ends up heading back to state four instead”

Execution 3 (Unlucky Start):

Path: 4 → 5 → 6
Rewards: 0, 0, 40
“You may get unlucky even on the first step and you end up going to state five because it slipped”

Expected Return Concept

Problem with Deterministic Return

“When the reinforcement learning problem is stochastic, there isn’t one sequence of rewards that you see for sure instead you see this sequence of different rewards.”

Solution: Expected (Average) Return

Goal: “Maximizing the average value of the sum of discounted rewards”

Definition: “If you were to take your policy and try it out a thousand times or a 100,000 times or a million times, you get lots of different reward sequences like that and if you were to take the average over all of these different sequences of the sum of discounted rewards, then that’s what we call the expected return.”

Mathematical Notation

Expected Return: E[R₁ + γR₂ + γ²R₃ + …]

Stochastic Reinforcement Learning Goal

Objective

“The job of reinforcement learning algorithm is to choose a policy π to maximize the average or the expected sum of discounted rewards.”

Formal Goal: Choose policy π(s) to maximize expected return over all possible random outcomes.

Modified Bellman Equation

Stochastic Version

Q(S,A) = R(S) + γ × E[max_{A’} Q(S’, A’)]

Key Difference

“The difference now is that when you take the action a in state s, the next state s’ you get to is random.”

Example: “When you’re in state 3 and you tell it to go left the next state s’ it could be the state 2, or it could be the state 4.”

Explanation

“We say that the total return from state s, taking action a, once in a behaving optimally, is equal to the reward you get right away, also called the immediate reward plus the discount factor, Gamma plus what you expect to get on average of the future returns.”

Laboratory Exploration

Misstep Probability Parameter

In the optional lab, misstep_prob represents “the probability of your Mars Rover going in the opposite direction, than you had commanded it to.”

Effect of Increasing Uncertainty

10% Misstep (misstep_prob = 0.1):

Q values and optimal returns decrease slightly
“These values are now a little bit lower because you can’t control the robot as well as before”

40% Misstep (misstep_prob = 0.4):

Values decrease further
“These values end up even lower because your degree of control over the robot has decreased”

General Pattern

Higher uncertainty → Lower Q values → Reduced expected performance

Practical Implications

Control vs Uncertainty Trade-off

Stochastic environments require balancing:

Desired actions vs actual outcomes
Planning complexity vs execution reliability
Expected performance vs worst-case scenarios

Real-World Applications

This framework applies to:

Robotics with mechanical uncertainty
Financial trading with market volatility
Autonomous vehicles with sensor noise
Any decision-making under uncertainty

The stochastic framework extends reinforcement learning to handle realistic situations where actions don’t always produce intended outcomes, requiring algorithms to optimize expected rather than deterministic returns.