Skip to content
Pablo Rodriguez

Random Environment

Random (Stochastic) Environment (Optional)

Section titled “Random (Stochastic) Environment (Optional)”

“In some applications, when you take an action, the outcome is not always completely reliable.”

Mars Rover Examples:

  • Rock slide affecting movement
  • Slippery floor causing rover to slip
  • Wind blowing robot off course
  • Wheel slipping or mechanical issues

General Principle: “Many robots don’t always manage to do exactly what you tell them because of wind blowing and off course and the wheel slipping or something else.”

When commanding the rover to go left:

  • 90% success rate (0.9): Correctly goes left
  • 10% failure rate (0.1): Accidentally goes right (opposite direction)

Example from State 4:

  • Command: Go left
  • 0.9 probability: End up in State 3
  • 0.1 probability: End up in State 5

When commanding right:

  • 0.9 probability: Correctly end up in State 5
  • 0.1 probability: Accidentally end up in State 3

Using the same policy (left from states 2,3,4 and right from state 5), different executions produce different outcomes:

Execution 1 (Lucky):

  • Path: 4 → 3 → 2 → 1
  • Rewards: 0, 0, 0, 100

Execution 2 (Less Lucky):

  • Path: 4 → 3 → 4 → 3 → 2 → 1
  • Rewards: 0, 0, 0, 0, 0, 100
  • “Robot slips and ends up heading back to state four instead”

Execution 3 (Unlucky Start):

  • Path: 4 → 5 → 6
  • Rewards: 0, 0, 40
  • “You may get unlucky even on the first step and you end up going to state five because it slipped”

“When the reinforcement learning problem is stochastic, there isn’t one sequence of rewards that you see for sure instead you see this sequence of different rewards.”

Goal: “Maximizing the average value of the sum of discounted rewards”

Definition: “If you were to take your policy and try it out a thousand times or a 100,000 times or a million times, you get lots of different reward sequences like that and if you were to take the average over all of these different sequences of the sum of discounted rewards, then that’s what we call the expected return.”

Expected Return: E[R₁ + γR₂ + γ²R₃ + …]

“The job of reinforcement learning algorithm is to choose a policy π to maximize the average or the expected sum of discounted rewards.”

Formal Goal: Choose policy π(s) to maximize expected return over all possible random outcomes.

Q(S,A) = R(S) + γ × E[max_{A’} Q(S’, A’)]

“The difference now is that when you take the action a in state s, the next state s’ you get to is random.”

Example: “When you’re in state 3 and you tell it to go left the next state s’ it could be the state 2, or it could be the state 4.”

“We say that the total return from state s, taking action a, once in a behaving optimally, is equal to the reward you get right away, also called the immediate reward plus the discount factor, Gamma plus what you expect to get on average of the future returns.”

In the optional lab, misstep_prob represents “the probability of your Mars Rover going in the opposite direction, than you had commanded it to.”

10% Misstep (misstep_prob = 0.1):

  • Q values and optimal returns decrease slightly
  • “These values are now a little bit lower because you can’t control the robot as well as before”

40% Misstep (misstep_prob = 0.4):

  • Values decrease further
  • “These values end up even lower because your degree of control over the robot has decreased”

Higher uncertainty → Lower Q values → Reduced expected performance

Stochastic environments require balancing:

  • Desired actions vs actual outcomes
  • Planning complexity vs execution reliability
  • Expected performance vs worst-case scenarios

This framework applies to:

  • Robotics with mechanical uncertainty
  • Financial trading with market volatility
  • Autonomous vehicles with sensor noise
  • Any decision-making under uncertainty

The stochastic framework extends reinforcement learning to handle realistic situations where actions don’t always produce intended outcomes, requiring algorithms to optimize expected rather than deterministic returns.