Review

Review of Key Concepts

Core Reinforcement Learning Components

Universal Framework Elements

The reinforcement learning formalism consists of:

States: Possible situations or positions
Actions: Available choices from each state
Rewards: Feedback indicating performance quality
Discount Factor: Weight for future vs immediate rewards
Return: Discounted sum of rewards over time
Policy: Function mapping states to optimal actions

Mars Rover Example Review

Specific Implementation

States: Six numbered positions (1-6)
Actions: Go left or go right
Rewards:
- State 1: 100 (leftmost)
- State 6: 40 (rightmost)
- States 2-5: 0 (intermediate)
Discount Factor: 0.5 (for illustration)
Return Formula: R₁ + γR₂ + γ²R₃ + γ³R₄ + …
Policy: Different strategies for mapping states to actions

Applications to Other Domains

Autonomous Helicopter

State: “Set of possible positions and orientations and speeds and so on of the helicopter”
Actions: “Set of possible ways to move the controls stick of a helicopter”
Rewards:
- +1 if flying well
- -1,000 if crashes or “doesn’t fall really bad”
Discount Factor: “Number slightly less than one maybe say, 0.99”
Policy Goal: “Given as input, the position of the helicopter s, it tells you what action to take. That is, tells you how to move the control sticks”

Chess Playing

State: “Position of all the pieces on the board” (simplified - “there’s little bit more information than just the position of the pieces is important for chess”)
Actions: “Possible legal moves in the game”
Rewards:
- +1 if wins game
- -1 if loses game
- 0 if ties game
Discount Factor: “Very close to one will be used, so maybe 0.99 or even 0.995 or 0.999”
Policy Goal: “Given a board position to pick a good action using a policy π”

Markov Decision Process (MDP)

Formal Name

“This formalism of a reinforcement learning application actually has a name. It’s called a Markov decision process, and I know that sounds like a big technical complicated term.”

Key Point: “If you ever hear this term Markov decision process or MDP for short, that’s just the formalism that we’ve been talking about in the last few videos.”

Markov Property

“The term Markov in the MDP or Markov decision process refers to that the future only depends on the current state and not on anything that might have occurred prior to getting to the current state.”

Simplified: “In a Markov decision process, the future depends only on where you are now, not on how you got here.”

Agent-Environment Interaction

Conceptual Model

The MDP formalism represents:

Agent: Robot or other entity we wish to control
Environment: World that responds to actions
Interaction Cycle:
1. Agent chooses action (a) using policy (π)
2. Environment responds with new state and reward
3. Agent observes new state (s) and reward (r)
4. Process repeats

The Markov Decision Process provides the mathematical foundation for all reinforcement learning algorithms, offering a standardized way to represent decision-making problems where an agent must learn optimal behavior through interaction with an environment.