Making Decisions

Making Decisions: Policies in Reinforcement Learning

Action Selection Strategies

Possible Approaches

There are many different ways to choose actions in reinforcement learning:

Nearest reward strategy: “Always go for the nearer reward” - go left if leftmost reward is nearer, right if rightmost reward is nearer
Largest reward strategy: Always pursue the larger reward regardless of distance
Smallest reward strategy: Always go for smaller reward (doesn’t seem like a good idea, but it is another option)
Mixed strategy: “Go left unless you’re just one step away from the lesser reward, in which case, you go for that one”

Policy Definition

Core Concept

Policy (π): A function that “takes as input any state s and maps it to some action a that it wants us to take”

Mathematical Notation: π(s) = a

Input: State s
Output: Action a

Example Policy

Strategy: Go left unless one step from lesser reward

Policy Mapping:

π(State 2) = Left
π(State 3) = Left
π(State 4) = Left
π(State 5) = Right

Reinforcement Learning Goal

Objective

“The goal of reinforcement learning is to find a policy π or π(s) that tells you what action to take in every state so as to maximize the return.”

Policy vs Controller Terminology

“I don’t know if policy is the most descriptive term of what π is, but it’s one of those terms that’s become standard in reinforcement learning. Maybe calling π a controller rather than a policy would be more natural terminology but policy is what everyone in reinforcement learning now calls this.”

Complete Framework

The policy represents the final piece needed for a complete reinforcement learning system:

States: Possible positions/situations
Actions: Available choices at each state
Rewards: Feedback for being in each state
Return: Discounted sum of future rewards
Policy: Decision-making function that maps states to actions

Integration

Policy determines which actions to take
Actions determine which states are visited
States determine which rewards are received
Returns provide the metric for evaluating policy quality
Goal is finding the policy that maximizes expected return

The policy serves as the “brain” of the reinforcement learning agent, encapsulating all the learned knowledge about how to behave optimally in the environment to achieve the highest possible return.