Algorithm Refinement Greedy

Action Selection During Learning

The Learning Dilemma

“The learning algorithm that we developed, even while you’re still learning how to approximate Q(s,a), you need to take some actions in the lunar lander. How do you pick those actions while you’re still learning?”

Problem: “When the learning algorithm is still running, we don’t really know what’s the best action to take in every state. If we did, we’d already be done learning.”

Action Selection Options

Option 1: Pure Greedy Selection

Approach: “Pick whenever in state s, pick an action a that maximizes Q(s,a).”

Reasoning: “Even if Q(s,a) is not a great estimate of the Q function, let’s just do our best and use our current guess of Q(s,a) and pick the action a that maximizes it.”

Limitation: “It turns out this may work okay, but isn’t the best option.”

Option 2: ε-Greedy Policy (Recommended)

Approach: “Most of the time, let’s say with probability of 0.95, pick the action that maximizes Q(s,a). But the small fraction of the time, let’s say, five percent of the time, we’ll pick an action a randomly.”

Why Random Exploration Matters

Preconception Problem

Scenario: “Suppose there’s some strange reason that Q(s,a) was initialized randomly so that the learning algorithm thinks that firing the main thruster is never a good idea. Maybe the neural network parameters were initialized so that Q(s, main) is always very low.”

Vicious Cycle

Problem: “If that’s the case, then the neural network, because it’s trying to pick the action a that maximizes Q(s,a), it will never ever try firing the main thruster.”

Consequence: “Because it never ever tries firing the main thruster, it will never learn that firing the main thruster is actually sometimes a good idea.”

Solution Through Exploration

Benefit: “Under option 2 on every step, we have some small probability of trying out different actions so that the neural network can learn to overcome its own possible preconceptions about what might be a bad idea that turns out not to be the case.”

Exploration vs Exploitation

Terminology

Exploration: “Picking actions randomly… sometimes called an exploration step. Because we’re going to try out something that may not be the best idea, but we’re going to just try out some action in some circumstances, explore and learn more about an action in the circumstance where we may not have had as much experience before.”
Exploitation: “Taking an action that maximizes Q(s,a), sometimes this is called a greedy action because we’re trying to actually maximize our return by picking this. Or in the reinforcement learning literature, sometimes you’ll also hear this as an exploitation step.”

Trade-off Concept

“In the reinforcement learning literature, sometimes you hear people talk about the exploration versus exploitation trade-off, which refers to how often do you take actions randomly or take actions that may not be the best in order to learn more, versus trying to maximize your return by say, taking the action that maximizes Q(s,a).“

ε-Greedy Policy Definition

Mathematical Formulation

With probability 1-ε: Take greedy action (argmax Q(s,a)) With probability ε: Take random action

Standard Parameters: ε = 0.05 (5% exploration, 95% exploitation)

Naming Convention

Confusing Terminology: “A lot of people have commented that the name ε-greedy policy is confusing because you’re actually being greedy 95 percent of the time, not five percent of the time. So maybe 1 minus ε-greedy policy, because it’s 95 percent greedy, five percent exploring, that’s actually a more accurate description of the algorithm.”

Historical Usage: “But for historical reasons, the name ε-greedy policy is what has stuck. This is the name that people use to refer to the policy that explores actually ε fraction of the time rather than this greedy ε fraction of the time.”

Dynamic ε Scheduling

Adaptive Exploration Strategy

Progressive Reduction: “One of the trick that’s sometimes used in reinforcement learning is to start off ε high. Initially, you are taking random actions a lot at a time and then gradually decrease it, so that over time you are less likely to take actions randomly and more likely to use your improving estimates of the Q-function to pick good actions.”

Lunar Lander Example

Initial Phase: “You might start off with ε very, very high, maybe even ε equals 1.0. You’re just picking actions completely at random initially”

Final Phase: “Then gradually decrease it all the way down to say 0.01, so that eventually you’re taking greedy actions 99 percent of the time and acting randomly only a very small one percent of the time.”

Start: ε = 1.0 (100% random exploration)
Gradually decrease: ε reduces over time during training
End: ε = 0.01 (99% exploitation, 1% exploration)

Implementation and Practical Considerations

Hyperparameter Sensitivity

Reinforcement Learning Challenge: “One of the things that I’ve noticed for reinforcement learning algorithm is that compared to supervised learning, they’re more finicky in terms of the choice of hyper parameters.”

Comparison with Supervised Learning:

Supervised learning: “If you set the learning rate a little bit too small, then maybe the algorithm will take longer to learn. Maybe it takes three times as long to train, which is annoying, but maybe not that bad.”
Reinforcement learning: “If you set the value of ε not quite as well, or set other parameters not quite as well, it doesn’t take three times as long to learn. It may take 10 times or a 100 times as long to learn.”

Algorithm Maturity

“Reinforcement learning algorithms, I think because they’re are less mature than supervised learning algorithms, are much more finicky to little choices of parameters like that, and it actually sometimes is frankly more frustrating to tune these parameters with reinforcement learning algorithm compared to a supervised learning algorithm.”

The ε-greedy policy provides a principled way to balance exploration and exploitation during learning, ensuring the agent discovers good actions while gradually shifting toward optimal behavior as learning progresses.