Algorithm Refinement Greedy
Algorithm Refinement: ε-Greedy Policy
Section titled “Algorithm Refinement: ε-Greedy Policy”Action Selection During Learning
Section titled “Action Selection During Learning”The Learning Dilemma
Section titled “The Learning Dilemma”“The learning algorithm that we developed, even while you’re still learning how to approximate Q(s,a), you need to take some actions in the lunar lander. How do you pick those actions while you’re still learning?”
Problem: “When the learning algorithm is still running, we don’t really know what’s the best action to take in every state. If we did, we’d already be done learning.”
Action Selection Options
Section titled “Action Selection Options”Option 1: Pure Greedy Selection
Section titled “Option 1: Pure Greedy Selection”Approach: “Pick whenever in state s, pick an action a that maximizes Q(s,a).”
Reasoning: “Even if Q(s,a) is not a great estimate of the Q function, let’s just do our best and use our current guess of Q(s,a) and pick the action a that maximizes it.”
Limitation: “It turns out this may work okay, but isn’t the best option.”
Option 2: ε-Greedy Policy (Recommended)
Section titled “Option 2: ε-Greedy Policy (Recommended)”Approach: “Most of the time, let’s say with probability of 0.95, pick the action that maximizes Q(s,a). But the small fraction of the time, let’s say, five percent of the time, we’ll pick an action a randomly.”
Why Random Exploration Matters
Section titled “Why Random Exploration Matters”Preconception Problem
Section titled “Preconception Problem”Scenario: “Suppose there’s some strange reason that Q(s,a) was initialized randomly so that the learning algorithm thinks that firing the main thruster is never a good idea. Maybe the neural network parameters were initialized so that Q(s, main) is always very low.”
Vicious Cycle
Section titled “Vicious Cycle”Problem: “If that’s the case, then the neural network, because it’s trying to pick the action a that maximizes Q(s,a), it will never ever try firing the main thruster.”
Consequence: “Because it never ever tries firing the main thruster, it will never learn that firing the main thruster is actually sometimes a good idea.”
Solution Through Exploration
Section titled “Solution Through Exploration”Benefit: “Under option 2 on every step, we have some small probability of trying out different actions so that the neural network can learn to overcome its own possible preconceptions about what might be a bad idea that turns out not to be the case.”
Exploration vs Exploitation
Section titled “Exploration vs Exploitation”Terminology
Section titled “Terminology”-
Exploration: “Picking actions randomly… sometimes called an exploration step. Because we’re going to try out something that may not be the best idea, but we’re going to just try out some action in some circumstances, explore and learn more about an action in the circumstance where we may not have had as much experience before.”
-
Exploitation: “Taking an action that maximizes Q(s,a), sometimes this is called a greedy action because we’re trying to actually maximize our return by picking this. Or in the reinforcement learning literature, sometimes you’ll also hear this as an exploitation step.”
Trade-off Concept
Section titled “Trade-off Concept”“In the reinforcement learning literature, sometimes you hear people talk about the exploration versus exploitation trade-off, which refers to how often do you take actions randomly or take actions that may not be the best in order to learn more, versus trying to maximize your return by say, taking the action that maximizes Q(s,a).“
ε-Greedy Policy Definition
Section titled “ε-Greedy Policy Definition”Mathematical Formulation
Section titled “Mathematical Formulation”With probability 1-ε: Take greedy action (argmax Q(s,a)) With probability ε: Take random action
Standard Parameters: ε = 0.05 (5% exploration, 95% exploitation)
Naming Convention
Section titled “Naming Convention”Confusing Terminology: “A lot of people have commented that the name ε-greedy policy is confusing because you’re actually being greedy 95 percent of the time, not five percent of the time. So maybe 1 minus ε-greedy policy, because it’s 95 percent greedy, five percent exploring, that’s actually a more accurate description of the algorithm.”
Historical Usage: “But for historical reasons, the name ε-greedy policy is what has stuck. This is the name that people use to refer to the policy that explores actually ε fraction of the time rather than this greedy ε fraction of the time.”
Dynamic ε Scheduling
Section titled “Dynamic ε Scheduling”Adaptive Exploration Strategy
Section titled “Adaptive Exploration Strategy”Progressive Reduction: “One of the trick that’s sometimes used in reinforcement learning is to start off ε high. Initially, you are taking random actions a lot at a time and then gradually decrease it, so that over time you are less likely to take actions randomly and more likely to use your improving estimates of the Q-function to pick good actions.”
Lunar Lander Example
Section titled “Lunar Lander Example”Initial Phase: “You might start off with ε very, very high, maybe even ε equals 1.0. You’re just picking actions completely at random initially”
Final Phase: “Then gradually decrease it all the way down to say 0.01, so that eventually you’re taking greedy actions 99 percent of the time and acting randomly only a very small one percent of the time.”
-
Start: ε = 1.0 (100% random exploration)
-
Gradually decrease: ε reduces over time during training
-
End: ε = 0.01 (99% exploitation, 1% exploration)
Implementation and Practical Considerations
Section titled “Implementation and Practical Considerations”Hyperparameter Sensitivity
Section titled “Hyperparameter Sensitivity”Reinforcement Learning Challenge: “One of the things that I’ve noticed for reinforcement learning algorithm is that compared to supervised learning, they’re more finicky in terms of the choice of hyper parameters.”
Comparison with Supervised Learning:
- Supervised learning: “If you set the learning rate a little bit too small, then maybe the algorithm will take longer to learn. Maybe it takes three times as long to train, which is annoying, but maybe not that bad.”
- Reinforcement learning: “If you set the value of ε not quite as well, or set other parameters not quite as well, it doesn’t take three times as long to learn. It may take 10 times or a 100 times as long to learn.”
Algorithm Maturity
Section titled “Algorithm Maturity”“Reinforcement learning algorithms, I think because they’re are less mature than supervised learning algorithms, are much more finicky to little choices of parameters like that, and it actually sometimes is frankly more frustrating to tune these parameters with reinforcement learning algorithm compared to a supervised learning algorithm.”
The ε-greedy policy provides a principled way to balance exploration and exploitation during learning, ensuring the agent discovers good actions while gradually shifting toward optimal behavior as learning progresses.