Mini-Batch Benefits
- Faster training iterations
- Reduced memory requirements
- Applies to both supervised and RL
- Trade noise for speed
Problem with Large Datasets: When training set size m is very large (e.g., 100 million housing examples), traditional gradient descent becomes inefficient.
Computational Issue: “Every single step of gradient descent requires computing this average over 100 million examples, and this turns out to be very slow.”
Inefficient Process:
Result: “When the training set size is very large, this gradient descent algorithm turns out to be quite slow.”
Core Idea: “Not use all 100 million training examples on every single iteration through this loop. Instead, we may pick a smaller number, let me call it m’ equals say, 1,000.”
Process: “On every step, instead of using all 100 million examples, we would pick some subset of 1,000 or m’ examples.”
Efficiency Gain: “Now each iteration through gradient descent requires looking only at the 1,000 rather than 100 million examples, and every step takes much less time and just leads to a more efficient algorithm.”
Batch Gradient Descent: “Every step of gradient descent causes the parameters to reliably get closer to the global minimum of the cost function”
Mini-Batch Gradient Descent: “Every iteration is much more computationally inexpensive and so mini-batch learning or mini-batch gradient descent turns out to be a much faster algorithm when you have a very large training set.”
Trade-off: Steps are noisier but much faster, leading to overall speed improvement.
Replay Buffer Usage: “Even if you have stored the 10,000 most recent tuples in the replay buffer, what you might choose to do is not use all 10,000 every time you train a model.”
Practical Implementation: “Instead, what you might do is just take the subset. You might choose just 1,000 examples of these s, a, R(s), s’ tuples and use it to create just 1,000 training examples to train the neural network.”
Speed vs Accuracy Trade-off: “It turns out that this will make each iteration of training a model a little bit more noisy but much faster and this will overall tend to speed up this reinforcement learning algorithm.”
Common Practice: Mini-batch size of 1,000 examples even when storing 10,000 tuples in replay buffer.
Abrupt Changes: “I’ve written out this step here of Set Q equals Q_new. But it turns out that this can make a very abrupt change to Q.”
Risk of Degradation: “If you train a new neural network to new, maybe just by chance is not a very good neural network. Maybe is even a little bit worse than the old one, then you just overwrote your Q function with a potentially worse noisy neural network.”
Gradual Parameter Updates: Instead of completely replacing parameters, blend old and new values.
Mathematical Formulation:
Interpretation: “We’re going to make W to be 99 percent the old version of W plus one percent of the new version W_new.”
Stability: “This is called a soft update because whenever we train a new neural network W_new, we’re only going to accept a little bit of the new value.”
Risk Mitigation: “Soft update allows you to make a more gradual change to Q or to the neural network parameters W and B that affect your current guess for the Q function Q(s,a).”
Convergence: “Using the soft update method causes the reinforcement learning algorithm to converge more reliably. It makes it less likely that the reinforcement learning algorithm will oscillate or divert or have other undesirable properties.”
Update Rate: The coefficients 0.01 and 0.99 are hyperparameters that control update aggressiveness.
Constraint: “These two numbers are expected to add up to one.”
Extremes:
Universal Application: “Mini-batching, which actually applies very well to supervise learning as well, not just reinforcement learning, as well as the idea of soft updates”
Performance Impact: “With these two final refinements to the algorithm… you should be able to get your lunar algorithm to work really well on the Lunar Lander.”
Mini-Batch Benefits
Soft Update Benefits
Replay Buffer: Store 10,000 recent experiences Mini-Batch Size: Use subset of 1,000 for training Soft Update Rate: Typical values like 0.01/0.99 or 0.05/0.95 Combined Effect: Faster, more stable learning
“The Lunar Lander is actually a decently complex, decently challenging application and so that you can get it to work and land safely on the moon. I think that’s actually really cool and I hope you enjoy playing with the practice lab.”
These refinements transform the basic DQN algorithm from a theoretically sound but practically challenging approach into a robust, efficient learning system capable of solving complex control problems.
Final Algorithm: DQN with experience replay, improved architecture, ε-greedy exploration, mini-batch training, and soft updates represents the state-of-the-art approach for discrete action reinforcement learning problems.
The combination of these techniques addresses the major challenges in deep reinforcement learning: sample efficiency, stability, computational cost, and convergence reliability.