Algorithm Refinement Mini Batch

Mini-Batch Gradient Descent

Motivation: Large Dataset Challenge

Problem with Large Datasets: When training set size m is very large (e.g., 100 million housing examples), traditional gradient descent becomes inefficient.

Computational Issue: “Every single step of gradient descent requires computing this average over 100 million examples, and this turns out to be very slow.”

Traditional Gradient Descent Limitations

Inefficient Process:

Compute derivative over entire 100 million example dataset
Take one small gradient descent step
Scan entire dataset again for next derivative computation
Take another small step
Repeat…

Result: “When the training set size is very large, this gradient descent algorithm turns out to be quite slow.”

Mini-Batch Solution

Core Idea: “Not use all 100 million training examples on every single iteration through this loop. Instead, we may pick a smaller number, let me call it m’ equals say, 1,000.”

Process: “On every step, instead of using all 100 million examples, we would pick some subset of 1,000 or m’ examples.”

Efficiency Gain: “Now each iteration through gradient descent requires looking only at the 1,000 rather than 100 million examples, and every step takes much less time and just leads to a more efficient algorithm.”

Mini-Batch Algorithm Behavior

Uses all examples every iteration
Reliable steps toward global minimum
Slow when dataset is large
Smooth convergence path

Convergence Characteristics

Batch Gradient Descent: “Every step of gradient descent causes the parameters to reliably get closer to the global minimum of the cost function”

Mini-Batch Gradient Descent: “Every iteration is much more computationally inexpensive and so mini-batch learning or mini-batch gradient descent turns out to be a much faster algorithm when you have a very large training set.”

Trade-off: Steps are noisier but much faster, leading to overall speed improvement.

Mini-Batch in Reinforcement Learning

Application to DQN

Replay Buffer Usage: “Even if you have stored the 10,000 most recent tuples in the replay buffer, what you might choose to do is not use all 10,000 every time you train a model.”

Practical Implementation: “Instead, what you might do is just take the subset. You might choose just 1,000 examples of these s, a, R(s), s’ tuples and use it to create just 1,000 training examples to train the neural network.”

Benefits for RL

Speed vs Accuracy Trade-off: “It turns out that this will make each iteration of training a model a little bit more noisy but much faster and this will overall tend to speed up this reinforcement learning algorithm.”

Common Practice: Mini-batch size of 1,000 examples even when storing 10,000 tuples in replay buffer.

Soft Updates

Problem with Hard Updates

Abrupt Changes: “I’ve written out this step here of Set Q equals Q_new. But it turns out that this can make a very abrupt change to Q.”

Risk of Degradation: “If you train a new neural network to new, maybe just by chance is not a very good neural network. Maybe is even a little bit worse than the old one, then you just overwrote your Q function with a potentially worse noisy neural network.”

Soft Update Solution

Gradual Parameter Updates: Instead of completely replacing parameters, blend old and new values.

Mathematical Formulation:

Original (Hard Update): W = W_new, B = B_new
Soft Update:
- W = 0.01 × W_new + 0.99 × W
- B = 0.01 × B_new + 0.99 × B

Interpretation: “We’re going to make W to be 99 percent the old version of W plus one percent of the new version W_new.”

Soft Update Benefits

Stability: “This is called a soft update because whenever we train a new neural network W_new, we’re only going to accept a little bit of the new value.”

Risk Mitigation: “Soft update allows you to make a more gradual change to Q or to the neural network parameters W and B that affect your current guess for the Q function Q(s,a).”

Convergence: “Using the soft update method causes the reinforcement learning algorithm to converge more reliably. It makes it less likely that the reinforcement learning algorithm will oscillate or divert or have other undesirable properties.”

Hyperparameter Control

Update Rate: The coefficients 0.01 and 0.99 are hyperparameters that control update aggressiveness.

Constraint: “These two numbers are expected to add up to one.”

Extremes:

Setting W = 1 × W_new + 0 × W returns to original hard update
Smaller first coefficient means more conservative updates

Combined Algorithm Improvements

Mini-Batch + Soft Updates

Universal Application: “Mini-batching, which actually applies very well to supervise learning as well, not just reinforcement learning, as well as the idea of soft updates”

Performance Impact: “With these two final refinements to the algorithm… you should be able to get your lunar algorithm to work really well on the Lunar Lander.”

Implementation Summary

Mini-Batch Benefits

Faster training iterations
Reduced memory requirements
Applies to both supervised and RL
Trade noise for speed

Soft Update Benefits

More stable convergence
Reduced oscillations
Gradual parameter changes
Better final performance

Practical Implementation

Replay Buffer: Store 10,000 recent experiences Mini-Batch Size: Use subset of 1,000 for training Soft Update Rate: Typical values like 0.01/0.99 or 0.05/0.95 Combined Effect: Faster, more stable learning

Algorithm Complexity Assessment

Challenge Level

“The Lunar Lander is actually a decently complex, decently challenging application and so that you can get it to work and land safely on the moon. I think that’s actually really cool and I hope you enjoy playing with the practice lab.”

These refinements transform the basic DQN algorithm from a theoretically sound but practically challenging approach into a robust, efficient learning system capable of solving complex control problems.

Final Algorithm: DQN with experience replay, improved architecture, ε-greedy exploration, mini-batch training, and soft updates represents the state-of-the-art approach for discrete action reinforcement learning problems.

The combination of these techniques addresses the major challenges in deep reinforcement learning: sample efficiency, stability, computational cost, and convergence reliability.

Algorithm Refinement Mini Batch

Algorithm Refinement: Mini-Batch and Soft Updates (Optional)

Mini-Batch Gradient Descent

Motivation: Large Dataset Challenge

Traditional Gradient Descent Limitations

Mini-Batch Solution

Mini-Batch Algorithm Behavior

Iteration Process

Convergence Characteristics

Mini-Batch in Reinforcement Learning

Application to DQN

Benefits for RL

Soft Updates

Problem with Hard Updates

Soft Update Solution

Soft Update Benefits

Hyperparameter Control

Combined Algorithm Improvements

Mini-Batch + Soft Updates

Implementation Summary

Practical Implementation

Algorithm Complexity Assessment

Challenge Level

Refinement Impact