Reinforcement Learning Programming
Programming Assignment: Reinforcement Learning
Section titled “Programming Assignment: Reinforcement Learning”Assignment Overview
Section titled “Assignment Overview”Objective
Section titled “Objective”Goal: “Train an agent to land a lunar lander safely on a landing pad on the surface of the moon.”
Method: Implement Deep Q-Learning with Experience Replay using TensorFlow/Keras.
Environment Setup
Section titled “Environment Setup”Required Packages
Section titled “Required Packages”import numpy as npfrom collections import deque, namedtupleimport gymimport tensorflow as tffrom tensorflow.keras import Sequentialfrom tensorflow.keras.layers import Dense, Inputfrom tensorflow.keras.losses import MSEfrom tensorflow.keras.optimizers import Adam
Hyperparameters
Section titled “Hyperparameters”MEMORY_SIZE = 100_000 # size of memory bufferGAMMA = 0.995 # discount factorALPHA = 1e-3 # learning rateNUM_STEPS_FOR_UPDATE = 4 # perform learning update every C time steps
Lunar Lander Environment Details
Section titled “Lunar Lander Environment Details”State Space (8 variables)
Section titled “State Space (8 variables)”- (x,y) coordinates relative to landing pad at (0,0)
- Linear velocities (ẋ,ẏ)
- Angle θ and angular velocity θ̇
- Boolean variables l,r for leg ground contact
Action Space (4 discrete actions)
Section titled “Action Space (4 discrete actions)”- Do nothing = 0
- Fire right engine = 1
- Fire main engine = 2
- Fire left engine = 3
Reward Structure
Section titled “Reward Structure”- Distance: Closer to landing pad = higher reward
- Speed: Slower movement = higher reward
- Orientation: Less tilt = higher reward
- Ground contact: +10 points per leg touching ground
- Fuel penalties: -0.03 for side engines, -0.3 for main engine
- Landing outcome: +100 for safe landing, -100 for crash
Exercise 1: Network Architecture
Section titled “Exercise 1: Network Architecture”Create Q-Network and Target Q-Network with identical architectures:
# Network Architecture Required:# Input layer: state_size (8 numbers)# Dense layer: 64 units, relu activation# Dense layer: 64 units, relu activation# Output layer: num_actions units (4), linear activation
q_network = Sequential([ Input(shape=state_size), Dense(units=64, activation='relu'), Dense(units=64, activation='relu'), Dense(units=num_actions, activation='linear'),])
optimizer = Adam(learning_rate=ALPHA)
Exercise 2: Loss Function Implementation
Section titled “Exercise 2: Loss Function Implementation”Implement the compute_loss
function using Bellman equation:
def compute_loss(experiences, gamma, q_network, target_q_network): states, actions, rewards, next_states, done_vals = experiences
# Compute max Q^(s,a) from target network max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)
# Set y targets using Bellman equation # y = R if episode terminates, otherwise y = R + γ max Q^(s,a) y_targets = rewards + (gamma * max_qsa * (1 - done_vals))
# Get Q values and compute loss q_values = q_network(states) q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]), tf.cast(actions, tf.int32)], axis=1))
loss = MSE(y_targets, q_values) return loss
Key Implementation Details
Section titled “Key Implementation Details”- Terminal states: Use
(1 - done_vals)
to handle episode termination - Target calculation: Immediate reward + discounted future return
- Action indexing: Extract Q values for specific actions taken
Training Algorithm Structure
Section titled “Training Algorithm Structure”Main Training Loop
Section titled “Main Training Loop”# Simplified training structure:for episode in range(num_episodes): state = env.reset()
for timestep in range(max_timesteps): # ε-greedy action selection action = utils.get_action(q_values, epsilon)
# Environment interaction next_state, reward, done, _ = env.step(action)
# Store experience in replay buffer memory_buffer.append(experience(state, action, reward, next_state, done))
# Learning update every C steps if update_conditions_met: experiences = utils.get_experiences(memory_buffer) agent_learn(experiences, GAMMA)
state = next_state if done: break
Key Algorithm Components
Section titled “Key Algorithm Components”Experience Replay
Section titled “Experience Replay”- Buffer size: 100,000 most recent experiences
- Mini-batch: Sample random subset for training
- Experience tuple: (state, action, reward, next_state, done)
ε-Greedy Exploration
Section titled “ε-Greedy Exploration”- Initial ε: High exploration (possibly ε = 1.0)
- Decay schedule: Gradually reduce to minimum (e.g., 0.01)
- Balance: Exploration vs exploitation trade-off
Target Network Updates
Section titled “Target Network Updates”- Soft updates: Gradual parameter blending
- Stability: Prevents moving target problem
- Update frequency: Every C time steps
Expected Performance
Section titled “Expected Performance”Training Duration
Section titled “Training Duration”Estimated time: 10-15 minutes with default parameters
Success Criteria
Section titled “Success Criteria”Environment solved: Average 200 points over last 100 episodes
Convergence Indicators
Section titled “Convergence Indicators”- Increasing average scores over time
- Successful landings in test videos
- Stable learning curve progression
Implementation Tips
Section titled “Implementation Tips”Network Architecture Benefits
Section titled “Network Architecture Benefits”- Efficiency: Single forward pass produces all Q values
- Action selection: Direct argmax over outputs
- Bellman updates: Efficient max operation
Hyperparameter Sensitivity
Section titled “Hyperparameter Sensitivity”Debugging Approaches
Section titled “Debugging Approaches”- Monitor average episode scores
- Check for stable Q value convergence
- Verify successful landings in test videos
- Validate experience replay buffer functionality
This programming assignment provides hands-on experience with state-of-the-art deep reinforcement learning, combining neural networks with RL-specific techniques like experience replay and ε-greedy exploration to solve a challenging continuous control problem.