Skip to content
Pablo Rodriguez

Reinforcement Learning Programming

Programming Assignment: Reinforcement Learning

Section titled “Programming Assignment: Reinforcement Learning”

Goal: “Train an agent to land a lunar lander safely on a landing pad on the surface of the moon.”

Method: Implement Deep Q-Learning with Experience Replay using TensorFlow/Keras.

imports.py
import numpy as np
from collections import deque, namedtuple
import gym
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.losses import MSE
from tensorflow.keras.optimizers import Adam
hyperparameters.py
MEMORY_SIZE = 100_000 # size of memory buffer
GAMMA = 0.995 # discount factor
ALPHA = 1e-3 # learning rate
NUM_STEPS_FOR_UPDATE = 4 # perform learning update every C time steps
  • (x,y) coordinates relative to landing pad at (0,0)
  • Linear velocities (ẋ,ẏ)
  • Angle θ and angular velocity θ̇
  • Boolean variables l,r for leg ground contact
  • Do nothing = 0
  • Fire right engine = 1
  • Fire main engine = 2
  • Fire left engine = 3
  • Distance: Closer to landing pad = higher reward
  • Speed: Slower movement = higher reward
  • Orientation: Less tilt = higher reward
  • Ground contact: +10 points per leg touching ground
  • Fuel penalties: -0.03 for side engines, -0.3 for main engine
  • Landing outcome: +100 for safe landing, -100 for crash

Create Q-Network and Target Q-Network with identical architectures:

network_creation.py
# Network Architecture Required:
# Input layer: state_size (8 numbers)
# Dense layer: 64 units, relu activation
# Dense layer: 64 units, relu activation
# Output layer: num_actions units (4), linear activation
q_network = Sequential([
Input(shape=state_size),
Dense(units=64, activation='relu'),
Dense(units=64, activation='relu'),
Dense(units=num_actions, activation='linear'),
])
optimizer = Adam(learning_rate=ALPHA)

Implement the compute_loss function using Bellman equation:

loss_function.py
def compute_loss(experiences, gamma, q_network, target_q_network):
states, actions, rewards, next_states, done_vals = experiences
# Compute max Q^(s,a) from target network
max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)
# Set y targets using Bellman equation
# y = R if episode terminates, otherwise y = R + γ max Q^(s,a)
y_targets = rewards + (gamma * max_qsa * (1 - done_vals))
# Get Q values and compute loss
q_values = q_network(states)
q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
tf.cast(actions, tf.int32)], axis=1))
loss = MSE(y_targets, q_values)
return loss
  • Terminal states: Use (1 - done_vals) to handle episode termination
  • Target calculation: Immediate reward + discounted future return
  • Action indexing: Extract Q values for specific actions taken
training_loop.py
# Simplified training structure:
for episode in range(num_episodes):
state = env.reset()
for timestep in range(max_timesteps):
# ε-greedy action selection
action = utils.get_action(q_values, epsilon)
# Environment interaction
next_state, reward, done, _ = env.step(action)
# Store experience in replay buffer
memory_buffer.append(experience(state, action, reward, next_state, done))
# Learning update every C steps
if update_conditions_met:
experiences = utils.get_experiences(memory_buffer)
agent_learn(experiences, GAMMA)
state = next_state
if done: break
  • Buffer size: 100,000 most recent experiences
  • Mini-batch: Sample random subset for training
  • Experience tuple: (state, action, reward, next_state, done)
  • Initial ε: High exploration (possibly ε = 1.0)
  • Decay schedule: Gradually reduce to minimum (e.g., 0.01)
  • Balance: Exploration vs exploitation trade-off
  • Soft updates: Gradual parameter blending
  • Stability: Prevents moving target problem
  • Update frequency: Every C time steps

Estimated time: 10-15 minutes with default parameters

Environment solved: Average 200 points over last 100 episodes

  • Increasing average scores over time
  • Successful landings in test videos
  • Stable learning curve progression
  • Efficiency: Single forward pass produces all Q values
  • Action selection: Direct argmax over outputs
  • Bellman updates: Efficient max operation
  • Monitor average episode scores
  • Check for stable Q value convergence
  • Verify successful landings in test videos
  • Validate experience replay buffer functionality

This programming assignment provides hands-on experience with state-of-the-art deep reinforcement learning, combining neural networks with RL-specific techniques like experience replay and ε-greedy exploration to solve a challenging continuous control problem.