Reinforcement Learning Programming

Programming Assignment: Reinforcement Learning

Assignment Overview

Objective

Goal: “Train an agent to land a lunar lander safely on a landing pad on the surface of the moon.”

Method: Implement Deep Q-Learning with Experience Replay using TensorFlow/Keras.

Environment Setup

Required Packages

import numpy as np
from collections import deque, namedtuple
import gym
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.losses import MSE
from tensorflow.keras.optimizers import Adam

Hyperparameters

MEMORY_SIZE = 100_000     # size of memory buffer
GAMMA = 0.995             # discount factor
ALPHA = 1e-3              # learning rate
NUM_STEPS_FOR_UPDATE = 4  # perform learning update every C time steps

Lunar Lander Environment Details

State Space (8 variables)

(x,y) coordinates relative to landing pad at (0,0)
Linear velocities (ẋ,ẏ)
Angle θ and angular velocity θ̇
Boolean variables l,r for leg ground contact

Action Space (4 discrete actions)

Do nothing = 0
Fire right engine = 1
Fire main engine = 2
Fire left engine = 3

Reward Structure

Distance: Closer to landing pad = higher reward
Speed: Slower movement = higher reward
Orientation: Less tilt = higher reward
Ground contact: +10 points per leg touching ground
Fuel penalties: -0.03 for side engines, -0.3 for main engine
Landing outcome: +100 for safe landing, -100 for crash

Exercise 1: Network Architecture

Task

Create Q-Network and Target Q-Network with identical architectures:

# Network Architecture Required:
# Input layer: state_size (8 numbers)
# Dense layer: 64 units, relu activation
# Dense layer: 64 units, relu activation
# Output layer: num_actions units (4), linear activation

q_network = Sequential([
  Input(shape=state_size),
  Dense(units=64, activation='relu'),
  Dense(units=64, activation='relu'),
  Dense(units=num_actions, activation='linear'),
])

optimizer = Adam(learning_rate=ALPHA)

Exercise 2: Loss Function Implementation

Task

Implement the compute_loss function using Bellman equation:

def compute_loss(experiences, gamma, q_network, target_q_network):
  states, actions, rewards, next_states, done_vals = experiences

  # Compute max Q^(s,a) from target network
  max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)

  # Set y targets using Bellman equation
  # y = R if episode terminates, otherwise y = R + γ max Q^(s,a)
  y_targets = rewards + (gamma * max_qsa * (1 - done_vals))

  # Get Q values and compute loss
  q_values = q_network(states)
  q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
                                              tf.cast(actions, tf.int32)], axis=1))

  loss = MSE(y_targets, q_values)
  return loss

Key Implementation Details

Terminal states: Use (1 - done_vals) to handle episode termination
Target calculation: Immediate reward + discounted future return
Action indexing: Extract Q values for specific actions taken

Training Algorithm Structure

Main Training Loop

# Simplified training structure:
for episode in range(num_episodes):
  state = env.reset()

  for timestep in range(max_timesteps):
      # ε-greedy action selection
      action = utils.get_action(q_values, epsilon)

      # Environment interaction
      next_state, reward, done, _ = env.step(action)

      # Store experience in replay buffer
      memory_buffer.append(experience(state, action, reward, next_state, done))

      # Learning update every C steps
      if update_conditions_met:
          experiences = utils.get_experiences(memory_buffer)
          agent_learn(experiences, GAMMA)

      state = next_state
      if done: break

Key Algorithm Components

Experience Replay

Buffer size: 100,000 most recent experiences
Mini-batch: Sample random subset for training
Experience tuple: (state, action, reward, next_state, done)

ε-Greedy Exploration

Initial ε: High exploration (possibly ε = 1.0)
Decay schedule: Gradually reduce to minimum (e.g., 0.01)
Balance: Exploration vs exploitation trade-off

Target Network Updates

Soft updates: Gradual parameter blending
Stability: Prevents moving target problem
Update frequency: Every C time steps

Expected Performance

Training Duration

Estimated time: 10-15 minutes with default parameters

Success Criteria

Environment solved: Average 200 points over last 100 episodes

Convergence Indicators

Increasing average scores over time
Successful landings in test videos
Stable learning curve progression

Implementation Tips

Network Architecture Benefits

Efficiency: Single forward pass produces all Q values
Action selection: Direct argmax over outputs
Bellman updates: Efficient max operation

Hyperparameter Sensitivity

Debugging Approaches

Monitor average episode scores
Check for stable Q value convergence
Verify successful landings in test videos
Validate experience replay buffer functionality

This programming assignment provides hands-on experience with state-of-the-art deep reinforcement learning, combining neural networks with RL-specific techniques like experience replay and ε-greedy exploration to solve a challenging continuous control problem.