Algorithm Refinement Neural

Efficiency Problem with Original Architecture

Computational Inefficiency

The original architecture required computing Q(s,a) separately for each action:

“Whenever we are in some state s, we would have to carry out inference in the neural network separately four times to compute these four values so as to pick the action a that gives us the largest Q value”
“This is inefficient because we have to carry our inference four times from every single state”

Original Architecture Issues

Input: 12 numbers (state + action)
↓
Hidden Layer 1: 64 units
Hidden Layer 2: 64 units
↓
Output: 1 Q value

Required: 4 separate forward passes per state

Improved Architecture Solution

Simultaneous Q Value Computation

Key Improvement: “It turns out to be more efficient to train a single neural network to output all four of these values simultaneously.”

New Architecture Design

Input: 8 numbers (state only)
↓
Hidden Layer 1: 64 units
Hidden Layer 2: 64 units
↓
Output: 4 units (all Q values)

Output Units:
- Q(s, nothing)
- Q(s, left)
- Q(s, main)
- Q(s, right)

Architecture Benefits

Single Forward Pass

Efficiency Gain: “This turns out to be more efficient because given the state s we can run inference just once and get all four of these values, and then very quickly pick the action a that maximizes Q(s,a).”

Bellman Equation Optimization

Additional Benefit: “You notice also in Bellman’s equations, there’s a step in which we have to compute max over a’ Q(s’, a’), this multiplied by gamma and then there was plus R(s) up here.”

Computational Advantage: “This neural network also makes it much more efficient to compute this because we’re getting Q(s’, a’) for all actions a’ at the same time. You can then just pick the max to compute this value for the right-hand side of Bellman’s equations.”

Implementation Changes

Input Modification

Original: State + Action (12 numbers) → Single Q value
Improved: State only (8 numbers) → All four Q values

Action Selection Process

Input state s to neural network (8 numbers)
Get all Q values simultaneously: [Q(s,nothing), Q(s,left), Q(s,main), Q(s,right)]
Select action with highest Q value: a = argmax Q(s,a)

Training Implications

Forward pass: More efficient (1 pass vs 4 passes)
Bellman computation: Faster max operation over actions
Overall training: Significantly improved computational efficiency

Architecture Comparison

Original Architecture

Input: 12 numbers (state + action)
Output: 1 Q value
Requires: 4 forward passes per decision
Use case: One Q value at a time

Improved Architecture

Input: 8 numbers (state only)
Output: 4 Q values simultaneously
Requires: 1 forward pass per decision
Use case: All Q values at once

Practical Implementation

Network Definition

The improved architecture eliminates the need for action encoding in the input and instead produces all action values as outputs, making both action selection and Bellman equation computation much more efficient.

Real-World Usage

“Most implementations of DQN actually use this more efficient architecture that we’ll see in this video” rather than the conceptually simpler but computationally inefficient original approach.

This architectural improvement represents a practical optimization that maintains the same learning objectives while significantly reducing computational overhead, making the algorithm much more viable for real-world applications.