DQN Techniques Reference

Experience Replay Core DQN

Store past experiences and sample random batches for training

How It Works

The Problem

Without replay, the network trains on sequential, correlated experiences. This causes:

Catastrophic forgetting of earlier experiences
Overfitting to recent states
Unstable, oscillating training

The Solution

Store transitions (s, a, r, s') in a buffer and sample random mini-batches for training. This breaks correlation and enables learning from each experience multiple times.

Buffer: {(s₁,a₁,r₁,s'₁), (s₂,a₂,r₂,s'₂), ...} → Sample batch of 32

class ReplayBuffer:
    def __init__(self, capacity=100000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size=32):
        return random.sample(self.buffer, batch_size)
                        

Stability

+80%

Sample Efficiency

+300%

Complexity

Low

When to Use

Use When

Training any DQN (essentially required)
Environment has diverse states
You have memory available

Caution When

Non-stationary environments (old data becomes invalid)
Extremely memory-constrained systems
On-policy algorithms (use different approach)

Training Impact

Without Experience Replay

Oscillating, unstable training from correlated samples

With Experience Replay

Smooth, stable convergence from decorrelated batches

Target Network Core DQN

Use a separate, slowly-updated network for computing Q-targets

How It Works

The Problem

In standard Q-learning, the same network both predicts Q-values AND provides targets. This creates a "moving target" problem:

Each update changes the target we're chasing
Leads to oscillations and divergence
Like a dog chasing its own tail

The Solution

Maintain two networks: Online (updated every step) and Target (frozen, updated periodically). Use target network for computing TD targets.

Target: r + γ · max_a Q_target(s', a)
Update Q_target ← Q_online every N steps

# Initialize both networks
q_network = QNetwork()
target_network = QNetwork()
target_network.load_state_dict(q_network.state_dict())

# Compute target using frozen network
target = reward + gamma * target_network(next_state).max()

# Periodically sync target to online
if step % 1000 == 0:
    target_network.load_state_dict(q_network.state_dict())
                        

Stability

+90%

Training Speed

Neutral

Complexity

Low

Pro Tip: The update frequency matters! Too frequent (every step) = no benefit. Too slow (every 50K steps) = stale targets. Start with 1000 steps and tune.

Training Impact

Without Target Network

Wild oscillations from chasing moving targets

With Target Network

Stable learning with fixed target values

Double DQN Core DQN

Decouple action selection from action evaluation to reduce overestimation

Q-Value Comparison

The Problem

Standard DQN uses max to both SELECT and EVALUATE actions:

Standard: Q_target = r + γ · max_a Q(s', a)

This causes systematic overestimation because noise in Q-values gets amplified by the max operator. Always picking the highest noisy estimate inflates values.

The Solution

Use online network to SELECT best action, but target network to EVALUATE it:

Double: a* = argmax Q_online(s', a)
Q_target = r + γ · Q_target(s', a*)

# Standard DQN (overestimates)
target = reward + gamma * target_net(next_state).max()

# Double DQN (accurate)
best_action = q_net(next_state).argmax()  # Select with online
target = reward + gamma * target_net(next_state)[best_action]  # Evaluate with target
                        

Q-Accuracy

+40%

Final Score

+15%

Complexity

Minimal

Training Impact

Standard DQN (Overestimation)

Q-values inflate, leading to suboptimal policies

Double DQN (Accurate Q)

Accurate Q-values, better final performance

Prioritized Experience Replay Advanced

Sample important transitions more frequently based on TD error

Priority Distribution

The Problem

Uniform random sampling treats all experiences equally, but some are more valuable for learning:

Surprising transitions (high TD error) teach more
Easy transitions waste training time
Rare but important events get undersampled

The Solution

Assign priority based on TD error magnitude. Higher error = more likely to be sampled:

P(i) = (|δᵢ| + ε)^α / Σ(|δ| + ε)^α

Use importance sampling weights to correct for bias introduced by non-uniform sampling.

# Compute TD error for priority
td_error = abs(target - q_network(state)[action])
priority = (td_error + 0.01) ** 0.6  # alpha=0.6

# Importance sampling weight
weight = (N * P(i)) ** (-beta)
loss = weight * (target - prediction) ** 2
                        

Learning Speed

+40%

Sample Efficiency

+200%

Complexity

Medium

Implementation Note: Use a Sum Tree data structure for O(log n) sampling. The alpha parameter (0.5-0.7) controls how much prioritization matters. Beta should anneal from 0.4 to 1.0 during training.

Training Impact

Uniform Replay

Slower learning, wastes time on easy samples

Prioritized Replay

Faster convergence, focuses on hard samples

Dueling DQN Architecture Advanced

Separate state value and action advantage into two streams

Network Architecture

The Insight

Q-value can be decomposed into two parts:

Value V(s): How good is this state overall?
Advantage A(s,a): How much better is this action than average?

Q(s,a) = V(s) + A(s,a) - mean(A(s,·))

Why It Helps

In many states, the action doesn't matter much (e.g., pole nearly balanced). Dueling can learn "this state is good" without needing to figure out which action is best.

class DuelingQNetwork(nn.Module):
    def __init__(self):
        self.features = nn.Linear(4, 128)
        # Value stream
        self.value = nn.Linear(128, 1)
        # Advantage stream
        self.advantage = nn.Linear(128, 2)

    def forward(self, x):
        x = F.relu(self.features(x))
        value = self.value(x)
        advantage = self.advantage(x)
        # Combine: Q = V + (A - mean(A))
        return value + advantage - advantage.mean()
                        

Generalization

+25%

Similar Actions

Better

Complexity

Low

Training Impact

Standard Q-Network

Struggles when actions have similar values

Dueling Architecture

Better generalization across states

Noisy Networks Advanced

Replace ε-greedy with learnable exploration through noisy weights

Exploration Comparison

The Problem with ε-Greedy

ε-greedy exploration is "dumb" - it explores randomly regardless of state. But intelligent exploration should be:

State-dependent (explore more in uncertain states)
Consistent within episodes (not random each step)
Self-annealing (less as we learn)

The Solution

Add learnable noise to network weights. The network learns WHEN and HOW MUCH to explore:

W = μ_w + σ_w ⊙ ε_w, where ε ~ N(0,1)

class NoisyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        # Learnable mean weights
        self.mu_w = nn.Parameter(torch.zeros(out_features, in_features))
        # Learnable noise scale
        self.sigma_w = nn.Parameter(torch.ones(out_features, in_features) * 0.017)

    def forward(self, x):
        # Sample noise each forward pass
        epsilon = torch.randn_like(self.sigma_w)
        weight = self.mu_w + self.sigma_w * epsilon
        return F.linear(x, weight)
                        

Exploration

Intelligent

Hyperparameters

-1 (no ε)

Complexity

Medium

Key Benefit: Noisy networks automatically reduce exploration as they become more confident. The σ parameters shrink during training, naturally transitioning from exploration to exploitation.

Training Impact

ε-Greedy Exploration

Inconsistent, hyperparameter-sensitive exploration

Noisy Networks

State-dependent, self-annealing exploration

N-Step Returns Advanced

Use multiple steps of real rewards before bootstrapping

Reward Propagation

The Trade-off

There's a spectrum between TD (1-step) and Monte Carlo (full episode):

1-step TD: Low variance, high bias (depends on Q accuracy)
Monte Carlo: High variance, zero bias (uses real returns)
N-step: Sweet spot in between

G_t^(n) = r_t + γr_{t+1} + γ²r_{t+2} + ... + γⁿV(s_{t+n})

Why It Helps

Rewards propagate faster. Instead of backing up one step at a time, N-step connects actions to rewards N steps in the future directly.

# Store N-step transitions
n_step_buffer = deque(maxlen=3)

def compute_n_step_return(buffer, gamma=0.99):
    R = 0
    for i, (s, a, r, s_, d) in enumerate(buffer):
        R += (gamma ** i) * r
    # Add bootstrapped value from final state
    if not done:
        R += (gamma ** len(buffer)) * V(final_state)
    return R
                        

Reward Speed

+N×

Typical N

3-5

Complexity

Medium

Training Impact

1-Step TD (Slow Credit)

Slow reward propagation, especially with sparse rewards

N-Step Returns (Fast Credit)

Faster reward signal, quicker early learning

Gradient Clipping Stability

Prevent exploding gradients by capping their magnitude

Gradient Magnitude

The Problem

Large TD errors can cause massive gradients, leading to:

Catastrophic weight updates
Numerical instability (NaN)
Forgetting previously learned knowledge

The Solution

Clip gradient norm before applying the optimizer:

if ||∇|| > max_norm: ∇ = ∇ × (max_norm / ||∇||)

# After loss.backward(), before optimizer.step()
torch.nn.utils.clip_grad_norm_(
    model.parameters(),
    max_norm=10.0
)
optimizer.step()
                        

Stability

+50%

Typical max_norm

1-10

Complexity

1 line

Training Impact

No Gradient Clipping

Occasional catastrophic drops from exploding gradients

With Gradient Clipping

Protected from sudden performance collapse

Reward Normalization Stability

Scale rewards to a consistent range for stable learning

Reward Distribution

The Problem

Different environments have different reward scales:

CartPole: 0-500
Atari: 0-10000+
This makes hyperparameter transfer difficult

Approaches

1. Reward Clipping:

reward = np.clip(reward, -1, 1)

2. Reward Scaling:

reward = reward / max_possible_reward

3. Running Normalization:

# Update running stats
running_mean = 0.99 * running_mean + 0.01 * reward
running_std = ...  # Similar update
reward = (reward - running_mean) / running_std
                        

Transferability

+High

Stability

+30%

Complexity

Low

Training Impact

Raw Rewards (Varied Scale)

Hyperparameters don't transfer between environments

Normalized Rewards

Consistent learning dynamics across environments

Weight Initialization Startup

Smart initialization for faster, more stable early training

Activation Distribution

Why It Matters

Bad initialization causes:

Too small: Vanishing activations/gradients
Too large: Exploding activations, dead ReLUs
Slow or failed training from the start

Initialization Methods

Method	Best For	Formula
Xavier/Glorot	Tanh, Sigmoid	W ~ U(-√(6/n_in+n_out), √(6/n_in+n_out))
He/Kaiming	ReLU (default)	W ~ N(0, √(2/n_in))
Orthogonal	RNNs, deep nets	W = QR decomposition

# PyTorch uses Kaiming by default for Linear
# But you can be explicit:
for layer in model.modules():
    if isinstance(layer, nn.Linear):
        nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')
        nn.init.zeros_(layer.bias)
                        

Training Impact

Poor Initialization

Slow start, may fail to learn entirely

Proper Initialization

Quick start, stable gradient flow

Learning Rate Scheduling Optimization

Adjust learning rate during training for better convergence

LR Over Time

The Idea

Different training phases need different learning rates:

Early: High LR for fast initial progress
Middle: Medium LR for steady learning
Late: Low LR for fine-tuning

Common Schedules

# Linear decay
lr = lr_start * (1 - episode / total_episodes)

# Step decay
scheduler = StepLR(optimizer, step_size=1000, gamma=0.9)

# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=total_episodes)

# Warmup + decay
if episode < warmup:
    lr = lr_max * (episode / warmup)
else:
    lr = lr_max * decay_factor
                        

Final Performance

+10-20%

Training Stability

Better

Complexity

Low

Training Impact

Constant Learning Rate

May overshoot late in training, suboptimal convergence

Scheduled Learning Rate

Fine-tunes to better final performance

Exploration Strategies Exploration

Different approaches to balance exploration vs exploitation

Action Selection

Strategy Comparison

Strategy	How It Works	Pros	Cons
ε-Greedy	Random with prob ε	Simple, effective	Explores blindly
Boltzmann	Softmax over Q-values	Q-aware exploration	Temperature tuning
UCB	Bonus for uncertainty	Explores uncertain	Needs visit counts
Noisy Nets	Learned noise	State-dependent	More complex

# ε-Greedy
if random.random() < epsilon:
    action = env.action_space.sample()
else:
    action = q_values.argmax()

# Boltzmann (softmax)
probs = F.softmax(q_values / temperature, dim=-1)
action = torch.multinomial(probs, 1)

# UCB
ucb_values = q_values + c * sqrt(log(t) / visit_counts)
action = ucb_values.argmax()
                        

Recommendation: Start with ε-greedy (simple and works). Move to Noisy Networks if you want automatic exploration without hyperparameter tuning. Use Boltzmann if Q-values are well-calibrated.

Training Impact

Greedy (No Exploration)

Gets stuck in local optima, never discovers better strategies

With Exploration

Discovers optimal policy through exploration

Quick Reference Summary

All techniques at a glance

Technique	Problem Solved	Impact	Complexity	When to Add
Experience Replay	Correlated samples	Essential	Low	Always (Day 1)
Target Network	Moving targets	Essential	Low	Always (Day 1)
Double DQN	Q overestimation	High	Minimal	After basics work
Prioritized Replay	Sample efficiency	High	Medium	Large buffers
Dueling DQN	State vs action value	Medium	Low	Similar action values
Noisy Networks	Dumb exploration	Medium	Medium	Tired of tuning ε
N-Step Returns	Slow credit assignment	Medium	Medium	Sparse rewards
Gradient Clipping	Exploding gradients	High stability	Minimal	Always (1 line)
Reward Normalization	Scale variance	Medium	Low	Multi-environment
Weight Initialization	Bad starting point	Medium	Low	Training fails to start
LR Scheduling	Suboptimal convergence	Medium	Low	After basic tuning

DQN Techniques Reference

Quick Navigation

Experience Replay Core DQN

How It Works

The Problem

The Solution

When to Use

Use When

Caution When

Training Impact

Without Experience Replay

With Experience Replay

Target Network Core DQN

How It Works

The Problem

The Solution

Training Impact

Without Target Network

With Target Network

Double DQN Core DQN

Q-Value Comparison

The Problem

The Solution

Training Impact

Standard DQN (Overestimation)

Double DQN (Accurate Q)

Prioritized Experience Replay Advanced

Priority Distribution

The Problem

The Solution

Training Impact

Uniform Replay

Prioritized Replay

Dueling DQN Architecture Advanced

Network Architecture

The Insight

Why It Helps

Training Impact

Standard Q-Network

Dueling Architecture

Noisy Networks Advanced

Exploration Comparison

The Problem with ε-Greedy

The Solution

Training Impact

ε-Greedy Exploration

Noisy Networks

N-Step Returns Advanced

Reward Propagation

The Trade-off

Why It Helps

Training Impact

1-Step TD (Slow Credit)

N-Step Returns (Fast Credit)

Gradient Clipping Stability

Gradient Magnitude

The Problem

The Solution

Training Impact

No Gradient Clipping

With Gradient Clipping

Reward Normalization Stability

Reward Distribution

The Problem

Approaches

Training Impact

Raw Rewards (Varied Scale)

Normalized Rewards

Weight Initialization Startup

Activation Distribution

Why It Matters

Initialization Methods

Training Impact

Poor Initialization

Proper Initialization

Learning Rate Scheduling Optimization

LR Over Time

The Idea

Common Schedules

Training Impact