Regularization | Concepts

TL;DR

Regularization prevents overfitting by constraining the model. Dropout randomly zeros neurons during training. Weight decay penalizes large weights. Early stopping halts training when validation loss stops improving. Use all of these when fine-tuning on small datasets.

Visual Overview

The Overfitting Problem

When overfitting happens:

Small dataset, large model
Training too long
Model has too much capacity for the task
No regularization

Dropout

Randomly “drops” (zeros out) neurons during training. Forces the network to not rely on any single neuron.

Dropout

Typical dropout values:

Model type	Dropout rate
Transformers	0.1 (10%)
Older MLPs	0.5 (50%)
CNNs	0.25-0.5
Fine-tuning	0.1 or lower

Where to apply:

After attention layers
After FFN layers
Before final classification layer
NOT inside attention computation itself

Weight Decay (L2 Regularization)

Penalizes large weights by adding their squared sum to the loss.

Weight Decay

AdamW vs Adam with Weight Decay

AdamW: The Right Way

Typical values:

Language models: 0.01 - 0.1
Vision models: 0.0001 - 0.01
Fine-tuning: 0.01 (same as pre-training usually)

Early Stopping

Stop training when validation loss stops improving.

Early Stopping

Implementation:

best_val_loss = float('inf')
patience_counter = 0
patience = 5  # epochs to wait

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = evaluate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_checkpoint()  # Save best model
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping!")
            break

# Load best checkpoint for final model
load_checkpoint()

Typical patience values:

Fine-tuning: 2-3 epochs
Training from scratch: 5-10 epochs
Large models: 3-5 epochs

Label Smoothing

Don’t use hard labels (0 or 1). Spread some probability to other classes.

Label Smoothing

Why it works:

Prevents model from being overconfident
Encourages model to keep some probability for alternatives
Acts as regularization on the output distribution

Typical value: 0.1 (10% smoothing)

Combining Techniques

Regularization techniques stack. Use multiple together.

Typical Transformer Regularization

Common combinations:

Scenario	Regularization
Pre-training large LLM	Weight decay 0.1, dropout 0.1
Fine-tuning	Weight decay 0.01, dropout 0.1, early stopping
Small dataset	Dropout 0.3, weight decay 0.1, data augmentation
Large dataset	Minimal — dropout 0.1, weight decay 0.01

Debugging Regularization

When This Matters

Situation	What to apply
Fine-tuning on small dataset	All: dropout, weight decay, early stopping
Model overfitting	Increase dropout, weight decay
Model underfitting	Decrease regularization
Classification task	Add label smoothing
Training from scratch	Moderate regularization, data augmentation
Using AdamW	Set weight_decay parameter (not in loss)
Evaluating model	Ensure dropout is OFF (model.eval())