Regularization prevents overfitting by constraining the model. Dropout randomly zeros neurons during training. Weight decay penalizes large weights. Early stopping halts training when validation loss stops improving. Use all of these when fine-tuning on small datasets.
Visual Overview
The Overfitting Problem
The Overfitting Problem
THE OVERFITTING PROBLEM┌───────────────────────────────────────────────────────────┐│││Training loss: ↓decreasing nicely││Validation loss: ↓ then ↑ starts increasing ││││ Loss │││││ 2 ││││ _____ val loss (starts going up) ││ 1 │ ___________ │││ __________ train loss (keeps going down) ││ 0 └────────────────────────││ 0 epochs 100 ││││ Model memorized training data. ││Doesn't generalize to new data. │││└───────────────────────────────────────────────────────────┘
When overfitting happens:
Small dataset, large model
Training too long
Model has too much capacity for the task
No regularization
Dropout
Randomly “drops” (zeros out) neurons during training. Forces the network to not rely on any single neuron.
Dropout
Dropout
DROPOUT┌───────────────────────────────────────────────────────────┐│││Training (dropout=0.1): ││ Randomly zero out 10% of activations each forward ││││ Before: [0.5, 0.3, 0.8, 0.2, 0.6] ││ Mask: [1, 1, 0, 1, 1 ] ← 0.8 dropped││ After: [0.5, 0.3, 0.0, 0.2, 0.6] ││││Inference: ││ No dropout. Use all neurons. ││ Scale activations by (1 - dropout_rate) to compensate │││└───────────────────────────────────────────────────────────┘WHY IT WORKS┌───────────────────────────────────────────────────────────┐│││ Without dropout: ││ Network can rely on specific neurons ││ "Neuron 47 always detects cats" ││If neuron 47 is wrong, whole prediction fails││││ With dropout: ││ Any neuron might be missing ││ Network must build redundant representations││ Multiple neurons learn to detect cats ││More robust predictions│││└───────────────────────────────────────────────────────────┘
Typical dropout values:
Model type
Dropout rate
Transformers
0.1 (10%)
Older MLPs
0.5 (50%)
CNNs
0.25-0.5
Fine-tuning
0.1 or lower
Where to apply:
After attention layers
After FFN layers
Before final classification layer
NOT inside attention computation itself
Weight Decay (L2 Regularization)
Penalizes large weights by adding their squared sum to the loss.
Weight Decay
Weight Decay
WEIGHT DECAY┌───────────────────────────────────────────────────────────┐│││ Standard loss: ││ L = task_loss ││││ With weight decay (L2 regularization): ││ L = task_loss + λ × Σ(w²) ││││ λ = weight decay coefficient (typically 0.01) ││ w = all model weights │││└───────────────────────────────────────────────────────────┘WHY IT WORKS┌───────────────────────────────────────────────────────────┐│││ Large weights = model is very confident about features ││ = likely memorizing training data││││ Penalizing large weights: ││ • Keeps weights small ││ • Model can't "overfit" to any single feature ││ • Smoother, more generalizable function││││ Small weights = "softer" decision boundaries │││└───────────────────────────────────────────────────────────┘
AdamW vs Adam with Weight Decay
AdamW: The Right Way
AdamW: The Right Way
ADAMW: THE RIGHT WAY┌───────────────────────────────────────────────────────────┐│││Adam with L2 (WRONG): ││ gradient = task_gradient + λ × w ││ m, v = update_momentum(gradient) ││ w = w - lr × m / sqrt(v) ││││ Problem: Weight decay is entangled with adaptive LR││ High-variance params get less regularization ││││ AdamW (CORRECT): ││ gradient = task_gradient ← No λ here ││ m, v = update_momentum(gradient) ││ w = w - lr × m / sqrt(v) - lr × λ × w ││↑││ Decay applied separately ││││ Weight decay is truly decoupled. ││ This is what you should use. │││└───────────────────────────────────────────────────────────┘
Typical values:
Language models: 0.01 - 0.1
Vision models: 0.0001 - 0.01
Fine-tuning: 0.01 (same as pre-training usually)
Early Stopping
Stop training when validation loss stops improving.
Early Stopping
Early Stopping
EARLY STOPPING┌───────────────────────────────────────────────────────────┐│││Monitor: validation loss (or another metric) ││Patience: how many epochs to wait for improvement ││││ Loss │││││ 2 ││││ _____ val loss ││ 1 │ ___________ │││ __________ train loss ││ 0 └────────────────────────││ 0 10 20 30 40 50 ││↑││STOP HERE││ (val loss stopped improving) │││└───────────────────────────────────────────────────────────┘
Implementation:
best_val_loss = float('inf')patience_counter = 0patience = 5 # epochs to waitfor epoch in range(max_epochs): train_loss = train_one_epoch() val_loss = evaluate() if val_loss < best_val_loss: best_val_loss = val_loss patience_counter = 0 save_checkpoint() # Save best model else: patience_counter += 1 if patience_counter >= patience: print("Early stopping!") break# Load best checkpoint for final modelload_checkpoint()
Typical patience values:
Fine-tuning: 2-3 epochs
Training from scratch: 5-10 epochs
Large models: 3-5 epochs
Label Smoothing
Don’t use hard labels (0 or 1). Spread some probability to other classes.
Label Smoothing
Label Smoothing
LABEL SMOOTHING┌───────────────────────────────────────────────────────────┐│││Hard labels (no smoothing): ││ True class = 2: [0, 0, 1, 0] ││││Soft labels (smoothing = 0.1): ││ True class = 2: [0.033, 0.033, 0.9, 0.033] ││││ 10% of probability spread to other classes. │││└───────────────────────────────────────────────────────────┘
Why it works:
Prevents model from being overconfident
Encourages model to keep some probability for alternatives
Acts as regularization on the output distribution
Typical value: 0.1 (10% smoothing)
Combining Techniques
Regularization techniques stack. Use multiple together.
Typical Transformer Regularization
Typical Transformer Regularization
TYPICAL TRANSFORMER REGULARIZATION┌───────────────────────────────────────────────────────────┐│││ 1. Dropout: 0.1 after attention and FFN ││ 2. Weight decay: 0.01 with AdamW ││ 3. Early stopping: patience=3 on val loss ││ 4. Label smoothing: 0.1 (for classification) ││││ For fine-tuning, often reduce dropout (model already ││ regularized). │││└───────────────────────────────────────────────────────────┘
Common combinations:
Scenario
Regularization
Pre-training large LLM
Weight decay 0.1, dropout 0.1
Fine-tuning
Weight decay 0.01, dropout 0.1, early stopping
Small dataset
Dropout 0.3, weight decay 0.1, data augmentation
Large dataset
Minimal — dropout 0.1, weight decay 0.01
Debugging Regularization
Debugging Regularization
Debugging Regularization
STILL OVERFITTING DESPITE REGULARIZATION┌───────────────────────────────────────────────────────────┐│││Symptoms: ││ • Added dropout, weight decay ││ • Train/val gap still large││││Causes: ││ • Regularization too weak││ • Model still too large for data││ • Data augmentation would help ││││Debug steps: ││ 1. Increase dropout (0.1 → 0.3) ││ 2. Increase weight decay (0.01 → 0.1) ││ 3. Add data augmentation││ 4. Use smaller model ││ 5. Get more data │││└───────────────────────────────────────────────────────────┘UNDERFITTING (TRAIN LOSS HIGH)
┌───────────────────────────────────────────────────────────┐│││Symptoms: ││ • Training loss not decreasing enough││ • Model can't fit training data ││││Causes: ││ • Too much regularization││ • Dropout too high ││ • Weight decay too strong ││ • Model too small ││││Debug steps: ││ 1. Reduce dropout (0.3 → 0.1) ││ 2. Reduce weight decay (0.1 → 0.01) ││ 3. Remove early stopping temporarily ││ 4. Use larger model │││└───────────────────────────────────────────────────────────┘
When This Matters
Situation
What to apply
Fine-tuning on small dataset
All: dropout, weight decay, early stopping
Model overfitting
Increase dropout, weight decay
Model underfitting
Decrease regularization
Classification task
Add label smoothing
Training from scratch
Moderate regularization, data augmentation
Using AdamW
Set weight_decay parameter (not in loss)
Evaluating model
Ensure dropout is OFF (model.eval())
Interview Notes
💼60% of ML interviews
Interview Relevance 60% of ML interviews
🏭Every fine-tuning job
Production Impact
Powers systems at Every fine-tuning job
⚡Preventing overfitting on small datasets
Performance Preventing overfitting on small datasets query improvement