Foundation for softmax, cross-entropy, temperature scaling, and sampling in AI systems
70% of ML interviews
Powers systems at Every LLM application
Understanding sampling and temperature query improvement
TL;DR
Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.
Visual Overview
MODEL OUTPUT TO PROBABILITIES
+-----------------------------------------------------------+
| |
| Raw model output (logits): |
| [2.1, 0.5, -0.3] <-- Just numbers, not probs |
| |
| After softmax: |
| [0.72, 0.15, 0.03] <-- Probability distribution |
| |
| Properties: |
| - Each value in [0, 1] |
| - Sum = 1.0 (certainty is distributed) |
| |
+-----------------------------------------------------------+
The Softmax Function
Converts raw scores into a probability distribution:
SOFTMAX FORMULA
+-----------------------------------------------------------+
| |
| softmax(x_i) = exp(x_i) / SUM(exp(x_j)) |
| |
| Why exp()? |
| - Makes all values positive |
| - Preserves relative ordering |
| - Amplifies differences |
| |
+-----------------------------------------------------------+
When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”
Expected Value
The expected value is the weighted average of outcomes.
EXPECTED VALUE
+-----------------------------------------------------------+
| |
| E[X] = SUM(outcome x probability) |
| |
| Example: Rolling a fair die |
| E[X] = (1x1/6) + (2x1/6) + ... + (6x1/6) = 3.5 |
| |
| For model outputs: |
| If rewards = [10, 5, 1] and P = [0.72, 0.15, 0.03] |
| E[reward] = (10x0.72) + (5x0.15) + (1x0.03) = 7.98 |
| |
+-----------------------------------------------------------+
Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.
Entropy: Average Surprise
Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.
ENTROPY INTUITION
+-----------------------------------------------------------+
| |
| "Surprise" of an event = -log(P) |
| |
| Low probability --> high surprise P=0.01 -> 4.6 |
| High probability --> low surprise P=0.99 -> 0.01 |
| |
| Entropy = Expected surprise = average across outcomes |
| |
| H(P) = -SUM(P(x) x log(P(x))) |
| |
+-----------------------------------------------------------+
ENTROPY EXAMPLES
+-----------------------------------------------------------+
| |
| Uniform distribution (maximum uncertainty): |
| P = [0.25, 0.25, 0.25, 0.25] |
| H = 1.39 bits |
| "Model has no idea, all options equally likely" |
| |
| Peaked distribution (confident): |
| P = [0.97, 0.01, 0.01, 0.01] |
| H = 0.24 bits |
| "Model is pretty sure it's the first option" |
| |
| One-hot distribution (certain): |
| P = [1.0, 0.0, 0.0, 0.0] |
| H = 0 bits |
| "Model is certain" |
| |
+-----------------------------------------------------------+
Why it matters:
- Entropy of model output tells you confidence
- Temperature scaling manipulates entropy (higher temp = more uniform)
- Perplexity = exp(entropy) — “how many choices is the model confused between?”
Cross-Entropy: Comparing Distributions
Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.
CROSS-ENTROPY
+-----------------------------------------------------------+
| |
| H(P, Q) = -SUM(P(x) x log(Q(x))) |
| |
| Where: |
| P = true distribution (ground truth) |
| Q = predicted distribution (model output) |
| |
+-----------------------------------------------------------+
CROSS-ENTROPY AS LOSS
+-----------------------------------------------------------+
| |
| True label: "cat" --> P = [1, 0, 0] (one-hot) |
| Model prediction: Q = [0.7, 0.2, 0.1] |
| |
| H(P, Q) = -[1xlog(0.7) + 0xlog(0.2) + 0xlog(0.1)] |
| = -log(0.7) |
| = 0.36 |
| |
| Only the true class matters! Simplifies to: |
| Loss = -log(P_correct) |
| |
| Punishes confident wrong predictions severely: |
| If Q = [0.01, 0.98, 0.01] for true class cat: |
| Loss = -log(0.01) = 4.6 <-- Much higher! |
| |
+-----------------------------------------------------------+
Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.
KL Divergence: Distance Between Distributions
KL divergence measures how different two distributions are. It’s not symmetric.
KL DIVERGENCE
+-----------------------------------------------------------+
| |
| DKL(P || Q) = SUM(P(x) x log(P(x)/Q(x))) |
| |
| Also written as: |
| DKL(P || Q) = H(P, Q) - H(P) |
| = Cross-entropy - Entropy |
| |
| "Extra bits needed to encode P using Q's distribution" |
| |
+-----------------------------------------------------------+
Where you’ll see it:
| Context | What it measures |
|---|---|
| Fine-tuning with KL penalty | How far fine-tuned model drifted from base |
| Knowledge distillation | How well student matches teacher |
| VAEs, diffusion models | Difference from prior distribution |
Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.
Temperature Scaling
TEMPERATURE EFFECT ON SOFTMAX
+-----------------------------------------------------------+
| |
| softmax(x / T) where T = temperature |
| |
| T = 1.0 --> standard softmax |
| T > 1.0 --> softer distribution (more random sampling) |
| T < 1.0 --> sharper distribution (more deterministic) |
| T --> 0 --> argmax (always pick highest) |
| |
| +-------------------------------------+ |
| | T=0.5: [0.88, 0.10, 0.02] sharp | |
| | T=1.0: [0.66, 0.24, 0.10] normal | |
| | T=2.0: [0.49, 0.31, 0.20] soft | |
| +-------------------------------------+ |
| |
| Higher temperature = higher entropy = more "creative" |
| Lower temperature = lower entropy = more "focused" |
| |
+-----------------------------------------------------------+
Numerical Stability
COMMON GOTCHA
+-----------------------------------------------------------+
| |
| Problem: log(0) = negative infinity |
| |
| Solution: Add small epsilon |
| log(P + 1e-10) or max(P, 1e-10) |
| |
| In practice: Use framework's built-in cross_entropy |
| It handles numerical stability for you |
| |
+-----------------------------------------------------------+
When This Matters
| Situation | Concept to apply |
|---|---|
| Understanding model confidence | Softmax outputs as probabilities |
| Tuning temperature for generation | Higher temp = higher entropy = more random |
| Understanding perplexity scores | Perplexity = exp(cross-entropy) |
| Debugging “model too confident” | Look at entropy of outputs |
| Fine-tuning with KL penalty | Constrains drift from base model |
| Understanding why cross-entropy works | It heavily penalizes confident mistakes |