Probability Basics

TL;DR

Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.

Visual Overview

MODEL OUTPUT TO PROBABILITIES
+-----------------------------------------------------------+
|                                                           |
|   Raw model output (logits):                              |
|     [2.1, 0.5, -0.3]    <-- Just numbers, not probs       |
|                                                           |
|   After softmax:                                          |
|     [0.72, 0.15, 0.03]  <-- Probability distribution      |
|                                                           |
|   Properties:                                             |
|     - Each value in [0, 1]                                |
|     - Sum = 1.0 (certainty is distributed)                |
|                                                           |
+-----------------------------------------------------------+

The Softmax Function

Converts raw scores into a probability distribution:

SOFTMAX FORMULA
+-----------------------------------------------------------+
|                                                           |
|   softmax(x_i) = exp(x_i) / SUM(exp(x_j))                 |
|                                                           |
|   Why exp()?                                              |
|     - Makes all values positive                           |
|     - Preserves relative ordering                         |
|     - Amplifies differences                               |
|                                                           |
+-----------------------------------------------------------+

When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”

Expected Value

The expected value is the weighted average of outcomes.

EXPECTED VALUE
+-----------------------------------------------------------+
|                                                           |
|   E[X] = SUM(outcome x probability)                       |
|                                                           |
|   Example: Rolling a fair die                             |
|     E[X] = (1x1/6) + (2x1/6) + ... + (6x1/6) = 3.5        |
|                                                           |
|   For model outputs:                                      |
|     If rewards = [10, 5, 1] and P = [0.72, 0.15, 0.03]    |
|     E[reward] = (10x0.72) + (5x0.15) + (1x0.03) = 7.98    |
|                                                           |
+-----------------------------------------------------------+

Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.

Entropy: Average Surprise

Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.

ENTROPY INTUITION
+-----------------------------------------------------------+
|                                                           |
|   "Surprise" of an event = -log(P)                        |
|                                                           |
|   Low probability  --> high surprise   P=0.01 -> 4.6      |
|   High probability --> low surprise    P=0.99 -> 0.01     |
|                                                           |
|   Entropy = Expected surprise = average across outcomes   |
|                                                           |
|   H(P) = -SUM(P(x) x log(P(x)))                           |
|                                                           |
+-----------------------------------------------------------+

ENTROPY EXAMPLES
+-----------------------------------------------------------+
|                                                           |
|   Uniform distribution (maximum uncertainty):             |
|     P = [0.25, 0.25, 0.25, 0.25]                          |
|     H = 1.39 bits                                         |
|     "Model has no idea, all options equally likely"       |
|                                                           |
|   Peaked distribution (confident):                        |
|     P = [0.97, 0.01, 0.01, 0.01]                          |
|     H = 0.24 bits                                         |
|     "Model is pretty sure it's the first option"          |
|                                                           |
|   One-hot distribution (certain):                         |
|     P = [1.0, 0.0, 0.0, 0.0]                              |
|     H = 0 bits                                            |
|     "Model is certain"                                    |
|                                                           |
+-----------------------------------------------------------+

Why it matters:

Entropy of model output tells you confidence
Temperature scaling manipulates entropy (higher temp = more uniform)
Perplexity = exp(entropy) — “how many choices is the model confused between?”

Cross-Entropy: Comparing Distributions

Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.

CROSS-ENTROPY
+-----------------------------------------------------------+
|                                                           |
|   H(P, Q) = -SUM(P(x) x log(Q(x)))                        |
|                                                           |
|   Where:                                                  |
|     P = true distribution (ground truth)                  |
|     Q = predicted distribution (model output)             |
|                                                           |
+-----------------------------------------------------------+

CROSS-ENTROPY AS LOSS
+-----------------------------------------------------------+
|                                                           |
|   True label: "cat" --> P = [1, 0, 0]  (one-hot)          |
|   Model prediction:  Q = [0.7, 0.2, 0.1]                  |
|                                                           |
|   H(P, Q) = -[1xlog(0.7) + 0xlog(0.2) + 0xlog(0.1)]       |
|           = -log(0.7)                                     |
|           = 0.36                                          |
|                                                           |
|   Only the true class matters! Simplifies to:             |
|     Loss = -log(P_correct)                                |
|                                                           |
|   Punishes confident wrong predictions severely:          |
|     If Q = [0.01, 0.98, 0.01] for true class cat:         |
|     Loss = -log(0.01) = 4.6  <-- Much higher!             |
|                                                           |
+-----------------------------------------------------------+

Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.

KL Divergence: Distance Between Distributions

KL divergence measures how different two distributions are. It’s not symmetric.

KL DIVERGENCE
+-----------------------------------------------------------+
|                                                           |
|   DKL(P || Q) = SUM(P(x) x log(P(x)/Q(x)))                |
|                                                           |
|   Also written as:                                        |
|     DKL(P || Q) = H(P, Q) - H(P)                          |
|                 = Cross-entropy - Entropy                 |
|                                                           |
|   "Extra bits needed to encode P using Q's distribution"  |
|                                                           |
+-----------------------------------------------------------+

Where you’ll see it:

Context	What it measures
Fine-tuning with KL penalty	How far fine-tuned model drifted from base
Knowledge distillation	How well student matches teacher
VAEs, diffusion models	Difference from prior distribution

Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.

Temperature Scaling

TEMPERATURE EFFECT ON SOFTMAX
+-----------------------------------------------------------+
|                                                           |
|   softmax(x / T) where T = temperature                    |
|                                                           |
|   T = 1.0 --> standard softmax                            |
|   T > 1.0 --> softer distribution (more random sampling)  |
|   T < 1.0 --> sharper distribution (more deterministic)   |
|   T --> 0  --> argmax (always pick highest)               |
|                                                           |
|   +-------------------------------------+                 |
|   | T=0.5:  [0.88, 0.10, 0.02]  sharp  |                 |
|   | T=1.0:  [0.66, 0.24, 0.10]  normal |                 |
|   | T=2.0:  [0.49, 0.31, 0.20]  soft   |                 |
|   +-------------------------------------+                 |
|                                                           |
|   Higher temperature = higher entropy = more "creative"   |
|   Lower temperature = lower entropy = more "focused"      |
|                                                           |
+-----------------------------------------------------------+

Numerical Stability

COMMON GOTCHA
+-----------------------------------------------------------+
|                                                           |
|   Problem: log(0) = negative infinity                     |
|                                                           |
|   Solution: Add small epsilon                             |
|     log(P + 1e-10)  or  max(P, 1e-10)                     |
|                                                           |
|   In practice: Use framework's built-in cross_entropy     |
|                It handles numerical stability for you     |
|                                                           |
+-----------------------------------------------------------+

When This Matters

Situation	Concept to apply
Understanding model confidence	Softmax outputs as probabilities
Tuning temperature for generation	Higher temp = higher entropy = more random
Understanding perplexity scores	Perplexity = exp(cross-entropy)
Debugging “model too confident”	Look at entropy of outputs
Fine-tuning with KL penalty	Constrains drift from base model
Understanding why cross-entropy works	It heavily penalizes confident mistakes