ML Metrics | Concepts

TL;DR

Accuracy can be misleading with imbalanced data. Use precision when false positives are costly, recall when false negatives are costly, and F1 when you need balance. For retrieval, MRR measures first-result quality, NDCG measures ranking quality.

Visual Overview

CONFUSION MATRIX
+-----------------------------------------------------------+
|                                                           |
|                     Predicted                             |
|                  Pos      Neg                             |
|               +--------+--------+                         |
|   Actual  Pos |   TP   |   FN   |                         |
|               +--------+--------+                         |
|           Neg |   FP   |   TN   |                         |
|               +--------+--------+                         |
|                                                           |
|    TP = True Positive  (correct positive)                 |
|    FP = False Positive (false alarm)                      |
|    FN = False Negative (missed)                           |
|    TN = True Negative  (correct negative)                 |
|                                                           |
+-----------------------------------------------------------+

Classification Metrics

Precision

PRECISION
+-----------------------------------------------------------+
|                                                           |
|   Precision = TP / (TP + FP)                              |
|                                                           |
|   "Of everything I predicted positive, how many were      |
|    actually positive?"                                    |
|                                                           |
|   High precision = few false alarms                       |
|   Optimize when: false positives are costly               |
|   (spam filter, fraud detection)                          |
|                                                           |
+-----------------------------------------------------------+

Recall

RECALL
+-----------------------------------------------------------+
|                                                           |
|   Recall = TP / (TP + FN)                                 |
|                                                           |
|   "Of everything that was actually positive, how many     |
|    did I find?"                                           |
|                                                           |
|   High recall = few misses                                |
|   Optimize when: false negatives are costly               |
|   (disease detection, security threats)                   |
|                                                           |
+-----------------------------------------------------------+

F1 Score

F1 SCORE
+-----------------------------------------------------------+
|                                                           |
|   F1 = 2 x (Precision x Recall) / (Precision + Recall)    |
|                                                           |
|   Harmonic mean of precision and recall.                  |
|   Use when: you need single number, classes are imbalanced|
|                                                           |
|   Note: Harmonic mean punishes extreme imbalance harder   |
|   than arithmetic mean.                                   |
|                                                           |
+-----------------------------------------------------------+

Precision-Recall Tradeoff

TRADEOFF
+-----------------------------------------------------------+
|                                                           |
|        High Threshold              Low Threshold          |
|        (conservative)              (aggressive)           |
|              |                           |                |
|              v                           v                |
|      +--------------+            +--------------+         |
|  P   |     HIGH     |            |     LOW      |         |
|      +--------------+            +--------------+         |
|      +--------------+            +--------------+         |
|  R   |     LOW      |            |     HIGH     |         |
|      +--------------+            +--------------+         |
|                                                           |
|   Moving threshold trades one for the other.              |
|                                                           |
+-----------------------------------------------------------+

Threshold Tuning

Default threshold is 0.5. This is arbitrary.

CHOOSING A THRESHOLD
+-----------------------------------------------------------+
|                                                           |
|   The right threshold depends on your cost function:      |
|                                                           |
|     FN 10x worse than FP?  -> Lower threshold             |
|     FP 10x worse than FN?  -> Higher threshold            |
|     Equal cost?            -> Optimize F1                 |
|                                                           |
+-----------------------------------------------------------+

Common threshold strategies:

Strategy	When to use
Maximize F1	Balanced importance
Fixed recall (e.g., 95%)	Can’t miss positives (medical)
Fixed precision (e.g., 95%)	Can’t have false alarms (spam)
Cost-weighted	Know exact cost of FP vs FN

Class Imbalance

THE PROBLEM
+-----------------------------------------------------------+
|                                                           |
|   Dataset: 99% negative, 1% positive (fraud detection)    |
|                                                           |
|   Model predicts "negative" for everything:               |
|     Accuracy = 99%  <-- Looks great!                      |
|     Recall = 0%     <-- Completely useless                |
|                                                           |
|   Accuracy is misleading with imbalanced classes.         |
|                                                           |
+-----------------------------------------------------------+

Solutions:

1. USE DIFFERENT METRICS
+-----------------------------------------------------------+
|                                                           |
|   Bad:  Accuracy                                          |
|   Good: F1, Precision, Recall, PR-AUC                     |
|                                                           |
|   F1 on "always negative" model = 0 (reveals the problem) |
|                                                           |
+-----------------------------------------------------------+

2. PR-AUC OVER ROC-AUC
+-----------------------------------------------------------+
|                                                           |
|   ROC-AUC can look good with imbalanced data (high TNR)   |
|   PR-AUC focuses on positive class performance            |
|                                                           |
|   Severe imbalance? PR-AUC is the metric.                 |
|                                                           |
+-----------------------------------------------------------+

3. MACRO VS MICRO AVERAGING
+-----------------------------------------------------------+
|                                                           |
|   Multi-class with imbalance:                             |
|                                                           |
|   Micro F1: Aggregate TP, FP, FN across all classes       |
|             -> Dominated by majority class                |
|                                                           |
|   Macro F1: Compute F1 per class, then average            |
|             -> Each class weighted equally                |
|                                                           |
|   Imbalanced? Use Macro F1 to ensure minority classes     |
|   matter.                                                 |
|                                                           |
+-----------------------------------------------------------+

Retrieval Metrics

Precision@K and Recall@K

PRECISION AND RECALL AT K
+-----------------------------------------------------------+
|                                                           |
|   P@K = (relevant docs in top K) / K                      |
|   "Of the K results I returned, how many were relevant?"  |
|                                                           |
|   R@K = (relevant docs in top K) / (total relevant docs)  |
|   "Of all relevant docs, how many appear in my top K?"    |
|                                                           |
+-----------------------------------------------------------+

Mean Reciprocal Rank (MRR)

MRR
+-----------------------------------------------------------+
|                                                           |
|   MRR = (1/|Q|) x SUM(1 / rank_i)                         |
|                                                           |
|   Where rank_i = position of first relevant result        |
|                                                           |
|   Example:                                                |
|     Query 1: first relevant at position 3 -> 1/3          |
|     Query 2: first relevant at position 1 -> 1/1          |
|     Query 3: first relevant at position 2 -> 1/2          |
|                                                           |
|     MRR = (1/3 + 1 + 1/2) / 3 = 0.61                       |
|                                                           |
|   Use when: you care most about the first relevant result |
|                                                           |
+-----------------------------------------------------------+

NDCG (Normalized Discounted Cumulative Gain)

NDCG
+-----------------------------------------------------------+
|                                                           |
|   DCG@K = SUM(relevance_i / log2(i + 1))                  |
|                                                           |
|   NDCG@K = DCG@K / IDCG@K  (normalized by ideal ranking)  |
|                                                           |
|   Accounts for:                                           |
|     - Graded relevance (not just binary)                  |
|     - Position discount (top results matter more)         |
|                                                           |
|   Use when: relevance has degrees, ranking order matters  |
|                                                           |
+-----------------------------------------------------------+

When to Use What

Scenario	Metric	Why
Binary classification, balanced	Accuracy	Simple, interpretable
Binary classification, imbalanced	F1 or PR-AUC	Accuracy misleading
Multi-class, balanced	Accuracy or Micro F1	Simple aggregate
Multi-class, imbalanced	Macro F1	Each class matters equally
Retrieval, one right answer	MRR	First result matters
Retrieval, multiple relevant	NDCG	Ranking quality
RAG retrieval component	Recall@K	Did we retrieve the context?
Medical/safety-critical	Recall	Can’t miss positives
Spam/fraud filtering	Precision	Can’t have false alarms

Debugging with Metrics

HIGH ACCURACY BUT POOR REAL-WORLD PERFORMANCE
+-----------------------------------------------------------+
|                                                           |
|   Symptoms:                                               |
|     - Model accuracy is 95%                               |
|     - Users complain it doesn't work                      |
|                                                           |
|   Causes:                                                 |
|     - Class imbalance (predicting majority class)         |
|     - Test set doesn't match production distribution      |
|     - Wrong metric for actual goal                        |
|                                                           |
|   Debug steps:                                            |
|     1. Check class balance in test set                    |
|     2. Compute per-class metrics (confusion matrix)       |
|     3. Compute F1, not just accuracy                      |
|     4. Sample production data and evaluate                |
|                                                           |
+-----------------------------------------------------------+

PRECISION AND RECALL BOTH LOW
+-----------------------------------------------------------+
|                                                           |
|   Symptoms:                                               |
|     - P = 0.4, R = 0.3 (both bad)                         |
|     - Model seems random                                  |
|                                                           |
|   Causes:                                                 |
|     - Model not learning (training issue)                 |
|     - Features not predictive                             |
|     - Data quality issues                                 |
|                                                           |
|   Debug steps:                                            |
|     1. Check training loss -- is it decreasing?           |
|     2. Overfit to small dataset first                     |
|     3. Check feature importance                           |
|     4. Examine misclassified examples manually            |
|                                                           |
+-----------------------------------------------------------+

When This Matters

Situation	What to know
Evaluating any classifier	Check class balance first
Imbalanced data	Use F1 or PR-AUC, not accuracy
Multi-class imbalance	Use Macro F1
Setting classification threshold	Tune based on cost of FP vs FN
Evaluating retrieval	MRR for single answer, NDCG for ranking
RAG system evaluation	Recall@K for retriever
Model seems good but users complain	Metric doesn’t match user goal