Skip to content
~110s Visual Explainer

Attention Mechanism

How self-attention enables transformers to understand context by letting each token attend to all others.

A Sentence as Tokens "The cat sat on the mat" The cat sat on the mat Each token: context-independent "sat" doesn't know about "cat" or "mat" Problem: No awareness of surrounding words Query, Key, Value: Three Roles sat Q Query K Key V Value "What am I looking for?" "What do I contain?" "Here's my content" W_Q W_K W_V Same input → three learned perspectives Learned projections enable comparison Which Tokens Should I Attend To? sat Query The cat sat on the mat Keys (all tokens) 0.5 2.8 1.2 0.3 0.4 2.5 Score = Q · K^T / √d High score = relevant Low score = less relevant "sat" finds "cat" and "mat" most relevant Dot product measures similarity Softmax: Normalize to Probabilities 0.5, 2.8, 1.2, 0.3, 0.4, 2.5 softmax 0.08, 0.35, 0.12, ... "sat" attention pattern: The 0.08 cat 0.35 sat 0.12 mat 0.28 on: 0.05, the: 0.12 Σ = 1.00 ✓ Model learns what's relevant for each position Attention weights = learned focus Combine Values by Attention 0.08 × V_The 0.35 × V_cat 0.12 × V_sat 0.28 × V_mat + + + ... Output for "sat" context-aware ✓ Has context Now "sat" knows about: cat mat others Weighted sum aggregates relevant context Self-Attention in Parallel Attention Matrix The cat sat on the mat The cat sat ... Multi-Head 8 heads in parallel High attention Medium Low ✓ Fully parallel ✓ Differentiable ✓ Learnable This is the heart of transformers Parallel computation enables modern LLMs
1 / ?

A Sentence as Tokens

The sentence "The cat sat on the mat" is split into tokens, each converted to a vector (embedding). Initially, each token's embedding knows nothing about its neighbors.

"Sat" doesn't know it relates to "cat" or "mat".

  • Tokenization: text → tokens
  • Embedding: token → vector
  • Initially context-independent

Query, Key, Value: Three Roles

Each token is projected into three vectors:

  • Query (Q): "What information am I seeking?"
  • Key (K): "What information do I have to offer?"
  • Value (V): "If selected, here's my content."

These are learned linear transformations (W_Q, W_K, W_V).

Which Tokens Should I Attend To?

To compute attention for "sat", we take its Query and compare to every Key via dot product. High score = similar directions = relevant.

The score is scaled by √d to prevent gradients from vanishing.

  • Score = Q · KT
  • Higher score = more relevant
  • Scale factor √d for stability

Softmax: Normalize to Probabilities

Softmax converts scores to probabilities that sum to 1. "Sat" might attend to "cat" with weight 0.35, "mat" with 0.28, and spread the rest.

This is the attention pattern — what the model "looks at".

  • Softmax normalizes to [0, 1]
  • Weights sum to 1.0
  • Differentiable and learnable

Combine Values by Attention

The output for "sat" is a weighted sum of all Value vectors. Highly-attended tokens contribute more to the result.

Now "sat" has a representation that incorporates context from "cat" and "mat".

  • Output = Σ (weight × V)
  • Context aggregated from relevant tokens
  • Position-agnostic (positional encoding helps)

Self-Attention in Parallel

This happens for all tokens simultaneously — a matrix operation. Multi-head attention runs multiple attention patterns in parallel, letting the model learn different relationships.

This is the heart of transformers.

  • Fully parallel computation
  • Multi-head: different relationship types
  • Stacked layers = deeper understanding

What's Next?

Self-attention is the foundation of transformers powering GPT, BERT, and all modern LLMs. Next, explore positional encoding (how transformers know word order), multi-head attention (parallel attention patterns), and transformer architecture (putting it all together).