Status Update

KN is on page 86 of 247

Attention: Q, K, V

Transformer: Unified vector to encode token + position (shared embedding)

Encoder: Self-attention
Decoder: Cross-attend to all inputs

— Feb 08, 2026 01:22AM

Super Study Guide: Transformers & Large Language Models

Like flag

KN’s Previous Updates

KN is on page 94 of 247

Logit: z = Wx + b
p_i = Softmax(z_i) = e^z_i / \sum e^z
has the property
log(p_i / p_j) = z_i - z_j

Can generalize to softmax(z / T)
Default: T =1
T increases -> flattened distribution -> more “creative”

— Feb 09, 2026 10:11PM

Post a comment »
Comments

No comments have been added yet.

KN’s Reviews > Super Study Guide: Transformers & Large Language Models > Status Update

KN’s Previous Updates

Post a comment »Comments

Post a comment »
Comments