Status Updates From Super Study Guide: Transfor...
Super Study Guide: Transformers & Large Language Models by
Status Updates Showing 1-16 of 16
KN
is on page 94 of 247
Logit: z = Wx + b
p_i = Softmax(z_i) = e^z_i / \sum e^z
has the property
log(p_i / p_j) = z_i - z_j
Can generalize to softmax(z / T)
Default: T =1
T increases -> flattened distribution -> more “creative”
— Feb 09, 2026 10:11PM
Add a comment
p_i = Softmax(z_i) = e^z_i / \sum e^z
has the property
log(p_i / p_j) = z_i - z_j
Can generalize to softmax(z / T)
Default: T =1
T increases -> flattened distribution -> more “creative”
KN
is on page 86 of 247
Attention: Q, K, V
Transformer: Unified vector to encode token + position (shared embedding)
Encoder: Self-attention
Decoder: Cross-attend to all inputs
— Feb 08, 2026 01:22AM
Add a comment
Transformer: Unified vector to encode token + position (shared embedding)
Encoder: Self-attention
Decoder: Cross-attend to all inputs






