Name: Super Study Guide: Transformers & Large Language Models
Rating: 4.56 (2 reviews)
ISBN: 9798836693312

93 reviews

October 4, 2025

Finally a book that teaches you EVERYTHING that you need to know about LLMS from first principles (if you already have a grasp on math).
This was a very complex book as i tried to reimplement each one of the concepts by myself. But it finally made me understand how LLMS work in a very conceptual level.
Amazing book for anyone that wants to learn how ChatGpt or any of the so called AIs (which are just probabilistic models) work.

A bit of my overview on the attention mechanism from the book:
Transformers receive tokens, which are composed of on average 4 characters each.

Based on these token the role of the transformer is to output what the next token on the sentence is gonna be

First each token gains a positional embedding and a contextual embedding
# The Self Attention Mechanism
This is the mechanism that teaches the model what each word means and how it relates to all the other ones.
Each word gets a Query, a Key, and a Value.
The Query is what the word looks for
The key is the anwser to the query
And the value is the information of the word itself
For every word we pass the query througout all the other words in the sentence, we use the dot product to compare it to every other key.

Then we divide that dot product by a square root to not let it be too huge.

We pass that dot product into a softmax for every other word. That gives us a vector of the relationship of the word towards all the others in the phrase. This tells the model how that words relate the rest of the phrase

Based on this vector we get a new calculation for the word based on the Weighted Sum of all the other values multipled by the "focus" value .That is the new value of that word.

Then we do that for all the words

# Multi Head
Theres many heads that do that. Because of that each of them can learn something different about the words. In the end we concatenate the results

# Feed Forward
After that there are some feed forward neural networks to discover other informations about the words we may be missed

# The Decoder
Basically the decoder is the part that does all of this, it makes each word have
Context Awareness
Position Awareness
Token Awareness

When we use encoder and decoders we are trying to translate from one languague to another but if we are doing languague modelling (like LLMs do) we will only use decoders.

But LLMs are causal, that means it can only see the attention of PREVIOUS tokens. The name of this is [[Masked Self Attention ]]
# Probabilities
When we get the probabilities for the next word we can either
Use a deterministic choice, choosing the token that is the most likely
Or use sampling, we sample ghe

Super Study Guide: Transformers & Large Language Models

Afshine Amidi, Shervine Amidi

About the author

Afshine Amidi

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion

Can't find what you're looking for?