Finally a book that teaches you EVERYTHING that you need to know about LLMS from first principles (if you already have a grasp on math). This was a very complex book as i tried to reimplement each one of the concepts by myself. But it finally made me understand how LLMS work in a very conceptual level. Amazing book for anyone that wants to learn how ChatGpt or any of the so called AIs (which are just probabilistic models) work.
A bit of my overview on the attention mechanism from the book: Transformers receive tokens, which are composed of on average 4 characters each.
Based on these token the role of the transformer is to output what the next token on the sentence is gonna be
First each token gains a positional embedding and a contextual embedding # The Self Attention Mechanism This is the mechanism that teaches the model what each word means and how it relates to all the other ones. Each word gets a Query, a Key, and a Value. The Query is what the word looks for The key is the anwser to the query And the value is the information of the word itself For every word we pass the query througout all the other words in the sentence, we use the dot product to compare it to every other key.
Then we divide that dot product by a square root to not let it be too huge.
We pass that dot product into a softmax for every other word. That gives us a vector of the relationship of the word towards all the others in the phrase. This tells the model how that words relate the rest of the phrase
Based on this vector we get a new calculation for the word based on the Weighted Sum of all the other values multipled by the "focus" value .That is the new value of that word.
Then we do that for all the words
# Multi Head Theres many heads that do that. Because of that each of them can learn something different about the words. In the end we concatenate the results
# Feed Forward After that there are some feed forward neural networks to discover other informations about the words we may be missed
# The Decoder Basically the decoder is the part that does all of this, it makes each word have Context Awareness Position Awareness Token Awareness
When we use encoder and decoders we are trying to translate from one languague to another but if we are doing languague modelling (like LLMs do) we will only use decoders.
But LLMs are causal, that means it can only see the attention of PREVIOUS tokens. The name of this is [[Masked Self Attention ]] # Probabilities When we get the probabilities for the next word we can either Use a deterministic choice, choosing the token that is the most likely Or use sampling, we sample ghe
I recommend this book if you have already studied these topics in-depth before and are looking for a quick summary and overview to have a knowledge recap. If you're learning from scratch, I would suggest taking a look at other resources that have more depth in theory, like Understanding Deep Learning (book), Attention is All You Need (paper), Language Modeling from Scratch from Stanford (course), and Transformers, The Tech Behind LLMs (DL course). If you're looking for a more practical guide, I suggest the Transformers from Scratch notebook from Kaggle and How Transformer LLMs Work course by DeepLearning.AI, so you can implement it from scratch with Python and PyTorch.
As a follow-up guide, I would give a 4-star review. As a starting book, 3.