I have yet to find another resource that is able to effectively capture deep learning—without the overuse of frameworks—in a fundamental way. That being said, I did have some experience with DL paradigms before reading this work, so I’m not sure whether or not it was everything that it is meant to be.
1st Chapter - Easy to skim through. I’m really not confident that it is necessary. Could’ve just given a bulleted list of necessary tools in the preface.
2nd Chapter - He does a decent job introducing the concepts, but I’m not sure that he captures nonparametric models quite right for the person without any experience. He at least acknowledges this by informing the reader that “there is a gray area between parametric and nonparametric algorithms.” My bigger problem is that he calls these counting-based methods. While technically correct, I do wish he would’ve used the clearer grouping-based. Counting leaves it a bit unclear as to why/how nonparametric models find what to count; whereas grouping is a bit more intuitive. This is a minor quibble.
3rd Chapter - This is one of the best abstract-to-specific introductions I’ve seen to a feedforward neural network, but he never explains weight initialization to the reader. I’ll note that his lack of discussion of weight initialization and its importance was a huge knock on this book for me. But especially in an intro book, the reader isn’t going to know about how/why weights converge, and thus throwing in manually assigned weights (and not calling random numpys to do it) makes it a smidge unclear as to the reasoning. I’d be surprised if a few careful readers weren’t confused. It isn’t super complicated to explain, but is pretty vital.
4th Chapter - I was very pleased that a simple discussion of why we square the error. A lot of texts assume that the people learning are familiar with the common MSE practice in statistics, when many people are not. I’ve seen this paragraph discussion in a few places now, but happy to see it here as otherwise people get a little too hung up on the whole squaring business. On the negative side: “hot and cold” learning is ridiculous. It only serves to confuse what is otherwise a straightforward process. I would do away with this entirely and just jump straight into gradient descent. However, I admit that might just be because I am used to calculus. Perhaps this makes more sense to someone who isn’t familiar with anything past college algebra. There are also some minor typos in the er_dn, er_up comparisons (depending on the version of the book) so be wary of that. As per his conversation on derivatives...he used a lot of words to describe slope. The definition is fine, but could be cleaned up.
5th Chapter - There isn’t really much to say about this one. Just goes into weighted averages and matrix operations. Important, but one of the easier parts of NNs to understand if you fully understood the priors.
6th Chapter - Ahh, backprop. Once you know it, you wonder how it was ever hard to understand in the first place. There is a lot of good work done in here, but I think this is the first time that you really need to do the code sample in order to understand what is going on mathematically. Now, once you get to the section about “Up and down pressure” I think there is a lack of clarity in the directionality of it. It would’ve been much clearer if he would’ve skipped the abstract entirely here and moved to actual weight changes. Important to note that his overuse of abstraction is probably the throughline behind many of my criticisms: it isn’t that the abstractions are wrong, rather that they are not useful at best and confusing at worst. This is one of those cases where he overcomplicates a very simple movement with a poorly formatted table. However, he makes up for it by smoothly transitioning into ReLU functions. Unfortunately, another problem comes up here in the initialization of his weight layers: his neglect to explain the -1 after initialization. I briefly mentioned it earlier, but he fails to talk about why this is something that is done. He also fails to mention why you’re randomly initializing. It is a small thing, but for a book that is about grokking something, not insignificant.
7th Chapter - Some visualizations for intuition. May be useful for some people. I found that I already understood them from looking at priors.
8th Chapter - Was very happy to see that console.log printouts were a big deal in this chapter. Helps for the understanding of what is actually going on in batch gradient descent and overfitting. Dropout was useful, although a little bit simplified. I’m not entirely sure how often that simple version of dropout is used in industry, but even if it isn’t used, the overall messaging is sound.
9th Chapter - Much ado about activations. This is my favorite part of the book, and solely for the reason that he showcased neural networks in a correlation map format. I’d seen these intermediate visualizations of what a NN is performing before, but this is the first time I’ve ever really had it stick that you can apply correlation maps to the results in order to “see”. Very pleased. For all the words scribbled about softmax, I was a little disappointed by how verbose he ended up being here. One diagram with a short paragraph (the sigmoid treatment) essentially explains it.
10th Chapter - Despite my previous complaints about being verbose, I don’t feel like there was anything deep about this chapter on ConvNets. For such a popular processing model, this piece felt very unfinished. Luckily, the intuition is easy to grasp, but the deeper mathematical concepts are skated over. I should note, that this is actually quite easy to grasp if you are familiar with concepts like Hamming Codes, which have an uncannily resemblance here.
11th Chapter - Like a lot of these final chapters, I just feel like the author was giving the reader a taste of what is out there. Many of these subjects are books and specializations on their own, so I can’t blame him for not going into too much detail. That being said, if I hadn’t had experience with NLP before reading this I would’ve been deeply disappointed. His writing on embedding layers is probably some of the least clear writing in this book, which is largely due to his failure to have a separate section on one-hot encoding vs. label encoding.
12th Chapter - Similar to the prior chapter, I think my view here is skewed. I found it quite intelligible due to my knowledge of certain NLP concepts, but I really don’t think it would be very good without that knowledge. I was able to relate these to things like n-back algorithms and the like. He just didn’t spend enough time on what was going on through here.
13th Chapter - I’m unconvinced that the best way to teach a framework is in this ground up manner. I recognize that this is the ethos of the book, but this is one of those cases where I think a top-down approach works much better than a bottom-up one. A lot of this is just rehashing what you already know, which is fine, but maybe a different book altogether.
14th Chapter - Easily my least favorite chapter. Way too short and the intuition building here is almost non-existent.
15th Chapter - This is almost just a, “Hey, here is a fun-ish new thing that I don’t really want to explain,” final piece. Not sure that you should end a book on deep learning on a section called Homomorphically encrypted federated learning when you haven’t been talking about encryption—which is an entirely different field—throughout.
I was actually pretty pleased with this book. My biggest problem is that I’m not sure it knew exactly what it wanted to be. If I had advice, I would grab the intuitions (both mathematical and abstract) and leave most of the coding samples alone. There are just better ways to code through work that don’t consist of copying so-so code snippets. I’m excited to read through a couple more and see how they stack up against this one.