http://www.deeplearningbook.org/ .. This book can be useful for a variety of readers, but we wrote it with two main target audiences in mind. One of these target audiences is university students (undergraduate or graduate) learning about machine learning, including those who are beginning a career in deep learning and artificial intelligence research. The other target audience is software engineers who do not have a machine learning or statistics background, but want to rapidly acquire one and begin using deep learning in their product or platform. Deep learning has already proven useful in many software disciplines including computer vision, speech and audio processing, natural language processing, robotics, bioinformatics and chemistry, video games, search engines, online advertising and finance.
This book has been organized into three parts in order to best accommodate a variety of readers. Part I introduces basic mathematical tools and machine learning concepts. Part II describes the most established deep learning algorithms that are essentially solved technologies. Part III describes more speculative ideas that are widely believed to be important for future research in deep learning.
We do assume that all readers come from a computer science background. We assume familiarity with programming, a basic understanding of computational performance issues, complexity theory, introductory level calculus and some of the terminology of graph theory.
Ian J. Goodfellow is a researcher working in machine learning, currently employed at Apple Inc. as its director of machine learning in the Special Projects Group. He was previously employed as a research scientist at Google Brain.
I decided to read this book because I wanted to learn about Deep Learning, and everywhere I looked on the Internet seemed to point in this direction as the book you need to read to learn about DL. I gave up around page 220 (this is at the end of chapter 6), when I realized that I was not learning anything, but not only that, I was getting confused about topics I already knew.
Before I go on detail, just a bit about my background. I am a Comp Sci researcher in my early 30s, working in a university. My main topics of research are robotics and computer vision, but I use machine learning a lot, mostly as a end user, where I just feed the data to some ML tool. For many years I have been mostly using SVM (which work very well for what I need), but I decided it was time to move forward and try DL, and that is when I decided to get this book.
If you are like me, someone who knows about ML or AI, and want to know about DL, or even if you are someone who doesn't know much about ML or AI, please run away from this now, don't do the mistake I did of devoting your time to a pointless process. I am not going to recommend any other book, because I don't want people to think I am here to shill, but there are way better options out there.
If you want to learn DL: run away. In my opinion this book is for people who are already expets in DL and want to go deep into its theoretical background. If that is your case, then probably this is perfect for you. It is probably also very good for PhD students or researchers who are researching about DL. If you want to use DL as a final end-user: run away.
What is my problem with this book? It is written in a very obfuscated way. Think about the readability of a wikipedia article. If you go to the wikipedia page of SVM and you are able to perfectly understand it, then go for this book. But if you need something more prosaic (as I do) run away. The book also drifts a lot on what it explains. Every chapter goes about a topic, but the sub sections drift a lot, and there are often theorems or math explanations about stuff which in reality shouldn't be there, and are just confusing. I really think it needs a strong second edit, where some of the stuff should be removed (easily 50% of the content) and the other 50% done more readable. Very often when reading some stuff in the book I had to stop and check on Google somewhere else where the same concept was being explained, in order to understand it.
The first 5 chapters are theoretical and they are supposed to give you a foundation about topics you will need later (I actually doubt they checked if everything explained there was used at any point in the book, it is more like a knowledge dump). Most of the concepts explained here are actually very basic (let's say, undergrad level) but the way they are explained is extremelly obfuscated. It happened to me a few times that I was reading some explanation, and I had to google it, "wait wait, they are just explaining this?" As an example, chapter 2 finishes with an example (PCA). I have no idea what's the point of having this here, because it doesn't add anything to the book (apart from something like 20 cool looking formulas). PCA is a very simple technique, but the way they explained it is so obfuscated that even after reading it twice, I have no idea what are they meaning, or even worse, how can I actually implement this, or use it.
The moment I decided to give it with this book is during chapter 6. Chapters 1-5 were very bad, just a theoretical dump, but I thought, maybe later the good stuff will start, because theoretical chapters are generally difficult to read. Chapter 6 is about deep feedforward networks. I decided I had enough in the section where they explained backpropagation. backprop is something I learned as a undergrad more than 10 years ago. it is something I also learn during my masters. I have even coded it twice during my programming life. It is not a simple algorithm, but it is not difficult to understand. Because NN and Backprop are very common, I have seen it in many books, and I have to say the explanation here is by far the worse I have seen in my life. I have no idea what they were explaining. I had to read it twice, and then fill the gaps with the stuff I already knew about backprop. I had to write it down the algorithm the way I remember it, and then counter engineer their description to match it. At this point, after 220 pages, I realized I was not learning anything. If you ask me what I have learned about DL after 220 pages, I can tell you easily: nothing. If I cannot even understand something I already know, how am I go to understand new stuff?
The problem with this book (as happens with many technical books) is that it is written by very well respected researchers. The fact that they are amazing researchers doesn't mean that they are good writters. But you see a DL book written by Goodfellow and Bengio, and who is going to say that it is bad? I wonder how many people actually read it.
As I said before, if you are new to DL and want to learn about it: run away. Do yourself a favour and get some other book.
This is apparently THE book to read on deep learning. Written by luminaries in the field - if you've read any papers on deep learning, you'll have encountered Goodfellow and Bengio before - and cutting through much of the BS surrounding the topic: like 'big data' before it, 'deep learning' is not something new and is not deserving of a special name. Networks with more hidden layers to detect higher-order features, networks of different types chained together in order to play to their strengths, graphs of networks to represent a probabilistic model.
The book is 150 pages of background (stats, linear algebra), 300 pages of applications, and 200 pages of research topics. The applications section is the meat of the book: the background you might already know, and the research topics are mostly of interest only to fellow machine learning researchers.
Take note that this is a theoretical book. I read it in tandem with Hands-On Machine Learning with Scikit-Learn and TensorFlow, almost chapter-for-chapter. The Scikit-Learn and Tensorflow example code, while only moderately interesting on its own, helped to clarify the purpose of many of the topics in the Goodfellow book.
Part 1: basic math and machine learning, no problem. Part 2: the part I like the most. It includes almost everything we need to know to adapt deep learning algorithms to practical matters. Part 3: still feeling meh. It's too difficult for me to understand at this moment. Maybe I will come back after finishing PRML book.
Have I finished READING the book? No. Have I skimmed through and known where to go back and read in detail? Yes. In fact, I read actively (i.e. taking notes and read all the proofs) till the end of Chapter 6, and read passively the rest. The book has 3 parts. Part I, the math background; part II, the most used ML techniques; and part III, current active research topics. Imo, this book would be suitable for grad level Math/Stat/CS students who are comfortable working with proofs AND know what they are looking for. So if you are one who just wants to learn abt DL in general and don't have a specific research question, it would be tough to follow through. I think this book suits my needs really well, and I enjoy its structure. Part I reviews absolutely crucial math concepts for ML and DL. I like some of the proofs in this section; they have different approaches from what I have done/read before. Part II and III are technical heavy, but important. I skimmed through them because I'm looking for some specific information and don't have time to process all the information. At some point, I did feel quite overwhelmed, mostly because I prefer bullet point presentation over huge text explanation, especially with complex concepts. Did I find the answers to my questions? Not quite. But this book provides me some clues, which I deeply appreciate. Can we also talk about the fact that the authors provide free access to the book? Seriously, I can't thank the authors enough!
The "i'm finished" should more or less be interpreted as "i've had it". This book is both awesome and horrible. It's awesome because it is giving an extremely up to date view on what is currently state of the art. At this moment the book isn't even't published and it will be a landmark once it hits the shelves. It is horrible because it diguises all insights in maths, and partical use/application should be sought after (instead of being plain obvious). The latter means that this book is really written as an academic text (and I'm not part of the target audience ;-) ). I've found that the first part of this book reads well with for example the deep learnig course offered by Udacity, both follow the same logical order, but one is more 'popular' than the other.
Reading this book was tiresome. Imagine extracting the most technical pieces of hundreds of publications and piling them all together into a single book. This really is a prescription for unreadable manual, and that's unfortunately what has happened to "Deep Learning" book. I definitely prefer reading articles (including brilliant articles by the Authors of this book).
* Following the success of back-propagation, neural network research gained popularity and reached a peak in the early 1990s. Afterwards, other machine learning techniques became more popular until the modern deep learning renaissance that began in 2006. * Regularization of an estimator works by trading increased bias for reduced variance. * in neural networks, typically only the weight and not the biases are used in normalisation penalties * Effect of weight decay: small eigenvalue directions of the Hessian are reduced more than large eigenvalues. * Linear hidden units can be useful: if rather than g(Wx+b) we have g(UVx+b) then we have effectively factorised W, which can save parameters (at cost of constraining W to lower rank). * "The softplus demonstrates that the performance of hidden unit types canbe very counterintuitive—one might expect it to have an advantage overthe rectifier due to being differentiable everywhere or due to saturating lesscompletely, but empirically it does not." * Hard tanh * L2 regularisation is comes from Gaussian prior over weights + MAP * L1 regularisation comes from Laplacian prior * L-norms are equivalent to constrained optimisation problems: constraining to an L-n ball whose radius depends on the form of the loss * With early stopping, after you've finished on the training set, you can now also train on the validation data * You can either train again from scratch with the val data added in * You can do the same number of parameter updates * Or same number of passes through the data * Or also train on the validation data after the first round of training * Perhaps until the objective function on the validation set reaches the same level as the training set * Early stopping is in a sense equivalent to L2 regularization in that it limits the length of the optimisation trajectory. It is superior in that it automatically fine tunes the eta hyperparameter1 * Bagging = bootstrap aggregating * Dropout: Typically, an input unit is included with probability 0.8, and a hidden unit is included with probability 0.5. * Although the extra cost of dropout per step is negligable, it does require longer training and a larger model. If the dataset is large then this probably isn't worthwhile. * Wang and Manning (2013) showed that deterministic dropout can converge faster * Regularising noise has to be multiplicative rather than additive because otherwise the outputs could just be made very large * Virtual adversarial training: generate example x which is far from any real examples and make sure that the model is smooth around x * Smaller batch sizes are better for regularisation, but often underutilise hardware * Second order optimisation techniques like Newtons method appear to have not taken off due to them getting stuck in saddle points * Cliffs are common in objective landscapes because of "a multiplication of many factors" * Exploding/vanishing gradient illustration: if you are multiplying by M repeatedly, then eigenvalues > 1 will explode and < 1 will vanish. This can cause cliffs. * A sufficient condition for convergence of SGD is $\sum_{k=1}^\infty \epsilon_k = \infty$ and $\sum_{k=1}^\infty \epsilon_k^2 < \infty$ * Often a learning rate is decayed such that $\epsilon_t = (1-\alpha) \epsilon_{t-1} + \alpha \epsilon_\tau * Newton's method isn't commonly used because calculating the inverse Hessian is O(k^n) with number of parameters * coordinate descent: optimise one parameter at a time * block coordinate descent: optimise a subset of parameters at a time * Three ways to decide the length of a output sequence of a RNN: * Have an token * Have a binary output saying whether the sequence is done or not * Have a countdown to the end as one of the outputs * Reservoir computing: hard code the weights for the recurrent and input connections, only train output * You can only accept predictions above a given confidence. Then the metric you might use is coverage. * If the error rate on the training set is low, try to get this up by increasing model size, more data, better hyperparameter, etc. * If the error rate on the test set is low, try adding regularizers, more data, etc. * Hyperparameter vs loss tends to be U-shaped * One common learning rate regime is to wait for plateaus then reduce the learning rate 2-10x * Learning rate is probably the most important hyperparameter * Learning rate vs training error: apparently exponential decay then a sharp jump upwards (p. 430) * In conv nets, there are three schemes (TODO: look up names): * no padding * pad enough to preserve image size * pad enough so that all pixels get equal convolutions (increase image size) * Grid hyperparameter search is normally iterative: if you search [-1, 0, 1] and find 1 is the best, then you should expand the window to the right * To debug: look at most confident mistakes * Test for bprop bug: manually estimate the gradient (see p. 439) * Monitor outputs/entropy, activations (for dead units), updates to weight magnitude ratio (should be ~1%) * Your metric can be coverage: hold accuracy constant and try improve coverage * Groups of GPU threads are called warps * Mixture of experts: one model predicts weighting of expert predictions. * Hard mixture of experts: one model chooses a single expert predictor * Combinatorial gaters: choose a subset of experts to vote * Switch: model receives subset of inputs (similar to attention) * To save on compute use cascades: use a cheap model for most instances, and an expensive model when some tricky feature is present. Use a cascade of models to detect the tricky feature: the first has high recall, the last has high precision. * Common preprocessing step is to make each pixel have mean zero and std one. But for low-information pixels this may just amplify noise or compression artefacts. So you want to add a regularizer (p. 455) * Image preprocessing: * sphere = whitening * GCN = global contrast normalisation (whole image has mean zero and std 1) * LCN = local contrast normalisation (each window / kernel has mean zero nad std 1) * Rather than a binary tree for hierarchical softmax, you can just have a breadth-sqrt(n) and depth 2 tree * ICA is used in EEG to separate the signal from the brain from the signal from the heart and the breath * Recirculation: autoencoder to match layer-1 activations of original input with reconstructed input * Under-complete autoencoders have lower representational power in hidden space than the original space * Over-compelte autoencoders have at least as much representational power in hidden space as original space * Denoising autoencoder: L(x, g(f(x + epsilon)) ) * CAE = contractive autoencoder: penalises derivatives (so it doesn't change much with small changes in x) * score matching; try and get the same \nabla_x \log p(x) for all x * Rifai 2011a is where the iterative audoencoder comes from * Autoencoder failure mode: f can simply multiply by epsilon, and g divide by epsilon, and thereby achieve perfect reconstruction and low contractive penalty * Semantic hashing: use autoencoder to create a hash of instances to help with information retrieval. If the last layer of the autoencoder is softmax, you can force it to saturate at 0 and 1 by adding noise just before the softmax, so it will have to push further towards 0 and 1 to let the signal get through * Denoising autoencoders learn a vector field towards the instance manifold * Predictive sparse decomposition is a thing * Autoencoders can perform better than PCA at reconstruction * I didn't understand much of linear factor models * Probabilistic PCA * Slow feature analysis * independent component analysis * You can coerce a representation that suits your task better * E.g., for a density estimation task you can encourage independence between components of hidden layer h * Greedy layerwise unsupervised pretraining goes back to the neocognitron, and it was the discovery that you can use this pretraining to get a fully connected network to work properly that sparked the 2006 DL renaissance * Pretraining makes use of two ideas: * parameter initial conditions can have a regularisation effect * learning about input distribution can help with prediction * Unsupervised pretraining has basically been abandoned except for word embedding. * For moderate and large datasets simply supervised learning works better * For small datasets Bayesian learning works better * What makes good features? Perhaps if you can capture the underlying (uncorrelated) causes these, these would be good features. * Distributed representations are much more expressive than non-distributive ones * Radford 2015 does vector arithmetic with images * While NFL theorems mean that there's no universal prior or regularisation advantage, we can choose some which provide an advantage in a range of tasks which we are interested in. Perhaps priors similar to those humans or animals have. * To calculate probabilities in undirected graphs (e.g., modelling sickness between you, your coworker, your roommate) take the product of the "clique potentials" for each clique (and normalise). The distribution over clique products is a "gibbs distribution" * The normalising Z is known as the partition function (statistical physics terminology) * "d-separation": in graphical models context, this stands for "dependence separation" and means that there's no flow of information from node set A to node set B * Any relationship structure can be modelled with directed or undirected graphs. The value of these is that they eliminate dependencies. * "immorality": a collider structure in a directed graph. To convert this to an undirected graph, it needs to be "moralised" by connecting the two incoming nodes (creating a triangle). The terminology comes from a joke about unmarried parents. * In undirected graphs a loop of length greater than 3 without chords needs to be cut up into triangles before it can be represented as a directed graph * To sample from a directed graph, sample from each node in topographical order * Restricted Boltzmann Machine = Harmonium * It consists of one fully connected hidden layer, and one non-hidden layer * The "restricted" means that there's no connections between hidden layers * Continuous Markov chain = Harris chain * Perron-Frobenius Theorem: for a transition matrix, if there are no zero-probability transitions, then there will be a eigenvalue of one. * Running a Markov chain to reach equilibrium distribution is called "burning in" the Markov chain. * Sampling methods like GIbbs sampling can get stuck in one mode. Tempered transitions means reducing the temperature of the transition function between samples to make it easier to jump between modes * There's two kinds of sampling: Las Vegas sampling which always either returns the correct answer or gives up, and Las Vegas sampling, which will always return an answer, but with a random amount of error * You can decompose sampling into a positive phase and a negative phase: * $\nabla_\theta \log p(x;\theta) = \nabla_\theta \log \tilde{p}(x;\theta) + \nabla_\theta \logZ()\theta) $ * I'm not understanding a ton of the chapter on the partition function. * Biology vs back prop: If brains are implementing back prop, then there needs to be a secondary mechanism to the usual axon activation. * Hinton 2007a, Bengio 2015 have proposed biologically plausible mechanisms. * Dreams may be sampling from the brains model during negative phase learning (Crick and Mitchison, 1983) * Ie, "approximate the negative gradient of the log partition function of undirected models." * It could also be about sampling p(v,h) for inference (see p. 651) * Or may be about reinforcement learning * Most important generative architectures: * RBM: restricted boltzmann machine, bipartate & undirected * DBN: deep belief network, RBM plus a feedforward layer to the sensors * DBM: deep Boltzmann machine, stack of RBMs * RBMs are bad for computer vision because it's hard to expression max pooling in energy functions. * Also, the partition function changes with different sizes on input * Also doesn't work well with boundaries * There are lots of problems with evaluating generative models. * One approach is blind taste testing. * To prevent the model from memorising, also display the nearest neighbour in the dataset * Some models good at maximising likelihood of good examples, others good at minimising likelihood of bad examples * Depends on the direction of the KL divergence * Theis et al. (2015) reviews issues with generative models
This is a dense and challenging read, but currently “the Bible” of machine learning. Walk through any Machine Learning teams offices in Silicon Valley and you’ll find this book leaning against a monitor somewhere. I invested a lot of time early on with the book, getting mentors for different sections. Some of it was over my head, others deeper than I needed to go. I appreciated the breadth it gave of the topic but this definitely isn’t a book for someone new to the field of ML. It presumes a lot of background especially in math. Even the first section math refreshers chapters are pretty intense. I both appreciated this book and was frustrated by it. If you want an academic survey paper with cited sources in the form of a book this is the book for you. If you want an approachable introduction this probably isn’t it. The authors claim to be targeting two audiences: students and software engineers in industry. I think they understand the first audience much more than the second.
This book is great for readers to gain intuition behind many of the concepts underpinning deep learning techniques taken for granted, with a focus on probabilistic graphical models towards the end. It teaches how to approximate approximations of approximations due to life's intractability.
I found this book to be an excellent introduction and overview of deep neural networks for someone who already understands other types of statistical and machine learning models. It can be a challenging book, but it's clear and well written; the challenge is commensurate with the inherent complexity of the material, and not because the authors capriciously skip steps. In fact, rather the opposite is the case - the authors are quite explicit and put in more intermediate steps in their derivations than most books, let alone papers, which I quite like.
This is not the "here's some code, off you go" book. This is the book that explains what is going on and why, so that you will be able to make principled decisions and not just be an "appliance operator" when you then go read the book with the practical details and code samples. If your preferred style of learning is to understand the concepts before applying them, read this book first; if not, come back and read it afterwards.
A broad overview of the current state of deep learning. Given the introduction to machine learning in general it can be the position for learning "machine learning". Yet this is not a step-by-step tutorial, rather a place where one can start the reading and be redirected elsewhere for details. For me it was a great way to organize all the bits I had about deep learning. Part III was too hard for a practitioner like me so I just skimmed through.
THE most rigorous and up to date reference of deep learning algorithms that is almost self-contained. Though If you intend to learn deep learning from scratch this book will not suffice - some important concepts are described in too high level detail, so a complementary material is needed to fully understand the algorithms in detail.
It’s very difficult to review this book in the means of goodreads. It provides tremendous amount of detail for neural networks and especially the deep versions of them. The writers succeeded in finding an appropriate way to categorize the topics in a way that conveyed the ideas smoothly.
If you ask me about only one book about Deep Learning I would suggest this one. It covers everything. Starting with fundamentals like linear algebra, probability, statistics, optimizations and finishing with deep neural nets. Just an amazing book for studying the field.
I volunteered to present some of the material central to Modern Neural Networks in a bunch of class presentations, lectures to undergrads in my undergrad institution and reading groups from December 2015 to March 2016, and used that as an excuse to read this book page by page, and used it to make my presentation slides. I am glad it exists, as it summarizes much of the history and the recent work in Neural Networks. The earlier book on the subject (A Foundations and Trends volume by Bengio - Learning Deep Architectures for AI) centers mostly on generative models (part III in this book) and in a sense is not directly relevant to much of the recent progress in Deep Learning (although in the long run, it will perhaps again be more important). Thus in a way, this book fills the space by presenting a more holistic picture by including feedforward networks, recurrent networks and optimization strategies and tricks. To summarize: Part I of the book is mostly a whirlwind review of the basics of Machine Learning and optimization, Part II is about deep feedforward networks, regualrization strategies and sequence modeling. Part III is more about generative Neural Models.
I only have one complaint - that is, it would have been much better had the book been more concise and concrete. This preference for crispness is personal and I do believe that its verbosity would be useful to people. Another "complaint" is that it would have been great had there been a section on deep reinforcement learning. Also, given the speed of research in deep learning, much of this book would need to add sections later, but at least it does a decent job in presenting the backbone of modern neural networks research, which will remain a constant.
I will make my slides from my presentations from parts II and III (which were not that great I presume), available in a link that I will post over here, just in case it might be useful for others.
I used it to review basic statistics knowledge needed for ML, such as linear algebra, probability, and statistical learning theory.
Here are some philosophical thoughts: 1) Deeper understanding of basic concepts in linear algebra a) The concept of norm L^p norm is defined by (avg(each_dim_of_x^p))^{1/p}. Intuitive examples are L1 = Manhattan distance (i.e., driving distance in a city), L2 = the length from 0 to x in the Euclidean space (e.g., Pythagorean theorem)
The essence: A vector (x1, x2, x3, ..., xn) has n values that represent it; however, we only want one value to represent its norm. So we look for some type of equivalence: a_representative_value ^p = avg(x1^p, x2^p, ..., xn^p). In this case, we approximate
b) The concept of matrix Matrix is a set of vectors, and this set of vectors form a new coordinate system. So that "transformation" can be done by changing the normal coordinate to this new coordinate system defined by this matrix. *Reference: see a wonderful intro series of matrices in Chinese: [1] [2] [3]
c) The concept of x^n in the n-dim geometric space We know that x^1 is the "size" from 0 to x; x^2 is the "size" of an object whose diagonal is from 0 to (x, x); x^3 is the "size" of an object whose diagonal is from 0 to (x, x, x) (i.e., the size of the cube); so I guess x^n is the "size" of an object whose diagonal is from 0 to the n-dim point (x, ..., x).
Another understanding: the x^n that we first get to know is its curve y=x^n on the x-y coordinate.
2) dot product vs. cross product
In physics: - Dot product (x y cos(alpha)): Work = force * displacement,在位移上施力,力这个向量只有在和位移向量同方向的,才是有效的力【本质上是和数字和数字的乘法基本概念相同】 - Cross product (x y sin(alpha): angular_velocity = from_o_to_the_point × linear_velocity; torque M = r × Force; magnetic field. See examples here. The reason for the direction of the cross product vector is that we need a representation of "a rotation on a 2D surface" (see the example of angular velocity in the previous link), so we use the normal vector of the surface considering the counterclockwise rotation. 比如杠杆中的扭矩是力臂×力,关键在于这两个向量垂直的部分才拉开了面积,【本质上和面积的计算基本相同】
As an algorithm of life: - "set a life goal, and decide what work to do" is dot product of (objective, efforts), although here objective is more of a unit vector that mainly encodes the direction. A better case can be (efficiency, type of work). Note that this dot product should be distinguished from addition of two vectors, e.g., past 5 years' effort + upcoming 5 years' planned work. - "learn a (transferable) skill set, and broaden the application areas" is cross product of (skills, applications). Note that the more orthogonal the skill set is from the applications (i.e., transferrable/general skills), the larger the cross product (more similar to a rectangle); the more specific the skill set is (i.e., highly paired with a specific application), the smaller the cross product (a a thin parallelogram with sharp angles). A team with collaborative efforts on different (orthogonal) dimensions (e.g., some developing products, and others increasing sales) can also be seen as a cross product. - Dot product is for "efficiency", and scalar product is for "impact size"
3) Finding the right representation of a concept, not just its surface form I really like the introduction sentence of 2.7 Eigendecomposition: "Many mathematical objects can be understood better by breaking them into constituent parts, or finding some properties of them that are universal, not caused by the way we choose to represent them."
Examples: - discover something about the true nature of an integer by decomposing it into prime factors, e.g., 12 = 2×2×3. - Eigen-decomposition: An eigenvector of a square matrix A is a nonzero vector v such that multi-plication by A alters only the scale of v. - 极坐标系 - using cos and sin to represent waves and all periodic things. - changing the derivative of a_number^x to e^x * how_that_number_converts_to_e - In general, parametric functions!
What things in life is similar to finding the right parametric functions: - When we build analogy, A to B is like A' to B', we are actually just checking the mapping of f: A->A' and B->B', then converted our knowledge into "A to B" and "how to convert other frameworks such as A' and B' to the A-to-B framework" - When we try to learn things fast, lots of people prefer to find shortcuts (e.g., in language learning, we tend to do word-to-word translation from a language we know to the language we are trying to learn; so is learning about other knowledges). These shortcuts are very likely to be just a "parametric function" to turn the vocabulary of the new thing we are learning to the old thing we are familiar with - The amazing thing about math is to find the "invariant rules" + "for everything else, learn a conversion from the concrete subject to element in those rules" - As Prof Joachim mentioned, "changing perspectives" is a good way to revolutionize a field. The essence of "changing perspectives" is to find a better parametric function - One thing is that analogy/parametric functions has its limitations: do not try to forcefully apply it to things that do not share similarities in any level of abstraction
In general, "finding the right representation" of data also relates to finding a good data structure (paired with the algorithm) in computer science, or finding the storage-efficient database design in database optimization.
---- Other interesting points: 1) Etymology of "marginal distribution": Quote: The name “marginal probability” comes from the process of computing marginal probabilities on paper. When the values ofP(x, y) are written in a grid with different values of x in rows and different values of y in columns, it is natural to sum across a row of the grid, then write P(x) in the margin of the paper just to the right of the row.
This book is for grad students, advanced practitioners and theoreticians, so I (a hobbyist engineer) was only able to read about the first half. It works well to gauge your level of understanding of how deep learning is implemented.
My big takeaway from the book is that for my purposes I don’t need to understand these implementation details, even if I find them very interesting, because anything worth doing is implemented in libraries and by cloud providers. A hobbyist engineer like me can run code and get results using existing implementations.
I am prone to go down rabbit holes on projects reading more background and implementation details than necessary so I stopped grinding through this book. I was able to train an AI that plays the SNES game Kirby’s Avalanche and make a YouTube video about it anyway!
I rated this book a bit higher than I might have otherwise as it is operating at the edge of what research was at the time it was written. It's a pretty strong rundown in that regard.
The negative side is that it obfuscates its information by its presentation. It's not motivated well -- if I wasn't already familiar with most of it, it might have been harder to grasp, but I can't test that hypothesis. Some people complain about the math in the reviews -- I don't as math can be self-explanatory. But the logical progressions weren't always framed well and sometimes they leaped into acronyms too quick, or dropped in a symbol they only used in that one place in the book, and that isn't always standard.
A very good high-level overview of the most popular deep learning techniques at the time. I will keep it around as a reference for sure.
It requires some prior maths, statistics and machine learning knowledge, but is not a mathematical book with proofs and detailed abstract theory. The focus is on practically applicable theory at a high level, which it provides in a good way. Look elsewhere both for practical instructions on how to use various tools and frameworks (e.g. Tensorflow) and also for a rigorous mathematical treatment of machine learning.
This tries to be the clr of deep learning. But it might be too early for that so the last part is more experimental. Also, statistics is different than real math so all the proofs don't make much sense.
Very theoretical and steep learning curve. Would be much better if it had code and practical examples as well as exercises. Perhaps it is better to get Deep Learning with Python by Chollet or the O’Reilly book by Gerhon which has Jupyter Notebook examples and exercises.
Finally made my way through the bulk of this, has all the fundamentals of the core DL concepts in really good depth along with some of the more specific and recent innovations. One to continually refer back to.
Only read a few parts of the book but i also don't think the book was nicely written in a way to be read from start to end. Too chaotic for that. Some good intuitions but a lot of the stuff is also too outdated these days. It didn't focus enough on general principles it looks like, so too many specific ideas or architectures
"Deep Learning", was first published in 2016. It introduces the broad topic of deep learning, covering mathematical and conceptual background, deep learning techniques used in industry, and research perspectives. Deep learning is a form of machine learning that enables computers to learn from experience and understand the world based on a hierarchy of concepts. Because computers gather knowledge from experience, human-computer operators do not need to formally specify all the knowledge the computer needs. Hierarchies of concepts allow computers to learn complex concepts by building them from simpler concepts. These hierarchical diagrams will have many layers.
It provides a mathematical and conceptual background covering relevant concepts in linear algebra, probability and information theory, numerical computing, and machine learning. It describes deep learning techniques used by industrial practitioners, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodologies; it surveys natural language processing, speech recognition, computer vision, online recommendations systems, bioinformatics, and video games. Finally, the book provides a research perspective covering theoretical topics such as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, partition functions, approximate inference, and deep generative models.
Ian Goodfellow was born in America in 1987. Studied at Stanford University and Université de Montréal . He is an American computer scientist, engineer, and executive known for his work on artificial neural networks and deep learning. He previously served as a research scientist at Google Brain and director of machine learning at Apple and has made several important contributions to the field of deep learning, including the invention of generative adversarial networks (GAN).
Table of Contents 1 Introduction I Applied Math and Machine Learning Basics 2 Linear Algebra 3 Probability and Information Theory 4 Numerical Computation 5 Machine Learning Basics II Deep Networks: Modern Practices 6 Deep Feedforward Networks 7 Regularization for Deep Learning 8 Optimization for Training Deep Models 9 Convolutional Networks 10 Sequence Modeling: Recurrent and Recursive Nets 11 Practical Methodology 12 Applications III Deep Learning Research 13 Linear Factor Models 14 Autoencoders 15 Representation Learning 16 Structured Probabilistic Models for Deep Learning 17 Monte Carlo Methods 18 Confronting the Partition Function 19 Approximate Inference 20 Deep Generative Models
When studying the myths of various countries, we can find that almost all myths describe God’s creation of humans as the origin of humans, and usually gods create humans in their own image. God has given mankind the wisdom of God, but not the lifespan of God. And when humans began to try to play the role of creator, they seemed to naturally expect to create creations similar to themselves. For computer scientists and engineers, such goals might include designing software that simulates the human brain and designing hardware that resembles humans in appearance and capabilities. If we trace history vertically, we can find that people had this desire thousands of years ago. In modern times, if we look horizontally, we will find that even in the early stages of life, children's scribblings often focus on people as important objects.
Machines excel at tasks that can be clearly defined and formulated, such as chess and Go. Since the solutions to these problems are theoretically clear, as long as the hardware conditions are sufficient, there is enough storage space, and fast computing power, it is almost inevitable for machines to solve these problems. So tasks that humans find difficult are actually easier for machines. However, tasks that are simple for humans, especially cognitive tasks, appear complex for machines. These so-called simple tasks are often based on a large amount of life experience, forming a subjective judgment. For example, when you see a face, you can immediately remember the name and related memories, and when you hear words, you can quickly convert them into corresponding concepts and understand them. Although humans have made major breakthroughs in scientific research in this area after 2000, there is still a long way to go to achieve strong artificial intelligence or general artificial intelligence. In everyday life, simple common-sense events are generally not considered a sign of intelligence, whereas the ability to solve complex mathematical and physical problems is considered a sign of genius. In contrast, in the world of machines, software that can understand images, and language and perform reading comprehension is considered more intelligent. If one day, humans can have a deeper understanding of how the human brain processes language, images, and text, and can describe these abilities in a formulaic way, then there will probably be breakthroughs in the capabilities of machines in these intuitive fields.
Whether it is a machine or a human, recognizing an object requires capturing its characteristics. The human process of capturing features often happens unconsciously and automatically, so we often don’t think deeply about the process. However, how to identify these features is a very challenging problem for machines. I am reminded of a story about Plato and Diogenes. Plato defined man as a "bipedal hairless animal." Diogenes then plucked all the feathers out of a chicken and brought it to Plato. Regardless of whether this story is true or not, it reflects how difficult it is to describe something as accurately as possible through wired characteristics. Since the extracted features are incomplete or inaccurate, the machine is prone to make mistakes when making judgments based on these features. For simple objects, we may be able to manually set specific characteristics for the machine to make judgments, but for more complex problems, this method obviously does not work. Therefore, we must hope that the machine can independently find a suitable way to learn feature extraction by itself.
If we study neural networks, the basic idea is to simplify complex problems, that is, by decomposing a complex problem into multiple simple sub-problems. These simple factors, the variables or features we choose, form the input layer, which is visible to the user. The output layer displays the final result, while intermediate layers are usually invisible to the end user. In a neural network, the output of each layer becomes the input of the next layer. In addition, we can also calculate the depth of the neural network in two ways: one is to calculate the length of the longest path from the input to the output, and the other is to calculate the depth of the graph. Our best practices show that generally greater depth provides better results.
The author introduces the basics of linear algebra at the beginning. I am now more and more aware of the importance of mathematics in various subjects. Looking back on the education I received in the past, I now feel that any emphasis on mathematics is justified. However, in the process of learning mathematical knowledge, I found that I lacked an intuitive feeling for the specific application of this knowledge, which to a certain extent weakened my understanding of the importance of the beauty of mathematics. On the one hand, mathematics is important and is the result of rational thinking; on the other hand, mathematics also has natural beauty, which is often ignored in mathematics education. In the field of computer science, linear algebra, and discrete mathematics are important mathematical foundations. I took these courses as an undergraduate. Linear algebra mainly studies continuous mathematical problems, while discrete mathematics, as the name implies, focuses on dealing with discrete elements. Just as a pyramid is built step by step from its foundation stones, complex mathematical theories, and applications are derived step by step from these basic concepts.
Probability theory is an important branch of mathematics in computer science that plays a central role in programming. In essence, programming is a modeling activity where we build models to simulate and analyze real situations. In most cases, models are simplifications of real situations. Due to this simplification, some features of the real situation are bound to be lost in the model, and these missing features lead to a certain degree of distortion, thus creating uncertainty. When dealing with this uncertainty, we often need to design some simple and effective rules so that the computer can make judgments. Often, a simple but slightly flawed explanation is more adaptable in practice than a rigorous but complex rule. When an algorithm makes a choice, it usually selects the explanation or decision with the highest probability from many possible options. This is essentially a process of finding a local optimal solution. For example, large language models are an example of predicting output based on input. This model infers possible outputs based on inputs, so the quality of the input directly affects the quality of the output.
Numerical computing is an important field, not least because in practical applications we face a challenge: numbers are infinite and computer memory is limited. Therefore, we need to use limited memory to express nearly infinite values as much as possible. Due to this limitation, the loss of information may lead to some problems such as underflow and overflow. Underflow occurs when values approach zero, and some functions may not work as expected. In this case, we may need to modify the input to a very small value close to zero, which is a problem I have encountered in my previous programming experience. On the other hand, overflow occurs when the input value approaches positive or negative infinity, beyond the range that the program can detect. Therefore, it is very important when writing programs to pay attention to handling these edge cases to avoid potential errors.
A comprehensive overview of the Deep Learning paradigm, written by several leading researchers in the field. The author's cover many topics, and did a great job providing references to the current literature in the field. For this reason, I see this book more as a reference book than a book to read straight through. I read it straight through, but there were definitely some sections I skimmed over, especially when the author's introduced technical details of several related methods in the field.
Coming from the statistics world, one thing I've been impressed with in AI / machine learning / Deep learning is how much energy they have. And it shows. The methods used by these communities get grants, start companies, are used in state-of-the-art applications (e.g. Alexa, Self-driving cars), are fodder for books and podcasts, and more. Why is this?
One theory I have is that the overall goal of AI and Deep Learning, which is to *solving intelligence*, is much more aspirational than a dicipline like statistics. By definition, the AI fields are trying to solve a problem that could easily be unsolvable, so they have unlimited research options, and anything on the world could inspire a new methodology. The goal of statistics would easily be something like *smarter, rigorous, decisions and knowledge creation while dealing with uncertainty, because ultimately, the world is an uncertain place*. This is nice, but doesn't capture the imagination like *solving intelligence*
In any case, after a great introduction, the book is broken into three parts:
1. Applied Math and Machine Learning Basics 2. Deep Networks: Modern Practices 3. Deep Learning Research
Part 1 provides background and synthesis of many topics, and I was impressed by the author's command of so much material. They were able to summarize measure theory in a page, and when I took measure theory over an entire semester, I still didn't understand it.
Part 2 covers *what-you-need-to-know* in order to implement *state-of-the-art* neural networks. There were two things that really stood out to be here. First, I think that the idea of automatic differentiation, and how they us it to compute the gradients for training neural networks, is genious. Whoever figured that out should get the nobel prize. Second, I really appreciated chapters 11 and 12, where chapter 11 covered practical considerations, and chapter 12 covered applications where deep learning provides state of the art results. These chapters did a good job rounding out the technical material covered in the previous chapters.
Part 3 is a bit more difficult to read, since it covers research topics like autoencoders, generative models, graphical models, and so on. A lot of this part was on fitting full probabilistic models with deep neural networks, and the author's use a lot of previously developed tools in unsupervised learning, density estimation, graphical models, Bayesian inference, etc. to develop their tools.
Overall, I see this book as a reference that captures the current state of deep learning. The prose isn't excellent, but it gets the information across well enough. It will certainly be landmark book for our generation. After reading it, I feel like I have a better bird's eye view of what's going on in Deep Learning.