The key idea behind active learning is that a machine learning algorithm can perform better with less training if it is allowed to choose the data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle (e.g., a human annotator) that already understands the nature of the problem. This sort of approach is well-motivated in many modern machine learning and data mining applications, where unlabeled data may be abundant or easy to come by, but training labels are difficult, time-consuming, or expensive to obtain. This book is a general introduction to active learning. It outlines several scenarios in which queries might be formulated, and details many query selection algorithms which have been organized into four broad categories, or query selection frameworks. We also touch on some of the theoretical foundations of active learning, and conclude with an overview of the strengths and weaknesses of these approaches in practice, including a summary of ongoing work to address these open challenges and opportunities. Table of Contents: Automating Inquiry / Uncertainty Sampling / Searching Through the Hypothesis Space / Minimizing Expected Error and Variance / Exploiting Structure in Data / Theory / Practical Considerations
I bought the book after reading Dr. Settles's tutorial paper "Active Learning Literature Survey" with the hope that there might be something extra. I have to say I was fairly disappointed. Most of the book's content has already been covered by the paper.
But the book itself is comprehensive with some noticeable improvements, such as using an alien fruit example as a running example and organizing active learning techniques in a relatively more logical fashion. Also, there are some quite interesting quotes, e.g., "In theory, there is no difference between theory and practice. But in practice, there is."
(I was hoping to read it this week, but the library has recalled it for another user already, so I'll try again later. These are just notes from a quick skim...)
As another reviewer noted, the book is basically an extended version of the same author's article "Active Learning Literature Survey". I haven't read either of them in detail yet, but seems like a good high-level overview: the goals of active learning, common use-cases, several sensible approaches, and open questions.
My overall question: How is Active Learning different from the tradition of Experimental Design (in Statistics)? Here's my current, horribly-oversimplified understanding:
In AL, you tend to have cheap input data X, but labels/outcomes Y are expensive. Think of the millions of photos online: collecting them can be automated and is almost free, but tagging them (what does the photo show?) can take some human effort. So the focus is: Which input cases should we pick to label, to improve our understanding the most? For instance: Which photos should we tag manually, in the hopes that they'll improve our auto-tagging algorithm the most?
ED can also cover similar cases, but (my sense is that) it's traditionally focused more on evidence rather than optimization. There is emphasis on random variation in outcomes, from the very start. In the AL photo example, if a specific X-value is some particular photo of a cat, then that X will *always* have the same correct Y label "cat." But in traditional ED setups (e.g. try different combinations of baking temperature, humidity, and material to see how they affect the strength of ceramics), there will always be sample-to-sample variation at the same X values. How does the whole *distribution* of outcomes change as we adjust the different features in X?
Say that we're choosing a tuning parameter from a grid of possible values, such as lambda in tuning a Lasso regression model. ED asks: How much data overall should we collect, if we want *solid evidence* that the optimum we chose was indeed the best? How many times should we validate, re-training the algorithm on new data several times per grid value, to be confident that our choice of optimum will generalize to future data? AL asks: Which grid-value(s) should I evaluate next, to be most informative, to minimize training error on this one dataset? I *don't* have enough time/resources to be really confident in my answer, so the best I can do is try to reach the *apparent* optimum *quickly*.
Another way to phrase the caricature above: AL is about efficient *search* through the hypothesis space, seeking an optimum, finding a single best *case* or decision boundary. ED is about thorough *description* of the hypothesis space, making inferences about different sources of variation, characterizing how each *feature* affects the outcome. (However, that's just big-picture, intro-level ED. There are subfields including "optimal experimental design" and "sequential design" which are much more closely linked to AL.)
Hence, there can be considerable overlap between the fields. But even an AL expert might sometimes need an ED expert to help design and analyze an experiment. And even an ED expert might sometimes need an AL expert to help optimize a model fit.
Finally, also contrast AL with Semi-Supervised Learning. In AL, you might have a bit of labeled data and a lot of unlabeled, then select unlabeled cases you'd like to label. In SSL, you have the same setup, but you just train a model on the labeled data; use it to extrapolate labels for the unlabeled data; and refit the model (perhaps only adding the most confident self-labeled cases at first).
a survey book on the relatively sparse field of Active Learning. As such, a very short book, but I was extremely impressed by the clarity and readability of it, and would definitely recommend this book to anyone looking to learn about the field. Most researchers cite this book and zero other works when writing anything about Active Learning, and considering it's readable in a single focused afternoon it's probably worth reading if you have any interest in the field at all.