Name: Active Learning
Rating: 4.06 (4 reviews)
ISBN: 9781608457267

566 reviews138 followers

Want to read

August 29, 2017

(I was hoping to read it this week, but the library has recalled it for another user already, so I'll try again later. These are just notes from a quick skim...)

As another reviewer noted, the book is basically an extended version of the same author's article "Active Learning Literature Survey". I haven't read either of them in detail yet, but seems like a good high-level overview: the goals of active learning, common use-cases, several sensible approaches, and open questions.

My overall question: How is Active Learning different from the tradition of Experimental Design (in Statistics)?
Here's my current, horribly-oversimplified understanding:

In AL, you tend to have cheap input data X, but labels/outcomes Y are expensive. Think of the millions of photos online: collecting them can be automated and is almost free, but tagging them (what does the photo show?) can take some human effort. So the focus is: Which input cases should we pick to label, to improve our understanding the most? For instance: Which photos should we tag manually, in the hopes that they'll improve our auto-tagging algorithm the most?

ED can also cover similar cases, but (my sense is that) it's traditionally focused more on evidence rather than optimization. There is emphasis on random variation in outcomes, from the very start. In the AL photo example, if a specific X-value is some particular photo of a cat, then that X will *always* have the same correct Y label "cat." But in traditional ED setups (e.g. try different combinations of baking temperature, humidity, and material to see how they affect the strength of ceramics), there will always be sample-to-sample variation at the same X values. How does the whole *distribution* of outcomes change as we adjust the different features in X?

Say that we're choosing a tuning parameter from a grid of possible values, such as lambda in tuning a Lasso regression model.
ED asks: How much data overall should we collect, if we want *solid evidence* that the optimum we chose was indeed the best? How many times should we validate, re-training the algorithm on new data several times per grid value, to be confident that our choice of optimum will generalize to future data?
AL asks: Which grid-value(s) should I evaluate next, to be most informative, to minimize training error on this one dataset? I *don't* have enough time/resources to be really confident in my answer, so the best I can do is try to reach the *apparent* optimum *quickly*.

Another way to phrase the caricature above:
AL is about efficient *search* through the hypothesis space, seeking an optimum, finding a single best *case* or decision boundary.
ED is about thorough *description* of the hypothesis space, making inferences about different sources of variation, characterizing how each *feature* affects the outcome.
(However, that's just big-picture, intro-level ED. There are subfields including "optimal experimental design" and "sequential design" which are much more closely linked to AL.)

Hence, there can be considerable overlap between the fields. But even an AL expert might sometimes need an ED expert to help design and analyze an experiment. And even an ED expert might sometimes need an AL expert to help optimize a model fit.

Finally, also contrast AL with Semi-Supervised Learning. In AL, you might have a bit of labeled data and a lot of unlabeled, then select unlabeled cases you'd like to label. In SSL, you have the same setup, but you just train a model on the labeled data; use it to extrapolate labels for the unlabeled data; and refit the model (perhaps only adding the most confident self-labeled cases at first).

statistics

Active Learning

Burr Settles

About the author

Burr Settles

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion

Can't find what you're looking for?