Get a hands-on introduction to building and using decision trees and random forests. Tree-based machine learning algorithms are used to categorize data based on known outcomes in order to facilitate predicting outcomes in new situations.
You will learn not only how to use decision trees and random forests for classification and regression, and their respective limitations, but also how the algorithms that build them work. Each chapter introduces a new data concern and then walks you through modifying the code, thus building the engine just-in-time. Along the way you will gain experience making decision trees and random forests work for you.
This book uses Python, an easy to read programming language, as a medium for teaching you how these algorithms work, but it isn't about teaching you Python, or about using pre-built machine learning libraries specific to Python. It is about teaching you how some of the algorithms inside those kinds of libraries work and why we might use them, and gives you hands-on experience that you can take back to your favorite programming environment.
Table of Contents
A brief introduction to decision trees
Chapter 1: Branching - uses a greedy algorithm to build a decision tree from data that can be split on a single attribute.
Chapter 2: Multiple Branches - examines several ways to split data in order to generate multi-level decision trees.
Chapter 3: Continuous Attributes - adds the ability to split numeric attributes using greater-than.
Chapter 4: Pruning - explore ways of reducing the amount of error encoded in the tree.
Chapter 5: Random Forests - introduces ensemble learning and feature engineering.
Chapter 6: Regression Trees - investigates numeric predictions, like age, price, and miles per gallon.
Chapter 7: Boosting - adjusts the voting power of the randomly selected decision trees in the random forest in order to improve its ability to predict outcomes.
I am a polyglot programmer with more than 15 years of professional programming experience. My personal projects include photography, playing board games, genealogy and genetic programming.
The title and frontspiece of the book are misleading for, based upon pagecount, very little of it concerns tree-based algorithms, random forests, or boosting. In fact, it is essentially a tutorial on Python, developing a single program, a non-general purpose one-of-a-kind regression tree and random forest learner, with a limited boosting capability added at the end.
I'm not suggesting anyone should be a theory snob, but, facts are, in scikit-learn and in the world of R and Spark, there are dozens of different algorithms available for doing these things, developed by some experts, and, most importantly, used by hundreds if not thousands and ``shaken down''. It is audacity to think that anyone can readily exceed the benefits of this combined experience. Modern programming, whether in Python or R or Spark or Julia or Scala or MATLAB, is ALL about learning to lash together libraries to get things done, not about writing code ab initio.
But the thing is, one needs some theory and experience to understand why one algorithm is to be preferred over another in a particular case. These aren't sports teams one cheers on.
So, I was disappointed, partly because the book did not fill up to its billing, partly because it was primarily a Python tutorial, and partly because it was a BAD Python tutorial.