Imbalanced classification are those classification tasks where the distribution of examples across the classes is not equal. Using clear explanations, standard Python libraries, and step-by-step tutorial lessons, you will discover how to confidently develop robust models for your own imbalanced classification projects.
Jason Brownlee, Ph.D. trained and worked as a research scientist and software engineer for many years (e.g. enterprise, R&D, and scientific computing), and is known online for his work on Computational Intelligence (e.g. Clever Algorithms), Machine Learning and Deep Learning (e.g. Machine Learning Mastery, sold in 2021) and Python Concurrency (e.g. Super Fast Python).
A fantastic guide for dealing with imbalanced classification problems in applied machine learning settings. For Data practitioners or those of you who may be wondering what the hell does "Imbalanced classification" mean? Think of it like this, imagine we have a matrix (a grid of data points) of values that hold independent features of lets say breast cancer. This would be our feature values and our target value would be the classifier column which would be our y column which often times is a binary numerical column such as 0 (Does not have cancer) and 1 for (Does have cancer).
(Stick with me here for a sec)
We would like to predict the classifier column (our y value). In this case, we want to create a model that can accurately predict if a patient has breast cancer using our feature values (our x values). This way we can treat the cancer as quickly as possible. However, given that this is an imbalanced data set, this means we could have 10,000 data points and of those 10k data points, we have 100 cases where the patient has breast cancer creating a skewed imbalance between patients who do have breast cancer and those who don't. If we run models on datasets like this without accounting for this imbalance, the model will be prone to overfitting. If the model is prone to overfitting, this means the model does not generalize well to new data but even worse it can result in what is a called a type 2 error. Where a patient is given a false negative, where we tell them they don't have breast cancer when in fact they do have breast cancer. This is a costly error because we're giving more time for the cancer to spread through the patients body and thus potentially limiting their chances of beating the cancer.
Unfortunately, there are many instances in the real world we have imbalanced data where the negative class is the minority vs the positive class, sometimes this imbalance can be up 1000:1. So how do we deal with these problems?
There are numerous examples in the book of dealing with scenarios where we can either under sample, meaning remove examples from the majority class based on a variety of under sampling methods we can use while weighing the trade off that by under sampling we might be getting rid of important examples from the data set that have predictive power. Vice versa, we can over sample the data set, this means we can create synthetic examples of the minority class or negative class, balance the data set and test a range of models and see what gives us the best performance over a range of trade offs we must live with.
We can also combine over and under sampling methods to create the best of both worlds and test to see if this improves the performance of the model, we can calculate probabilities and determine which model performs the best such as the brier score, the G mean etc. We can also use cost sensitive learning where we weight errors differently depending on the nature of the error as the costs associated with misclassifying a case.
There are numerous methods and tools in this book that you surprisingly will not find anywhere else, I have not seen a lot of methods used here and I felt like I walked into a treasure trove of tools that I can use for projects at work and through my own research. The book is designed so you can skip forward to any part, however I would recommend you read this from start to finish. I had vscode up and I coded all the exercises with the author and actively applied everything he showed in the book. Overall, a fantastic book on applied machine learning. I am excited to read the authors other books on the field as I see he has written numerous books on a range of subjects I am interested in learning about.