This textbook is like the Swiss Army knife of machine learning books—it's packed with tools and techniques to help you tackle a wide range of real-world problems.
It takes you on a journey through the exciting landscape of machine learning, equipped with powerful libraries like Scikit-Learn, Keras, and TensorFlow.
It explains in depth each library: I was more interested in the first one, as Keras and TS are too advanced for my interests and knowledge.
It is actually funny to read the Natural Language Processing NLP, LLMs section, prior to ChatGPT.
NOTES:
Supervised Learning: the algorithm is trained on a labeled dataset, meaning the input data is paired with the correct output. The model learns to map the input to the output, making predictions or classifications when new data is introduced. Common algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks.
Unsupervised Learning: deals with unlabeled data, where the algorithm explores the data's structure or patterns without any explicit supervision. Clustering and association are two primary tasks in this type. Clustering algorithms, like K-means or hierarchical clustering, group similar data points together. Association algorithms, like a priori algorithm, find relationships or associations among data points.
Reinforcement Learning: involves an agent learning to make decisions by interacting with an environment. Usually is used on robots: It learns by receiving feedback in the form of rewards or penalties as it navigates through a problem space. The goal is to learn the optimal actions that maximize the cumulative reward. Algorithms like Q-learning and Deep Q Networks (DQN) are used in reinforcement learning scenarios.
Additionally, there are subfields and specialized forms within these categories, such as semi-supervised learning, where algorithms learn from a combination of labeled and unlabeled data, and transfer learning, which involves leveraging knowledge from one domain to another. These types and their variations offer diverse approaches to solving different types of problems in machine learning.
Gradient descent is a fundamental optimization algorithm widely used in machine learning for minimizing the error of a model by adjusting its parameters. It's especially crucial in training models like neural networks, linear regression, and other algorithms where the goal is to find the optimal parameters that minimize a cost or loss function.
- Objective: In machine learning, the objective is to minimize a cost or loss function that measures the difference between predicted values and actual values.
- Optimization Process: Gradient descent is an iterative optimization algorithm. It works by adjusting the model parameters iteratively to minimize the given cost function.
- Gradient Calculation: At each iteration, the algorithm calculates the gradient of the cost function with respect to the model parameters. The gradient essentially points in the direction of the steepest increase of the function.
- Parameter Update: The algorithm updates the parameters in the direction opposite to the gradient (i.e., descending along the gradient). This step size is determined by the learning rate, which controls how big a step the algorithm takes in the direction of the gradient.
- Convergence: This process continues iteratively, gradually reducing the error or loss. The algorithm terminates when it reaches a point where further iterations don't significantly decrease the loss or when it reaches a predefined number of iterations.
There are variations of gradient descent, such as:
Batch Gradient Descent: Calculates the gradient over the entire dataset.
Stochastic Gradient Descent (SGD): Computes the gradient using a single random example from the dataset at each iteration, which can be faster but more noisy. Randomness is good to escape local optima.
Mini-batch Gradient Descent: Computes the gradient using a small subset of the dataset, balancing between the efficiency of SGD and the stability of batch gradient descent.
Gradient descent plays a vital role in training machine learning models by iteratively adjusting parameters to find the optimal values that minimize the error or loss function, leading to better model predictions and performance.
It is commonly used in conjunction with various machine learning algorithms, including regression models. It serves as an optimization technique to train these models by minimizing a cost or loss function associated with the model's predictions.
Support Vector Machines SVM
It can perform linear or nonlinear classification, regression and even outlier detection.
Well suited for classification of complex small to medium sized datasets.
They tend to work effectively and efficiently when there are many features compared to the observations, but SVM is not as scalable to larger data sets and it’s hard to tune its hyperparameters.
SVM is a family of model classes that operate in high dimensional space to find an optimal hyperplane when they attempt to separate the classes with a maximum margin between them. Support vectors are the points closest to the decision boundary that would change it if were removed.
It tries to fit the widest possible space between the classes, staying as far as possible from the closest training instances: large margin classification.
Adding more training instances far away from the boundary does not affect SVM, which is fully determined/supported by the instances located at the edge of the street, called support vectors.
N.B. SVMs are sensitive to the feature scales.
Soft margin classification is generally preferred to the hard version, because it is tolerant to outliers and it’s a compromises between perfectly separating the two classes, and having the widest possible Street.
Unlike Logistic regressions, SVM classifiers of not output probabilities.
Nonlinear SVM classification adds polynomial features and thanks to the kernel trick we get the same result as if we add many high-degree polynomial features, without actually adding them so there is no combinatorial explosion of the number of features.
SVM Regression reverses the objective: it tries to fit as many instances as possible on the street while limiting margin violations, that is training instances outside the support vectors region.
Decision Trees
They have been used for the longest time, even before they were turned into algorithms.
It searches for the the pair (feature, threshold) that produces the purest subsets (weighted by their size) and it does it recursively, however it does not check whether or not the split will lead to the lowest possible impurity several levels down.
Hence it does not guarantee a global maximum solution.
The computational complexity does not explode since each node only requires checking the value of one feature: the training algorithm compares all features on all samples at each node.
Nodes purity is measured by Gini coefficient or entropy: a node’s impurity is generally lower that its parents’.
Decision trees make very few assumptions about the training data, as opposed to linear models, which assume that the data is linear. If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, indeed, most likely overfitting it.
Such a model is often called a non-parametric model, it has parameters, but their number is not determined prior to training.
To avoid overfitting, we need to regularize hyperparameters, to reduce the decision tree freedom during training: pruning (deleting unnecessary nodes), set a max number of leaves.
We can have decision tree regressions, which, instead of predicting a class in each node, it predicts a value.
They are simple to understand and interpret, easy to use, versatile and powerful.
They don’t care if the training data is called or centered: no need to scale features.
However, they apply orthogonal decision boundaries which makes them sensitive to training set rotation, that is the model will not generalize well because they are very sensitive to small variations in the training data. Random forests can lead to disease stability by averaging predictions over many trees.
Random Forests
It is an ensemble of Decision Trees, generally trained via bagging or sometimes pasting, typically with the max_samples set to the size of the training set.
Instead of using the BaggingClassifier the RandomForestClassifier is optimized for Decision Trees, it has all its hyperparameters.
Instead of searching for the best feature when splitting a node, it searches for the best feature among a random subset of features, which results in a greater tree diversity.
It makes it easy to measure feature importance by looking at how much that feature reduces impurity on average.
Boosting
Adaptive Boosting
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. For example, when training AdaBoost classifier the algorithm first trains of base classifier such as a decision trees, and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances, then train the second classifier using the updated weights and again makes predictions on the training said updates the instance weights and so on. Once all predictors are trained the ensemble makes predictions like bagging expect that the predictors have wights depending on their overall accuracy on the weighted training set.
Gradient Boosting
It works by sequentially adding predictors to an ensemble, each one correcting its predecessor.
Instead of tweaking the instance weights at every iteration like AdaBoost does, it tries to fit the new predictor to the residual errors made by the previous predictor.
[XGBoost Python Library is an optimised implementation]
Stacking
Stacked generalization involves training multiple diverse models and combining their predictions using a meta-model (or blender).
Instead of parallel training like in bagging, stacking involves training models in a sequential manner.
The idea is to let the base models specialize in different aspects of the data, and the meta-model learns how to weigh their contributions effectively.
Stacking can involve multiple layers of models, with each layer's output serving as input to the next layer.
It requires a hold-out set (validation set) for the final model to prevent overfitting on the training data.
Stacking is a more complex ensemble method compared to boosting and bagging.
[Not supported by Scikit-learn]
Unsupervised Learning
Dimensionality Reduction
Reducing dimensionality does cause information loss and makes pipelines more complex thus harder to maintain, while speeding up training.
The main result is that it is much easier to rely on Data Viz once we have fewer dimensions.
[the operation can be reversed, we can reconstruct a data set relatively similar to the original]
Intuitively dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the data set without losing too much information.
The Curse of Dimensionality
As the number of features or dimensions in a dataset increases, certain phenomena occur that can lead to difficulties in model training, performance, and generalization.
- Increased Sparsity: In high-dimensional spaces, data points become more sparse. As the number of dimensions increases, the available data tends to be spread out thinly across the feature space. This sparsity can lead to difficulties in estimating reliable statistical quantities and relationships.
- Increased Computational Complexity: The computational requirements grow exponentially with the number of dimensions. Algorithms that work efficiently in low-dimensional spaces may become computationally expensive or impractical in high-dimensional settings. This can affect the training and inference times of machine learning models.
- Overfitting: In high-dimensional spaces, models have more freedom to fit the training data closely. This can lead to overfitting, where a model performs well on the training data but fails to generalize to new, unseen data. Regularization techniques become crucial to mitigate overfitting in high-dimensional settings.
- Decreased Intuition and Visualization: It becomes increasingly difficult for humans to visualize and understand high-dimensional spaces. While we can easily visualize and interpret data in two or three dimensions, the ability to comprehend relationships among variables diminishes as the number of dimensions increases.
- Increased Data Requirements: As the dimensionality increases, the amount of data needed to maintain the same level of statistical significance also increases. This implies that more data is required to obtain reliable estimates and make accurate predictions in high-dimensional spaces.
- Distance Measures and Density Estimation: The concept of distance becomes less meaningful in high-dimensional spaces, and traditional distance metrics may lose their discriminative power. Similarly, density estimation becomes challenging as the data becomes more spread out.
Projection
In most real-world problems, training instances are not spread out uniformly across all dimensions: many features are almost constant whereas others are highly correlated.
As a result, all training instances lie within a much lower dimensional subspace of the high-dimensional space.
If we project every instance perpendicularly onto this subspace we get a new Dimension-1 dataset.
Manifold Learning focuses on capturing and representing the intrinsic structure or geometry of high-dimensional data in lower-dimensional spaces, often referred to as manifolds.
The assumption is that the task will be simpler if expressed in the lower dimensional space of the manifold, which is not always true: the decision boundary may not always be simpler with lower dimensions.
PCA Principal Component Analysis
It identifies the hyperplane that lies closest to the data and then it projects the data onto to it while retaining as much of the original variance as possible.
PCA achieves this by identifying the principal components of the data, which are linear combinations of the original features, the axis that accounts for the largest amount of variance in the training set.
[It's essential to note that PCA assumes that the principal components capture the most important features of the data, and it works well when the variance in the data is aligned with the directions of maximum variance. However, PCA is a linear technique and may not perform optimally when the underlying structure of the data is nonlinear. In such cases, non-linear dimensionality reduction techniques like t-Distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP) might be more appropriate.]
It identifies the principal components via a standard matrix factorization technique, Singular Value Decomposition.
Before applying PCA, it's common to standardize the data by centering it (subtracting the mean) and scaling it (dividing by the standard deviation). This ensures that each feature contributes equally to the analysis.
PCA involves the computation of the covariance matrix of the standardized data. The covariance matrix represents the relationships between different features, indicating how they vary together.
It is useful to compute the explained variance ratio of each principal component which indicates the proportion of the dataset’s variance that lies along each PC.
The number of dimensions to reduce down to, should account for 95% of the variance.
After dimensionality reduction the training set takes up much less space.
- Dimensionality Reduction: The primary use of PCA is to reduce the number of features in a dataset while retaining most of the information. This is beneficial for visualization, computational efficiency, and avoiding the curse of dimensionality.
- Data Compression: PCA can be used for data compression by representing the data in a lower-dimensional space, reducing storage requirements.
- Noise Reduction: By focusing on the principal components with the highest variance, PCA can help filter out noise in the data.
- Visualization: PCA is often employed for visualizing high-dimensional data in two or three dimensions, making it easier to interpret and understand.
Kernel PCA, Unsupervised Algorithm
The basic idea behind Kernel PCA is to use a kernel function to implicitly map the original data into a higher-dimensional space where linear relationships may become more apparent. The kernel trick avoids the explicit computation of the high-dimensional feature space but relies on the computation of pairwise similarities (kernels) between data points.
Commonly used kernel functions include the radial basis function (RBF) or Gaussian kernel, polynomial kernel, and sigmoid kernel. The choice of the kernel function depends on the characteristics of the data and the desired transformation.
After applying the kernel trick, the eigenvalue decomposition is performed in the feature space induced by the kernel. This results in eigenvalues and eigenvectors, which are analogous to those obtained in traditional PCA.
The final step involves projecting the original data onto the principal components in the higher-dimensional feature space. The projection allows for non-linear dimensionality reduction.
Kernel PCA is particularly useful in scenarios where the relationships in the data are not well captured by linear techniques. It has applications in various fields, including computer vision, pattern recognition, and bioinformatics, where the underlying structure of the data might be highly non-linear.
However, it's important to note that Kernel PCA can be computationally expensive, especially when dealing with large datasets, as it involves the computation of pairwise kernel values. The choice of the kernel and its parameters can also impact the performance of Kernel PCA, and tuning these parameters may be necessary for optimal results.
Clustering: K-Means
It is the task of identifying similar instances and assigning them to clusters or groups of similar instances.
It is an example where we can use Data Science not to predict but to classify the existing data.
Use cases:
- Customer segmentation: You can cluster your customers based on their purchases and their activity on your website. This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment.