Very technical, but still accessible and well written textbook on ML systems.
It goes very deep on the infra, almost DEVops stuff, but it was expected (the author is a ML eng).
It is a great complement to conventional data science books, which focus primarily on algorithms and data manipulation.
NOTES
MLOps comes from DevOps short for developments and operations. To operationalize something means to bring it into production which includes deploying monitoring and maintaining it.
Machine learning is an approach to learning complex patterns from existing data and use these patterns to make predictions on unseen data, it is most useful when tasks are repetitive, cost of wrong predictions is cheap, it’s at scale and the patterns are constantly changing.
A relational database isn’t in ML system because it doesn’t have the capacity to learn the relationship between two columns by itself.
Ml systems are part code, part data and part artifacts created from the two.
Machine learning in research VS in production
Requirements: state of the art more than performance on benchmark data sets VS different stakeholders have different requirements. For instance while it can give your ML system is small performance improvement, ensembling tends to make a system too complex to be useful in production.
Computational priorities: fast training, high throughput VS fast inference, low latency. When designing ML systems, people who haven’t deployed NML system often make the mistake of focusing too much on the modern development part and not enough on the deployment and maintenance part.
During the model development process, we train different models, and each model does multiple passes over the training data. Each trained model then generates predictions on the validation data once to report the scores. However the validation dataset is usually much smaller than the training data. During model development, training is the bottleneck. Once the model is deployed, inference is the bottleneck. Research usually prioritises faster training whereas production usually prioritise fast in inference.
To reduce latency in production you might have to reduce the number of queries you can process on the same hardware at a time. Latency is not an individual number but a distribution, it’s better to think in percentiles. Higher percentiles are important to look at because even though they account for a small percentage of your users, sometimes they can be the most important users.
Data: in a researcher you must leave work with the historical well formatted data, whereas in production the data is being constantly generated by users, systems, and third-party data.
Fairness: ML algorithms don’t predict the future, but encode the past, thus perpetuating biases in the data and more.
Interpretability and Discussion
Requirements
- At the base of every ML project there must be the Business Objective, create value for the company.
- Reliability: the system should continue to perform the correct function at the desired level of performance even in the face of adversity, if the ground truth is not available
- Scalability: handling resource scaling, but also artifact management.
- Maintainability and adaptability
Data Engineering Fundamental
Data Source: user input data; system-generated data (logs); internal databases; 3rd party data.
Data Formats: Data serialization is the process of converting a data structure or object state into a format that can be stored or transmitted and restructured later.
- JSON is human readable; key value pair paradigm; text file
- CSV is row-major, consecutive elements in a row are stored next to each other in memory; good for accessing samples; it’s much faster to write; text file
- Parquet is column-major non-human readable, consecutive elements in a column are restored the next to each other; good for accessing features (columns); it is better when you have to do a lot of column-based reads; binary file (aka non text file).
For instance Pandas Dataframes are column-based whereas Numpy creates row-major arrays, that is accessing a DF by row is much slower than by column
Data Models:
- Relation model and normalization, SQL declarative language tells the data you want but not how to retrieve it, SQL can be Turing complete (Python is imperative)
- NoSQL document model, based on a single continuous string, document=row, the document model doesn’t enforce a schema, they shift their responsibility of assuming the structure is from the application that writes the data to the application that reads the data. Compared to the relational model it is harder and less efficient to execute joins across documents compared to across tables.
- NoSQL graphic model, a graph consists of nodes and edges which represent the relationships between the nodes. It is faster to retrieve data based on relationships. Fast arrival, schemeless
Data Storage Engines and Processing: Transactional Analytical Processing
- Online transaction processing OLTP need to be processed fast, low latency, hi availability, so that they don’t keep users waiting. They usually to be ACID: atomicity, consistency, isolation, durability. Because each transaction is often processed as you need to separately from other transactions, transactional databases I often row-major
- Online analytical processing OLAP
This distinction is outdated: this separation of transactional and analytical databases was due to limitations of technology, it was hard to have databases that could handle both queries efficiently. Storage and processing are tightly coupled, how data is stored is also how data is processed. The term online has become overloaded, it might refer to the speed at which your data is process the order can mean in production.
ETL vs ELT (fast arrival of data since there is little processing needed before data is stored)
Modes of Dataflow:
How do we pass data between different processes that don’t share memory?
- Data passing through databases, both processes must be able to access the same database and read/write
- Data passing through services A to B, send data directly through a network that connects these two processes. A first sends a request to process B that to specifies the data needed, B returned to the requested data through the same network, this is called request-driven (It works well for systems that rely more on logic than on data). REST representational state transfer vs RPC Remote procedure call, for instance HTTP is RESTful
- Data passing through real time transport, called event-driven works better for system that are data heavy. Incoming events are stored in in memory storage before being discarded or moved to more permanent storage. Instead of using databases to broker data, we use in memory storage, real-time transports can be thought of as in memory storage for data passing among services. That’s because databases are too slow for applications with strict latency requirements. Publish-subscribe VS message-queue (such as Apache Kafka and RabbitMQ)
Batch processing, produces static features, leverage on map reduce and spark, historical data
Stream processing, produces dynamic features, stream computation capacity of real-time transport like Apache Kafka, it is more difficult because the data amount is unbounded and the data comes in at variable rates and speeds
Training Data
Sampling
Nonprobability Sampling can cause selection bias: convenience, snowball, judgement, quota sampling
Simple random sampling: stratified, divides your population into the groups that you care about and sample from each group that separately; weighted sampling, each sample is given a weight which determines the probability of it being selected, it allows to leverage domain expertise and helps when the data comes from a different distribution compared to the true data by adjusting the weights; reservoir, useful with the streaming data; importance, it allows to sample from a distribution when we only have access to another distribution which is similar to the target one.
Labeling
Hand Labels
Expensive, data privacy, slow, non-adaptive, label ambiguity issue when the data comes from multiple services and rely on multiple annotators with different levels of expertise.
Natural Labels
When the task has natural ground truth labels (example Google Maps or stock price prediction). Feedback loop length that is the time it takes from when a prediction is served until when the feedback on it is provided.
Handling the lack of labels
Weak supervision relies on the concept of a labelling function: a function that encodes heuristics to generate labels; needs a small number of labelled data, but the output can be noisy.
Semi supervision leverages structural assumptions to generate new labels based on a small set of initial labels. Unlike weak supervision, it requires in initial set of labels. A classic method is self-training: you start by training model on your existing set of labelled data and use this model to make predictions for and labelled samples; perturbation method applies small changes to the training instances to obtain new ones, given the assumption that the small perturbation to a sample shouldn’t change its label.
Transfer learning a model developed for the task is reused as the starting point for a model on a second task
Active learning improves their efficiency of data labels, you label the samples that are most helpful to your model, the ones that your model is the least certain about or based on disagreement among multiple candidate models.
Class Imbalance
It is a problem in classification tasks where there is a substantial difference in the number of samples in each class of the training data (fraud detection, rare diseases, churn prediction).
ML models work best with balanced data.
It often means there is insufficient signal for your model to learn how to detect the minority classes and it makes it easier for your model to get stuck in a non-optimal solution by exploding is simple heuristic instead of learning anything useful about the underlying pattern of the data: if the model learns to always output the majority glass its accuracy is already very high.
Class Imbalance leads to asymmetric the cost of error, the cost of a wrong prediction on a sample of the rare class might be much higher than a wrong prediction on a sample of the majority class.
In the real-world class imbalance is the norm: rare events are often more interesting and/or dangerous than regular events and many tasks focus on detecting those rare events
Using the right evaluation metrics
Overall accuracy and error rate are insufficient, need to look at F1, recall, ROC too
Data – level methods: resampling
Resampling includes over sampling, adding more instances from the minority classes and under sampling, removing instances of the majority classes. When resample your training data, never evaluate your model on the resampled data since it will cause the model overfit to that resampled distribution.
Algorithm-level Methods
It keeps the training data distribution intact but alter the algorithm to make it more robust to cross imbalance, mainly adjusting the loss function.
Cost Sensitive Learning
The individual loss function is modified to take into account the difference in classes costs.
Class balanced loss; Focal loss
Data Augmentation
It is a family of techniques that are used to increase the amount of training data. It is mainly use for medical imaging (change pixels) and NLP (replace a word).
Simple label preserving transformations is the simplest technique: randomly modify an image while preserving its label.
Perturbation is similar but it’s used to trick models into making wrong predictions. Adding noisy samples to training data can help models recognize the weak spots in their learned decision boundary and improve their performance.
Data Synthesis tries to train our models with synthesized data.
Feature Engineering
Learned VS Engineered Features
The promise of Deep Learning is that we won’t have to handcraft features, since they could be potentially learned and extracted by algorithms, for this reason DL is called feature learning.
However, this is not reached yet.
Handling Missing Values
Three types of missing values: Missing not at Random MNAR, Missing at Random MAR, Missing completely at Random MCAR.
Deletion by column or by row, risk of losing important info
Imputation fills missing values with their defaults or mean, median, mode
Feature Scaling
ML models tend to struggle with features that follow a skewed distribution.
Apply normalization, standardization or log function
Discretization
Turning continuous features into discrete by quantization or binning, risk of losing info.
Encoding Categorical Features
In production categories can change, and the model needs to address it.
One solution is the hashing trick: use a hash function to generate a hashed value of each category.
Feature crossing combines two or more features to generate new features. It is useful to model the nonlinear relationships between features.
Data Leakage
When a form of the label leaks into the set of features used for making predictions, and these same information is not available during inference.
Splitting time correlated data randomly instead of by time, in many cases, data is time correlated, which means that the time the data is generated affects its label distribution.
To prevent future information from leaking into the training process, and allowing models to cheat during evaluation, split your data by time, instead of splitting randomly whenever possible.
Scaling before splitting, do not use their entire training data to generate global statistics. Before splitting it into different splits, leaking the mean and the variance of the test sample into the training process, allowing a model to adjust its predictions for the test sample. This information is not available in production, so the models performance will likely degrade.
Filling in missing data with the statistics from the test split
Poor handling of data duplication before splitting
Leakage from data generation process
To detect data leakage measured the predictive power of each feature or a set of features with respect to the target variable (label). If a feature has unusually high correlation investigates how this feature is generated and whether the correlation makes sense.
Engineering good features: more features is not always good
- more features mean more opportunities for data leakage, can cause overfitting, can increase memory required to serve a model, can increase inference latency when doing online production, useless features become technical debts.
- Often a small number of features accounts for the large portion of the model’s feature importance
- Need to assess how well a feature generalizes
Model Development and off-line evaluation
Evaluating ML models
When considering what model to use that, it’s important to consider its performance, but also its other properties, such as how much data, compute, and time it’s needed to train, what’s its inference, latency and interpretability. For example, a simple logistic regression model might have lower accuracy than a complex neural network, but it requires less labelled data, it’s faster to train and easier to deploy.
- Avoid the state-of-the-art trap just to follow the latest trend
- Start with the simplest models, use it as baseline
- Avoid human biases in selecting models
- Evaluate good performance now vs good performance later, think of potential/future situations
- Evaluate Trade-offs, such as false positive VS false negatives or compute power VS accuracy
- Understand model’s assumptions
Ensembles
They are less favored in production because they are more complex to deploy and harder to maintain.
Bagging (bootstrap aggregating) reduces variance and helps to avoid overfitting, instead of training, one classifier on the entire dataset it samples with replacement to create different datasets, called bootstraps and train the model on each of them, e.g. random forest.
Boosting reinforce weak learners, each learner is trained on the same set of samples, but the samples are weighted differently among iterations. E.g. gradient boosting machine or XGBoost.
Stacking train base learners from the training data then create a meta-learner that combines the outputs of the base learner to output final predictions, the meta-learner can be as simple as a heuristic: take the majority or average vote from all the base learners.
Experiment tracking and versioning
Must track pivotal results: loss curve; model performance; predictions/labels; speed; parms and hyperparms
Distributed Training
In some cases that data sample is so larger, it can’t even fit into memory and you will have to use something like gradient checkpointing.
Data Parallelism It’s now the norm to train ML models on multiple machines, each worker has its own copy of the whole model and does all the computation necessary for its copy of the model; the problem is how to accurately and effectively accumulate gradients from different machines (synchronous VS Asynchronous).
Model Parallelism different components of the model are trained on different machines. It doesn’t mean that different parts of the model in different machines are executed in parallel, this happens with the pipeline parallelism.
Auto ML
It’s the process of finding ML algorithms to solve real problems
Soft AutoML: hyperparameter tuning, they are the parameters supplied by users, whose value is used to control the learning process. With different values, the same model can give drastically different performances on the same deficit. The goal of the hyperparameter tuning is to find the optimal set for a given mode within the search space – the performance of each set are evaluated on a validation set.
Model off-line evaluation
For certain tasks, it’s possible to infer approximate labels in production, based on user feedback (natural labels), for others, you might not be able to evaluate the models performance in production directly, and might have to rely on extensive monitoring to detect changes in failures in the ML systems performance.
Random baseline; simple heuristic; zero rule baseline; human baseline; existing solutions.
The model should be good, but also useful.
Evaluation Methods
Perturbations tests make small changes to the test set to see how these changes affect the model’s performance
Invariance tests change the sensitive information to see if the outputs change
Directional expectation tests, model calibration, confidence measurement; slice-based evaluation.