Rate this book

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Name: Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications
Rating: 4.45 (105 reviews)
ISBN: 9781098107963

Chip Huyen

Rate this book

Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements. Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references. This book will help you tackle scenarios such as:

* Engineering data and choosing the right metrics to solve a business problem
* Automating the process for continually developing, evaluating, deploying, and updating models
* Developing a monitoring system to quickly detect and address issues your models might encounter in production
* Architecting an ML platform that serves across use cases
* Developing responsible ML systems

GenresArtificial IntelligenceTechnologyComputer ScienceProgrammingTechnicalNonfictionSoftware

368 pages, Paperback

First published May 1, 2022

694 people are currently reading

3718 people want to read

About the author

Chip Huyen

10 books4,332 followers

I’m Chip Huyen, a writer and computer scientist. I grew up chasing grasshoppers in a small rice-farming village in Vietnam.

I'm interested in AI for storytelling and roleplaying. Previously, I built machine learning tools at NVIDIA and Netflix. I've also founded and sold a company.

I graduated from Stanford, where I taught ML Systems. The lectures became the foundation for the book Designing Machine Learning Systems, which is an Amazon #1 bestseller in AI and has been translated into 10+ languages (very proud)!

My new book AI Engineering (2025) is currently the most read book on the O’Reilly platform. It’s also available on Amazon and Kindle.

In my free time, I travel and write. After high school, I went to Brunei for a 3-day vacation which turned into a 3-year trip through Asia, Africa, and South America. During my trip, I worked as a Bollywood extra, a casino hostess, and a street performer.

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

610 (57%)

4 stars

351 (33%)

3 stars

75 (7%)

2 stars

16 (1%)

1 star

8 (<1%)

Displaying 1 - 30 of 105 reviews

Yury Kashnitsky

24 reviews41 followers

July 21, 2022

As an ML engineer or a Data Scientist, that’s exactly what you need to deploy ML models and maintain them in production. I am currently working on an internal ML platform, and the books resonates very well with the discussions that we are having among Data Scientists, managers, and engineers. How do we retrain models? How often? How to detect data drift and alert on it? Dow e need to have a separate ML platform team to deploy models or should we demand this from Data Scientists? The book discusses this all in detail. I like that the accent is made on principles rather than on tools. Although some tools are covered in the book.

Thang

105 reviews15 followers

May 25, 2022

One of the most comprehensive books in MLOps. Start reading this book to understand more about model deployment, and I am satisfied with the content.

Some notes for myself:
- It takes time to start from development -> production. Setting up a CI/CD, and auto-update for models is tremendous work since the tools are not quite mature at the moment.
- Tried MLFlow and compare with Kubeflow
- Common pattern: ML workloads are to do training on GCP or Azure, and deployment on AWS
- Observability is part of monitoring. Metric, logs and traces are important, also logs, dashboards, and alerts.

Amirali

64 reviews6 followers

March 9, 2023

Fantastic book. I would recommend this to people who have a grasp on traditional machine learning algorithms and an understanding of neural networks and want to have the mindset of a machine learning engineer in production.

Rick Sam

445 reviews165 followers

August 17, 2022

A Work focused on Industry Practitioners

A Person from Mathematics, Physics, Engineering background would find these with ease.

Most Software, Computer Science might not be familiar with formal mathematical jargon.

Consider going through fundamentals, practicing to help you understand.

Some Easy Notes, for your references,

Notes from this Work

Deus Vult,
Gottfried

artificial-intelligence computer-science engineering

Bing Wang

33 reviews7 followers

July 29, 2023

As a ML engineer, this is the best book I have ever read about practical tips that you can use in your daily work. Think about you are going to build a ML team that is responsible for providing intelligent solutions, you are required to not only figure out what the solution might be, but also how to push your solution into production and keep maintaining it. So your role is part of DS + ML scientist + ML infra engineer + QE etc. You would find almost all you need for each step of the workflow. Must read for any full stack ML person.

Giulio Ciacchini

401 reviews14 followers

December 27, 2022

Very technical, but still accessible and well written textbook on ML systems.
It goes very deep on the infra, almost DEVops stuff, but it was expected (the author is a ML eng).
It is a great complement to conventional data science books, which focus primarily on algorithms and data manipulation.

NOTES
MLOps comes from DevOps short for developments and operations. To operationalize something means to bring it into production which includes deploying monitoring and maintaining it.
Machine learning is an approach to learning complex patterns from existing data and use these patterns to make predictions on unseen data, it is most useful when tasks are repetitive, cost of wrong predictions is cheap, it’s at scale and the patterns are constantly changing.
A relational database isn’t in ML system because it doesn’t have the capacity to learn the relationship between two columns by itself.
Ml systems are part code, part data and part artifacts created from the two.

Machine learning in research VS in production
Requirements: state of the art more than performance on benchmark data sets VS different stakeholders have different requirements. For instance while it can give your ML system is small performance improvement, ensembling tends to make a system too complex to be useful in production.
Computational priorities: fast training, high throughput VS fast inference, low latency. When designing ML systems, people who haven’t deployed NML system often make the mistake of focusing too much on the modern development part and not enough on the deployment and maintenance part.
During the model development process, we train different models, and each model does multiple passes over the training data. Each trained model then generates predictions on the validation data once to report the scores. However the validation dataset is usually much smaller than the training data. During model development, training is the bottleneck. Once the model is deployed, inference is the bottleneck. Research usually prioritises faster training whereas production usually prioritise fast in inference.
To reduce latency in production you might have to reduce the number of queries you can process on the same hardware at a time. Latency is not an individual number but a distribution, it’s better to think in percentiles. Higher percentiles are important to look at because even though they account for a small percentage of your users, sometimes they can be the most important users.
Data: in a researcher you must leave work with the historical well formatted data, whereas in production the data is being constantly generated by users, systems, and third-party data.
Fairness: ML algorithms don’t predict the future, but encode the past, thus perpetuating biases in the data and more.
Interpretability and Discussion

Requirements
- At the base of every ML project there must be the Business Objective, create value for the company.
- Reliability: the system should continue to perform the correct function at the desired level of performance even in the face of adversity, if the ground truth is not available
- Scalability: handling resource scaling, but also artifact management.
- Maintainability and adaptability

Data Engineering Fundamental
Data Source: user input data; system-generated data (logs); internal databases; 3rd party data.
Data Formats: Data serialization is the process of converting a data structure or object state into a format that can be stored or transmitted and restructured later.
- JSON is human readable; key value pair paradigm; text file
- CSV is row-major, consecutive elements in a row are stored next to each other in memory; good for accessing samples; it’s much faster to write; text file
- Parquet is column-major non-human readable, consecutive elements in a column are restored the next to each other; good for accessing features (columns); it is better when you have to do a lot of column-based reads; binary file (aka non text file).
For instance Pandas Dataframes are column-based whereas Numpy creates row-major arrays, that is accessing a DF by row is much slower than by column
Data Models:
- Relation model and normalization, SQL declarative language tells the data you want but not how to retrieve it, SQL can be Turing complete (Python is imperative)
- NoSQL document model, based on a single continuous string, document=row, the document model doesn’t enforce a schema, they shift their responsibility of assuming the structure is from the application that writes the data to the application that reads the data. Compared to the relational model it is harder and less efficient to execute joins across documents compared to across tables.
- NoSQL graphic model, a graph consists of nodes and edges which represent the relationships between the nodes. It is faster to retrieve data based on relationships. Fast arrival, schemeless
Data Storage Engines and Processing: Transactional Analytical Processing
- Online transaction processing OLTP need to be processed fast, low latency, hi availability, so that they don’t keep users waiting. They usually to be ACID: atomicity, consistency, isolation, durability. Because each transaction is often processed as you need to separately from other transactions, transactional databases I often row-major
- Online analytical processing OLAP
This distinction is outdated: this separation of transactional and analytical databases was due to limitations of technology, it was hard to have databases that could handle both queries efficiently. Storage and processing are tightly coupled, how data is stored is also how data is processed. The term online has become overloaded, it might refer to the speed at which your data is process the order can mean in production.
ETL vs ELT (fast arrival of data since there is little processing needed before data is stored)

Modes of Dataflow:
How do we pass data between different processes that don’t share memory?
- Data passing through databases, both processes must be able to access the same database and read/write
- Data passing through services A to B, send data directly through a network that connects these two processes. A first sends a request to process B that to specifies the data needed, B returned to the requested data through the same network, this is called request-driven (It works well for systems that rely more on logic than on data). REST representational state transfer vs RPC Remote procedure call, for instance HTTP is RESTful
- Data passing through real time transport, called event-driven works better for system that are data heavy. Incoming events are stored in in memory storage before being discarded or moved to more permanent storage. Instead of using databases to broker data, we use in memory storage, real-time transports can be thought of as in memory storage for data passing among services. That’s because databases are too slow for applications with strict latency requirements. Publish-subscribe VS message-queue (such as Apache Kafka and RabbitMQ)

Batch processing, produces static features, leverage on map reduce and spark, historical data
Stream processing, produces dynamic features, stream computation capacity of real-time transport like Apache Kafka, it is more difficult because the data amount is unbounded and the data comes in at variable rates and speeds

Training Data
Sampling
Nonprobability Sampling can cause selection bias: convenience, snowball, judgement, quota sampling
Simple random sampling: stratified, divides your population into the groups that you care about and sample from each group that separately; weighted sampling, each sample is given a weight which determines the probability of it being selected, it allows to leverage domain expertise and helps when the data comes from a different distribution compared to the true data by adjusting the weights; reservoir, useful with the streaming data; importance, it allows to sample from a distribution when we only have access to another distribution which is similar to the target one.

Labeling
Hand Labels
Expensive, data privacy, slow, non-adaptive, label ambiguity issue when the data comes from multiple services and rely on multiple annotators with different levels of expertise.
Natural Labels
When the task has natural ground truth labels (example Google Maps or stock price prediction). Feedback loop length that is the time it takes from when a prediction is served until when the feedback on it is provided.

Handling the lack of labels
Weak supervision relies on the concept of a labelling function: a function that encodes heuristics to generate labels; needs a small number of labelled data, but the output can be noisy.
Semi supervision leverages structural assumptions to generate new labels based on a small set of initial labels. Unlike weak supervision, it requires in initial set of labels. A classic method is self-training: you start by training model on your existing set of labelled data and use this model to make predictions for and labelled samples; perturbation method applies small changes to the training instances to obtain new ones, given the assumption that the small perturbation to a sample shouldn’t change its label.
Transfer learning a model developed for the task is reused as the starting point for a model on a second task
Active learning improves their efficiency of data labels, you label the samples that are most helpful to your model, the ones that your model is the least certain about or based on disagreement among multiple candidate models.

Class Imbalance
It is a problem in classification tasks where there is a substantial difference in the number of samples in each class of the training data (fraud detection, rare diseases, churn prediction).
ML models work best with balanced data.
It often means there is insufficient signal for your model to learn how to detect the minority classes and it makes it easier for your model to get stuck in a non-optimal solution by exploding is simple heuristic instead of learning anything useful about the underlying pattern of the data: if the model learns to always output the majority glass its accuracy is already very high.
Class Imbalance leads to asymmetric the cost of error, the cost of a wrong prediction on a sample of the rare class might be much higher than a wrong prediction on a sample of the majority class.
In the real-world class imbalance is the norm: rare events are often more interesting and/or dangerous than regular events and many tasks focus on detecting those rare events
Using the right evaluation metrics
Overall accuracy and error rate are insufficient, need to look at F1, recall, ROC too
Data – level methods: resampling
Resampling includes over sampling, adding more instances from the minority classes and under sampling, removing instances of the majority classes. When resample your training data, never evaluate your model on the resampled data since it will cause the model overfit to that resampled distribution.
Algorithm-level Methods
It keeps the training data distribution intact but alter the algorithm to make it more robust to cross imbalance, mainly adjusting the loss function.
Cost Sensitive Learning
The individual loss function is modified to take into account the difference in classes costs.
Class balanced loss; Focal loss

Data Augmentation
It is a family of techniques that are used to increase the amount of training data. It is mainly use for medical imaging (change pixels) and NLP (replace a word).
Simple label preserving transformations is the simplest technique: randomly modify an image while preserving its label.
Perturbation is similar but it’s used to trick models into making wrong predictions. Adding noisy samples to training data can help models recognize the weak spots in their learned decision boundary and improve their performance.
Data Synthesis tries to train our models with synthesized data.

Feature Engineering
Learned VS Engineered Features
The promise of Deep Learning is that we won’t have to handcraft features, since they could be potentially learned and extracted by algorithms, for this reason DL is called feature learning.
However, this is not reached yet.
Handling Missing Values
Three types of missing values: Missing not at Random MNAR, Missing at Random MAR, Missing completely at Random MCAR.
Deletion by column or by row, risk of losing important info
Imputation fills missing values with their defaults or mean, median, mode
Feature Scaling
ML models tend to struggle with features that follow a skewed distribution.
Apply normalization, standardization or log function
Discretization
Turning continuous features into discrete by quantization or binning, risk of losing info.

Encoding Categorical Features
In production categories can change, and the model needs to address it.
One solution is the hashing trick: use a hash function to generate a hashed value of each category.
Feature crossing combines two or more features to generate new features. It is useful to model the nonlinear relationships between features.

Data Leakage
When a form of the label leaks into the set of features used for making predictions, and these same information is not available during inference.
Splitting time correlated data randomly instead of by time, in many cases, data is time correlated, which means that the time the data is generated affects its label distribution.
To prevent future information from leaking into the training process, and allowing models to cheat during evaluation, split your data by time, instead of splitting randomly whenever possible.
Scaling before splitting, do not use their entire training data to generate global statistics. Before splitting it into different splits, leaking the mean and the variance of the test sample into the training process, allowing a model to adjust its predictions for the test sample. This information is not available in production, so the models performance will likely degrade.
Filling in missing data with the statistics from the test split
Poor handling of data duplication before splitting
Leakage from data generation process
To detect data leakage measured the predictive power of each feature or a set of features with respect to the target variable (label). If a feature has unusually high correlation investigates how this feature is generated and whether the correlation makes sense.
Engineering good features: more features is not always good
- more features mean more opportunities for data leakage, can cause overfitting, can increase memory required to serve a model, can increase inference latency when doing online production, useless features become technical debts.
- Often a small number of features accounts for the large portion of the model’s feature importance
- Need to assess how well a feature generalizes

Model Development and off-line evaluation
Evaluating ML models
When considering what model to use that, it’s important to consider its performance, but also its other properties, such as how much data, compute, and time it’s needed to train, what’s its inference, latency and interpretability. For example, a simple logistic regression model might have lower accuracy than a complex neural network, but it requires less labelled data, it’s faster to train and easier to deploy.
- Avoid the state-of-the-art trap just to follow the latest trend
- Start with the simplest models, use it as baseline
- Avoid human biases in selecting models
- Evaluate good performance now vs good performance later, think of potential/future situations
- Evaluate Trade-offs, such as false positive VS false negatives or compute power VS accuracy
- Understand model’s assumptions

Ensembles
They are less favored in production because they are more complex to deploy and harder to maintain.
Bagging (bootstrap aggregating) reduces variance and helps to avoid overfitting, instead of training, one classifier on the entire dataset it samples with replacement to create different datasets, called bootstraps and train the model on each of them, e.g. random forest.
Boosting reinforce weak learners, each learner is trained on the same set of samples, but the samples are weighted differently among iterations. E.g. gradient boosting machine or XGBoost.
Stacking train base learners from the training data then create a meta-learner that combines the outputs of the base learner to output final predictions, the meta-learner can be as simple as a heuristic: take the majority or average vote from all the base learners.

Experiment tracking and versioning
Must track pivotal results: loss curve; model performance; predictions/labels; speed; parms and hyperparms

Distributed Training
In some cases that data sample is so larger, it can’t even fit into memory and you will have to use something like gradient checkpointing.
Data Parallelism It’s now the norm to train ML models on multiple machines, each worker has its own copy of the whole model and does all the computation necessary for its copy of the model; the problem is how to accurately and effectively accumulate gradients from different machines (synchronous VS Asynchronous).
Model Parallelism different components of the model are trained on different machines. It doesn’t mean that different parts of the model in different machines are executed in parallel, this happens with the pipeline parallelism.

Auto ML
It’s the process of finding ML algorithms to solve real problems
Soft AutoML: hyperparameter tuning, they are the parameters supplied by users, whose value is used to control the learning process. With different values, the same model can give drastically different performances on the same deficit. The goal of the hyperparameter tuning is to find the optimal set for a given mode within the search space – the performance of each set are evaluated on a validation set.

Model off-line evaluation
For certain tasks, it’s possible to infer approximate labels in production, based on user feedback (natural labels), for others, you might not be able to evaluate the models performance in production directly, and might have to rely on extensive monitoring to detect changes in failures in the ML systems performance.
Random baseline; simple heuristic; zero rule baseline; human baseline; existing solutions.
The model should be good, but also useful.
Evaluation Methods
Perturbations tests make small changes to the test set to see how these changes affect the model’s performance
Invariance tests change the sensitive information to see if the outputs change
Directional expectation tests, model calibration, confidence measurement; slice-based evaluation.

coding non-fiction

113 reviews99 followers

January 31, 2026

Accessible, well-written, and it covers a complete workflow of ML systems. It has a strong theory about practical stuff in machine learning and data science in real-life production ML systems. A good theoretical ML book, but you still need to gain experience through code, projects, and put all the learnings into practice to make sense of some of the topics and gain proficiency.

If you are looking for a book to learn ML algorithms, Deep Learning, and practical projects, this is not the book to read (as stated in the book). But a great book to learn the step-by-step process of ML systems and processes, from data cleaning and data preprocessing to training and evaluating models, to infrastructure, deployment, and monitoring in production.

Some tips on how to read it:

- It looks like a detailed written version of the ML in Production by DeepLearning.ai, which is great too. I would also recommend pairing this book with the course to better visualize the theory.
- I think it's a nice resource to read cover-to-cover, but also use as a reference to brush up on the topics and to help fill in the (knowledge) gaps as you grow as a data scientist/ML engineer.
- ML is a fast-moving field. Most of the content remains valid, but like many books, it is not frequently updated. Another complementary, more-complete book that can be read afterwards is the ML System book.

ai-machine-learning

Lucas Moda

93 reviews2 followers

July 29, 2022

Muito bom!!
É bastante difícil encontrar um material tão completo e bem escrito sobre MLOps, esse livro é um tesouro! Também é extremamente atual (publicado em maio/2022), mas um ponto muito positivo é que a autora não foca em ferramentas específicas (apesar de em certas passagens comentar as ferramentas disponíveis e pincelar como funcionam) e tutoriais, enfatizando os conceitos e desafios de um sistema de Machine Learning ponta a ponta.
Traz discussões e exemplos tanto do mercado quanto de sua trajetória sobre temas muito importantes, como feature engineering, monitoramento e observabilidade, performance, ética em IA, plataformas de ML, feature stores, batch vs streaming, escolha de modelos, e muito mais!
Só é importante ter em mente que esse NÃO é um livro para iniciantes, é muito importante ter um bom entendimento sobre Ciência de Dados e Engenharia de Software para aproveitar a leitura ao máximo.
Enfim, recomento muuuuito para quem quer ser/aprender um cientista de dados/engenheiro de ML completo!

favorites

Rahul Chandra

23 reviews

May 22, 2025

This is like DDIA for ML systems, but it somehow requires even less prerequisite knowledge. It reads super easily, I was able to finish most of the book in a single day. Very strongly recommend

Olysavra

67 reviews6 followers

July 4, 2025

We read this one during our book club and it’s dense and hard for someone without ML background and worth it for anyone with.

Gabriel Santos

62 reviews11 followers

July 5, 2024

As a Data Leader I think Chip has written one of the best pieces on Machine Learning Systems. The book softly approaches several interesting aspects regarding implementing processes to deploy machine learning systems in large scale.

It is very well written and keeps you engaged throughout the whole book. If you are interested in grasping some insights about the necessary processes when dealing with machine learning systems I think you will really enjoy it.

Michael Burnam-Fink

1,725 reviews310 followers

November 7, 2024

Data science is not that hard. Simply clean and annotate a dataset, select one of the available algorithms from the basics like linear regression to the latest transformer architecture neural networks, train and optimize a loss function for accuracy, precision, and/or recall, make some pretty plots, and move on. You did it!

Unfortunately, the company is going to need you to keep doing that every single day, forever.

Deploying and maintaining models in production is machine learning engineering and operations, and it is in fact pretty hard. Designing Machine Learning Systems is a solid introduction about how to go from ad hoc data science to continual learning with machine learning engineering and operations.

The first and foremost issue is one of data shifts. The data coming into any system is continuously evolving, and entropy means that changes are away from the data that the model was trained on. This means that a useful ML product has to be constantly retrained and redeployed, even in the absence of

The second issue is that platforms and tooling for doing this is apparently not great. Code versioning via Git is solid. Model versioning via some kind of artifact store is okay, but varies via company. Data versioning is likely bad, requiring painstaking reconstruction from a data swamp (like a data lake, but full of sludge). And the totality of being able to maintain a consistent workflow around code, data, models, and compute is basically non-existent.

This book has a lot of good questions to ask and targets to aim for, especially in the later chapters (I found the first five or so chapters very basic), but fewer good answers, particularly around the key questions of what metrics to monitor and when to refresh models. I guess this is why they pay us.

2024 data-science

Hassan

1 review

November 3, 2023

a lot of buzzwords little to no details

This book looks like a student project to me the structure of the contents seems not contained where concept topics are leaked into each other and as a result you get to be confused about what you reading at the moment. I was hoping to see a conclusive architecture of a machine learning system or something that resembled it but nah let’s describe each technology in a sentence and get away with it.

Matheus Leão

4 reviews

January 9, 2023

It's a good introductory book, but very high level, introducing concepts in a way that is easy to follow. It could be better if it covered systems a bit more in depth (for instance, going through an actual inference system in a whiteboard-interview style).

Vaidas

122 reviews4 followers

October 21, 2023

This book is primer / introduction to ML systems. Haven't found a lot of information on iterative process for delivering vualue...

Scott Pearson

877 reviews46 followers

January 15, 2026

Machine learning (ML) is a hot yet daunting topic. It's perhaps the most important technological advancement supporting the revolution in artificial intelligence (AI), and most leading IT companies have been extracting value from it for some time. Almost every time predictions are made by a computer, like when a product or really anything is recommended, ML is at work. This book explains in technical detail how to get ML projects out of theory and into production.

ML is probabilistic by nature; that is, in contrast to traditional computer algorithms that are deterministic, ML (and AI) can produce different results with the same input. This is at the same time scary and enthralling. It mimics the creativity we've long associated most with human intelligence. AI has extracted its value and brought ML to the masses.

Since I work in web software development, I'm never going to become an ML engineer, but as a leader, I wanted to understand the key concepts better. This book explains those clearly, but a lot of the technical details went in one ear and out the other! For those more interested in ML's AI applications, I suggest her follow-up book AI Engineering.

From what I understand, this book is one of the best one-volume introductions to understanding ML, and its clarity and comprehensiveness makes me see why. Chip Huyen has clearly spent a lot of time in the literature and in front of a computer understanding the technology. This work spans the academic-practitioner divide well and has found a welcoming audience in classrooms and in industry. If you're looking to understand ML with some technical muscle to back it up, your search should stop here.

artificial-intelligence audiobook software-technology

Eric

139 reviews

January 18, 2024

Billed as a resource for academic ML researchers to transition into industry and understand how models get deployed “in the real world”, this book is an excellent overview of the problem spaces addressed by MLOps/ML Platform teams. The book assumes some familiarity with various ML model families, but largely takes a systems-oriented approach that’s accessible to software engineers.

Different chapters were varying levels of technical, but provided a solid taxonomy of sub-problems, mixed in with some practical advice, overviews of popular vendors/tools and looooooots of paper citations. In a world where LLMs are drastically lowering the knowledge and data barriers to creating viable models, this book demonstrates to engineers who aren’t ML specialists how to incorporate such tech with sufficient nuance and rigor to avoid some of the most common pitfalls.

Bruno

115 reviews16 followers

October 24, 2023

I got this book as a gift and didn’t know what to expect, but I knew it would be good stuff coming from Huyen. She’s one of the biggest voices in the MLOps scenario right now, and her teaching method is both pleasant and effective (also, her blog is packed with useful content).

The book provides an overview and tips-and-hints regarding every step of a machine learning model lifecycle, but leans toward model serving in production rather than local notebook development.

It covers all the major issues I ran into in my past projects, so I definitely recommend the read if you’re a junior or semi-senior, but even if you’re a seasoned data scientist.

The book is not as dense as Andriy Burkov’s “Machine Learning Engineering”, but despite that, I could learn quite interesting new tricks.

what-i-do-for-a-living

Mark Torres

24 reviews

March 16, 2024

Excellent book on designing machine learning systems! A just read for any ML or data science practitioners. Most people who get into machine learning start from the academic perspective of machine learning, and then learn the hard way the challenges of actually putting machine learning into production. The author does a great job of bridging this gap, presenting different types of problems that someone might face when going from training models in a notebook to putting said model into production, and more broadly how machine learning can be used in industry to answer questions. Great read!

Minervas

229 reviews3 followers

November 21, 2022

As a data scientist in the non-tech industry who has limited data engineering / deployment experience, I know I have much to learn, but I don't even know where to start. Fortunately, I found this book, which covers basics and introduces best industry practices using real examples. I will recommend it to any data scientist who wants to grow into an end-to-end role.

machine-learning purchased purchased-physical-book

Ahmed Khalifah

3 reviews

November 26, 2022

Easy to read for a technical book and for readers of non English speaking backgrounds.. I loved every piece of it

Nghĩa

20 reviews

September 15, 2025

Great book. The first half is friendly for beginners, while the second half is more appealing to me.

Jacob

114 reviews16 followers

December 30, 2025

Awesome overview of building ML systems! This book is an encyclopedia of building ML systems, and everyone who works on systems that involve models should read it.

Short summary:

Machine learning success is primarily a systems and organizational problem, not a modeling problem. Most ML failures in production stem from data issues, system design, deployment constraints, monitoring gaps, and organizational misalignment - not from choosing the “wrong” model. Effective ML systems require treating ML as software engineering + data engineering + product thinking, with explicit tradeoffs at every stage.

"Accuracy is easy to optimize offline; reliability, scalability, maintainability, and adaptability are the real challenges."

Key Takeaways:
- The Systemic Approach: ML is only a small part of an ML system. Success requires equal attention to data pipelines, infrastructure, and monitoring.
- Iterate Early and Often: Don't wait for a perfect model. Deploy a simple baseline early to validate the entire pipeline and then increase complexity.
- Data is Dynamic: Unlike software, ML systems are "alive." Data distributions will shift, and your system must be designed to adapt through continual monitoring and retraining.
- Align with Business: High accuracy is useless if it doesn't move a business metric. Always ensure your ML objectives are tied to actual organizational goals.

Notable quotes:
- "ML in production is very different from ML in research".
- "In production, data is a lot more messy, noisy, possibly unstructured, constantly shifting, and likely biased. Labels might be sparse, imbalanced or incorrect."
- "Data in research vs production: PhD = 5% Datasets and 95% Models and Algorithms. Real system = 75% Datasets and 25% Models and Algorithms".
- " Common pattern of short-lived ML projects is becoming too focused on ML metrics without paying attention to business metrics".

Chapters Summary:

===Part I: Foundations of Machine Learning Systems===

📕 Chapter 1: Overview of Machine Learning Systems
- The Paradigm Shift: ML systems differ from traditional software because they are data-dependent and probabilistic. While traditional software is code-heavy, ML systems are data-heavy and require "DataOps".
- When to Use ML: ML is best suited for problems where the logic is too complex to hand-code, the data is constantly changing, or the scale of the task is beyond human capacity.
- The ML Lifecycle: The process is iterative: framing the problem -> data engineering -> model development -> deployment -> monitoring.
- ML in research vs production: Production systems face messy data, latency constraints, and evolving inputs.
- ML systems require a broader focus beyond model metrics. They must solve real business problems reliably (focus on business metrics) .

📗 Chapter 2: Introduction to ML Systems Design
- Aligning business and ML objectives: Start with clear business constraints and formulate the ML problem so success is measurable and tractable. Reliability (performing under faults), Scalability (handling growth), Maintainability (easy for different engineers to work on), and Adaptability (reacting to data shifts).
- Iterative design process: start with a simple heuristic or a baseline model before jumping to complex architectures
- Data-First Approach: The quality of the system is capped by the quality of the data.

===Part II: Data Engineering===

📘 Chapter 3: Data Engineering Fundamentals
- Good data engineering underpins every ML system - bad data means bad predictions no matter the model.
- Data sources: user inputs (e.g., text/images, often malformed), system logs (for behavior analysis), internal databases, third-party data (privacy concerns like post-IDFA changes)
- Data models: relational (structured tables, SQL queries), NoSQL (flexible schemas for unstructured data), graph (relationship-focused)
- Data Formats: The choice between row-major (CSV, JSON) and column-major (Parquet) formats significantly impacts training speed and storage costs.
- Structured vs. unstructured data: warehouses for processed data, lakes for raw.
- Storage and processing: OLTP (transactions, row-major), OLAP (analytics, decoupled); ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines. ELT is often preferred to keep raw data available for feature engineering experiments.
- Dataflow modes: through databases (persistent), services (APIs/microservices), real-time transports (Kafka for streaming)
- Batch vs. stream processing: batch for periodic tasks (Spark), stream for real-time (Flink); shift toward streaming for unified pipelines.

📙 Chapter 4: Training Data
- High-quality labeled data is often a greater driver of system performance than model complexity alone.
- Sampling Challenges: Random sampling is rarely sufficient. Strategies like stratified sampling or importance sampling are needed to handle class imbalance.
- Labeling Bottlenecks: Manual labeling is the most expensive part of the lifecycle. Mitigations:
1. Weak Supervision: Using heuristics or pre-existing models to generate noisy labels (e.g., Snorkel).
2. Semi-Supervised Learning: Leveraging small labeled sets with large unlabeled sets.
3. Active Learning: Only labeling the data points that the model is most "confused" about
- Class Imbalance: Techniques like SMOTE or cost-sensitive learning are vital to ensure the model doesn't ignore the minority class.
- Data augmentation: simple transformations (crop/flip for images), perturbations (noise addition), synthesis (GANs like CycleGAN for medical imaging)

===Part III: Model Development and Offline Evaluation===

📘 Chapter 5: Feature Engineering
- Good features can dramatically improve performance; automated feature learning doesn’t eliminate the need for domain insight.
- Data Leakage: One of the most common production failures. It occurs when information from the future (the target) "leaks" into the training features
- Feature Stores: A centralized repository for features that ensures consistency between training and serving, preventing "training-serving skew."
- Engineering good features: use SHAP for importance; ensure generalization (distribution overlap); collaborate with domain experts

📗 Chapter 6:Model Development & Offline Evaluation
- Rigorous offline evaluation and tracking experiments ensure models perform as expected before deployment.
- Start with simple models (Logistic Regression, Gradient Boosted Trees) to establish a baseline.
- Evaluation Metrics: Accuracy is usually a trap. Use Precision-Recall, F1-score, and ROC-AUC.
- Ensemble Methods: Bagging (random forests), Boosting (XGBoost), and Stacking are powerful tools for squeezing out performance, but they increase maintenance complexity
- Experiment tracking: monitor metrics/parameters, and version for reproducibility.

📘 Chapter 7: Model Deployment & Prediction Services
- Batch vs. Online Prediction:
* Batch: High throughput, high latency (pre-computing predictions).
* Online: Low latency, requires immediate feature computation.
- Cloud vs. Edge: Cloud (scalable, costly); edge (low latency, constrained; optimize via compilers like TVM)
- Model Compression Techniques (essential for edge deployment):
* Quantization (reducing precision)
* Pruning (removing unimportant weights)
* Knowledge Distillation
- Myths:
* Having only few models (bust: companies run hundreds)
* Models are static (bust: trained once degrade fast, continuous training/tweaking required)
* Rare updates (bust: every World-class system is frequently updated)

📙 Chapter 8: Data Distribution Shifts & Monitoring
- Models degrade over time; monitoring is essential for reliability. Unlike traditional software that crashes, ML models just give "wrong" answers as the world changes.
- Types of Drift:
* Covariate Shift: P(X) changes (input distribution).
* Label Shift: P(Y) changes (output distribution).
* Concept Drift: P(Y∣X) changes (the relationship between input and output changes).
- Detect drifts via KS test (Kolmogorov–Smirnov test), or PSI (Population Stability Index)
- Monitoring Metrics: Track both system metrics (CPU/Latency) and ML-specific metrics (prediction distribution, feature statistics). Use dashboards and alerts.

📗 Chapter 9: Continual Learning & Test in Production
- ML systems must adapt without breaking existing functionality.
- Continuous training isn't just about retraining once a month; it's about building a system that can update as frequently as the data dictates.
- Deployment Strategies:
1. Shadow Deployment: Running the new model in parallel with the old one without using its output to test performance.
2. Canary Release: Rolling out to a small percentage of users first.
3. A/B Testing: Statistical comparison of two models in production.
- Stateless models need to be retrained from scratch. Stateful can be fine-tuned.
-

📘 Chapter 10 – Infrastructure & Tooling for MLOps
- Proper tooling and infrastructure make ML systems repeatable, scalable, and easier to maintain.
- The MLOps Stack:
* Storage/Compute: Clouds (AWS); multi-cloud for redundancy (p. 1-10).
* Environments: Containers (Docker); IDEs (notebooks) (p. 10-15).
* Management: Orchestrators (Kubernetes, Airflow); ML platforms (MLflow model store, Feast features).
* Unified Pipelines: Consistency across batch/stream.
- Build vs. Buy: build parts that are core to your competitive advantage and buy/use open-source for the rest.

📙 Chapter 11 – The Human Side of Machine Learning
- Responsible AI: Understanding bias (data/algorithm) and fairness vs accuracy trade-offs. Models can codify and amplify existing human biases. Use frameworks: audits, model cards.
- Interpretability: In high-stakes industries (finance, healthcare), being able to explain why a model made a decision is as important as the decision itself.
- User Experience: Handle inconsistency (rules for stability); graceful failures (backups)
- Teams: Involve SMEs; structures (separate vs. end-to-end); abstract ops for DS.

The "Iterative Feedback" Framework:
- *Metric-Driven Iteration*: Use business metrics, not just loss functions, to drive development.
- *The Feedback Loop*: Create a path for ground-truth labels to flow back into the training set as quickly as possible.
- *Decoupling*: Decouple feature engineering from model logic to allow teams to work in parallel.

Building ML Systems Framework:

1. Project Scoping & Framing - Before a single line of code is written, the system must be framed around business objectives rather than technical metrics.
* Define the Objective: Translate business goals (e.g., "increase user engagement") into ML objectives (e.g., "minimize latency of intent recognition").
* Performance Constraints: Define the "Latency Budget" and "Freshness Requirement." Does the model need to be updated hourly (online learning) or monthly?
* Evaluate Feasibility: Is there a "Data Flywheel"? Can the system collect its own training data through user interactions (implicit feedback)?

2. Data Engineering (The Foundation) - data is the most significant part of the ML code base. This stage focuses on the infrastructure that feeds the model.
* Data Pipelines (ETL/ELT): Move from static CSVs to robust pipelines.
* Feature Store Implementation: Create a unified interface for features to ensure that the features used in training are identical to those used in production (eliminating Training-Serving Skew).
* Labeling Strategy:
a) Natural Labels: Use system logs where the "ground truth" is revealed naturally (e.g., a user clicking a search result).
b) Programmatic Labeling: Use weak supervision (heuristics/small models) to label massive datasets where manual labeling is too slow.

3. Iterative Development & Validation - reach a baseline quickly and then iterate based on error analysis.
* Start with Baselines: Implement a simple heuristic or a linear model first. This provides a "floor" for performance and helps debug the pipeline.
* Model Selection & Training:
a) Experiment Tracking: Version your data, code, and model weights together.
b) Hyperparameter Tuning: Automate this, but don't let it distract from feature engineering.
* Offline Evaluation: Use "Slicing." Don't just look at global accuracy; look at performance on specific segments (e.g., how model works for power users vs long-tail).

4. Deployment, Monitoring & Maintenance
* Prediction Service:
a) Online Prediction: For immediate needs.
b) Batch Prediction: For background tasks (pre-computing).
* Deployment Strategies:
a) Canary/Blue-Green: Test on 1% of traffic.
b) Shadow Mode: Run the model in production but don't use its outputs yet to verify stability.
* The Monitoring Loop:
a) Detection of Drift: Implement automated triggers to detect Covariate Shift (inputs change) or Concept Drift (the "rules" of the world change).
b) Continual Learning: Create a path for the model to be automatically retrained on the latest data once performance dips below a threshold.

BONUS: Great video summary: https://youtu.be/_SmkfnH_oOc?si=hPiLG...

Doğa Armangil

51 reviews24 followers

April 16, 2023

Required reading for data scientists

This book gives a comprehensive overview of the types of activities and processes that businesses must put into place in order to use machine learning effectively in their day-to-day operations.

One caveat is that large language models are only mentioned briefly, but this is understandable, as LLMs are a fairly recent development, and they are not yet widely used in business settings.

Overall, a must read.

computers digital_book non-fiction

Chris Taber

2 reviews1 follower

July 16, 2022

Very informative with actionable advice

I would recommend this book to any data scientist or data engineer who develops or deploys ML models. This book expanded my thinking and will improve my work and how ML can contribute positively to business and society.

Dainius Prakapavičius

23 reviews2 followers

February 4, 2023

Great, fresh and a timely book that covers all aspects of machine learning systems in a very accessible style. ML systems are complex and a lot of factors and skillsets have to be considered. Because of that, a wide variety of people - software/data engineers, data scientists, DevOps, middle/top managers - would benefit from reading this book.
The coverage is not too deep, but the author provides enough references for those who want to dig deeper. And this keeps the material well-balanced.

Aris

56 reviews1 follower

February 4, 2023

Awesome build of a framework for end to end ML systems. Can be improved in parts (e.g. some parts read too high-level to be meaningful even for a framework book, some diagrams need improvement etc.) but overall a great book.

tech

Oleksandr Stefanovskyi

60 reviews5 followers

January 25, 2023

This book provides a comprehensive and holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

I found the book's iterative framework to be particularly valuable. It helped me understand how different design decisions, such as how to process and create training data, which features to use, how often to retrain models, and what to monitor, can impact the system as a whole. This framework made it easier for me to align our team's efforts with the organization's objectives and to make informed decisions about our ML systems.

The book also covers a wide range of topics, from engineering data and choosing the right metrics to automating model development, deployment, and updating, developing a monitoring system, and architecting an ML platform that serves across use cases. The author's use of actual case studies and references made it easy to relate the concepts to real-world scenarios.

I also appreciated the author's focus on full-stack machine learning concepts rather than tooling. This helped me understand how to evaluate different tools and how to position them within our ecosystem. The book also covers responsible ML systems, which is an important topic for any ML team.

In conclusion, as an Engineering Manager, I found this book to be an invaluable resource for understanding how to design and implement ML systems end-to-end. I recommend it to any manager or engineer who is looking to leverage ML to solve real-world problems.

Ashley

3 reviews

December 25, 2023

Fairly high-level end-to-end overview of the current state of the world in ML, but don't expect to learn anything radically new. I'd recommend to this someone new to DS/ML, but probably not to anyone with >=2 years in the industry.

technical

Taylor Joren

19 reviews

March 16, 2023

I’m amazed by how comprehensive this book is. It now lives next to my desk as a reference when I’m working on ML Engineering tasks. Note that it won’t make you an expert on the tools you’d need as an ML Engineer, but it will be a broad overview of what to learn next.