Rate this book

Data Pipelines for Machine Learning: A Practical Guide to MLOps and Production Data Systems

Marco Lusi

Rate this book

Brief Hook

You've trained a model with 95% accuracy in a Jupyter Notebook, but now what? A brilliant machine learning model without a data pipeline is a world-class engine with no fuel line—it’s not going anywhere.

Book SummaryThis guide closes the gap between a model in a Jupyter Notebook and a resilient, production-grade ML application. It dives into the 90% of MLOps work that is often the critical data infrastructure. You'll move beyond theory and learn to design and build the data systems that are the true backbone of any successful AI product. From ingestion and transformation to validation and orchestration, this book provides a complete roadmap for the modern data professional.

Following a single, end-to-end case study, you will construct a complete data platform from the ground up. You’ll learn to structure transformations with dbt, orchestrate workflows with Dagster, solve training-serving skew with a feature store like Feast, and automate your quality gates with GitHub Actions. This book is your hands-on guide to building the robust, observable, and testable

Why Choose this Book?Go from Zero to Production with a Complete Case Study
This isn't a collection of disconnected tutorials. You will build one cohesive project a "Frequently Bought Together" recommendation engine from start to finish. Every chapter adds a new, functional component to this system, ensuring you understand not just how each tool works in isolation, but how they all fit together in a real-world architecture.

Master the Tools the Pros Actually Use
Gain deep, practical skills with the technologies that define the modern data and MLOps stack. You'll get hands-on experience with dbt for building modular SQL transformations, Dagster for data-aware orchestration, Feast for feature store implementation, Great Expectations for data validation, DuckDB for local data warehousing, and GitHub Actions for professional CI/CD workflows.

Think Like a Data Systems Architect, Not Just a Scripter
This book focuses on the timeless principles that separate fragile scripts from robust platforms. You will learn to design for idempotency (safe re-runs), modularity (reusable components), observability (monitoring and logging), and testability. These are the foundational concepts that senior engineers use to build systems that are reliable, scalable, and a pleasure to maintain.

Solve the Toughest Problems in MLOps Data
Go beyond basic ETL and tackle the complex, ML-specific challenges that cause most projects to fail in production. This book provides clear, step-by-step solutions for eliminating training-serving skew using a feature store, and shows you how to generate point-in-time correct training datasets to prevent data leakage and build models you can trust.

Adopt Professional Software Engineering Practices for Data
Learn how to treat "Data as Code." You'll implement a complete, Git-based development workflow using feature branches and pull requests. You will build an automated CI/CD pipeline that lints, tests, and validates every single code change, ensuring that no bad code or faulty logic ever reaches your production environment.

4.

285 pages, Kindle Edition

Published June 28, 2025