Jump to ratings and reviews
Rate this book

Effective Data Science Infrastructure: How to make data scientists productive

Rate this book
Simplify data science infrastructure to give data scientists an efficient path from prototype to production.

In Effective Data Science Infrastructure you will learn how

Design data science infrastructure that boosts productivity
Handle compute and orchestration in the cloud
Deploy machine learning to production
Monitor and manage performance and results
Combine cloud-based tools into a cohesive data science environment
Develop reproducible data science projects using Metaflow, Conda, and Docker
Architect complex applications for multiple teams and large datasets
Customize and grow data science infrastructure

Effective Data Science How to make data scientists more productive is a hands-on guide to assembling infrastructure for data science and machine learning applications. It reveals the processes used at Netflix and other data-driven companies to manage their cutting edge data infrastructure. In it, you’ll master scalable techniques for data storage, computation, experiment tracking, and orchestration that are relevant to companies of all shapes and sizes. You’ll learn how you can make data scientists more productive with your existing cloud infrastructure, a stack of open source software, and idiomatic Python.

The author is donating proceeds from this book to charities that support women and underrepresented groups in data science.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Growing data science projects from prototype to production requires reliable infrastructure. Using the powerful new techniques and tooling in this book, you can stand up an infrastructure stack that will scale with any organization, from startups to the largest enterprises.

About the book
Effective Data Science Infrastructure teaches you to build data pipelines and project workflows that will supercharge data scientists and their projects. Based on state-of-the-art tools and concepts that power data operations of Netflix, this book introduces a customizable cloud-based approach to model development and MLOps that you can easily adapt to your company’s specific needs. As you roll out these practical processes, your teams will produce better and faster results when applying data science and machine learning to a wide array of business problems.

What's inside

Handle compute and orchestration in the cloud
Combine cloud-based tools into a cohesive data science environment
Develop reproducible data science projects using Metaflow, AWS, and the Python data ecosystem
Architect complex applications that require large datasets and models, and a team of data scientists

About the reader
For infrastructure engineers and engineering-minded data scientists who are familiar with Python.

About the author
At Netflix, Ville Tuulos designed and built Metaflow, a full-stack framework for data science. Currently, he is the CEO of a startup focusing on data science infrastructure.

Table of Contents
1 Introducing data science infrastructure
2 The toolchain of data science
3 Introducing Metaflow
4 Scaling with the compute layer
5 Practicing scalability and performance
6 Going to production
7 Processing data
8 Using and operating models
9 Machine learning with the full stack

352 pages, Paperback

Published August 16, 2022

14 people are currently reading
107 people want to read

About the author

Ville Tuulos

3 books

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
20 (55%)
4 stars
10 (27%)
3 stars
3 (8%)
2 stars
2 (5%)
1 star
1 (2%)
Displaying 1 - 7 of 7 reviews
Profile Image for Marcello.
303 reviews10 followers
January 24, 2023
I read this book as an ML developer, and I found it a great way to better understand the infrastructure required to run the models I was working on.
If you want to learn about what goes on 'behind the scenes', this book shows us all the hard work that makes machine learning pipelines running without making the headlines.
And it's of course even more useful if you work on infrastructure, then it's perfect to keep you up to date and show you the tricks that will make your job easier.

Each chapter clearly explains one aspect of building and maintaining a solid ML pipeline, what you should look out for, the problems you might face, and how to solve them.
As a developer, I feel that it gives me a better understanding of what's needed to build the models in such a way that it's going to be easier to support, scale and maintain them.
1 review
October 11, 2022
This an incredible book, detailing how you can leverage Metaflow to scale up your organisation's Machine Learning Infrastructure. I loved the storytelling that put you in the shoes of an actual Data Scientist in a company that is scaling up and leveraging Machine Learning. Although this book walks you through how Metaflow can help you, much of the information shown can be applied to other tools. I learnt a tremendous amount about how I can improve the way I structure my Machine Learning projects and how I can handle big data in my everyday workflows. Also, appreciate the drawings used to illustrate all the points. Definitely recommend this book to anyone building Machine Learning Infrastructure in your organisation.
1 review
October 15, 2022
There are many things to say about this book. The concepts, even the more complex, are so simply explained. It tackles the main challenge: building the bridge between data science and data engineering.

Regardless of your background or role, this book helps you to clarify your technical responsibilities, facilitates the implementation of a robust infrastructure and supports you in dramatically increase your knowledge.

In short, this book is a “must have” for any data engineer.
Profile Image for Zhuzi_20.
26 reviews4 followers
November 15, 2024
This book feels like an introduction to MetaFlow, but the tools can be built on your own. MetaFlow is not necessary, and the thinking should not be constrained. Spark and Git can fully achieve the same results.
Profile Image for Ninoslav Cerkez.
6 reviews
October 15, 2022
This title is a brilliant book about Effective Data Science Infrastructure. It is a must for every data scientist. The author is a master of the subject, and I enjoyed reading it.
Profile Image for Erin.
65 reviews
January 5, 2025
More helpful to those who haven’t used Metaflow before and do not come from an ML background. I found it to be a very fast read given the high-level subject coverage.
50 reviews
June 2, 2024
Brilliant book that focuses on pragmatic infrastructure that enables data scientist productivity with engineering rigour.
Displaying 1 - 7 of 7 reviews

Can't find what you're looking for?

Get help and learn more about the design.