Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's success—and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearly explains how to apply the best practices from software engineering to data science.
Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how
Understand data structures and object-oriented programmingClearly and skillfully document your codePackage and share your codeIntegrate data science code with a larger code baseLearn how to write APIsCreate secure codeApply best practices to common tasks such as testing, error handling, and loggingWork more effectively with software engineersWrite more efficient, maintainable, and robust code in PythonPut your data science projects into productionAnd more
Read this as part of an Innovation Sprint at work and I feel like I was the exact target audience. I know I tend to write ad-hoc code that is not easily scalable, especially by other people, so it was great to learn some software engineering best practices, especially around design/refactoring, formatting, and documentation. Lots of good packages mentioned throughout. And the tone was conversational enough that it was really easy to read. Overall, it was validating to me that I’m not alone in my coding weaknesses as a data scientist, but also gave me good paths forward to get better. Yay.
I just loved this book for its simplicity and clarity. I found it extremely useful and discovered a lot of techniques I didn’t know. Would advise to everyone wanting to pursue a career in Data Science.
The book positions itself as being suitable for both junior and senior data scientists, but in practice, it only serves the former. In the book's opening, Catherine claims it can address more advanced audiences. It does not.
By attempting to cover nearly every aspect of software engineering within a data science context, the author sacrifices depth. Many topics are treated superficially, and in some cases, poor practices are suggested while core fundamentals are overlooked entirely.
The first nine chapters offer little value to anyone with even minimal cs education or more than a month of industry experience. In contrast, Chapters 10 through 12 are worth reading, as they provide clear introductions to version control, APIs, automation, and deployment. these from my experience are areas where junior data scientists / researchers lack knowledge
Bottom line: The book may work as a primer for beginners but fails to provide meaningful insights for intermediate or senior level ds
This book has a great list of topics for data folks who are asked to build software that has to run reliably more than once: it's time to bring in some software engineering. I think the overall scope is perfect, from technical details like modular code and using appropriate data structures for data pipelines, to fuzzy people high-level topics like automated deploys and operations and working in teams.
Although most chapters were known to me, I quite liked the overall refreshments and as a seasoned data scientists (5 years) I find out that I was missing out some new practices like mypy, ruff, poetry etc. So learning about that was quite good.
I recommend it as a light read which I have done during a break between projects.