A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline
Finding a good data scientist has been likened to hunting for a the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to real-world applications. Visualization tools are reviewed, and their central importance in data science is highlighted. Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving real-world data problems. The book also
• Extensive sample code and tutorials using Python™ along with its technical libraries
• Core technologies of “Big Data,” including their strengths and limitations and how they can be used to solve real-world problems
• Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity
• A wide variety of case studies from industry
• Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed
The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools. The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such. This book is also an appropriate reference for researchers and entry-level graduate students who need to learn real-world analytics and expand their skill set.
FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.
Was looking for an intro text for my academic mates who aren't techie mates: this turned out to be it.
Covers all the important boring stuff (file formats, coding practices) and a bit of the flashy stuff (CNNs, Keras) and was written specifically to drag maths PhDs into basic competence.
TL;DR: good to leaf through to appreciate just how diverse the field of data science is. But it is neither reference material, nor an introductory textbook. The blame, however, lies more with Wiley’s poor editorial choices and the failure to clean-up major formatting and layout issues, instead of the author who writes well and covers a lot of ground.
Long Review
A pretty good book, overall, with plenty of example code to build real-life DS skills. The discussion is engaging and the organisation is very modular (with three major sections, in decreasing order of importance for real-life DS). However, the book suffers from three major defects (covered below). Although, of course, a lot of these issues have to do with it not being a Jupyter notebook.
First, the code is pretty darn annoying. The absence of colour coding, dependency mismatches, and deprecations makes the code from 2017 unreadable — if not quite unuseable. The code snippets are also not at a sufficient level of abstraction to allow a user to generalise the learnings. Besides, as it’s not always properly indented, nested loops can be tough to parse.
Second, this handbook is extremely sparing with figures or diagrams. For instance, this book manages to talk about JOINs in SQL without ever using a single Venn diagram or flowchart. Sure, there are (often, very unfinished) images of the output of code, but no conceptual illustrations. Cady is pretty good at explaining concepts in prose and employs good analogies (e.g., his description of typical Git commands in Chp-15 is an excellent introduction), but this approach simply doesn’t lend itself to many areas.
Third, this books cover a lot of ground, and the modular organisation doesn't always help, as many important concepts have in fact been pushed to the end. As a result, the first half often assumes prior knowledge of quite a lot of Python and related libraries; the text comes laced with terms like “scripting languages” and “classes”,and Cady often name-drops a concept just to add, “it’s too difficult so I won’t explain it here, hehe”. (Though, to be fair, classes and OOP do get a discussion, even if only in the final chapters of the book).
Overall, it makes for a good reading, if only to see what all lies out there in the world of DS. And there are nice nuggets of wisdom from a field-practitioner. However, this is not necessarily reference material (you aren’t likely to re-read it), nor is it a textbook (there are no exercises to consolidate learning). And it comes with several poor editorial choices and sometimes unworkable code from 2017. The author also yields to the so-called curse of knowledge, though this may be due to the major concepts getting pushed to the final third of the book.
I pretty good overview and a good place to start. There are a lot of typos which could lead you astray if you are not already familiar with material. If there is a second edition fixing this it would be worth seeking out.
The only real complaint I have is from near the end. During the discussion on pointers one of the sections and a occasionally in the text there are references to the stack and heap, but they were never defined or discussed. This would be very confusing to someone not already familiar with the topic.
Too much generalization. It is good for those who don't know anything about data science and data analytics. The author just added a few pages in each domain (like an introduction section) to cover it up as a handbook.
I have shown this book to a few of my colleagues and they all said one thing "It is just an introduction, not even coverd the full basics."
It's a simple intro to Data Science, nothing too fancy. If you are interested to understand what is data science as a beginner, it's great. If you're experienced, I don't think you will find much in this book.
Also, it's waaaaay better than the other Data Science Handbook from Carl Shan, seriously, don't even try that one if you want to be a better data scientist.