Learning Spark from O'Reilly is a fun-Spark-tastic book! It has helped me to pull all the loose strings of knowledge about Spark together. The official documentation, articles, blog posts, the source code, StackOverflow gave me a fine start, but it was the book to make it all flow well. I'm much better equipped to understand the concepts of Apache Spark - RDDs, DataFrames, DStreams, driver vs executors, clusters, do's and dont’s, monitoring, a little of Machine Learning using MLlib, and much, much more. And there were only ca 250 pages! They're far too few to cover Apache Spark in-depth for sure, but the book did the great job to not be too lengthy (so people could get scared and run away when spot the book on a shelf in a bookstore) and at the same time cover enough details with areas for self-study when needed. The book is exactly what I can offer to anyone wishing to learn Spark and apply it with confidence to problems it was really meant to solve.
From a higher-level learning perspective the book follows a proven teaching path - start with the theory, show a few examples and explain the more advanced topics just a little (to whet my appetite enormously, though). It met my expectations fully, but, as it usually happens when I’m guided by very skilled teachers, my appetite grew so badly that I hated (and got mad) when the book finished.
The book is packed with plenty of examples, explanations, motivations, recommendations, tips that together with the writing flow and the layout, fonts, and such, made the book so pleasant to read. I’m into Apache Spark as a technology advocate for Apache Spark in deepsense.io and the book has just made my wish to get deeper into Spark even stronger. The book targets developers (Scala, Java, Python), data scientists, and administrators (with a little of Spark's clustering, monitoring and tuning).
The book is a fine example of what sort of books resonate well with me.
I was reading the book using Kindle on a Nexus 7 tablet and it read very well. The examples were well laid out, the fonts appropriate and in general the quality was excellent. It helped me convince myself to read more books in electronic format.
I’d really love to have a series of follow-up books devoted specifically for each topic alone - clustering (standalone, YARN, and Mesos), streaming and integration with external sources (Kafka and Flume comes to my mind), DataFrames and SQL, Machine Learning (Pipelines and algorithms in MLlib) and graph processing (GraphX). I think these could easily each have 200+ pages.
On the flip side, I would advocate for improving Chapter 9. Spark SQL as it was too shallow (even without a separate section on DataFrames) and there were very few examples. With the features of upcoming Spark 1.5 in, the book could easily remain the Spark book for the years to come. There’s no chapter about SparkR, either.
I think it’s going to be a tough exercise to beat Learning Spark content-wise! As a reader, however, I’d like to be wrong and encourage publishers to take up the challenge!