Jump to ratings and reviews
Rate this book

Learning Spark: Lightning-Fast Data Analytics

Rate this book
Data is bigger, arrives faster, and comes in a variety of formats--and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you'll be able to:

Learn Python, SQL, Scala, or Java high-level Structured APIs
Understand Spark operations and SQL Engine
Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Perform analytics on batch and streaming data using Structured Streaming
Build reliable data pipelines with open source Delta Lake and Spark
Develop machine learning pipelines with MLlib and productionize models using MLflow

300 pages, Paperback

Published August 11, 2020

102 people are currently reading
205 people want to read

About the author

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
63 (42%)
4 stars
68 (46%)
3 stars
15 (10%)
2 stars
1 (<1%)
1 star
0 (0%)
Displaying 1 - 12 of 12 reviews
Profile Image for Giulio Ciacchini.
382 reviews14 followers
February 6, 2023
The first part is majestic: straight to the point and very clear.
I always find very useful to understand when and why a piece of software was first invented to grasp its genesis and its core functions.
The second one, is too much hands on for my purpose, nonetheless it is very valid.
The ML section is even too much for a book about Spark, but in this case the more the better.

NOTES
Apache Spark is a unified engine designed for large-scale distributed data processing, on premises in data centers, or in the cloud.
Sparke provides in memory storage for intermediate computations, making it much faster than the Hadoop MapReduce.
Speed: the underlying UNIX-based operating system takes advantage of efficient, multithreading and parallel processing thanks to today’s commodities servers with multiple cores; it builds its query computations as directed acyclic graph (DAG).
Easy of use: it provides a fundamental abstraction of a simple, logical data structure called a resilient distributed data (RDD) set upon which all other higher-level structured, data abstractions, such as dataframes and data sets are constructed.
Modularity: Spark variations can be applied across many types of workloads and expressed in any of the supported programming languages.
Extensibility: unlike Apache Hadoop, which included both storage and compute, Spark decoupled the two. This means it can read the data stored in myriad sources and process it all in memory.

Apache Spark components as a unified stack
Each of these components is separate from Spark’s core fault-tolerant engine in that you use APIs to write your Spark application and Spark converts this into it a DAG that is executed by the core engine.
Spark SQL: this module works well with the structured data, you can read data stored in a DB or from file formats and then construct a permanent or temporary tables in Spark.
Spark MLib: it provides many popular machine learning algorithms beat a top high-level data frame based APIs to build models. To extract or transform features, build pipelines, and pretty small that during deployment.
Spark Structured Streaming: necessary for the big data developments to combine and react in real time to both static data and streaming data from engines like Apache Kafka.
GraphX: perform graph parallel, computations for manipulating graphs.

Actual physical data is distributed across to storage as partitions residing in either HDFS or cloud storage. While the data is distributed as partitions across the physical cluster, Spark treats each partition as a high-level logical data abstraction as a data frame in the memory.
Data engineers use Spark because it provides a simple way to parallelize computations and hides all the complexity of the distribution and fault tolerance.

- Processing in parallel, large data sets, distributed across a cluster
- Performing interactive queries to explore and visualize the data set
- Implementing machine learning models using MLlib

Spark computations are expressed as operations, then converted into low level by RDD-based bytecode as tasks, which are distributed to Spark’s executors for execution.
Application: a user program built using Spark APIs, with the driver program and executors on the cluster.
SparkSession: an object that provides a point of entry to interact with underlying functionality and allows programming with Spark APIs.
Job: a parallel computation, consisting of multiple tasks that gets spawned in response to a Spark action.
Stage: each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the DAG nodes not all operations can happen in a single stage so they may be divided into multiple stages.
Task: a single unit of work or execution that will be sent to a Spark executor.

Spark evaluate all transformations lazily: their results are not computed immediately, but they are recorded or remembered as a lineage.
An action triggers the lazy evaluation of all the recorded transformations.
Narrow dependencies transform a single input to partition into a single output partition.
Wide transformations where data from other partitions is read in, combined and written to disk.

Apache Spark’s Structured API
Resilient Distributed Dataset (RDD) is the most basic and fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.
Dependencies: a list that instructs how a RDD is constructed with its inputs.
Partitions: provide the ability to split the work to parallelize computation across the executors.
Compute function: produces an iterator for the data that will be stored in the RDD.

Problems: the compute function is opaque, that is the Spark does not know what you’re doing in the computer function, and it is unable to inspect the computation or expression in the function.

Structuring Spark: High level patterns operations that provide clarity and simplicity available as APIs; Data Abstraction and Domain Specific Language (DSL) applicable to structure and semi-structured data. DataFrame API is a distributed collection of data in the form of named column and row.

The DataFrame API, collection of generic objects, was inspired by pandas DataFrames, distributed to in memory tables with the named columns and schemas, to human’s eyes they are like a table.
Spark can optionally infer the schema.

The Dataset API is a collection of strongly type the JVM objects: they don’t make sense in Python and R because they are not to compile-time type-safe; types are dynamically, inferred, or assigned during execution, not during compile time.

We can seamlessly move between dataframes or data sets, and RDD: after all, they are built on top of RDDs and they get decomposed to compact RDD during the whole stage code generation.

Spark SQL: unifies Spark components and permits abstraction to Dataframes/data sets in every language; reads and writes a structured data with the specific, a schema from structured file format; offers an interactive shell for a quick data explorations; generates optimized query plans and compact code.
No matter which of the support of the language is your use, a Spark query undergoes the same optimization journey from logical and physical plan construction to final compact code generation.
At its core are the Catalyst optimizer and Project Tungsten: analysis; logical optimization; physical planning; code generation.

Spark SQL and DataFrames
Managed Tables: Spark manages both the metadata and the data in the file store.
Unmanaged Tables: Spark only manages the meta data while you manage the data yourself in an external data source. For instance a DELETE TABLE statement only deletes the metadata and not the actual data.
By default Spark creates tables under the default DB.
Views can be global or session-scoped: the view doesn’t actually hold the data, whereas tables persist after the Spark application terminates, but both type of views disappear once the Spark application is over.
DataFrameReader (Parquet, JSON, CSV, images, binary files) and DataFrameWriter are the core constructs to manage df.
Parquet is the default data source.

User-defined functions: allow to create ad hoc functions.
SparkSQL can connect to external data sources using JDBC: use partitions to parallelize table reading and writing (numPartitions and partitionColumn parameters).
Optimising and tuning Spark Applications
We can set Spark configurations directly in the Spark application.
Scaling Spark: Static VS Dynamic resource allocation; Configuring Spark executors’ memory; Maximizing // via as much partition as possible, but less than the executors.

Spark will store as many partitions across the executors as memory allows, df can be fractionally cached, partitions not.

Spark has five distinct join strategies to maximize efficiency given the size of the tables to join: broadcast hash join; shuffle hash join; shuffle sort merge join; broadcast nested loop join.

Structured Streaming
Micro batched streaming replaced record-at-a-time processing: streaming computation is modeled as a continuous series of small chunks of the stream data, it is fault tolerant and deterministic but lower latency.
Structured Streaming principles: Single unified, programming, model and interface for batch and stream of processing; broader definition of stream processing.
Every new record received is like a new row being appended to the unbounded input table.
Different methods to write the updates to external system: Append mode; Update mode; Complete mode.
- Define input sources
- Transform Data
- Define output sink and output mode
- Specify processing details
Stateless Transformations: all projection operations and selection operations process each input record individually without needing info from the previous rows.
Stateful Transformations: example df.groupby().count().

DBs and Datalakes with Spark
Spark is primarily designed for OLAP (online analytical processing) workloads not OLTP (online transaction processing).
Datalakes, unlike DBs, decouple the distributed storage system from the distributed compute system: this allows each system to scale out as needed by the workload.
The data is saved as files with open formats, popularized by Hadoop HDFS.
Limitations: atomicity and isolation (partly solved through partitions); consistency.

Lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data https://www.databricks.com/blog/2020/... Hudi, Iceberg, Delta lake.

Machine Learning with MLlib
Transformer (accept df as input), Estimator (fits the model), Pipeline (organize a series of transformers and estimator in a single model).
Hyperparameter is an attribute that we define about the model prior to training, and that it is not learned during the training process, as is the case for parameters
Model deployment options: batch processing, predictions on a regular schedule and writes the results out to persistent storage; streaming, predictions on micro batches of data and get predictions in seconds to minutes; Real time deployment, prioritises latency over throughput and generate predictions in a few milliseconds, however Spark cannot meet the latency requirement yet.
53 reviews2 followers
November 6, 2020
I switched to this book after “Spark: The definitive guide” got too heavy for me mid-way through the book.

This one is a much easier read. However it skimmed few topics like aggregations & windowing functions - I would have liked some more elaboration there. Loved the optimization/tuning tips chapter.

I breezed through the later chapters as I wasn’t interested at the moment with MLLib. Not a definite 5/5 but a good companion book when you are ready to wrestle with Spark.
Profile Image for Arun.
212 reviews67 followers
January 13, 2025
The quality of this book is outstanding as one comes to expect of all O'Rielly books! One point detected for poorer explanations of concepts especially in late chapters when the topics get harder and harder to understand which is precisly when one needs better explanations!

Likely re-read soon!
Profile Image for Peter Aronson.
400 reviews19 followers
September 30, 2023
A pretty decent introduction to the Apache Spark analytics engine. Good examples, nice code samples and an absolute minimum of screen shots (which belong in blogs, not books in my opinion). I'm not sure it really needed code samples in both Python and Scala, since the samples were generally very similar, but I assume some people found one or the other more readable (the Java code samples however were often different enough to warrant being included).

The end of the book didn't feel as necessary as front three-quarters, with the data lake chapter coming across more like an ad, and the chapters on Machine Learning seemed rather specialized for a general introduction -- if you're going to include material like that, why not GeoSpark instead, which I at least would have found a lot more interesting.

One misfeature in the book is likely the fault of the publisher: Almost all of the references in the printed text are shortened URLs pointing to the publisher's website. No doubt the theory is this protects against link rot. Alternatively, it replaces one possible source of link rot with two. There are standard formats for citations of hyperlinks. That would have been considerably more appropriate in a print edition. (Although it is interesting to see a case where the print, not digital, edition is the poor cousin.)
Profile Image for Jonathan Garvey.
17 reviews
May 6, 2024
A broad overview of Apache Spark. Well worth a read. It covers a lot of introductory topics in reasonable detail. I read it to help prepare for the Databricks Spark Developer Exam and it has helped tremendously. The examples are provided in both Python and Scala, although some functionality is only available in Scala or Java. Taking some of the learnings and building my own demo apps, helped drive home the concepts.
151 reviews5 followers
July 17, 2021
This is a good soup-to-nuts book on Spark features. I didn't read many of the GitHub examples since I was on an e-reader, but I'll do so now that I'm done. I think this is a lot better than the Coursera courses. The treatment, when detailed, was just detailed enough (for example, with optimization for shuffling). As a textbook, I'm not sure how to improve the book actually. Hence the five stars.
Profile Image for Harry.
15 reviews
December 4, 2022
Crystal clear in explaining the architecture of Spark and its capabilities to the beginner/intermediate reader.
18 reviews1 follower
January 6, 2025
Great book to get basic knowledge what Apache Spark is. I am only missing one chapter about different deployment scenarios. Recommend it.
Profile Image for Evan Oman.
31 reviews3 followers
January 18, 2025
A good introduction but a bit surface level in the second half. It has definitely improved my mental model for how spark is operating under the hood.
23 reviews5 followers
January 19, 2022
Excellent book, especially the chapter on Streaming
Displaying 1 - 12 of 12 reviews

Can't find what you're looking for?

Get help and learn more about the design.