The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming. Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing. If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.
This text is a brief theoretical introduction to spark and pyspark followed by ML use cases. It may seems a good balance, however there are too many unrelated concepts: the book should be about Spark not Machine learning; there are better books for ML. It ends up being neither fish nor fowl.
For some examples you need to have specific knowledge, which defeats the purpose of the book. The use cases go from geospatial to finance up to genomics. WHile they are interesting topics the learning curve is too steep and you spend half the chapter understanding the data and the final goal.
I understand that these are some of pyspark applications, but they miss the general point: I'd rather have one main example eviscerated throughout the book, where you see different pyspark applications.
NOTES Many of our favorite small data tools hit a wall when working with big data. Libraries like pandas are not equipped to deal with data that can't fit in our RAM. Then, what should an equivalent process look like that can leverage clusters of computers to achieve the same outcomes on large datasets? Challenges of distributed computing require us to rethink many of the basic assumptions that we rely on in single-node systems. For example, because data must be partitioned across many nodes on a cluster, algorithms that have wide data dependencies will suffer from the fact that network transfer rates are orders of magnitude slower than memory accesses. As the number of machines working on a problem increases, the probability of a failure increases. These facts require a programming paradigm that is sensitive to the characteristics of the underlying system: one that discourages poor choices and makes it easy to write code that will execute in a highly parallel manner.
Apache Spark, apart from the computation engine Spark Core, is composed of: SparkSQL and DataFrames and datasets; MLlib; Structured Streaming; GraphX.
One illuminating way to understand Spark is in terms of its advances over its predecessor, Apache Hadoop's MapReduce. MapReduce revolutionized computation over huge datasets by offering a simple and resilient model for writing programs that could execute in parallel across hundreds to thousands of machines. It broke up work into small tasks and could gracefully accommodate task failures without compromising the job to which they belonged. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in three important ways: - rather than relying on a rigid map-then-reduce format, its engine can execute a more general directed acyclic graph of operators. This means that in situations where MapReduce must write out intermediate results to the distributed filesystem, Spark can pass them directly to the next step in the pipeline. - it complements its computational capability with a rich set of transformations that enable users to express computation more naturally. Out-of-the-box functions are provided for various tasks, including numerical computation, date- time processing, and string manipulation. - Spark extends its predecessors with in-memory processing. This means that future steps that want to deal with the same dataset need not recompute it or reload it from disk. Spark is well-suited for highly iterative algorithms as well as ad hoc queries.
PySpark is Spark's Python API. In simpler words, PySpark is a Python-based wrapper over the core Spark framework, which is written primarily in Scala. PySpark provides an intuitive programming environment for data science practitioners and offers the flexibility of Python with the distributed processing capabilities of Spark. PySpark allows us to work across programming models. For example, a common pattern is to perform large-scale extract, transform, and load (ETL) workloads with Spark and then collect the results to a local machine followed by manipulation using pandas.
- Spark A distributed processing framework written primarily in the Scala programming language. The framework offers different language APIs on top of the core Scala-based framework. It decouples storage and compute unlike Hadoop. - PySpark Spark's Python API. Think of it as a Python-based wrapper on top of core Spark. - SparkSQL A Spark module for structured data processing. It is part of the core Spark framework and accessible through all of its language APIs, including PySpark.
PySpark follows a distributed computing model, where data is partitioned across a cluster of machines, and computations are performed in parallel. It leverages in-memory processing and optimizes data locality to achieve high performance. To start using PySpark, you need to have Apache Spark installed on your system and set up the necessary configurations. PySpark can be run in local mode for development and testing purposes or on distributed clusters for large-scale data processing. Here's a basic example to demonstrate the usage of PySpark:
from pyspark.sql import SparkSession
# Create a SparkSession spark = SparkSession.builder.appName("PySparkExample").getOrCreate()
# Read data from a CSV file into a DataFrame data = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Perform transformations and actions on the DataFrame filtered_data = data.filter(data["age"] > 25) result = filtered_data.groupBy("gender").count()
# Show the result result.show()
# Stop the SparkSession spark.stop()
In this example, we create a SparkSession object, which is the entry point for working with Spark functionalities in PySpark. We read data from a CSV file into a DataFrame, perform transformations (filtering) and an action (grouping and counting), and display the result using show(). Finally, we stop the SparkSession.
In PySpark, the DataFrame is an abstraction for datasets that have a regular structure in which each record is a row made up of a set of columns, and each column has a well-defined data type. You can think of a dataframe as the Spark analogue of a table in a relational database. Even though the naming convention might make you think of a pandas.DataFrame object, Spark's DataFrames are a different beast. This is because they represent distributed datasets on a cluster, not local data where every row in the data is stored on the same machine. Although there are similarities in how you use DataFrames and the role they play inside the Spark ecosystem.
The act of creating a DataFrame does not cause any distributed computation to take place on the cluster. Rather, DataFrames define logical datasets that are intermediate steps in a computation. Spark operations on distributed data can be classified into types: transformations and actions. All transformations are evaluated lazily. That is, their results are not computed immediately, but they are recorded as a lineage. This allows Spark to optimize the query plan. Distributed computation occurs upon invoking an action on a DataFrame.
Spark applications run as independent sets of processes on a cluster or locally. At a high level, a Spark application is comprised of a driver process, a cluster manager, and a set of executor processes. The driver program is the central component and responsible for distributing tasks across executor processes. There will always be just one driver process. When we talk about scaling, we mean increasing the number of executors. The cluster manager simply manages resources. Spark is a distributed, data-parallel compute engine. In the data-parallel model, more data partitions equals more parallelism. Partitioning allows for efficient parallelism. A distributed scheme of breaking up data into chunks or partitions allows Spark executors to process only data that is close to them, minimizing network bandwidth. That is, each executor's core is assigned its own data partition to work on. Remember this whenever a choice related to partitioning comes up. Spark programming starts with a dataset, usually residing in some form of distributed, persistent storage like the Hadoop distributed file system (HDFS) or AWS S3 in a format like Parquet.