Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.
In Data Analysis with Python and PySpark you will learn how
Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs
Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.
About the technology The Spark data processing engine is an amazing analytics raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem.
About the book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.
What's inside
Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs
About the reader Written for data scientists and data engineers comfortable with Python.
About the author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.
Table of Contents
1 Introduction PART 1 GET FIRST STEPS IN PYSPARK 2 Your first data program in PySpark 3 Submitting and scaling your first PySpark program 4 Analyzing tabular data with pyspark.sql 5 Data frame Joining and grouping PART 2 GET TRANSLATE YOUR IDEAS INTO CODE 6 Multidimensional data Using PySpark with JSON data 7 Bilingual Blending Python and SQL code 8 Extending PySpark with RDD and UDFs 9 Big data is just a lot of small Using pandas UDFs 10 Your data under a di
I was looking for a book dedicated specifically to PySpark (instead of Spark in general): * chapters 1-9 are quite informative and focused on developers' interests - so you don't get too much info on the overall Spark architecture - you're learning how to solve (simple) problems with Spark * unfortunately, the further, the less detailed the chapters get (quite counter-intuitively) - I had a feeling that the author is just to get into meaty, interesting stuff, and then - bam - the chapter is over * chapters 10 and 11 were supposed (IMHO) to be the opus magnum of the book :) unfortunately, they are too shallow, too rushed, and in the end, I don't think any of the questions I initially had got answered :( * TBH I skimmed through 12-14 quickly: I wasn't too interested in that way of mixing Spark processing with ML
My general opinion on this book is: it's enough to get you started but on a tutorial level. Don't expect the level of readiness required to do any actual, real-life work.
A very good introduction to Spark and its components. Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQL: for instance he explains how JOINs work regardless of PySpark or SQL. So if you already know these languages it might have redundant, but still valuable info.
NOTES Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed to provide fast and general-purpose data processing capabilities. Spark extends the MapReduce model to support more types of computations, including interactive queries and stream processing, making it a powerful engine for large-scale data analytics.
Key Features of Spark: Speed: Spark's in-memory processing capabilities allow it to be up to 100 times faster than Hadoop MapReduce for certain applications. Ease of Use: It provides simple APIs in Java, Scala, Python, and R, which makes it accessible for a wide range of users. Versatility: Spark supports various workloads, including batch processing, interactive querying, real-time analytics, machine learning, and graph processing. Advanced Analytics: It has built-in modules for SQL, streaming, machine learning (MLlib), and graph processing (GraphX).
PySpark is the Python API for Apache Spark, which allows Python developers to write Spark applications using Python. PySpark integrates the simplicity and flexibility of Python with the powerful distributed computing capabilities of Spark.
Key Features of PySpark: Python-Friendly: It enables Python developers to leverage Spark’s power using familiar Python syntax. DataFrames: Provides a high-level DataFrame API, which is similar to pandas DataFrames, but distributed. Integration with Python Ecosystem: Allows seamless integration with Python libraries such as NumPy, pandas, and scikit-learn. Machine Learning: Through MLlib, PySpark supports a wide range of machine learning algorithms.
SparkSQL is a module for structured data processing in Apache Spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
Key Features of SparkSQL: DataFrames: A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Pandas. SQL Queries: SparkSQL allows users to execute SQL queries on Spark data. It supports SQL and Hive Query Language (HQL) out of the box. Unified Data Access: It provides a unified interface for working with structured data from various sources, including Hive tables, Parquet files, JSON files, and JDBC databases. Optimizations: Uses the Catalyst optimizer for query optimization, ensuring efficient execution of queries.
Key Components and Concepts Spark Core RDD (Resilient Distributed Dataset): The fundamental data structure of Spark, an immutable distributed collection of objects that can be processed in parallel. Transformations and Actions: Transformations create new RDDs from existing ones (e.g., map, filter), while actions trigger computations and return results (e.g., collect, count).
PySpark RDDs and DataFrames: Similar to Spark Core but accessed using Python syntax. SparkContext: The entry point to any Spark functionality, responsible for coordinating Spark applications. SparkSession: An entry point to interact with DataFrames and the Spark SQL API.
SparkSQL DataFrame API: Provides a high-level abstraction for structured data. SparkSession: Central to SparkSQL, used to create DataFrames, execute SQL queries, and manage Spark configurations. SQL Queries: Enables running SQL queries using the sql method on a SparkSession. Catalog: Metadata repository that stores information about the structure of the data.
This book is really excellent. A great intro to spark for those already familiar with python and sql. Some of the later exercises provide some good challenges too.
The only possible criticism is that this book occupies a reasonably niche are of spark use; data analysis and ML (in section 3). I didn't read the third section, as I won't be using much ML professionally in future . Plenty of folks use spark for ML, especially in databricks, but the most popular use of spark is still for Data engineering and etl pipelines. This is fine, the book is about something else, but some things one might expect for Data engineering use cases such as orchestration and spark, and data quality control with pydeequ are not present. Regardless of the use case, I would expect some more discussion of shuffling, it's effects, and strategies to avoid shuffling when joining large dataframes.
Since I gave the book Data pipelines with Apache Airflow 4/5 starts, I couldn't help but give this 3/5 as I didn't enjoy it as much.
Let's be frank... Spark really seems like a big system (libraries, setting up, configurations...) with a lot of nuisances. All in all this is a good book.
At times I felt like the author is going to quickly through some content or not explaining it for dummies. I'm also not sure whether it would be better to introduce the reader to spark internals before starting with the first application. Although I can understand the benefits of this approach.
I also found that I misunderstood some of the exercises after consulting the exercise solutions.
This is a good book to get started with writing PySpark code and gaining some understanding of what is going on. Some of the explanations were a bit confusing for me and I needed to consult other sources.
This book won't make you a PySpark intermediate, but it's good starting point for those interested in programming with PySpark.