A very good introduction to Spark and its components.
Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQL: for instance he explains how JOINs work regardless of PySpark or SQL.
So if you already know these languages it might have redundant, but still valuable info.
NOTES
Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed to provide fast and general-purpose data processing capabilities. Spark extends the MapReduce model to support more types of computations, including interactive queries and stream processing, making it a powerful engine for large-scale data analytics.
Key Features of Spark:
Speed: Spark's in-memory processing capabilities allow it to be up to 100 times faster than Hadoop MapReduce for certain applications.
Ease of Use: It provides simple APIs in Java, Scala, Python, and R, which makes it accessible for a wide range of users.
Versatility: Spark supports various workloads, including batch processing, interactive querying, real-time analytics, machine learning, and graph processing.
Advanced Analytics: It has built-in modules for SQL, streaming, machine learning (MLlib), and graph processing (GraphX).
PySpark is the Python API for Apache Spark, which allows Python developers to write Spark applications using Python. PySpark integrates the simplicity and flexibility of Python with the powerful distributed computing capabilities of Spark.
Key Features of PySpark:
Python-Friendly: It enables Python developers to leverage Spark’s power using familiar Python syntax.
DataFrames: Provides a high-level DataFrame API, which is similar to pandas DataFrames, but distributed.
Integration with Python Ecosystem: Allows seamless integration with Python libraries such as NumPy, pandas, and scikit-learn.
Machine Learning: Through MLlib, PySpark supports a wide range of machine learning algorithms.
SparkSQL is a module for structured data processing in Apache Spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
Key Features of SparkSQL:
DataFrames: A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Pandas.
SQL Queries: SparkSQL allows users to execute SQL queries on Spark data. It supports SQL and Hive Query Language (HQL) out of the box.
Unified Data Access: It provides a unified interface for working with structured data from various sources, including Hive tables, Parquet files, JSON files, and JDBC databases.
Optimizations: Uses the Catalyst optimizer for query optimization, ensuring efficient execution of queries.
Key Components and Concepts
Spark Core
RDD (Resilient Distributed Dataset): The fundamental data structure of Spark, an immutable distributed collection of objects that can be processed in parallel.
Transformations and Actions: Transformations create new RDDs from existing ones (e.g., map, filter), while actions trigger computations and return results (e.g., collect, count).
PySpark
RDDs and DataFrames: Similar to Spark Core but accessed using Python syntax.
SparkContext: The entry point to any Spark functionality, responsible for coordinating Spark applications.
SparkSession: An entry point to interact with DataFrames and the Spark SQL API.
SparkSQL
DataFrame API: Provides a high-level abstraction for structured data.
SparkSession: Central to SparkSQL, used to create DataFrames, execute SQL queries, and manage Spark configurations.
SQL Queries: Enables running SQL queries using the sql method on a SparkSession.
Catalog: Metadata repository that stores information about the structure of the data.