Jump to ratings and reviews
Rate this book

Apache Iceberg: The Definitive Guide: Data Lakehouse Functionality, Performance, and Scalability on the Data Lake

Rate this book
Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of priority tools and formats, which creates data silos and data drift. This practical book shows you a better way.

Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this high-performance open source format. Authors Tomer Shiran, Jason Hughes, and Alex Merced from Dremio show you how to get started with Iceberg.

With this book, you'll

The architecture of Apache Iceberg tablesWhat happens under the hood when you perform operations on Iceberg tablesHow to further optimize Iceberg tables for maximum performanceHow to use Iceberg with popular data engines such as Apache Spark, Apache Flink, and DremioDiscover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.

571 pages, Kindle Edition

Published May 2, 2024

42 people are currently reading
45 people want to read

About the author

Tomer Shiran

4 books

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
20 (45%)
4 stars
17 (38%)
3 stars
6 (13%)
2 stars
1 (2%)
1 star
0 (0%)
Displaying 1 - 9 of 9 reviews
Profile Image for Emre Sevinç.
177 reviews434 followers
June 14, 2024
A few days after I started reading this excellent introduction to Apache Iceberg, Databricks announced their acquisition of Tabular, the company founded by the original creators of Apache Iceberg. "Why is this important and relevant?" one might ask, and I think a short answer can be: it shows the competition is getting stronger and data lakehouse will be more and more integrated into whatever kind of data management you do at large scale on the cloud.

I really enjoyed this book because it puts concepts and techniques into historical, as well as technological context. If you spent time with traditional database systems to build data warehouses, or if you jumped on Hadoop, object storage and Apache Spark or proprietary Databricks to store a lot of large data files and query them, wondering if things could be more all-in-one-good-old-database-like, without losing all the niceties of object storage, HDFS, and scaling computing, then data lakehouse technologies powered by software systems such as Apache Iceberg (or Delta Lake, see this DuckDB post for a good introduction) provide the next step in large scale data management and processing.

This book, after introducing the main concepts and components of Apache Iceberg, goes on to describe in detail how various bits and pieces of this technology relate to each other, to provide the data engineer and data architect to enjoy features such as "time travel" to query historical data, creating branches of data sets and then merging them into the "main" branch, treating your "database" like a Git repository, using different query engines such as Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time. These and many other features are the things you wouldn't have if you simply dumped your CSV, Parquet, etc. files on scalable S3 or HDFS, and tried to work with them directly using a query engine such as Apache Spark.

Another good aspect of the book is the presentation of various use cases, demonstrating how Apache Iceberg can be used in those scenarios, making it easy for the reader to evaluate Apache Iceberg for different data architecture needs. Moreover, there's also ample source code provided by the authors in GitHub and GitLab repositories, making it easy to try various examples.

Conclusion: If you want to have a good introduction to Apache Iceberg or data lakehouse architecture, this book is the perfect source to familiarize yourself.
Profile Image for Giorgi.
26 reviews
October 16, 2024
Essential read to understand Iceberg. Especially first 5 chapters are invaluable
Profile Image for Peter Aronson.
399 reviews19 followers
July 18, 2024
Three-and-three-quarter stars, rounded up. There's a lot to like about this book, but there is also some places where it is somewhat problematic.

Part I, Fundamentals of Apache Iceberg, is well written and informative, although sometimes sounding a bit like marketing materiel ("Reading data from Apache Iceberg tables follows a well-defined sequence of actions, seamlessly allowing queries to be transformed into actionable insights."). It was my favorite part of the book, delving down into the file level and building up to Iceberg's architecture.

Part II, Hands-on with Apache Iceberg, on the other hand, had a bunch of very similar chapters with the same material covered with different query engines. If you were looking to understand how Iceberg worked and was used, this redundancy wasn't all that useful. Admittedly, if you just wanted to look up how to use one particular query engine with Iceberg, this approach is probably what you want.

Part III, Apache Iceberg in Practice, was OK, discussing Iceberg metadata, streaming, security and governance, migration and some meh use-cases, but was missing something important: a discussion of when and when not you should use Iceberg. This is large scale software for huge amounts of data. You really don't want to adopt it too early -- most companies will probably never have enough data to need Iceberg (but those that do, will really need it or something like it).

The authors all work for Dremio, but they do seem to be making an effort to be even-handled and cover all of the Apache Iceberg ecosystem. But they do seem to be a bit too used to selling, and I found some of it a bit off-putting (like describing the book in the introduction as "meticulously crafted"). There's a reason why most technical writing strives for a more neutral tone; many technical readers are put off by that sort of thing.

But despite these nits, it was definitely worth reading.
Profile Image for Thang.
101 reviews13 followers
May 10, 2025
Apache Iceberg and Open Table Format have been trending in Data Engineering in recent years, so I picked up this book to understand more about the lower level. It meets my expectations.
- Explain about Apache Iceberg architecture, data file, metadata file, manifest file, with examples of how they're in S3 bucket.
- Some examples of integration with Spark/Flink
- Some real-world usecases such as CDC

Technology is changing quickly, so I expected the author can update in 2nd version of this book with more Cloud integration: Snowflake, BigQuery, etc, and also how to use Iceberg in Machine Learning.
8 reviews
December 26, 2024
Essential Reading for Understanding Iceberg

Iceberg table formats are quickly becoming the standard in the data world. This book is excellent in helping you quickly ramp not only in the theory but also how to practically implement and use Iceberg. Could not recommend more strongly.
Profile Image for Bjarne.
82 reviews
January 30, 2025
I only read the conceptual part about the technology and how the interaction with it works under the hood. So roughly the first half. It was very well written and gives a very good overview about the technology, why it was developed, and how it handles interaction requests. Very good.
Profile Image for Cristian Orellana.
12 reviews1 follower
February 2, 2025
Good introductory book to the inner workings of Iceberg.
It has good tips on optimization and also provides good guidance on the different options and settings you can modify to get the most out of this table format for the different access patterns you might have for your datasets.
Displaying 1 - 9 of 9 reviews

Can't find what you're looking for?

Get help and learn more about the design.