Jump to ratings and reviews
Rate this book

Building the Data Lakehouse

Rate this book
The data lakehouse is the next generation of the data warehouse and data lake, designed to meet today’s complex and ever-changing analytics, machine learning, and data science requirements.Learn about the features and architecture of the data lakehouse, along with its powerful analytical infrastructure. Appreciate how the universal common connector blends structured, textual, analog, and IoT data. Maintain the lakehouse for future generations through Data Lakehouse Housekeeping and Data Future-proofing. Know how to incorporate the lakehouse into an existing data governance strategy. Incorporate data catalogs, data lineage tools, and open source software into your architecture to ensure your data scientists, analysts, and end users live happily ever after.

256 pages, Kindle Edition

Published September 20, 2021

8 people are currently reading
35 people want to read

About the author

Bill Inmon

28 books6 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
4 (10%)
4 stars
4 (10%)
3 stars
14 (37%)
2 stars
13 (35%)
1 star
2 (5%)
Displaying 1 - 5 of 5 reviews
1 review
January 31, 2022
Given that it's called "Building the data lakehouse", there is very little content that would help you build and architect an actual data lakehouse.

There are way too many abstract and vague sections that are not wrong per say but are of very little practical use.
Some sections are downright trivial. To prove my point, it takes three pages to show what a pie chart (with and without legend), gauge chart (and a reverse gauge chart), bar chart, etc. look like.

All in all, I have no idea what audience would appreciate this book. In my opinion, it is not useful to professionals already working in the analytics field, and irrelevant to someone who are just looking to enter the industry.

Do yourselves a favor and just read the Databricks blog post called "Evolution to the Data Lakehouse" -- authors are promoting their book there by summarizing the first few chapters of the book that are arguably of the highest quality anyway. You will learn pretty much everything there is to it in less than 10 minutes.
Profile Image for Sam.
44 reviews
Read
July 15, 2025
a data swamp implies there is a data shrek
1 review
May 2, 2022
Great on theory, but very light on details and implementation techniques.

I expected more from a thought leader to beyond the “basic” ideas. Good as a primer. An advanced data modeler and architect like myself will consider this more like a point of view lacking the details of “how” to support their arguments.
Profile Image for WiseB.
226 reviews
March 2, 2022
The book should be named Introduction to Data Lakehouse ... any IT professional intended to read about building one such will be disappointed with only high level description instead of practical knowledge.
Profile Image for Giulio Ciacchini.
379 reviews14 followers
August 19, 2024
There is a lot of theoretical stuff, even too much about basic concepts.
The core idea is intriguing, but the authors never explain in details how to achieve it.
The definitions already say a lot of each architecture and actually would save you the time to read the book:

**Data Warehouse**
A data warehouse is a centralized repository that stores structured data from various sources, typically optimized for query performance and reporting. It is designed to support business intelligence (BI) and analytics, enabling users to generate reports and insights. Data in a warehouse is usually organized in a schema-based format (e.g., star schema, snowflake schema) and is subject to strict quality control and transformation processes (ETL: Extract, Transform, Load) before being loaded into the warehouse.

**Data Lake**
A data lake is a large-scale storage repository that can hold vast amounts of raw, unstructured, semi-structured, and structured data in its native format. Unlike a data warehouse, a data lake does not require data to be processed or transformed before being stored. It is designed to accommodate a wide variety of data types, such as log files, videos, images, and sensor data, making it suitable for big data analytics, machine learning, and data exploration. Data lakes are highly scalable and can be deployed on cloud platforms or on-premises.

**Data Lakehouse**
A data lakehouse is a modern data architecture that combines the scalable storage and flexibility of a data lake with the performance and management features of a data warehouse. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while enabling high-performance analytics and query processing. The data lakehouse supports ACID transactions, schema enforcement, and data governance, providing a unified platform for diverse workloads, including BI, data science, and real-time analytics.

It is a hybrid architecture that leverages the scalability and flexibility of data lakes with the reliability and performance of data warehouses. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while also supporting high-performance queries and analytics.
The book discusses the (very well known) limitations of traditional data warehouses, such as their inability to handle unstructured data efficiently and their cost-prohibitive scalability. It also covers the challenges associated with data lakes, like data governance, data quality, and the complexity of managing diverse data formats. The data lakehouse addresses these issues by integrating the best features of both architectures.

A common criticism is that the book tends to be repetitive, rehashing similar ideas and concepts multiple times without providing new insights.
While it covers the basics of the data lakehouse, it may not delve deeply enough into technical details or advanced concepts for readers who are already familiar with data architecture.
In order to move to a Lakehouse architecture:

Assessment and Planning
Evaluate Current Infrastructure: Assess the existing data warehouse architecture, including storage, ETL processes, and BI tools. Identify limitations and areas for improvement.
Define Use Cases: Determine the use cases that the data lakehouse will support, such as real-time analytics, machine learning, and unstructured data analysis.
Identify Stakeholders: Engage business stakeholders, data engineers, data scientists, and IT teams to gather requirements and expectations.

Design the Lakehouse Architecture
Storage Layer: Plan for a scalable storage solution (e.g., object storage like AWS S3, Azure Blob Storage) that can handle diverse data types (structured, semi-structured, and unstructured).
Management Layer: Implement data governance, security, and metadata management practices that ensure data quality and compliance.
Processing Layer: Incorporate processing engines capable of supporting SQL queries, machine learning, and streaming data processing (e.g., Apache Spark, Flink).
Consumption Layer: Ensure compatibility with existing BI tools and provide user-friendly access for data analysts and data scientists.
Displaying 1 - 5 of 5 reviews

Can't find what you're looking for?

Get help and learn more about the design.