Name: Building the Data Lakehouse
Rating: 2.86 (5 reviews)
ISBN: 9781634629676

379 reviews14 followers

August 19, 2024

There is a lot of theoretical stuff, even too much about basic concepts.
The core idea is intriguing, but the authors never explain in details how to achieve it.
The definitions already say a lot of each architecture and actually would save you the time to read the book:

**Data Warehouse**
A data warehouse is a centralized repository that stores structured data from various sources, typically optimized for query performance and reporting. It is designed to support business intelligence (BI) and analytics, enabling users to generate reports and insights. Data in a warehouse is usually organized in a schema-based format (e.g., star schema, snowflake schema) and is subject to strict quality control and transformation processes (ETL: Extract, Transform, Load) before being loaded into the warehouse.

**Data Lake**
A data lake is a large-scale storage repository that can hold vast amounts of raw, unstructured, semi-structured, and structured data in its native format. Unlike a data warehouse, a data lake does not require data to be processed or transformed before being stored. It is designed to accommodate a wide variety of data types, such as log files, videos, images, and sensor data, making it suitable for big data analytics, machine learning, and data exploration. Data lakes are highly scalable and can be deployed on cloud platforms or on-premises.

**Data Lakehouse**
A data lakehouse is a modern data architecture that combines the scalable storage and flexibility of a data lake with the performance and management features of a data warehouse. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while enabling high-performance analytics and query processing. The data lakehouse supports ACID transactions, schema enforcement, and data governance, providing a unified platform for diverse workloads, including BI, data science, and real-time analytics.

It is a hybrid architecture that leverages the scalability and flexibility of data lakes with the reliability and performance of data warehouses. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while also supporting high-performance queries and analytics.
The book discusses the (very well known) limitations of traditional data warehouses, such as their inability to handle unstructured data efficiently and their cost-prohibitive scalability. It also covers the challenges associated with data lakes, like data governance, data quality, and the complexity of managing diverse data formats. The data lakehouse addresses these issues by integrating the best features of both architectures.

A common criticism is that the book tends to be repetitive, rehashing similar ideas and concepts multiple times without providing new insights.
While it covers the basics of the data lakehouse, it may not delve deeply enough into technical details or advanced concepts for readers who are already familiar with data architecture.
In order to move to a Lakehouse architecture:

Assessment and Planning
Evaluate Current Infrastructure: Assess the existing data warehouse architecture, including storage, ETL processes, and BI tools. Identify limitations and areas for improvement.
Define Use Cases: Determine the use cases that the data lakehouse will support, such as real-time analytics, machine learning, and unstructured data analysis.
Identify Stakeholders: Engage business stakeholders, data engineers, data scientists, and IT teams to gather requirements and expectations.

Design the Lakehouse Architecture
Storage Layer: Plan for a scalable storage solution (e.g., object storage like AWS S3, Azure Blob Storage) that can handle diverse data types (structured, semi-structured, and unstructured).
Management Layer: Implement data governance, security, and metadata management practices that ensure data quality and compliance.
Processing Layer: Incorporate processing engines capable of supporting SQL queries, machine learning, and streaming data processing (e.g., Apache Spark, Flink).
Consumption Layer: Ensure compatibility with existing BI tools and provide user-friendly access for data analysts and data scientists.

coding non-fiction

Building the Data Lakehouse

Bill Inmon, Mary Levins, Ranjeet Srivastava

About the author

Bill Inmon

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion

Can't find what you're looking for?