Name: Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump
Rating: 2.71 (11 reviews)

85 reviews

April 5, 2020

You can tell that the author's specialty is not prose, but I found this book to be a great 101 on Data Lake Architecture despite this. Approachable and high level. Does not include many scenarios for usage but walks through what the setup of his proposal would look like at an organization. I think it would have been much better organized if the author discussed one specific point in its entirety before moving on. He seems to jump around a bit. It's very possible he is repeating himself on purpose, to drive the point home for the less IT-minded audience that he may be reaching for. Agreeing with other posts that there could have been more editing work and that there are run on sentences and information that could be... de-duped ;)

Definitely not a primer for professionals but might recommend to someone with a passing interest in how it might be set up, a new student interested in data science, architecture or application development, or for management to understand some of the key concepts and be able to understand what they mean in meetings. I would recommend to that specific audience alone.

------------------------------------

Chapter 2, Transforming the Data Lake

What are the ingredients needed to turn the data lake into a bottomless well of actionable insights? There are 4 basic ingredients that are needed: Metadata, integration mapping, context, and metaprocess.

Metadata is the description of the data in the data lake itself, as opposed to raw data. Metadata is the basic structural information that every collection of data has associated with it. For example, if tracking visits, clicks, and engagement on a website, metadata would include the IP Address and geographical location of the visiting computer. Typical forms of metadata include descriptions of the record, the attributes, the keys, the indexes, and the relationships among the different attributes of data. Metadata is used by the analyst to decipher the raw data found in the data lake. In other words, Metadata is the basic roadmap of the data that resides in the Data Lake. Imagine trying to search Wikipedia when none of the articles have titles? Now, when raw data is properly tagged with metadata and stored in the Data Lake together, you now have an incredibly useful service.

The Integration Map describes how data from one application relates to data in another application and how that data can be meaningfully combined.

Consider that most of the Data Lake's input is generated by an application in some way or another. What happens when you put data from different applications in the Data Lake? You create unintegrated silos of data in the Data Lake. Each application, usually written in a different coding language, sends its input to a separate silo – which cannot communicate with the other silo. While the information is all stored in the same Data Lake, each silo is unable to integrate its data with the others even if properly tagged with metadata.

In order to make sense of the data in the Data Lake, it is necessary to create an Integration Map. The Intgration Map is a detailed specification that shows how the Lake's data can be integrated – and it is the best method to overcome the isolation of data in the silos.

Context:

Another complicating factor in the Data Lake is textual data that has been placed without context of the text being identified. Suppose the term "court" appears: Does court refer to Tennis, to a legal proceding, the act of wooing a lover? When you look at the word "court" by itself, it might mean any of these things or more. Text without context is meaningless data. In fact, in some cases it is dangerous to store text without an understanding of its context. If you were going to put text in the Data Lake, you must also insert context as well – or at least a way to find that context

Metaprocess information is information about how the Data Lake was processed or how the information in the Lake will be processed.

1. Where was the data generated?

2. When was the data generated?

3. How much data was generated?

4. Who generated the data?

5. How was the data selected?

6. Once inside the Data Lake, was the data further processed?

All of these forms of Metaprocessing are useful to the analyst as they go about extracting and analyzing the Lake's data. The most important point is that these features need to be included in the outset.

Once these ingredients have been added, the Data Lake is a potential information goldmine. Another important effect of turning the Data Lake into a useful corporate resource is that an entirely different and expanded community of users can access and report on this data across the organization.

Chapter 3, Inside the Data Lake

While it is true that any kind of data can be found inside the Data Lake, it is nevertheless possible to categorize the data into three categorize:

Analogue Data

Application Data

Textual Data

Analogue Data is typically generated by a machine or some other automated device even if not connected to the internet. These measuring tools include diagnostic programs logging performance on anything from nuclear reactors to the CPU useage of your mobile phone. In general Analogue Data is very voluminous and repetitive. Most Analogue Data consists of a long list of numbers that have been generated. Most records, created by an analogue device, are measurements and most of the time those measurements only vary slightly from all other measurements. Typically, these small outliers are of the most interest. Analogue Data typically consists of a simple measurement of some physical value: heat, weight, size, chemical composition, etc. When a measurement seems out of line, it is an indication to look elsewhere for the cause of the measurement. For example, the odd measurement may have been caused by the fact that a machine has lost its calibration or a part needs an adjustment. The Analogue Data is merely a signal to the analyst to look elsewhere as to the cause of the variation in measurement, which is why the Metaprocess information associated with Analogue Data is often times more important than the Analogue Data itself. Metaprocess details typically include information such as time of measurement, location of measurement, speed of measurement. Typically Analogue Information is triggered by or associated with some trigger, such as a manufacturing event, a part created, a shipment sent, a box moved – All common events causing the creation of an Analogue Record.

The Analogue measurement is almost always made mechanically without any user input or extra processing. The data points accompanying the raw data captured in the Analogue Measurement Process is called Metaprocess Data. While there are different kinds of Metaprocess Models, suited to different objectives, this raw output is the most relevant to Data Lakes. The Metaprocess information provides a different perspective of the Analogue Data than just looking at the raw data itself.

Oftentimes, the Analogue Measurements are stored in log tapes or journal tapes. The log tape is a sequential measurement of one or more variables detected during the events that creates an Analogue Measurement. A log tape is very detailed – numbers are generated in very small intervals and the format is typically complex. Oftentimes, system utilities are used to read and interpret the log tape because of their complexity. In most cases, the log tape captures all events that occur – not just events that are of interest or exception. As a consequence, its normal for a log tape to contain much extraneous info.

Application Data is generated by the execution of an application or transaction and sent to the Data Lake. As important as transaction data is, it is not the only kind of data found in the application component of the Data Lake. This can include sales data, payment data, banking checking data, shipment data, manufacturing process control data, contract completion data, inventory mgmt data, bill payment data, etc. When any business relevant event occurs, the event is measured by an application and the data is created. The physical manifestation of data in the Data Lake can take many forms, but the most typical is recording activity in an application.

The records may or may not have been shaped by a Database Management System (DBMS). It is typical of the application records to have a common and repeating uniform structure. The common uniform structure of the application data is usually in the form of a record, which is more than an Analogue Data Point. The record may have attributes, which are designated as a key. One or more of the attributes can have an independent index.

Textual Data is usually associated with an application but takes a very different form from Application Data. Whereas App Data is shaped into uniform records, data found in a textual format is decidedly not shaped into any uniformity. Textual Data is called Unstructured Data because the text can take any form. For example: Someone speaking, typing. They could speak in riddles or parables. They might use a different language. Their speech may contain slang, vulgarities, or inside jokes. Naturally, such text is extremely context dependent and not easily searched by automated means. Typically this could include: Call Center conversations, corporate contracts, email, insurance claims, sales pitches, court orders, jokes, tweets, invitations, etc. There is no limit to what kind of text and how much text is stored in a Data Lake – however, in order for text to be used analytically – it must be transformed.

In order for text to be subjected to useful analytical processing, unstructured text must pass through a process known as Textual Disambiguation. Because of the uniformity with which Analogue and App Data are captured, those kinds of data points are expected to be analyzed by a computer. But if there is to be exhaustive analysis of text, it must be passed from its unstructured form of data through textual disambiguation, at which point it passes into a state and form that can be analyzed by a computer.

1. Text goes from an unstructured state to a structured, uniform state that can be analyzed by a computer

2. Text has context recognized and associated with the text itself

Data Classifications

Analogue and Application Data is considered repetitive data, while Textual Data is considered non-repetitive. Generally, there is great business value in non-repetitive data while significantly less in repetitive.

Chapter 4, Data Ponds

Once raw data flows from the Raw Data Pond into the Data Pond, should the raw data remain in the Raw Data Pond? Once raw data passes from the Raw Data Pond to the Analogue, Application, or Textual Data Ponds, it is best to remove the data from the Raw Data Pond. The raw data has already served its purpose and it would be extremely rare for analytical processing to be performed in the Raw Data Pond. The Raw Data Pond then becomes the holding cell for a jumble of data. The data in the RDP should be passed to the supporting DPs as quickly as possible.

The Analogue Data Pond is the place where Analogue Data is stored. The conditioning process for Analogue Data consists primarily of data reduction – reducing the volume to a workable, manageable, meaningful amount of restructured data.

The Application Data Pond is populated with information that comes from executing one or more applications. This application data is probably the cleanest in the whole Lake because it has been generated by an application. The App Pond is uniformly structured and contains values that are relevant to the execution of some business activity. The data in this pond is notoriously un-integrated. If all the information in this pond comes from a single app, the data in this pond may actually be integrated – however, for most large corporations, there is a very good chance that data in this pond comes from different applications.

The Textual Data Pond is where unstructured textual data is placed. Text here can come from anywhere and is notoriously difficult to analyze in a profound fashion. In order to do a deep analysis of the data, disambiguation is required.

The Archival Data Pond is where data is held when it is not actively needed for analysis, but may at some time in the future.

Chapter 5, Generic Structure of the Data Pond

The Pond Descriptor contains a description of the external contents and manifestation of the pond, and where the data originated from. Frequency of update or refreshment cycle of data outside the pond.

The Pond Target is a description of the relationship between the business of the corporation and the data inside the pond

The Pond Data is merely the physical data that resides inside the pond

The Pond Metadata describes the physical characteristics of the data inside the pond

Pond Metaprocess Information is information about the transformation conditioning of the data inside the pond. In order to be useful, data in the pond must undergo this process.

Pond Transformation Criteria is documentation on how the transformation conditioning of data inside the pond should occur.

Volume of Data is a general description of how much data is in the pond. Data is measured in terms of number of records and bytes, and influences the type and depth of analysis that can be done.

Selection Criteria is the description of the criteria that was used to select the data for inclusion in the specific pond.

Summarization Criteria – Most of the time, data is summarized or otherwise processed as it passes into the data pond. This summarization is a description of the algorithms deployed.

Transformation Criteria – The criteria used in the transformation process for the conditioning of data within the pond. Each pond will have unique criteria.

Ex:

"If length is greater than 45cm, capture record. If else, do not capture."

"Capture all the activity on a specific PC for this specific month."

Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump

Bill Inmon

About the author

Bill Inmon

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion

Can't find what you're looking for?