A very good journey into Data Catalogs.
The approach is very theoretical and agnostic is terms of tools or technologies.
The concepts are well explained and the whole structure is easy to follow.
Also the core idea is not that hard to grasp even to someone new to the world of data: having a place to look for data assets.
I have much appreciated the fact that the book is short and not redundant: the topic is smaller that other data arguments, so it makes sense to have a short but dense book.
NOTES
A Data Catalog is an organized inventory of the data of a company: it provides an overview at a metadata level only, and thus no actual data values are exposed, everyone can see it.
It is a DB with metadata that has been pushed or pulled from the data sources.
It is organized in domains that contain assets, metadata representations of data in source system.
It allows Data Discovery and Data Governance.
Organize Data
Domain-Driven Design (DDD) is an approach to software development that focuses on creating software systems that closely align with the business domain they are designed to serve. It provides a set of principles and practices for managing complexity and organizing code in a way that reflects the domain's concepts, logic, and relationships.
DDD emphasizes collaboration between domain experts, software developers, and other stakeholders to ensure that the software accurately models the problem domain and captures the essential business knowledge. The goal is to create a shared understanding of the domain and use that understanding to drive the design and implementation of the software.
The domain represents the core concepts, rules, and processes that define the problem space and drive the behavior of the software. It includes the entities, relationships, workflows, and business logic that are relevant to the specific problem or industry being addressed.
Processes describe how a company performs a task.
Capabilities describe what tasks a company perform.
The first step to organising your domains is to choose between creating the domains as processes or capabilities.
A process domain is put together based on how things are done; A capability domain based on what things are done.
Getting Assets into the Data Catalog
- Pull: using standard, built-in connectors (crawlers); API; RDS (read-only data store)
- Push: mainly streaming, just listen and receive data, not influence
Classification of content; confidentiality (secrecy); sensitivity (pii).
Understand Search
There are different users: everyday end users; governance end users; data analytics end users/
They can search in data (actual data, specific answers) or for data (data sources).
Leverage on DQL (Database Query Language), the most popular being SQL.
In a Data Catalog we can use IRQL (Information Retrieval Query Language)
- Why do you search a data catalog? Because it enables data discovery. Data discovery starts with finding the best data sources.
- What do you search in a data catalog? In a data catalog, you are not searching in data, but for data.
- How do you search a data catalog? You use a combination of query language commands, operators, and clickable filters. You can use simple search, browsing, and complex search:
- Simple search is simple for you, but complex behind the scenes. It provides search results based on how you have previously searched. It also corrects your queries and makes suggestions.
- Browse search can be vertical, based on domains; horizontal, based on data lineage and display how data travels across systems; relational, based on graph technology.
- When searching for data, you need to apply the mindset of a librarian, not a data scientist. Searching for data is a discipline that relies on search mechanics, but it also takes experience and understanding your company's data and language.
- Basic simple search is the way of searching that most end users will apply. A well- structured data catalog will deliver precise simple search, especially if it's based on a knowledge graph. But expect a lot of mess deeper down in the search results also.
- Detailed simple search requires you to know the syntax of the IRQL in your data catalog. So it takes a little time to write, or just experience, but you get super- precise hits in return.
- Flexible simple search also depends on understanding IRQL, but it opens up the search to give more results, increasing your recall and decreasing your precision, while at the same time still being a better way to target a well-defined topic than basic simple search.
- Range search is searching in intervals, e.g., a time span. This kind of search will result in high precision and low recall.
- Block search is a structured way to search for a complex topic using IRQL. It works best if your glossaries are exhaustive and used with great specificity.
- Statement search is a way to search for a complex topic; it simply puts a lot of things together in a search. It's not unstructured, but it's haphazard.
- Glossary browsing is searching in which you go exploring to get informed and enlightened about business terminology.
- Domain browsing, lineage browsing, and graph browsing are ways of searching vertically, horizontally, and relationally, respectively, by clicking through the data landscape.
Discover Data
- Data catalogs must activate metadata so that a data catalog is not just a dead repository of data but a machine that pushes relevant data in contexts where it can provide value and increased results.
- Data governance leaders are engaged by the fact that they can apply sensitivity and confidentiality classification directly on data. They are furthermore r ted by the fact that they can join forces in mapping the IT landscape and concentrate on more strategic priorities.
- Data analytics leaders are engaged naturally, but an extra selling point is data lineage, which allows these leaders to understand changes upstream or the causes of broken reporting.
- Domain leaders are engaged by the potential of seeing data from other business units that they are in need of in their daily tasks.
- All leaders are in fact connected. They work with data for different purposes, either on the operational backbone or on the data platform. But the data they work with is the same, and they need to align on how to describe it and manage it in a data catalog.
Access Data
- There will be cases where the implementation of a data catalog is in fact a catalog of catalogs. Even though this is a difficult way to implement a data catalog, it can be both a necessary and relevant approach.
- A centralized approach, which uses one global solution to make data accessible across the company
- A decentralized approach, where each domain is capable of choosing their own solutions to make data accessible
- A combined approach, where some data is accessible via a central solution, while certain domains in the company act more freely and have made data accessible themselves
- Questionnaires are a way to unlock the descriptions of domains and the assets in them, complete with glossary terminology.
The data asset Lifecyle is very similar to the Data Lifecycle: plan; obtain; store and share; maintain; apply; dispose.
- All data in IT systems has a lifecycle. This lifecycle can be short, long, or eternal, depending on the nature of the organization it pertains to. The data catalog enables companies to gain a complete overview of their data earlier in the data lifecycle.
- The data catalog enables you to mirror all the data in the IT landscape of a firm, giving global control of the data lifecycle, which solves issues such as how and with whom data must be shared and when it must be deleted-or if it must not be deleted.
- The data assets inside the data catalog also have their own lifecycle, and to keep the data catalog well curated and searchable, the lifecycle of the asset must be taken into consideration when managing the assets, for example when data sources are sunsetted.
- All lifecycles-inside and outside the data catalog-are connected. Data source lifecycles and data lifecycles influence the data asset lifecycles and terminology lifecycles, whereas the two latter support the first two. You can manage your data source and your data lifecycles via the data catalog.
- Lifecycles enable applied search that takes the dimension of time into account: via lifecycles, searches can be carried out that go back in time as long as the organizational memory allows. This is the key element to be in compliance with privacy regulations and other, industry-specific regulations that require organization to store data for a certain period of time.
- Lifecycles can be treated as a maintenance framework that can be enacted by using a data catalog.
- Data observability proposes to manage the data lifecycle in the Obtain phase, before it is stored in solutions and shared with the rest of the company.