Author: René van Kooten
The Datavault Adventure – Myths Busted and more
Dan Linstedt published the principles of the Data Vault methodology in 5 articles in July 2002. In 2005, the Data Vault Data Warehouse Methodology arrived in The Netherlands: the first articles about Data Vault written by Dutch adaptors were published. The interest in the Data Vault Data Warehouse Methodology in The Netherlands grew from 2006 on. Though in 2013 – 2014 one sees more articles and blogs about Data Vault arising on the Internet. The Data Vault Data Warehouse Methodology is mainly in The Netherlands very popular. It is noticed that the Data Vault is implemented or will be implemented by large financial organizations in The Netherlands.
In 2008 an introduction on Data Vault was given. Information was given about the benefits of this – at that time – latest Data Warehouse Methodology. The enthusiasm with which the message was brought indicated that this new Data Warehouse methodology was the Holy Grail. Since that introduction, more articles and blogs about Data Vault have been published. Some of these assigned new areas for which the Data Vault was the solution. But is that so? This article / blog will bust the myths of the Data Vault Methodology and more!
1. Myth: Data Vault has an Audit Function
A Data Warehouse contains a copy of the source data and it builds a track of changes from the moment when the solution is taken into production. Is this track of changes the same as an audit of the source data? The answer is no.
The Data Warehouse only keeps a track of changes of the source data when the source data is loaded into the Data Warehouse. This is done by executing an ETL (Extract, Transform & Load) process and the execution of this ETL process is done upmost a couple of times a day, but often only once a day. There is a certain timeframe between the executions of ETL processes. The following example will clarify the Audit function:
- Day 1, 08:00 – a record is created in a table in a database
○ first audit
- Day 1, 08:30 – the record is mutated
○ second audit
- Day 1, 19:00 – execution of ETL process
- Day 2, 10:15 – the record is mutated
○ third audit
- Day 2, 10:15 – the record is mutated
○ fourth audit
- Day 2, 10:15 – the record is mutated
○ fifth audit
- Day 2, 19:00 – execution of ETL process
The Data Warehouse will contain the source data of the second and fifth audit. The original value of the record and the values of the third and fourth change will not be known by the Data Warehouse. The Audit function takes place at the source data. In another article / blog the Audit Function will be described in more detail.
2. Myth: Data Vault Does Not Acquire Knowledge Of Data Sources
The main objective of the Data Vault is to store source data in an organized, easy to understand structure in historical perspective. The changed data is stacked. But how can one build a Data Vault without knowing characteristics of the source data? The answer is: it is not possible.
- Before one builds a Data Vault, one has to design a datamodel for the Data Vault. To be able to do this, one has to analyze the structure of the data in the data source(s).
- The manner in which Master Data Management has been implemented in the organization (procedures) and in the applications determine the tactics to load the source data into the Data Vault.
3. Myth: Data Quality is not a subject for Data Vault
One can divide Data Quality issues in two categories:
- Source data does not meet the Business Rules
- Source data does not meet the Database Integrity Rules
To solve or not to solve the Data Quality issues of the first kind is a choice made by Methodology, Designer and / or Organization. To not solve the Data Quality issues of the second kind is not a choice. These are to be considered in every Data Warehouse project / solution.
The two most important Database Integrity Rules are the Unicity Constraint (Primary Key) and the Referential Integrity Constraint (Primary Key / Foreign Key). Personally, I experience that in 90% of the data sources that I have unlocked, these constraints have not been honoured:
- When a violation of the Unicity Constraint in the source data occur, then the ETL process which loads the source data into the Data Vault will not stop. Technically the execution of the ETL process is successful. But what if the violation of the Unicity Constraint occur in a transaction worth €1.000.000? How will this transaction occur in the Satellite(s) of the LINK and what influence will that have on reports built on the Data Vault?
○ Ofcourse this issue should be solved in the (data) source, but what to do in the mean time? There is no other option than to rebuild the LINK at every execution of the ETL process. Because every record in the data source needs to have a UniqueID, a new UniqueID has to be assigned to every record in the data source by a separate step in the Extraction process.
- It depends on which kind of Data Vault is to be build (with or without Surrogate Keys) whether one has to deal with a violation of the Referential Integrity Constraint. This issue has been discussed in a previous article / blog about Data Vault: The Datavault Adventure – Building a Data Vault with DDM Studio 3 SK.
4. Myth: Design of Satellites can be done automatically
There are tools available which generate a Data Vault datamodel and the necessary ETL scripts to load the source data into the Data Vault. Part of the Data Vault datamodel is the design of the surrounding Satellites. The design of the Satellites is based on the mutation rate of each attribute defined for a HUB or a LINK. Does an OLTP system contain information about the rate frequency for every attribute in a table and is it therefore possible to design satellites automatically? The answer is no.
5. Point In Time (PIT) Table: a Point of Attention
A record in a Satellite contains a valid period marked by a StartDate and an EndDate. Therefore it is possible to retrieve the valid state of a HUB or a LINK on a given DateTime: only one record in the Satellite is valid.
A trap and therefore a point of attention is that the valid state of a Satellite record does not say anything about the context in which the content of the Satellite record has to be placed. For transactions, the date on which these occurred is extremely important when fulfilling the information needs of the Business. When you define First Level Facts In the Dimensional Modeling Module of DDM Studio it is required to assign a Main Fact Date. This date – together with the Natural Key of a Dimension – determine which version of the entity in the Dimension (identified by the Dimension Surrogate Key) is retrieved for the transaction in the Fact. That is a Point of Attention when unlocking the Data Vault.
Implementation of the PIT table
The following is a question to the Data Vault Designers and Builders: why – in some implementations – does every HUB and LINK has an own PIT table? Is one PIT table for the whole Data Vault not enough? In my opinion the PIT table contains all the dates on which the Data Vault has been loaded, occasionally extended by dates on which the Data Vault has not been loaded to be able to fulfill the information needs of the Business.
6. LINK Foreign Key Changes
When one reads articles / blogs on the Internet, then one will encounter the issue of value changes of the Foreign Keys in LINKs which can occur in time. The Data Vault Methodology does not give guidelines about how to solve this problem. On the Internet one will find a solution called Tracking Satellites.
Four scenarios have been identified to deal with changes of the Foreign Keys in LINKs:
- The Data Vault will not keep track of value changes of the Foreign Keys. The Foreign Keys in the LINK keep their initial values. A prerequisite is that the LINK contains a stable Natural Key.
- The Foreign Keys will be updated with the most recent values. The Foreign Keys in the LINK contains the current values. A prerequisite is that the LINK contains a stable Natural Key.
- For the Foreign Keys of the LINK a special own Tracking Satellite will be created to be able to track also the value changes of the Foreign Keys. This is the same solution you will find on the Internet, but the difference is that this Tracking Satellite contains all the Foreign Keys and not one (and therefore only one Tracking Satellite is needed). A prerequisite is that the LINK contains a stable Natural Key.
○ The relationships between the LINK and the HUBs and / or other LINKs start from this Tracking Satellite.
○ Then it is also possible to move the Foreign Keys from the LINK to this Tracking Satellite, which will make the structure of the LINK the same as the HUB. Will this generic approach make the design of the Data Vault more easy?
- When the LINK does not contain a stable Natural Key, then the only possibility is to load the LINK always full initial. The consequence for a Data Vault with Surrogate Keys is that all the other LINKs, which depend on this LINK, also have to be loaded full initial!
7. Known Extensions to the Data Vault
The power of the Data Vault Methodology is to load source data in an organized but also easy to understand, structure with high speed. Huge volumes of source data could be processed within a limited timeframe by the Data Vault ETL processes because of the simplicity of these processes. Since the publications about Data Vault by Dan Linstedt in July 2002 this methodology has been extended with a couple of features. A couple of these will be discussed one by one.
- Surrogate Keys
○ Surrogate Keys have been introduced to Data Vault to improve performance when unlocking the Data Vault to fulfill information needs of the Business. Experiences with unlocking the Data Vault have lead to the conclusion that the Data Vault structure is too difficult for reporting purposes and that datamarts have to be built. The Data Vault becomes a data source for the datamarts. In that scenario, are the Surrogate Keys useful?
○ The implementation of Surrogate Keys in the Data Vault comes with great costs. This has been discussed this in a previous article / blog about Data Vault: The Datavault Adventure – Building a Data Vault with DDM Studio 3 SK.
- Date Last Seen
○ One of the most interesting characteristics of source data is that it can disappear / be deleted.
○ One benefit of the Data Vault is that it contains all source data from the moment the Data Vault solution got into production.
○ By comparing the content of the Data Vault with the source data to be loaded it is possible to detect deleted records. Ofcourse the prerequisite is that the source data always contains the full dataset. The comparison is done by an extra step in the ETL process. Because this step can have a negative influence on the performance of the ETL processes, it is advised to apply this extension only on source datasets, where the deletion of data can occur.
- Tracking Satellites
○ To deal with changes of the Foreign Keys in LINKs, the solution has been found in creating ‘Tracking Satellites’. Some Data Vault Designers create one Tracking Satellites for each Foreign Key, which values can change in time. As discussed in the previous section of this article / blog it is proposed to build only one Tracking Satellite, which contains all Foreign Keys of the LINK.
- Reference Tables
○ The latest extension to the Data Vault that has become public, is the implementation of the Reference Tables. What is the reason for the implementation of the Reference Table? I think that when you model a Data Vault based on a datamodel of a transactional datasource, the number of LINKs is higher than the number of HUBs; while up-front one expects the opposite (see a previous article / blog about Data Vault: The Datavault Adventure – Modeling the Datavault).
○ When one looks at a datamodel for a Data Vault, then one will notice a lot of objects due to the number of Satellites in the datamodel. The Reference Tables are a layer above these Satellites, which will not improve the readability of the Data Vaults datamodel.
○ But why not applying denormalization on the data sources? Is that forbidden? If one applies a certain degree of denormalization on the data sources, then – to my opinion – the Data Vaults datamodel will contain less objects and more HUBs, which will improve the readability of the Data Vaults datamodel and it will simplify the loading process of the source data into the Data Vault if Surrogate Keys are implemented.
8. The Data Vault In RetroPerspective
Is the Data Vault Methodology the Holy Grail? Is the Data Vault Methodology the egg of Colombo? Can everything be automated and therefore is it not necessary anymore to think? The answer to these questions is ofcourse no.
- As a Data Modeler you still have to think
- To the Data Vault objectives were assigned which can not be reached. The idea of everything being possible to be solved is not true ofcourse. That was the trigger to name this article / blog “Bust The Myths”.
- The Data Vault Methodology contains from the start one design flaw and that relates to the existence of value changes of the Foreign Keys in LINKs.
- New extensions to the Data Vault Methodology as the Surrogate Keys and Date Last Seen are against the principles of the Data Vault methodology which Dan Linstedt published in 5 articles in July 2002: processing huge volumes of source data within a limited time frame by executing simple – and therefore easy to understand and to implement – ETL processes.
Are there also good things to the Data Vault method? The answer to this question is: YES ofcourse there are good things to Data Vault. Due to its nature to keep tracks of changes of source data, the Data Vault can be an excellent data source for a Dimensionally Modeled Data Warehouse. The Dimensionally Modeled Data Warehouse can always be rebuild from scratch and remaining its characteristic historical function, because by using the PIT table, the state of the source data on a certain moment in time is transferred to the ETL process of the Dimensionally Modeled Data Warehouse.
There are also alternatives for the Data Vault and the choice is determined by the situation at the organization which wants to have a Data Warehouse implemented and / or the characteristics of the data sources:
- Anchor Modeling
- Historical Data Layer
- Stacked Tables / Datasets
- Audit Tables