Author: René van Kooten
Modeling the Adventureworks Data Vault
Modeling the Data Vault
Building a Data Vault starts with modeling the Data Vault based on an existing datamodel of a transactional source system. The result is an Entity Relationship Diagram (ERD), which consists of HUBs, LINKs and Satellites. But how to model the Data Vault?
The design of the HUBs and LINKs
The datamodel of the Adventureworks transactional source system which Microsoft delivers with SQL Server 2005 is used. Without making any agreements and defining principles to convert the ERD of Adventureworks to a Data Vault me and my colleague Mesa start individually. After completion of the datamodel we compare our results. Surprisingly each one of us comes up with more LINKs than HUBs. Upfront we expected to model more HUBs than LINKs. The reason that we result with more LINKs than HUBs is the fact that the Adventureworks transactional source system contains more tables, which are the ‘n’ tables in the “1:n” relationships and are the ‘m’ tables in the “n:m” relationships. This is how we convert the ERD of Adventureworks into a Data Vault: an entity in the Adventureworks ERD which did not contain foreign keys was turned into a Hub, otherwise the entity was turned into a Link.
The design of the Satellites
The design of the surrounding Satellites is based on a physical aspect of attributes: the change frequency. Attributes with a similar frequency rate are grouped together and form a satellite. A Hub or a Link can have more than one Satellite and a Satellite can consist of only one descriptive attribute.
The consequence of taking the Satellites into the datamodel design is that the result datamodel ends up with far more entities than the original ERD design of the transactional source system. This will not make the datamodel clearer, just the opposite. For that reason the decision is made not to include the Satellites in the Data Vault ERD.
Automated design of the Data Vault
There are tools available on the market which generate a Data Vault ERD – including Satellites – based on a physical database design of a transactional source system. That is very convenient. A data modeler doesn’t need to do anything but rather let the computer do the job.
But if one thinks logically:
- It is possible to analyze the Primary / Foreign Key constraints in the physical database design and /or to compare two datasets to detect relationships between tables. Based on the outcome, an automation tool can offer a proposal for the HUBs and LINKs in the Data Vault model
- But how can an automation tool offer a proposal for the surrounding Satellites? Based on which criteria to be found in the physical database design and in the data itself?
Recently the Data Vault methodology has been extended with a new feature: the Reference Tables. Quote: “Reference tables are a normal part of a healthy Data Vault model.”
When a Data Vault ERD contains a lot of LINKs, it will not make the datamodel easy to understand. This was an outcome, which had not been foreseen when the principles of the Data Vault methodology were defined. It explains the introduction of the Reference Tables. Reference Tables will increase the number of HUBs and therefore reduce the number of LINKs and in the Data Vault datamodel, making the datamodel easier to understand. The downside is an extra entity layer in the datamodel.
One lesson learneded is that it is preferable to have more HUBs in the Data Vault ERD. More HUBs mean more clearance.
By applying a form of de-normalization the content (and attributes) of the reference tables can be part of a Satellite of a Hub, which would normally be a Link because of the “1:n” relationship between a Reference table and an underlying table.