Author: René van Kooten
Building a Data Vault with DDM Studio (Surrogate Keys)
After a Data Vault has been modeled it is time to build the Data Vault with DDM Studio as also shown in this previous article. DDM Studio has been based on the principles of the Data Vault methodology described in the five articles of Dan Linstedt which have been published in 2002. At the time when the TL processes were built in dynamic SQL for DDM Studio (Q2, 2009) an extension on the Data Vault methodology was introduced: Surrogate Keys. The decision was made – from a commercial perspective – also to support this new extension. The result is the existence of two DDM Studio Data Vault modules:
- Data Vault Classic
- Data Vault Surrogate Keys
A choice between these two modules has to be made in order to choose whether to build a Data Vault with or without Surrogate Keys. A mixture of the two flavours can not be implemented. In this previous article one can read about the Classic module. This blog decribes the journey while building a Data Vault Surrogate Key based on the Adventureworks transactional source database, which Microsoft delivers with SQL Server 2005 and above with DDM Studio 3.
Data Vault Surrogate Keys
When loading source data into the Data Vault each record in a HUB or LINK is assigned a unique key of datatype integer. The HUBs and LINKs in the Data Vault have two unique keys: the Natural Key (NK), which is the unique ID of an entity in the data source (and when the value is meaningful to the business it is flagged as the Business Key), and a Surrogate Key (SK), which is a meaningless unique integer value. The SK is a synonym of the NK and the SK has been introduced to the Data Vault to get better performance when unlocking the Data Vault: the objective of having a Datawarehouse is to fulfill information needs of Business Users.
In both DDM Studio Data Vault modules, the HUBs and LINKs contain Surrogate Keys. There is a difference between both DDM Studio modules. The Data Vault module with the Surrogate Key option uses the following values:
- in the PK/FK relationships between LINKs and HUBs and / or other LINKs
- in the PK/FK relationships between HUBs / LINKs and their surrounding Satellites
Extracting the Source Data
The first step in loading source data into the Datawarehouse is extracting the source data to a separate (copy) environment, which is detached from the production environment of the source OLTP (On Line Transactional Processing) systems.
The reasons for this step are:
- Performing operations on the source data cost a lot of CPU time, which will cause performance loss of the production environment. To avoid this problem making a copy of the source data to a separate environment is inevitable.
- Usually a Datawarehouse contains data coming from multiple data sources, platforms and arriving in several formats. Having an unequivocal starting point for loading source data into the Datawarehouse will simplify the ETL process. Let’s keep things simple!
The SourceDB contains all the source data to be loaded into the Data Warehouse. The SourceDB contains tables (physical data) and views (references to tables in other databases on the same instance as the DDM Studio Repository).
With the introduction of version 3 of DDM Studio, the definition of OceanBIs ODS has broadened. All databases residing on the same instance as the DDM Studio Repository – except tempdb, model and DDM_LicenceKey – are considered part of the ODS. This was a logical evolution of the DDM Studio solution, because building a Data Warehouse with DDM Studio starts with importing the metadata of the data sources into the DDM Studio Repository.
Because the Adventureworks database resides on the same instance as the DDM Studio Repository the metadata of the tables and views are imported in the Adventureworks database into the DDM Studio Repository.
After importing the metadata of the tables and views in the Adventureworks database into the DDM Studio Repository the result is shown in the details panel of the Source Objects. There the developer and business comments can be added on header level and on attribute level.
Defining the HUBs
Though the order in which the source data is loaded into the Data Warehouse is different, now starts defining the HUBs in the Data Vault datamodel. In DDM Studio one chooses the ‘Add’ button at the right corner of the screen and then fill in the properties of the HUB. Because after confirming the property values, the required attributes of the HUB are assigned by DDM Studio, one has to specify the properties of the Surrogate Key and Natural Key of the HUB also.
The attributes which are assigned automatically to the HUB are:
- Surrogate Key
○ A unique non-descriptive key of the HUB in the Data Vault
- Natural Key
○ A unique (descriptive) key in the data source
- Datasource
○ The name of the data source where the HUB record is originating
In the Field Detail panel the properties of these attributes can be adjusted. Note: after having source data loaded into the HUB it is still possible to change the properties of these attributes!
*) The DateTime stamp of the HUB (and also LINK and Satellite) is added automatically by DDM Studio during the execution of the ETL process. The name of this attribute is configurable for each Data Vault ObjectType, but only once before loading data into the Data Vault and the name of this attribute is applied to all the HUBs. Before confirmation:
After confirmation:
Defining the LINKs
Defining the LINKs follows the same procedure as the HUBs. The same attributes as the HUBs are assigned automatically to the LINK. The difference is that in the Field Detail panel one adds the Foreign Keys to the LINK. The related HUB or LINK is defined in this panel.
In a Datavault with Surrogate Keys, the Foreign Keys are the Surrogate keys of the HUB or the LINK with which the LINK has a relationship in the datamodel. Be aware of the properties of the related Surrogate key, when one defines the Foreign Key.
*) In the next version of DDM Studio the concept of “Scenario Based Attributes” will be implemented. This will simplify the usage of this functionality.
Defining the Satellites
When a HUB or a LINK is added to the DDM Studio Repository, a Satellite table with four attributes in case of Data Vault Surrogate Key is automatically added to the DDM Studio Repository. The reason for this is that a HUB or a LINK has at least one Satellite. The attributes added to the Satellite table are:
- Surrogate Key
○ The Surrogate key of the HUB or LINK
- Natural Key
○ The Natural key of the HUB or LINK
- Data Source
○ The name of the data source where the Satellite record is originating
- ExampleSatelliteAttribute
○ A Satellite has at least one attribute :-)!
With the Satellite menu option one defines the wanted Satellites and for each Satellite one defines the attributes. Here one defines the physical design of the Data Vault. For each HUB and LINK one creates the Satellites. This can be a time-consuming process.
Storage properties of the HUBs, LINKs and Satellites
The data of the HUBs, LINKs and Satellites are physically stored in tables of the DataVaultDB. DDM Studio offers the possibility to assign these Data Vault Objects to database schemas and to database filegroups.
The HUBs, LINKs and Satellites are automatically assigned to the default schema and filegroup, but you can overwrite this and adjust these settings to the business needs.
DDM Studio Workmethod: Staging Area Deltas and Transformationviews
The first ETL process supported by DDM Studio was the one which loaded a Dimensional Modeled Data Warehouse. Data operations as; data cleansing, data enrichment and data transformation are part of this ETL process and technically can occur in several parts of the ETL process. When these data operations occur in several parts of the ETL process (and for each HUB and LINK in the Data Vault or for each Dimension and Fact in the Dimensional Modeled Data Warehouse a TL process has to be built), then that has a negative effect on the maintenance of the whole Data Warehouse solution.
To be in control of the whole Data Warehouse solution, the steps to load source data into the Data Warehouse with DDM Studio have been formalized into the DDM Studio Workmethod. The goal of the DDM Studio Workmethod is to build Data Warehouses to meet several quality criterias. These criterias have been determined on behalf of a set of questions:
Controllability
Are we “in control” over the solution?
Maintainability
Is the solution maintainable? Transparency and simplicity of the built components has a great influence and is of big importance.
Learnability
Is it possible for others to learn the necessary skills?
Transferability
Is it possible for another person to take over the maintenance of the solution in a relative easy way?
Extensibility
Is the solution to be extended with other data sources in a relative easy way?
Traceability
Are the data in the Data Warehouse to be traced back 1:1 to the data sources? This criteria relates to Data Lineage.
Scalability
Does the solution also function properly on a Technical Infrastructure, which offers far more resources than the Technical Infrastructure for which / on which the solution has been built originally? This quality criterion is determined by the choice for the RDBMS of the Data Warehouse, for instance Microsoft SQL Server.
The DDM Studio Workmethod gives guidelines for the steps performed during the execution of the ETL process. Because DDM Studio loads the source data into the Data Warehouse by combining metadata of the Data Warehouse Destination Objects stored in the DDM Studio Repository with Dynamic SQL, the focus of the DDM Studio Workmethod is on the Transformation part of the ETL process. This is also metadata and stored in the DDM Studio Repository.
The transformation consists of two steps:
- The Staging Area Delta (SADs)
○ The SADs transport the source data from the data sources to the StagingArea. Though this is considered to be one step, it can consist of more than one step. Dependencies between these steps occur and to have these steps executed in the logical sequence these dependencies have to be stored in the DDM Studio Repository.
○ What is solved in the SADs are possible problems related to Master Data Management implemented in the applications, which support the company processes. The endresult of the SAD step is a homogeneous dataset.
- The Transformation View (TV’s)
○ The TV maps data from the SAD objects to the Data Warehouse destination objects, in case of a Datavault to a HUBs or a LINK with the surrounding Satellites.
○ In the TV, the homogenous datasets produced by SAD’s are cleansed, enriched and / or transformed. These data operations might occur in SADs when there are no other possibilities. These exceptions can be argued and documented!
○ The TV may contain Business Rules (stored as metadata in the DDM Studio Repository) to be applied against the source data
○ The outcome of the TV may not result in more or less records than the originating source.
For the Data Vault modeler pur sang these steps are unnecessary. These steps however are part of the DDM Studio solution, so these are mandatory. Keep in mind that DDM Studio is a standard Data Warehouse Automation solution which goal is to meet the desires of so many customers as possible. Also, the SADs and TVs can be stored as views in the StagingAreaDB. So loss of performance and storage needs can be kept at the absolute minimum.
Staging Area Deltas
Creating SADs with DDM Studio has been evolved with each version and with each version it becomes easier to build SADs. The idea when building a Data Warehouse is that one also keeps the DataLineage in mind. That means that it is possible to track the dataflow over each step in the ETL process. To achieve this goal one has to map attributes from data sources to attributes of destination objects, which in a following step can be a data source for other destination objects. In some situations, the characteristics of the source data achieve this goal. For these situations it is also possible to execute a SQL script ending in the creation of a SAD object in the StagingAreaDB. The picture below is an example of a SAD object built on DataLineage principles.
Transformationviews
A transformationview transfers data from one or more SADs to a Datawarehouse destination object. In this case these destination objects are Data Vault HUBs and Data Vault LINKs together with their surrounding Satellites. Building a transformationview is the same as building a SAD object by DataLineage.
Knowledge of T-SQL is necessary, just like with building SADs. This because while choosing the SADs and the attributes, a SQL statement for the transformationview is composed. You can copy this SQL statement from the UI and run the statement in SSMS (SQL Server Management Studio) to check the result.
Load the HUBs and LINKs in a Data Vault with Surrogate Keys
When you have completed the building of a HUB or a LINK, you can process them by the load option of the UI. A prerequisite is that SAD objects, on which the HUB or LINK depends, have also been created physically. This can also be done by an option of the UI. With SSMS you check the results. Processing HUBs and / or LINKs with the UI is only done in the building phase.
Normally the Data Vault is loaded in a batch. When a Data Vault without Surrogate Keys is built, then the order in which the HUBs and LINKs are processed is not important. This situation meets one of the first principles of the Data Vault methodology: HUBs and LINKs can be loaded independently. But in a Data Vault with Surrogate Keys you have to deal with the PK/FK constraints in an RDBMS. So the order in which the HUBs and LINKs are processed is extremely important, otherwise the Data Vault may contain less data than the data source!
To get information about the dependencies between LINKs and HUBs and / or other LINKs I have built a SQL script. I have used that to place the LINKs in the right order of execution in the ProcessQueue Panel of the LINK. The LINKs without a dependency have been divided over the other processqueues: see the picture below. Oh yeah, I forgot to mention that DDM Studio loads the HUBs and LINKs asynchronous. Eight processqueue are available. So when there are eight or more CPUs available in the Datawarehouse server, then a maximum of eight LINKs are processed together.
Lessons Learned
When building a Data Warehouse – also a Data vault – one always has to take the characteristics of the source data into account!
1. PK/FK constraints in an RDBMS
When one builds a Data Vault with the Surrogate Key option, then one has to deal with the PK/FK constraints in an RDBMS. So the order in which the HUBs and LINKs are processed is extremely important, otherwise the Data Vault may contain less data than the data source! When a Data Vault is built with the Surrogate Key option, then a forced execution order arises due to dependencies between LINK and HUBs / Other LINKs.
DDM Studio offers the possibility to mark HUBs and LINKs as “Late Arrived Possible” (like Late Arrived Dimensions or Early Arrived Facts in a Dimensional Modeled Datawarehouse) , which inserts not existing Natural Keys into the HUB or LINK. Be aware that the use of this option has a negative side effect on the performance of the ETL process and that this option is against the principles of the Data Vault methodology described in the five articles of Dan Linstedt, which have been published in 2002. Also on the Data Vault Community blogs one can’t find information about the PK/FK constraint issues in a Data Vault with the Surrogate Key option.
Some data sources contain Self-Referencing / Parent-Child relationships. In the Data Vault built with the Surrogate Key module you have to deal with self-referencing. This situation occur at LINKs. The AdventureWorksDB contains a ParentChild relationship at Employee. Except the Director of the company, every employee has a manager. To solve this Self-Referencing / Parent-Child relationship in the Data Vault with the Surrogate Key option, one can create an extra HUB: Manager. The Employee LINK then contains a FK to the Manager HUB. Both Employee LINK and Manager HUB originates from the same SAD object.
In the Data Vault with the Surrogate Key module, one has to deal also with sometimes / non mandatory relationships. In that situation, the Foreign Key in a table of a data source does not contain a value (or the NULL value). To be able to retrieve the Surrogate Key of the HUB or LINK with which the LINK containing the Foreign Key has a relationship, the Foreign Key needs to have a – dummy – value which is to be found as a Natural Key in the HUB or LINK, with which the LINK containing the Foreign Key has a relationship. The domain values of the Natural Key of the HUB or LINK are shared with the Foreign Key of the LINK. The NULL value is however not part of the domain. When an INNER JOIN between two tables is performed and the Foreign Key of the JOINed table contains a NULL value, then that record will not become part of the resultset. The risk is that not all the records in a data source are loaded into the Data Vault. To overcome this situation I have:
- Added a dummy value to a couple of HUBs and LINKs and I have done that at the SADs by adding an extra dataset to the data source.
- Anticipated on the NULL value of the Foreign Key by applying the ISNULL function to the Foreign Key and I setted the replace value to the dummy value of the HUB / LINK. I have done that at the Transformationview.
*) In a Dimensional Modeled Data Warehouse the Dimensions always – it is standard – to contain a dummy record to secure the PK/FK relationships with the Facts. To the FK’s in the Fact the ISNULL function is always applied and also at the Transformationview. This practise has shown that it is easy if the ETL process assigns automatically a dummy record to the HUBs and LINKs. Then it ain’t necessary to add an extra dataset to the data source at the SADs.
1.Unique Composed Keys
Unfortunately a great number of data sources contains composed keys as the UniqueID. No matter the datatype of each individual attribute, the composed key is of character datatype and preferable of fixed length. An example of of composed key with fixed length is:
Right(Replicate(‘0’,14)+ cast(EmployeeID as varchar(14)),14) + ‘_’ +Right(Replicate(‘0’,14)+ cast(AddressID as varchar(14)),14) |
With DDM Studio one can define these composed keys in the Transformationview, but that is not the right place. One defines the composed keys in the SAD, because the Data Vault can contain data from more than one data source and the objective of the SAD is to get a homogeneous dataset. The SAD contains the result of a Data Integration process and the (not) implemented Master Data Management rules have to be take into account:
- Master Data Management has not been implemented
○ The source data is stacked and the Natural Key is prefixed with the name of the data source
- Master Data Management has been implemented 1), so for instance EmployeeID in every application used by the organization refers to the same employee
○ The source data is not stacked and one application has been assigned as the originating data source
○ Separate SADs are built for other data sources. These data sources may contain attributes / properties of the Master Data Object, which do not occur in the other data sources. These SADs are used in the Transformationview to enrich the dataset of the originating data source
- Master Data Management has been implemented 2), but the EmployeeID domain has been divided in ranges and each range is assigned to one application
○ The source data is stacked, but the Natural Key has not to be prefixed with the name of the data source (because it is unique – or should be! – in the whole domain)
3.Loading Datavault Sequence Surrogate Number
In the 5 articles of Dan Linstedt, which have been published in 2002, the Sequence Surrogate Number is mentioned as an optional attribute of the Satellites unique key. The AdventureWorksDB contains a table called WorkOrderRouting in Production schema. That table contains an attribute called OperationSequence. Logically, this looks like an attribute as an example of a Sequence Surrogate Number, so this attribute has been defined in the Satellite as SSN.
But how to load a Datavault Sequence Surrogate Number and what are my experiences with it? The answer to that question is quite simple: the SSN attribute is part of the unique composed NK of the LINK. If I omit this attribute in the composed NK, then I get a problem with the unicity constraint in the RDBMS. Omitting this attribute in the composed NK is therefore not an option. Because this attribute has to be part of the composed NK, it is not necessary to become the SSN part of the Satellites unique key (= NK + Load DateTime Stamp). My conclusion is that the SSN becomes just a flag to identify attributes like OperationSequence in table WorkOrderRouting in Production schema of AdventureworksDB.
4.Labour Intensive
Building a Data Vault is labour intensive. Because of the higher normalization level / Normal Form the number of objects in the Data Vault is larger than in the (transactional) data source. In an extreme form, the Data Vault can be in 6NF, which means that every attribute of a HUB or a LINK is placed in its own Satellite (kinda Anchor Modeling). With DDM Studio the identified HUBs and LINKs are defined and DDM Studio assigns automatically the known attributes to these entity types. Also one Satellite is automatically created. And now the real labour intensive activities take place: designing the Satellite Architecture of each HUB and LINK based on the change rate of each attribute. Food for thought for a future release of DDM Studio.