markerMaatwerkoplossingen

markerBetrokken & direct

markerResultaatgericht

markerSamen voor de beste oplossing

The Data Vault Adventure
An introduction to Data Vault modeling

Author: Mesa Jajan

An introduction to Data Vault modeling

What is Data Vault modeling?

Data Vault is a method to model a database which is designed to provide long-term historical storage of data extracted from multiple operational systems. Apart from the modeling aspect, this method deals with issues such as auditing and tracing data from a historical point of view. Note that the Data Vault can’t track audit changes that occurred in the source system.

There are several aspects that point out in Data Vault modeling.

  • It enables till a certain degree for an auditor to trace values back to the source. It offers to ability to trace the origin of the data in the database.
  • Structural information is separated from descriptive attributes. The thought behind this was to be resilient to change in the operational systems of the data. Practice proves that the Data Vault modeling faces some difficulties when changes in the operation systems occur. This topic will be discussed extensively in our next article.
  • No distinction is being made between good and bad data. Expressed by Linstedt as “all the data, all of the time”. The Data Vault stores a single version of the facts. This is the complete opposite to the dimensional modeling method of storing “a single version of the truth”
  • Data Vault is designed to enable parallel loading as much as possible. This makes it possible to scale out large implementations.

History and philosophy of Data Modeling for Data Warehousing

There are two leading approaches to storing data in a Data Warehouse — the most important approaches are the normalized and the dimensional approach.

The normalized approach
The concept of data modeling dates back to the early 1906’s. Codd, while working for IBM founded the relational model for database management. This was called the 3rd Normal Form (3NF) for On-Line Transaction Processing (OLTP) or the normalized approach. In the early 1980’s it was adapted to meet the growing needs of data by Inmon. Inmon, has defined a Data Warehouse as a centralized repository for the entire enterprise. The top-down approach is designed using a normalized enterprise data model. “Atomic”data, that is, data at the lowest level of detail, are stored in the Data Warehouse

The dimensional approach
In the late 1980’s the concept of Business Date Warehousing was invented by Devlin and Murphy. The definition of the Data Warehouse focuses on data storage. The main source of the data is cleaned, transformed, cataloged and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a Data Warehousing system. This was executed through star schema modeling. Star schema modeling was architected to solve subject-oriented problems including aggregations, data model structural change, query performance, reusable or shared information, ease of use, and the ability to support On-Line Analytical Processing (OLAP). This single subject centric architecture became known as a data mart. Kimball, designed an approach to Data Warehouse design known as bottom-up. In the bottom-up approach, data marts are first created to provide reporting and analytical capabilities for specific business processes. Kimball’s creation of the Enterprise Bus Matrix is a top-down approach. This model is an output of the Business requirements phase. The Kimball methodology covers the full scope, only you work bottom-up.

Data Vault modeling
Data Vault Modeling was conceived by Linstedt in the 1990’s. According to Linstedt a new modeling technique was necessary due to performance and weaknesses of 3NF and star schema as the volume of data increased. He claims that Data Vault is architected to overcome these shortcomings while retaining the strengths of 3NF and star schema architectures. This modeling style is a hybrid design, consisting of the best practices from both 3NF and star schema. The Data Vault model is not a true 3NF and breaks some of the rules that 3NF dictates be followed. It is however, a top-down architecture with a bottom up design. The Data Vault model is geared to be strictly a Data Warehouse. It is not geared to be end-user accessible, which when built, still requires the use of a data mart or star schema based release area for business purposes.

The Data Vault Components
There are three main components; the HUB, LINK and Satellite Entities. The HUB is representing the primary key. The LINK Entities provide transaction integration between the HUBs. The Satellite Entities provide the context of the HUB primary key.

HUB
HUBs contain a list of unique business keys with low propensity to change. These are the keys that the businesses utilize in every day operations. For example, invoice numbers, employee numbers and customer numbers.Hubs also contain a surrogate key for each HUB item and metadata describing the origin of the business key. The HUB is established the first time a new instance of that business key is introduced to the Enterprise Data Warehouse. It contains no descriptive information and contains no Foreign Keys.

The HUB contains at least the following fields:

  • Surrogate Key – A unique identifier
  • Business Key – It may consists out of multiple fields
  • Load Date Time Stamp – Recording when the key arrived in the data warehouse
  • Record Source – Recording of the source system to trace the data

LINK
Associations or transactions between business keys. These tables are basically a physical representation of a many-to-many 3NF relationship. It is established the first time this new unique association is presented to the EDW.  It can represent an association between several HUBs and other LINKs.  It does maintain a 1:1 relationship with the business defined association between that set of keys.  Just like the HUB, it contains no descriptive information.

The LINK contains the following attributes:

  • Surrogate Key – Used if there are more than two HUBs through this LINK.
  • HUB 1 Key to HUB N Key – Hub Keys migrated into the Link or Links into other LINKs.
  • Load Date Time Stamp – Recording when the key arrived in the data warehouse
  • Record Source – Recording of the source system to trace the data

Satellite
The Satellite contains the descriptive information (context) for a business key.  These consist of metadata linking them to their parent HUB or LINK, metadata describing the origin of the association and attributes, as well as a timeline with start and end dates for the attribute. There can be several Satellites used to describe a single business key (or association of keys) however a Satellite can only describe one key (HUB or a LINK).  All of its information is subject to change over time; therefore the structure must be capable of storing new or altered data at the granular level. There is a good amount of flexibility afforded the modelers in how they design and build Satellites.  These attributes are stored both with regards to the details of the matter as well as the timeline and can range from quite complex to quite simple. The Satellite is comprised of the following attributes:

  • Satellite Primary Key – HUB or LINK Primary Key
  • Satellite Primary Key – Load Date Time Stamp
  • Satellite Optional Primary Key – Sequence Surrogate Number
  • Record Source – Recording of the source system to trace the data

Reference tables
Reference tables are a normal part of a healthy Data Vault model. They are there to prevent redundant storage of simple reference data that is referenced a lot. Reference tables are referenced from Satellites, but never bound with physical foreign keys.

Point-in-time-tables
The point-in-time table contains a copy of the source data, with an additional system column containing the timestamp of approximately when the particular row was inserted or updated at the source server.

Further
The Data Vault is a method that is used widely within Business Intelligence. Together with my colleague René we decided to model a Data Vault from the Adventureworks database.  Afterwards we build this model with our Dynamic Data Modeling tool; DDM Studio.

Data Vault model