Data Vault model of data warehose proposed by Dan E. Linstedt is an alternative approach to the well known Ralph Kimball’s Dimensional Modeling and Bill Inmon’s 3NF data warehouse. It has not gained much recognition primarily because the author is not willing to share his ideas for free. You are welcome to support Dan by buying his book or attending Data Vault certification program 🙂
The main benefit of Data Vault is the ease of modification: you do not have to alter anything in your current model to add a new dimension or attributes of existing dimension.
The largest drawback of this modeling technique is the larger number of joins comparing to other modeling techniques, this hurts the performance. Also such complex data model makes it not relevant for the data access layer.
So let’s consider this modeling technique on an example starting from 3NF and dimensional model.
3NF Model
Consider the following 3NF data model. Usually it is close to the original model in the data source. There are three master tables (Customer, Order and Product) and one transactional table (Order Line).
SK fields are surrogate keys, BK fields are business keys aka natural keys. Surrogate key is a generated key used instead of combination of business key plus start date of validity (Valid From).
Dimensional Model
In our dimensional model, we have a fact table in the middle and dimension tables around the fact table. We added order date as a separate dimension. Customer is directly joined to the fact table.
Nothing new by now, right?
Data Vault
There are 3 types of tables in Data Vault model: Hub, Link and Satellite. Let’s consider on an example of transformation of the 3NF model to Data Vault model.
1) Instead of each master table in 3NF, we add a hub and a satellite.
2) Instead of the transactional table, we add Link table and Satellite.
3) Instead of the joins between master tables, we add Link tables.
Hub table – contains business key and retained key. The retained key is a key which is mapped to business key one-to-one. In comparison, the surrogate key includes time and there can be many surrogate keys corresponding to one business key. This explains the name of the key, it is retained with insertion of a new version of a row while surrogate key is increasing.
Why we need the retained key? There can be many sources, so we cannot use business key as it may have different type or different number of attributes in different source systems.
Link table – contains retained keys of the hubs. E.g. Customer Order Link links Customer Hub and Order Hub by the corresponding keys – Customer RK and Order RK. Note that one-to-many relationship was replaced with many-to-many relationship.
Satellite table – contains attributes of the original tables.
Resilience to Change
We can add new attributes to Customer (e.g. Customer Demographics) or even a new parent table (e.g. Delivery) without any change to the existing tables. We just need to add and populate new tables.
References
Note that this was simplified consideration, just to give an overview of the methodology. You can find detailed information on the web. I would recommend to start from wikipedia.
many thanks . crystal clear with example.
LikeLike