How to model Many-to-Many relationship in Data Vault OR What is a Record Tracking Satellite

At times you need to model Many-to-Many relationship that changes over time in Data Vault. For e.g., you may have HUB_CUSTOMER and HUB_ADDRESS. Several Customer may be associated with the same Address, and also multiple Addresses can be associated with a single Customer. And this can change over time. For e.g. 

Day 1:

HUB_CUSTOMER_HASHKEYHUB_ADDRESS_HASHKEY
111aaa
111bbb
111ccc

Day 2:

HUB_CUSTOMER_HASHKEYHUB_ADDRESS_HASHKEY
111aaa
111bbb
222ccc

Day 3:

HUB_CUSTOMER_HASHKEYHUB_ADDRESS_HASHKEY
111aaa
111bbb
222ccc
111ccc

 

LINK_CUSTOMER_ADDRESS will only capture the first occurrence of the relationship, but will not capture the changing nature of the relations. LINK_CUSTOMER_ADDRESS will look as follows:

LINK_CUSTOMER_ADDRESS_HKHUB_CUSTOMER_HKHUB_ADDRESS_HKLOAD_DATE
1qaz111aaaDay 1
2wsx111bbbDay 1
3edc111cccDay 1
4rfv222cccDay 2

Note that the Link is only capturing the first appearance of the relationship, but nothing about the effectivity of the relationship. This is where you need the Record Tracking Satellite. A Record Tracking Satellite is a narrow table that keeps tracks of LINK Hashkeys or SAT HASHKEY for each load. It basically captures all the HASHKEYs that are valid for that load. The Record Tracking Satellite for the above Many-to-Many relationship that changes over time will look as following

LINK_CUSTOMER_ADDRESS_HKLOAD_DATE
1qazDay 1
2wsxDay 1
3edcDay 1
1qazDay 2
2wsxDay 2
4rfvDay 2
1qazDay 3
2wsxDay 3
4rfvDay 3
3edcDay 3

This way you can capture the effectivity of Many-to-Many relationships that change over time. 

 

Data Vault Record Tracking Satellite hanging from a Link

 

Data Vault Anti-pattern: Using Effectivity Satellites as SCD2

 

In Data Vault 2.0 Effectivity Satellites are artifacts that are exclusively used to Track the temporal relevance of a relationship based on a Driving Key. As such, they hang from a Link Table.

 

Effectivity Satellites are not same as the SCD2.

So what is an Effectivity Satellite?

For an Effectivity Satellite, a Driving Key needs to be defined. For e.g. let's say we have a Link for Opportunities and relations to Accounts. The Account on Opportunity can change overtime. A common scenario is that the Opportunity is Assigned to a Global Parent account (e.g. Seagate Technology), and may get re-assigned to the Account Subsidiary (e.g. Lyve Labs) and get re-assigned to the Global Parent (Seagate Technology) at a later date

This will be tracked in the LINK as following:

LINK_HKOPPORTUNITY_HKOPPORTUNITYACCOUNT_HKACCOUNTLOAD_DATE
8913b4fac89470111e184de02Lyve Labs2023-02-21 14:43:05
8cee5f97c8947011192b3503Seagate Technology2023-02-21 14:53:57

Now the 3rd step, i.e. 111 switching back to Seagate Technology will not be captured in this LINK as a new row in this LINK

An Effectivity SAT can be used to Track effectivity (temporal relevance) of the aforementioned Opportunity to Account relationships.

LINK_HKOPPORTUNITY_HKOPPORTUNITYACCOUNT_HKACCOUNTSTART_DATEEND_DATELOAD_DATE
8913b4fac89470111e184de02Lyve Labs2023-02-21 14:43:059999-12-31 0:00:002023-02-21 14:43:05
8cee5f97c8947011192b3503Seagate Technology2023-02-21 14:53:579999-12-31 0:00:002023-02-21 14:53:57
8913b4fac89470111e184de02Lyve Labs2023-02-21 14:43:052023-02-21 14:53:572023-02-21 14:53:57
8cee5f97c8947011192b3503Seagate Technology2023-02-21 14:53:572023-02-21 15:00:312023-02-21 15:00:31
8913b4fac89470111e184de02Lyve Labs2023-02-21 15:38:309999-12-31 0:00:002023-02-21 15:38:30
8cee5f97c8947011192b3503Seagate Technology2023-02-21 15:44:429999-12-31 0:00:002023-02-21 15:44:42
8913b4fac89470111e184de02Lyve Labs2023-02-21 15:38:302023-02-21 15:44:422023-02-21 15:44:42
  1. When the relationship is first recorded only 1 record is inserted (highlighted in green)
  2. When there is a change in a relationship it must be based on one of the participants of the relationship– the driver, we track its change vs the other keys in the relationship. To do that we end_date one record and insert a new "active record"; i.e. 2 records inserted. (highlighted in red)

 


Data Vault Anti-pattern: Load Dates that are anything other than time of loading the Staging

 

By using the Load Date we should be able to identify all the Data that was loaded into DV in that particular batch. If the Load Date is some else, for e.g. the Load Date from the ETL tool, the entire batch in DV can not identified using the Load Date.

Data Vault Anti-pattern: Using varchar to store the HashKeys and HashDiffs

 

Why would you do that? HashKeys and HashDiffs are binary generated using a hashing algorithm like MD5 or SHA-1. Just store them as binary and effectively halve your storage and double your I/O! No need to convert them to the Char to store them as VARCHAR.

Data Vault Anti-pattern: Creating Hubs for Dependent Children

 

Dependent Children should not have their own Hubs. They are not Business Concepts and as such should not be a Business Key by themselves. They only make sense when associated with a Business concept. 

One example is Line Items in a Order. The Line Item in Order by itself does not make sense. It needs to be associated with an Order. This makes a Line Item a Dependent Child i.e. it is not a Business Concept on its own.

One way to model the Dependent Child is to add them in the Link as following:

 

 

 

Data Vault Anti-pattern: Implementing Business Rules at the Infomart Level

 

While it is tempting to implement Business Rules at the Infomart Level, that is not where the Business Rules should reside. They should reside in Business Vault. This enables historisation of the Business Rules and introduces auditability. When the Business Rule changes, with historisation it is possible to go back in time and analyze the impact of the Business Rule change. Now if this Business Rule was implemented at the Infomart Level, there would be no history thus no auditability. As a best-practice, the Business Vault tables should be materialized physical TABLES instead of VIEWs.

Data Vault Anti-pattern: Using Historized Links to store Transactional data that does not change

 

Transactional Data that does not change e.g. sensor data, stock trades, call center call data log, medical test results, event logs etc. should reside in a Non-historized Link (NHL) aka. Transaction Link. There is no point in using a Historized Link to store data that can not change.  All of the attributes of the Transaction can be stored within the NHL. Here is an example:

 

Note that the PATIENT_HK, LOCATION_HK, COVID_TEST_ID, and LOAD_DATE form the Unique Key for the Non-historized Link. The descriptive attributes are stored in the Non-historized Link instead of a SAT hanging from the Link.

"Non-historised links are used when the data in the source should not be modified at any time"