The records can be removed from a table in a operational system, but we never remove records from dimension tables. This post describes how the removal of records can be handled in dimension tables.
Let’s consider an example. There is a list of contact persons in a source system. The list is loaded into a dimension table in a data warehouse. Contact persons can be removed from the operational system, but not from dimension tables. The task is to list the current contact persons based on the data from the data warehouse. What can help us to identify the current records?
Closing out deleted records
The first idea could be closing out the deleted records. Closing out rows means setting its DATE_TO equal to the current date without inserting a new row. For example, when “Tony” is removed from the operational system, DATE_TO of its last version is set to the date of the deletion.
In this case the following query will return only current contact persons.
SELECT first_name, last_name FROM d_contact_person WHERE GetDate() < date_to
Some tools (e.g. SAS Data Integration Studio) provide possibility to close out the row when the corresponding row is removed from the source system. In Kettle, this can be implemented combining “Dimension lookup/update” with other steps. (The implementation will not be considered)
The closing out deleted rows solves the task, but is such approach good enough in general?
Let’s assume we altered ETL to close out the removed rows, and created the following filter in the semantic layer:
GetDate() < date_to
The dimension table D_CONTACT_PERSON can be joined with a fact table, for instance, F_OPPORTUNITIY. If we want to see a list of current contact persons together with the number of the won opportunities, the generated query can likely be the following:
SELECT d_contact_person.first_name, d_contact_person.last_name, Sum(CASE WHEN f_opportunity.status='won' THEN 1 END) won_cnt FROM d_contact_person INNER JOIN f_opportunity ON d_contact_person.contact_person_sk = f_opportunity.contact_person_sk WHERE GetDate() < d_contact_person.date_to GROUP BY d_contact_person.first_name, d_contact_person.last_name
This query can produce a wrong result (incorrect number of won opportunities), because all non-current revisions of the contact persons will be removed from result together with the linked fact rows. For instance, for the contact person Robin, the row with surrogate key 2 will be excluded from the result as well as all opportunities registered before 2012-02-16.
It is possible to alter the SQL query and condition to make this approach work. But this is not the easiest approach. In most cases: The validity dates of the dimension table (DATE_FROM, DATE_TO) should only be used for surrogate key lookup. It is never a good idea to alter the validity dates outside SCD mechanism. The dimension validity dates should never be used for report logic.
Removal Flag and Removal Date
Another approach is to add two fields to the dimension table – a flag whether the contact person is removed and a date field indicating when it was removed. In this case there is no need to alter the dimension validity dates.
The correct list of the current contact persons can be retrieved with:
SELECT first_name, last_name FROM d_contact_person WHERE removed=0
The list of current contact persons together with the number of the won opportunities:
SELECT d_contact_person.first_name, d_contact_person.last_name, Sum(CASE WHEN f_opportunity.status='won' THEN 1 END) won_cnt FROM d_contact_person INNER JOIN f_opportunity ON d_contact_person.contact_person_sk = f_opportunity.contact_person_sk WHERE removed=0 GROUP BY d_contact_person.first_name, d_contact_person.last_name
The removal date is not necessary for these queries, but this can be needed for building the reports for a historical date.
Implementation (1)
Initialize – the step marks all dimension rows as removed.
UPDATE d_contact_person SET removed=1 WHERE contact_person_sk > 0
Table input, Add constant and Dimension lookup/update – the steps read the current rows from the operational system, and mark corresponding rows in the dimension as not-removed.
The following two steps handles REMOVAL_DATE
Update removal date – the step sets REMOVAL_DATE for the removed rows.
UPDATE d_contact_person SET removal_date = Convert(DATE, GetDate()) WHERE removed = 1 AND removal_date = '22000101' AND contact_person_sk > 0
Consistency – the step resets REMOVAL_DATE to maximum date for existing rows.
UPDATE d_contact_person SET removal_date = '22000101' WHERE removed = 0 AND contact_person_sk > 0