This post demonstrates an example how to create aggregation tables in Kettle.
Let’s assume the original transformation loading delta is the following:
It reads data from source database, makes some transformations and loads data into data warehouse.
The source table JobEntry has two grouping fields project_id and employee_id and one additive measure registered_hrs. The task is to update aggregate tables (by project_id, and by employee_id) without complete rebuilding them.
Before updating the fact table in data warehouse, we need to retain the current value from the fact table (step Get previous fact value). After the fact table is updated, we are updating the aggregate tables. We calculate difference between the new and old value (step Calculate change), summarize the change of value to necessary granularity (steps Sort rows, Aggregate), and add the change to the value in the aggregate table (steps Get old agg.value, Calculate new value, Insert/update agg table). The transformation may look like this:
Demo Data (SQL Server)
The source database table:
create table JobEntry ( project_id int, employee_id int, registered_hrs numeric(22,2) )
The script changing the table (updating existing or inserting new rows):
declare @project_id int = RAND() * 3, @employee_id int = RAND() * 3, @registered_hrs numeric(22,2) = RAND() * 10 declare @cnt int = ( select COUNT(*) from JobEntry where employee_id = @employee_id and project_id = @project_id ) if @cnt = 0 begin insert JobEntry values ( @project_id, @employee_id, @registered_hrs ) end else begin update JobEntry set registered_hrs = @registered_hrs where employee_id = @employee_id and project_id = @project_id end select * from JobEntry
The transformation
The data warehouse tables:
create table f_jobentry( project_id int, employee_id int, registered_hrs decimal(22,2) ); create table f_jobentry_employee ( employee_id int, registered_hrs decimal(22,2) ); create table f_jobentry_project ( project_id int, registered_hrs decimal(22,2) );
Test
- Create the necessary tables.
- Run the script generating data.
- Run the transformation updating the fact table and aggregate tables.
- Check the output tables.
- Run the script.
- Run the transformation.
- Check the output tables.
Summary
This approach seems too complex. Maybe the complete rebuilding the aggregate tables is not bad…