Category Archives: Data Warehousing

Data Warehousing

Creating Aggregate Tables

This post demonstrates an example how to create aggregation tables in Kettle.

Let’s assume the original transformation loading delta is the following:

It reads data from source database, makes some transformations and loads data into data warehouse.

The source table JobEntry has two grouping fields project_id and employee_id and one additive measure registered_hrs. The task is to update aggregate tables (by project_id, and by employee_id) without complete rebuilding them.

Before updating the fact table in data warehouse, we need to retain the current value from the fact table (step Get previous fact value). After the fact table is updated, we are updating the aggregate tables. We calculate difference between the new and old value (step Calculate change), summarize the change of value to necessary granularity (steps Sort rows, Aggregate), and add the change to the value in the aggregate table (steps Get old agg.value, Calculate new value, Insert/update agg table). The transformation may look like this:

Demo Data (SQL Server)

The source database table:

create table JobEntry (
  project_id int,
  employee_id int,
  registered_hrs numeric(22,2)
)

The script changing the table (updating existing or inserting new rows):

declare
   @project_id int = RAND() * 3,
   @employee_id int = RAND() * 3,
   @registered_hrs numeric(22,2) = RAND() * 10

declare @cnt int = (
      select COUNT(*) from JobEntry
      where employee_id = @employee_id
      and project_id = @project_id
   )

if @cnt = 0
begin
   insert JobEntry values (
      @project_id,
      @employee_id,
      @registered_hrs
   )
end
else begin
   update JobEntry set registered_hrs = @registered_hrs
   where employee_id = @employee_id
   and project_id = @project_id
end

select * from JobEntry

The transformation

The data warehouse tables:

create table f_jobentry(
  project_id int,
  employee_id int,
  registered_hrs decimal(22,2)
);
create table f_jobentry_employee (
  employee_id int,
  registered_hrs decimal(22,2)
);
create table f_jobentry_project (
  project_id int,
  registered_hrs decimal(22,2)
);

Kettle transformation

Test

  • Create the necessary tables.
  • Run the script generating data.
  • Run the transformation updating the fact table and aggregate tables.
  • Check the output tables.
  • Run the script.
  • Run the transformation.
  • Check the output tables.

Summary

This approach seems too complex. Maybe the complete rebuilding the aggregate tables is not bad…

Use of Index and Characterset

This post demonstrates influence of characterset on use of indexes. See also Language and Use of Indexes in Oracle

/*
DROP TABLE t;
ALTER SESSION SET NLS_LANGUAGE='AMERICAN';
*/

SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS;

CREATE TABLE t (x VARCHAR2(255));

INSERT INTO t VALUES ('a');
INSERT INTO t VALUES ('b');
INSERT INTO t VALUES ('c');
INSERT INTO t VALUES ('d');

CREATE INDEX t_idx ON t (x);

-- the index is used
UPDATE t SET x='a' WHERE x='a';

ALTER SESSION SET NLS_LANGUAGE='CZECH';

-- the index is still used if NLS_CHARACTERSET=WE8MSWIN1252
-- the index is not used if NLS_CHARACTERSET=WE8ISO8859P1
UPDATE t SET x='a' WHERE x='a';

Language and Use of Indexes in Oracle

Symptoms

It takes a very long time to load a data warehouse on Oracle database with Kettle when the language in the Windows Region and Language settings is different from English (United States).

Cause

  • Oracle JDBC driver used by Kettle issues command ALTER SESSION SET NLS_LANGUAGE and changes default database language.
  • Changing NLS_LANGUAGE also changes oracle parameter NLS_SORT from BINARY to the specified language.
  • When NLS_COMP is equal to ANSI and the value of NLS_SORT is not BINARY, the linguistic comparison is used for WHERE statements.
  • When the value of NLS_COMP is different from BINARY a full scan is used for VARCHAR2 fields instead of index range scan.

Resolution

Add “-Duser.country=en -Duser.language=en” to java runtime options in Kettle batch files.

Experiment

When NLS_SORT is BINARY, the index is used.

SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_LANGUAGE', 'NLS_SORT', 'NLS_COMP');

UPDATE F_JOB_BUDGET_LINE SET ACTUAL_COST=0
WHERE INSTANCE_KEY='JobBudgetLine00000000000004835665';


When the  parameter NLS_LANGUAGE is changed, NLS_SORT is also changed. The index is not used when NLS_SORT=CZECH and NLS_COMP=ANSI.

ALTER SESSION SET NLS_LANGUAGE='CZECH';
SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_LANGUAGE', 'NLS_SORT');

UPDATE F_JOB_BUDGET_LINE SET ACTUAL_COST=0
WHERE INSTANCE_KEY='JobBudgetLine00000000000004835665';


We can change NLS_SORT to BINARY to make the index used again. Alternatively we can set NLS_COMP to BINARY, this will have the same effect.

ALTER SESSION SET NLS_SORT='BINARY';
SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_LANGUAGE', 'NLS_SORT');

UPDATE F_JOB_BUDGET_LINE SET ACTUAL_COST=0
WHERE INSTANCE_KEY='JobBudgetLine00000000000004835665';

Io exception: Socket read timed out

If you get the following message, check the firewall settings.

ERROR 30-09 09:13:36,485 - YOUR_LOG_CONNECTION - Error disconnecting from database:
Error comitting connection
 Io exception: Socket read timed out

When the transformation/job is started, Kettle opens a connection for logging. The transformation could run for hours, the connection for logging will stay idle all this time and it might be dropped by the firewall. When the transformation finishes execution, stopped by the user or fails with an error and Kettle tries to update the log, it notices that the connection is dropped and displays the error messages.
Continue reading

Influence of Nr of rows in rowset on Merge Join

Let’s consider a transformation that merges two flows:

Here are some experiments with different Nr of rows in rowset:

So the speed of the Merge join depends on the parameter Nr of rows in rowset. It should be reasonably high (3000K).

Note that if the parameter is too high, the transformation might fail with exception: java.lang.OutOfMemoryError: Java heap space.

java.lang.OutOfMemoryError: Java heap space

All steps in a transformation runs in parallel. The hops between steps are kind of buffers of rows. The maximum number of rows is controlled with transformation parameter Nr of rows in rowset. However it is important to understand that if this parameter is set to too high value, Kettle might not be able to allocate the required amount of memory and fail with exception java.lang.OutOfMemoryError: Java heap space.

Experiment:

Heap memory for Kettle is restricted with parameter -Xmx to 256m. We will set Nr of rows in rowset to 10M. The following transformation fails after some time when the number of required memory reaches the limit.

The error:

UnexpectedError: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
	at org.pentaho.di.core.row.RowDataUtil.allocateRowData(RowDataUtil.java:34)
	at org.pentaho.di.core.row.RowMeta.cloneRow(RowMeta.java:311)
	at org.pentaho.di.trans.steps.rowgenerator.RowGenerator.processRow(RowGenerator.java:151)
	at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2889)
	at org.pentaho.di.trans.steps.rowgenerator.RowGenerator.run(RowGenerator.java:215)

Generate rows adds a string field with the 100 bytes string. The other two steps do not do anything significant – step Dummy can be used instead of them.

We can calculate the number of rows in hops calculating the difference between read and write. Hop between Generate Rows and Add sequence holds 6445K-3650K=2795K of rows, hop between Add sequence and Calculator holds 3650K-3435K=215K of rows. So the total is 3010K rows each of which requires at least 100 bytes. The total required memory is more than 301M bytes.

Dimension lookup / update properties for all transformations

Sometimes it might be useful to get different information about kettle transformations in a list format. Here is an example selecting update type of step Dimension lookup/update:

select
   t.NAME as TRANSFORMATION_NAME
,  s.NAME as STEP_NAME
,  fn.FIELD_NAME
,  tab.TABLE_NAME
,  fu.FIELD_UPDATE
from
(
   select
      ID_STEP
   ,  NR
   ,  CAST(VALUE_STR as varchar) as FIELD_NAME
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'field_lookup'
) fn
inner join
(
   select
      ID_STEP,
      NR,
      CAST(VALUE_STR as varchar) as FIELD_UPDATE
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'field_update'
) fu
on
   fn.ID_STEP = fu.ID_STEP
and
   fn.NR = fu.NR
inner join
   R_STEP s
on
   s.ID_STEP = fn.ID_STEP
inner join
(
   select
      ID_STEP
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'update'
   and
      CAST(VALUE_STR as varchar)='Y'
) u
on u.ID_STEP = s.ID_STEP
inner join
(
   select
      ID_STEP
   ,  CAST(VALUE_STR as varchar) as TABLE_NAME
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'table'
) tab
on
   tab.ID_STEP = s.ID_STEP
inner join
   R_TRANSFORMATION t
on
   t.ID_TRANSFORMATION = s.ID_TRANSFORMATION
order by
   TRANSFORMATION_NAME
,  STEP_NAME
,  FIELD_NAME

SAS Methodology

  • SAS® Project Management Methodology
  • SAS® Intelligence Platform Implementation Methodology
  • SAS® Intelligence Solution Implementation Methodology

SAS Project Management Methodology, SAS Intelligence Platform Implementation Methodology, SAS Intelligence Solution Implementation Methodology

Project Management Methodology

Project Qualification
Project Definition
Project Planning
Project Execution
Project Summation

Intelligence Platform Implementation Methodology

Platform:
Assess and Define
Analyze and Evaluate
Design
Construct
Load
Final Test
Deploy Platform
Review
Ongoing Operations and Maintenance
Data Quality:
Analyze Data Quality
Resolve Data Quality Issues
Data Mining:
Define Data Mining Target
Create Data Mining Data Mart
SEMMA
Implement Model

Intelligence Solution Implementation Methodology

Solution Assessment
Detailed Analysis
Customize Design
Construct, Customize and Configure
Calibration and Integration Test
Deploy Solution
Hand Over to Client

Kimball data warehouse lifecycle

There are three tracks (beside Project Management) that running in paralel: technological, data and business intelligence. Yes, theoretically it is possible to run them in parallel, but it seems that such practice is quite rare.

It would be interesting to hear an example of a True Kimball Project.