Creating Aggregate Tables

Posted by & filed under Kettle, SQL Server.

This post demonstrates an example how to create aggregation tables in Kettle.

Let's assume the original transformation loading delta is the following:

It reads data from source database, makes some transformations and loads data into data warehouse.

The source table JobEntry has two grouping fields project_id and employee_id and one additive measure registered_hrs. The task is to update aggregate tables (by project_id, and by employee_id) without complete rebuilding them.

Before updating the fact table in data warehouse, we need to retain the current value from the fact table (step Get previous fact value). After the fact table is updated, we are updating the aggregate tables. We calculate difference between the new and old value (step Calculate change), summarize the change of value to necessary granularity (steps Sort rows, Aggregate), and add the change to the value in the aggregate table (steps Get old agg.value, Calculate new value, Insert/update agg table). The transformation may look like this:

Demo Data (SQL Server)

The source database table:

create table JobEntry (
  project_id int,
  employee_id int,
  registered_hrs numeric(22,2)
)

The script changing the table (updating existing or inserting new rows):

declare
   @project_id int = RAND() * 3,
   @employee_id int = RAND() * 3,
   @registered_hrs numeric(22,2) = RAND() * 10

declare @cnt int = (
      select COUNT(*) from JobEntry
      where employee_id = @employee_id
      and project_id = @project_id
   )

if @cnt = 0
begin
   insert JobEntry values (
      @project_id,
      @employee_id,
      @registered_hrs
   )
end
else begin
   update JobEntry set registered_hrs = @registered_hrs
   where employee_id = @employee_id
   and project_id = @project_id
end

select * from JobEntry

The transformation

The data warehouse tables:

create table f_jobentry(
  project_id int,
  employee_id int,
  registered_hrs decimal(22,2)
);
create table f_jobentry_employee (
  employee_id int,
  registered_hrs decimal(22,2)
);
create table f_jobentry_project (
  project_id int,
  registered_hrs decimal(22,2)
);

Kettle transformation

Test

  • Create the necessary tables.
  • Run the script generating data.
  • Run the transformation updating the fact table and aggregate tables.
  • Check the output tables.
  • Run the script.
  • Run the transformation.
  • Check the output tables.

Summary

This approach seems too complex. Maybe the complete rebuilding the aggregate tables is not bad…

Activity Tracker

Posted by & filed under C#.

Do you have to fill timesheet weekly? I have to.. This small tool can help you to remember what you were doing last week :)

You start the tool and minimize it to tray. It tracks activity on your computer and logs it to a file, so you can later see what you were doing. The information, the program tracks every minute, is:

  • was there any activity (mouse or keyboard input),
  • current foreground program,
  • the title of the current foreground program.

Click on the program icon in the tray, opens it. You can pick a date, and you will see the 24 hours activity chart in the bottom, and time statistics in the tree.

The color meaning on the activity chart are the following:

  • White – no information, the computer was turned off or the program was not running.
  • Gray – idle time.
  • Blue – there was some activity.
  • Red – highlights the time when the program selected in the tree was active.

Source Code

http://bukhantsov.org/tools/ActivityTracker.zip

The 64 bit executable is in

ActivityTracker.zip\ActivityTracker\ActivityTracker\bin\x64\Release

Security

The data file is stored in "My Documents" and has name activitytracker.txt.

If other people has access to the file, please think twice before using the tool. The program will log everything you do and every minute. :)

TO-DO: Password protection. The data should be stored in password protected zip file. To start the program, the user will have to enter the password for the zip file. If there is no data file, user will have to enter "new" password which will be used for the zip file.

A bit of code

The code builds a histogram and then orders the array by frequency in descending order.

Histogram:

string[] list = { "x", "b", "b", "x", "x", "z", "c", "x", "b", "c" };
Dictionary<string, int> histogram = new Dictionary<string, int>();
foreach (string s in list)
{
  if (!histogram.ContainsKey(s))
    histogram.Add(s, 1);
  else
    histogram[s] += 1;
}

The following code demonstrates use of lambda expressions for sorting (it was new to me).

List<KeyValuePair<string, int>> sortedlist = new List<KeyValuePair<string, int>>(histogram);
sortedlist.Sort((firstPair, nextPair) => -firstPair.Value.CompareTo(nextPair.Value));

foreach (KeyValuePair<string, int> p in sortedlist)
{
  Console.WriteLine(p.Key + ": " + p.Value);
}

This can also be done using LINQ:

var sortedlist = from s in histogram orderby s.Value descending select s;

foreach (var p in sortedlist)
{
  Console.WriteLine(p.Key + ": " + p.Value);
}

Update, 2011-11-19

Some bugs were corrected
Some enhancements
The new source code uploaded

Alternatives :)

Of course, the program was written mostly for fun. Consider ManicTime if you really need activity tracking software.

Use of Index and Characterset

Posted by & filed under Kettle, Oracle.

This post demonstrates influence of characterset on use of indexes. See also Language and Use of Indexes in Oracle

/*
DROP TABLE t;
ALTER SESSION SET NLS_LANGUAGE='AMERICAN';
*/

SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS;

CREATE TABLE t (x VARCHAR2(255));

INSERT INTO t VALUES ('a');
INSERT INTO t VALUES ('b');
INSERT INTO t VALUES ('c');
INSERT INTO t VALUES ('d');

CREATE INDEX t_idx ON t (x);

-- the index is used
UPDATE t SET x='a' WHERE x='a';

ALTER SESSION SET NLS_LANGUAGE='CZECH';

-- the index is still used if NLS_CHARACTERSET=WE8MSWIN1252
-- the index is not used if NLS_CHARACTERSET=WE8ISO8859P1
UPDATE t SET x='a' WHERE x='a';

Language and Use of Indexes in Oracle

Posted by & filed under Kettle, Oracle.

Symptoms

It takes a very long time to load a data warehouse on Oracle database with Kettle when the language in the Windows Region and Language settings is different from English (United States).

Cause

  • Oracle JDBC driver used by Kettle issues command ALTER SESSION SET NLS_LANGUAGE and changes default database language.
  • Changing NLS_LANGUAGE also changes oracle parameter NLS_SORT from BINARY to the specified language.
  • When NLS_COMP is equal to ANSI and the value of NLS_SORT is not BINARY, the linguistic comparison is used for WHERE statements.
  • When the value of NLS_COMP is different from BINARY a full scan is used for VARCHAR2 fields instead of index range scan.

Resolution

Add "-Duser.country=en -Duser.language=en" to java runtime options in Kettle batch files.

Experiment

When NLS_SORT is BINARY, the index is used.

SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_LANGUAGE', 'NLS_SORT', 'NLS_COMP');
UPDATE F_JOB_BUDGET_LINE SET ACTUAL_COST=0
WHERE INSTANCE_KEY='JobBudgetLine00000000000004835665';


When the  parameter NLS_LANGUAGE is changed, NLS_SORT is also changed. The index is not used when NLS_SORT=CZECH and NLS_COMP=ANSI.

ALTER SESSION SET NLS_LANGUAGE='CZECH';
SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_LANGUAGE', 'NLS_SORT');
UPDATE F_JOB_BUDGET_LINE SET ACTUAL_COST=0
WHERE INSTANCE_KEY='JobBudgetLine00000000000004835665';


We can change NLS_SORT to BINARY to make the index used again. Alternatively we can set NLS_COMP to BINARY, this will have the same effect.

ALTER SESSION SET NLS_SORT='BINARY';
SELECT PARAMETER, VALUE FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_LANGUAGE', 'NLS_SORT');
UPDATE F_JOB_BUDGET_LINE SET ACTUAL_COST=0
WHERE INSTANCE_KEY='JobBudgetLine00000000000004835665';

How to Find Running Queries – SQL Server

Posted by & filed under SQL Server.

SELECT
  CN.session_id  AS SPID,
  ST.text        AS SqlStatementText
FROM
  sys.dm_exec_connections AS CN
CROSS APPLY
  sys.dm_exec_sql_text(CN.most_recent_sql_handle) AS ST
ORDER BY
  CN.session_id

If a BusinessObjects document is refreshing for a long time, you might be interested in what query is currently running (to optimize it). The above code can help to figure out this.

Io exception: Socket read timed out

Posted by & filed under Kettle.

If you get the following message, check the firewall settings.

ERROR 30-09 09:13:36,485 - YOUR_LOG_CONNECTION - Error disconnecting from database:
Error comitting connection
 Io exception: Socket read timed out

When the transformation/job is started, Kettle opens a connection for logging. The transformation could run for hours, the connection for logging will stay idle all this time and it might be dropped by the firewall. When the transformation finishes execution, stopped by the user or fails with an error and Kettle tries to update the log, it notices that the connection is dropped and displays the error messages.
Read more »

Influence of Nr of rows in rowset on Merge Join

Posted by & filed under Kettle.

Let's consider a transformation that merges two flows:

Here are some experiments with different Nr of rows in rowset:

So the speed of the Merge join depends on the parameter Nr of rows in rowset. It should be reasonably high (3000K).

Note that if the parameter is too high, the transformation might fail with exception: java.lang.OutOfMemoryError: Java heap space.

java.lang.OutOfMemoryError: Java heap space

Posted by & filed under Kettle.

All steps in a transformation runs in parallel. The hops between steps are kind of buffers of rows. The maximum number of rows is controlled with transformation parameter Nr of rows in rowset. However it is important to understand that if this parameter is set to too high value, Kettle might not be able to allocate the required amount of memory and fail with exception java.lang.OutOfMemoryError: Java heap space.

Experiment:

Heap memory for Kettle is restricted with parameter -Xmx to 256m. We will set Nr of rows in rowset to 10M. The following transformation fails after some time when the number of required memory reaches the limit.

The error:

UnexpectedError: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
	at org.pentaho.di.core.row.RowDataUtil.allocateRowData(RowDataUtil.java:34)
	at org.pentaho.di.core.row.RowMeta.cloneRow(RowMeta.java:311)
	at org.pentaho.di.trans.steps.rowgenerator.RowGenerator.processRow(RowGenerator.java:151)
	at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2889)
	at org.pentaho.di.trans.steps.rowgenerator.RowGenerator.run(RowGenerator.java:215)

Generate rows adds a string field with the 100 bytes string. The other two steps do not do anything significant – step Dummy can be used instead of them.

We can calculate the number of rows in hops calculating the difference between read and write. Hop between Generate Rows and Add sequence holds 6445K-3650K=2795K of rows, hop between Add sequence and Calculator holds 3650K-3435K=215K of rows. So the total is 3010K rows each of which requires at least 100 bytes. The total required memory is more than 301M bytes.

Where universes and documents are stored

Posted by & filed under Administration.

BusinessObjects documents and universes are not stored in the CMS database. The CMS database contains only metadata – miscellaneous information about the objects. The files corresponding to the objects are stored in the BO File Repository, these documents can be seen in PDF format using software as sodapdf for this purpose.

For example, there is a webi document Balance Sheet in the BO. For the document, you can find its ID, CUID, and file name in Central Management Console. (FRS in the file name stands for File Repository Server)

In the example, the file name for the document is:

frs://Input/a_128/035/000/9088/adzbyinunwldinunwldindleqsaoze.wid

The BusinessObjects software is installed in C:\Business Objects, the path to the file repository server is

C:\Business Objects\BusinessObjects Enterprise 12.0\FileStore\

If you join these two, you will find the document on the server.

Internet Explorer 9 support in BusinessObjects Enterprise

Posted by & filed under Administration.

If you get error message "HTTP Status 400 – Invalid Path /Analytical Reporting/WebiView was requested. The request sent by the client was syntactically incorrect.", it is very likely that you are using IE 9.

IE 9 is not supported yet by BO. It is planned to be supported in BO BI 4.1 and BOE XI 3.1 SP5 that will be released somewhere in Q1 2012.