Posts Categorized: Data Warehousing

Format date in Batch yyyyMMdd

Posted by & filed under Java, Kettle.

The usual approach to build a timestamp for a log files in a batch is to use standard Windows utilities date and time. However the format of the output depends on locale and it is almost not possible to make the script which runs on any machine. A solution might be to create a small java […]

Archiving Pentaho log table in SQL Server

Posted by & filed under Kettle, Maconomy.

Pentaho Data Integration displays only the last 50 log entries but anytime you open a transformation it has to read the whole log table. When the log table gets large, opening a transformation may take long time and PDI freezes for this time. A solution is to archive old entries from the log table. Create […]

Removing time part of a date fails in Kettle

Posted by & filed under Kettle.

Might be an interesting info. I encountered an error running a Pentaho Data Integration transformation. After simplification I got this: Generate Row creates one row with the date field DATETIME=1981-01-01. Calculator calculates field DATE using operation "Remove time from a date". And this trivial transformation failed with a weird error: Unexpected error : java.lang.IllegalArgumentException: MINUTE […]

Slowly Changing Fact Tables

Posted by & filed under Design.

Let's assume the organization has a huge number of customers and the task is to track very detailed customer profiles that include customer preferences, market segmentation, customer geography etc. The customer profiles are changing very often. The table would be very wide, huge and growing very fast. We can split the dimension table in a […]

What is Data Vault?

Posted by & filed under Design.

Data Vault model of data warehose proposed by Dan E. Linstedt is an alternative approach to the well known Ralph Kimball's Dimensional Modeling and Bill Inmon's 3NF data warehouse. It has not gained much recognition primarily because the author is not willing to share his ideas for free. You are welcome to support Dan by […]

How to Obfuscate Password for Command Line Kettle

Posted by & filed under Kettle.

Pentaho Kettle does not provide possibility to obfuscate password in batch files (though the connection passwords are obfuscated in XML files). You can write: kitchen.bat /rep repos /job test /user admin /pass admin but this will not work: kitchen.bat /rep repos /job test /user admin /pass "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4" Here "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4" is encrypted version of "admin". You can get […]

Loading Fact Tables with PyGramETL

Posted by & filed under Data Warehousing.

The previous post described how to load dimension tables with PyGramETL. This one is about fact tables. The source CVS file has the following structure: cust_id;res_date;future_guests;future_revenue 105;2008/04/01 00:00:00.000;2;300 106;2007/04/17 00:00:00.000;4;380 106;2007/04/17 00:00:00.000;4;768 … Fact Table CREATE TABLE reservation_fact( customer_sk INT, reservation_date DATETIME, future_guests FLOAT, future_revenue FLOAT ) Kettle PyGramETL

Loading Dimension Tables with PyGramETL

Posted by & filed under Data Warehousing.

There is a promising Python framework for ETL – PyGramETL (http://pygrametl.org/) This post demonstrates an example how to import data from a CSV file and load it into a dimension table using PyGramETL in comparison with the same functionality implemented in a Kettle transformation. The input file customer.txt has the following structure: customer_id;first_name;last_name;age;phone_number;address 101;Paul;Brendt;19;(212) 555 […]

Handling Records Removal in Dimension Tables

Posted by & filed under Kettle.

The records can be removed from a table in a operational system, but we never remove records from dimension tables. This post describes how the removal of records can be handled in dimension tables. Let's consider an example. There is a list of contact persons in a source system. The list is loaded into a […]