The usual approach to build a timestamp for a log files in a batch is to use standard Windows utilities date and time. However the format of the output depends on locale and it is almost not possible to make the script which runs on any machine. A solution might be to create a small java […]
Category Archives: Data Warehousing
Archiving Pentaho log table in SQL Server
Pentaho Data Integration displays only the last 50 log entries but anytime you open a transformation it has to read the whole log table. When the log table gets large, opening a transformation may take long time and PDI freezes for this time. A solution is to archive old entries from the log table. Create […]
Removing time part of a date fails in Kettle
Might be an interesting info. I encountered an error running a Pentaho Data Integration transformation. After simplification I got this: Generate Row creates one row with the date field DATETIME=1981-01-01. Calculator calculates field DATE using operation "Remove time from a date". And this trivial transformation failed with a weird error: Unexpected error : java.lang.IllegalArgumentException: MINUTE […]
Slowly Changing Fact Tables
Let's assume the organization has a huge number of customers and the task is to track very detailed customer profiles that include customer preferences, market segmentation, customer geography etc. The customer profiles are changing very often. The table would be very wide, huge and growing very fast. We can split the dimension table in a […]
What is Data Vault?
Data Vault model of data warehose proposed by Dan E. Linstedt is an alternative approach to the well known Ralph Kimball's Dimensional Modeling and Bill Inmon's 3NF data warehouse. It has not gained much recognition primarily because the author is not willing to share his ideas for free. You are welcome to support Dan by […]
Top 144 Kimball Design Tips by the Number of References in Google Search
Rank # Design Tip Refs 1 ) 51 Latest Thinking on Time Dimension Tables 1310 2 ) 5 Surrogate Keys for the Time Dimension 1140 3 ) 122 Call to Action for ETL Tool Providers 616 4 ) 115 Kimball Lifecycle in a Nutshell 519 5 ) 46 Another Look at Degenerate Dimensions 443 6 […]
How to Obfuscate Password for Command Line Kettle
Pentaho Kettle does not provide possibility to obfuscate password in batch files (though the connection passwords are obfuscated in XML files). You can write: kitchen.bat /rep repos /job test /user admin /pass admin but this will not work: kitchen.bat /rep repos /job test /user admin /pass "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4" Here "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4" is encrypted version of "admin". You can get […]
Loading Fact Tables with PyGramETL
The previous post described how to load dimension tables with PyGramETL. This one is about fact tables. The source CVS file has the following structure: cust_id;res_date;future_guests;future_revenue 105;2008/04/01 00:00:00.000;2;300 106;2007/04/17 00:00:00.000;4;380 106;2007/04/17 00:00:00.000;4;768 … Fact Table CREATE TABLE reservation_fact( customer_sk INT, reservation_date DATETIME, future_guests FLOAT, future_revenue FLOAT ) Kettle PyGramETL
Loading Dimension Tables with PyGramETL
There is a promising Python framework for ETL – PyGramETL (http://pygrametl.org/) This post demonstrates an example how to import data from a CSV file and load it into a dimension table using PyGramETL in comparison with the same functionality implemented in a Kettle transformation. The input file customer.txt has the following structure: customer_id;first_name;last_name;age;phone_number;address 101;Paul;Brendt;19;(212) 555 […]
Handling Records Removal in Dimension Tables
The records can be removed from a table in a operational system, but we never remove records from dimension tables. This post describes how the removal of records can be handled in dimension tables. Let's consider an example. There is a list of contact persons in a source system. The list is loaded into a […]