Category Archives: Data Warehousing

Data Warehousing

Format date in Batch yyyyMMdd

The usual approach to build a timestamp for a log files in a batch is to use standard Windows utilities date and time. However the format of the output depends on locale and it is almost not possible to make the script which runs on any machine. A solution might be to create a small java tool.

Source file (datetime.java):

import java.text.SimpleDateFormat;
import java.util.Calendar;
public class datetime {
  public static void main(String[] args) {
    System.out.println(new SimpleDateFormat("yyyyMMdd_HHmmss")
          .format(Calendar.getInstance().getTime()));
  }
}

Compilation (you need JDK for this):

javac datatime.java

The code will be compiled into datatime.class.

Use in a batch file:

java datetime>logdate.txt
set /p STAMP=<logdate.txt
set LOG=C:\logs\%STAMP%.log
call Main.bat > %LOG%

Archiving Pentaho log table in SQL Server

Pentaho Data Integration displays only the last 50 log entries but anytime you open a transformation it has to read the whole log table. When the log table gets large, opening a transformation may take long time and PDI freezes for this time.

A solution is to archive old entries from the log table.

Create an archive table ETL_LOG_ARCHIVE with the structure identical to the structure of the log table ETL_LOG:

SELECT * INTO ETL_LOG_ARCHIVE FROM ETL_LOG WHERE 0=1

Move old rows from ETL_LOG to ETL_LOG_ARCHIVE:

DELETE FROM ETL_LOG
OUTPUT DELETED.* INTO ETL_LOG_ARCHIVE
WHERE ID_BATCH IN (
   SELECT ID_BATCH FROM 
   (
      SELECT 
         ID_BATCH, 
         ROW_NUMBER() OVER (
            PARTITION BY TRANSNAME 
            ORDER BY ID_BATCH DESC) RN 
         FROM ETL_LOG
   ) A
   WHERE RN > 50
) 
AND ID_BATCH NOT IN (
   SELECT MAX(ID_BATCH) 
   FROM ETL_LOG
   WHERE STATUS='end' 
   GROUP BY TRANSNAME
)

We want to keep the last 50 rows for each transformation. But also we need to keep the last successful execution for each transformation. Otherwise there might be a problem with incremental load if a transformation has been failing more than 50 times.

Removing time part of a date fails in Kettle

Might be an interesting info.

I encountered an error running a Pentaho Data Integration transformation. After simplification I got this:

Generate Row creates one row with the date field DATETIME=1981-01-01. Calculator calculates field DATE using operation “Remove time from a date”. And this trivial transformation failed with a weird error:

Unexpected error : 
java.lang.IllegalArgumentException: MINUTE
	at java.util.GregorianCalendar.computeTime(Unknown Source)
	at java.util.Calendar.updateTime(Unknown Source)
	at java.util.Calendar.getTimeInMillis(Unknown Source)
	at java.util.Calendar.getTime(Unknown Source)
	at org.pentaho.di.core.Const.removeTimeFromDate(Const.java:1958)
	at org.pentaho.di.core.row.ValueDataUtil.removeTimeFromDate(ValueDataUtil.java:628)
	at org.pentaho.di.trans.steps.calculator.Calculator.calcFields(Calculator.java:476)
	at org.pentaho.di.trans.steps.calculator.Calculator.processRow(Calculator.java:165)
	at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2889)
	at org.pentaho.di.trans.steps.calculator.Calculator.run(Calculator.java:626)

After some digging into Pentaho code, I have replicated the error in Java:

import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.TimeZone;

public class Prog {

    public static void main(String[] args) throws ParseException {
        //String dateStr = "1982-01-01", timeZoneID = "Europe/Berlin";
        //String dateStr = "1981-01-01", timeZoneID = "Asia/Singapore";
        //String dateStr = "1982-01-01", timeZoneID = "Asia/Seoul";
        String dateStr = "1982-01-01", timeZoneID = "Asia/Singapore";

        DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
        Date date = dateFormat.parse(dateStr);        

        Calendar calendar = Calendar.getInstance();
        calendar.setTimeZone(TimeZone.getTimeZone(timeZoneID));
        calendar.setLenient(false);
        calendar.setTime(date);
        calendar.set(Calendar.HOUR_OF_DAY, 0);
        calendar.set(Calendar.MINUTE, 0);
        calendar.set(Calendar.SECOND, 0);
        calendar.set(Calendar.MILLISECOND, 0);

        System.out.println(calendar.getTime());
    }
}

Here Asia/Singapore is the time zone setting of the Data Integration server.

The code failed with the same error:

Exception in thread "main" java.lang.IllegalArgumentException: MINUTE
	at java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2482)
	at java.util.Calendar.updateTime(Calendar.java:2265)
	at java.util.Calendar.getTimeInMillis(Calendar.java:1049)
	at java.util.Calendar.getTime(Calendar.java:1022)
	at Prog.main(Prog.java:28)

Interesting is that the code works fine for very close parameters:

  • dateStr = “1982-01-01”, timeZoneID = “Europe/Berlin”
  • dateStr = “1981-01-01”, timeZoneID = “Asia/Singapore”
  • dateStr = “1982-01-01”, timeZoneID = “Asia/Seoul”

Asia/Seoul and Asia/Singapore is actually the same time zone.

The cause of the error has been explained here.

When you enter a date with no time, the time is assumed to be 12:00:00 AM.

But there was no 12:00:00 AM on January 1, 1982, in Singapore. After 11:59:59 PM on December 31, 1981, Singapore jumped ahead by half an hour to 12:30 AM. It had previously been at UTC+7:30, but moved to the whole-hour zone of UTC+8.”

Slowly Changing Fact Tables

Let’s assume the organization has a huge number of customers and the task is to track very detailed customer profiles that include customer preferences, market segmentation, customer geography etc. The customer profiles are changing very often. The table would be very wide, huge and growing very fast.

We can split the dimension table in a number of smaller tables (mini-dimensions). The question is how to join them together to be able to get complete customer profile as at any point in time.

Snapshot Fact Table

The first design could be to add a snapshot factless fact table that joins these customer dimensions.

This is not the best solution as the current snapshot of the customers will be added to the fact table every load. The table will grow extremely fast.

Slowly Changing Fact Tables

Most often the rows inserted into the snapshot fact table will be the same as previous just with a new date. Therefore we can apply Slowly changing dimension technique for the fact table to avoid duplication of rows when the attributes are not changing. Let’s call this Slowly Changing Fact Table.

Customer RK is retained key of the Customer dimension, it is a surrogate key that does not change from one version of the row to another.

If any attribute of a customer profile changes, a new row will be inserted in the fact table. The number of rows in this fact table will be the same as the number of rows in the original dimension table Customer Profiles, however the data will take less of the disk space.

It is often more reasonable to applySlowly Changing Facttechnique to prevent the snapshot fact table from growing too fast.

References

The ideas here were taken from the Kimball Design Tip #50 Factless Fact Tables and diluted with my  interpretations 🙂

Kimball’s Design Tip #50: Factless Fact Tables

What is Data Vault?

Data Vault model of data warehose proposed by Dan E. Linstedt is an alternative approach to the well known Ralph Kimball’s Dimensional Modeling and Bill Inmon’s 3NF data warehouse. It has not gained much recognition primarily because the author is not willing to share his ideas for free. You are welcome to support Dan by buying his book or attending Data Vault certification program 🙂

The main benefit of Data Vault is the ease of modification: you do not have to alter anything in your current model to add a new dimension or attributes of existing dimension.

The largest drawback of this modeling technique is the larger number of joins comparing to other modeling techniques, this hurts the performance. Also such complex data model makes it not relevant for the data access layer.

So let’s consider this modeling technique on an example starting from 3NF and dimensional model.

3NF Model

Consider the following 3NF data model. Usually it is close to the original model in the data source. There are three master tables (Customer, Order and Product) and one transactional table (Order Line).

SK fields are surrogate keys, BK fields are business keys aka natural keys. Surrogate key is a generated key used instead of combination of business key plus start date of validity (Valid From).

Dimensional Model

In our dimensional model, we have a fact table in the middle and dimension tables around the fact table. We added order date as a separate dimension. Customer is directly joined to the fact table.

Nothing new by now, right?

Data Vault

There are 3 types of tables in Data Vault model: Hub, Link and Satellite. Let’s consider on an example of transformation of the 3NF model to Data Vault model.

1) Instead of each master table in 3NF, we add a hub and a satellite.

2) Instead of the transactional table, we add Link table and Satellite.

3) Instead of the joins between master tables, we add Link tables.

Hub table – contains business key and retained key. The retained key is a key which is mapped to business key one-to-one. In comparison,  the surrogate key includes time and there can be many surrogate keys corresponding to one business key. This explains the name of the key, it is retained with insertion of a new version of a row while surrogate key is increasing.

Why we need the retained key? There can be many sources, so we cannot use business key as it may have different type or different number of attributes in different source systems.

Link table – contains retained keys of the hubs. E.g. Customer Order Link links Customer Hub and Order Hub by the corresponding keys – Customer RK and Order RK. Note that one-to-many relationship was replaced with many-to-many relationship.

Satellite table – contains attributes of the original tables.

Resilience to Change

We can add new attributes to Customer (e.g. Customer Demographics) or even a new parent table (e.g. Delivery) without any change to the existing tables. We just need to add and populate new tables.

References

Note that this was simplified consideration, just to give an overview of the methodology. You can find detailed information on the web. I would recommend to start from wikipedia.

Top 144 Kimball Design Tips by the Number of References in Google Search

Rank # Design Tip Refs
1 ) 51 Latest Thinking on Time Dimension Tables 1310
2 ) 5 Surrogate Keys for the Time Dimension 1140
3 ) 122 Call to Action for ETL Tool Providers 616
4 ) 115 Kimball Lifecycle in a Nutshell 519
5 ) 46 Another Look at Degenerate Dimensions 443
6 ) 48 De-clutter with Junk Dimensions 321
7 ) 137 Creating and Managing Shrunken Dimensions 286
8 ) 8 Perfectly Partioning History with Type 2 SCD 262
9 ) 25 Dimensional Models for Parent-Child Applications 240
10 ) 111 Is Agile Enterprise Data Warehousing an Oxymoron? 226
11 ) 41 Drill Down into a Detailed Bus Matrix 166
12 ) 50 Factless Fact Tables 163
13 ) 1 Guidelines for an Expressive Clickstream Data Mart 152
14 ) 34 You Don’t Need an EDW 151
15 ) 91 Marketing the DW/BI System 145
16 ) 59 Surprising Value of Data Profiling 144
17 ) 106 Can the Data Warehouse Benefit from SOA? 137
18 ) 107 Using the SQL MERGE Statement for Slowly Changing Dimensions 134
19 ) 21 Declaring the Grain 124
20 ) 56 Dimensional Modeling for Microsoft Analysis Services 121
21 ) 2 Multiple Time Stamps 111
22 ) 99 Staging Areas and ETL Tools 110
23 ) 90 Slowly Changing Entities 103
24 ) 89 The Real Time Triage 101
25 ) 13 When Fact Tables can be used as Dimensions 96
26 ) 20 Sparse Facts and Facts with Short Lifetimes 96
27 ) 57 Early Arriving Facts 93
28 ) 9 Processing Slowly Changing Dimensions during Initial Load 90
29 ) 17 Populating Hierarchy Helper Tables 88
30 ) 43 Dealing With Nulls in a Dimensional Model 85
31 ) 81 Fact Table Surrogate Keys 83
32 ) 37 Modeling a Pipeline with Accumulating Snapshots 78
33 ) 15 Combining SCD Techniques 75
34 ) 85 Smart Date Keys to Partition Fact Tables 69
35 ) 136 Adding a Mini-Dimension to a Bridge Table 68
36 ) 73 Relating to Agile Methodologies 67
37 ) 49 Off the Bench about the Bottoms Up Misnomer 65
38 ) 3 Focus on Business Process, not Business Departments 63
39 ) 42 Combining Periodic and Accumulating Snapshots 62
40 ) 39 Bus Architecture Foundation for Analytic Applications 60
41 ) 28 Avoiding Catastrophic Failure of the Data Warehouse 58
42 ) 139 Much Ado About Nothing 56
43 ) 58 BI Portal 55
44 ) 92 Dimension Manager and Fact Provider 55
45 ) 95 Patterns to Avoid when Modeling Header/Line Item Transactions 54
46 ) 35 Modeling Time Spans 52
47 ) 121 Columnar Databases: Game Changers for DW/BI Deployment? 52
48 ) 124 Alternatives for Multi-valued Dimensions 51
49 ) 33 Using CRM Measures as Behavior Tags 48
50 ) 16 Hot Swappable Dimensions 45
51 ) 135 Conformed Dimensions as the Foundation for Agile Data Warehousing 45
52 ) 97 Modeling Data as Both a Fact and Dimension Attribute 44
53 ) 19 Replicating Dimensions Correctly 41
54 ) 113 Creating, Using, and Maintaining Junk Dimensions 41
55 ) 127 Creating and Managing Mini-Dimensions 41
56 ) 134 Data Warehouse Testing Recommendations 40
57 ) 87 Combining SCD Techniques Having It Both Ways 39
58 ) 128 Selecting Default Values for Nulls 38
59 ) 7 Getting your Data Warehouse back on Track 36
60 ) 22 Variable Depth Customer Dimensions 36
61 ) 30 Put your Fact Tables on a Diet 36
62 ) 53 Dimension Embellishments 36
63 ) 110 Business Requirements Gathering Dos and Don’ts 36
64 ) 133 Factless Fact Tables for Simplification 36
65 ) 12 Accurate Counting with a Dimensional Supplement 35
66 ) 119 Updating the Date Dimension 35
67 ) 6 Showing the Correlation between Dimensions 33
68 ) 84 Readers’ Suggestions on Fact Table Surrogate Keys 33
69 ) 141 Expanding Boundaries of the Data Warehouse 33
70 ) 126 Disruptive ETL Changes 32
71 ) 140 Is it a Dimension, a Fact, or Both? 32
72 ) 4 Fast Changing Complex Customer Dimensions 31
73 ) 102 Server Configuration Considerations 31
74 ) 123 Using the Dimensional Model to Validate Business Requirements 31
75 ) 100 Keep Your Keys Simple 30
76 ) 129 Are IT Procedures Beneficial to DW/BI Projects? 30
77 ) 61 Handling all the Dates 29
78 ) 88 Dashboards Done Right 28
79 ) 11 Accurate Counts within a Dimension 27
80 ) 78 Late Arriving Dimension Rows 27
81 ) 24 Multinational Dimensional Data Warehouse Considerations 26
82 ) 14 Arbitrary Balance Reporting with Transaction Facts 24
83 ) 26 Audit Dimensions to Track Lineage and Confidence 24
84 ) 27 Being Off-line as Little as Possible 24
85 ) 29 Graceful Modifications to Existing Fact and Dimension Tables 24
86 ) 103 Staffing the Dimensional Modeling Team 24
87 ) 105 Snowflakes, Outriggers, and Bridges 24
88 ) 130 Accumulating Snapshots for Complex Workflows 24
89 ) 18 Taking the Publishing Metaphor Seriously 23
90 ) 32 Doing the Work at Extract Time 23
91 ) 75 Creating the Metadata Strategy 23
92 ) 45 Techniques for Modeling Intellectual Capital 22
93 ) 104 Upgrading your BI Architecture 22
94 ) 76 Advantages of a 64-bit Server 21
95 ) 86 Reference Dimensions for Infrequently-Accessed Degenerates 21
96 ) 93 Transactions Create Time Spans 20
97 ) 96 Think Like A Software Development Manager 20
98 ) 79 Dangerously Large Dimension Tables 19
99 ) 82 Pivoting the Fact Table with a Fact Dimension 17
100 ) 114 Avoiding Alternate Organization Hierarchies 17
101 ) 109 Dos and Don’ts on the Kimball Forum 15
102 ) 31 Designing a Real Time Partition 9
103 ) 60 Big Shifts in Business Intelligence 9
104 ) 63 Building a Change Data Capture System 9
105 ) 65 Document the ETL System 9
106 ) 112 Creating Historical Dimension Rows 9
107 ) 10 Is your Data Correct 8
108 ) 36 To Be or Not To Be Centralized 8
109 ) 38 Analytic Application—What’s That? 8
110 ) 40 Structure of an Analytic Application 8
111 ) 44 Reliance on the BI Tool’s Metadata 8
112 ) 47 Business Initiatives versus Business Processes 8
113 ) 52 Improving Operating Procedures 8
114 ) 62 Alternate Hierarchies 8
115 ) 69 Identifying Business Processes 8
116 ) 70 Architecting Data for MS SQL Server 2005 8
117 ) 101 Slowly Changing Vocabulary 8
118 ) 108 When is the Dimensional Model Design Done? 8
119 ) 125 Balancing Requirements and Realities 8
120 ) 138 Use a Design Charter to Keep Dimensional Design Activities on Track 8
121 ) 23 Rolling Prediction of the Future 7
122 ) 54 Delivering Historical and Current Perspectives 7
123 ) 68 Simple Drill-Across in SQL 7
124 ) 80 Dimension Row Change Reason Attributes 7
125 ) 83 Resist Abstract Generic Dimensions 7
126 ) 94 Building Custom Tools for the DW/BI System 7
127 ) 98 Focus on Data Stewardship 7
128 ) 116 Add Uncertainty to Your Fact Table 7
129 ) 118 Managing Backlogs Dimensionally 7
130 ) 120 Design Review Dos and Don’ts 7
131 ) 131 Easier Approaches For Harder Problems 7
132 ) 55 Exploring Text Facts 6
133 ) 64 Avoid Isolating the DW and BI Teams 6
134 ) 67 Maintaining Back Pointers to Operational Sources 6
135 ) 72 Business Process Decoder Ring 6
136 ) 74 Compliance-Enabled Data Warehouses 6
137 ) 77 Warning: Summary Data may be Hazardous 6
138 ) 132 Kimball Forum Update 6
139 ) 66 Implementation Analysis Paralysis 5
140 ) 71 Naming Conventions 5
141 ) 117 Dealing with Data Quality: Don’t Just Sit There, Do Something! 5
142 ) 142 Building Bridges 5
143 ) 143 Enjoy the Sunset 5
144 ) 144 History Lesson on Ralph Kimball and Xerox PARC 4


How to Obfuscate Password for Command Line Kettle

Pentaho Kettle does not provide possibility to obfuscate password in batch files (though the connection passwords are obfuscated in XML files).

You can write:

kitchen.bat /rep repos /job test /user admin /pass admin

but this will not work:

kitchen.bat /rep repos /job test /user admin /pass "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4"

Here “Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4” is encrypted version of “admin”. You can get such string using kettle tool encr.bat.

This post describes a workaround.

Solution

A solution is to create a java wrapper for Kitchen.

1. Create java file Kettle1.java in the Kettle root with the following code:

import org.pentaho.di.core.exception.KettleException;
import org.pentaho.di.kitchen.Kitchen;
import org.pentaho.di.core.encryption.Encr;

public class Kitchen1 {
   public static void main(String[] args) throws KettleException {
      // find argument /pass
      int i = 0;
      for (; i < args.length; ++i)
         if (args[i].equals("/pass") || args[i].equals("/password"))
            break;
      // decrypt password if necessary
      if (++i < args.length)
         args[i] = Encr.decryptPasswordOptionallyEncrypted(args[i]);
      // run Kitchen
      Kitchen.main(args);
   }
}

2. Open command line. Run kitchen.bat without any parameters. This will initialize the necessary environment variable CLASSPATH.

3. Compile the java file with the following command:

javac Kitchen1.java

4. Copy Kitchen.bat to Kitchen1.bat and correct the last line. Replace

java %OPT% org.pentaho.di.kitchen.Kitchen %_cmdline%

with

java %OPT% Kitchen1 %_cmdline%

Now you can use Kitchen1.bat instead of Kitchen.bat. Both commands will work:

kitchen1.bat /rep repos /job test /user admin /pass admin
kitchen1.bat /rep repos /job test /user admin /pass "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4"

5. Remove source file Kitchen1.java. Close command line.

Have fun 🙂

Loading Fact Tables with PyGramETL

The previous post described how to load dimension tables with PyGramETL. This one is about fact tables.

The source CVS file has the following structure:

cust_id;res_date;future_guests;future_revenue
105;2008/04/01 00:00:00.000;2;300
106;2007/04/17 00:00:00.000;4;380
106;2007/04/17 00:00:00.000;4;768
...

Fact Table

CREATE TABLE reservation_fact(
    customer_sk INT,
    reservation_date DATETIME,
    future_guests FLOAT,
    future_revenue FLOAT
)

Kettle

PyGramETL

Continue reading

Loading Dimension Tables with PyGramETL

There is a promising Python framework for ETL – PyGramETL (http://pygrametl.org/)

This post demonstrates an example how to import data from a CSV file and load it into a dimension table using PyGramETL in comparison with the same functionality implemented in a Kettle transformation.

The input file customer.txt has the following structure:

customer_id;first_name;last_name;age;phone_number;address
101;Paul;Brendt;19;(212) 555 2146;10 Jasper Blvd.
102;Robin;McCarthy;29;(214) 555 3075;27 Pasadena Drive
103;Peter;Travis;34;(510) 555 4448;7835 Hartford Drive
...

Dimention Table

MySQL is used for data warehouse. The following code create the dimension table.

CREATE TABLE customer_dim
(
    customer_sk INT,
    customer_id INT,
    valid_from DATETIME,
    valid_to DATETIME,
    version INT,
    phone_number VARCHAR(20),
    first_name VARCHAR(20),
    last_name VARCHAR(20),
    age INT,
    address VARCHAR(50)
)

INSERT customer_dim (customer_sk, version) VALUES (0, 1)

Kettle

Let’s consider how this can be done in Kettle

The transformation has two steps:

The first step “Text file input” reads the CSV file. The second step “Dimensional lookup/update” generates surrogate key customer_sk as well as technical fields: valid_from, valid_to, version. The update type for first_name and last_name is set to Punch through.

PyGramETL

The same can be done with the following code using PyGramETL

import MySQLdb
import pygrametl
from pygrametl.datasources import CSVSource
from pygrametl.tables import SlowlyChangingDimension

myconn = MySQLdb.connect(user='test', passwd='test', db='test')

connection = pygrametl.ConnectionWrapper(myconn)
connection.setasdefault()

customer_dim = SlowlyChangingDimension(
    name='customer_dim',
    key='customer_sk',
    attributes=['customer_id', 'valid_from', 'valid_to', 'version',
                'first_name', 'last_name', 'age', 'phone_number', 'address'],
    lookupatts=['customer_id'],
    versionatt='version',
    fromatt='valid_from',
    toatt='valid_to',
    type1atts=('first_name', 'last_name'),
    cachesize=-1)

customer_dim.defaultidvalue = 0

def load_customer_dim():
    customer = CSVSource(file('./customer.txt', 'r', 16384), delimiter=';')
    for row in customer:
        printrow
        customer_dim.scdensure(row)
    connection.commit()

def main():
    load_customer_dim()

if __name__ == '__main__':
    main()

On the one hand, it is much easier to get overview of a transformation in Kettle. On the other hand, when the ETL transformation is small, it is much easier to see details in the Python code because you do not have to open each step to see its settings.

Handling Records Removal in Dimension Tables

The records can be removed from a table in a operational system, but we never remove records from dimension tables. This post describes how the removal of records can be handled in dimension tables.

Let’s consider an example. There is a list of contact persons in a source system. The list is loaded into a dimension table in a data warehouse. Contact persons can be removed from the operational system, but not from dimension tables. The task is to list the current contact persons based on the data from the data warehouse. What can help us to identify the current records? Continue reading