Category Archives: Data Warehousing

Data Warehousing

Format date in Batch yyyyMMdd

The usual approach to build a timestamp for a log files in a batch is to use standard Windows utilities date and time. However the format of the output depends on locale and it is almost not possible to make the script which runs on any machine. A solution might be to create a small java tool.

Source file (datetime.java):

import java.text.SimpleDateFormat;
import java.util.Calendar;
public class datetime {
  public static void main(String[] args) {
    System.out.println(new SimpleDateFormat("yyyyMMdd_HHmmss")
          .format(Calendar.getInstance().getTime()));
  }
}

Compilation (you need JDK for this):

javac datatime.java

The code will be compiled into datatime.class.

Use in a batch file:

java datetime>logdate.txt
set /p STAMP=<logdate.txt
set LOG=C:\logs\%STAMP%.log
call Main.bat > %LOG%

Archiving Pentaho log table in SQL Server

Pentaho Data Integration displays only the last 50 log entries but anytime you open a transformation it has to read the whole log table. When the log table gets large, opening a transformation may take long time and PDI freezes for this time.

A solution is to archive old entries from the log table.

Create an archive table ETL_LOG_ARCHIVE with the structure identical to the structure of the log table ETL_LOG:

SELECT * INTO ETL_LOG_ARCHIVE FROM ETL_LOG WHERE 0=1

Move old rows from ETL_LOG to ETL_LOG_ARCHIVE:

DELETE FROM ETL_LOG
OUTPUT DELETED.* INTO ETL_LOG_ARCHIVE
WHERE ID_BATCH IN (
   SELECT ID_BATCH FROM 
   (
      SELECT 
         ID_BATCH, 
         ROW_NUMBER() OVER (
            PARTITION BY TRANSNAME 
            ORDER BY ID_BATCH DESC) RN 
         FROM ETL_LOG
   ) A
   WHERE RN > 50
) 
AND ID_BATCH NOT IN (
   SELECT MAX(ID_BATCH) 
   FROM ETL_LOG
   WHERE STATUS='end' 
   GROUP BY TRANSNAME
)

We want to keep the last 50 rows for each transformation. But also we need to keep the last successful execution for each transformation. Otherwise there might be a problem with incremental load if a transformation has been failing more than 50 times.

Removing time part of a date fails in Kettle

1 Reply

Might be an interesting info.

I encountered an error running a Pentaho Data Integration transformation. After simplification I got this:

Generate Row creates one row with the date field DATETIME=1981-01-01. Calculator calculates field DATE using operation “Remove time from a date”. And this trivial transformation failed with a weird error:

Unexpected error : 
java.lang.IllegalArgumentException: MINUTE
	at java.util.GregorianCalendar.computeTime(Unknown Source)
	at java.util.Calendar.updateTime(Unknown Source)
	at java.util.Calendar.getTimeInMillis(Unknown Source)
	at java.util.Calendar.getTime(Unknown Source)
	at org.pentaho.di.core.Const.removeTimeFromDate(Const.java:1958)
	at org.pentaho.di.core.row.ValueDataUtil.removeTimeFromDate(ValueDataUtil.java:628)
	at org.pentaho.di.trans.steps.calculator.Calculator.calcFields(Calculator.java:476)
	at org.pentaho.di.trans.steps.calculator.Calculator.processRow(Calculator.java:165)
	at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2889)
	at org.pentaho.di.trans.steps.calculator.Calculator.run(Calculator.java:626)

After some digging into Pentaho code, I have replicated the error in Java:

import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.TimeZone;

public class Prog {

    public static void main(String[] args) throws ParseException {
        //String dateStr = "1982-01-01", timeZoneID = "Europe/Berlin";
        //String dateStr = "1981-01-01", timeZoneID = "Asia/Singapore";
        //String dateStr = "1982-01-01", timeZoneID = "Asia/Seoul";
        String dateStr = "1982-01-01", timeZoneID = "Asia/Singapore";

        DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
        Date date = dateFormat.parse(dateStr);        

        Calendar calendar = Calendar.getInstance();
        calendar.setTimeZone(TimeZone.getTimeZone(timeZoneID));
        calendar.setLenient(false);
        calendar.setTime(date);
        calendar.set(Calendar.HOUR_OF_DAY, 0);
        calendar.set(Calendar.MINUTE, 0);
        calendar.set(Calendar.SECOND, 0);
        calendar.set(Calendar.MILLISECOND, 0);

        System.out.println(calendar.getTime());
    }
}

Here Asia/Singapore is the time zone setting of the Data Integration server.

The code failed with the same error:

Exception in thread "main" java.lang.IllegalArgumentException: MINUTE
	at java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2482)
	at java.util.Calendar.updateTime(Calendar.java:2265)
	at java.util.Calendar.getTimeInMillis(Calendar.java:1049)
	at java.util.Calendar.getTime(Calendar.java:1022)
	at Prog.main(Prog.java:28)

Interesting is that the code works fine for very close parameters:

dateStr = “1982-01-01”, timeZoneID = “Europe/Berlin”
dateStr = “1981-01-01”, timeZoneID = “Asia/Singapore”
dateStr = “1982-01-01”, timeZoneID = “Asia/Seoul”

Asia/Seoul and Asia/Singapore is actually the same time zone.

The cause of the error has been explained here.

“When you enter a date with no time, the time is assumed to be 12:00:00 AM.

But there was no 12:00:00 AM on January 1, 1982, in Singapore. After 11:59:59 PM on December 31, 1981, Singapore jumped ahead by half an hour to 12:30 AM. It had previously been at UTC+7:30, but moved to the whole-hour zone of UTC+8.”

Slowly Changing Fact Tables

1 Reply

Let’s assume the organization has a huge number of customers and the task is to track very detailed customer profiles that include customer preferences, market segmentation, customer geography etc. The customer profiles are changing very often. The table would be very wide, huge and growing very fast.

We can split the dimension table in a number of smaller tables (mini-dimensions). The question is how to join them together to be able to get complete customer profile as at any point in time.

Snapshot Fact Table

The first design could be to add a snapshot factless fact table that joins these customer dimensions.

This is not the best solution as the current snapshot of the customers will be added to the fact table every load. The table will grow extremely fast.

Slowly Changing Fact Tables

Most often the rows inserted into the snapshot fact table will be the same as previous just with a new date. Therefore we can apply Slowly changing dimension technique for the fact table to avoid duplication of rows when the attributes are not changing. Let’s call this Slowly Changing Fact Table.

Customer RK is retained key of the Customer dimension, it is a surrogate key that does not change from one version of the row to another.

If any attribute of a customer profile changes, a new row will be inserted in the fact table. The number of rows in this fact table will be the same as the number of rows in the original dimension table Customer Profiles, however the data will take less of the disk space.

It is often more reasonable to apply “Slowly Changing Fact” technique to prevent the snapshot fact table from growing too fast.

References

The ideas here were taken from the Kimball Design Tip #50 Factless Fact Tables and diluted with my interpretations 🙂

Kimball’s Design Tip #50: Factless Fact Tables

What is Data Vault?

1 Reply

Data Vault model of data warehose proposed by Dan E. Linstedt is an alternative approach to the well known Ralph Kimball’s Dimensional Modeling and Bill Inmon’s 3NF data warehouse. It has not gained much recognition primarily because the author is not willing to share his ideas for free. You are welcome to support Dan by buying his book or attending Data Vault certification program 🙂

The main benefit of Data Vault is the ease of modification: you do not have to alter anything in your current model to add a new dimension or attributes of existing dimension.

The largest drawback of this modeling technique is the larger number of joins comparing to other modeling techniques, this hurts the performance. Also such complex data model makes it not relevant for the data access layer.

So let’s consider this modeling technique on an example starting from 3NF and dimensional model.

3NF Model

Consider the following 3NF data model. Usually it is close to the original model in the data source. There are three master tables (Customer, Order and Product) and one transactional table (Order Line).

SK fields are surrogate keys, BK fields are business keys aka natural keys. Surrogate key is a generated key used instead of combination of business key plus start date of validity (Valid From).

Dimensional Model

In our dimensional model, we have a fact table in the middle and dimension tables around the fact table. We added order date as a separate dimension. Customer is directly joined to the fact table.

Nothing new by now, right?

Data Vault

There are 3 types of tables in Data Vault model: Hub, Link and Satellite. Let’s consider on an example of transformation of the 3NF model to Data Vault model.

1) Instead of each master table in 3NF, we add a hub and a satellite.

2) Instead of the transactional table, we add Link table and Satellite.

3) Instead of the joins between master tables, we add Link tables.

Hub table – contains business key and retained key. The retained key is a key which is mapped to business key one-to-one. In comparison, the surrogate key includes time and there can be many surrogate keys corresponding to one business key. This explains the name of the key, it is retained with insertion of a new version of a row while surrogate key is increasing.

Why we need the retained key? There can be many sources, so we cannot use business key as it may have different type or different number of attributes in different source systems.

Link table – contains retained keys of the hubs. E.g. Customer Order Link links Customer Hub and Order Hub by the corresponding keys – Customer RK and Order RK. Note that one-to-many relationship was replaced with many-to-many relationship.

Satellite table – contains attributes of the original tables.

Resilience to Change

We can add new attributes to Customer (e.g. Customer Demographics) or even a new parent table (e.g. Delivery) without any change to the existing tables. We just need to add and populate new tables.

References

Note that this was simplified consideration, just to give an overview of the methodology. You can find detailed information on the web. I would recommend to start from wikipedia.

Top 144 Kimball Design Tips by the Number of References in Google Search

Leave a reply

Rank	#	Design Tip	Refs
1 )	51	Latest Thinking on Time Dimension Tables	1310
2 )	5	Surrogate Keys for the Time Dimension	1140
3 )	122	Call to Action for ETL Tool Providers	616
4 )	115	Kimball Lifecycle in a Nutshell	519
5 )	46	Another Look at Degenerate Dimensions	443
6 )	48	De-clutter with Junk Dimensions	321
7 )	137	Creating and Managing Shrunken Dimensions	286
8 )	8	Perfectly Partioning History with Type 2 SCD	262
9 )	25	Dimensional Models for Parent-Child Applications	240
10 )	111	Is Agile Enterprise Data Warehousing an Oxymoron?	226
11 )	41	Drill Down into a Detailed Bus Matrix	166
12 )	50	Factless Fact Tables	163
13 )	1	Guidelines for an Expressive Clickstream Data Mart	152
14 )	34	You Don’t Need an EDW	151
15 )	91	Marketing the DW/BI System	145
16 )	59	Surprising Value of Data Profiling	144
17 )	106	Can the Data Warehouse Benefit from SOA?	137
18 )	107	Using the SQL MERGE Statement for Slowly Changing Dimensions	134
19 )	21	Declaring the Grain	124
20 )	56	Dimensional Modeling for Microsoft Analysis Services	121
21 )	2	Multiple Time Stamps	111
22 )	99	Staging Areas and ETL Tools	110
23 )	90	Slowly Changing Entities	103
24 )	89	The Real Time Triage	101
25 )	13	When Fact Tables can be used as Dimensions	96
26 )	20	Sparse Facts and Facts with Short Lifetimes	96
27 )	57	Early Arriving Facts	93
28 )	9	Processing Slowly Changing Dimensions during Initial Load	90
29 )	17	Populating Hierarchy Helper Tables	88
30 )	43	Dealing With Nulls in a Dimensional Model	85
31 )	81	Fact Table Surrogate Keys	83
32 )	37	Modeling a Pipeline with Accumulating Snapshots	78
33 )	15	Combining SCD Techniques	75
34 )	85	Smart Date Keys to Partition Fact Tables	69
35 )	136	Adding a Mini-Dimension to a Bridge Table	68
36 )	73	Relating to Agile Methodologies	67
37 )	49	Off the Bench about the Bottoms Up Misnomer	65
38 )	3	Focus on Business Process, not Business Departments	63
39 )	42	Combining Periodic and Accumulating Snapshots	62
40 )	39	Bus Architecture Foundation for Analytic Applications	60
41 )	28	Avoiding Catastrophic Failure of the Data Warehouse	58
42 )	139	Much Ado About Nothing	56
43 )	58	BI Portal	55
44 )	92	Dimension Manager and Fact Provider	55
45 )	95	Patterns to Avoid when Modeling Header/Line Item Transactions	54
46 )	35	Modeling Time Spans	52
47 )	121	Columnar Databases: Game Changers for DW/BI Deployment?	52
48 )	124	Alternatives for Multi-valued Dimensions	51
49 )	33	Using CRM Measures as Behavior Tags	48
50 )	16	Hot Swappable Dimensions	45
51 )	135	Conformed Dimensions as the Foundation for Agile Data Warehousing	45
52 )	97	Modeling Data as Both a Fact and Dimension Attribute	44
53 )	19	Replicating Dimensions Correctly	41
54 )	113	Creating, Using, and Maintaining Junk Dimensions	41
55 )	127	Creating and Managing Mini-Dimensions	41
56 )	134	Data Warehouse Testing Recommendations	40
57 )	87	Combining SCD Techniques Having It Both Ways	39
58 )	128	Selecting Default Values for Nulls	38
59 )	7	Getting your Data Warehouse back on Track	36
60 )	22	Variable Depth Customer Dimensions	36
61 )	30	Put your Fact Tables on a Diet	36
62 )	53	Dimension Embellishments	36
63 )	110	Business Requirements Gathering Dos and Don’ts	36
64 )	133	Factless Fact Tables for Simplification	36
65 )	12	Accurate Counting with a Dimensional Supplement	35
66 )	119	Updating the Date Dimension	35
67 )	6	Showing the Correlation between Dimensions	33
68 )	84	Readers’ Suggestions on Fact Table Surrogate Keys	33
69 )	141	Expanding Boundaries of the Data Warehouse	33
70 )	126	Disruptive ETL Changes	32
71 )	140	Is it a Dimension, a Fact, or Both?	32
72 )	4	Fast Changing Complex Customer Dimensions	31
73 )	102	Server Configuration Considerations	31
74 )	123	Using the Dimensional Model to Validate Business Requirements	31
75 )	100	Keep Your Keys Simple	30
76 )	129	Are IT Procedures Beneficial to DW/BI Projects?	30
77 )	61	Handling all the Dates	29
78 )	88	Dashboards Done Right	28
79 )	11	Accurate Counts within a Dimension	27
80 )	78	Late Arriving Dimension Rows	27
81 )	24	Multinational Dimensional Data Warehouse Considerations	26
82 )	14	Arbitrary Balance Reporting with Transaction Facts	24
83 )	26	Audit Dimensions to Track Lineage and Confidence	24
84 )	27	Being Off-line as Little as Possible	24
85 )	29	Graceful Modifications to Existing Fact and Dimension Tables	24
86 )	103	Staffing the Dimensional Modeling Team	24
87 )	105	Snowflakes, Outriggers, and Bridges	24
88 )	130	Accumulating Snapshots for Complex Workflows	24
89 )	18	Taking the Publishing Metaphor Seriously	23
90 )	32	Doing the Work at Extract Time	23
91 )	75	Creating the Metadata Strategy	23
92 )	45	Techniques for Modeling Intellectual Capital	22
93 )	104	Upgrading your BI Architecture	22
94 )	76	Advantages of a 64-bit Server	21
95 )	86	Reference Dimensions for Infrequently-Accessed Degenerates	21
96 )	93	Transactions Create Time Spans	20
97 )	96	Think Like A Software Development Manager	20
98 )	79	Dangerously Large Dimension Tables	19
99 )	82	Pivoting the Fact Table with a Fact Dimension	17
100 )	114	Avoiding Alternate Organization Hierarchies	17
101 )	109	Dos and Don’ts on the Kimball Forum	15
102 )	31	Designing a Real Time Partition	9
103 )	60	Big Shifts in Business Intelligence	9
104 )	63	Building a Change Data Capture System	9
105 )	65	Document the ETL System	9
106 )	112	Creating Historical Dimension Rows	9
107 )	10	Is your Data Correct	8
108 )	36	To Be or Not To Be Centralized	8
109 )	38	Analytic Application—What’s That?	8
110 )	40	Structure of an Analytic Application	8
111 )	44	Reliance on the BI Tool’s Metadata	8
112 )	47	Business Initiatives versus Business Processes	8
113 )	52	Improving Operating Procedures	8
114 )	62	Alternate Hierarchies	8
115 )	69	Identifying Business Processes	8
116 )	70	Architecting Data for MS SQL Server 2005	8
117 )	101	Slowly Changing Vocabulary	8
118 )	108	When is the Dimensional Model Design Done?	8
119 )	125	Balancing Requirements and Realities	8
120 )	138	Use a Design Charter to Keep Dimensional Design Activities on Track	8
121 )	23	Rolling Prediction of the Future	7
122 )	54	Delivering Historical and Current Perspectives	7
123 )	68	Simple Drill-Across in SQL	7
124 )	80	Dimension Row Change Reason Attributes	7
125 )	83	Resist Abstract Generic Dimensions	7
126 )	94	Building Custom Tools for the DW/BI System	7
127 )	98	Focus on Data Stewardship	7
128 )	116	Add Uncertainty to Your Fact Table	7
129 )	118	Managing Backlogs Dimensionally	7
130 )	120	Design Review Dos and Don’ts	7
131 )	131	Easier Approaches For Harder Problems	7
132 )	55	Exploring Text Facts	6
133 )	64	Avoid Isolating the DW and BI Teams	6
134 )	67	Maintaining Back Pointers to Operational Sources	6
135 )	72	Business Process Decoder Ring	6
136 )	74	Compliance-Enabled Data Warehouses	6
137 )	77	Warning: Summary Data may be Hazardous	6
138 )	132	Kimball Forum Update	6
139 )	66	Implementation Analysis Paralysis	5
140 )	71	Naming Conventions	5
141 )	117	Dealing with Data Quality: Don’t Just Sit There, Do Something!	5
142 )	142	Building Bridges	5
143 )	143	Enjoy the Sunset	5
144 )	144	History Lesson on Ralph Kimball and Xerox PARC	4

How to Obfuscate Password for Command Line Kettle

Leave a reply

Pentaho Kettle does not provide possibility to obfuscate password in batch files (though the connection passwords are obfuscated in XML files).

You can write:

kitchen.bat /rep repos /job test /user admin /pass admin

but this will not work:

kitchen.bat /rep repos /job test /user admin /pass "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4"

Here “Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4” is encrypted version of “admin”. You can get such string using kettle tool encr.bat.

This post describes a workaround.

Solution

A solution is to create a java wrapper for Kitchen.

1. Create java file Kettle1.java in the Kettle root with the following code:

import org.pentaho.di.core.exception.KettleException;
import org.pentaho.di.kitchen.Kitchen;
import org.pentaho.di.core.encryption.Encr;

public class Kitchen1 {
   public static void main(String[] args) throws KettleException {
      // find argument /pass
      int i = 0;
      for (; i < args.length; ++i)
         if (args[i].equals("/pass") || args[i].equals("/password"))
            break;
      // decrypt password if necessary
      if (++i < args.length)
         args[i] = Encr.decryptPasswordOptionallyEncrypted(args[i]);
      // run Kitchen
      Kitchen.main(args);
   }
}

2. Open command line. Run kitchen.bat without any parameters. This will initialize the necessary environment variable CLASSPATH.

3. Compile the java file with the following command:

javac Kitchen1.java

4. Copy Kitchen.bat to Kitchen1.bat and correct the last line. Replace

java %OPT% org.pentaho.di.kitchen.Kitchen %_cmdline%

with

java %OPT% Kitchen1 %_cmdline%

Now you can use Kitchen1.bat instead of Kitchen.bat. Both commands will work:

kitchen1.bat /rep repos /job test /user admin /pass admin
kitchen1.bat /rep repos /job test /user admin /pass "Encrypted 2be98afc86aa7f2e4cb79ce71da9fa6d4"

5. Remove source file Kitchen1.java. Close command line.

Have fun 🙂

Loading Fact Tables with PyGramETL

Leave a reply

The previous post described how to load dimension tables with PyGramETL. This one is about fact tables.

The source CVS file has the following structure:

cust_id;res_date;future_guests;future_revenue
105;2008/04/01 00:00:00.000;2;300
106;2007/04/17 00:00:00.000;4;380
106;2007/04/17 00:00:00.000;4;768
...

Fact Table

CREATE TABLE reservation_fact(
    customer_sk INT,
    reservation_date DATETIME,
    future_guests FLOAT,
    future_revenue FLOAT
)

Kettle

PyGramETL

Continue reading →

Loading Dimension Tables with PyGramETL

Leave a reply

There is a promising Python framework for ETL – PyGramETL (http://pygrametl.org/)

This post demonstrates an example how to import data from a CSV file and load it into a dimension table using PyGramETL in comparison with the same functionality implemented in a Kettle transformation.

The input file customer.txt has the following structure:

customer_id;first_name;last_name;age;phone_number;address
101;Paul;Brendt;19;(212) 555 2146;10 Jasper Blvd.
102;Robin;McCarthy;29;(214) 555 3075;27 Pasadena Drive
103;Peter;Travis;34;(510) 555 4448;7835 Hartford Drive
...

Dimention Table

MySQL is used for data warehouse. The following code create the dimension table.

CREATE TABLE customer_dim
(
    customer_sk INT,
    customer_id INT,
    valid_from DATETIME,
    valid_to DATETIME,
    version INT,
    phone_number VARCHAR(20),
    first_name VARCHAR(20),
    last_name VARCHAR(20),
    age INT,
    address VARCHAR(50)
)

INSERT customer_dim (customer_sk, version) VALUES (0, 1)

Kettle

Let’s consider how this can be done in Kettle

The transformation has two steps:

The first step “Text file input” reads the CSV file. The second step “Dimensional lookup/update” generates surrogate key customer_sk as well as technical fields: valid_from, valid_to, version. The update type for first_name and last_name is set to Punch through.

PyGramETL

The same can be done with the following code using PyGramETL

import MySQLdb
import pygrametl
from pygrametl.datasources import CSVSource
from pygrametl.tables import SlowlyChangingDimension

myconn = MySQLdb.connect(user='test', passwd='test', db='test')

connection = pygrametl.ConnectionWrapper(myconn)
connection.setasdefault()

customer_dim = SlowlyChangingDimension(
    name='customer_dim',
    key='customer_sk',
    attributes=['customer_id', 'valid_from', 'valid_to', 'version',
                'first_name', 'last_name', 'age', 'phone_number', 'address'],
    lookupatts=['customer_id'],
    versionatt='version',
    fromatt='valid_from',
    toatt='valid_to',
    type1atts=('first_name', 'last_name'),
    cachesize=-1)

customer_dim.defaultidvalue = 0

def load_customer_dim():
    customer = CSVSource(file('./customer.txt', 'r', 16384), delimiter=';')
    for row in customer:
        printrow
        customer_dim.scdensure(row)
    connection.commit()

def main():
    load_customer_dim()

if __name__ == '__main__':
    main()

On the one hand, it is much easier to get overview of a transformation in Kettle. On the other hand, when the ETL transformation is small, it is much easier to see details in the Python code because you do not have to open each step to see its settings.

Handling Records Removal in Dimension Tables

Leave a reply

The records can be removed from a table in a operational system, but we never remove records from dimension tables. This post describes how the removal of records can be handled in dimension tables.

Let’s consider an example. There is a list of contact persons in a source system. The list is loaded into a dimension table in a data warehouse. Contact persons can be removed from the operational system, but not from dimension tables. The task is to list the current contact persons based on the data from the data warehouse. What can help us to identify the current records? Continue reading →

bukhantsov.org

my technical notes

Category Archives: Data Warehousing

Format date in Batch yyyyMMdd

Archiving Pentaho log table in SQL Server

Removing time part of a date fails in Kettle

Slowly Changing Fact Tables

Snapshot Fact Table

Slowly Changing Fact Tables

References

What is Data Vault?

3NF Model

Dimensional Model

Data Vault

Resilience to Change

References

Top 144 Kimball Design Tips by the Number of References in Google Search

How to Obfuscate Password for Command Line Kettle

Solution

Loading Fact Tables with PyGramETL

Fact Table

Kettle

PyGramETL

Loading Dimension Tables with PyGramETL

Dimention Table

Kettle

PyGramETL

Handling Records Removal in Dimension Tables