Loading Dimension Tables with PyGramETL

There is a promising Python framework for ETL – PyGramETL (http://pygrametl.org/)

This post demonstrates an example how to import data from a CSV file and load it into a dimension table using PyGramETL in comparison with the same functionality implemented in a Kettle transformation.

The input file customer.txt has the following structure:

customer_id;first_name;last_name;age;phone_number;address
101;Paul;Brendt;19;(212) 555 2146;10 Jasper Blvd.
102;Robin;McCarthy;29;(214) 555 3075;27 Pasadena Drive
103;Peter;Travis;34;(510) 555 4448;7835 Hartford Drive
...

Dimention Table

MySQL is used for data warehouse. The following code create the dimension table.

CREATE TABLE customer_dim
(
    customer_sk INT,
    customer_id INT,
    valid_from DATETIME,
    valid_to DATETIME,
    version INT,
    phone_number VARCHAR(20),
    first_name VARCHAR(20),
    last_name VARCHAR(20),
    age INT,
    address VARCHAR(50)
)

INSERT customer_dim (customer_sk, version) VALUES (0, 1)

Kettle

Let's consider how this can be done in Kettle

The transformation has two steps:

The first step "Text file input" reads the CSV file. The second step "Dimensional lookup/update" generates surrogate key customer_sk as well as technical fields: valid_from, valid_to, version. The update type for first_name and last_name is set to Punch through.

PyGramETL

The same can be done with the following code using PyGramETL

import MySQLdb
import pygrametl
from pygrametl.datasources import CSVSource
from pygrametl.tables import SlowlyChangingDimension

myconn = MySQLdb.connect(user='test', passwd='test', db='test')

connection = pygrametl.ConnectionWrapper(myconn)
connection.setasdefault()

customer_dim = SlowlyChangingDimension(
    name='customer_dim',
    key='customer_sk',
    attributes=['customer_id', 'valid_from', 'valid_to', 'version',
                'first_name', 'last_name', 'age', 'phone_number', 'address'],
    lookupatts=['customer_id'],
    versionatt='version',
    fromatt='valid_from',
    toatt='valid_to',
    type1atts=('first_name', 'last_name'),
    cachesize=-1)

customer_dim.defaultidvalue = 0

def load_customer_dim():
    customer = CSVSource(file('./customer.txt', 'r', 16384), delimiter=';')
    for row in customer:
        printrow
        customer_dim.scdensure(row)
    connection.commit()

def main():
    load_customer_dim()

if __name__ == '__main__':
    main()

On the one hand, it is much easier to get overview of a transformation in Kettle. On the other hand, when the ETL transformation is small, it is much easier to see details in the Python code because you do not have to open each step to see its settings.