There is a promising Python framework for ETL – PyGramETL (http://pygrametl.org/)
This post demonstrates an example how to import data from a CSV file and load it into a dimension table using PyGramETL in comparison with the same functionality implemented in a Kettle transformation.
The input file customer.txt has the following structure:
customer_id;first_name;last_name;age;phone_number;address 101;Paul;Brendt;19;(212) 555 2146;10 Jasper Blvd. 102;Robin;McCarthy;29;(214) 555 3075;27 Pasadena Drive 103;Peter;Travis;34;(510) 555 4448;7835 Hartford Drive ...
Dimention Table
MySQL is used for data warehouse. The following code create the dimension table.
CREATE TABLE customer_dim ( customer_sk INT, customer_id INT, valid_from DATETIME, valid_to DATETIME, version INT, phone_number VARCHAR(20), first_name VARCHAR(20), last_name VARCHAR(20), age INT, address VARCHAR(50) ) INSERT customer_dim (customer_sk, version) VALUES (0, 1)
Kettle
Let’s consider how this can be done in Kettle
The transformation has two steps:
The first step “Text file input” reads the CSV file. The second step “Dimensional lookup/update” generates surrogate key customer_sk as well as technical fields: valid_from, valid_to, version. The update type for first_name and last_name is set to Punch through.
PyGramETL
The same can be done with the following code using PyGramETL
import MySQLdb import pygrametl from pygrametl.datasources import CSVSource from pygrametl.tables import SlowlyChangingDimension myconn = MySQLdb.connect(user='test', passwd='test', db='test') connection = pygrametl.ConnectionWrapper(myconn) connection.setasdefault() customer_dim = SlowlyChangingDimension( name='customer_dim', key='customer_sk', attributes=['customer_id', 'valid_from', 'valid_to', 'version', 'first_name', 'last_name', 'age', 'phone_number', 'address'], lookupatts=['customer_id'], versionatt='version', fromatt='valid_from', toatt='valid_to', type1atts=('first_name', 'last_name'), cachesize=-1) customer_dim.defaultidvalue = 0 def load_customer_dim(): customer = CSVSource(file('./customer.txt', 'r', 16384), delimiter=';') for row in customer: printrow customer_dim.scdensure(row) connection.commit() def main(): load_customer_dim() if __name__ == '__main__': main()
On the one hand, it is much easier to get overview of a transformation in Kettle. On the other hand, when the ETL transformation is small, it is much easier to see details in the Python code because you do not have to open each step to see its settings.