Category Archives: Kettle

Pentaho Data Integration

java.lang.OutOfMemoryError: Java heap space

All steps in a transformation runs in parallel. The hops between steps are kind of buffers of rows. The maximum number of rows is controlled with transformation parameter Nr of rows in rowset. However it is important to understand that if this parameter is set to too high value, Kettle might not be able to allocate the required amount of memory and fail with exception java.lang.OutOfMemoryError: Java heap space.

Experiment:

Heap memory for Kettle is restricted with parameter -Xmx to 256m. We will set Nr of rows in rowset to 10M. The following transformation fails after some time when the number of required memory reaches the limit.

The error:

UnexpectedError: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
	at org.pentaho.di.core.row.RowDataUtil.allocateRowData(RowDataUtil.java:34)
	at org.pentaho.di.core.row.RowMeta.cloneRow(RowMeta.java:311)
	at org.pentaho.di.trans.steps.rowgenerator.RowGenerator.processRow(RowGenerator.java:151)
	at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2889)
	at org.pentaho.di.trans.steps.rowgenerator.RowGenerator.run(RowGenerator.java:215)

Generate rows adds a string field with the 100 bytes string. The other two steps do not do anything significant – step Dummy can be used instead of them.

We can calculate the number of rows in hops calculating the difference between read and write. Hop between Generate Rows and Add sequence holds 6445K-3650K=2795K of rows, hop between Add sequence and Calculator holds 3650K-3435K=215K of rows. So the total is 3010K rows each of which requires at least 100 bytes. The total required memory is more than 301M bytes.

Dimension lookup / update properties for all transformations

Sometimes it might be useful to get different information about kettle transformations in a list format. Here is an example selecting update type of step Dimension lookup/update:

select
   t.NAME as TRANSFORMATION_NAME
,  s.NAME as STEP_NAME
,  fn.FIELD_NAME
,  tab.TABLE_NAME
,  fu.FIELD_UPDATE
from
(
   select
      ID_STEP
   ,  NR
   ,  CAST(VALUE_STR as varchar) as FIELD_NAME
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'field_lookup'
) fn
inner join
(
   select
      ID_STEP,
      NR,
      CAST(VALUE_STR as varchar) as FIELD_UPDATE
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'field_update'
) fu
on
   fn.ID_STEP = fu.ID_STEP
and
   fn.NR = fu.NR
inner join
   R_STEP s
on
   s.ID_STEP = fn.ID_STEP
inner join
(
   select
      ID_STEP
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'update'
   and
      CAST(VALUE_STR as varchar)='Y'
) u
on u.ID_STEP = s.ID_STEP
inner join
(
   select
      ID_STEP
   ,  CAST(VALUE_STR as varchar) as TABLE_NAME
   from
      R_STEP_ATTRIBUTE
   where
      CODE = 'table'
) tab
on
   tab.ID_STEP = s.ID_STEP
inner join
   R_TRANSFORMATION t
on
   t.ID_TRANSFORMATION = s.ID_TRANSFORMATION
order by
   TRANSFORMATION_NAME
,  STEP_NAME
,  FIELD_NAME