Wednesday, May 25, 2005

Building ETL for Homegrown vs. Packaged Applications

Building ETL has always been dependent on source system. Things get more intriguing when the source systems are changing on and off. The dependencies on the source systems could be broadly divided into two categories:

1. Structural dependencies
This refers to the dependencies on the underlying data model. ETL expects the source systems object to follow some structure. For example if customer name is pulled stored in fields FirstName and LastName. If this columns in the source system are modified to a single column CustomerName, the ETL goes hey way. The structural dependencies holds for any source system ranging from Relational database, XML based sources, LDAP, files, Application interfaces like BAPI etc.

2. Application and Data Dependencies
In nutshell this refers to the mechanism with which data gets updated in the underlying source system. All the data updates have to follow some application flow. The complete data flow and data changes are typically part of source system documentation. Good example of Application or data dependencies would be LastUpdateDate getting modified for any changes made to a Customer record. The ETL would usually capture the change using this field. In case if the field is not updated by the source applications properly whole extraction and in turn ETL goes hey-way.

So dependencies on source systems are inevitable. Its not just one time phenomena when you first time build the ETL’s but its something on which you have to keep tap on on-going basis. The structural and application changes in the source systems keep happening on the regular basis.

So coming to the subject of this discussion, we can divide source systems in to major classes:
1. Homegrown applications: This category primarily refers to the custom built applications, which are designed by the organizations for day-to-day operations. A typical organization has many of its operational systems, which are either developed in-house or out sourced. This are typically custom applications build for the supporting specific function or operations of an organization.
2. Packaged Applications: This refers to the packaged applications, which are implemented by organizations for there operations. ERP and back office applications are implemented by many organizations for their operational needs.

Buying and building is always a common decision CIO’s and IT managers have to make these days. Packaged applications has its own advantages like comprehensive documentation, bug support, integration and extension point, customizable and easy and quick to roll-out, upgrade etc. Similarly homegrown application comes with its own advantages like flexibility, customization and extensive capabilities for integration with other systems, ability to enhance on modify the functionalities on need-to basis. Anyways the debate between and buy and build is beyond the scope of this posting.

Creating an ETL application for homegrown operational systems have been challenging any day. This is primarily due to
· Lack of proper documentation of the systems
· Bugs and support
· Patching of source systems (both underlying data model and applications)
· Lack of documentation for application flow, underlying data model and data flow.
· Upgrades of the source systems, which are home grown, could be difficult to track if there’s no established documentation for the changes happening.
· Typically it has been seen that source systems, which are homegrown, have integrity issues and no mechanism in place for preventing data updates and in consistencies.

On the other had homegrown systems comes with its own advantage of easy access of the development team of the applications, better understanding of the business functionalities, etc.

Writing ETL for packaged applications has its own pros. This is mainly because of:
· Availability of comprehensive documentation of the underlying data model, data flow and application flow.
· Bugs are well documented with all there implications on the application
· Upgrades and patches are well documented.
· Data integrity is typically maintained and there is usually a mechanism in place to prevent the updates of the underlying data

However, building ETL for packaged applications as the source system has its own challenges. This could be because of lack of documentation or understanding on particular area of the application.

The above discussion was just meant to delineate the intricacies of developing the ETL for homegrown and package applications. It was not driving to any conclusion of which is better over other. I had exposure both this world and with my experiences I believe that building ETL for package applications has always been easy and more efficient then the homegrown source systems. At the end of the day, being an ETL developer, one does not have much chalice of what source systems should be. Perhaps, I would follow up this discussion with my experiences and the considerations I had in my mind while approaching both type of applications.

No comments: