Saturday, May 28, 2005

Good article series from Dave Aldridge

While browsing through some of my favorite blogs, I came across Dave Aldridge. His blog seems to be one more place to be there. There is lot of stuff Dave Aldridge writes which will be of interest for all the BI and DW followers specifically those, which are following DW offerings from Oracle. Also his article series on A Practical Guide to Data Warehousing in Oracle: Series Introduction though dated back in end of 2003, has some subtle insights on various considerations for a typical data warehouse project.

Wednesday, May 25, 2005

Building ETL for Homegrown vs. Packaged Applications

Building ETL has always been dependent on source system. Things get more intriguing when the source systems are changing on and off. The dependencies on the source systems could be broadly divided into two categories:

1. Structural dependencies
This refers to the dependencies on the underlying data model. ETL expects the source systems object to follow some structure. For example if customer name is pulled stored in fields FirstName and LastName. If this columns in the source system are modified to a single column CustomerName, the ETL goes hey way. The structural dependencies holds for any source system ranging from Relational database, XML based sources, LDAP, files, Application interfaces like BAPI etc.

2. Application and Data Dependencies
In nutshell this refers to the mechanism with which data gets updated in the underlying source system. All the data updates have to follow some application flow. The complete data flow and data changes are typically part of source system documentation. Good example of Application or data dependencies would be LastUpdateDate getting modified for any changes made to a Customer record. The ETL would usually capture the change using this field. In case if the field is not updated by the source applications properly whole extraction and in turn ETL goes hey-way.

So dependencies on source systems are inevitable. Its not just one time phenomena when you first time build the ETL’s but its something on which you have to keep tap on on-going basis. The structural and application changes in the source systems keep happening on the regular basis.

So coming to the subject of this discussion, we can divide source systems in to major classes:
1. Homegrown applications: This category primarily refers to the custom built applications, which are designed by the organizations for day-to-day operations. A typical organization has many of its operational systems, which are either developed in-house or out sourced. This are typically custom applications build for the supporting specific function or operations of an organization.
2. Packaged Applications: This refers to the packaged applications, which are implemented by organizations for there operations. ERP and back office applications are implemented by many organizations for their operational needs.

Buying and building is always a common decision CIO’s and IT managers have to make these days. Packaged applications has its own advantages like comprehensive documentation, bug support, integration and extension point, customizable and easy and quick to roll-out, upgrade etc. Similarly homegrown application comes with its own advantages like flexibility, customization and extensive capabilities for integration with other systems, ability to enhance on modify the functionalities on need-to basis. Anyways the debate between and buy and build is beyond the scope of this posting.

Creating an ETL application for homegrown operational systems have been challenging any day. This is primarily due to
· Lack of proper documentation of the systems
· Bugs and support
· Patching of source systems (both underlying data model and applications)
· Lack of documentation for application flow, underlying data model and data flow.
· Upgrades of the source systems, which are home grown, could be difficult to track if there’s no established documentation for the changes happening.
· Typically it has been seen that source systems, which are homegrown, have integrity issues and no mechanism in place for preventing data updates and in consistencies.

On the other had homegrown systems comes with its own advantage of easy access of the development team of the applications, better understanding of the business functionalities, etc.

Writing ETL for packaged applications has its own pros. This is mainly because of:
· Availability of comprehensive documentation of the underlying data model, data flow and application flow.
· Bugs are well documented with all there implications on the application
· Upgrades and patches are well documented.
· Data integrity is typically maintained and there is usually a mechanism in place to prevent the updates of the underlying data

However, building ETL for packaged applications as the source system has its own challenges. This could be because of lack of documentation or understanding on particular area of the application.

The above discussion was just meant to delineate the intricacies of developing the ETL for homegrown and package applications. It was not driving to any conclusion of which is better over other. I had exposure both this world and with my experiences I believe that building ETL for package applications has always been easy and more efficient then the homegrown source systems. At the end of the day, being an ETL developer, one does not have much chalice of what source systems should be. Perhaps, I would follow up this discussion with my experiences and the considerations I had in my mind while approaching both type of applications.

Sunday, May 22, 2005

Learning awk

Other day, I had a task to kill some of the UNIX processes, which were running javac (Java compiler). One way I could have approached this was to identify all the processes running javac by running

root@localhost# ps -ef | grep javac

And then picking each of these process id and killing them individually.

root@localhost# kill –9 pid

But I have overseen people writing some one liner shell command using combination of awk ,ps etc. to kill all the processes running particular program/binary. So I thought of building this one liner. I had no idea of what is awk except that it is named after the initials of the authors (Aho, Kernighan and Weinberger) who wrote this. I even didn’t knew if it was a comprehensive language for text processing.

I launched awk manual on shell prompt and started going through it. It took me a while to get hold of where the things were heading.

Some how I tried run through the manual but didn’t pick much except {print} and few simple things like awk can have inline programs or the programs could be stored in file. Also it is data driven and not procedural. $0 refers to the current line in the data. Commands like print and printf. But some how I was not able to get hold of syntaxes. I was not understanding the grammar and how the program needs to be written. I just managed to execute

awk {print}

After this I tried few more

root@localhost# awk {print Hello}
awk: cmd. line:2: (END OF FILE)
awk: cmd. line:2: parse error
root@localhost# awk {print 'Hello'}
awk: cmd. line:2: (END OF FILE)
awk: cmd. line:2: parse error
root@localhost# awk "{print 'Hello'}"
awk: cmd. line:1: {print 'Hello'}
awk: cmd. line:1: ^ invalid char ''' in expression
root@localhost# awk '{print "Hello"}'
Finally, I got hold of tutorial GAWK: Effective AWK Programming which gave me head-start understanding the language.

root@localhost# awk 'BEGIN {print "Hello"}' was easy to comprehend.
Things started clicking.

root@localhost# awk 'BEGIN {print substr("Hello World",2,2}' and so on..

So since I started understanding the awk and how it works, I tried getting to the problem, which started the whole zeal. Killing all the processes running javac.

ps –ef | grep javac | awk {print}
then it was
ps -ef | grep javac | awk '{print substr($0,10,6)}'
ps -ef | grep javac | awk '{printf substr($0,10,6)}'
to throw the all the process id in one line
and finally
kill -9 `ps -ef | grep javac | awk '{printf substr($0,10,6)}'`

This tries to kill all the processes which have patter “javac” in the executing command. This means even it tries to kill “grep javac” also which some how does not exist after passing the output to awk command. Thus you will get
bash: kill: (pid) - No such pid
error message. But it above command will kill all the processes running javac.

The onliner got further modified to get rid of grep
kill -9 `ps -ef | awk '/javac/{printf substr($0,10,6)}'`

Some how substr was sounding bit fishy. It could be a case that pid does not start at the 10th position and is of more then 6 character. The one liner was further modified to

kill -9 `ps -ef | awk '/javac/{printf $2}'`
where $2 implicitly refers to the second column of the ps output. So finally was able to get through.

Saturday, May 21, 2005

Summary Management in Oracle database: World before and after Materialized View

Summary management has been prime consideration for data warehouse implemented using relations database technology (ROLAP.) Typical set of steps one follows for managing summary in a data warehouse projects is:

  1. Identifying the summary required. This depends on the nature of analytics and report performance. If on-the-fly summarization is taking long, then the summary is candidate of pre-aggregation
  2. Refreshing the summaries. The job does not end just after creating the summary. There should be a mechanism in place to keep the summaries fresh.
  3. Querying the pre-aggregated summary for reporting purpose. Reporting/analysis tools should have capability to leverage on summary

Characteristics of good summary management mechanism should be:

  1. Transparency. This is applicable for both refresh of summary and query. The summary should automatically refresh with the change in the base fact table. Also a summary query posed to the base table should be automatically redirected to the summary table if there’s any available.
  2. Efficient. This is applicable to refreshing. Refreshing mechanism should be efficient to do incremental refresh of the summary and not rebuild summaries all again.
  3. Fast. Both querying and refreshing should be fast
  4. Capability to manage the percentage of aggregation. There could be a case that all possible set of summaries for a base fact is not required. Summary management tool should enable to you tune this parameter.

So what were the possible summary management artifacts available in Oracle before the introduction of Materialized views in 8i? Primarily the two ways to maintain summaries before MV’s were:

  1. Snapshot. Snapshot is the mechanism to capture state of data at a given point of time and store it under a separate database object. They are not transparent to querying and needs to be explicitly referred in the query. The refresh of this snapshot could be schedule periodically.
  2. Summary tables. A data warehouse can have set of tables, which stores the summary data. For example a separate table keeping the pre-aggregated sales data at the year, country level. These tables are again not transparent to the query tools and needs be explicitly referred by the query. Also refresh of this tables needs to be handled in the application (ETL logic).

So none of the above approaches were able to achieve all the characteristics of ideal summary management system.

Materialized views (MV) came along with the Oracle 8i. They replaced snapshot. The core functionality of snapshot became part of MV. The radical characteristic of MV was the query rewrite. Query rewrite enables optimizer to rewrite the query to access materialized view instead of base table if the query is seeking summary information. This feature stands out MV from all the previous approaches of summary management.

Apart from query rewrite the other capability of MV is the wide range of options available for refresh. Refresh options avialble with MV’s could be classfied under two categories:

  1. How to refresh:
    1. Fast Refresh (Apply only incremental changes happened in the base fact to the MV)
    2. Complete Refresh (Rebuild the MV completely)
    3. Force (Attempt for Fast refresh and if it’s not possible then do complete refresh)
  2. When to refresh
    1. On Commit Any changes happening on the base table is immediately propagated to MV. Refresh is driven by the changes in base fact table
    2. On Demand: Do the refresh when the query accesses the MV. If the MV is stale the refresh at that point of time. Drive by the querying of the MVs.
    3. Automatically: Scheduling the refresh on the periodical basis. This is independent of both changes in Base table and querying of MV.

Lot of enhancements have been done to the MV after it was first introduced with Oracle 8i. Now one can have indexes on MV, Summary advisor came along in Oracle 9i, MV refresh mechanism have been further enhanced etc. is the quick link to on MV in Oracle 10g.

Hand-coded v/s Tools-based ETL:

There have been rounds of debates going around on which paradigm to opt, for developing ETL for typical data warehouse project.

I would share my experience working with Handcoding the ETL and using Oracle Warehouse Builder (Tool Based ETL.)

The ETL of old days has not remain same now. Challenges ETL has faced in the recent times are:

  1. Complex business systems and Analytics requirement
  2. Volume of data
  3. Small refresh window
  4. Orchestrating and automation of Loads
  5. Monitoring capabilities and exception handling
  6. Complex data quality and integrity requirements
  7. Complex transformation
  8. Match-merge and cleansing
  9. Flexibility to the change ETL logic on the fly as the business changes
  10. Varied nature of source systems

And list goes on. So our very old hand-coded ETL does not cope up with this all.

Let me delineate my understanding of ETL. ETL is not piece of code or application, which enables you to do Extraction, Transformation and Loading. E, T and L are three primary pillars and core of typical ETL. But there are few more aspects, which fall under the purview of ETL. Today, ETL stands for Extraction, transformation, Loading, infrastructure elements (for extraction, transformation, loading ).

insert into sales values(customerid, …,…,revenue)

( select orderheader.customerid,… orderline.qty * orderline.price from orderheader@src , orderline@src where orderheader.order_id = orderline.order_id)

Above could be simple ETL doing Extraction (select..) , transformation (qty*price) and loading(insert into..) but it still lacks something to suffice above 10 challenges.

  1. Infrastructure for scheduling and automating the loads
  2. Mechanism to handle errors and exceptions and to throw out necessary alerts.
  3. Resilience and error recovery. .
  4. Abilities to switch from Set-based (insert into …select *) to row based (picking up one row at a time and loading it) loading type
  5. Workflow to orchestrate the various sub-loads (individual units)
  6. Metadata management
  7. Versioning and management of design metadata.
  8. Capturing Runtime audit logs
  9. Capabilities to leverage the specific features of target data warehouse platform.
  10. Flexibility to change the configuration at any level ranging from whole ETL project to individual object.

It’s unrealistic for a hand-coded ETL project to cover all this.

ETL tools suites these days are typically bundled with all this functionalities. Oracle Warehouse Builder is one such tool. Apart from enabling one to model the ETL flows (and not code), it also gives you all the above functionalities. Tools like OWB gives above functionalities in one integrated suite. With the new release of OWB (Paris) things would be more streamlined and all the above functionalities will be accessible through one single interface. I would be sharing my experiences on OWB Paris in subsequent blogs.

As I keep hearing from the research analysts that applications would be modeled and not coded as time goes by. This is driven by the inherent complexities in business applications.

So coming back to our prime question of which approach is better between Hand-coding and tool-based. As everything depends on context, the answer to this also depends on the context. If the ETL requirement is simple and straight without any great need for infrastructure elements like monitoring, metadata management etc. hand coding the ETL would be the wise choice. But if the ETL is complex and there is a requirement of monitoring, metadata management etc. tool-based approach would better go ahead.

Stray thought.. 100% CPU Utilization

How does a typical microprocessor for a personal computer work? When you boot up the PC, the CPU picks up an instructions from a particular place in the memory and starts executing them sequentially. There is sequence of instructions/steps undertaken by CPU and finally it brings up your OS. When you finally have your OS up there and you run a top (Unix command) or task manager, you see CPU 2% or may be x% utilized. So does that mean that your CPU is sleeping for 98% of time? Is it that for 98% of time there is nothing for CPU to execute and it’s sitting idle? In reality, CPU never sleeps. This means that CPU is 100% utilized all the time. It’s just that CPU is executing dummy instruction (NOP) for 98% of time and doing something useful only 2% of its time. If you launch some more programs, perhaps some more useful tasks will get executed on this CPU and utilization will be getting higher.

Kimball and Rittman

Like anyother DW follower, I am a big fan of R Kimball. Kimball no doubt is one of the most respected authority in DW space.

One more guy whom I follow is M Rittman ( Rittman is like one stop shop for anything and everything pertaining to Oracle BI&W.

Among few others, I also follow Niegle Pendse (

All this guys to me are the pioneer contributor to DW and BI field.

One of the aim of this blog is to share some of experiences during implementation of various BI and DW projects.

This blog is very much inspired by the works of Kimball, Rittman and Niegle and I would attempt to do something similar to what they have been doing for the BI and DW community.

Tuesday, May 03, 2005

About Me

I am software professional working with a IT services firm in Singapore. Basically I am from India and have done my engineering in Computer Science. My core technical interests are Data warehouse and Business technologies. I specialize in the various BI and DW offerings from Oracle.