Saturday, August 26, 2006

Duplicating the data in a data warehouse - Does it always hold good?

Duplicating the data in an applications. Does it hold good always? I won't be talking about the whole range of application but focusing on data warehouse and reporting applications. If we go by some of the definition and design principles of data warehouse, they mention about redundancy in the data model. One of the golden rule for designing a typical data warehouse is to have the data model as a simple star schema. Star schema with denormalized or redundant dimension tables. The technique is also referred as Dimensional modeling. There has been a long running debate on what is the better approach for designing a data warehouse: Dimensional modeling or ER (traditional 3NF) modeling. Let us keep ourselves away for this for now.

So what I want to cover here are:
1. Duplicating fact tables or dimensional in the same data model with some calculation or some minor transformation.
I have seen this quite a times in my previous experiences. Duplication has all inherent disadvantage like duplicating in the efforts to write the loading program to populate this extra tables. This new program comes with its on effort to debug and test them. Extra space for these tables is one more thing. However the storage has negligible impact since the ever storage cost is ever decreasing and improvements in the RDBMS technologies. The most important thing, which I hate about the duplication, is the maintenance. Every load, if you encounter an issue in particular report, you have to trace down from the report to the intermediate fact table to the base fact table and source. Having these extra tables would always keep you giving and giving in terms of troubleshooting and debugging them.

2. Summary tables
There is always tendency to convert a long running query to a summary table which can then be directly used by report instead of querying on the base tables. What you end up doing is writing a PL/SQL routine or ETL mapping to populate this. All right. It does solve the problem. The performance would better up. After all summarizing the data and giving the aggregate picture is one of the important principles under pinning data warehouse philosophy. However the problem it brings along with it are same as point 1. You need to develop, test, and maintain this extra logic.

So how to remediate this? Possible ways:

  • Make the reporting tool use the base fact with the necessary transformation and calculation be part of report query.
  • If the reporting tool does not support some transformation functions whic are available in the database or you would like to keep report query simple, encapsulate the transformation or calculation logic in database view. Database views are most handy feature I have used in my last project.
  • If the transformation or calculation logic is very complex and simple SQL is not sufficient to achieve it. Use features like table functions. Table functions allow you to encapsulate a complex PL/SQL logic in a function, which returns a table (or collection of rows). On top of this table function you can create relational views, which in turn could be used for reporting.
  • Materialize view is another way to tackle this kind of situation. All the above 3 approaches are the on-the-fly approach and can lead to performance issues. Materialize views could be another option to consider if there are such issues.



As mentioned reporting aggregating and summarizing the data to have high level picture and then able to drill down to the transactions is the one of the core principle of data warehouse philosophy. Most of the design principles and technologies are how to make this more efficient. There has been range of new technologies catering this particular directly or indirectly. The new products are being churned out from both traditional data warehouse vendors like Oracle (Oracle 10g OLAP option, Materialized view), IBM and MS (SQL Server 2005). Also there are new offerings from pure play BI software vendors like Hyperion, BO, Cognos. Above all there are firms like Hyperoll etc. are just catering things particular thing. The details of this offering are outside the scope of this post.

All I covered in here is about relation data warehouse. I would cover similar caveats in MOLAP applications in one of a subsequent post.

No comments: