Monday, August 28, 2006

Duplicating the data in a MOLAP application - Does it always hold good?

Continuing with the discussion on whether duplicating the data in data warehouse applications is always good or not. Today I would be talking about MOLAP application. By MOLAP application, I refer to the reporting application, which use multi dimensional storage (not relational table) to organize and store the data. Also these are the tools, which typical use multi dimensional query language like MDX, SPL (for Oracle Express) and not SQL for querying. Hyperion Essbase, Oracle Express (now Analytical Workspace and part of Oracle RDBMS), Microsoft Analytical Services etc. are some of the well-known  MOLAP platforms. Niegle Pendse’s OLAP Report is best place to understand various OLAP products in the market. Like relational data warehouse and reporting applications, there are instances of duplicating the data in the MOLAP applications too. In my previous experiences with Essbase and Oralce Express, I have seen some occurrence of data being duplicated with slight modification or transformation in multiple variables or cubes.

MOLAP applications are meant to be lighting fast. So there is lot of summarization and aggregation done in the application. Most of the calculations are part of data load and aggregation batch process. Very little gets calculated while querying the data. However there are quite a number of calculations, which are made part of formula (in Oracle Express) or dynamic calc (in case of Hyperion Essbase). Often these calculations are moving total, ranking, calculating Year to date, rolling quarter, variance etc. And ideally these calculations should be part of formula and or dynamic calc, because:
  • Because this calculations are very simple and does not involve many data values to operate on

  • Sometime this calculations involve applying some formula with different weight:
Example: ( 20 * Advertising Expenditure + 80 * Other Expenditure ) /100
Now this weight could change over the period as the organization tactics change. And this formula could evolve. So it’s better to keep them on the fly.

Even though there are things which best fit in as on-the-fly calculation, I bet there would be many implementations out there calculating them in the load scripts and storing. So all the disadvantage I mentioned for duplicating relational tables holds for MOLAP too. The extra logic needs to be written, tested and maintained. Any debugging would have to check the base data, and the intermediate data. If there are any bugs in the loading script, the data needs to be cleaned up, reloaded and re-aggregated. Would it be dynamic calculation and if there is any bug, only the formula needs to be changed no reloading of data is required.

At times there are scenarios were you want to expose a cube with the respect to fewer dimensions and not all. In OFA (Oracle Financial Analyzer, a Financial reporting application using Express as backend) there are times when FDI (Financial data item a.k.a. variable or cube in Express) are required to be exposed with fewer dimension and not all. So for this scenario also, one should avoid creating a stored FDI and replicating the data in it. Better approach would be to have the data rolled up in the base FDI using hierarchies and creating a formula FDI on top of base FDI with limiting the dimensions to “Total” for those which are not to be exposed for analysis for that FDI.


Lets take an example where dynamic calc could enable you to achive reporting the aggregated data. You have a following data model:

Dimensions:
Market (with hierarchy All Market -> Region -> Country)
Years (with members like 2006, 2005 etc. No hierarchy)
Product (with members P1, P2, etc. No hierarchy)

Cubes:
Gross Sales <Year, Product and Market>
Tax Rate <Year, Market>

How could we dynamically calculate Net Sales <Year, Product and Market> from the above?

Net Sales = Gross Sales * (100/ (100 + Tax Rate))

Lets see what’s wrong with this formula? Tax Rate is something you can’t aggregate on Market. Tax of India + Japan + China + Singapore etc. would not give you the tax rate for Asia. Neither would the average of this give you. So you don’t have tax rate for “Asia” and hence above formula would not return you the Net Sales for “Asia”. All you might need is another stored cube called “Net Sales” which you can use to store the “Gross Sales” less “Tax”. The cube can then be aggregated to return the numbers at Asia or All Market.  

Not trying to contradict what I mentioned earlier. For this scenario and similarly many others, there is no direct way to achieve using the dynamic calc. In Oracle Express you can end up writing a SPL code which can do so dynamically, however in nutshell it would be adding the net figure of India, Japan and so on and returning the result dynamically. This could slow down the performance depending on the number of members you have. If you have 220 markets, the All Market level calculation would take quite long to come through.

To recap what I said in this and last post, is that there are different scenarios where one gets inclined to replicate the data either with minor transformation or aggregate. However one should consider doing this only if the on-the-fly calculation can not achieve the said results with permissible performance.



No comments: