Monday, August 28, 2006

Duplicating the data in a MOLAP application - Does it always hold good?

Continuing with the discussion on whether duplicating the data in data warehouse applications is always good or not. Today I would be talking about MOLAP application. By MOLAP application, I refer to the reporting application, which use multi dimensional storage (not relational table) to organize and store the data. Also these are the tools, which typical use multi dimensional query language like MDX, SPL (for Oracle Express) and not SQL for querying. Hyperion Essbase, Oracle Express (now Analytical Workspace and part of Oracle RDBMS), Microsoft Analytical Services etc. are some of the well-known  MOLAP platforms. Niegle Pendse’s OLAP Report is best place to understand various OLAP products in the market. Like relational data warehouse and reporting applications, there are instances of duplicating the data in the MOLAP applications too. In my previous experiences with Essbase and Oralce Express, I have seen some occurrence of data being duplicated with slight modification or transformation in multiple variables or cubes.

MOLAP applications are meant to be lighting fast. So there is lot of summarization and aggregation done in the application. Most of the calculations are part of data load and aggregation batch process. Very little gets calculated while querying the data. However there are quite a number of calculations, which are made part of formula (in Oracle Express) or dynamic calc (in case of Hyperion Essbase). Often these calculations are moving total, ranking, calculating Year to date, rolling quarter, variance etc. And ideally these calculations should be part of formula and or dynamic calc, because:
  • Because this calculations are very simple and does not involve many data values to operate on

  • Sometime this calculations involve applying some formula with different weight:
Example: ( 20 * Advertising Expenditure + 80 * Other Expenditure ) /100
Now this weight could change over the period as the organization tactics change. And this formula could evolve. So it’s better to keep them on the fly.

Even though there are things which best fit in as on-the-fly calculation, I bet there would be many implementations out there calculating them in the load scripts and storing. So all the disadvantage I mentioned for duplicating relational tables holds for MOLAP too. The extra logic needs to be written, tested and maintained. Any debugging would have to check the base data, and the intermediate data. If there are any bugs in the loading script, the data needs to be cleaned up, reloaded and re-aggregated. Would it be dynamic calculation and if there is any bug, only the formula needs to be changed no reloading of data is required.

At times there are scenarios were you want to expose a cube with the respect to fewer dimensions and not all. In OFA (Oracle Financial Analyzer, a Financial reporting application using Express as backend) there are times when FDI (Financial data item a.k.a. variable or cube in Express) are required to be exposed with fewer dimension and not all. So for this scenario also, one should avoid creating a stored FDI and replicating the data in it. Better approach would be to have the data rolled up in the base FDI using hierarchies and creating a formula FDI on top of base FDI with limiting the dimensions to “Total” for those which are not to be exposed for analysis for that FDI.

Lets take an example where dynamic calc could enable you to achive reporting the aggregated data. You have a following data model:

Market (with hierarchy All Market -> Region -> Country)
Years (with members like 2006, 2005 etc. No hierarchy)
Product (with members P1, P2, etc. No hierarchy)

Gross Sales <Year, Product and Market>
Tax Rate <Year, Market>

How could we dynamically calculate Net Sales <Year, Product and Market> from the above?

Net Sales = Gross Sales * (100/ (100 + Tax Rate))

Lets see what’s wrong with this formula? Tax Rate is something you can’t aggregate on Market. Tax of India + Japan + China + Singapore etc. would not give you the tax rate for Asia. Neither would the average of this give you. So you don’t have tax rate for “Asia” and hence above formula would not return you the Net Sales for “Asia”. All you might need is another stored cube called “Net Sales” which you can use to store the “Gross Sales” less “Tax”. The cube can then be aggregated to return the numbers at Asia or All Market.  

Not trying to contradict what I mentioned earlier. For this scenario and similarly many others, there is no direct way to achieve using the dynamic calc. In Oracle Express you can end up writing a SPL code which can do so dynamically, however in nutshell it would be adding the net figure of India, Japan and so on and returning the result dynamically. This could slow down the performance depending on the number of members you have. If you have 220 markets, the All Market level calculation would take quite long to come through.

To recap what I said in this and last post, is that there are different scenarios where one gets inclined to replicate the data either with minor transformation or aggregate. However one should consider doing this only if the on-the-fly calculation can not achieve the said results with permissible performance.

Sunday, August 27, 2006

Some Web 2.0 stuff..

Few Web 2.0 things, which I started following in last few weeks. Techcrunch is one of the best tech news site I have found in the recent times. It is sort of one stop shop for all the tech and web 2.0 related news. The content is very concise and complete. It’s always good to browse through happenings around in the tech industry. Digg is another news site. Though I heard of it few years back, but newer used to follow things over there. In the last few weeks I have started reading through the stories at Digg. Like Tech crunch, even this is sort of one stop shop for tech news. Most of the times, the stories in Techcrunch and Digg do overlap. But Digg gives a different taste. The commentary going around is good to read.

Few more handy Web 2.0 utilities which I started using in the last few weeks. Guess what? All are from Google. Calendar , Reader and Writely. The altruistic Google has been generous as like ever before. As I always write, they are showering best of the collaboration and personal management utilities, all as SaaS and for free. The usability, the UI, the features and the functionalities, all unmatched compare to any commercial desktop tools. Of course there are quite a number of limitations compare to desktop tool some because this tools run from browser. But the best all this tools have is the Web 2.0 community related features. You can share the calendar, you can view others calendar across the globe, you can create the documents in writely, share them, co-author the a document by multiple people, maintain versions, blog them, publish them as HTML and PDF and what not. The Google Reader is an excellent RSS feed aggregator. Initially I was big fan of My Yahoo. It was very easy to subscribe the feeds and collate them at one place. Easy one page snapshot of all the feeds. Easy to customize the interface. However, you cannot read the original stories for this feeds in the same page. Got to open it in the new window or so. With Google Reader its damn easy to read through the whole feed. At least for those who publish the whole story or article in the feeds. TechCrunch is one and most of the blogs at Blogger do publish the whole article in the feed. So its quick to scan through the whole story without switching the tabs. However what I miss in Google Reader is to see all the feeds in one page, in one view something like My Yahoo. Probably there is a way you can do it. Let me explore in the coming days. For time being I am using both My Yahoo and Reader. Of Course one can look for the elaborate reviews of these tools on Internet and blogsphere.

Saturday, August 26, 2006

Duplicating the data in a data warehouse - Does it always hold good?

Duplicating the data in an applications. Does it hold good always? I won't be talking about the whole range of application but focusing on data warehouse and reporting applications. If we go by some of the definition and design principles of data warehouse, they mention about redundancy in the data model. One of the golden rule for designing a typical data warehouse is to have the data model as a simple star schema. Star schema with denormalized or redundant dimension tables. The technique is also referred as Dimensional modeling. There has been a long running debate on what is the better approach for designing a data warehouse: Dimensional modeling or ER (traditional 3NF) modeling. Let us keep ourselves away for this for now.

So what I want to cover here are:
1. Duplicating fact tables or dimensional in the same data model with some calculation or some minor transformation.
I have seen this quite a times in my previous experiences. Duplication has all inherent disadvantage like duplicating in the efforts to write the loading program to populate this extra tables. This new program comes with its on effort to debug and test them. Extra space for these tables is one more thing. However the storage has negligible impact since the ever storage cost is ever decreasing and improvements in the RDBMS technologies. The most important thing, which I hate about the duplication, is the maintenance. Every load, if you encounter an issue in particular report, you have to trace down from the report to the intermediate fact table to the base fact table and source. Having these extra tables would always keep you giving and giving in terms of troubleshooting and debugging them.

2. Summary tables
There is always tendency to convert a long running query to a summary table which can then be directly used by report instead of querying on the base tables. What you end up doing is writing a PL/SQL routine or ETL mapping to populate this. All right. It does solve the problem. The performance would better up. After all summarizing the data and giving the aggregate picture is one of the important principles under pinning data warehouse philosophy. However the problem it brings along with it are same as point 1. You need to develop, test, and maintain this extra logic.

So how to remediate this? Possible ways:

  • Make the reporting tool use the base fact with the necessary transformation and calculation be part of report query.
  • If the reporting tool does not support some transformation functions whic are available in the database or you would like to keep report query simple, encapsulate the transformation or calculation logic in database view. Database views are most handy feature I have used in my last project.
  • If the transformation or calculation logic is very complex and simple SQL is not sufficient to achieve it. Use features like table functions. Table functions allow you to encapsulate a complex PL/SQL logic in a function, which returns a table (or collection of rows). On top of this table function you can create relational views, which in turn could be used for reporting.
  • Materialize view is another way to tackle this kind of situation. All the above 3 approaches are the on-the-fly approach and can lead to performance issues. Materialize views could be another option to consider if there are such issues.

As mentioned reporting aggregating and summarizing the data to have high level picture and then able to drill down to the transactions is the one of the core principle of data warehouse philosophy. Most of the design principles and technologies are how to make this more efficient. There has been range of new technologies catering this particular directly or indirectly. The new products are being churned out from both traditional data warehouse vendors like Oracle (Oracle 10g OLAP option, Materialized view), IBM and MS (SQL Server 2005). Also there are new offerings from pure play BI software vendors like Hyperion, BO, Cognos. Above all there are firms like Hyperoll etc. are just catering things particular thing. The details of this offering are outside the scope of this post.

All I covered in here is about relation data warehouse. I would cover similar caveats in MOLAP applications in one of a subsequent post.

eLitecore and me

Today, I stumble upon this news story is about eLitecore's expansion in middle-east. Ofcz thats a good news for eLitecore. They are a great company with the great bunch of products and great people.

eLitecore is a mid-size Software development house based in Ahmedabad, India. The company has a long and vibrant history. I know fairly good about them since I worked with them for couple of months and reading the news story took me back to good olds days at eLite. Couple of months, huh.. was it a consulting assignment? No. Actually I joined them on permanent rolls but just 2 months down the line, got an offer from Oracle India Ptv. Ltd. Since it was early days of my carrier and probably thought that working for a bigbrand would add lot of weight to my resume. And thats it, I quit. That was the end of my spree of working with startup. In 2003, I worked for Rightway Solution and eLitecore. And the learnings were one of the best, a rookie in IT field can get. In Dec,2003 I joined Oracle and since then have been with big corporates. So kindda have a flavour of both startup and biggies and how different its working with them. Ofcz it goes for another post. Probably in the comings weeks would write on this, since there are lot of things to share over here.

All that aside, lets come back to eLitecore. eLitecore has its genesis from IceNet one of the first ISPs in India during the early days of Internet in mid nineties. eLitecore started with developing network management and billingsoftware for Icenet and latter went to become a leading product development and services company. There are quite a few successful product based company in India (most of them are into services) and eLitecore is one of them. Part of eLitecore's business was services but product development has been their focus all along. Their primary product lines deals with network management, network security tools and billing software for small and large scale ISP. Their flagship product Cyberoam now branded as UTM has large install base in both domestic and international market. The product comes with tonnes of features like configuring Internet access policies, bandwidth sharing among the application, security features against attacks like Phising and Pharming, loads of reporting to get better insight in the Internet access patterns, VPN and the list goes on and on.

Coming to my stint at eLitecore, I would say those were one of the best days in my IT carrier. I got to learn lot of the insights about developing a product. eLite follows Agile methodologies for product development. The team size is usually small so there is lot of things on your plate to do. There were earnings on how to

  • architect products
  • creating the library and reusable code base
  • how to leverage on open source
  • how to make best use existing libraries out there
  • integration
  • designing data models
  • Usability design of the systems
  • developing a tool typically for network access management

Of course I didn't gain mastery in all the above. The learnings were not that intense for some of the above items, but I did get good starting point on the considerations for developing a software product.

Though after eLitecore, I never worked in network management tools, or Linux or even in Product development of late. All I am doing is technical consulting and implementation of BI and DW products. However the learnings from eLitecore are instrumental in my day to day working.

Tuesday, August 22, 2006

Hitec Ecosystems and India

Yesterday was listening through a keynote address by Dr. C. K. Prahalad titled “Emerging Hi-Tech Ecosystems” from the Software 2006 conference. Speaker starts with defining what is a ecosystem and what are the major ecosystems which have drawn considerable attention in recent past. The talk is packed with lot of stats and some are really surprising like the way software exports grew in India from mere few millions in the start of 90s to 10 billion by 2005. The talk revolves around economics at India and US. The speaker mentions that India is no more an outsourcing partner but innovation partner for US.

One great point which the speaker brings up is how India can help reduce down the cost of innovation and diffusing the innovation by its large pool of skilled labor and booming manufacturing and other sectors like pharma, services etc. He mentions on how ecosystems can interact with each other by complementing and contributing to each other. An example he gave was that IT and Automotive industry in India are concentrated in 3 parts of India that is Delhi-Noida, Mumbai-Pune and Chennai. The point he makes is how should the automotive industry leverage on the software sector for innovating new technology know-how of doing the things in better and efficient way.

The talk also brings up a great thought of making IT and software affordable for the 80% of underprivileged population of the world. He gives an example of a village from central Madhya Pradesh were the farmers have started using computers and Internet to keep them updated about the weather or using the latest technology and novel farming practices or the latest prices of commodity goods at Chicago Trade Exchange which in turn would help them get the better prices for there goods. This particular thought is close to what Vinod Khosla said in a conversation at Web 2.0 2005 conference held in Sept,2005 where he mentioned of using Internet to deliver the high end education to all the underprivileged people in the remote corners of the world. An striking example he gave was to have a sort of remote Harvard university where 400,000 are listening to a lecture from a eminent scholar from remote corners of the world.

Overall the keynote bundles great thoughts. Must listen.

Tuesday, August 08, 2006

SAP NetWeaver BI and SAP BI Accelerator

If you see the BI market, and to that matter any market to say, it is getting crowded like never before. Today we have large number of players catering both to both BI Applications and BI platform market place. Apart from this, there are niche players like whose focus is only performance boosting, data quality and so forth.

For me, I just follow Oracle BI tools and these days Siebel Analytics and some of the related offering. Also had spend some time with Hyperion product line in my previous assignment. But it always good to know some bits and pieces about offerings form other vendors.

Today I ran into this whitepaper describing SAP BI Accelerator. The whitepaper focuses on how does BIA fits in the typical SAP BI implementation. I had hardly followed SAP BI offerings in past, except that I read some news group discussion on SAP BW vs other data warehouse technologies and stuff. However this article did give me some insight on what is SAP NetWeaver BI and how a typical implementation of SAP BI works. Sort of high-level view of what could be the architecture of SAP BI. To me, SAP BI closely resembles to Siebel Analytics. Setting up SAP BI would consist of defining the data model, which could consist of either relation tables (ODS) or dimension data model (Infocubes).

Like any other relational OLAP implementation, SAP BI allows you to create Aggregates and define the metadata for the same. This will allow the query to be redirected to them instead of base tables. However, this aggregates like any in Siebel Analytics needs to be maintained outside the SAP BI purview. Also these aggregates are pre-defined and can cater to limited set of queries. So here comes BI Accelerator for the rescue. With BIA setup, you no more need to set up the aggregates. The BIA would in turn use the inbuilt proprietary technology to aggregate the data and cache it for subsequent hits. The concept behind BI Accelerator is close to HOLAP as author Naeem Hashmi from Information Frameworks says
For Data Warehouse pros, the concept of BI accelerator is similar to good old HOLAP, although the technology and approach is radically different. Meaning, the content is transformed into proprietary structures in another layer on top of Relational-OLAP implementation. User access layer sends incoming queries to HOLAP for quick access/navigation instead of Relational-OLAP. The only difference here is that BI accelerator uses powerful search engine technology, transparent to traditional data warehouse end users.”

The white paper is a good read and gives some fundamental insights of BI Accelerator.