Friday, December 23, 2005

Great essay: How to be a Programmer

Few days’ back I stumble up on this very good essay titled: “How to be a Programmer: A Short, Comprehensive, and Personal Summary” by Robert L Read. The essay is must read for all those whoever are part of the “Software Programmer” tribe. It’s a comprehensive essay detailing various facets of professional life of Software engineer.

The essay deals with range of topics starting from basic debugging skills to managing projects, conceiving good design and to the extent of how to manage the people, team dynamics and personal aspects like “When to go home” etc.

The author has been very concise and focused while explaining all these topics. The best part of the article is that it’s very generic and the suggestion/recommendation holds for a person working as any role in software development with any sort of technology.

Another good part of the essay is the manner in which it is structured. All the skills required in order to be a successful programmer are grouped into three sections namely: “Beginner”, “Intermediate” and “Advanced”. The “Beginner” section talks about things like debugging, performance tuning, fundamental concepts like memory and i/o management, testing, experimenting, team skills, working with poor code, source code control etc. All are damn interesting and must read.

“Intermediate” section covers some of the soft skills, which are of paramount importance. Things like Personal skills, how to stay motivated, how to grow professionally, how to deal with non-engineer etc. and technical things like managing development time, evaluating and managing third party software, when to apply fancy computer science. Finally the “Advanced” section talks about how to make technology judgment, using embedded languages, dealing with schedule pressure, growing a system, dealing with organization chaos etc.

So all and all very interesting read.

Sunday, December 18, 2005

Outsourcing to small vendors: What to consider?

Few days back I was having mail discussion with a friend of mine on what are the prime considerations of outsourcing a project to small vendor.
I personally have worked with both a small size outsourcing firm in India named Rightway Solution (www.rigthwaysolution.com) and significantly large outsourcing partner and system integration company at Singapore.

Below are some of the snippets from our discussion:

Just wished one favor from u. I have some project which I need to outsource to india. Can you please tell me the name of the site where I can post my req, so that bidders can bid on it? Also, can u plz tell em the name of the company where u had done ur first job and u worked on outsourced projects? Also, any hints or pitfalls to be taken care of while outsourcing projects?

Regarding .. the outsourcing stuff.. You can go to some e-lacing site.
www.elance.com is the best in the league.. very sophistacted. There are some small ones too. I don't have their URL on top off my head..but you can google them.

Pitfalls.. Make sure you choose the good vendor..who has both the quality..and timely delivery.. As you know.. in the outsourting.. the Delivery model.is very critcal.. So you be in look out of the vendor which has very sound delivery model.. Won't say it shoud be very sophistacted.. but very roubust.. of good quality.

Rightway. the company I worked..has a very good track record.. They are not a big player in elance.com but they are established vendor in some of the other elancing portal... not able to recall there name.

Second aspect of the outsourcing is the communication.
-The way you communicate your requirement.
-The way you monitor the progresss of the project
-The way you gauge..the honesty and quality of their work.
-The way you get things delivered..
-The terms and condition...payment.etc.

For all this communication is the corner stone..and its of paramount importance..when you don't see the opposite party physically.

1) For a j2ee project, what delivery model can be thought of? I perceive it as a war file that can be deployed on my server. The db setup and other things would be just scripts which I need to run. All the necessary documents (like TD, FD, test plans, etc.) should be delivered as they occur in the phase of SDLC. Please suggest me if you can conceive any other model.

The final software (piece of code) and documents are as much important as much is the visibility of what is the architecture, the whole code and details about enhancing/maintaining that code. Typically the customer/client who are assigning the project to outsourcing partner are much bothered about the final software delivery and pay less attention on the documentation or the overall design of the software..
So I think the focus should be all different aspects of the software and not just final piece of code.

- Key to the delivery is the Project Plan. Depending on the size of project there should be a accurate project plan (both high level and detail level) should be agreed upon by both client and vendor. Appropriate milestone should be laid out and one should make sure there are no slippages.
- Regular status update meetings where the client and vendor sit to gather and take the stock of situation. Any issues should be raised up. Proper status report template should be followed. Close watch should be kept on the whether the milestones are met properly or not
- Regular technical discussion/ walk through specially during the design of the architecture and over all solution. Mind that you would have very less time to and rather it would be too late to rectify any design goof-ups at the latter stage of the project. So the piece of code could be just mere speck and of no use if the core is in mess.
- I would advise to have regular signoffs. This is in tandem with your Project plan. Any milestone should be signed off by the client. Vendors are typically reluctant on this since it affects there flexibility.. but I would advise you to insist on this if the size of proejct significant.

Overall one should be closely watching the overall developments happening around.. even though there is not physically proximity but through various communication channels.

I won't recommend to follow the exact SDLC cook book if you are dealing with small vendors. Yeah design documents are critical and get them done. Not so formal/sophisticated in terms of templates/formats. But they should be readable, understandable and comprehensive. It doesn't matter if you one wants to call it LLD of HLD
etc. 2 levels of design document is advisable if the size of the project is significant.


2) I am sure it would be financially beneficial to contact rightway
directly, rather than through eLance. Agreed? Also, does that approach have any cons in terms of accountability or fraud?

Yeah approaching rightway directly would be beneficial. The only advantage you get if you go via elance is that the payment is done through Elance. So you pay to elance and they pay to vendor. However it involves some commission at part of client and vendor.
Its your call. Rightway.. have pulled of from Elance since most of there business comes from other channels like references, existing clients etc.

If you are thinking of Java/J2EE, Rightway does not have much expertise in it. There forte is PHP and .Net/ASP. Just a thought here. PHP is no doubt a proven scripting language for web applications. Its robust, has lot of supports of various backends, open source and I would say its being used, grown and nurtured actively by a large community. Even a small to medium complex applications could be designed/developed perfectly and efficiently using PHP.

Rightway no doubt has tons of experience in this technology. They have best practices, a very good processes and design/development methodologoies around PHP. And PHP is easy and good to maintain. Very much
portable with wide range of web containers. So think.. rest its your call.

3) Code copyright can be enforced..right? So, a buyer of the service would own the copyright for the code and not the service provider..rt? (This should be obvious, as is the case with TCS ,Wipro, Infy, etc.)


IP (Intellectual Property) is something you can enforce on vendor. To be honest no one can stop any one using the code. However you can have some legal contract of how IP is to be managed. Usually the vendor does not sell the same code/product as it is to other client. They obviously reuse some of the generic component/ best practices in other projects and no one can stop.

M back After long time

Almost six months now. Lot happened during this 6 months. Lot of stuff to do at personal and professional front.

In July, I visited my native place, Kodinar (small town in India). Come August, I am back to Singapore. During this time we were embarking the new project:

This time it involved Hyperion Essbase. A new beast, I never worked before. The scope was to implement Sales and Marketing Analytics platform. Number of KPIs to model: 60+. 5 core dimension anchoring this 60 facts.

Most of KPIs were semi-additive. There were even some KPIs which were not additive on any of the dimension.

The client I have been working since an year now, has a very different set of metrics they want to implement in there on-going effort to create enterprise wide data warehouse.

So work involved both learning and experimenting new things. I would spend sometime in coming weeks to blog some of experiences in this last 6 months.

Tuesday, June 28, 2005

No Silver Bullet...by Frederick P. Brooks, Jr

Just came across a very good paper: No Silver Bullet: Essence and Accidents of Software Engineering. by Frederick P. Brooks, Jr.

If I am not mistaken he is the same person from IBM who has written The Mythical Man-Month.
The paper is a great read for anyone who is in software industry. Though this paper is dated back to April 1987, the principles it delineates still very much hold.

Following snippets from this paper talks about invistiblity aspect of software.

Invisibility. Software is invisible and unvisualizable. Geometric abstractions are powerful tools. The floor plan of a building helps both architect and client evaluate spaces, traffic flows, views. Contradictions and omissions become obvious. Scale drawings of mechanical parts and stick-figure models of molecules, although abstractions, serve the same purpose. A geometric reality is captured in a geometric abstraction.

The reality of software is not inherently embedded in space. Hence, it has no ready geometric representation in the way that land has maps, silicon chips have diagrams, computers have connectivity schematics. As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs superimposed one upon another. The several graphs may represent the flow of control, the flow of data, patterns of dependency, time sequence, name-space relationships. These graphs are usually not even planar, much less hierarchical. Indeed, one of the ways of establishing conceptual control over such structure is to enforce link cutting until one or more of the graphs becomes hierarchical.1

In spite of progress in restricting and simplifying the structures of software, they remain inherently unvisualizable, and thus do not permit the mind to use some of its most powerful conceptual tools. This lack not only impedes the process of design within one mind, it severely hinders communication among minds.

Frederick P. Brooks is always a great read. The way he writes on the intricacies about software industry in compare to manufacturing is too intriguing.

Saturday, June 18, 2005

Trying out OWB Java API

I have been working with OWB since a year. And the learning is still to take a back seat. During the initial days with OWB the main attraction was exploring various operators (pivot, match merge, sort, etc.), trying out all the possible things one can do from the mapping editor, configuring various objects from OWB Client, implementing things like SCD, exporting the OLAP metadata for use with BI Beans, process flows etc.

Latter on, the focus shifted to Runtime and Design time browser. Both the browsers were significant features of the OWB. RAB (Runtime Audit Browser) is the great place to see the audit logs, the error messages etc. etc. Design time has two great piece (Impact diagram and Linage diagram), which could be of great help while analyzing the impact of any change in the OWB mappings.

Then came the exposure to the various scripts under /owb/rtp/sql. The scripts were of great help understanding runtime platform and how to manage the RTP service. Apart from that the scripts like sql_exec_template.sql, abort_exec_request.sql and list_requests.sql were really handy in managing the individual runs of mappings.

Then came the time to explore various tables/views in the design time and runtime repositories. Later on I managed to get a chance trying my hands on OMBPlus. And man, it was just too cool working with OMBPlus. OWB Client (gui) is extremely unusable if there is some repetitive task to be done. For example you want to substr all the attributes to 30 characters. And assume that there are 50 such attributes which needs to be sub-string. What does one do? Drop an expression operator. Get all this attributes into the input group. Then one by one add 50 attributes in output group. Then change the expression property of all this 50 attributes to do the substr() of the corresponding input attributes. Now this is heck of work. At OMB side its just one small script which you got to write to do the whole thing. So if there is anything repetitive and tedious OMBPlus is the answer. Even taking backup of the repositories in to MDL is repetitive. One can write a small bat job using OMBPlus to take care of it. Apart from this, there could be many more things which OMBPlus accomplishes for you like creating template mapping, deploying the object, synchronizing the objects, importing the metadata etc.

So, over the period I kept learning all the different pieces of OWB, which I feel is an extensive suite and provides a comprehensive capabilities for any ETL and data integration task. And still I had few more left out. One of this was Java API to manipulate the metadata. Oracle exposed a public Java API to manipulate OWB design and runtime metadata, deploying and running the mapping, etc. Apart from all the capabilities of OMBPlus, Java API has some additional capabilities. In reality OMBPlus in turn uses Java API to do the various manipulation to the metadata.

So today was the day to get my hands dirty with Java API which came along with OWB 10g1 and onwards. The first thing I did was to find out some document and all I was able to manage was http://download-west.oracle.com/docs/html/B12155_01/index.html, the Javadoc for the API. Typically the Javadoc just has the description of all the classes, interfaces, methods etc. The doc is not a step-by-step tutorial on how to write a simple Java program using the OWB Java API.

The API is very big and so is the doc. The above link leads you to a list of 25 odd java packages to do various things. However there is no hint on where to start.

I brought up my Eclipse (www.eclipse.org) workbench and created a simple Java project named OWBApi. There was a bit struggle in locating where the jar file for the Public API resides. I managed to locate it under /owb/lib/int/publicapi.jar. I added this jar to the build path of the OWBApi project by going to menu Project->Properties -> Java Build Path -> Libraries and Add External Jars.

Back to the Java doc for the OWB Java API:
oracle.owb.connection was the first package I hit. Everything has to start with connection first. After bit of scramble here and there I managed to put together following piece of code in ConnectOWB.java


import oracle.owb.connection.OWBConnection;
import oracle.owb.connection.RepositoryManager;

public class ConnectOWB {
public static void main(String[] args) {
try{
oracle.owb.connection.RepositoryManager rm=oracle.owb.connection.RepositoryManager.getInstance();
OWBConnection owbconn = rm.openConnection("dtrep",
"dtrep","localhost:1521:orcl",rm.MULTIPLE_USER_MODE);

if (owbconn != null ){
System.out.println("Connection Establishied..");
}

}
catch (Exception e)
{
e.printStackTrace();
}
}
}


When running the above, I got following error message:

java.lang.NoClassDefFoundError: oracle/wh/repos/impl/foundation/CMPException
at ConnectOWB.main(ConnectOWB.java:23)
Exception in thread "main"

This means the build path is missing some jar which contains oracle/wh/repos/impl/foundation/CMPException. Now how to find this. Tried searching for this error on OWB forum at OTN, etc. but of no help. Finally an idea clicked to add all the jar under /owb/lib/int. Doing so I was able to get rid of the NoClassFound error message but ended up with one more error message saying that
unable to located Compatibility.properties file.

Where to go now? I tried to search this file under OWB home and was able to locate it under /owb/bin/admin. Inorder to make this file avilble to my java program I added one more entry in the build path for this directory. Adding a folder to the build path is as good as adding a Jar but instead of selecting the Add External Jar one has to click Add Class folder. Spicify the directory: /owb/bin/admin.

That’s it. I tried compling and running the program again. And it’s through. I was able to establish the connection to design time rep.

After oracle.owb.connection, the next hit was oracle.owb.project. I manged to do some more things like getting the list of project, creating a project, setting the active project etc. Following program displays the list of Projects in the design time repository.

import oracle.owb.connection.OWBConnection;
import oracle.owb.connection.RepositoryManager;
import oracle.owb.project.*
;

public class ConnectOWB {
public static void main(String[] args) {

try{
oracle.owb.connection.RepositoryManager rm=oracle.owb.connection.RepositoryManager.getInstance();
OWBConnection owbconn = rm.openConnection("dtrep",
"dtrep","localhost:1521:orcl",rm.MULTIPLE_USER_MODE);

if (owbconn != null ){
System.out.println("Connection Establishied..");
}

ProjectManager pmgr = ProjectManager.getInstance();
String[] projlist=pmgr.getProjectNames();
for(int i = 0 ;i<projlist.length;++i)
{
System.out.println(projlist[i]);
}

}
catch (Exception e)
{
e.printStackTrace();
}
}
}


And the list goes on. Java API seems to be a good option to write the striped down interface like OWB for some special set of users who really don’t need the whole OWB client to

Java language is a proven language for writing the GUI application. It has a rich set of libraries for accessing developing user interfaces, network programming, database programming.

One can use this API and end up writing a browser-based interface to manipulate the OWB metadata. Or may be even creating new mappings with some specific templates and stuff. Or atleast writing an interface to run a job, deploy a mapping, change some metadata etc. This really reminds of a new feature called Expert, which is coming along with OWB Paris.


Yeah, so getting back to my learnings in OWB I have some more in list. Next would be exploring the Appendix:C of the user guide. The appendix talks about extracting the data from XML data sources. Next is to use the AQ (Advance Queues) and build up the understaning of pulling the data from Appilcations (SAP). So still long way to go.


Thursday, June 16, 2005

How to get started with the new technology/tool etc.

Learning new tools and technologies has become part of daily chores of any IT professional. There is no way out. Or there is no good reason of why one should not learn new things. I personally am a tech savvy guy and always in lookout of learning new things. The interest is not just to learn things pertaining to data warehouse and BI but everything, which comes on way. The only thing that the new tools/technology I learn should have some fundamentals or concepts to take home.

During this last 6-7 years of being into IT, I have learned numerous theories, technologies, programming languages, tools etc. Most of them were through self-learning. But this self-learning was dependent on all my previous learnings, which I inculcated in the past and without which all this self learning would not have been possible. Today I just picked one more tool/technology to build some understanding on it. I don’t have the access to the software but just the documentation. This is one among few tools/technology I am trying to learn for which I don’t have the access of the software. Though I have hands on extensive hands-on experience on a similar kind of technology by another vendor.

This whole thing lead me to think of how can one approach taking up new tool/technology. Possibly three ways which came into my mind:

1. First hit the document.. get some background. and then come to the tool/hands-on and then again go back to the manuals/references. .. an then back to hands on.. May be over the period doing both things simultaneously.
2. First hit the tool ..let your intuition take over the wheel first.. play around stretch your understanding/intuition... and then come back to references/manual/docs/some text and then back to the tool. Over the period both doing both things simultaneously.
3. First attend some seminar ,some talk, some discussion ( as good as 1 but instead of text you are get into more live things) and then hit the tool may be then back to the manuals tools.. come back to tool/hands-on then go back to discussion and so forth. May be I call it Spaghetti approach. In this approach it could be that you start with Books first and then tools an then talks or any combination.

Which to choose?? Time and availability of resources can give the right call for this.. I keep trying all this approaches. Most of the times approach 2 is a good deal for me. Approach 1 is something we have been trying since the college days. First read about the “c” language, listen some lectures... and then get to the labs for some hands-on. And that was good since one didn't had so many fundamentals/concepts built up, not so much of exposure to the tools/languages of similar kind. Again like all my postings, there is no need to reach to conclusion of which is better and which not. Depends like everything else. My idea here is just to bring out some points.

Monday, June 06, 2005

Funky Business

During this weekend I got hold of this book Funky Business and man I can't keep away myself from getting it done. Now I don't want to end up writing one more review of this book, buts just thought of sharing some of the intriguing things I liked about this book. The authors have really brought in lot of wisdom of how business should be run in the 21st century. And all that in different style of writing. Everything is funky about the book, the examples, the style of writing, the wisdom, the content, the authors. Its just cool piece.

There were lot of striking things, lot of striking concepts which just hits your nerves like a sharp tool, lot of striking examples (the one defining niche market was: group of lawyers who are interested in pigeon races). And no doubt the book has tons of facts (GM tried producing car stereo and that didn’t worked out for them, or a dentist slur company which has 50 worlds market share is just run by 85 folks). Now all that is interesting. And above all lot of learning one can draw out from this. The one good thing about this book (or may be bad for someone’s) is that its very concise and says 10 things in 5 sentences. So one has to just keep reading it again and again to appreciate all what it has to offer.

Quick Linux Recipe

The other day I had a task to install some avtaar of Linux on an WinXP machine. One of my friend wanted to get started with Linux. He wanted to do some hands-on running, various commands, get hold of some basics of how Linux works, and gradually some further details like file system, Linux daemons, networking stuff in Linux and so forth.

Need was to install Linux on top of host OS WinXP, so that he can keep working on XP and switch to Linux for doing some hands on, etc. etc. The desktop he was running was bit out of time. 128 MB ram and 500 mhz of CPU. And on top of this we have a mammoth WinXP running.

VMWare was there to create a virtual machine on top of which the I was to install some distribution of Linux. Redhat was big bloat (4 disc for FEDORA) plus lot of space to set up whole thing, plus it will be killer to the CPU. So the idea was to get hold of some mini Linux distribution, which does not take whole lot of space and can get installed quickly.

I came across a roster of mini Linux distribution. But there was this BeatrIX which clicked to me. BeatrIX was cool piece of Linux bundling (<200 MB) with no need to setup since it runs directly boots from the CD. It has all the pieces which one would need to get started with Linux: Gnome, text editor, browser, terminal etc. etc.)

I downloaded the ISO image of BeatrIX. Instead of burning it to CD, I set of my VMWare CDROM to read this iso image file. And that’s it. The whole thing took less the 30 minutes. Just to recap the whole recipe:

1. Download VMWare Workstation 5 for Windows. Install it. Register and get the evaluation license key from the VMWare site (mind that its just for 30 days).
2. Download BeatrIX to some location under your file system.
3. Launch VMWare. Create a virtual machine as “Other Linux Distribution Kernel2.6.. “.
4. Modify the CD Rom device in VMWare to read from the ISO image and specify the file location to the downloaded copy of BeatrIX.
5. That’s it. Click Start. You have your Linux set up. The BeatrIX Linux OS does not need any installation since its boots up directly from CD.

By the way, BeatrIX seems to be have built and inspired by interesting set of people and cats. Check this out there site.

Sunday, June 05, 2005

What is xml?

So this 3 letter word is doing storms in the IT world since its inception in late 90's. Now what is it all about? I hear lot of folks talking around, defining, trying to understand, trying to explain other, of what XML is. Even I my self have indulged in all such discussions. I kept hearing lot of definitions floating in the air some saying "Its the standard to encode data", "its enhanced version of html", "its extensible HTML, you can create your own tag", but why on the earth would I need to create these tags? What for?

The understanding which I build up in this due course of discussion and reading was that "XML is a standard way to encode the data which is pertaining to anything ranging from transaction details, list of entity, a message for some application, configurations, metadata etc. and the only way it differs from the a simple text file is that in XML data is stored in hierarchical fashion and an XML document is bound to some schema or DTD which specifies the structure and content of this hierarchy." XML is a way to package a data. Now this packaged data could sit in a file or network packet or a message or a database table or anything.

I didn’t worked with XML per se. As such there is nothing like working with XML. XML is not a programming language which one can use to create some application or neither its meant for presentation like HTML. As some one has said that one will encounter XML everywhere. Even when your car will break down, it will send an message in XML to the nearest service center for necessary help.

XML is meant for nothing in specific but for everything. I kept seeing XML everywhere in the last couple of years,
- Configuration of various applications/servers
- Web services are sending request and response in XML
- The report I create using some tool gets stored in XML file. Its not just reports but any meta data generated using any wiggy wizard tools gets stored in the XML.
- I write my descriptor file of EJB in xml, my strut config is in XML
- The WML is again an XML
- The process flows are getting stored in the XML
- The presentation information is getting stored in the XML and is transformed to particular rendering device using some translation
- I export data from the database in XML and import it to any database.

These and many more. I wonder why use XML everywhere if the bare simple text files can do the same? Okay what could be a bare text file look like which stores the list of books and their details:

Option 1 (attribute value):

Book: Abc
Author: Xyz1
Price: 100
Pages: 252
Book: Abc1
Author: Xyz2
Price: 150
Pages: 531

Option 2 (comma separated):

Abc, Xyz1, 100, 252
Abc1, Xyz2, 150,531



And may be there could be some more.
For both of the above options the application has to make necessary assumption when consuming or generating this text format. In the first one, all the attributes for one book should be placed together vertically and in the second all the attribute for a book should be place horizontally together in a particular order. And in case if this file needs to be extended to store some more attributes for a book, for example Publisher information. So what all needs to be changed? Application? File? We understand it. It will be heck of work.

On the other hand, XML is also the bare text with some structure and some syntax to follow. That’s it. Something, which I have found till now, which is bit convincing to me and which stands XML better then simple text encoding:
1. The structure of a xml document is extensible without effecting much of application. You can extend the xml document to store some more information without effecting the application which is using it
2. The data is stored in hierarchical fashion something like:
<?xml version="1.0"encoding="utf-8"?>
<Books xmlns="http://tempuri.org/XMLFile1.xsd">
<Book>
<Name>Abc</Name>
<Author>Xyz1</Author>
<Price>100</Price>
<Pages>252</Pages>
</Book>
<Book>
<Name>Abc1</Name>
<Author>Xyz2</Author>
<Price>150</Price>
<Pages>531</Pages>
</Book>
</Books>

This could be very well extended to store the new attributes without really bothering the application.

3. Availability of lot of parsers and DOM (Document Object Model, API for accessing for processing XML document) for various programming languages. So generating and consuming XML document is easy

There are two guidelines, which every XML document has to follow:
1. The XML document should be correct. This means every opening tag should have a closing tag. The structure should be correct. And the tags are case sensitive.
2. The XML document should be valid. This means that the arrangement of tags, there attributes and there values have to follow certain scheme. This scheme is specified in the DTD or XML schema, which is associated with the XML document. In the above example it is http://tempuri.org/XMLFile1.xsd which specifies the schema of the XML document.

One can Google and find tons of commentary on XML, XML toturial, applications of XML, current happenings etc. www.w3c.org is the place to get the latest on what’s happenings in XML world.

Saturday, June 04, 2005

Handling of testing and QA artifacts at the source system while building ETL

Now this is interesting. Your source system (the production database) has lot of test data spread across the various entities. A typical online portal can have routine set of test cases run every day on the production system to check the consistency of system. So how does one handle all this test artifacts while building the ETL? Should this be treated as part of data cleansing? May be it should be or may be it needs more serious attention then just cleaning them away.

Handling test data in the ETL involves two aspects: one to identify and segregate the test data and second to track the test data at a prescribed location. This location could be some separate set of tables in data warehouse it self (if its an requirement) or some log/audit files or even the tables which store the regular data with a tag saying that they are test supplier/buyer/product and not the actual.

Segregating test data from the production data depends on flagging done at the source side or some convention followed while generating the id for the test data. For example id starting with 65xxxxxx is always test data. Another way of segregating test data would be a lookup table residing in the source system or staging which contains the list suppliers/buyers/products etc. which are test data and corresponding transactions are test transactions.

If the case is just to identify the test data and filter it before bringing into the staging or DW, life would be perhaps easy. However if the need is to bring the data in data warehouse with some identification to separate it out, there could be two possible ways to do it: populate the data in separate set of tables or in the same set of table with some tagging. The latter has an advantage because it saves the extra ETL at the cost of the one extra flag. The second approach also aligns the test data with the regular data hence the same constraints checking and data capturing, ETL could be used. But there could be tough times handling the test data with second approach if it does not follow the prescribed application/business logic. This could be due to some data patching done from behind to run through some test cases or any special provision in the application logic. First approach stands out to be better for this case. There should be enough balance maintained such that main ETLs populating the regular data does not get complicated just because its handling lot of exception for test data. The best deal here would be do segregate the test data at the first place and put it in the separate table which in turn could be used as lookup for regular ETL.

There is no definite thumb rule,(as such there are no thumb rules) of handling the test data in the source system. All depends on the nature of test data and identifying it and the way it needs to be tracked.

Saturday, May 28, 2005

Good article series from Dave Aldridge

While browsing through some of my favorite blogs, I came across Dave Aldridge. His blog http://oraclesponge.blogspot.com/ seems to be one more place to be there. There is lot of stuff Dave Aldridge writes which will be of interest for all the BI and DW followers specifically those, which are following DW offerings from Oracle. Also his article series on A Practical Guide to Data Warehousing in Oracle: Series Introduction though dated back in end of 2003, has some subtle insights on various considerations for a typical data warehouse project.

Wednesday, May 25, 2005

Building ETL for Homegrown vs. Packaged Applications

Building ETL has always been dependent on source system. Things get more intriguing when the source systems are changing on and off. The dependencies on the source systems could be broadly divided into two categories:

1. Structural dependencies
This refers to the dependencies on the underlying data model. ETL expects the source systems object to follow some structure. For example if customer name is pulled stored in fields FirstName and LastName. If this columns in the source system are modified to a single column CustomerName, the ETL goes hey way. The structural dependencies holds for any source system ranging from Relational database, XML based sources, LDAP, files, Application interfaces like BAPI etc.

2. Application and Data Dependencies
In nutshell this refers to the mechanism with which data gets updated in the underlying source system. All the data updates have to follow some application flow. The complete data flow and data changes are typically part of source system documentation. Good example of Application or data dependencies would be LastUpdateDate getting modified for any changes made to a Customer record. The ETL would usually capture the change using this field. In case if the field is not updated by the source applications properly whole extraction and in turn ETL goes hey-way.

So dependencies on source systems are inevitable. Its not just one time phenomena when you first time build the ETL’s but its something on which you have to keep tap on on-going basis. The structural and application changes in the source systems keep happening on the regular basis.

So coming to the subject of this discussion, we can divide source systems in to major classes:
1. Homegrown applications: This category primarily refers to the custom built applications, which are designed by the organizations for day-to-day operations. A typical organization has many of its operational systems, which are either developed in-house or out sourced. This are typically custom applications build for the supporting specific function or operations of an organization.
2. Packaged Applications: This refers to the packaged applications, which are implemented by organizations for there operations. ERP and back office applications are implemented by many organizations for their operational needs.

Buying and building is always a common decision CIO’s and IT managers have to make these days. Packaged applications has its own advantages like comprehensive documentation, bug support, integration and extension point, customizable and easy and quick to roll-out, upgrade etc. Similarly homegrown application comes with its own advantages like flexibility, customization and extensive capabilities for integration with other systems, ability to enhance on modify the functionalities on need-to basis. Anyways the debate between and buy and build is beyond the scope of this posting.

Creating an ETL application for homegrown operational systems have been challenging any day. This is primarily due to
· Lack of proper documentation of the systems
· Bugs and support
· Patching of source systems (both underlying data model and applications)
· Lack of documentation for application flow, underlying data model and data flow.
· Upgrades of the source systems, which are home grown, could be difficult to track if there’s no established documentation for the changes happening.
· Typically it has been seen that source systems, which are homegrown, have integrity issues and no mechanism in place for preventing data updates and in consistencies.

On the other had homegrown systems comes with its own advantage of easy access of the development team of the applications, better understanding of the business functionalities, etc.

Writing ETL for packaged applications has its own pros. This is mainly because of:
· Availability of comprehensive documentation of the underlying data model, data flow and application flow.
· Bugs are well documented with all there implications on the application
· Upgrades and patches are well documented.
· Data integrity is typically maintained and there is usually a mechanism in place to prevent the updates of the underlying data

However, building ETL for packaged applications as the source system has its own challenges. This could be because of lack of documentation or understanding on particular area of the application.

The above discussion was just meant to delineate the intricacies of developing the ETL for homegrown and package applications. It was not driving to any conclusion of which is better over other. I had exposure both this world and with my experiences I believe that building ETL for package applications has always been easy and more efficient then the homegrown source systems. At the end of the day, being an ETL developer, one does not have much chalice of what source systems should be. Perhaps, I would follow up this discussion with my experiences and the considerations I had in my mind while approaching both type of applications.

Sunday, May 22, 2005

Learning awk

Other day, I had a task to kill some of the UNIX processes, which were running javac (Java compiler). One way I could have approached this was to identify all the processes running javac by running

root@localhost# ps -ef | grep javac

And then picking each of these process id and killing them individually.

root@localhost# kill –9 pid

But I have overseen people writing some one liner shell command using combination of awk ,ps etc. to kill all the processes running particular program/binary. So I thought of building this one liner. I had no idea of what is awk except that it is named after the initials of the authors (Aho, Kernighan and Weinberger) who wrote this. I even didn’t knew if it was a comprehensive language for text processing.

I launched awk manual on shell prompt and started going through it. It took me a while to get hold of where the things were heading.

Some how I tried run through the manual but didn’t pick much except {print} and few simple things like awk can have inline programs or the programs could be stored in file. Also it is data driven and not procedural. $0 refers to the current line in the data. Commands like print and printf. But some how I was not able to get hold of syntaxes. I was not understanding the grammar and how the program needs to be written. I just managed to execute

awk {print}

After this I tried few more

root@localhost# awk {print Hello}
awk: cmd. line:2: (END OF FILE)
awk: cmd. line:2: parse error
root@localhost# awk {print 'Hello'}
awk: cmd. line:2: (END OF FILE)
awk: cmd. line:2: parse error
root@localhost# awk "{print 'Hello'}"
awk: cmd. line:1: {print 'Hello'}
awk: cmd. line:1: ^ invalid char ''' in expression
root@localhost# awk '{print "Hello"}'
asdf
Hello
asfd
Hello
Finally, I got hold of tutorial GAWK: Effective AWK Programming which gave me head-start understanding the language.

root@localhost# awk 'BEGIN {print "Hello"}' was easy to comprehend.
Things started clicking.

root@localhost# awk 'BEGIN {print substr("Hello World",2,2}' and so on..


So since I started understanding the awk and how it works, I tried getting to the problem, which started the whole zeal. Killing all the processes running javac.

ps –ef | grep javac | awk {print}
then it was
ps -ef | grep javac | awk '{print substr($0,10,6)}'
ps -ef | grep javac | awk '{printf substr($0,10,6)}'
to throw the all the process id in one line
and finally
kill -9 `ps -ef | grep javac | awk '{printf substr($0,10,6)}'`

This tries to kill all the processes which have patter “javac” in the executing command. This means even it tries to kill “grep javac” also which some how does not exist after passing the output to awk command. Thus you will get
bash: kill: (pid) - No such pid
error message. But it above command will kill all the processes running javac.

The onliner got further modified to get rid of grep
kill -9 `ps -ef | awk '/javac/{printf substr($0,10,6)}'`

Some how substr was sounding bit fishy. It could be a case that pid does not start at the 10th position and is of more then 6 character. The one liner was further modified to

kill -9 `ps -ef | awk '/javac/{printf $2}'`
where $2 implicitly refers to the second column of the ps output. So finally was able to get through.

Saturday, May 21, 2005

Summary Management in Oracle database: World before and after Materialized View

Summary management has been prime consideration for data warehouse implemented using relations database technology (ROLAP.) Typical set of steps one follows for managing summary in a data warehouse projects is:

  1. Identifying the summary required. This depends on the nature of analytics and report performance. If on-the-fly summarization is taking long, then the summary is candidate of pre-aggregation
  2. Refreshing the summaries. The job does not end just after creating the summary. There should be a mechanism in place to keep the summaries fresh.
  3. Querying the pre-aggregated summary for reporting purpose. Reporting/analysis tools should have capability to leverage on summary

Characteristics of good summary management mechanism should be:

  1. Transparency. This is applicable for both refresh of summary and query. The summary should automatically refresh with the change in the base fact table. Also a summary query posed to the base table should be automatically redirected to the summary table if there’s any available.
  2. Efficient. This is applicable to refreshing. Refreshing mechanism should be efficient to do incremental refresh of the summary and not rebuild summaries all again.
  3. Fast. Both querying and refreshing should be fast
  4. Capability to manage the percentage of aggregation. There could be a case that all possible set of summaries for a base fact is not required. Summary management tool should enable to you tune this parameter.

So what were the possible summary management artifacts available in Oracle before the introduction of Materialized views in 8i? Primarily the two ways to maintain summaries before MV’s were:

  1. Snapshot. Snapshot is the mechanism to capture state of data at a given point of time and store it under a separate database object. They are not transparent to querying and needs to be explicitly referred in the query. The refresh of this snapshot could be schedule periodically.
  2. Summary tables. A data warehouse can have set of tables, which stores the summary data. For example a separate table keeping the pre-aggregated sales data at the year, country level. These tables are again not transparent to the query tools and needs be explicitly referred by the query. Also refresh of this tables needs to be handled in the application (ETL logic).

So none of the above approaches were able to achieve all the characteristics of ideal summary management system.

Materialized views (MV) came along with the Oracle 8i. They replaced snapshot. The core functionality of snapshot became part of MV. The radical characteristic of MV was the query rewrite. Query rewrite enables optimizer to rewrite the query to access materialized view instead of base table if the query is seeking summary information. This feature stands out MV from all the previous approaches of summary management.

Apart from query rewrite the other capability of MV is the wide range of options available for refresh. Refresh options avialble with MV’s could be classfied under two categories:

  1. How to refresh:
    1. Fast Refresh (Apply only incremental changes happened in the base fact to the MV)
    2. Complete Refresh (Rebuild the MV completely)
    3. Force (Attempt for Fast refresh and if it’s not possible then do complete refresh)
  2. When to refresh
    1. On Commit Any changes happening on the base table is immediately propagated to MV. Refresh is driven by the changes in base fact table
    2. On Demand: Do the refresh when the query accesses the MV. If the MV is stale the refresh at that point of time. Drive by the querying of the MVs.
    3. Automatically: Scheduling the refresh on the periodical basis. This is independent of both changes in Base table and querying of MV.

Lot of enhancements have been done to the MV after it was first introduced with Oracle 8i. Now one can have indexes on MV, Summary advisor came along in Oracle 9i, MV refresh mechanism have been further enhanced etc.

http://download-west.oracle.com/docs/cd/B14117_01/server.101/b10759/statements_6002.htm#sthref4967 is the quick link to on MV in Oracle 10g.

Hand-coded v/s Tools-based ETL:

There have been rounds of debates going around on which paradigm to opt, for developing ETL for typical data warehouse project.

I would share my experience working with Handcoding the ETL and using Oracle Warehouse Builder (Tool Based ETL.)

The ETL of old days has not remain same now. Challenges ETL has faced in the recent times are:

  1. Complex business systems and Analytics requirement
  2. Volume of data
  3. Small refresh window
  4. Orchestrating and automation of Loads
  5. Monitoring capabilities and exception handling
  6. Complex data quality and integrity requirements
  7. Complex transformation
  8. Match-merge and cleansing
  9. Flexibility to the change ETL logic on the fly as the business changes
  10. Varied nature of source systems

And list goes on. So our very old hand-coded ETL does not cope up with this all.

Let me delineate my understanding of ETL. ETL is not piece of code or application, which enables you to do Extraction, Transformation and Loading. E, T and L are three primary pillars and core of typical ETL. But there are few more aspects, which fall under the purview of ETL. Today, ETL stands for Extraction, transformation, Loading, infrastructure elements (for extraction, transformation, loading ).

insert into sales values(customerid, …,…,revenue)

( select orderheader.customerid,… orderline.qty * orderline.price from orderheader@src , orderline@src where orderheader.order_id = orderline.order_id)

Above could be simple ETL doing Extraction (select..) , transformation (qty*price) and loading(insert into..) but it still lacks something to suffice above 10 challenges.

  1. Infrastructure for scheduling and automating the loads
  2. Mechanism to handle errors and exceptions and to throw out necessary alerts.
  3. Resilience and error recovery. .
  4. Abilities to switch from Set-based (insert into …select *) to row based (picking up one row at a time and loading it) loading type
  5. Workflow to orchestrate the various sub-loads (individual units)
  6. Metadata management
  7. Versioning and management of design metadata.
  8. Capturing Runtime audit logs
  9. Capabilities to leverage the specific features of target data warehouse platform.
  10. Flexibility to change the configuration at any level ranging from whole ETL project to individual object.

It’s unrealistic for a hand-coded ETL project to cover all this.

ETL tools suites these days are typically bundled with all this functionalities. Oracle Warehouse Builder is one such tool. Apart from enabling one to model the ETL flows (and not code), it also gives you all the above functionalities. Tools like OWB gives above functionalities in one integrated suite. With the new release of OWB (Paris) things would be more streamlined and all the above functionalities will be accessible through one single interface. I would be sharing my experiences on OWB Paris in subsequent blogs.

As I keep hearing from the research analysts that applications would be modeled and not coded as time goes by. This is driven by the inherent complexities in business applications.

So coming back to our prime question of which approach is better between Hand-coding and tool-based. As everything depends on context, the answer to this also depends on the context. If the ETL requirement is simple and straight without any great need for infrastructure elements like monitoring, metadata management etc. hand coding the ETL would be the wise choice. But if the ETL is complex and there is a requirement of monitoring, metadata management etc. tool-based approach would better go ahead.

Stray thought.. 100% CPU Utilization

How does a typical microprocessor for a personal computer work? When you boot up the PC, the CPU picks up an instructions from a particular place in the memory and starts executing them sequentially. There is sequence of instructions/steps undertaken by CPU and finally it brings up your OS. When you finally have your OS up there and you run a top (Unix command) or task manager, you see CPU 2% or may be x% utilized. So does that mean that your CPU is sleeping for 98% of time? Is it that for 98% of time there is nothing for CPU to execute and it’s sitting idle? In reality, CPU never sleeps. This means that CPU is 100% utilized all the time. It’s just that CPU is executing dummy instruction (NOP) for 98% of time and doing something useful only 2% of its time. If you launch some more programs, perhaps some more useful tasks will get executed on this CPU and utilization will be getting higher.

Kimball and Rittman

Like anyother DW follower, I am a big fan of R Kimball. Kimball no doubt is one of the most respected authority in DW space.

One more guy whom I follow is M Rittman (www.rittman.net). Rittman is like one stop shop for anything and everything pertaining to Oracle BI&W.

Among few others, I also follow Niegle Pendse (www.olapreport.com)

All this guys to me are the pioneer contributor to DW and BI field.

One of the aim of this blog is to share some of experiences during implementation of various BI and DW projects.

This blog is very much inspired by the works of Kimball, Rittman and Niegle and I would attempt to do something similar to what they have been doing for the BI and DW community.

Tuesday, May 03, 2005

About Me

I am software professional working with a IT services firm in Singapore. Basically I am from India and have done my engineering in Computer Science. My core technical interests are Data warehouse and Business technologies. I specialize in the various BI and DW offerings from Oracle.