Tuesday, June 28, 2005

No Silver Bullet...by Frederick P. Brooks, Jr

Just came across a very good paper: No Silver Bullet: Essence and Accidents of Software Engineering. by Frederick P. Brooks, Jr.

If I am not mistaken he is the same person from IBM who has written The Mythical Man-Month.
The paper is a great read for anyone who is in software industry. Though this paper is dated back to April 1987, the principles it delineates still very much hold.

Following snippets from this paper talks about invistiblity aspect of software.

Invisibility. Software is invisible and unvisualizable. Geometric abstractions are powerful tools. The floor plan of a building helps both architect and client evaluate spaces, traffic flows, views. Contradictions and omissions become obvious. Scale drawings of mechanical parts and stick-figure models of molecules, although abstractions, serve the same purpose. A geometric reality is captured in a geometric abstraction.

The reality of software is not inherently embedded in space. Hence, it has no ready geometric representation in the way that land has maps, silicon chips have diagrams, computers have connectivity schematics. As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs superimposed one upon another. The several graphs may represent the flow of control, the flow of data, patterns of dependency, time sequence, name-space relationships. These graphs are usually not even planar, much less hierarchical. Indeed, one of the ways of establishing conceptual control over such structure is to enforce link cutting until one or more of the graphs becomes hierarchical.1

In spite of progress in restricting and simplifying the structures of software, they remain inherently unvisualizable, and thus do not permit the mind to use some of its most powerful conceptual tools. This lack not only impedes the process of design within one mind, it severely hinders communication among minds.

Frederick P. Brooks is always a great read. The way he writes on the intricacies about software industry in compare to manufacturing is too intriguing.

Saturday, June 18, 2005

Trying out OWB Java API

I have been working with OWB since a year. And the learning is still to take a back seat. During the initial days with OWB the main attraction was exploring various operators (pivot, match merge, sort, etc.), trying out all the possible things one can do from the mapping editor, configuring various objects from OWB Client, implementing things like SCD, exporting the OLAP metadata for use with BI Beans, process flows etc.

Latter on, the focus shifted to Runtime and Design time browser. Both the browsers were significant features of the OWB. RAB (Runtime Audit Browser) is the great place to see the audit logs, the error messages etc. etc. Design time has two great piece (Impact diagram and Linage diagram), which could be of great help while analyzing the impact of any change in the OWB mappings.

Then came the exposure to the various scripts under /owb/rtp/sql. The scripts were of great help understanding runtime platform and how to manage the RTP service. Apart from that the scripts like sql_exec_template.sql, abort_exec_request.sql and list_requests.sql were really handy in managing the individual runs of mappings.

Then came the time to explore various tables/views in the design time and runtime repositories. Later on I managed to get a chance trying my hands on OMBPlus. And man, it was just too cool working with OMBPlus. OWB Client (gui) is extremely unusable if there is some repetitive task to be done. For example you want to substr all the attributes to 30 characters. And assume that there are 50 such attributes which needs to be sub-string. What does one do? Drop an expression operator. Get all this attributes into the input group. Then one by one add 50 attributes in output group. Then change the expression property of all this 50 attributes to do the substr() of the corresponding input attributes. Now this is heck of work. At OMB side its just one small script which you got to write to do the whole thing. So if there is anything repetitive and tedious OMBPlus is the answer. Even taking backup of the repositories in to MDL is repetitive. One can write a small bat job using OMBPlus to take care of it. Apart from this, there could be many more things which OMBPlus accomplishes for you like creating template mapping, deploying the object, synchronizing the objects, importing the metadata etc.

So, over the period I kept learning all the different pieces of OWB, which I feel is an extensive suite and provides a comprehensive capabilities for any ETL and data integration task. And still I had few more left out. One of this was Java API to manipulate the metadata. Oracle exposed a public Java API to manipulate OWB design and runtime metadata, deploying and running the mapping, etc. Apart from all the capabilities of OMBPlus, Java API has some additional capabilities. In reality OMBPlus in turn uses Java API to do the various manipulation to the metadata.

So today was the day to get my hands dirty with Java API which came along with OWB 10g1 and onwards. The first thing I did was to find out some document and all I was able to manage was http://download-west.oracle.com/docs/html/B12155_01/index.html, the Javadoc for the API. Typically the Javadoc just has the description of all the classes, interfaces, methods etc. The doc is not a step-by-step tutorial on how to write a simple Java program using the OWB Java API.

The API is very big and so is the doc. The above link leads you to a list of 25 odd java packages to do various things. However there is no hint on where to start.

I brought up my Eclipse (www.eclipse.org) workbench and created a simple Java project named OWBApi. There was a bit struggle in locating where the jar file for the Public API resides. I managed to locate it under /owb/lib/int/publicapi.jar. I added this jar to the build path of the OWBApi project by going to menu Project->Properties -> Java Build Path -> Libraries and Add External Jars.

Back to the Java doc for the OWB Java API:
oracle.owb.connection was the first package I hit. Everything has to start with connection first. After bit of scramble here and there I managed to put together following piece of code in ConnectOWB.java


import oracle.owb.connection.OWBConnection;
import oracle.owb.connection.RepositoryManager;

public class ConnectOWB {
public static void main(String[] args) {
try{
oracle.owb.connection.RepositoryManager rm=oracle.owb.connection.RepositoryManager.getInstance();
OWBConnection owbconn = rm.openConnection("dtrep",
"dtrep","localhost:1521:orcl",rm.MULTIPLE_USER_MODE);

if (owbconn != null ){
System.out.println("Connection Establishied..");
}

}
catch (Exception e)
{
e.printStackTrace();
}
}
}


When running the above, I got following error message:

java.lang.NoClassDefFoundError: oracle/wh/repos/impl/foundation/CMPException
at ConnectOWB.main(ConnectOWB.java:23)
Exception in thread "main"

This means the build path is missing some jar which contains oracle/wh/repos/impl/foundation/CMPException. Now how to find this. Tried searching for this error on OWB forum at OTN, etc. but of no help. Finally an idea clicked to add all the jar under /owb/lib/int. Doing so I was able to get rid of the NoClassFound error message but ended up with one more error message saying that
unable to located Compatibility.properties file.

Where to go now? I tried to search this file under OWB home and was able to locate it under /owb/bin/admin. Inorder to make this file avilble to my java program I added one more entry in the build path for this directory. Adding a folder to the build path is as good as adding a Jar but instead of selecting the Add External Jar one has to click Add Class folder. Spicify the directory: /owb/bin/admin.

That’s it. I tried compling and running the program again. And it’s through. I was able to establish the connection to design time rep.

After oracle.owb.connection, the next hit was oracle.owb.project. I manged to do some more things like getting the list of project, creating a project, setting the active project etc. Following program displays the list of Projects in the design time repository.

import oracle.owb.connection.OWBConnection;
import oracle.owb.connection.RepositoryManager;
import oracle.owb.project.*
;

public class ConnectOWB {
public static void main(String[] args) {

try{
oracle.owb.connection.RepositoryManager rm=oracle.owb.connection.RepositoryManager.getInstance();
OWBConnection owbconn = rm.openConnection("dtrep",
"dtrep","localhost:1521:orcl",rm.MULTIPLE_USER_MODE);

if (owbconn != null ){
System.out.println("Connection Establishied..");
}

ProjectManager pmgr = ProjectManager.getInstance();
String[] projlist=pmgr.getProjectNames();
for(int i = 0 ;i<projlist.length;++i)
{
System.out.println(projlist[i]);
}

}
catch (Exception e)
{
e.printStackTrace();
}
}
}


And the list goes on. Java API seems to be a good option to write the striped down interface like OWB for some special set of users who really don’t need the whole OWB client to

Java language is a proven language for writing the GUI application. It has a rich set of libraries for accessing developing user interfaces, network programming, database programming.

One can use this API and end up writing a browser-based interface to manipulate the OWB metadata. Or may be even creating new mappings with some specific templates and stuff. Or atleast writing an interface to run a job, deploy a mapping, change some metadata etc. This really reminds of a new feature called Expert, which is coming along with OWB Paris.


Yeah, so getting back to my learnings in OWB I have some more in list. Next would be exploring the Appendix:C of the user guide. The appendix talks about extracting the data from XML data sources. Next is to use the AQ (Advance Queues) and build up the understaning of pulling the data from Appilcations (SAP). So still long way to go.


Thursday, June 16, 2005

How to get started with the new technology/tool etc.

Learning new tools and technologies has become part of daily chores of any IT professional. There is no way out. Or there is no good reason of why one should not learn new things. I personally am a tech savvy guy and always in lookout of learning new things. The interest is not just to learn things pertaining to data warehouse and BI but everything, which comes on way. The only thing that the new tools/technology I learn should have some fundamentals or concepts to take home.

During this last 6-7 years of being into IT, I have learned numerous theories, technologies, programming languages, tools etc. Most of them were through self-learning. But this self-learning was dependent on all my previous learnings, which I inculcated in the past and without which all this self learning would not have been possible. Today I just picked one more tool/technology to build some understanding on it. I don’t have the access to the software but just the documentation. This is one among few tools/technology I am trying to learn for which I don’t have the access of the software. Though I have hands on extensive hands-on experience on a similar kind of technology by another vendor.

This whole thing lead me to think of how can one approach taking up new tool/technology. Possibly three ways which came into my mind:

1. First hit the document.. get some background. and then come to the tool/hands-on and then again go back to the manuals/references. .. an then back to hands on.. May be over the period doing both things simultaneously.
2. First hit the tool ..let your intuition take over the wheel first.. play around stretch your understanding/intuition... and then come back to references/manual/docs/some text and then back to the tool. Over the period both doing both things simultaneously.
3. First attend some seminar ,some talk, some discussion ( as good as 1 but instead of text you are get into more live things) and then hit the tool may be then back to the manuals tools.. come back to tool/hands-on then go back to discussion and so forth. May be I call it Spaghetti approach. In this approach it could be that you start with Books first and then tools an then talks or any combination.

Which to choose?? Time and availability of resources can give the right call for this.. I keep trying all this approaches. Most of the times approach 2 is a good deal for me. Approach 1 is something we have been trying since the college days. First read about the “c” language, listen some lectures... and then get to the labs for some hands-on. And that was good since one didn't had so many fundamentals/concepts built up, not so much of exposure to the tools/languages of similar kind. Again like all my postings, there is no need to reach to conclusion of which is better and which not. Depends like everything else. My idea here is just to bring out some points.

Monday, June 06, 2005

Funky Business

During this weekend I got hold of this book Funky Business and man I can't keep away myself from getting it done. Now I don't want to end up writing one more review of this book, buts just thought of sharing some of the intriguing things I liked about this book. The authors have really brought in lot of wisdom of how business should be run in the 21st century. And all that in different style of writing. Everything is funky about the book, the examples, the style of writing, the wisdom, the content, the authors. Its just cool piece.

There were lot of striking things, lot of striking concepts which just hits your nerves like a sharp tool, lot of striking examples (the one defining niche market was: group of lawyers who are interested in pigeon races). And no doubt the book has tons of facts (GM tried producing car stereo and that didn’t worked out for them, or a dentist slur company which has 50 worlds market share is just run by 85 folks). Now all that is interesting. And above all lot of learning one can draw out from this. The one good thing about this book (or may be bad for someone’s) is that its very concise and says 10 things in 5 sentences. So one has to just keep reading it again and again to appreciate all what it has to offer.

Quick Linux Recipe

The other day I had a task to install some avtaar of Linux on an WinXP machine. One of my friend wanted to get started with Linux. He wanted to do some hands-on running, various commands, get hold of some basics of how Linux works, and gradually some further details like file system, Linux daemons, networking stuff in Linux and so forth.

Need was to install Linux on top of host OS WinXP, so that he can keep working on XP and switch to Linux for doing some hands on, etc. etc. The desktop he was running was bit out of time. 128 MB ram and 500 mhz of CPU. And on top of this we have a mammoth WinXP running.

VMWare was there to create a virtual machine on top of which the I was to install some distribution of Linux. Redhat was big bloat (4 disc for FEDORA) plus lot of space to set up whole thing, plus it will be killer to the CPU. So the idea was to get hold of some mini Linux distribution, which does not take whole lot of space and can get installed quickly.

I came across a roster of mini Linux distribution. But there was this BeatrIX which clicked to me. BeatrIX was cool piece of Linux bundling (<200 MB) with no need to setup since it runs directly boots from the CD. It has all the pieces which one would need to get started with Linux: Gnome, text editor, browser, terminal etc. etc.)

I downloaded the ISO image of BeatrIX. Instead of burning it to CD, I set of my VMWare CDROM to read this iso image file. And that’s it. The whole thing took less the 30 minutes. Just to recap the whole recipe:

1. Download VMWare Workstation 5 for Windows. Install it. Register and get the evaluation license key from the VMWare site (mind that its just for 30 days).
2. Download BeatrIX to some location under your file system.
3. Launch VMWare. Create a virtual machine as “Other Linux Distribution Kernel2.6.. “.
4. Modify the CD Rom device in VMWare to read from the ISO image and specify the file location to the downloaded copy of BeatrIX.
5. That’s it. Click Start. You have your Linux set up. The BeatrIX Linux OS does not need any installation since its boots up directly from CD.

By the way, BeatrIX seems to be have built and inspired by interesting set of people and cats. Check this out there site.

Sunday, June 05, 2005

What is xml?

So this 3 letter word is doing storms in the IT world since its inception in late 90's. Now what is it all about? I hear lot of folks talking around, defining, trying to understand, trying to explain other, of what XML is. Even I my self have indulged in all such discussions. I kept hearing lot of definitions floating in the air some saying "Its the standard to encode data", "its enhanced version of html", "its extensible HTML, you can create your own tag", but why on the earth would I need to create these tags? What for?

The understanding which I build up in this due course of discussion and reading was that "XML is a standard way to encode the data which is pertaining to anything ranging from transaction details, list of entity, a message for some application, configurations, metadata etc. and the only way it differs from the a simple text file is that in XML data is stored in hierarchical fashion and an XML document is bound to some schema or DTD which specifies the structure and content of this hierarchy." XML is a way to package a data. Now this packaged data could sit in a file or network packet or a message or a database table or anything.

I didn’t worked with XML per se. As such there is nothing like working with XML. XML is not a programming language which one can use to create some application or neither its meant for presentation like HTML. As some one has said that one will encounter XML everywhere. Even when your car will break down, it will send an message in XML to the nearest service center for necessary help.

XML is meant for nothing in specific but for everything. I kept seeing XML everywhere in the last couple of years,
- Configuration of various applications/servers
- Web services are sending request and response in XML
- The report I create using some tool gets stored in XML file. Its not just reports but any meta data generated using any wiggy wizard tools gets stored in the XML.
- I write my descriptor file of EJB in xml, my strut config is in XML
- The WML is again an XML
- The process flows are getting stored in the XML
- The presentation information is getting stored in the XML and is transformed to particular rendering device using some translation
- I export data from the database in XML and import it to any database.

These and many more. I wonder why use XML everywhere if the bare simple text files can do the same? Okay what could be a bare text file look like which stores the list of books and their details:

Option 1 (attribute value):

Book: Abc
Author: Xyz1
Price: 100
Pages: 252
Book: Abc1
Author: Xyz2
Price: 150
Pages: 531

Option 2 (comma separated):

Abc, Xyz1, 100, 252
Abc1, Xyz2, 150,531



And may be there could be some more.
For both of the above options the application has to make necessary assumption when consuming or generating this text format. In the first one, all the attributes for one book should be placed together vertically and in the second all the attribute for a book should be place horizontally together in a particular order. And in case if this file needs to be extended to store some more attributes for a book, for example Publisher information. So what all needs to be changed? Application? File? We understand it. It will be heck of work.

On the other hand, XML is also the bare text with some structure and some syntax to follow. That’s it. Something, which I have found till now, which is bit convincing to me and which stands XML better then simple text encoding:
1. The structure of a xml document is extensible without effecting much of application. You can extend the xml document to store some more information without effecting the application which is using it
2. The data is stored in hierarchical fashion something like:
<?xml version="1.0"encoding="utf-8"?>
<Books xmlns="http://tempuri.org/XMLFile1.xsd">
<Book>
<Name>Abc</Name>
<Author>Xyz1</Author>
<Price>100</Price>
<Pages>252</Pages>
</Book>
<Book>
<Name>Abc1</Name>
<Author>Xyz2</Author>
<Price>150</Price>
<Pages>531</Pages>
</Book>
</Books>

This could be very well extended to store the new attributes without really bothering the application.

3. Availability of lot of parsers and DOM (Document Object Model, API for accessing for processing XML document) for various programming languages. So generating and consuming XML document is easy

There are two guidelines, which every XML document has to follow:
1. The XML document should be correct. This means every opening tag should have a closing tag. The structure should be correct. And the tags are case sensitive.
2. The XML document should be valid. This means that the arrangement of tags, there attributes and there values have to follow certain scheme. This scheme is specified in the DTD or XML schema, which is associated with the XML document. In the above example it is http://tempuri.org/XMLFile1.xsd which specifies the schema of the XML document.

One can Google and find tons of commentary on XML, XML toturial, applications of XML, current happenings etc. www.w3c.org is the place to get the latest on what’s happenings in XML world.

Saturday, June 04, 2005

Handling of testing and QA artifacts at the source system while building ETL

Now this is interesting. Your source system (the production database) has lot of test data spread across the various entities. A typical online portal can have routine set of test cases run every day on the production system to check the consistency of system. So how does one handle all this test artifacts while building the ETL? Should this be treated as part of data cleansing? May be it should be or may be it needs more serious attention then just cleaning them away.

Handling test data in the ETL involves two aspects: one to identify and segregate the test data and second to track the test data at a prescribed location. This location could be some separate set of tables in data warehouse it self (if its an requirement) or some log/audit files or even the tables which store the regular data with a tag saying that they are test supplier/buyer/product and not the actual.

Segregating test data from the production data depends on flagging done at the source side or some convention followed while generating the id for the test data. For example id starting with 65xxxxxx is always test data. Another way of segregating test data would be a lookup table residing in the source system or staging which contains the list suppliers/buyers/products etc. which are test data and corresponding transactions are test transactions.

If the case is just to identify the test data and filter it before bringing into the staging or DW, life would be perhaps easy. However if the need is to bring the data in data warehouse with some identification to separate it out, there could be two possible ways to do it: populate the data in separate set of tables or in the same set of table with some tagging. The latter has an advantage because it saves the extra ETL at the cost of the one extra flag. The second approach also aligns the test data with the regular data hence the same constraints checking and data capturing, ETL could be used. But there could be tough times handling the test data with second approach if it does not follow the prescribed application/business logic. This could be due to some data patching done from behind to run through some test cases or any special provision in the application logic. First approach stands out to be better for this case. There should be enough balance maintained such that main ETLs populating the regular data does not get complicated just because its handling lot of exception for test data. The best deal here would be do segregate the test data at the first place and put it in the separate table which in turn could be used as lookup for regular ETL.

There is no definite thumb rule,(as such there are no thumb rules) of handling the test data in the source system. All depends on the nature of test data and identifying it and the way it needs to be tracked.