Saturday, June 04, 2005

Handling of testing and QA artifacts at the source system while building ETL

Now this is interesting. Your source system (the production database) has lot of test data spread across the various entities. A typical online portal can have routine set of test cases run every day on the production system to check the consistency of system. So how does one handle all this test artifacts while building the ETL? Should this be treated as part of data cleansing? May be it should be or may be it needs more serious attention then just cleaning them away.

Handling test data in the ETL involves two aspects: one to identify and segregate the test data and second to track the test data at a prescribed location. This location could be some separate set of tables in data warehouse it self (if its an requirement) or some log/audit files or even the tables which store the regular data with a tag saying that they are test supplier/buyer/product and not the actual.

Segregating test data from the production data depends on flagging done at the source side or some convention followed while generating the id for the test data. For example id starting with 65xxxxxx is always test data. Another way of segregating test data would be a lookup table residing in the source system or staging which contains the list suppliers/buyers/products etc. which are test data and corresponding transactions are test transactions.

If the case is just to identify the test data and filter it before bringing into the staging or DW, life would be perhaps easy. However if the need is to bring the data in data warehouse with some identification to separate it out, there could be two possible ways to do it: populate the data in separate set of tables or in the same set of table with some tagging. The latter has an advantage because it saves the extra ETL at the cost of the one extra flag. The second approach also aligns the test data with the regular data hence the same constraints checking and data capturing, ETL could be used. But there could be tough times handling the test data with second approach if it does not follow the prescribed application/business logic. This could be due to some data patching done from behind to run through some test cases or any special provision in the application logic. First approach stands out to be better for this case. There should be enough balance maintained such that main ETLs populating the regular data does not get complicated just because its handling lot of exception for test data. The best deal here would be do segregate the test data at the first place and put it in the separate table which in turn could be used as lookup for regular ETL.

There is no definite thumb rule,(as such there are no thumb rules) of handling the test data in the source system. All depends on the nature of test data and identifying it and the way it needs to be tracked.

No comments: