Friday, August 21, 2009

ETL Testing Process Overview

This is a compilation of my friend Rekha Ram Burra's experiences while testing ETL applications. I just borrowed this information from him to share publicly on my Blog. Thanks a lot to Rekharam being kind enough to allow me to reuse this content for publishing in my blog.

This is a compilation of the test activities performed while testing an ETL application. Though ETL stands for Extract, Transform and Load, as the final target of the test team is to validate the data in destination database, Test team should only verify data between source and destination. It is not compulsory that Test team should verify how data is extracted and transformed. We can consider this as Black Box Testing.

Assumptions

There are four different databases (on same server or different servers) in place.

  1. At least one Source database system which is on a different server than the below databases
  2. A Delta database that holds source data in tables with schema similar to source table schema
  3. Staging database where the data that is needed for the transformations will reside with any intermediate tables needed
  4. A Factory database where the data after required transformations will be loaded into tables
  5. Warehouse database which is the final database which is a data warehouse

There can be changes to the above architecture. E.g. both the delta and staging databases can be combined to one single database still with different tables for delta and staging. The general architecture would be to maintain Staging, Factory and Warehouse databases separately (if not on different servers) to reduce load on the warehouse.

Test approach

Verify the data type of the columns between source and destination

Between Source and Delta

Test should make sure that the data type of each column of each table is as per the Functional Specification. If no specific details are mentioned in Functional Specification (FS) about the table schema, then test should make sure on the below:

  1. The data type of the source column and the destination column are same
  2. The destination column length is Equal to or Greater than the Source column length

Between Source and Factory

  1. The data type of the source column and the destination column are same
  2. The destination column length is Equal to or Greater than the Source column length
  3. If the data type is different between source and destination then test should make sure that it is documented and no data is lost

Between Factory and Warehouse

The schema of tables will be exactly same between Factory and Warehouse. The below query can be used to check the table schema. Please note, the below are linked queries assuming that the Factory and Warehouse databases are on different servers. You may have to add the servers as linked servers using SP_ADDLINKEDSERVER/ remove the linked servers if the databases are on same server.

DECLARE @FactColCount int, @RepColCount int, @MatchCount int

SELECT @FactColCount = COUNT(*) FROM <FactoryServerName>.<FactoryDatabaseName>.dbo.sysobjects so INNER JOIN <FactoryServerName>.<FactoryDatabaseName>.dbo.syscolumns sc ON so.id = sc.id WHERE so.type = 'U' AND so.name = <Factory Table name to check>

SELECT @RepColCount = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.Sysobjects so INNER JOIN <WarehouseServername>.<WarehouseDatabaseName>.dbo.syscolumns sc ON so.id = sc.id

WHERE so.type = 'U' AND so.name = <Warehouse Table name to check>

SELECT @MatchCount = COUNT(*) FROM <FactoryServerName>.<FactoryDatabaseName>.dbo.sysobjects so1, <FactoryServerName>.<FactoryDatabaseName>.dbo.syscolumns sc1, <WarehouseServername>.<WarehouseDatabaseName>.dbo.sysobjects so2, <WarehouseServername>.<WarehouseDatabaseName>.dbo.syscolumns sc2 WHERE so1.type = 'U' AND so1.id = sc1.id AND so1.name = <Factory Table name to check> AND so2.type = 'U' AND so2.id = sc2.id AND so2.name = <Warehouse Table name to check> AND (sc1.name = sc2.name AND sc1.colorder = sc2.colorder AND sc1.name = sc2.name AND sc1.length = sc2.length AND sc1.isNullable = sc2.isNullable)

IF (@FactColCount <> @RepColCount)

SELECT 'ERROR: The number of columns (' + convert(varchar(20), @FactColCount) + ') in Factory does not match with the number of columns (' + convert(varchar(20), @RepColCount) + ') in Warehouse'

ELSE

IF (@FactColCount <> @MatchCount)

SELECT 'ERROR: The number of columns (' + convert(varchar(20),@RepColCount) + ') in both Factory and Warehouse matches. But some of the attributes does not match'

ELSE

SELECT 'SUCCESS: The tables are identical in structure'

GO

Note: The above query uses SysObjects and SysColumns which are discouraged on SQL 2008. You may consider changing it as per your need

Verifying the data

Check the number of records in source and destination

A simple SELECT COUNT(*) FROM <Table Name> with needed WHERE condition should be enough to check whether there are same number of records on source and destination. But we cannot rely on just this query, reason: assume one record is duplicated into destination instead of another record. In this case, the number of records will be same but there is one record missing from source.

We may encounter two kinds of scenarios:

a.    When the data in source and destination tables can be related using a Key

DECLARE @SrcRowCount int, @DestRowCount int

DECLARE @SrcExtraRows int, @DestExtraRows int

SELECT @SrcRowCount = COUNT(<ID Column>) FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /*Any Other Join Conditions that are needed WHERE Conditions needed */

SELECT @DestRowCount = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName>

SELECT @SrcExtraRows = COUNT(*) FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /* Any Other Join Conditions that are needed WHERE Conditions needed */

AND gcd.<ID Column> NOT IN

(

    select <ID Column> from <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName> /*Do not forget to filter the Zero Key if the table has one*/

)

SELECT @DestExtraRows = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName> WHERE <ID Column> NOT IN

(

    SELECT gcd.<ID Column> FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /* Any Other Join Conditions that are needed WHERE Conditions needed */

)

IF (@SrcRowCount <> @DestRowCount)

PRINT 'ERROR: The number of rows in destination table does not match with the number of rows in source.'

ELSE IF (@SrcExtraRows > 0)

PRINT 'ERROR: All the records of source table have not migrated to Destination table'

ELSE IF (@DestExtraRows > 0)

PRINT 'ERROR: There are extra Ids in Destination table that are not present in the Source.'

ELSE

PRINT 'SUCCESS: The tables have same row counts and same Ids'

b.    When the data in source and destination tables cannot be related using a Key (There may be cases where every time the destination is just truncated and repopulated with the source data). In this case,

  1. Get the count of records from source grouping by all columns
  2. Get the count of records from source grouping by all columns
  3. Compare the count from (i) and ii and it should be same

Verifying Straight Map columns

If there is only one source table for a destination Warehouse table, identify all the columns that are just copied from source and stored in destination. Use the below query to verify the column values

SELECT TOP 10 'Exp StraightMapColumn1' = Source.StraightMapColumn1, 'Act StraightMapColumn1' = Destination.StraightMapColumn1, FROM <Source Server>.<Source Database>.dbo.<Source Table> Source JOIN <WarehouseServername>.<WarehouseDatabaseName>.dbo.<Destination Table> Destination ON Source.<Id Column> = Destination.<Id Column> WHERE Source.StraightMapColumn1 <> Destination.StraightMapColumn1

If there are more than one source table from which few columns are just copied into destination table, we can group the columns by source table and use the above query for each group.

Verifying Derived Columns

Simple derivations like coalesce can be directly done using the same straight map query with a few changes. If there are any complex calculations for a column, then it is always suggested to create a temporary table only with two columns an Id column that can be used to map to the source and destination tables and the actual column that needs to be tested. Following the logic given in Functional Specification (FS) we have to populate the temporary table (This is similar to what dev do but Test also should do it to validate the column values). Once the temporary table is populated, we can use the same above straight map query to compare data between the temporary table and the destination table.

Verifying Zero Key row

There will be a record (may not be in every warehouse) with the surrogate key value as 0. This record contains the default values for each column. This record must be validated with the values given in the Functional Specification (FS).

Conclusion

This document gives a basic idea of the approach that can be followed to test the finally loaded data. There may be a lot other approaches for similar kind of application testing. Once you get used to the approach, you can further refine the approach by automating the test cases to pick up the column names automatically and using Dynamic SQL. Once accustomed with the queries, you can start looking into the code for any kind of discrepancies that might occur due to data (e.g. looking into queries that might generate duplicate data etc…)