Introduction to Data Warehouse Testing
1 What is Warehouse?
2 What is Data Warehouse?
3 Data Warehouse Architecture
4 Importance of Data Warehouse Testing
5 Overview of Data Warehouse Validation
6 Types of Testing carried out in Data Warehouse
7 Testing Strategy followed in Data Warehouse Testing
8 Challenges in Data Warehouse Testing
What is Warehouse?
Warehouse is a place where goods are physically stocked, to facilitate smooth flow of business without any production downtime or crisis.
What is Data Warehouse?
A data warehouse is a structured environment designed for the analysis of data, rationally and physically transformed from different source applications to align with business requirements, updated and maintained over a period of time, and presents for analysis. Data warehouse data must be organized to meet the purpose of the data warehouse, which is a rapid access to information for analysis and reporting.
Data warehouse characteristics
Subject oriented: Data warehouse are intended to examine the data. The data in the data warehouse is structured so that all the data elements involving the same real-world event or object are connected together.
Integrated: Data warehouses must put data from different distinct sources into a steady and reliable format. The database has data from most or all of a given company’s operational applications.
Non Volatile: Data in the database is not over-written or removed, once the change is committed, the data is fixed, read-only, but held for future reporting purpose. Once the data entered into the warehouse, it should not alter. This is reasonable because the purpose of a data warehouse is to enable one to examine what has happened.
Time Variant: The changes to the data in the database are monitored and stored to create reports on data changed over a period of time. In order to find patterns in business, analysts require large volume of data. Time variation here refers to ability of a data warehouse to focus on change over a period of time.
Data Warehouse Architecture
Data warehouse architecture is mainly based on the business processes of a business endeavor. Data warehouses and their architectures vary depending upon the specifics of an organization’s situation depending on its data modeling, security, reference data management, staging area etc.
The various types of architecture mainly used in data warehouse are:
- Architecture without staging area: In this case, Data from different sources are transformed in the data warehouse directly. This data, once processed is used by users for the reporting and analysis thereafter.
Fig: Data Warehouse Architecture without staging area.
- Architecture with staging area: Data from different sources get processed in a place called staging area before it reaches the warehouse.
Fig: Data Warehouse Architecture with staging area.
- Architecture with data mart: In this case, data mart is created on top of the data warehouse to cater to different business function areas or geographies. The data that are accessed by end users in the front end applications are fed by the data mart.
Fig: Data Warehouse Architecture with staging area and data mart
Data staging process is a key constituent of technical architecture. It covers five main areas as below:
- Extract – Extraction refers to pulling the expected data from different sources into a common area. The sources can be anyone of the below:.
*Flat files /Delimiter files Ex: .txt files, CSV, Excel etc
*RDBMS systems Ex: DB2, SQL server, Oracle
*Main frames file ex: VSAM files, DB2 database.
- Transform – Transformation refers to the process of altering data based on a set of business rules. The different types of transformations are:
Basic Conversion: It transforms data from one form to another. Ex: Currency Conversion, data type conversion.
Integration: It joins different tables using the surrogate keys
Filtering: It filters the data based on Business rules.
Calculations: Here it uses functions like COUNT, MAX, etc.
Sorting: It sorts the data (eg ascending\descending) based on business rules..
Error handling: Eg Identifying duplicate records and null Values.
- Load – It is a stage where transformed data from various sources is loaded into the target database.
- Security – This encompasses various activities performed by the database administrator . Eg., Granting read/write/execute access to various users at various levels.
- Job control – This includes definition of job and its scheduling, monitoring, logging errors, exception handling and its intimation to the users etc.
Terminologies used in Architecture
- Data Mining: Data Mining is the process of extracting useful information or knowledge from large and small data sources. Data mining activity can be performed on relational databases, internal legacy systems and World Wide Web etc.
- Summary Data: Summary data refers to the data that are consolidated and processed from the atomic level data. Summary data may exist in in all of the database objects of the warehouse.
- Metadata: It’s a Data that describes data and other structures such as objects, business rules and processes.
Metadata usually refers the following:
-An explanation of tables and columns in the warehouse, including data types and the series of tolerable values.
-A similar depiction of tables and fields in the source databases, with a mapping of columns from the source to the warehouse.
-An account of how the data has been processed, including format changes, currency conversion.
-Any other details that is required to maintain and handle the function of the data warehouse.
Staging Area: This refers to an area where data get processed before proceeding to the Data Warehouse.
Different Types of Tables in Data Warehouse
Dimensional table: Dimension tables explain the business entities of a company, represented as information such as time, sections, geographies, and products. Dimension tables are also called reference tables or lookup tables.
Fact Tables: This refers to any tables present in a schema that contain facts. A fact table usually has two types of fields: those that have facts and those that are foreign keys to dimension tables. The fact table’s primary key is typically a composite key that is built up of all of its foreign keys
Aggregation Table: Tables that store aggregated data are called as aggregation table/ summary table. Facts are aggregated for a given dimensions from the fact table. The aggregate table that is resulted will have fewer rows, thus enhancing the performance of the query.
For example: Collection of Market data everyday and aggregation of the data subsequently every week. The weekly data can be processed to the month level.
Level Value Table: A database table that keeps the data for the levels formed as part of hierarchies or dimension.
Importance of Data Warehouse Testing
-With the growth of Information Technology, Data warehouse started gaining importance!
-Ensures the data movement across all the tables and the schemas.
-Ensures that complex business rules have been implemented correctly.
-There is every possibility that a minute change in the value of one field can affect thousands of records.
-Ensures that data in DWH is correct, complete and reliable for decision making process.
-In a single data warehouse there are multiples of tables and schemas where in some millions of records are transformed as per the business requirements.
Overview of Data Warehouse Validation
Data warehouse validations are divided into two parts they are
- Functional Validations-
- Standard Validation
Data Validation: This evaluates and makes sure that a correctness of the data in the warehouse.
Count Validation: This refers to running database queries on a given table or related tables to get the total count of a given transaction or business scenario. If these counts of records are same , it may mean that records were not ignored due to a mistake during the ETL or loading proces.
End to End testing: End-to-end testing here refers to validating the data as they pass various stages of the system (e.g. Source flat file -> Staging area -> Ware house -> Data Mart etc). Thus this generally ensures the completeness of the business functions for the system under test.
Straight/Direct move: When data is directly moved from source to target system without applying any transformation rule.
Data transformation: Validates whether the transformation is inline with the business rules applied to the source data base or not.
Look up validation: Positive and negative test cases will be created to verify ETL process and lookup validations rules with valid and invalid data respectively.
Defaulting: The fields in Source (to be defaulted) having invalid values w.r.t. the requirements in Target get defaulted to some appropriate value during ETL run.
Signage: Various General Ledger amounts are assigned appropriate sign (based on certain rules) before being populated in Target tables. Target fields are to be populated according to the signage rules. Specific rules/formulas will be applied to the tables to effectively transform and process the data.
Translation: This process use the general ledger translator table to translate GL account numbers and/or Corporation (Corp) & Responsibility-Centers (RC) to the same Corp, Account and RC which post to the general ledger. Transformation of data according to the rules/formulas applicable to the data/tables.
Filtering: Some records have filtering criteria attached to them. The filtered out record will be left behind in the Source. It helps in maximizing the performance of the query in the database and retrieves only required data.
Average Balance Calculation: The average balance are usually computed every weekend or month end.
Data integrity validation: Testing to validate the exactness of the data stored. Testing conducted on a periodic basis as data get changed over a period of time in database.
External field validation: Validates whether the database is able to perform the same function/operation for newly added fields or existing fields in a table or not.
- Load Testing –Incremental load, bulk load and full load
- Performance testing
- Volume testing
Types of Testing carried out in Data Warehouse
The points mentioned below under each category refer to the types of testing done in that category.
Data is able to pull out from the given columns.
The Extraction mechanism for every source system works as expected
Extraction scripts are given security rights to the source systems.
Amendment of extract audit log and time stamping happen.
Source to Extraction destination works as expected for correctness and completeness.
Extraction is done in a given time window
Transaction scripts are manipulating the data as mentioned by transformation logic..
The one off Transformation for historical loading is working.
Exhaustive and aggregated data sets are produced and are comparable.
Audit Log and time stamp work as expected.
Pilferage of data does not happen during Transformation process.
Transformation is done in a given time window.
Verifies pilferage has not occurred in the loading process.
The Transformation logic during Loading process works as expected.
Data sets in staging to Loading area works as expected
One Off historical loading works as expected
Incremental and total refresh work as expected.
Loading is done in a given time window.
End User Browsing and OLAP Testing
The data are displayed by the Business views and dashboard as expected.
The reports that are scheduled are precise and comprehensive.
The reports and other batch operations occur in the expected time window.
Pilferage does not happen between the views and source systems..
Down Stream Flow Testing:
Data from the data warehouse update the down-stream systems/data marts.
Check for no pilferage.
Initial Load Testing:
The one-off ETL for the data take-on works as expected
Uploading the data into data warehouse in the opening run.
The time taken for the initial loading is acceptable within the conversion window.
Front To Back Integration Testing:
Data flow from the source to the downstream systems is correct.
Stress and volume Testing:
This involves keeping maximum volume of data to check the strength and capability of the system. The intensity of stress testing depends upon the type of the test environment and the magnitude of capacity planning arranged.
Parallel testing is performed when the data in production are compared to an existing set of reports on a given test environment to ensure they are in synch. Any mismatch in the comparison will also be analysed.
There are two ways for Regression Testing in DWH Testing Projects:
Regression Testing can be done by comparing the data in the two environments provided one environment reflecting old requirements and another reflecting new requirement.
If two test environments are not provided, select the high priority test cases which may be affected after the changes.
Use comparison tools to do the source to target validations.
Testing Strategy followed in Data Warehouse Testing
|Sl.No||Testing Strategy||Applicable for||Pros||Cons|
|1||Data Sampling||Straight hold and homogeneous data.||Time Effective||Precision cannot be high.|
|2||Automating Data Warehouse.||Straight loads that require exhaustive testing||High confidence level.
More Coverage of scenarios.
|Heavily dependent on source target mapping.
Higher elapsed time and COST poor customer buying
All scenarios cannot be automated
|3||Customized test cases||Complex data load,
hetero generous data load
|Mitigation of business risks||Good business knowledge required.|
Challenges in Data Warehouse Testing
Defining the scope: Defining the scope of testing is very important in every project. In a Data ware house testing Projects the common items in scope of testing are:
- Data movement from source to target
- Error tables
- E-mail and status tables
- Different validation of data at field level
Estimation: Estimation is considered as one of the big challenge. Following areas should be focused while estimating DWH testing projects.
Define the different scenarios/validations and modules and estimate.
Define the complexity of each scenario/validation and estimate.
Define the complexity of every test case for each scenario/validation and estimate.
How much time is required to write test case at field level and execute the same.
Number of Source Systems should be considered during estimation.
Test database size
How much to test/validate?
Whether to test 20% sample or to test 80% sample of data?
How many test cases to be created?
Example: 500 testcases-20k records-0.4% of data
How to ensure data consistency during transfer from flat file format to dimensional
How many tables and to what extent validations can be automated?
Identification of Negative Test Scenarios.
Environmental Setup Challenges
Setting up of Source Systems, Staging Areas and DWH.
Setting up ETL mappings for the DWH.
The right data populated in the data base?
Is the data good enough for customer to make a strategic business decision?
Critical Business Scenarios
Current business scenarios have to be tested.
Future scenarios based on the business goals of the company need to be envisaged and tested.
Quantity/Size of the data migrated
One of the major issues in testing of migration project is the size of the data involved, which runs to millions of records especially if historical data load is present.
Huge error logs
Test result verification
Knowledge on SQL( See the queries in another post)
Test Result reporting will be a major challenge as reporting exact steps to reproduce the defects using the huge amount of data
Flat files to tables – Data in Flat files should be extracted and transformed (if necessary) and inserted in to the target tables. It consumes lot of time in identifying data in flat files, since no queries will work on flat files. It tedious, time consuming, very hectic – leads to every possibility of human errors.
Hierarchy: A rational structure that makes use of ordered levels as a means of sorting data.
Star Schema: It is the easiest form of data warehouse schema that has one or more dimensions and fact tables.
Snowflake Schema: This refers to a star schema where bigger dimensional tables are busted into simpler tables.
Extraction, Transformation and Loading (ETL): This refers to the processes concerned in reading a source data and loading it into a target system.
Extract: This reads data from a given source and extracts an expected data.
Transformation: In this case, the data in the source system are applied with a set of business functions or rules, and stored in a target system.
Loading: This loads the data in to a target system.
Some of the accepted ETL tools that are accessible in the market are as below:
|Tool Name||Company Name|
|Data Stage||IBM Corp.|
|Data Junction||Pervasive Software|
|Transformation Manager||ETL Solutions|
Example: Assume that there are 100 fields out of which we require only 75 fields. Hence we need to make sure that all the required 75 fields are present.
Data Profiling: Examining the data and collecting statistics about the data.
OLAP: This means On Line Analytical Processing. It makes use of Static i.e. historical data for analysis. Historical data is stored to enable trend analysis and future predictions. Data is more summarized and stored at the higher level.
Pilferage: As the data is more prone for stealing, as the data keeps on moving from one stage to the other, where in there is more chance of data theft for the fields like credit card info, bank details, etc. So, the tester should ensure that the data is properly masked before migrating it from one database to the other database.