introduction resultsshirazi/etltesting/wp... · results dq testing: we executed generated dq gbq...

1
Results Results DQ Testing: We executed generated DQ GBQ commands on Health Data Compass on GBQ and the result is three tables including analysis, analysis_results (results of analysis queries) and heel_results (results of DQ rules). Portion of analysis table analysis_result table row for analysis 114 heel_result table row for analysis 114 Balancing Testing – Instantiate source model: We generated source model with 147 tables and 666 foreign key constraints, and fill the with mock data using Mockaroo. Conclusion Conclusion In this work, a functional testing of ETL for Health Data Compass Data Warehouse on BigQuery is performed as a black-box testing method. Data Quality rules are translated to GBQ commands and executed on Health Data Compass resulting in 25 errors/warnings in 7807010 records. We also instantiate the source data model for the balancing test of ETL. References References 1)Matteo Golfarelli and Stefano Rizzi. Data Warehouse Testing: A Prototype-based Method-ology.Information and Software Technology, 53(11):1183–1198, 2011. 2)Jerry Gao, Chunli Xie, and Chuanqi Tao. Big Data Validation and Quality Assurance– Issuses, Challenges, and Needs. In2016 IEEE Symposium on Service-Oriented SystemEngineering (SOSE), pages 433–441, 2016. 3)Benoit Baudry, Sudipto Ghosh, Franck Fleurey, Robert France, Yves Le Traon, and Jean-Marie Mottu. Barriers to Systematic Model Transformation Testing.Communications ofthe ACM, 53(6):139–143, June 2010. DQ tests are a set of DQ rules to be checked against the data in a DWH and make sure if it meets the quality metrics defined by an organization. We used a set of defined DQ rules in SQL format from OHSI organization, translated them to GBQ commands, and executed them on DWH. We expand SqlRender and add types, functions, queries, and syntactic structure for GBQ in its Replacement Pattern table. Challenges: String Concatenation: Translating + to CONCAT() function requires knowledge about concatenation operands. Solution: a pre-process parser algorithm to mark + operands. Replacement Pattern is a table that SqlRender uses to translate from a source dialect to a destination dialect. It describes structures in source and destination dialects. Replacement Pattern Table for SQL to GBQ translation Black-box White-box Data Quality Data Balancing Functional Testing of ETL DQ Rules SQL R Checklist Translator Database Connector GBQ E T L Instantiate Source Model Generate Target Model Compare Source and Target data Extract transform DSA load DWH Data sources E T L Translator Translator Data Balancing test compares the data in the source with the data in the destination model in a DWH and tests it for consistency and completeness. It has three main phases: instantiating source model, generating target model, and comparing the source and target data. We perform the first phase by generating source models and filling it with mock data. Translator: Replacement Pattern Table Translator: Replacement Pattern Table Pattern (SQL) Replacement (GBQ Command) DATEDIFF(d,@start, @end) DATE_DIFF(DATE(@end), DATE(@start), DAY) DATEADD(d,@days,@date) DATE_ADD(DATE(@date), @days, DAY) GETDATE() CURRENT_DATE() YEAR(@date) EXTRACT(YEAR from @date) STDEV(@a) STDDEV(@a) VAR(@a) VARIANCE(@a) HASHBYTES('MD5',@a) SHA1(@a) LEN(@a) LENGTH(@a) COUNT_BIG(@a) COUNT(@a) SELECT @values FROM @table; bq --use_legacy_sql=false “SELECT @values FROM @table” insert into @table (@column) values (@value); bq query --use_legacy_sql=false --append=true --destination_table=@table "select @value as @column" with @table1 (@columns) as (select @values from @table2) bq rm -f -t @table1 bq query --use_legacy_sql=false --append=true --destination_table=@table1 "select @values from @table2" drop table @table; bq rm -f -t @table analysis_id analysis_name 101 Number of persons by age with age at first observation period 3 Number of persons by year of birth 1020 Number of condition era records by condition era start month 820 Number of observation records by observation start month 114 Number of persons with observation period before year-of-birth analysis_id count_value 114 69 analysis_id Error/warning message 114 ERROR: 114-Number of persons with observation period before year-of-birth. count (n=69) should not be > 0 ETL: process that brings data from source systems into a single destination. Functional testing of ETL is the most critical testing phase in a Data Warehouse (DWH) because ETL directly affects the quality of data . Functional testing of ETL can be performed using black-box (Data Quality (DQ) - Data Balancing testing) or white-box testing techniques. Objectives: 1.DQ testing: we apply a set of defined DQ rules a set of SQL queries on the Health Data Compass DWH on Google BigQuery (GBQ). 2.Data balancing testing: we used the idea of model transformation testing and perform the first phase of test, i.e., instantiating the source data model by filling it with mock data. Introduction Introduction Data Quality Test Data Quality Test Data Balancing Testing Data Balancing Testing Health Data Compass Health Data Compass Health Data Compass is an enterprise health Data Warehouse on Google BigQuery that integrates patient clinical data from the electronic health records at both UCHealth System and Children's Hospital of Colorado. Compass is a vital source of multi-institutional integrated data and analytic services designed to transform data-driven processes in clinical research, operational excellence, molecular discovery, and precision medicine. Analysis and Reports Postgres/ Oracle/ Sql Server Data Warehouse (OMOP CDM) Children’s Hospital of Colorado UCHealth System Google Big Query E T L Health Data Compass Architecture Functional Testing of ETL Functional Testing of ETL Balancing testing phases

Upload: others

Post on 22-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction Resultsshirazi/etlTesting/wp... · Results DQ Testing: We executed generated DQ GBQ commands on Health Data Compass on GBQ and the result is three tables including analysis,

Results Results DQ Testing: We executed generated DQ GBQ commands on Health Data Compass on GBQ and the result is three tables including analysis, analysis_results (results of analysis queries) and heel_results (results of DQ rules).

Portion of analysis table

analysis_result table row for analysis 114

heel_result table row for analysis 114

Balancing Testing – Instantiate source model: We generated source model with 147 tables and 666 foreign key constraints, and fill the with mock data using Mockaroo.

ConclusionConclusionIn this work, a functional testing of ETL for Health Data Compass Data Warehouse on BigQuery is performed as a black-box testing method. Data Quality rules are translated to GBQ commands and executed on Health Data Compass resulting in 25 errors/warnings in 7807010 records. We also instantiate the source data model for the balancing test of ETL.

ReferencesReferences1)Matteo Golfarelli and Stefano Rizzi. Data Warehouse

Testing: A Prototype-based Method-ology.Information and Software Technology, 53(11):1183–1198, 2011.

2)Jerry Gao, Chunli Xie, and Chuanqi Tao. Big Data Validation and Quality Assurance– Issuses, Challenges, and Needs. In2016 IEEE Symposium on Service-Oriented SystemEngineering (SOSE), pages 433–441, 2016.

3)Benoit Baudry, Sudipto Ghosh, Franck Fleurey, Robert France, Yves Le Traon, and Jean-Marie Mottu. Barriers to Systematic Model Transformation Testing.Communications ofthe ACM, 53(6):139–143, June 2010.

● DQ tests are a set of DQ rules to be checked against the data in a DWH and make sure if it meets the quality metrics defined by an organization.

● We used a set of defined DQ rules in SQL format from OHSI organization, translated them to GBQ commands, and executed them on DWH.

● We expand SqlRender and add types, functions, queries, and syntactic structure for GBQ in its Replacement Pattern table.

● Challenges:➢ String Concatenation: Translating + to

CONCAT() function requires knowledge about concatenation operands. Solution: a pre-process parser algorithm to mark + operands.

Replacement Pattern is a table that SqlRender uses to translate from a source dialect to a destination dialect. It describes structures in source and destination dialects.

Replacement Pattern Table for SQL to GBQ translation

Black-box

White-box

Data Quality

Data Balancing Functional Testing of ETL

DQ RulesSQL R Checklist

Translator

Database Connector

GBQ

ET L

Instantiate Source Model

Generate Target Model

Compare Source and Target data

Extract transform

DSA

load

DWHData sources

E T L

Translator Translator

● Data Balancing test compares the data in the source with the data in the destination model in a DWH and tests it for consistency and completeness.

● It has three main phases: instantiating source model, generating target model, and comparing the source and target data.

● We perform the first phase by generating source models and filling it with mock data.

Translator: Replacement Pattern TableTranslator: Replacement Pattern Table

Pattern (SQL) Replacement (GBQ Command)

DATEDIFF(d,@start, @end) DATE_DIFF(DATE(@end), DATE(@start), DAY)

DATEADD(d,@days,@date) DATE_ADD(DATE(@date), @days, DAY)

GETDATE() CURRENT_DATE()

YEAR(@date) EXTRACT(YEAR from @date)

STDEV(@a) STDDEV(@a)

VAR(@a) VARIANCE(@a)

HASHBYTES('MD5',@a) SHA1(@a)

LEN(@a) LENGTH(@a)

COUNT_BIG(@a) COUNT(@a)SELECT @values FROM @table; bq --use_legacy_sql=false “SELECT @values FROM @table”

insert into @table (@column) values (@value); bq query --use_legacy_sql=false --append=true --destination_table=@table "select @value as @column"

with @table1 (@columns) as (select @values from @table2)

bq rm -f -t @table1bq query --use_legacy_sql=false --append=true --destination_table=@table1 "select @values from @table2"

drop table @table; bq rm -f -t @table

analysis_id analysis_name

101 Number of persons by age with age at first observation period

3 Number of persons by year of birth

1020 Number of condition era records by condition era start month

820 Number of observation records by observation start month

114 Number of persons with observation period before year-of-birth

analysis_id count_value

114 69

analysis_id Error/warning message

114 ERROR: 114-Number of persons with observation period before year-of-birth. count (n=69) should not be > 0

● ETL: process that brings data from source systems into a single destination.

● Functional testing of ETL is the most critical testing phase in a Data Warehouse (DWH) because ETL directly affects the quality of data .

● Functional testing of ETL can be performed using black-box (Data Quality (DQ) - Data Balancing testing) or white-box testing techniques.

● Objectives:1.DQ testing: we apply a set of defined DQ rules a

set of SQL queries on the Health Data Compass DWH on Google BigQuery (GBQ).

2.Data balancing testing: we used the idea of model transformation testing and perform the first phase of test, i.e., instantiating the source data model by filling it with mock data.

Introduction Introduction

Data Quality Test Data Quality Test

Data Balancing Testing Data Balancing Testing

Health Data CompassHealth Data Compass

Health Data Compass is an enterprise health Data Warehouse on Google BigQuery that integrates patient clinical data from the electronic health records at both UCHealth System and Children's Hospital of Colorado. Compass is a vital source of multi-institutional integrated data and analytic services designed to transform data-driven processes in clinical research, operational excellence, molecular discovery, and precision medicine.

Analysis and Reports

Postgres/ Oracle/ Sql Server

Data Warehouse

(OMOP CDM)

Children’s Hospital of Colorado

UCHealth System

Google Big Query

E T L

Health Data Compass Architecture

Functional Testing of ETLFunctional Testing of ETL

Balancing testing phases