compare 2

4
Proc Compare to Validate Datasets Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC ABSTRACT Comparison of two datasets is a technique used to know that two datasets are equal or if they have discrepancies. This method can be used to validate that a dataset has been created correctly or that the changes in a dataset are only those that are expected. I.e. observations were added or removed or corrections were done. INTRODUCTION Proc Compare is a procedure that allows two datasets to be compared for properties, number of observations, number of variables, and properties of the datasets. For a variable, you can get output about differences in: Type, length, formats, informats and label(s). For a dataset, you can find differences in: date of creation, last modification of the datasets was modified, number of variables and observations of the datasets. You can also see the labels of the datasets, but differences are not reported for that. For observations, you can compare the value of the record for each variable. Also, you can decide how different the values of the observations can be. THE DATASETS THAT WILL BE COMPARED. proc print data=demog noobs; Unique WEIGHTKG dob 20120 101.6 08/19/1949 20130 94.3 06/02/1934 20110 85.7 09/17/1949 20202 99.3 10/17/1931 proc contents data=demog; Data Set Name: WORK.DEMOG Observations: 4 Member Type: DATA Variables: 3 --Alphabetic List of Variables and Attributes-- Variable Type Len Pos Format Label ----------------------------------------------- UNIQUE Char 200 32 Unique Record WEIGHTKG Num 8 8 5.1 Weight in kg dob Num 8 24 MMDDYY10. Date of Birth proc sql; create table Compare as select * from demog; quit; AN EXAMPLE OF A CLEAN COMPARE OUTPUT proc compare base=here.demog compare=compare; run; The COMPARE Procedure Comparison of WORK.DEMOG with WORK.COMPARE (Method=EXACT) Data Set Summary Dataset Created Modified Nvar Nobs WORK.DEMOG 14JAN03:16:03 14JAN03:16:06 3 4 WORK.COMPARE 14JAN03:16:06 14JAN03:16:06 3 4 Variables Summary Number of Variables in Common: 3. Observation Summary Observation Base Compare First Obs 1 1 Last Obs 4 4 Number of Observations in Common: 4. Total Number of Observations Read from WORK.DEMOG: 4. Total Number of Observations Read from WORK.COMPARE: 4. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 4. NOTE: No unequal values were found. All values compared are exactly equal. The records in DEMOG and the records in compare have the same order and an ID statement is not needed. However if the order of the two datasets is not the same, records might be compared incorrectly. WHEN THE DATASETS HAVE A DIFFERENT ORDER; proc sort data=compare; by DOB; run; proc compare base=demog (keep=unique weightkg) compare=compare (keep=unique weightkg); run; As in other SAS procedures and data step, it is possible to select observations and variables in the datasets that are going to be used. Dataset Created Modified NVar Nobs WORK.DEMOG 20JAN03:13:17 20JAN03:13:17 2 4 WORK.COMPARE 20JAN03:13:17 20JAN03:13:17 2 4 Vars Summary # of Vars in Common: 2. Observation Summary Observation Base Compare # of Obs in Common: 4. Total # of Obs Read from WORK.DEMOG: 4. Total # of Obs Read from WORK.COMPARE: 4. # of Obs with Some Compared Vars Unequal: 3. # of Obs with All Compared Vars Equal: 1. Values Comparison Summary # of Vars Compared with All Obs Equal: 0. # of Vars Compared with Some Obs Unequal: 2. Total # of Values which Compare Unequal: 6. Maximum Difference: 15.9. All Vars Compared have Unequal Values Variable Type Len Label Ndif MaxDif Unique CHAR 200 Unique Record 3 WEIGHTKG NUM 8 Weight in kg 3 15.900 Value Comparison Results for Vars

Upload: naraine-kanth

Post on 06-Sep-2015

212 views

Category:

Documents


0 download

DESCRIPTION

sas

TRANSCRIPT

  • Proc Compare to Validate DatasetsAngelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC

    ABSTRACTComparison of two datasets is a technique used to know that twodatasets are equal or if they have discrepancies. This methodcan be used to validate that a dataset has been created correctlyor that the changes in a dataset are only those that are expected.I.e. observations were added or removed or corrections weredone.

    INTRODUCTIONProc Compare is a procedure that allows two datasets to becompared for properties, number of observations, number ofvariables, and properties of the datasets.For a variable, you can get output about differences in:Type, length, formats, informats and label(s).For a dataset, you can find differences in:date of creation, last modification of the datasets was modified,number of variables and observations of the datasets. You canalso see the labels of the datasets, but differences are notreported for that.For observations, you can compare the value of the record foreach variable. Also, you can decide how different the values ofthe observations can be.

    THE DATASETS THAT WILL BE COMPARED.

    proc print data=demog noobs;

    Unique WEIGHTKG dob

    20120 101.6 08/19/194920130 94.3 06/02/193420110 85.7 09/17/194920202 99.3 10/17/1931

    proc contents data=demog;

    Data Set Name: WORK.DEMOG Observations: 4Member Type: DATA Variables: 3

    --Alphabetic List of Variables and Attributes--

    Variable Type Len Pos Format Label-----------------------------------------------

    UNIQUE Char 200 32 Unique RecordWEIGHTKG Num 8 8 5.1 Weight in kgdob Num 8 24 MMDDYY10. Date of Birth

    proc sql;create table Compare as select * from demog;

    quit;

    AN EXAMPLE OF A CLEAN COMPARE OUTPUT

    proc compare base=here.demog compare=compare;run;

    The COMPARE ProcedureComparison of WORK.DEMOG with WORK.COMPARE(Method=EXACT)

    Data Set Summary

    Dataset Created Modified Nvar Nobs

    WORK.DEMOG 14JAN03:16:03 14JAN03:16:06 3 4WORK.COMPARE 14JAN03:16:06 14JAN03:16:06 3 4

    Variables SummaryNumber of Variables in Common: 3.

    Observation Summary

    Observation Base Compare

    First Obs 1 1Last Obs 4 4

    Number of Observations in Common: 4.Total Number of Observations Read fromWORK.DEMOG: 4.Total Number of Observations Read fromWORK.COMPARE: 4.

    Number of Observations with Some Compared VariablesUnequal: 0.Number of Observations with All Compared VariablesEqual: 4.

    NOTE: No unequal values were found. All valuescompared are exactly equal.

    The records in DEMOG and the records in compare have thesame order and an ID statement is not needed. However if theorder of the two datasets is not the same, records might becompared incorrectly.

    WHEN THE DATASETS HAVE A DIFFERENTORDER;

    proc sort data=compare;by DOB;run;

    proc compare base=demog (keep=unique weightkg)compare=compare (keep=unique weightkg);

    run;

    As in other SAS procedures and data step, it is possible to selectobservations and variables in the datasets that are going to beused.Dataset Created Modified NVar Nobs

    WORK.DEMOG 20JAN03:13:17 20JAN03:13:17 2 4WORK.COMPARE 20JAN03:13:17 20JAN03:13:17 2 4

    Vars Summary

    # of Vars in Common: 2.

    Observation Summary

    Observation Base Compare

    # of Obs in Common: 4.Total # of Obs Read from WORK.DEMOG: 4.Total # of Obs Read from WORK.COMPARE: 4.

    # of Obs with Some Compared Vars Unequal: 3.# of Obs with All Compared Vars Equal: 1.

    Values Comparison Summary

    # of Vars Compared with All Obs Equal: 0.# of Vars Compared with Some Obs Unequal: 2.Total # of Values which Compare Unequal: 6.Maximum Difference: 15.9.

    All Vars Compared have Unequal Values

    Variable Type Len Label Ndif MaxDifUnique CHAR 200 Unique Record 3WEIGHTKG NUM 8 Weight in kg 3 15.900

    Value Comparison Results for Vars

  • ______________________________________________________

    || Unique Record|| Base Value Compare Value

    Obs || Unique Unique_____ || ___________________+ ___________________+

    ||1 || 20120 202023 || 20110 201204 || 20202 20110

    _______________________________________________________

    _______________________________________________________

    || Weight in kg|| Base Compare

    Obs || WEIGHTKG WEIGHTKG Diff. % Diff_____ || _________ _________ _________ _________

    ||1 || 101.6 99.3 -2.3000 -2.26383 || 85.7 101.6 15.9000 18.55314 || 99.3 85.7 -13.6000 -13.6959

    _______________________________________________________

    The values of the variable that was to the right of the BASE orDATA statement is in the BASE column. The values of datasetlisted to right of the COMPARE dataset is in the COMPAREcolumn.

    It is necessary that the two datasets that are going to becompared have the same order UNLESS the order of thedatasets is something that is being tested.

    What happens when the order of the observations in each of thedatasets is different?

    proc sql;update compare set unique='34343' where

    unique='20202';quit;

    proc sort data=compare;by unique;

    run;

    proc compare base=demog(keep = unique weightkg)compare=compare(keep = unique weightkg);

    run;

    _______________________________________________________

    || Unique Record|| Base Value Compare Value

    Obs || Unique Unique_______ || ___________________+ ___________________+

    ||2 || 20120 201303 || 20130 202024 || 20202 34343

    _______________________________________________________

    Etcetera

    Even though, there is only one discrepancy (UNIQUE) in thedatasets, several discrepancies are reported. It is necessary tospecify which observations should be compared .The option that tells proc compare, which variables should betogether, is ID. After it, you should list the number of variablesthat define which observation in each of the datasets should becompared. This is similar to the way datasets are merged usinga by statement.

    proc compare base=demog compare=compare;id unique;

    run;

    # of Obs in Common: 3.

    # of Obs in WORK.DEMOG but not in WORK.COMPARE: 1.# of Obs in WORK.COMPARE but not in WORK.DEMOG: 1.

    Total # of Obs Read from WORK.DEMOG: 4.Total # of Obs Read from WORK.COMPARE: 4.

    # of Obs with Some Compared Variables Unequal: 0.# of Obs with All Compared Variables Equal: 3.

    NOTE: No unequal values were found. All values comparedare exactly equal.

    INFORMATION ABOUT VARIABLES OROBSERVATIONS THAT ARE IN DIFFERENT INTHE DATASETS BEING COMPARED.

    The output there are no differences, but it might be easy to missthe fact that only 3 of the 4 observations were compared. Also,you dont know which observations were not compared unlessyou use the LIST or LISTALL option.

    proc compare data=basecompare=comparelist;

    id unique;run;

    Observation 2 in WORK.DEMOG not found in WORK.COMPARE:Unique=20120.

    Observation 4 in WORK.COMPARE not found in WORK.DEMOG:Unique=34343.

    The LIST option will list the observations and variables that existin one dataset and are missing in the other. There are otheroptions, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR,LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, ANDLISTVAR, to limit this report to one or the other dataset and onlyvariables and / or observations. LISTEQUALVAR will also listthe variables that are not used in the ID variable list for which allvalues are equal.

    LIMITING THE PRINTED OUTPUT

    Sometimes it is not necessary to know all the observations thathave an error, once it is known that an error exists, it might not beneeded to know all the discrepancies.

    It might be desirable to limit the size of the output; for that, theoption MAXPRINT is very useful.

    First it is necessary to use the original dataset and modify thevariable WEIGHTKG to have something to report.

    For example, lets change the value of one of the variables in theID statement from 34343 to 20120.

    proc sql;update compare set unique='20120'where unique='34343';

    update compare set weightkg=int(weightkg);

    quit;

    proc sort data=compare;by unique;

    run;

    proc compare data=demogcompare=comparemaxprint=(4,2);

    id unique ;run;

    At the most, 4 differences will be printed, 2 for every variable that

  • has discrepancies.

    Variable Type Len Label Ndif MaxDif

    WEIGHTKG NUM 8 Weight in kg 4 0.700

    _______________________________________________________

    || Weight in kg|| Base Compare

    Unique || WEIGHTKG WEIGHTKG Diff. % Diff_______ || _________ _________ _________ _________

    ||20110 || 85.7 85.0 -0.7000 -0.816820120 || 101.6 101.0 -0.6000 -0.5906NOTE: The MAXPRINT=2 printing limit has been reached

    for the variable WEIGHTKG.

    No more values will be printed for this comparison.

    PRINTING RELATED INFORMATIONTOGETHER.

    Sometimes it is desired to see the information with discrepanciesgrouped together by the variables in the ID statement.

    proc compare data=demogcompare=comparetranspose

    ;id unique ;

    run;

    Unique=20110:Variable Base Value Compare Diff. % DiffWEIGHTKG 85.7 85.0 -0.700000 -0.816803

    dob 09/17/1949 09/14/1949 -3.000000 0.079830

    Unique=20120:Variable Base Value Compare Diff. % DiffWEIGHTKG 101.6 101.0 -0.600000 -0.590551

    dob 08/19/1949 08/16/1949 -3.000000 0.079218

    Unique=20130:Variable Base Value Compare Diff. % DiffWEIGHTKG 94.3 94.0 -0.300000 -0.318134

    dob 06/02/1934 05/30/1934 -3.000000 0.032106

    Unique=20202:Variable Base Value Compare Diff. % DiffWEIGHTKG 99.3 99.0 -0.300000 -0.302115

    dob 10/17/1931 10/14/1931 -3.000000 0.029118

    COMPARING VARIABLES WITH DIFFERENTNAMES.Sometimes, it is necessary to compare datasets when it is knownthat the variable names are different yet they should have thesame values.

    Proc sql;

    alter table compareadd nage num label "Numeric age";

    alter table demogadd age num label "New Numeric age";

    update demog setage=int((date()-dob)/365.25);

    update compare setnage=int((date()-dob)/365.25);

    quit;

    proc compare data=demogcompare=compare;

    id unique;

    var age;with nage;

    run;

    # of Vars in Common: 3.

    # of Vars in WORK.DEMOG but not in WORK.COMPARE:1.

    # of Vars in WORK.COMPARE but not in WORK.DEMOG:1.

    # of ID Vars: 1.# of VAR Statement Vars: 1.# of WITH Statement Vars: 1.

    Sometimes, there are differences in the datasets that are notimportant for the purpose of the comparison. For this scenario, itis possible to give a value for which only observations that arebigger than this number will be marked as a difference.

    proc sql;update compare set nage=nage+0.001;

    quit;

    proc compare data=demogcompare=comparecriterion=.01;

    var age;with nage;

    run;

    The COMPARE ProcedureComparison of WORK.DEMOG with WORK.COMPARE(Method=RELATIVE(2.21E-12), Criterion=0.01)

    Variables Summary

    Number of Variables in Common: 3.Number of Variables in WORK.DEMOG but not in

    WORK.COMPARE: 1.Number of Variables in WORK.COMPARE but not in

    WORK.DEMOG: 1.Number of VAR Statement Variables: 1.Number of WITH Statement Variables: 1.

    Number of Observations in Common: 4.Total Number of Observations Read from WORK.DEMOG: 4.Total Number of Observations Read from WORK.COMPARE:

    4.

    Number of Observations with Some Compared VariablesUnequal: 0.Number of Observations with All Compared Variables

    Equal: 4.

    Values Comparison Summary

    Number of Variables Compared with All ObservationsEqual: 1.Number of Variables Compared with Some Observations

    Unequal: 0.Total Number of Values which Compare Unequal: 0.Total Number of Values not EXACTLY Equal: 4.Maximum Difference Criterion Value: 0.000018868.

    CREATING AN OUTPUT DATASETSometimes, it is desirable to create an output dataset thatcontains only the equalities or discrepancies. Here, there is anexample to create a dataset that has the differences.The outnoequal option, specifies that only the observations withdiscrepancies will exist in the dataset toprint.

    proc compare data=demogcompare=compareoutnoequalout=toprint

    ;

  • id unique ;run;

    Proc Print data=toprint;run;

    Variables with Unequal Values

    Variable Type Len Label Ndif MaxDif

    WEIGHTKG NUM 8 Weight in kg 4 0.700

    Value Comparison Results for Variables

    _______________________________________________________

    ||Weight in kg|| Base Compare

    Unique || WEIGHTKG WEIGHTKG Diff. % Diff_________ || ________ ________ ________ _________

    ||20110 || 85.7 85.0 -0.7000 -0.816820120 || 101.6 101.0 -0.6000 -0.590620130 || 94.3 94.0 -0.3000 -0.318120202 || 99.3 99.0 -0.3000 -0.3021______________________________________________________

    OUTPUT OF PROC PRINT

    Obs _TYPE_ _OBS_ Unique WEIGHTKG dob

    1 DIF 1 20110 -0.7 E2 DIF 2 20120 -0.6 E3 DIF 3 20130 -0.3 E4 DIF 4 20202 -0.3 E

    It is possible to use the noprint option, but then it is stronglyrecommended to use the options OUTCOMP AND OUTBASE todifferentiate the variables.

    proc compare data=demogcompare=compareoutnoequalout=toprintnoprintoutcompoutbase;

    id unique ;run;

    proc print data=toprint;run;

    Obs _TYPE_ _OBS_ Unique WEIGHTKG dob1 BASE 1 20110 85.7 09/17/19492 COMPARE 1 20110 85.0 09/17/19493 BASE 2 20120 101.6 08/19/19494 COMPARE 2 20120 101.0 08/19/19495 BASE 3 20130 94.3 06/02/19346 COMPARE 3 20130 94.0 06/02/19347 BASE 4 20202 99.3 10/17/19318 COMPARE 4 20202 99.0 10/17/1931

    A USEFUL DATASET TO REPORT

    To get only the observations and variables that havediscrepancies, proc transpose can be a big help in reporting thefindings.

    proc transpose data=toprint out=transp;by unique _obs_;id _type_;

    run;

    proc print data=transp(where=(base^=compare));run;

    Obs Unique _OBS_ _NAME_ _LABEL_ BASE COMPARE1 20110 1 WEIGHTKG Weight in kg 85.7 853 20120 2 WEIGHTKG Weight in kg 101.6 1015 20130 3 WEIGHTKG Weight in kg 94.3 947 20202 4 WEIGHTKG Weight in kg 99.3 99

    CONCLUSIONYou are the person who decides what is important to validate,values of the observations and how different can they be. Is itimportant to have the same labels, formats and informats in avariable. Proc Compare is a tool that is flexible to allow you findthe differences that are important to you.

    ACKNOWLEDGMENTSI want to thank, Craig Mauldin, Andy Barnett, George Clark,Bonnie Duncan and Kim Sturgen for their comments and support.

    CONTACT INFORMATIONYour comments and questions are valued and encouraged.Contact the author at:

    A. Cecilia CasasPPD Development3900 Paramount ParkwayMorrisville, NC 27560Phone: (919) 462-4199Fax: (919) 379-6151Email: [email protected]

    SAS and all other SAS Institute Inc. product or service names areregistered trademarks or trademarks of SAS Institute Inc. in theUSA and other countries. indicates USA registration.

    Other brand and product names are trademarks of theirrespective companies.