data transformation data cleaning. importing data reading data from external formats...

31
Data Transformation Data cleaning

Upload: suzan-wright

Post on 23-Dec-2015

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Data Transformation

Data cleaning

Page 2: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Importing Data

• Reading data from external formats

• Libname/Infile/Input for text form data

• Proc Import for Excel/Access data

• ODBC for external database data

Page 3: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Importing an Excel Spreadsheet

PROC IMPORT OUT= WORK.Fall2007 DATAFILE= "L:\DataWarehousing07f\

CourseDatabase\Fall2007.xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES;RUN;

Page 4: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Import an Access Table

PROC IMPORT OUT= WORK.OrderLine

DATATABLE= "OrderLin"

DBMS=ACCESS REPLACE;

DATABASE="I:\DataWarehousing07f\WholesaleProducts.mdb";

SCANMEMO=YES;

USEDATE=NO;

SCANTIME=YES;

RUN;

Page 5: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Good Practice

• Check the metadata for a dataset

PROC CONTENTS DATA= OrderLine;

RUN; • Print a few records

PROC PRINT DATA= OrderLine (OBS= 10);

RUN;

Page 6: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Saving SAS Datasets

LIBNAME course "L:\DataWarehousing07f\CourseDatabase";

Data course.Spring2008;

set spring2008;

run;

Note: the name associated with the libname command (“course”) must be 8 characters or less.

Page 7: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

LIBNAME / INFILE / INPUT for character data

• LIBNAME identifies the location or folder where the data file is stored

• INFILE specifies the libname to use for reading external data.

• INPUT reads text format data

• SET reads SAS data

Page 8: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

INFILE with INPUT for character data files

DATA Fitness;

INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt";

INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS;

run;

Page 9: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Creating Derived Attributes

Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code.

• LENGTH statements• FORMAT statements• FORMATS and INFORMATS• PUT• INPUT

Page 10: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

PUT and INPUT Functions

TextOutput = PUT(variable, format)

Note: the result of a put function is always character

Note: there is also a PUT statement that writes the contents of a variable to the SAS log

Output = INPUT(CharacterInput, informat)

Note: the variable for an input function is always character

Page 11: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Formats

• Formats always contain a period• Formats for character variables always

start with a $• The most used format categories are

Character, Date and Time, and Numeric

Note: use the SAS “search” tab to look for “Formats.” For a list of SAS formats look under: “Formats: Formats by Category”

Page 12: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Good Practice

The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library

Data _Null_;InputVal= 123;OutputVal= PUT(InputVal, Roman30.);PUT InputVal OutputVal;run;

Page 13: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Generating Dates

• Generating a Date dimension

• Usually done offline in something like Excel

• SAS has extensive date and datetime functions and formats

• SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.

Page 14: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Creating a text variable for Date

Data Orders2;

Length Date $10.;

Set Orders;

Date= PUT( Datepart(OrderDate), MDDYY8.);• The Length statement assures that the variable will

have enough space. It must come before the SET.• OrderDate has DateTime format. The DATEPART

function produces a date format output. MMDDYYx. is a date format type.

Page 15: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

SAS Functions

We are especially interested in “Character” and “Date and Time” functions

Note: use the SAS “search” tab to look for “Functions.” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”

Page 16: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Useful Data Cleaning Functions

• Text Manipulation:– COMPRESS, STRIP, TRIM, LEFT, RIGHT,

UPCASE, LOWCASE

• Text Extraction– INDEX, SCAN, SUBSTR, TRANSLATE,

TRANWRD

Page 17: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Parsing

• The process of splitting a text field into multiple fields

• Uses SAS functions to extract parts of a character string.– Fixed position in a string: SUBSTR– Known delimiter: SCAN

Note: it is a good idea to strip blanks before you try to parse a string.

Page 18: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Example of Parsing

Data Customer2;LENGTH street cust_addr $20.;FORMAT street cust_addr $20.;

SET Customer;Cust_Addr= TRIM(Cust_Addr);Number= Scan(Cust_Addr,1,' ');Street= Scan(Cust_Addr,2,' ');run;

Note: The LENGTH and FORMAT statements clear trailing blanks for further display.

Page 19: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Parsing Results

Obs cust_addr Number street

1 481 OAK 481 OAK 2 215 PETE 215 PETE 3 48 COLLEGE 48 COLLEGE 4 914 CHERRY 914 CHERRY 5 519 WATSON 519 WATSON 6 16 ELM 16 ELM 7 108 PINE 108 PINE

Page 20: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Good Practice

Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘110b Elm’ and ‘110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.

Page 21: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

=SUBSTR( string, position<, length>)

Use this when you have a known position for characters.• String: character expression• Position: start position (starts with 1)• Length: number of characters to take (missing takes all

to the end)VAR= ‘ABCDEFG’

NEWVAR= SUBSTR(VAR,2,2)NEWVAR2= SUBSTR(VAR,4)

NEWVAR= ‘BC’NEWVAR2= ‘DEFG’

Page 22: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

SUBSTR(variable, position<,length>) = new-characters

Replaces character value contents. Use this when you know where the replacement starts.

a='KIDNAP';

substr(a,1,3)='CAT';

a: CATNAP

substr(a,4)='TY' ;

a: KIDTY

Page 23: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

INDEX(source, excerpt)

• Searches a character expression for a string of characters. Returns the location (number) where the string begins.

a='ABC.DEF (X=Y)';

b='X=Y';

x=index(a,b);

x: 10

x= index(a,’DEF’);

x: 5

Page 24: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Alternative INDEX functions

• INDEXC searches for a single character

• INDEXW searches for a word:

Syntax

INDEXW(source, excerpt<,delimiter>)

Page 25: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Length

Returns the length of a character variable• The LENGTH and LENGTHN functions return

the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0.

• The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.

Page 26: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Standardizing

• Adjusting terms to standard format.

• Based off of frequency prints.

• Use functions or IF statements– TRANWRD is easy but can produce

unexpected results– IF statements are safer, but less general

Page 27: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Standardization Code

Supplier= Tranwrd(supplier, " Incorporated", "");

If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things";

More complex logic is often needed. See the course examples.

Page 28: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Good Practice

It is a good idea to produce a change log for standardized changes:

Data Products2 Changed;Set Products;SupplierOld= Supplier;

* * * * Output Products2;If Trim(supplier) ^= Trim(SupplierOld) then output Changed;

Proc Print Data= Changed;Var SupplierOld Supplier;

Page 29: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Locating Anomalies

• Frequency counts are a good way to identify anomalies.

• It is also helpful to identify standard changes that you do not have to review.

• Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)

Page 30: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

De Duplicating

• Reconcile different representations of the same entity

• Done after standardizing. Usually requires multi-field testing.

• May use probabilistic logic, depending on the application.

• Should produce a change log.

Page 31: Data Transformation Data cleaning. Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access

Correcting

• Identifying and correcting values that are wrong

• Very difficult to do. Usually based off of exception reports or range checks.