data transformation data cleaning. importing data reading data from external formats...
TRANSCRIPT
Data Transformation
Data cleaning
Importing Data
• Reading data from external formats
• Libname/Infile/Input for text form data
• Proc Import for Excel/Access data
• ODBC for external database data
Importing an Excel Spreadsheet
PROC IMPORT OUT= WORK.Fall2007 DATAFILE= "L:\DataWarehousing07f\
CourseDatabase\Fall2007.xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES;RUN;
Import an Access Table
PROC IMPORT OUT= WORK.OrderLine
DATATABLE= "OrderLin"
DBMS=ACCESS REPLACE;
DATABASE="I:\DataWarehousing07f\WholesaleProducts.mdb";
SCANMEMO=YES;
USEDATE=NO;
SCANTIME=YES;
RUN;
Good Practice
• Check the metadata for a dataset
PROC CONTENTS DATA= OrderLine;
RUN; • Print a few records
PROC PRINT DATA= OrderLine (OBS= 10);
RUN;
Saving SAS Datasets
LIBNAME course "L:\DataWarehousing07f\CourseDatabase";
Data course.Spring2008;
set spring2008;
run;
Note: the name associated with the libname command (“course”) must be 8 characters or less.
LIBNAME / INFILE / INPUT for character data
• LIBNAME identifies the location or folder where the data file is stored
• INFILE specifies the libname to use for reading external data.
• INPUT reads text format data
• SET reads SAS data
INFILE with INPUT for character data files
DATA Fitness;
INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt";
INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS;
run;
Creating Derived Attributes
Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code.
• LENGTH statements• FORMAT statements• FORMATS and INFORMATS• PUT• INPUT
PUT and INPUT Functions
TextOutput = PUT(variable, format)
Note: the result of a put function is always character
Note: there is also a PUT statement that writes the contents of a variable to the SAS log
Output = INPUT(CharacterInput, informat)
Note: the variable for an input function is always character
Formats
• Formats always contain a period• Formats for character variables always
start with a $• The most used format categories are
Character, Date and Time, and Numeric
Note: use the SAS “search” tab to look for “Formats.” For a list of SAS formats look under: “Formats: Formats by Category”
Good Practice
The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library
Data _Null_;InputVal= 123;OutputVal= PUT(InputVal, Roman30.);PUT InputVal OutputVal;run;
Generating Dates
• Generating a Date dimension
• Usually done offline in something like Excel
• SAS has extensive date and datetime functions and formats
• SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.
Creating a text variable for Date
Data Orders2;
Length Date $10.;
Set Orders;
Date= PUT( Datepart(OrderDate), MDDYY8.);• The Length statement assures that the variable will
have enough space. It must come before the SET.• OrderDate has DateTime format. The DATEPART
function produces a date format output. MMDDYYx. is a date format type.
SAS Functions
We are especially interested in “Character” and “Date and Time” functions
Note: use the SAS “search” tab to look for “Functions.” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”
Useful Data Cleaning Functions
• Text Manipulation:– COMPRESS, STRIP, TRIM, LEFT, RIGHT,
UPCASE, LOWCASE
• Text Extraction– INDEX, SCAN, SUBSTR, TRANSLATE,
TRANWRD
Parsing
• The process of splitting a text field into multiple fields
• Uses SAS functions to extract parts of a character string.– Fixed position in a string: SUBSTR– Known delimiter: SCAN
Note: it is a good idea to strip blanks before you try to parse a string.
Example of Parsing
Data Customer2;LENGTH street cust_addr $20.;FORMAT street cust_addr $20.;
SET Customer;Cust_Addr= TRIM(Cust_Addr);Number= Scan(Cust_Addr,1,' ');Street= Scan(Cust_Addr,2,' ');run;
Note: The LENGTH and FORMAT statements clear trailing blanks for further display.
Parsing Results
Obs cust_addr Number street
1 481 OAK 481 OAK 2 215 PETE 215 PETE 3 48 COLLEGE 48 COLLEGE 4 914 CHERRY 914 CHERRY 5 519 WATSON 519 WATSON 6 16 ELM 16 ELM 7 108 PINE 108 PINE
Good Practice
Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘110b Elm’ and ‘110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.
=SUBSTR( string, position<, length>)
Use this when you have a known position for characters.• String: character expression• Position: start position (starts with 1)• Length: number of characters to take (missing takes all
to the end)VAR= ‘ABCDEFG’
NEWVAR= SUBSTR(VAR,2,2)NEWVAR2= SUBSTR(VAR,4)
NEWVAR= ‘BC’NEWVAR2= ‘DEFG’
SUBSTR(variable, position<,length>) = new-characters
Replaces character value contents. Use this when you know where the replacement starts.
a='KIDNAP';
substr(a,1,3)='CAT';
a: CATNAP
substr(a,4)='TY' ;
a: KIDTY
INDEX(source, excerpt)
• Searches a character expression for a string of characters. Returns the location (number) where the string begins.
a='ABC.DEF (X=Y)';
b='X=Y';
x=index(a,b);
x: 10
x= index(a,’DEF’);
x: 5
Alternative INDEX functions
• INDEXC searches for a single character
• INDEXW searches for a word:
Syntax
INDEXW(source, excerpt<,delimiter>)
Length
Returns the length of a character variable• The LENGTH and LENGTHN functions return
the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0.
• The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.
Standardizing
• Adjusting terms to standard format.
• Based off of frequency prints.
• Use functions or IF statements– TRANWRD is easy but can produce
unexpected results– IF statements are safer, but less general
Standardization Code
Supplier= Tranwrd(supplier, " Incorporated", "");
If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things";
More complex logic is often needed. See the course examples.
Good Practice
It is a good idea to produce a change log for standardized changes:
Data Products2 Changed;Set Products;SupplierOld= Supplier;
* * * * Output Products2;If Trim(supplier) ^= Trim(SupplierOld) then output Changed;
Proc Print Data= Changed;Var SupplierOld Supplier;
Locating Anomalies
• Frequency counts are a good way to identify anomalies.
• It is also helpful to identify standard changes that you do not have to review.
• Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)
De Duplicating
• Reconcile different representations of the same entity
• Done after standardizing. Usually requires multi-field testing.
• May use probabilistic logic, depending on the application.
• Should produce a change log.
Correcting
• Identifying and correcting values that are wrong
• Very difficult to do. Usually based off of exception reports or range checks.