![Page 1: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/1.jpg)
Chapter 9 Cleansing and Augmenting
the Data
Xiaogang Su
Department of Statistics
University of Central Florida
![Page 2: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/2.jpg)
Section 9.1 Cleansing and Augmenting the Data
Objectives
• Modify the data using the functions
— UPCASE
— COMPBL
— TRANWRD.
![Page 3: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/3.jpg)
Scenario
The secretary at Statistics Department maintains some faculty information. The data set contains information about all faculty in Fall 2001.
...
![Page 4: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/4.jpg)
UPCASE Function
The UPCASE function converts all of the letters in the data value into uppercase.
var=UPCASE(var);
country=upcase(country);COUNTRY
france
BEFORE
COUNTRY
FRANCE
AFTER
Example:
...
![Page 5: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/5.jpg)
COMPBL Function
The COMPBL function compresses multiple consecutive blanks in a data value into one blank. Since the length of a variable is set at compilation, the resulting data value is padded with blanks.
Example:
var=COMPBL(var);
name=compbl(name);NAME
DE PABLOS
BEFORE
NAME
DE PABLOS
AFTER
...
![Page 6: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/6.jpg)
TRANWRD Function
The TRANWRD function replaces all occurrences of a pattern of characters in a data value with another pattern of characters.
Example:
var=TRANWRD(var, target, replacement);
name=tranwrd(name,’Miss’, ’Ms’);NAME
Miss. Joy Ho
BEFORE
Ms. Joy Ho
AFTER
NAME
...
![Page 7: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/7.jpg)
Cleansing Existing Data
This demonstration illustrates
• changing Country to be always uppercase for USA and only the first letter be upper case for other counties
• removing multiple blanks from a last name
Program: pg2-ch9-ex01.sas
![Page 8: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/8.jpg)
Section 9.2 Creating New Character Variables
Objectives
• Modify character data values.
• Specifically, use the
SCAN function
TRIM function
concatenation
operator
SUBSTR function
...
![Page 9: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/9.jpg)
[email protected]@[email protected]@pegasus.cc.ucf.edu
ScenarioUniversity has two email systems “pegasus” and “mail”. The e-mail address structures are in the form of
<Firstinitial>[email protected]
Or <Firstinitial>[email protected]
...
![Page 10: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/10.jpg)
Refresher Quiz!
1. What are the two phases of the DATA step?
Compilation and Execution
2. What are the three required attributes of all variables?
Name, Type, and Length
3. In what phase are these attributes assigned?
Compilation
...
![Page 11: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/11.jpg)
SUBSTR Function
The SUBSTR function extracts a portion of the character data value based on how many characters are designated for retrieval.
var1=SUBSTR(var, start ,<number of chars>);
NAME
Dorothy Ename1 = substr(name,1,3); NAME1
Dor
...
![Page 12: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/12.jpg)
Retrieve First Initial
The first step is to retrieve the initial letter of an employee’s first name.
first_initial = substr(emp_firstname,1,1);
EMP_FIRSTNAME
Dorothy E
FIRST_INITIAL
D
...
![Page 13: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/13.jpg)
Retrieve Middle Initial
The second step is to retrieve the middle initial.
Problems:
• Not all middle initials are in the same location, so you cannot use the SUBSTR function.
• Not all employees have a middle initial.
lastname firstname email MILLS DOROTHY E [email protected] BOWER EILEEN A. [email protected] READING TONY R. [email protected] JUDD CAROL A. [email protected] ANDERSON CHRISTOPHER [email protected]
...
![Page 14: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/14.jpg)
SCAN Function
The SCAN function extracts a portion of the character data value based on what word-number to retrieve.
Example:
var1=SCAN(var, word-number, <delimiter(s)>);
NAME
Dorothy Ename1=scan(name,2,' ');
NAME1
E
...
![Page 15: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/15.jpg)
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Dorothy E
Second word = 'E'
...
![Page 16: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/16.jpg)
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Dorothy E
1 character = E
Second word = 'E'
![Page 17: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/17.jpg)
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Eileen Alma
Second word = 'Alma'
...
![Page 18: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/18.jpg)
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Eileen Alma
1 character = A
Second word = 'Alma'
...
![Page 19: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/19.jpg)
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Hanna
Second word = missing
...
![Page 20: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/20.jpg)
When you put the e-mail address together, separate assignments must be made for those with middle initials and for those without.
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Hanna
1 character = missing
...
![Page 21: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/21.jpg)
Concatenation Operator
The third step is to put the first initial, middle initial, and last name together.
The concatenation operator joins character data values together.
var2 = var !! var1;
![Page 22: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/22.jpg)
Concatenation Operator
Example: newname=name !! name1;
Compilation
NAME1
$
6
NEWNAME
$
15+
NAME
$
9
...
![Page 23: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/23.jpg)
Concatenation Operator
Example: newname=name !! name1;
Dorothy E
2 spaces
NEWNAME
$
15
Dorothy E
Execution
NAME
$
9
NAME1
$
6
+
...
![Page 24: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/24.jpg)
TRIM Function
The TRIM function removes trailing blanks from a character data value during execution.
var2=TRIM(var) !! var1;
![Page 25: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/25.jpg)
TRIM Function
Example: newname=trim(name) !! name1;
NEWNAME
$
15
NAME1
$
6
NAME
$
9
Compilation
+
...
![Page 26: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/26.jpg)
TRIM Function
Example: newname=trim(name) !! name1;
0 spaces
7 spaces
NEWNAME
$
15
DorothyEDorothy E
Execution
NAME
$
9
NAME1
$
6
+
...
![Page 27: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/27.jpg)
TRIM Function
Example: newname=trim(name) !! ‘ ‘ !! name1;
1 space
7 spaces
NEWNAME
$
16
Dorothy EDorothy E
Execution
NAME
$
9
NAME1
$
6
+ +space
...
![Page 28: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/28.jpg)
Pegasus, Mail, or Other email system
This demonstration illustrates putting all the pieces together.
Program: pg2-ch9-ex02.sas
![Page 29: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/29.jpg)
Section 9.3 Creating New Numeric Variables
Objectives
• Modify numeric data values.
• Specifically, use the
SUM Function
ROUND Function
INT Function
MEAN Function
![Page 30: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/30.jpg)
Scenario
International Airlines will provide a retroactive, prorated bonus to employees for their years of service up to and including a maximum of ten years of past company service.
sum_ avg_
EMP_ID bonus bonus
E00002 18956.98 2708
E00003 60211.06 12042
E00004 4218.99 4219
E00005 17142.96 1905
E00007 20361.21 2909
The variable names that containthe bonuses areBONUS1, BONUS2,and so on.
![Page 31: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/31.jpg)
Implied Variable Lists
Because the variable names have the same prefix word followed by a numbered suffix, you can use an implied implied variable listvariable list to refer to a group of variables.
The syntax may vary slightly based on the statement in which the implied variable list is used, but the basic syntax is
varA-varN
where A is a starting number, and N is an ending number. The numbers must be consecutive.
![Page 32: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/32.jpg)
The SUM Function
The SUM function
• adds variable values together
• ignores missing values.
var=SUM(var1, var2, …, varN);
var=sum(var1, var2, var3);
Example:
VAR
18
VAR1
12
VAR2
.
VAR3
6
...
![Page 33: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/33.jpg)
The SUM Function
When you use an implied variable list, use the keyword OF in front of the first variable name to prevent subtraction from occurring.
Example:
var=SUM(OF var1-varN);
var=sum(of var1-var3);VAR1
12
VAR2
.
VAR3
6
VAR
18
![Page 34: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/34.jpg)
The MEAN Function
The MEAN function returns the arithmetic mean (average) and ignores missing values.
Example:
var=MEAN(var1, var2,…,varN);
var=mean(of var1-var3);VAR
9
VAR1
12
VAR2
.
VAR3
6
...
![Page 35: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/35.jpg)
The ROUND Function
The ROUND function returns a value rounded to the nearest round-off unit. If round-off unit is not provided, the variable is rounded to the nearest integer.
Any number or fractional value can be used as a round-off unit.
var=ROUND(var1,<round-off unit>);
![Page 36: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/36.jpg)
The ROUND Function
Examples:
VAR1A
12
VAR2A
43
VAR3A
6.5
VAR4A
-6.5
VAR1
12.12
VAR2
42.54
VAR3
6.472
VAR4
-6.47
var1a = round(var1);var2a = round(var2);var3a = round(var3,.1);var4a = round(var4,.1);
![Page 37: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/37.jpg)
The INT Function
The INT function returns the integer portion of an argument.
var=INT(var1);
![Page 38: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/38.jpg)
The INT Function
var1a = int(var1);var2a = int(var2);var3a = int(var3);var4a = int(var4);
Examples:
VAR1
12.12
VAR2
42.54
VAR3
6.472
VAR4
-6.47
VAR1A
12
VAR2A
42
VAR3A
6
VAR4A
-6
![Page 39: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/39.jpg)
Section 9.4 Creating New Numerical Variables Based on Dates
Objectives
• Modify numeric date values.
• Specifically, use
SAS time and datetime constants
the INTNX function
the TODAY function
the YEAR function
...
![Page 40: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/40.jpg)
Scenario
New employees are eligible for benefits on the first of the month following their start dates.
benefit_ emp_hire_ benefit_Obs date date LASTNAME FIRSTNAME days 1 05/01/1999 01APR1999 ANDERSON SANDRA 3342 06/01/1999 10MAY1999 BARBOUR DAVID 3033 02/01/2000 01JAN2000 BELL MICHAEL 584 08/01/1999 18JUL1999 BENTZ ALICE 2425 02/01/2000 03JAN2000 BOLCH AMY 58
![Page 41: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/41.jpg)
Refresher Quiz!
1. What date is the starting point from which all SAS dates are based?
2. What option takes effect if only a two-digit year is specified?
January 1, 1960
YEARCUTOFF=
3. How do you specify a SAS date constant for September 12, 2008? '12sep2008'd
...
![Page 42: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/42.jpg)
SAS Times
Just as SAS has a starting point for dates, it also has a starting point for times.
Time is measured as the number of seconds since midnight.
'hh:mm<:ss.s>'T
![Page 43: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/43.jpg)
SAS Datetimes
SAS datetimes are a combination of dates and times, and are measured as the number of seconds since January 1, 1960.
'ddmmmyyyy:hh:mm<:ss.s>'DT
![Page 44: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/44.jpg)
The INTNX Function
The INTNX function advances a date, time, or datetime value by a given interval, and returns a date, time, or datetime value.
Example:
var1=intnx('year',var,1);VAR
0
SAS date for 01JAN1960
VAR1
366
SAS date for 01JAN1961
var=INTNX('interval',start-from,increment);
...
![Page 45: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/45.jpg)
The TODAY Function
The TODAY function returns the current date as a SAS date.
Example:
Assume today’s date is September 12, 2008.
var1=today(); VAR1
17787
SAS date for 12SEP2008
var=TODAY();
...
![Page 46: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/46.jpg)
Other DATE Functions
There are other functions that return values from SAS dates.
var=YEAR(SAS-date);
var=MONTH(SAS-date);
var=DAY(SAS-date);
![Page 47: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/47.jpg)
Other DATE Functions
Example:
Assume today’s date is September 12, 2008.
THISYEAR=YEAR(TODAY());
THISDAY
12
THISMONTH=MONTH(TODAY());
THISDAY=DAY(TODAY());
THISMONTH
9
THISYEAR
2008
...
![Page 48: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida](https://reader034.vdocuments.net/reader034/viewer/2022052603/56649f495503460f94c6abe8/html5/thumbnails/48.jpg)
Calculating Eligibility Date
This demonstration illustrates putting all the pieces together.
Program: pg2-ch9-ex03.sas