chapter 9 cleansing and augmenting the data xiaogang su department of statistics university of...

48
Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Upload: juliet-stafford

Post on 14-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Chapter 9 Cleansing and Augmenting

the Data

Xiaogang Su

Department of Statistics

University of Central Florida

Page 2: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Section 9.1 Cleansing and Augmenting the Data

Objectives

• Modify the data using the functions

— UPCASE

— COMPBL

— TRANWRD.

Page 3: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Scenario

The secretary at Statistics Department maintains some faculty information. The data set contains information about all faculty in Fall 2001.

...

Page 4: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

UPCASE Function

The UPCASE function converts all of the letters in the data value into uppercase.

var=UPCASE(var);

country=upcase(country);COUNTRY

france

BEFORE

COUNTRY

FRANCE

AFTER

Example:

...

Page 5: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

COMPBL Function

The COMPBL function compresses multiple consecutive blanks in a data value into one blank. Since the length of a variable is set at compilation, the resulting data value is padded with blanks.

Example:

var=COMPBL(var);

name=compbl(name);NAME

DE PABLOS

BEFORE

NAME

DE PABLOS

AFTER

...

Page 6: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

TRANWRD Function

The TRANWRD function replaces all occurrences of a pattern of characters in a data value with another pattern of characters.

Example:

var=TRANWRD(var, target, replacement);

name=tranwrd(name,’Miss’, ’Ms’);NAME

Miss. Joy Ho

BEFORE

Ms. Joy Ho

AFTER

NAME

...

Page 7: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Cleansing Existing Data

This demonstration illustrates

• changing Country to be always uppercase for USA and only the first letter be upper case for other counties

• removing multiple blanks from a last name

Program: pg2-ch9-ex01.sas

Page 8: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Section 9.2 Creating New Character Variables

Objectives

• Modify character data values.

• Specifically, use the

SCAN function

TRIM function

concatenation

operator

SUBSTR function

...

Page 9: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

[email protected]@[email protected]@pegasus.cc.ucf.edu

ScenarioUniversity has two email systems “pegasus” and “mail”. The e-mail address structures are in the form of

<Firstinitial>[email protected]

Or <Firstinitial>[email protected]

...

Page 10: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Refresher Quiz!

1. What are the two phases of the DATA step?

Compilation and Execution

2. What are the three required attributes of all variables?

Name, Type, and Length

3. In what phase are these attributes assigned?

Compilation

...

Page 11: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

SUBSTR Function

The SUBSTR function extracts a portion of the character data value based on how many characters are designated for retrieval.

var1=SUBSTR(var, start ,<number of chars>);

NAME

Dorothy Ename1 = substr(name,1,3); NAME1

Dor

...

Page 12: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Retrieve First Initial

The first step is to retrieve the initial letter of an employee’s first name.

first_initial = substr(emp_firstname,1,1);

EMP_FIRSTNAME

Dorothy E

FIRST_INITIAL

D

...

Page 13: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Retrieve Middle Initial

The second step is to retrieve the middle initial.

Problems:

• Not all middle initials are in the same location, so you cannot use the SUBSTR function.

• Not all employees have a middle initial.

lastname firstname email MILLS DOROTHY E [email protected] BOWER EILEEN A. [email protected] READING TONY R. [email protected] JUDD CAROL A. [email protected] ANDERSON CHRISTOPHER [email protected]

...

Page 14: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

SCAN Function

The SCAN function extracts a portion of the character data value based on what word-number to retrieve.

Example:

var1=SCAN(var, word-number, <delimiter(s)>);

NAME

Dorothy Ename1=scan(name,2,' ');

NAME1

E

...

Page 15: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Dorothy E

Second word = 'E'

...

Page 16: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Dorothy E

1 character = E

Second word = 'E'

Page 17: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Eileen Alma

Second word = 'Alma'

...

Page 18: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Eileen Alma

1 character = A

Second word = 'Alma'

...

Page 19: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Hanna

Second word = missing

...

Page 20: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

When you put the e-mail address together, separate assignments must be made for those with middle initials and for those without.

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Hanna

1 character = missing

...

Page 21: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Concatenation Operator

The third step is to put the first initial, middle initial, and last name together.

The concatenation operator joins character data values together.

var2 = var !! var1;

Page 22: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Concatenation Operator

Example: newname=name !! name1;

Compilation

NAME1

$

6

NEWNAME

$

15+

NAME

$

9

...

Page 23: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Concatenation Operator

Example: newname=name !! name1;

Dorothy E

2 spaces

NEWNAME

$

15

Dorothy E

Execution

NAME

$

9

NAME1

$

6

+

...

Page 24: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

TRIM Function

The TRIM function removes trailing blanks from a character data value during execution.

var2=TRIM(var) !! var1;

Page 25: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

TRIM Function

Example: newname=trim(name) !! name1;

NEWNAME

$

15

NAME1

$

6

NAME

$

9

Compilation

+

...

Page 26: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

TRIM Function

Example: newname=trim(name) !! name1;

0 spaces

7 spaces

NEWNAME

$

15

DorothyEDorothy E

Execution

NAME

$

9

NAME1

$

6

+

...

Page 27: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

TRIM Function

Example: newname=trim(name) !! ‘ ‘ !! name1;

1 space

7 spaces

NEWNAME

$

16

Dorothy EDorothy E

Execution

NAME

$

9

NAME1

$

6

+ +space

...

Page 28: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Pegasus, Mail, or Other email system

This demonstration illustrates putting all the pieces together.

Program: pg2-ch9-ex02.sas

Page 29: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Section 9.3 Creating New Numeric Variables

Objectives

• Modify numeric data values.

• Specifically, use the

SUM Function

ROUND Function

INT Function

MEAN Function

Page 30: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Scenario

International Airlines will provide a retroactive, prorated bonus to employees for their years of service up to and including a maximum of ten years of past company service.

sum_ avg_

EMP_ID bonus bonus

E00002 18956.98 2708

E00003 60211.06 12042

E00004 4218.99 4219

E00005 17142.96 1905

E00007 20361.21 2909

The variable names that containthe bonuses areBONUS1, BONUS2,and so on.

Page 31: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Implied Variable Lists

Because the variable names have the same prefix word followed by a numbered suffix, you can use an implied implied variable listvariable list to refer to a group of variables.

The syntax may vary slightly based on the statement in which the implied variable list is used, but the basic syntax is

varA-varN

where A is a starting number, and N is an ending number. The numbers must be consecutive.

Page 32: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The SUM Function

The SUM function

• adds variable values together

• ignores missing values.

var=SUM(var1, var2, …, varN);

var=sum(var1, var2, var3);

Example:

VAR

18

VAR1

12

VAR2

.

VAR3

6

...

Page 33: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The SUM Function

When you use an implied variable list, use the keyword OF in front of the first variable name to prevent subtraction from occurring.

Example:

var=SUM(OF var1-varN);

var=sum(of var1-var3);VAR1

12

VAR2

.

VAR3

6

VAR

18

Page 34: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The MEAN Function

The MEAN function returns the arithmetic mean (average) and ignores missing values.

Example:

var=MEAN(var1, var2,…,varN);

var=mean(of var1-var3);VAR

9

VAR1

12

VAR2

.

VAR3

6

...

Page 35: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The ROUND Function

The ROUND function returns a value rounded to the nearest round-off unit. If round-off unit is not provided, the variable is rounded to the nearest integer.

Any number or fractional value can be used as a round-off unit.

var=ROUND(var1,<round-off unit>);

Page 36: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The ROUND Function

Examples:

VAR1A

12

VAR2A

43

VAR3A

6.5

VAR4A

-6.5

VAR1

12.12

VAR2

42.54

VAR3

6.472

VAR4

-6.47

var1a = round(var1);var2a = round(var2);var3a = round(var3,.1);var4a = round(var4,.1);

Page 37: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The INT Function

The INT function returns the integer portion of an argument.

var=INT(var1);

Page 38: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The INT Function

var1a = int(var1);var2a = int(var2);var3a = int(var3);var4a = int(var4);

Examples:

VAR1

12.12

VAR2

42.54

VAR3

6.472

VAR4

-6.47

VAR1A

12

VAR2A

42

VAR3A

6

VAR4A

-6

Page 39: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Section 9.4 Creating New Numerical Variables Based on Dates

Objectives

• Modify numeric date values.

• Specifically, use

SAS time and datetime constants

the INTNX function

the TODAY function

the YEAR function

...

Page 40: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Scenario

New employees are eligible for benefits on the first of the month following their start dates.

benefit_ emp_hire_ benefit_Obs date date LASTNAME FIRSTNAME days 1 05/01/1999 01APR1999 ANDERSON SANDRA 3342 06/01/1999 10MAY1999 BARBOUR DAVID 3033 02/01/2000 01JAN2000 BELL MICHAEL 584 08/01/1999 18JUL1999 BENTZ ALICE 2425 02/01/2000 03JAN2000 BOLCH AMY 58

Page 41: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Refresher Quiz!

1. What date is the starting point from which all SAS dates are based?

2. What option takes effect if only a two-digit year is specified?

January 1, 1960

YEARCUTOFF=

3. How do you specify a SAS date constant for September 12, 2008? '12sep2008'd

...

Page 42: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

SAS Times

Just as SAS has a starting point for dates, it also has a starting point for times.

Time is measured as the number of seconds since midnight.

'hh:mm<:ss.s>'T

Page 43: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

SAS Datetimes

SAS datetimes are a combination of dates and times, and are measured as the number of seconds since January 1, 1960.

'ddmmmyyyy:hh:mm<:ss.s>'DT

Page 44: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The INTNX Function

The INTNX function advances a date, time, or datetime value by a given interval, and returns a date, time, or datetime value.

Example:

var1=intnx('year',var,1);VAR

0

SAS date for 01JAN1960

VAR1

366

SAS date for 01JAN1961

var=INTNX('interval',start-from,increment);

...

Page 45: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

The TODAY Function

The TODAY function returns the current date as a SAS date.

Example:

Assume today’s date is September 12, 2008.

var1=today(); VAR1

17787

SAS date for 12SEP2008

var=TODAY();

...

Page 46: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Other DATE Functions

There are other functions that return values from SAS dates.

var=YEAR(SAS-date);

var=MONTH(SAS-date);

var=DAY(SAS-date);

Page 47: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Other DATE Functions

Example:

Assume today’s date is September 12, 2008.

THISYEAR=YEAR(TODAY());

THISDAY

12

THISMONTH=MONTH(TODAY());

THISDAY=DAY(TODAY());

THISMONTH

9

THISYEAR

2008

...

Page 48: Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida

Calculating Eligibility Date

This demonstration illustrates putting all the pieces together.

Program: pg2-ch9-ex03.sas