the essence of data step programming

182
The Essence of DATA Step Programming Arthur Li City of Hope Comprehensive Cancer Center Department of Information Science

Upload: arthur8898

Post on 30-Nov-2014

700 views

Category:

Education


5 download

DESCRIPTION

The fundamental of SAS programming is DATA step programming. The essence of DATA step programming is to understand how SAS processes the data during the compilation and execution phases. In this paper, you will be exposed to what happens “behind the scenes” while creating a SAS dataset. You will learn how a new dataset is created, one observation at a time, from either a raw text file or an existing SAS dataset, to the program data vector (PDV) and from the PDV to the newly-created SAS dataset. Once you fully understand DATA step processing, learning the SUM and RETAIN statements will become easier to grasp. Relating to this topic, this paper will also cover BY-group processing.

TRANSCRIPT

Page 1: The essence of data step programming

The Essence of DATA Step Programming

Arthur LiCity of Hope Comprehensive Cancer Center

Department of Information Science

Page 2: The essence of data step programming

INTRODUCTION

SAS programming

DATA step programming

Understanding how SAS processes the data during the compilation and execution phases

Fundamental:

Essence:

Page 3: The essence of data step programming

A COMMON BEFUDDLEMENT

The newly-created SAS dataset is not what we intended there are more or less observationsthe value of the variable was not retained

correctly

Reason:Learning only SAS language syntaxNot understanding the fundamental SAS

programming concepts

Page 4: The essence of data step programming

INTRODUCTION

We will cover…what happens “behind the scenes” while creating a

SAS dataset Learn how a new dataset is created

one observation at a time a raw text file/SAS dataset PDVSAS dataset

The SUM and RETAIN statements BY-group processing Transposing dataset examples

Page 5: The essence of data step programming

DATA STEP PROCESSING OVERVIEW

Compilation phase:Each statement is scanned for syntax errors.

Execution phase:The DATA step reads and processes the input data.

If there is no syntax error

A DATA step is processed in two-phase sequences:

Page 6: The essence of data step programming

DATA STEP PROCESSING OVERVIEW

Variable names Columns

Name 1-7

Height 9-10

Weight 12-14

Program1:

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

Data Entry Error

The column input method:Each variable is occupied in a fixed fieldThe values are standard character or numerical values

Creating a new variable: BMI

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

Page 7: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

Used to hold raw dataWill not be created when reading a SAS dataset

Page 8: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Memory area where SAS builds its new data set, 1 observation at a time.

_N_ D _ERROR_D

Page 9: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Automatic variables:_N_ = 1: 1st observation is being processed_N_ = 2: 2nd observation is being processed

_N_ D _ERROR_D

Page 10: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Automatic variables:_ERROR_ = 1: signals the data error of the currently-processed observation

_N_ D _ERROR_D

Page 11: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

A space is added to the PDV for each variable

_N_ D _ERROR_D Height KName K Weight K

Page 12: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

BMI is added to the PDV

_N_ D _ERROR_D Height KName K Weight K BMI K

Page 13: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV _N_ D _ERROR_D Height KName K Weight K BMI K

D = dropped

K = kept

Page 14: The essence of data step programming

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV _N_ D _ERROR_D Height KName K Weight K BMI K

Checks for syntax errorsinvalid variable names invalid optionsincorrect punctuationsmisspelled keywords

Page 15: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

The DATA step works like a loopIt repetitively executes statements

reads data values creates observations one at a time

Page 16: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

1st Iteration:At the beginning

1 0

_N_ 1, _ERROR_ 0The remaining variables are set to missing

. . .

Page 17: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1st data line input bufferThe input pointer @ the beginning of the input buffer

The INFILE statement identifies the location of Exampl1.txt

1 0 . . .

Page 18: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The INPUT statement reads data values: input buffer PDV

. . .

Page 19: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

input buffer (columns 1-7) “Name” in the PDV

Barbara . . .

Page 20: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The input pointer @ column 8

Barbara . . .

Page 21: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0 . .

input buffer (columns 9-10) “Height” in the PDV

Barbara 61

Page 22: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The input pointer @ column 11

Barbara 61 . .

Page 23: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

Tries to read Weight – invalid value

Barbara 61 . .

Page 24: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

Tries to read Weight – invalid value _ERROR_ 1

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

Page 25: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

The input pointer @ column 15

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

Page 26: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

BMI will remain missing: operations on a missing value a missing value.

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

Page 27: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

The OUTPUT statement is executed

Only values marked with (K) are copied as a single observation to the SAS dataset ex1

Name Height Weight BMI

1 Barbara 61 . .

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

Page 28: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

At the end of the DATA step, two things occur automatically:

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

Page 29: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1. The SAS system returns to the beginning of the DATA step

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

Page 30: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2. The values of the variables in the PDV are reset to missing _N_ ↑ 2

_ERROR_ 0

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 . . .

Page 31: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

2nd data line input buffer The input pointer @

beginning of the input buffer

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 . . .

Page 32: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

The INPUT statement is executed

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 .62John 175

Name Height Weight BMI

1 Barbara 61 . .

Page 33: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

BMI is calculated

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

Name Height Weight BMI

1 Barbara 61 . .

Page 34: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

The OUTPUT statement is executed

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

Page 35: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

At the end of the DATA step, two things occur automatically:

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

Page 36: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

1. The SAS system returns to the beginning of the DATA step

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

Page 37: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

Ex1:

2. The values of the variables in the PDV are reset to missing _N_ ↑3

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

_N_ D _ERROR_D Name K Height K Weight K BMI K

3 0 . . .

Page 38: The essence of data step programming

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

proc print data=ex1;run;

There are no more records to readThe SAS system next DATA/PROC step

Page 39: The essence of data step programming

THE OUTPUT STATEMENT

data ex1; set example1; BMI = 700*weight/(height*height);

run;

The explicit OUTPUT statement:

write the current observation from the PDV to a SAS dataset immediately

not at the end of the DATA step

output;

Page 40: The essence of data step programming

THE OUTPUT STATEMENT

data ex1; set example1; BMI = 700*weight/(height*height);

run;

It tells SAS to write observations to the dataset at the end of the DATA step

The implicit OUTPUT statement:

Page 41: The essence of data step programming

THE OUTPUT STATEMENT

Using explicit OUTPUT will override the implicit OUTPUT

We can use more than one OUTPUT statement in the DATA step

Page 42: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175Raw data

SAS dataset

When Reading a raw dataset …

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

Page 43: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

SAS dataset

When Reading a SAS dataset …

SAS dataset

Input dataset:Example1(after “set”)

Output dataset:Ex1(after “data”)

Name Height Weight

1 Barbara 61 .

2 John 62 175

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

Page 44: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

When reading a raw dataset, SAS sets each variable value in the PDV to missing at the beginning of each iteration of execution, except for …

the automatic variablesvariables that are named in the RETAIN or SUM statementdata elements in a _TEMPORARY_ arrayvariables created in the options of the FILE/INFILE statement

Page 45: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:At the beginning of the

execution phase, SAS sets each variable to missing in the PDV

When Reading a SAS dataset …

Example1: Name Height Weight

1 Barbara 61 170

2 John 62 175

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 . . .

Page 46: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:The SET statement is

executed

When Reading a SAS dataset …

Example1: Name Height Weight

1 Barbara 61 170

2 John 62 175

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 .Barbara 170 61

Page 47: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:BMI is calculated

When Reading a SAS dataset …

Example1: Name Height Weight

1 Barbara 61 170

2 John 62 175

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 Barbara 31.9807170 61

Page 48: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:Output statement is

executed

When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 Barbara 31.9807170 61

Page 49: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

2nd Iteration:When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 Barbara .170 61

Variables exist in the input dataset

SAS sets each variable to missing in the PDV only before the 1st iteration of the execution

Variables will retain their values in the PDV until they are replaced by the new values

Page 50: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

2nd Iteration:When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 Barbara .170 61

Variables being created in the DATA step

SAS sets each variable to missing in the PDV at the beginning of every iteration of the execution

Page 51: The essence of data step programming

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

2nd Iteration:SET statement is executed

When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 John .175 62

Page 52: The essence of data step programming

THE RETAIN STATEMENT

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Consider the following dataset:

We would like to create a new variable that accumulates the values of SCORE

TOTAL

3

3

7

Page 53: The essence of data step programming

THE RETAIN STATEMENT

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Consider the following dataset:

How to do it?Set the TOTAL to 0 at the first iteration of the

execution Then at each iteration of the execution, add

values from SCORE to TOTAL

TOTAL

3

3

7

Problem: TOTAL is a new variable that you want to create TOTAL will be set to missing in the PDV at the beginning of every iteration of the execution.

Page 54: The essence of data step programming

THE RETAIN STATEMENT

To fix this problem, we can use the RETAIN statement:

RETAIN VARIABLE <VALUE>;

Prevents the VARIABLE from being initialized each time the DATA step executes

Page 55: The essence of data step programming

THE RETAIN STATEMENT

To fix this problem, we can use the RETAIN statement:

RETAIN VARIABLE <VALUE>;

Name of the variable that we will want to retain

A numeric valueUsed to initialize the VARIABLE

only at the first iteration of the DATA step execution

Not specifying an initial value VARIABLE is initialized as missing

Page 56: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV _N_ D _ERROR_D ID K Total K

The execution phase begins immediately after the completion of the compilation phase

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Score K

Page 57: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

_N_ 1, _ERROR_ 0ID, SCORE missingTOTAL 0 because of the RETAIN

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 . 0

Page 58: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

1st observation from ex2 PDV.

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 0 A01

Page 59: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

The RETAIN statement is a compile-time only statement

It does not execute during the execution phase

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 0 A01

Page 60: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

TOTAL is calculated

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 3 A01

Page 61: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

The implicit OUTPUT statement tells the SAS system to write observations to the dataset

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 3 A01

Page 62: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

_N_ ↑2 ID and SCORE are retained from the

previous iteration because data are read from an existing SAS dataset

TOTAL is also retained because the RETAIN statement is used

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

2 0 3 3 A01

Page 63: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

2nd observation from ex2 PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

2 0 . 3 A02

Page 64: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

TOTAL is calculated

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

2 0 . 3 A02

Page 65: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

The implicit OUTPUT: The contents in PDV Ex2_2

_N_ D _ERROR_D ID K Total KScore K

2 0 . 3 A02

Page 66: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

_N_ ↑ 3. ID and SCORE are retained

from the previous iteration. TOTAL is also retained.

_N_ D _ERROR_D ID K Total KScore K

3 0 . 3 A02

Page 67: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

3rd observation from ex2 PDV

_N_ D _ERROR_D ID K Total KScore K

3 0 4 3 A03

Page 68: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

TOTAL is calculated

_N_ D _ERROR_D ID K Total KScore K

3 0 4 7 A03

Page 69: The essence of data step programming

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

3 A03 4 7

Ex2_2:

The implicit OUTPUT:The contents in PDV Ex2_2

_N_ D _ERROR_D ID K Total KScore K

3 0 4 7 A03

Page 70: The essence of data step programming

THE SUM STATEMENT

The SUM statement has the following form:

VARIABLE + EXPRESSION;

The numeric accumulator variable that is to be created

It is automatically set to 0 at the beginning of the first iteration of the DATA step execution

Retained in following iterations

Any SAS expression If EXPRESSION is evaluated

to a missing value, it is treated as 0

Page 71: The essence of data step programming

THE SUM STATEMENT

data ex2_2; set ex2;

run;

retain total 0;total = sum(total, score);

The previous program can be re-written as…

Page 72: The essence of data step programming

THE SUM STATEMENT

data ex2_2; set ex2;

run;

The previous program can be re-written as…

total + score;

Page 73: The essence of data step programming

THE SUBSETTING IF STATEMENT

We use the subsetting IF statement to continue processing only the observations that meet the condition of the specified expression

IF EXPRESSION;

If EXPRESSION is true for the observation, SAS continues to execute statements in the DATA step includes the current observation in the data set

Page 74: The essence of data step programming

THE SUBSETTING IF STATEMENT

Use the IF statement to continue processing only the observations that meet the condition of the specified expression

IF EXPRESSION;

If EXPRESSION is false for the observation, no further statements are processed for that obs.SAS immediately returns to the beginning of DATA step the remaining program statements in the DATA step are

not executed and the current observation is not written to the output data set

Page 75: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

ID SCORE

1 A01 3

2 A02 .

3 A03 4

One observation per subject

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

Multiple observations per subject-- Longitudinal data

Identify the beginning/end of measurement for each subject

This can be accomplished by using the BY-group processing method

Page 76: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

SAS locates the beginning and end of a BY-group by creating two temporary indicator variables for each BY variable: FIRST.VARIABLELAST.VARIABLE

Suppose ID is the “BY” variable:

FIRST.ID

1

0

0

1

0

LAST.ID

0

0

1

0

1

SAS reads the 1st observation for ID = A01

SAS reads the last observation for ID = A01

Page 77: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

Calculating the total scores for each subject

ID TOTAL

1 A01 9

2 A02 6

proc sort data=ex3; by id;run;data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

Page 78: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

1st iteration:_N_ 1, _ERROR_ 0 FIRST.ID 1, LAST.ID 1 only at beginning of 1st iterationID, Score missingTOTAL 0 because of the SUM statement

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 1 . 0

Page 79: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The SET statement is executed1st observation PDVFIRST.ID 1 and LAST.ID 0

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 0

1st iteration:

Page 80: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID = 1: TOTAL 0

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 0

1st iteration:

Page 81: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is accumulated

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 3

1st iteration:

Page 82: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 3

SAS returns to the beginning of the DATA step to begin the 2nd iteration

1st iteration:

Page 83: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 2The values for the rest of the variables are retained

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 1 0 A01 3 3

2nd iteration:

Page 84: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

2nd observation PDVNot the first observation for A01: FIRST.ID 0Not the last observation for A01: LAST.ID 0

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 3

2nd iteration:

Page 85: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 3

2nd iteration:

Page 86: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is accumulated

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 7

2nd iteration:

Page 87: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 7

The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1

SAS returns to the beginning of the DATA step to begin the 3rd iteration

2nd iteration:

Page 88: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑3The values for the rest of the variables are retained

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 0 A01 4 7

3rd iteration:

Page 89: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

3rd observation PDVNot the first observation: FIRST.ID 0 Last observation for A01: LAST.ID 1

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 7

3rd iteration:

Page 90: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 7

3rd iteration:

Page 91: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculated

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 9

3rd iteration:

Page 92: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 9

The subsetting IF statement is evaluated to be TRUE

3rd iteration:

Page 93: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

SAS reaches the end of the 3rd iterationThe implicit OUTPUT executesSAS returns to the beginning of the

DATA step to begin the 3rd iteration

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 9

3rd iteration:

Page 94: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 4The values for the remaining

variables are retained

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 0 1 A01 2 9

4th iteration:

Page 95: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

4th observation PDVFIRST.ID 1LAST.ID 0

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 9

4th iteration:

Page 96: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID = 1: TOTAL 0ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 0

4th iteration:

Page 97: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculatedID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 4

4th iteration:

Page 98: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 4

The subsetting IF statement is evaluated to be FALSE

SAS returns to the beginning of the DATA step to begin the 5th iteration

4th iteration:

Page 99: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 5The values for the remaining

variables are retained

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 1 0 A02 4 4

5th iteration:

Page 100: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

5th observation PDVFIRST.ID 0LAST.ID 1

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 4

5th iteration:

Page 101: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 4

5th iteration:

Page 102: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculated ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 6

5th iteration:

Page 103: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The subsetting IF statement is evaluated to be TRUE

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 6

5th iteration:

Page 104: The essence of data step programming

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

ID TOTAL

1 A01 9

2 A02 6

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 6

SAS reaches the end of the 5th iteration

The implicit OUTPUT executes

5th iteration:

Page 105: The essence of data step programming

RESTRUCTURING DATASETS

Restructuring datasets:

data with one observation per

subject (the wide format)

data with multiple observations per

subject (the long format)

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Page 106: The essence of data step programming

RESTRUCTURING DATASETS

Restructuring datasets:

data with one observation per

subject (the wide format)

data with multiple observations per

subject (the long format)

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2S1 – S3 SCORE

Distinguish different measurements for each subject

Page 107: The essence of data step programming

RESTRUCTURING DATASETS

The transformation can be easily done by using ARRAY/PROC TRANSPOSE (See my paper “The Many Ways to Effectively Utilize Array Processing”, paper 244-2011)

This can also be accomplished without advanced techniques for more simple cases

Here is a solution for using multiple OUTPUT statements in one DATA step

Page 108: The essence of data step programming

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide: Long:ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Transform wide long2 observations to read 2 DATA step

iterationsUse multiple OUTPUT statementAny missing values in S1 – S3 will not be

outputted to long

data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

Page 109: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

_N_ 1Other variables missing

KID

.

DS1

.

DS2

.

DS3

.

KTIME

.

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

Page 110: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

1st observation from the wide PDV

A01

KID

3

DS1

4

DS2

5

DS3

.

KTIME

.

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

Page 111: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

Time 1

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

.

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

Page 112: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

Score value from S1(3)

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

3

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

Page 113: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE ≠ missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

3

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 114: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

TIME 2

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

3

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 115: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

Score value from S2(4)

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 116: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE ≠missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 117: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

TIME 3

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 118: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE value from S3(5)

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 119: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE ≠missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 120: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:There is no more implicit OUTPUT statementSAS returns to the beginning of the DATA step to

begin the 2nd iteration

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 121: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:_N_ ↑2ID and S1-S3 are retained from the previous iterationTIME, SCORE missing

A01

KID

3

DS1

4

DS2

5

DS3

.

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 122: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

2nd observation from the Wide PDV

A01

KID

4

DS1

.

DS2

2

DS3

.

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 123: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

TIME 1

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 124: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE value from S1 (4)

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 125: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

ID, TIME, and SCORE Long

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 126: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

TIME 2

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 127: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE the value from S2 (missing)

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 128: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE = missing: no output is generated

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 129: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

TIME 3

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 130: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE the value from S3 (2)

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 131: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

ID, TIME, and SCORE Long

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 132: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:SAS returns to the beginning of the DATA step to begin the

3rd iterationWith no more observations to read in the 3rd iteration, SAS

goes to the next DATA or PROC step

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Long:

FROM WIDE FORMAT TO LONG FORMAT

Page 133: The essence of data step programming

FROM LONG FORMAT TO WIDE FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Page 134: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Reading 5 observations but only creating 2 observations

You are not copying data from the PDV to the final dataset at each iteration

You only need to generate one observation once all the observations for each subject have been processed

FROM LONG FORMAT TO WIDE FORMAT

Page 135: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

S1

S2

S3

S1

S3

if time = 1 then s1 = score;else if time = 2 then s2 = score;else s3 = score;

Use BY-group processing: BY ID Output to the final data when LAST.ID = 1

SCORE S1, S2 S3

RETAIN

FROM LONG FORMAT TO WIDE FORMAT

Page 136: The essence of data step programming

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

proc sort data=long; by id;run;data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

FROM LONG FORMAT TO WIDE FORMAT

Page 137: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:_N_ 1FIRST.ID 1, LAST.ID 1Other variables missing

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 1 . . . . .

FROM LONG FORMAT TO WIDE FORMAT

Page 138: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:The SET statement copies the 1st observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 1 A01 1 3 . . .

FROM LONG FORMAT TO WIDE FORMAT

Page 139: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:The SET statement copies the 1st observation PDVFIRST.ID 1 since this is the 1st observation for A01LAST.ID 0 since this is not the last observation for A01

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 0 A01 1 3 . . .

FROM LONG FORMAT TO WIDE FORMAT

Page 140: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:Since TIME = 1, S1 SCORE (3)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

Page 141: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the

2nd iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

Page 142: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:_N_ ↑2

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

Page 143: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration: FIRST.ID and LAST.ID are retained; they are automatic variables ID, TIME, SCORE are retained; they are from input dataset S1, S2, and S3 are retained because of the RETAIN statement

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

Page 144: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:The SET statement copies the 2nd observation to the PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 1 0 A01 2 4 3 . .

FROM LONG FORMAT TO WIDE FORMAT

Page 145: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:The SET statement copies the 2nd observation to the PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 0; this is not the last observation for A01 either

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 0 0 A01 2 4 3 . .

FROM LONG FORMAT TO WIDE FORMAT

Page 146: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:Since TIME = 2, S2 SCORE (4)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 0 0 A01 2 4 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

Page 147: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 3rd

iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 0 0 A01 2 4 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

Page 148: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:_N_ ↑3The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 0 A01 2 4 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

Page 149: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The SET statement copies the 3rd observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 0 A01 3 5 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

Page 150: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The SET statement copies the 3rd observation PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 1; this is the last observation for A01

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

Page 151: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:Since TIME = 3, S3 SCORE (5)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 152: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The subsetting IF statement is evaluated to be true

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 153: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The implicit OUTPUT executes - variables marked with

(K) are copied to the dataset wideSAS returns to the beginning of the DATA step to

begin the 4th iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 154: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:_N_ ↑4The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 155: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:The SET statement copies the 4th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 156: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 157: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:Since TIME = 1, S1 SCORE (4)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 158: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 5 th

iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 159: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:_N_ ↑5The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 160: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The SET statement copies the 5th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 3 2 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 161: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 162: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:Since TIME = 3, S3 SCORE (2)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 163: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The subsetting IF statement is evaluated to be TRUE

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 164: The essence of data step programming

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The implicit OUTPUT executes

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 4 2

How to fix this?

FROM LONG FORMAT TO WIDE FORMAT

Page 165: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

FROM LONG FORMAT TO WIDE FORMAT

Page 166: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:_N_ ↑4The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 167: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The SET statement copies the 4th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 168: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 169: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:Since FIRST.ID = 1, S1 – S3 missing

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 . . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 170: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:Since TIME = 1, S1 SCORE (4)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 171: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The subsetting IF statement is evaluated to be falseSAS returns to the beginning of the DATA step to begin the 5th

iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 172: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:_N_ ↑5The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 173: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The SET statement copies the 5th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 174: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 175: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:Since FIRST.ID ≠1, no execution

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 176: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:Since TIME = 3, S3 SCORE (2)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 177: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The subsetting IF statement is evaluated to be true

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

Page 178: The essence of data step programming

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:SAS reaches the end of the 5th iterationThe implicit OUTPUT executes

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

FROM LONG FORMAT TO WIDE FORMAT

Page 179: The essence of data step programming

CONCLUSION

The most important part of DATA step processing is to understand how data is transformed to the PDV and how data is copied from the PDV to a new dataset

To be a successful SAS programmer, we must be able to thoroughly comprehend how DATA steps are processed

Page 180: The essence of data step programming

REFERENCES

Cody, Ron. 2001. Longitudinal Data and SAS® A Programmer’s Guide. Cary, NC: SAS Institute Inc.

Page 181: The essence of data step programming

ACKNOWLEDGEMENT

I would like to thank MaryAnne DePesquo for inviting me to present at the SGF 2011

Page 182: The essence of data step programming

CONTACT INFORMATION

Arthur X. Li

City of Hope Comprehensive Cancer Center

Division of Information Science

1500 East Duarte Road

Duarte, CA 91010 - 3000

Work Phone: (626) 256-4673 ext. 65121

Fax: (626) 471-7106

E-mail: [email protected]