the essence of data step programming

The Essence of DATA Step Programming

Arthur LiCity of Hope Comprehensive Cancer Center

Department of Information Science

INTRODUCTION

SAS programming

DATA step programming

Understanding how SAS processes the data during the compilation and execution phases

Fundamental:

Essence:

A COMMON BEFUDDLEMENT

The newly-created SAS dataset is not what we intended there are more or less observationsthe value of the variable was not retained

correctly

Reason:Learning only SAS language syntaxNot understanding the fundamental SAS

programming concepts

INTRODUCTION

We will cover…what happens “behind the scenes” while creating a

SAS dataset Learn how a new dataset is created

one observation at a time a raw text file/SAS dataset PDVSAS dataset

The SUM and RETAIN statements BY-group processing Transposing dataset examples

DATA STEP PROCESSING OVERVIEW

Compilation phase:Each statement is scanned for syntax errors.

Execution phase:The DATA step reads and processes the input data.

If there is no syntax error

A DATA step is processed in two-phase sequences:

DATA STEP PROCESSING OVERVIEW

Variable names Columns

Name 1-7

Height 9-10

Weight 12-14

Program1:

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

Data Entry Error

The column input method:Each variable is occupied in a fixed fieldThe values are standard character or numerical values

Creating a new variable: BMI

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

Used to hold raw dataWill not be created when reading a SAS dataset

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Memory area where SAS builds its new data set, 1 observation at a time.

_N_ D _ERROR_D

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Automatic variables:_N_ = 1: 1st observation is being processed_N_ = 2: 2nd observation is being processed

_N_ D _ERROR_D

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Automatic variables:_ERROR_ = 1: signals the data error of the currently-processed observation

_N_ D _ERROR_D

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

A space is added to the PDV for each variable

_N_ D _ERROR_D Height KName K Weight K

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

BMI is added to the PDV

_N_ D _ERROR_D Height KName K Weight K BMI K

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV _N_ D _ERROR_D Height KName K Weight K BMI K

D = dropped

K = kept

COMPILATION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV _N_ D _ERROR_D Height KName K Weight K BMI K

Checks for syntax errorsinvalid variable names invalid optionsincorrect punctuationsmisspelled keywords

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

The DATA step works like a loopIt repetitively executes statements

reads data values creates observations one at a time

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer


1st Iteration:At the beginning

1 0

_N_ 1, _ERROR_ 0The remaining variables are set to missing

. . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer



Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1st data line input bufferThe input pointer @ the beginning of the input buffer

The INFILE statement identifies the location of Exampl1.txt

1 0 . . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer



Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The INPUT statement reads data values: input buffer PDV

. . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer



Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

input buffer (columns 1-7) “Name” in the PDV

Barbara . . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer



Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The input pointer @ column 8

Barbara . . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer



Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0 . .

input buffer (columns 9-10) “Height” in the PDV

Barbara 61

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer



Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0


Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer



Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

Tries to read Weight – invalid value

Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

Tries to read Weight – invalid value _ERROR_ 1

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D



1 1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

BMI will remain missing: operations on a missing value a missing value.


1 1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

The OUTPUT statement is executed

Only values marked with (K) are copied as a single observation to the SAS dataset ex1

Name Height Weight BMI

1 Barbara 61 . .

Ex1:


1 1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

At the end of the DATA step, two things occur automatically:

Ex1:


1 Barbara 61 . .


1 1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

1. The SAS system returns to the beginning of the DATA step

Ex1:


1 Barbara 61 . .


1 1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

2. The values of the variables in the PDV are reset to missing _N_ ↑ 2

_ERROR_ 0

Ex1:


1 Barbara 61 . .


2 0 . . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

2nd data line input buffer The input pointer @

beginning of the input buffer

Ex1:


1 Barbara 61 . .


2 0 . . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

The INPUT statement is executed

Ex1:


2 0 .62John 175


1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

BMI is calculated

Ex1:


2 0 31.867862John 175


1 Barbara 61 . .

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

The OUTPUT statement is executed

Ex1:


2 0 31.867862John 175


1 Barbara 61 . .

2 John 62 175 31.8678

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

At the end of the DATA step, two things occur automatically:

Ex1:


2 0 31.867862John 175


1 Barbara 61 . .

2 John 62 175 31.8678

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

Ex1:


2 0 31.867862John 175

1. The SAS system returns to the beginning of the DATA step


1 Barbara 61 . .

2 John 62 175 31.8678

EXECUTION PHASE


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV


Example1.txt12345678901234567890

Ex1:

2. The values of the variables in the PDV are reset to missing _N_ ↑3


1 Barbara 61 . .

2 John 62 175 31.8678


3 0 . . .

EXECUTION PHASE


proc print data=ex1;run;

There are no more records to readThe SAS system next DATA/PROC step

THE OUTPUT STATEMENT

data ex1; set example1; BMI = 700*weight/(height*height);

run;

The explicit OUTPUT statement:

write the current observation from the PDV to a SAS dataset immediately

not at the end of the DATA step

output;


data ex1; set example1; BMI = 700*weight/(height*height);

run;

It tells SAS to write observations to the dataset at the end of the DATA step

The implicit OUTPUT statement:


Using explicit OUTPUT will override the implicit OUTPUT

We can use more than one OUTPUT statement in the DATA step

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer


Barbara 61 12DJohn 62 175Raw data

SAS dataset

When Reading a raw dataset …


1 Barbara 61 . .

2 John 62 175 31.8678


data ex1; set example1; BMI = 700*weight/(height*height); output;run;


SAS dataset

When Reading a SAS dataset …

SAS dataset

Input dataset:Example1(after “set”)

Output dataset:Ex1(after “data”)

Name Height Weight

1 Barbara 61 .

2 John 62 175


1 Barbara 61 . .

2 John 62 175 31.8678


When reading a raw dataset, SAS sets each variable value in the PDV to missing at the beginning of each iteration of execution, except for …

the automatic variablesvariables that are named in the RETAIN or SUM statementdata elements in a _TEMPORARY_ arrayvariables created in the options of the FILE/INFILE statement



PDV

1st Iteration:At the beginning of the

execution phase, SAS sets each variable to missing in the PDV


Example1: Name Height Weight

1 Barbara 61 170

2 John 62 175


1 0 . . .



PDV

1st Iteration:The SET statement is

executed



1 Barbara 61 170

2 John 62 175


1 0 .Barbara 170 61



PDV

1st Iteration:BMI is calculated



1 Barbara 61 170

2 John 62 175


1 0 Barbara 31.9807170 61



PDV

1st Iteration:Output statement is

executed


Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175


1 Barbara 61 170 31.9807


1 0 Barbara 31.9807170 61



PDV

2nd Iteration:When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175


1 Barbara 61 170 31.9807


2 0 Barbara .170 61

Variables exist in the input dataset

SAS sets each variable to missing in the PDV only before the 1st iteration of the execution

Variables will retain their values in the PDV until they are replaced by the new values



PDV

2nd Iteration:When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175


1 Barbara 61 170 31.9807


2 0 Barbara .170 61

Variables being created in the DATA step

SAS sets each variable to missing in the PDV at the beginning of every iteration of the execution



PDV

2nd Iteration:SET statement is executed


Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175


1 Barbara 61 170 31.9807


2 0 John .175 62

THE RETAIN STATEMENT

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Consider the following dataset:

We would like to create a new variable that accumulates the values of SCORE

TOTAL

3

3

7


ID SCORE

1 A01 3

2 A02 .

3 A03 4

Consider the following dataset:

How to do it?Set the TOTAL to 0 at the first iteration of the

execution Then at each iteration of the execution, add

values from SCORE to TOTAL

TOTAL

3

3

7

Problem: TOTAL is a new variable that you want to create TOTAL will be set to missing in the PDV at the beginning of every iteration of the execution.


To fix this problem, we can use the RETAIN statement:

RETAIN VARIABLE <VALUE>;

Prevents the VARIABLE from being initialized each time the DATA step executes


To fix this problem, we can use the RETAIN statement:

RETAIN VARIABLE <VALUE>;

Name of the variable that we will want to retain

A numeric valueUsed to initialize the VARIABLE

only at the first iteration of the DATA step execution

Not specifying an initial value VARIABLE is initialized as missing


data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV _N_ D _ERROR_D ID K Total K

The execution phase begins immediately after the completion of the compilation phase

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Score K



PDV

_N_ 1, _ERROR_ 0ID, SCORE missingTOTAL 0 because of the RETAIN

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 . 0



PDV

1st observation from ex2 PDV.

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:


1 0 3 0 A01



PDV

The RETAIN statement is a compile-time only statement

It does not execute during the execution phase

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:


1 0 3 0 A01



PDV

TOTAL is calculated

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:


1 0 3 3 A01



PDV

The implicit OUTPUT statement tells the SAS system to write observations to the dataset

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:


1 0 3 3 A01



PDV

_N_ ↑2 ID and SCORE are retained from the

previous iteration because data are read from an existing SAS dataset

TOTAL is also retained because the RETAIN statement is used

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:


2 0 3 3 A01



PDV

2nd observation from ex2 PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4


1 A01 3 3

Ex2_2:


2 0 . 3 A02



PDV

TOTAL is calculated

ID SCORE

1 A01 3

2 A02 .

3 A03 4


1 A01 3 3

Ex2_2:


2 0 . 3 A02



PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4


1 A01 3 3

2 A02 . 3

Ex2_2:

The implicit OUTPUT: The contents in PDV Ex2_2


2 0 . 3 A02



PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

_N_ ↑ 3. ID and SCORE are retained

from the previous iteration. TOTAL is also retained.


3 0 . 3 A02



PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4


1 A01 3 3

2 A02 . 3

Ex2_2:

3rd observation from ex2 PDV


3 0 4 3 A03



PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4


1 A01 3 3

2 A02 . 3

Ex2_2:

TOTAL is calculated


3 0 4 7 A03



PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4


1 A01 3 3

2 A02 . 3

3 A03 4 7

Ex2_2:

The implicit OUTPUT:The contents in PDV Ex2_2


3 0 4 7 A03

THE SUM STATEMENT

The SUM statement has the following form:

VARIABLE + EXPRESSION;

The numeric accumulator variable that is to be created

It is automatically set to 0 at the beginning of the first iteration of the DATA step execution

Retained in following iterations

Any SAS expression If EXPRESSION is evaluated

to a missing value, it is treated as 0

THE SUM STATEMENT

data ex2_2; set ex2;

run;

retain total 0;total = sum(total, score);

The previous program can be re-written as…

THE SUM STATEMENT

data ex2_2; set ex2;

run;

The previous program can be re-written as…

total + score;

THE SUBSETTING IF STATEMENT

We use the subsetting IF statement to continue processing only the observations that meet the condition of the specified expression

IF EXPRESSION;

If EXPRESSION is true for the observation, SAS continues to execute statements in the DATA step includes the current observation in the data set

THE SUBSETTING IF STATEMENT

Use the IF statement to continue processing only the observations that meet the condition of the specified expression

IF EXPRESSION;

If EXPRESSION is false for the observation, no further statements are processed for that obs.SAS immediately returns to the beginning of DATA step the remaining program statements in the DATA step are

not executed and the current observation is not written to the output data set

THE BY-GROUP PROCESSING IN THE DATA STEP

ID SCORE

1 A01 3

2 A02 .

3 A03 4

One observation per subject

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

Multiple observations per subject-- Longitudinal data

Identify the beginning/end of measurement for each subject

This can be accomplished by using the BY-group processing method


ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

SAS locates the beginning and end of a BY-group by creating two temporary indicator variables for each BY variable: FIRST.VARIABLELAST.VARIABLE

Suppose ID is the “BY” variable:

FIRST.ID

1

0

0

1

0

LAST.ID

0

0

1

0

1

SAS reads the 1st observation for ID = A01

SAS reads the last observation for ID = A01


ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

Calculating the total scores for each subject

ID TOTAL

1 A01 9

2 A02 6

proc sort data=ex3; by id;run;data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;


data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

1st iteration:_N_ 1, _ERROR_ 0 FIRST.ID 1, LAST.ID 1 only at beginning of 1st iterationID, Score missingTOTAL 0 because of the SUM statement

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 1 . 0



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The SET statement is executed1st observation PDVFIRST.ID 1 and LAST.ID 0


1 0 1 0 A01 3 0

1st iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID = 1: TOTAL 0


1 0 1 0 A01 3 0

1st iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is accumulated


1 0 1 0 A01 3 3

1st iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1


1 0 1 0 A01 3 3

SAS returns to the beginning of the DATA step to begin the 2nd iteration

1st iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 2The values for the rest of the variables are retained


2 0 1 0 A01 3 3

2nd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

2nd observation PDVNot the first observation for A01: FIRST.ID 0Not the last observation for A01: LAST.ID 0


2 0 0 0 A01 4 3

2nd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution


2 0 0 0 A01 4 3

2nd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is accumulated


2 0 0 0 A01 4 7

2nd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2


2 0 0 0 A01 4 7

The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1

SAS returns to the beginning of the DATA step to begin the 3rd iteration

2nd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑3The values for the rest of the variables are retained


3 0 0 0 A01 4 7

3rd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

3rd observation PDVNot the first observation: FIRST.ID 0 Last observation for A01: LAST.ID 1


3 0 0 1 A01 2 7

3rd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution


3 0 0 1 A01 2 7

3rd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculated


3 0 0 1 A01 2 9

3rd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2


3 0 0 1 A01 2 9

The subsetting IF statement is evaluated to be TRUE

3rd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

SAS reaches the end of the 3rd iterationThe implicit OUTPUT executesSAS returns to the beginning of the

DATA step to begin the 3rd iteration

ID TOTAL

1 A01 9

Ex3_1:


3 0 0 1 A01 2 9

3rd iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 4The values for the remaining

variables are retained

ID TOTAL

1 A01 9

Ex3_1:


4 0 0 1 A01 2 9

4th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

4th observation PDVFIRST.ID 1LAST.ID 0

ID TOTAL

1 A01 9

Ex3_1:


4 0 1 0 A02 4 9

4th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID = 1: TOTAL 0ID TOTAL

1 A01 9

Ex3_1:


4 0 1 0 A02 4 0

4th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculatedID TOTAL

1 A01 9

Ex3_1:


4 0 1 0 A02 4 4

4th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

ID TOTAL

1 A01 9

Ex3_1:


4 0 1 0 A02 4 4

The subsetting IF statement is evaluated to be FALSE

SAS returns to the beginning of the DATA step to begin the 5th iteration

4th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 5The values for the remaining

variables are retained

ID TOTAL

1 A01 9

Ex3_1:


5 0 1 0 A02 4 4

5th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

5th observation PDVFIRST.ID 0LAST.ID 1

ID TOTAL

1 A01 9

Ex3_1:


5 0 0 1 A02 2 4

5th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution ID TOTAL

1 A01 9

Ex3_1:


5 0 0 1 A02 2 4

5th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculated ID TOTAL

1 A01 9

Ex3_1:


5 0 0 1 A02 2 6

5th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The subsetting IF statement is evaluated to be TRUE

ID TOTAL

1 A01 9

Ex3_1:


5 0 0 1 A02 2 6

5th iteration:



PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

ID TOTAL

1 A01 9

2 A02 6

Ex3_1:


5 0 0 1 A02 2 6

SAS reaches the end of the 5th iteration

The implicit OUTPUT executes

5th iteration:

RESTRUCTURING DATASETS

Restructuring datasets:

data with one observation per

subject (the wide format)

data with multiple observations per

subject (the long format)

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


Restructuring datasets:

data with one observation per

subject (the wide format)

data with multiple observations per

subject (the long format)

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2S1 – S3 SCORE

Distinguish different measurements for each subject


The transformation can be easily done by using ARRAY/PROC TRANSPOSE (See my paper “The Many Ways to Effectively Utilize Array Processing”, paper 244-2011)

This can also be accomplished without advanced techniques for more simple cases

Here is a solution for using multiple OUTPUT statements in one DATA step

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide: Long:ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Transform wide long2 observations to read 2 DATA step

iterationsUse multiple OUTPUT statementAny missing values in S1 – S3 will not be

outputted to long

data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

_N_ 1Other variables missing

KID

.

DS1

.

DS2

.

DS3

.

KTIME

.

KSCORE

1

K_N_


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

1st observation from the wide PDV

A01

KID

3

DS1

4

DS2

5

DS3

.

KTIME

.

KSCORE

1

K_N_


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

Time 1

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

.

KSCORE

1

K_N_


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

Score value from S1(3)

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

3

KSCORE

1

K_N_


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

SCORE ≠ missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

3

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

TIME 2

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

3

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

Score value from S2(4)

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

SCORE ≠missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

TIME 3

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

SCORE value from S3(5)

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:

SCORE ≠missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


1st iteration:There is no more implicit OUTPUT statementSAS returns to the beginning of the DATA step to

begin the 2nd iteration

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:_N_ ↑2ID and S1-S3 are retained from the previous iterationTIME, SCORE missing

A01

KID

3

DS1

4

DS2

5

DS3

.

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

2nd observation from the Wide PDV

A01

KID

4

DS1

.

DS2

2

DS3

.

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

TIME 1

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

SCORE value from S1 (4)

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

ID, TIME, and SCORE Long

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

TIME 2

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

SCORE the value from S2 (missing)

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

SCORE = missing: no output is generated

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

TIME 3

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

SCORE the value from S3 (2)

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:

ID, TIME, and SCORE Long

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Long:


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


2nd iteration:SAS returns to the beginning of the DATA step to begin the

3rd iterationWith no more observations to read in the 3rd iteration, SAS

goes to the next DATA or PROC step

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Long:


FROM LONG FORMAT TO WIDE FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Reading 5 observations but only creating 2 observations

You are not copying data from the PDV to the final dataset at each iteration

You only need to generate one observation once all the observations for each subject have been processed


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

S1

S2

S3

S1

S3

if time = 1 then s1 = score;else if time = 2 then s2 = score;else s3 = score;

Use BY-group processing: BY ID Output to the final data when LAST.ID = 1

SCORE S1, S2 S3

RETAIN


ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

proc sort data=long; by id;run;data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:_N_ 1FIRST.ID 1, LAST.ID 1Other variables missing

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 1 . . . . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


1ST iteration:The SET statement copies the 1st observation PDV


1 1 1 A01 1 3 . . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


1ST iteration:The SET statement copies the 1st observation PDVFIRST.ID 1 since this is the 1st observation for A01LAST.ID 0 since this is not the last observation for A01


1 1 0 A01 1 3 . . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


1ST iteration:Since TIME = 1, S1 SCORE (3)


1 1 0 A01 1 3 3 . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


1ST iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the

2nd iteration


1 1 0 A01 1 3 3 . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


2nd iteration:_N_ ↑2


2 1 0 A01 1 3 3 . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


2nd iteration: FIRST.ID and LAST.ID are retained; they are automatic variables ID, TIME, SCORE are retained; they are from input dataset S1, S2, and S3 are retained because of the RETAIN statement


2 1 0 A01 1 3 3 . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


2nd iteration:The SET statement copies the 2nd observation to the PDV


2 1 0 A01 2 4 3 . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


2nd iteration:The SET statement copies the 2nd observation to the PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 0; this is not the last observation for A01 either


2 0 0 A01 2 4 3 . .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


2nd iteration:Since TIME = 2, S2 SCORE (4)


2 0 0 A01 2 4 3 4 .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


2nd iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 3rd

iteration


2 0 0 A01 2 4 3 4 .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


3rd iteration:_N_ ↑3The rest of the variables are retained


3 0 0 A01 2 4 3 4 .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


3rd iteration:The SET statement copies the 3rd observation PDV


3 0 0 A01 3 5 3 4 .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


3rd iteration:The SET statement copies the 3rd observation PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 1; this is the last observation for A01


3 0 1 A01 3 5 3 4 .


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


3rd iteration:Since TIME = 3, S3 SCORE (5)


3 0 1 A01 3 5 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


3rd iteration:The subsetting IF statement is evaluated to be true


3 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


3rd iteration:The implicit OUTPUT executes - variables marked with

(K) are copied to the dataset wideSAS returns to the beginning of the DATA step to

begin the 4th iteration


3 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


4th iteration:_N_ ↑4The rest of the variables are retained


4 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


4th iteration:The SET statement copies the 4th observation PDV


4 0 1 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02


4 1 0 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


4th iteration:Since TIME = 1, S1 SCORE (4)


4 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


4th iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 5 th

iteration


4 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2




5 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2




5 1 0 A02 3 2 4 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02


5 0 1 A02 3 2 4 4 5

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2




5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


5th iteration:The subsetting IF statement is evaluated to be TRUE


5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5


ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2


5th iteration:The implicit OUTPUT executes


5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 4 2

How to fix this?


data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2



4 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2



4 0 1 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02


4 1 0 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:Since FIRST.ID = 1, S1 – S3 missing


4 1 0 A02 1 4 . . .

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2



4 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The subsetting IF statement is evaluated to be falseSAS returns to the beginning of the DATA step to begin the 5th

iteration


4 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2



5 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2



5 1 0 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02


5 0 1 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:Since FIRST.ID ≠1, no execution


5 0 1 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2



5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The subsetting IF statement is evaluated to be true


5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5



ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:SAS reaches the end of the 5th iterationThe implicit OUTPUT executes


5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2


CONCLUSION

The most important part of DATA step processing is to understand how data is transformed to the PDV and how data is copied from the PDV to a new dataset

To be a successful SAS programmer, we must be able to thoroughly comprehend how DATA steps are processed

REFERENCES

Cody, Ron. 2001. Longitudinal Data and SAS® A Programmer’s Guide. Cary, NC: SAS Institute Inc.

ACKNOWLEDGEMENT

I would like to thank MaryAnne DePesquo for inviting me to present at the SGF 2011

CONTACT INFORMATION

Arthur X. Li

City of Hope Comprehensive Cancer Center

Division of Information Science

1500 East Duarte Road

Duarte, CA 91010 - 3000

Work Phone: (626) 256-4673 ext. 65121

Fax: (626) 471-7106

E-mail: [email protected]

the essence of data step programming

Education

d height

height output run pdv

height output run pdv

pdv input buffer

input buffer pdv

height output run pdv

input data

data error