the essence of data step programming
DESCRIPTION
The fundamental of SAS programming is DATA step programming. The essence of DATA step programming is to understand how SAS processes the data during the compilation and execution phases. In this paper, you will be exposed to what happens “behind the scenes” while creating a SAS dataset. You will learn how a new dataset is created, one observation at a time, from either a raw text file or an existing SAS dataset, to the program data vector (PDV) and from the PDV to the newly-created SAS dataset. Once you fully understand DATA step processing, learning the SUM and RETAIN statements will become easier to grasp. Relating to this topic, this paper will also cover BY-group processing.TRANSCRIPT
The Essence of DATA Step Programming
Arthur LiCity of Hope Comprehensive Cancer Center
Department of Information Science
INTRODUCTION
SAS programming
DATA step programming
Understanding how SAS processes the data during the compilation and execution phases
Fundamental:
Essence:
A COMMON BEFUDDLEMENT
The newly-created SAS dataset is not what we intended there are more or less observationsthe value of the variable was not retained
correctly
Reason:Learning only SAS language syntaxNot understanding the fundamental SAS
programming concepts
INTRODUCTION
We will cover…what happens “behind the scenes” while creating a
SAS dataset Learn how a new dataset is created
one observation at a time a raw text file/SAS dataset PDVSAS dataset
The SUM and RETAIN statements BY-group processing Transposing dataset examples
DATA STEP PROCESSING OVERVIEW
Compilation phase:Each statement is scanned for syntax errors.
Execution phase:The DATA step reads and processes the input data.
If there is no syntax error
A DATA step is processed in two-phase sequences:
DATA STEP PROCESSING OVERVIEW
Variable names Columns
Name 1-7
Height 9-10
Weight 12-14
Program1:
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
Data Entry Error
The column input method:Each variable is occupied in a fixed fieldThe values are standard character or numerical values
Creating a new variable: BMI
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
Used to hold raw dataWill not be created when reading a SAS dataset
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
PDV is created
Memory area where SAS builds its new data set, 1 observation at a time.
_N_ D _ERROR_D
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
PDV is created
Automatic variables:_N_ = 1: 1st observation is being processed_N_ = 2: 2nd observation is being processed
_N_ D _ERROR_D
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
PDV is created
Automatic variables:_ERROR_ = 1: signals the data error of the currently-processed observation
_N_ D _ERROR_D
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
A space is added to the PDV for each variable
_N_ D _ERROR_D Height KName K Weight K
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
BMI is added to the PDV
_N_ D _ERROR_D Height KName K Weight K BMI K
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV _N_ D _ERROR_D Height KName K Weight K BMI K
D = dropped
K = kept
COMPILATION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV _N_ D _ERROR_D Height KName K Weight K BMI K
Checks for syntax errorsinvalid variable names invalid optionsincorrect punctuationsmisspelled keywords
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
The DATA step works like a loopIt repetitively executes statements
reads data values creates observations one at a time
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
1st Iteration:At the beginning
1 0
_N_ 1, _ERROR_ 0The remaining variables are set to missing
. . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
1st data line input bufferThe input pointer @ the beginning of the input buffer
The INFILE statement identifies the location of Exampl1.txt
1 0 . . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
1 0
The INPUT statement reads data values: input buffer PDV
. . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
1 0
input buffer (columns 1-7) “Name” in the PDV
Barbara . . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
1 0
The input pointer @ column 8
Barbara . . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
1 0 . .
input buffer (columns 9-10) “Height” in the PDV
Barbara 61
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
1 0
The input pointer @ column 11
Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
1 0
Tries to read Weight – invalid value
Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
Tries to read Weight – invalid value _ERROR_ 1
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
The input pointer @ column 15
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
BMI will remain missing: operations on a missing value a missing value.
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
The OUTPUT statement is executed
Only values marked with (K) are copied as a single observation to the SAS dataset ex1
Name Height Weight BMI
1 Barbara 61 . .
Ex1:
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1st Iteration:
B a r b a r a 6 1 1 2 D
At the end of the DATA step, two things occur automatically:
Ex1:
Name Height Weight BMI
1 Barbara 61 . .
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
1. The SAS system returns to the beginning of the DATA step
Ex1:
Name Height Weight BMI
1 Barbara 61 . .
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
2. The values of the variables in the PDV are reset to missing _N_ ↑ 2
_ERROR_ 0
Ex1:
Name Height Weight BMI
1 Barbara 61 . .
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 . . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
2nd Iteration:
J o h n 6 2 1 7 5
2nd data line input buffer The input pointer @
beginning of the input buffer
Ex1:
Name Height Weight BMI
1 Barbara 61 . .
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 . . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
2nd Iteration:
J o h n 6 2 1 7 5
The INPUT statement is executed
Ex1:
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 .62John 175
Name Height Weight BMI
1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
2nd Iteration:
J o h n 6 2 1 7 5
BMI is calculated
Ex1:
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 31.867862John 175
Name Height Weight BMI
1 Barbara 61 . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
2nd Iteration:
J o h n 6 2 1 7 5
The OUTPUT statement is executed
Ex1:
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 31.867862John 175
Name Height Weight BMI
1 Barbara 61 . .
2 John 62 175 31.8678
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
2nd Iteration:
J o h n 6 2 1 7 5
At the end of the DATA step, two things occur automatically:
Ex1:
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 31.867862John 175
Name Height Weight BMI
1 Barbara 61 . .
2 John 62 175 31.8678
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
Ex1:
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 31.867862John 175
1. The SAS system returns to the beginning of the DATA step
Name Height Weight BMI
1 Barbara 61 . .
2 John 62 175 31.8678
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV
Barbara 61 12DJohn 62 175
Example1.txt12345678901234567890
Ex1:
2. The values of the variables in the PDV are reset to missing _N_ ↑3
Name Height Weight BMI
1 Barbara 61 . .
2 John 62 175 31.8678
_N_ D _ERROR_D Name K Height K Weight K BMI K
3 0 . . .
EXECUTION PHASE
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
proc print data=ex1;run;
There are no more records to readThe SAS system next DATA/PROC step
THE OUTPUT STATEMENT
data ex1; set example1; BMI = 700*weight/(height*height);
run;
The explicit OUTPUT statement:
write the current observation from the PDV to a SAS dataset immediately
not at the end of the DATA step
output;
THE OUTPUT STATEMENT
data ex1; set example1; BMI = 700*weight/(height*height);
run;
It tells SAS to write observations to the dataset at the end of the DATA step
The implicit OUTPUT statement:
THE OUTPUT STATEMENT
Using explicit OUTPUT will override the implicit OUTPUT
We can use more than one OUTPUT statement in the DATA step
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
…Input buffer
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
Barbara 61 12DJohn 62 175Raw data
SAS dataset
When Reading a raw dataset …
Name Height Weight BMI
1 Barbara 61 . .
2 John 62 175 31.8678
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV_N_ D _ERROR_D Name K Height K Weight K BMI K
SAS dataset
When Reading a SAS dataset …
SAS dataset
Input dataset:Example1(after “set”)
Output dataset:Ex1(after “data”)
Name Height Weight
1 Barbara 61 .
2 John 62 175
Name Height Weight BMI
1 Barbara 61 . .
2 John 62 175 31.8678
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
When reading a raw dataset, SAS sets each variable value in the PDV to missing at the beginning of each iteration of execution, except for …
the automatic variablesvariables that are named in the RETAIN or SUM statementdata elements in a _TEMPORARY_ arrayvariables created in the options of the FILE/INFILE statement
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV
1st Iteration:At the beginning of the
execution phase, SAS sets each variable to missing in the PDV
When Reading a SAS dataset …
Example1: Name Height Weight
1 Barbara 61 170
2 John 62 175
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 0 . . .
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV
1st Iteration:The SET statement is
executed
When Reading a SAS dataset …
Example1: Name Height Weight
1 Barbara 61 170
2 John 62 175
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 0 .Barbara 170 61
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV
1st Iteration:BMI is calculated
When Reading a SAS dataset …
Example1: Name Height Weight
1 Barbara 61 170
2 John 62 175
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 0 Barbara 31.9807170 61
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV
1st Iteration:Output statement is
executed
When Reading a SAS dataset …
Example1:
Ex1:
Name Height Weight
1 Barbara 61 170
2 John 62 175
Name Height Weight BMI
1 Barbara 61 170 31.9807
_N_ D _ERROR_D Name K Height K Weight K BMI K
1 0 Barbara 31.9807170 61
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV
2nd Iteration:When Reading a SAS dataset …
Example1:
Ex1:
Name Height Weight
1 Barbara 61 170
2 John 62 175
Name Height Weight BMI
1 Barbara 61 170 31.9807
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 Barbara .170 61
Variables exist in the input dataset
SAS sets each variable to missing in the PDV only before the 1st iteration of the execution
Variables will retain their values in the PDV until they are replaced by the new values
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV
2nd Iteration:When Reading a SAS dataset …
Example1:
Ex1:
Name Height Weight
1 Barbara 61 170
2 John 62 175
Name Height Weight BMI
1 Barbara 61 170 31.9807
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 Barbara .170 61
Variables being created in the DATA step
SAS sets each variable to missing in the PDV at the beginning of every iteration of the execution
THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET
data ex1; set example1; BMI = 700*weight/(height*height); output;run;
PDV
2nd Iteration:SET statement is executed
When Reading a SAS dataset …
Example1:
Ex1:
Name Height Weight
1 Barbara 61 170
2 John 62 175
Name Height Weight BMI
1 Barbara 61 170 31.9807
_N_ D _ERROR_D Name K Height K Weight K BMI K
2 0 John .175 62
THE RETAIN STATEMENT
ID SCORE
1 A01 3
2 A02 .
3 A03 4
Consider the following dataset:
We would like to create a new variable that accumulates the values of SCORE
TOTAL
3
3
7
THE RETAIN STATEMENT
ID SCORE
1 A01 3
2 A02 .
3 A03 4
Consider the following dataset:
How to do it?Set the TOTAL to 0 at the first iteration of the
execution Then at each iteration of the execution, add
values from SCORE to TOTAL
TOTAL
3
3
7
Problem: TOTAL is a new variable that you want to create TOTAL will be set to missing in the PDV at the beginning of every iteration of the execution.
THE RETAIN STATEMENT
To fix this problem, we can use the RETAIN statement:
RETAIN VARIABLE <VALUE>;
Prevents the VARIABLE from being initialized each time the DATA step executes
THE RETAIN STATEMENT
To fix this problem, we can use the RETAIN statement:
RETAIN VARIABLE <VALUE>;
Name of the variable that we will want to retain
A numeric valueUsed to initialize the VARIABLE
only at the first iteration of the DATA step execution
Not specifying an initial value VARIABLE is initialized as missing
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV _N_ D _ERROR_D ID K Total K
The execution phase begins immediately after the completion of the compilation phase
ID SCORE
1 A01 3
2 A02 .
3 A03 4
Score K
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
_N_ 1, _ERROR_ 0ID, SCORE missingTOTAL 0 because of the RETAIN
ID SCORE
1 A01 3
2 A02 .
3 A03 4
1st Iteration:
_N_ D _ERROR_D ID K Total KScore K
1 0 . 0
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
1st observation from ex2 PDV.
ID SCORE
1 A01 3
2 A02 .
3 A03 4
1st Iteration:
_N_ D _ERROR_D ID K Total KScore K
1 0 3 0 A01
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
The RETAIN statement is a compile-time only statement
It does not execute during the execution phase
ID SCORE
1 A01 3
2 A02 .
3 A03 4
1st Iteration:
_N_ D _ERROR_D ID K Total KScore K
1 0 3 0 A01
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
TOTAL is calculated
ID SCORE
1 A01 3
2 A02 .
3 A03 4
1st Iteration:
_N_ D _ERROR_D ID K Total KScore K
1 0 3 3 A01
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
The implicit OUTPUT statement tells the SAS system to write observations to the dataset
ID SCORE
1 A01 3
2 A02 .
3 A03 4
1st Iteration:ID SCORE TOTAL
1 A01 3 3
Ex2_2:
_N_ D _ERROR_D ID K Total KScore K
1 0 3 3 A01
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
_N_ ↑2 ID and SCORE are retained from the
previous iteration because data are read from an existing SAS dataset
TOTAL is also retained because the RETAIN statement is used
ID SCORE
1 A01 3
2 A02 .
3 A03 4
2nd Iteration:ID SCORE TOTAL
1 A01 3 3
Ex2_2:
_N_ D _ERROR_D ID K Total KScore K
2 0 3 3 A01
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
2nd observation from ex2 PDV
ID SCORE
1 A01 3
2 A02 .
3 A03 4
2nd Iteration:ID SCORE TOTAL
1 A01 3 3
Ex2_2:
_N_ D _ERROR_D ID K Total KScore K
2 0 . 3 A02
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
TOTAL is calculated
ID SCORE
1 A01 3
2 A02 .
3 A03 4
2nd Iteration:ID SCORE TOTAL
1 A01 3 3
Ex2_2:
_N_ D _ERROR_D ID K Total KScore K
2 0 . 3 A02
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
ID SCORE
1 A01 3
2 A02 .
3 A03 4
2nd Iteration:ID SCORE TOTAL
1 A01 3 3
2 A02 . 3
Ex2_2:
The implicit OUTPUT: The contents in PDV Ex2_2
_N_ D _ERROR_D ID K Total KScore K
2 0 . 3 A02
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
ID SCORE
1 A01 3
2 A02 .
3 A03 4
3rd Iteration:ID SCORE TOTAL
1 A01 3 3
2 A02 . 3
Ex2_2:
_N_ ↑ 3. ID and SCORE are retained
from the previous iteration. TOTAL is also retained.
_N_ D _ERROR_D ID K Total KScore K
3 0 . 3 A02
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
ID SCORE
1 A01 3
2 A02 .
3 A03 4
3rd Iteration:ID SCORE TOTAL
1 A01 3 3
2 A02 . 3
Ex2_2:
3rd observation from ex2 PDV
_N_ D _ERROR_D ID K Total KScore K
3 0 4 3 A03
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
ID SCORE
1 A01 3
2 A02 .
3 A03 4
3rd Iteration:ID SCORE TOTAL
1 A01 3 3
2 A02 . 3
Ex2_2:
TOTAL is calculated
_N_ D _ERROR_D ID K Total KScore K
3 0 4 7 A03
THE RETAIN STATEMENT
data ex2_2; set ex2; retain total 0; total = sum(total, score);run;
PDV
ID SCORE
1 A01 3
2 A02 .
3 A03 4
3rd Iteration:ID SCORE TOTAL
1 A01 3 3
2 A02 . 3
3 A03 4 7
Ex2_2:
The implicit OUTPUT:The contents in PDV Ex2_2
_N_ D _ERROR_D ID K Total KScore K
3 0 4 7 A03
THE SUM STATEMENT
The SUM statement has the following form:
VARIABLE + EXPRESSION;
The numeric accumulator variable that is to be created
It is automatically set to 0 at the beginning of the first iteration of the DATA step execution
Retained in following iterations
Any SAS expression If EXPRESSION is evaluated
to a missing value, it is treated as 0
THE SUM STATEMENT
data ex2_2; set ex2;
run;
retain total 0;total = sum(total, score);
The previous program can be re-written as…
THE SUM STATEMENT
data ex2_2; set ex2;
run;
The previous program can be re-written as…
total + score;
THE SUBSETTING IF STATEMENT
We use the subsetting IF statement to continue processing only the observations that meet the condition of the specified expression
IF EXPRESSION;
If EXPRESSION is true for the observation, SAS continues to execute statements in the DATA step includes the current observation in the data set
THE SUBSETTING IF STATEMENT
Use the IF statement to continue processing only the observations that meet the condition of the specified expression
IF EXPRESSION;
If EXPRESSION is false for the observation, no further statements are processed for that obs.SAS immediately returns to the beginning of DATA step the remaining program statements in the DATA step are
not executed and the current observation is not written to the output data set
THE BY-GROUP PROCESSING IN THE DATA STEP
ID SCORE
1 A01 3
2 A02 .
3 A03 4
One observation per subject
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
Multiple observations per subject-- Longitudinal data
Identify the beginning/end of measurement for each subject
This can be accomplished by using the BY-group processing method
THE BY-GROUP PROCESSING IN THE DATA STEP
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
SAS locates the beginning and end of a BY-group by creating two temporary indicator variables for each BY variable: FIRST.VARIABLELAST.VARIABLE
Suppose ID is the “BY” variable:
FIRST.ID
1
0
0
1
0
LAST.ID
0
0
1
0
1
SAS reads the 1st observation for ID = A01
SAS reads the last observation for ID = A01
THE BY-GROUP PROCESSING IN THE DATA STEP
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
Calculating the total scores for each subject
ID TOTAL
1 A01 9
2 A02 6
proc sort data=ex3; by id;run;data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
1st iteration:_N_ 1, _ERROR_ 0 FIRST.ID 1, LAST.ID 1 only at beginning of 1st iterationID, Score missingTOTAL 0 because of the SUM statement
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
1 0 1 1 . 0
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
The SET statement is executed1st observation PDVFIRST.ID 1 and LAST.ID 0
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
1 0 1 0 A01 3 0
1st iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
FIRST.ID = 1: TOTAL 0
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
1 0 1 0 A01 3 0
1st iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
TOTAL is accumulated
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
1 0 1 0 A01 3 3
1st iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
1 0 1 0 A01 3 3
SAS returns to the beginning of the DATA step to begin the 2nd iteration
1st iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
_N_ ↑ 2The values for the rest of the variables are retained
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
2 0 1 0 A01 3 3
2nd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
2nd observation PDVNot the first observation for A01: FIRST.ID 0Not the last observation for A01: LAST.ID 0
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
2 0 0 0 A01 4 3
2nd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
FIRST.ID ≠ 1: no execution
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
2 0 0 0 A01 4 3
2nd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
TOTAL is accumulated
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
2 0 0 0 A01 4 7
2nd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
2 0 0 0 A01 4 7
The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1
SAS returns to the beginning of the DATA step to begin the 3rd iteration
2nd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
_N_ ↑3The values for the rest of the variables are retained
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
3 0 0 0 A01 4 7
3rd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
3rd observation PDVNot the first observation: FIRST.ID 0 Last observation for A01: LAST.ID 1
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
3 0 0 1 A01 2 7
3rd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
FIRST.ID ≠ 1: no execution
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
3 0 0 1 A01 2 7
3rd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
TOTAL is calculated
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
3 0 0 1 A01 2 9
3rd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
3 0 0 1 A01 2 9
The subsetting IF statement is evaluated to be TRUE
3rd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
SAS reaches the end of the 3rd iterationThe implicit OUTPUT executesSAS returns to the beginning of the
DATA step to begin the 3rd iteration
ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
3 0 0 1 A01 2 9
3rd iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
_N_ ↑ 4The values for the remaining
variables are retained
ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
4 0 0 1 A01 2 9
4th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
4th observation PDVFIRST.ID 1LAST.ID 0
ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
4 0 1 0 A02 4 9
4th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
FIRST.ID = 1: TOTAL 0ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
4 0 1 0 A02 4 0
4th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
TOTAL is calculatedID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
4 0 1 0 A02 4 4
4th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
4 0 1 0 A02 4 4
The subsetting IF statement is evaluated to be FALSE
SAS returns to the beginning of the DATA step to begin the 5th iteration
4th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
_N_ ↑ 5The values for the remaining
variables are retained
ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
5 0 1 0 A02 4 4
5th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
5th observation PDVFIRST.ID 0LAST.ID 1
ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
5 0 0 1 A02 2 4
5th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
FIRST.ID ≠ 1: no execution ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
5 0 0 1 A02 2 4
5th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
TOTAL is calculated ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
5 0 0 1 A02 2 6
5th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
The subsetting IF statement is evaluated to be TRUE
ID TOTAL
1 A01 9
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
5 0 0 1 A02 2 6
5th iteration:
THE BY-GROUP PROCESSING IN THE DATA STEP
data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;
PDV
ID SCORE
1 A01 3
2 A01 4
3 A01 2
4 A02 4
5 A02 2
ID TOTAL
1 A01 9
2 A02 6
Ex3_1:
_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D
5 0 0 1 A02 2 6
SAS reaches the end of the 5th iteration
The implicit OUTPUT executes
5th iteration:
RESTRUCTURING DATASETS
Restructuring datasets:
data with one observation per
subject (the wide format)
data with multiple observations per
subject (the long format)
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
RESTRUCTURING DATASETS
Restructuring datasets:
data with one observation per
subject (the wide format)
data with multiple observations per
subject (the long format)
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2S1 – S3 SCORE
Distinguish different measurements for each subject
RESTRUCTURING DATASETS
The transformation can be easily done by using ARRAY/PROC TRANSPOSE (See my paper “The Many Ways to Effectively Utilize Array Processing”, paper 244-2011)
This can also be accomplished without advanced techniques for more simple cases
Here is a solution for using multiple OUTPUT statements in one DATA step
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide: Long:ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
Transform wide long2 observations to read 2 DATA step
iterationsUse multiple OUTPUT statementAny missing values in S1 – S3 will not be
outputted to long
data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
_N_ 1Other variables missing
KID
.
DS1
.
DS2
.
DS3
.
KTIME
.
KSCORE
1
K_N_
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
1st observation from the wide PDV
A01
KID
3
DS1
4
DS2
5
DS3
.
KTIME
.
KSCORE
1
K_N_
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
Time 1
A01
KID
3
DS1
4
DS2
5
DS3
1
KTIME
.
KSCORE
1
K_N_
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
Score value from S1(3)
A01
KID
3
DS1
4
DS2
5
DS3
1
KTIME
3
KSCORE
1
K_N_
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
SCORE ≠ missing: ID, TIME, and SCORE Long
A01
KID
3
DS1
4
DS2
5
DS3
1
KTIME
3
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
TIME 2
A01
KID
3
DS1
4
DS2
5
DS3
2
KTIME
3
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
Score value from S2(4)
A01
KID
3
DS1
4
DS2
5
DS3
2
KTIME
4
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
SCORE ≠missing: ID, TIME, and SCORE Long
A01
KID
3
DS1
4
DS2
5
DS3
2
KTIME
4
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
TIME 3
A01
KID
3
DS1
4
DS2
5
DS3
3
KTIME
4
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
SCORE value from S3(5)
A01
KID
3
DS1
4
DS2
5
DS3
3
KTIME
5
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:
SCORE ≠missing: ID, TIME, and SCORE Long
A01
KID
3
DS1
4
DS2
5
DS3
3
KTIME
5
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
1st iteration:There is no more implicit OUTPUT statementSAS returns to the beginning of the DATA step to
begin the 2nd iteration
A01
KID
3
DS1
4
DS2
5
DS3
3
KTIME
5
KSCORE
1
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:_N_ ↑2ID and S1-S3 are retained from the previous iterationTIME, SCORE missing
A01
KID
3
DS1
4
DS2
5
DS3
.
KTIME
.
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
2nd observation from the Wide PDV
A01
KID
4
DS1
.
DS2
2
DS3
.
KTIME
.
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
TIME 1
A01
KID
4
DS1
.
DS2
2
DS3
1
KTIME
.
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
SCORE value from S1 (4)
A01
KID
4
DS1
.
DS2
2
DS3
1
KTIME
4
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
ID, TIME, and SCORE Long
A01
KID
4
DS1
.
DS2
2
DS3
1
KTIME
4
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
TIME 2
A01
KID
4
DS1
.
DS2
2
DS3
2
KTIME
4
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
SCORE the value from S2 (missing)
A01
KID
4
DS1
.
DS2
2
DS3
2
KTIME
.
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
SCORE = missing: no output is generated
A01
KID
4
DS1
.
DS2
2
DS3
2
KTIME
.
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
TIME 3
A01
KID
4
DS1
.
DS2
2
DS3
3
KTIME
.
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
SCORE the value from S3 (2)
A01
KID
4
DS1
.
DS2
2
DS3
3
KTIME
2
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:
ID, TIME, and SCORE Long
A01
KID
4
DS1
.
DS2
2
DS3
3
KTIME
2
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
Long:
FROM WIDE FORMAT TO LONG FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;
2nd iteration:SAS returns to the beginning of the DATA step to begin the
3rd iterationWith no more observations to read in the 3rd iteration, SAS
goes to the next DATA or PROC step
A01
KID
4
DS1
.
DS2
2
DS3
3
KTIME
2
KSCORE
2
K_N_
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
Long:
FROM WIDE FORMAT TO LONG FORMAT
FROM LONG FORMAT TO WIDE FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
Reading 5 observations but only creating 2 observations
You are not copying data from the PDV to the final dataset at each iteration
You only need to generate one observation once all the observations for each subject have been processed
FROM LONG FORMAT TO WIDE FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
S1
S2
S3
S1
S3
if time = 1 then s1 = score;else if time = 2 then s2 = score;else s3 = score;
Use BY-group processing: BY ID Output to the final data when LAST.ID = 1
SCORE S1, S2 S3
RETAIN
FROM LONG FORMAT TO WIDE FORMAT
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
proc sort data=long; by id;run;data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
1ST iteration:_N_ 1FIRST.ID 1, LAST.ID 1Other variables missing
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
1 1 1 . . . . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
1ST iteration:The SET statement copies the 1st observation PDV
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
1 1 1 A01 1 3 . . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
1ST iteration:The SET statement copies the 1st observation PDVFIRST.ID 1 since this is the 1st observation for A01LAST.ID 0 since this is not the last observation for A01
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
1 1 0 A01 1 3 . . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
1ST iteration:Since TIME = 1, S1 SCORE (3)
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
1 1 0 A01 1 3 3 . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
1ST iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the
2nd iteration
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
1 1 0 A01 1 3 3 . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
2nd iteration:_N_ ↑2
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
2 1 0 A01 1 3 3 . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
2nd iteration: FIRST.ID and LAST.ID are retained; they are automatic variables ID, TIME, SCORE are retained; they are from input dataset S1, S2, and S3 are retained because of the RETAIN statement
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
2 1 0 A01 1 3 3 . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
2nd iteration:The SET statement copies the 2nd observation to the PDV
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
2 1 0 A01 2 4 3 . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
2nd iteration:The SET statement copies the 2nd observation to the PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 0; this is not the last observation for A01 either
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
2 0 0 A01 2 4 3 . .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
2nd iteration:Since TIME = 2, S2 SCORE (4)
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
2 0 0 A01 2 4 3 4 .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
2nd iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 3rd
iteration
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
2 0 0 A01 2 4 3 4 .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
3rd iteration:_N_ ↑3The rest of the variables are retained
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
3 0 0 A01 2 4 3 4 .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
3rd iteration:The SET statement copies the 3rd observation PDV
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
3 0 0 A01 3 5 3 4 .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
3rd iteration:The SET statement copies the 3rd observation PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 1; this is the last observation for A01
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
3 0 1 A01 3 5 3 4 .
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
3rd iteration:Since TIME = 3, S3 SCORE (5)
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
3 0 1 A01 3 5 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
3rd iteration:The subsetting IF statement is evaluated to be true
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
3 0 1 A01 3 5 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
3rd iteration:The implicit OUTPUT executes - variables marked with
(K) are copied to the dataset wideSAS returns to the beginning of the DATA step to
begin the 4th iteration
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
3 0 1 A01 3 5 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
4th iteration:_N_ ↑4The rest of the variables are retained
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 0 1 A01 3 5 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
4th iteration:The SET statement copies the 4th observation PDV
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 0 1 A02 1 4 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 1 0 A02 1 4 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
4th iteration:Since TIME = 1, S1 SCORE (4)
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 1 0 A02 1 4 4 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
4th iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 5 th
iteration
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 1 0 A02 1 4 4 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
5th iteration:_N_ ↑5The rest of the variables are retained
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 1 0 A02 1 4 4 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
5th iteration:The SET statement copies the 5th observation PDV
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 1 0 A02 3 2 4 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
5th iteration:Since TIME = 3, S3 SCORE (2)
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 4 2
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
5th iteration:The subsetting IF statement is evaluated to be TRUE
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 4 2
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
5th iteration:The implicit OUTPUT executes
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 4 2
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 4 2
How to fix this?
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
4th iteration:_N_ ↑4The rest of the variables are retained
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 0 1 A01 3 5 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
4th iteration:The SET statement copies the 4th observation PDV
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 0 1 A02 1 4 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 1 0 A02 1 4 3 4 5
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
4th iteration:Since FIRST.ID = 1, S1 – S3 missing
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 1 0 A02 1 4 . . .
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
4th iteration:Since TIME = 1, S1 SCORE (4)
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 1 0 A02 1 4 4 . .
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
4th iteration:The subsetting IF statement is evaluated to be falseSAS returns to the beginning of the DATA step to begin the 5th
iteration
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
4 1 0 A02 1 4 4 . .
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
5th iteration:_N_ ↑5The rest of the variables are retained
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 1 0 A02 1 4 4 . .
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
5th iteration:The SET statement copies the 5th observation PDV
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 1 0 A02 3 2 4 . .
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 . .
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
5th iteration:Since FIRST.ID ≠1, no execution
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 . .
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
5th iteration:Since TIME = 3, S3 SCORE (2)
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 . 2
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
5th iteration:The subsetting IF statement is evaluated to be true
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 . 2
ID S1 S2 S3
1 A01 3 4 5
FROM LONG FORMAT TO WIDE FORMAT
data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;
ID TIME SCORE
1 A01 1 3
2 A01 2 4
3 A01 3 5
4 A02 1 4
5 A02 3 2
5th iteration:SAS reaches the end of the 5th iterationThe implicit OUTPUT executes
_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K
5 0 1 A02 3 2 4 . 2
ID S1 S2 S3
1 A01 3 4 5
2 A02 4 . 2
FROM LONG FORMAT TO WIDE FORMAT
CONCLUSION
The most important part of DATA step processing is to understand how data is transformed to the PDV and how data is copied from the PDV to a new dataset
To be a successful SAS programmer, we must be able to thoroughly comprehend how DATA steps are processed
REFERENCES
Cody, Ron. 2001. Longitudinal Data and SAS® A Programmer’s Guide. Cary, NC: SAS Institute Inc.
ACKNOWLEDGEMENT
I would like to thank MaryAnne DePesquo for inviting me to present at the SGF 2011
CONTACT INFORMATION
Arthur X. Li
City of Hope Comprehensive Cancer Center
Division of Information Science
1500 East Duarte Road
Duarte, CA 91010 - 3000
Work Phone: (626) 256-4673 ext. 65121
Fax: (626) 471-7106
E-mail: [email protected]