dsci 325: handout 13 combining data sets in sascourse1.winona.edu/thooks/media/handout 13 -...

25
1 DSCI 325: Handout 13 – Combining Data Sets in SAS Spring 2017 A variety of methods exist for combining datasets. Specifically in this handout, we will discuss the following methods: Appending and Concatenating – these involve adding ROWS to a data set Merging – this involves adding COLUMNS to a data set The following table gives a more complete definition of and an example of each method: Method Example Appending – this adds the observations in the second data set directly to the end of the original data set

Upload: others

Post on 12-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

1

DSCI 325: Handout 13 – Combining Data Sets in SAS Spring 2017

A variety of methods exist for combining datasets. Specifically in this handout, we will discuss

the following methods:

Appending and Concatenating – these involve adding ROWS to a data set

Merging – this involves adding COLUMNS to a data set

The following table gives a more complete definition of and an example of each method:

Method Example

Appending – this adds the

observations in the second

data set directly to the end

of the original data set

Page 2: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

2

Method Example

Concatenating – this

copies all observations

from the data sets you

want to combine and

creates a new data set

Merging – this involves

combining observations

from two or more data sets

into a single observation in

a new data set

Questions:

1. Suppose that three data sets named JanSales, FebSales, and MarSales need to be

combined to create a data set named Qtr1Sales. Which method should be used?

2. Suppose that a Sales data set needs to be combined with a Target data set by month

to compare the sales data to the target data. Which method should be used?

3. Suppose the FebSales data set needs to be added to the YTD data set. Which method

should be used?

Page 3: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

3

Using PROC APPEND to Combine Datasets with the Same Variable Structures

Consider the following SAS data sets. Emps originally contains employee information on all

employees hired prior to 2012, and Emps2012 contains only those employees hired in 2012.

Note that both files have the same variables.

Emps

Emps2012

To combine these two data sets and view the result, we can run the following program:

PROC APPEND BASE = Emps

DATA = Emps2012;

RUN;

PROC PRINT DATA = Emps;

RUN;

Emps

The log shows the following:

Page 4: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

4

Using PROC APPEND to Combine Datasets with Different Variable Structures

Once again, consider the Emps data set. Recall that this now contains employee information on

all employees hired through 2012. Suppose that in 2013, we stopped recording the gender of

the employees. These data are given in the file Emps2013.

Emps

Emps2013

Now, suppose we run the following program to combine the two data sets:

PROC APPEND BASE = Emps

DATA = Emps2013;

RUN;

PROC PRINT DATA = Emps;

RUN;

Emps

The log displays the following warning, but the procedure still worked.

Page 5: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

5

Next, suppose instead that we decided in 2013 to stop recording gender and also to start

recording information on the employees’ highest degree earned.

Emps

Emps2013

Consider the following program and result in the log window:

PROC APPEND BASE = Emps

DATA = Emps2013;

RUN;

PROC PRINT DATA = Emps;

RUN;

Note that when the DATA= data set contained a variable not included in the BASE = data set,

the procedure is not executed in SAS. We could use the FORCE option as the log suggests:

PROC APPEND BASE = Emps

DATA = Emps2013 FORCE;

RUN;

PROC PRINT DATA = Emps;

RUN;

Page 6: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

6

SAS returns the following:

Emps

Another Example:

Consider the following data which was recorded from the Rushford-Peterson boys’ basketball

team. This data can be found in the RP Game 1 – RP Game 3 csv files on the course storage

space.

Game1

Game2

Read the above data into SAS data sets named Game1 and Game2, respectively.

Page 7: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

7

Then, run a PROC CONTENTS for each data set.

PROC CONTENTS DATA=Game1;

RUN;

PROC CONTENTS DATA=Game2;

RUN;

Next, run the following program to add the data from Game2 to the original Game1 data set:

PROC APPEND

BASE = Game1

DATA = Game2;

RUN;

Once again, the PROC APPEND procedure does not produce any output. The Log window can

be used to verify that this procedure was not successful.

Question: Why was PROC APPEND not successful in this example?

Page 8: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

8

The FORCE option can be used to overcome this problem:

PROC APPEND

BASE = Game1

DATA = Game2 FORCE;

RUN;

PROC CONTENTS DATA=Game1;

RUN;

Consider the following output from PROC CONTENTS.

Questions:

1. What is the name of the appended dataset?

2. How many observations are in the appended dataset?

3. How many variables are in the appended dataset?

Page 9: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

9

4. Suppose that I accidently submitted my program a second time (i.e., I hit the button

again). Consider the upper portion of the PROC CONTENTS output and a print-out of

the data for the Game1 dataset.

What is the effect of the second submission of this program?

To resolve this error, you can remove certain observations (via their observation

number) using the internal SAS statement, _N_. This is shown next.

DATA GAME1;

SET GAME1;

IF _N_ >= 15 then DELETE;

RUN;

PROC PRINT DATA=GAME1;

RUN;

When is the FORCE Option Needed?

The FORCE option is needed when the DATA= data set contains variables that either

are not in the BASE= data set (note that SAS drops this extra variable from the data set)

are longer than the variables in the BASE= data set (note that SAS truncates the values

from the DATA= data set so that they fit into the length specified in the BASE= data set)

do not have the same type as the variables in the BASE= data set (SAS will replace all

values for the variable in the DATA= data set with missing values and keeps the variable

type that was specified in the BASE= data set)

Page 10: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

10

Final Comments on the APPEND Procedure

Note that PROC APPEND works with only two data sets at a time in one step. Also, the

observations in the base data set are not read, and the variable information in the descriptor

portion of the base data set cannot change. We have a lot more flexibility when we use the SET

statement, which is discussed in the next section.

Concatenating Data Sets with the Same Variables

To concatenate two or more data sets in SAS, we use the SET statement in the DATA step. For

example, consider the data from Game 1 and Game 2 used previously in the handout.

PROC CONTENTS DATA=Game1;

PROC CONTENTS DATA=Game2;

RUN;

The output from PROC CONTENTS:

Game1

Game2

Page 11: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

11

The following code can be used to concatenate the data from Game 1 and Game 2 to create a

new data set called Games.

DATA Games;

SET Game1 Game2;

RUN;

PROC CONTENTS DATA=Games; RUN;

The result:

Note that any number of data sets can be used in the SET statement. The observations from the

first data set in the SET statement will appear first. The observations from the second set

follow, and so on.

Page 12: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

12

Questions:

1. How is the following code different from what was shown above?

DATA Game1;

SET Game1 Game2;

RUN;

PROC CONTENTS DATA=Game1;

RUN;

2. Try running the following code.

DATA Games;

SET Game2 Game1;

RUN;

PROC CONTENTS DATA=Games; RUN;

PROC PRINT DATA=Games; RUN;

What is the result of this code?

Page 13: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

13

Concatenating Data Sets with Different Variables

Recall that we also have data on a third game of R-P boys’ basketball. Read in the data for

Game 3 and run the following PROC CONTENTS:

PROC CONTENTS DATA=Games; RUN;

PROC CONTENTS DATA=Game3; RUN;

Games

Game3

What do you notice about the variables in the two data sets?

Page 14: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

14

Run the following code to concatenate all three data sets:

DATA Game123;

SET Game2 Game1 Game3;

RUN;

PROC PRINT DATA = Game123;

RUN;

The results are shown below:

Finally, note that the following code can be used to rename the variables in the Game3 data set

so that all of the FG3 and FGA3 data will be read into the same column.

DATA Game123;

SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));

RUN;

PROC PRINT DATA = Game123;

RUN;

Page 15: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

15

PROC APPEND versus using the SET statement

The data set that results from concatenating two data sets with the SET statement is the

same as the data set that results from concatenating them with the APPEND procedure

if the two data sets contain the same variables.

The APPEND procedure concatenates much faster than the SET statement because it

does not process the observations from the BASE= data set.

The two methods differ when the variables differ between data sets.

PROC APPEND uses all variables in the BASE= data set and assigns missing

values to observations from the DATA= data set where appropriate; it cannot

include variables found only in the DATA= data set

The SET statement uses all variables and assigns missing values where

appropriate

Page 16: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

16

Interleaving Data Sets in SAS

Consider the data sets Game1, Game2, and Game3 from the previous examples. Suppose that

these had already been sorted according to Number.

Page 17: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

17

Note that when we concatenate these data sets, the resulting data set is no longer sorted by

Number.

DATA Game123;

SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));

RUN;

PROC PRINT DATA = Game123 HEADING=vertical WIDTH=minimum;

RUN;

Of course, we could use the following code to perform this sort.

DATA Game123;

SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));

RUN;

PROC SORT DATA=Game123;

BY Number;

RUN;

PROC PRINT; RUN;

Page 18: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

18

However, if the original data sets are already sorted, it is more efficient to preserve that order

when combining the data sets. This can be accomplished by using a BY statement with a SET

statement in the DATA step.

DATA Game123;

SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));

BY Number;

RUN;

PROC PRINT DATA = Game123 HEADING=vertical WIDTH=minimum;

RUN;

This is known as interleaving the data sets. Note that before you can interleave observations,

the original data sets must be sorted by the BY variable(s).

Page 19: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

19

Combining SAS Data Sets with a One-to-One Merge

Recall that merging data sets involves combining observations from two or more data sets into a

single observation in a new data set.

The above is an example of a one-to-one match merge; i.e., each observation in one data set is

related to exactly one observation in the other data set(s). To see how this merge is successfully

accomplished in SAS, suppose the original data sets were given as follows.

EmpsAU

PhoneH

Unsuccessful Attempt #1

First, try using the following code to merge the data sets:

DATA EmpsAUH;

MERGE EmpsAU PhoneH;

RUN;

PROC PRINT DATA=EmpsAUH; RUN;

What is the problem?

Page 20: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

20

Unsuccessful Attempt #2

To get around the above problem, specify a BY variable. For example, try the following code:

DATA EmpsAUH;

MERGE EmpsAU PhoneH;

BY EmpID;

RUN;

PROC PRINT DATA=EmpsAUH; RUN;

Now, check the log window:

Note that observations must be sorted by the common variable(s) that are being matched;

otherwise, the merge is unsuccessful.

Successful Attempt

Consider the following code:

PROC SORT DATA = EmpsAU; BY EmpID;

PROC SORT DATA = PhoneH; BY EmpID;

DATA EmpsAUH;

MERGE EmpsAU PhoneH;

BY EmpID;

RUN;

PROC PRINT DATA=EmpsAUH; RUN;

Page 21: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

21

Note that if the data have been sorted in descending sequence, the following merge attempt is

unsuccessful.

PROC SORT DATA = EmpsAU; BY DESCENDING EmpID;

PROC SORT DATA = PhoneH; BY DESCENDING EmpID;

DATA EmpsAUH;

MERGE EmpsAU PhoneH;

BY EmpID;

RUN;

To remedy this, use the DESCENDING option in the BY statement of the DATA step:

PROC SORT DATA = EmpsAU; BY DESCENDING EmpID;

PROC SORT DATA = PhoneH; BY DESCENDING EmpID;

DATA EmpsAUH;

MERGE EmpsAU PhoneH;

BY DESCENDING EmpID;

RUN;

PROC PRINT DATA=EmpsAUH; RUN;

Page 22: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

22

Merging Data Sets with Identically Named Variables

Suppose the original data sets used above had been initially stored as follows:

EmpsAU

PhoneH

Note that both data sets contain a variable named First; however, Togar’s name is misspelled as

“Togur” in the PhoneH data set. Suppose the data sets are merged with the following code:

PROC SORT DATA = EmpsAU; BY EmpID;

PROC SORT DATA = PhoneH; BY EmpID;

DATA EmpsAUH;

MERGE EmpsAU PhoneH;

BY EmpID;

RUN;

PROC PRINT DATA=EmpsAUH; RUN;

The resulting data set is shown below:

What did SAS do here?

Page 23: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

23

Combining SAS Data Sets with a One-to-Many Merge

A one-to-many merge occurs when a single observation in one data set is related to more than

one variable in another data set. For example, consider the following data sets:

EmpsAU

PhoneHW

Consider the following program and output:

PROC SORT DATA = EmpsAU; BY EmpID;

PROC SORT DATA = PhoneHW; BY EmpID;

DATA EmpsAUHW;

MERGE EmpsAU PhoneHW;

BY EmpID;

RUN;

PROC PRINT DATA=EmpsAUHW; RUN;

Page 24: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

24

Merging with Nonmatches

Consider the following data sets:

EmpsAU

PhoneC

Note that Employees 121152 and 121153 are listed in only one of the data sets; i.e., they have no

match. Consider the following code and output:

PROC SORT DATA = EmpsAU; BY EmpID;

PROC SORT DATA = PhoneC; BY EmpID;

DATA EmpsAUC;

MERGE EmpsAU PhoneC;

BY EmpID;

RUN;

PROC PRINT DATA=EmpsAUC; RUN;

Note that the final result contains both the matches (observations with data from both input

data sets) and the non-matches (observations with data from only one of the data sets).

Page 25: DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring

25

Using the IN= Option

Suppose you wanted to eliminate the non-matches from the previous data set, for some reason.

This could be easily accomplished using the IN= option to create a variable that indicates

whether a data set contributed data to the current observation.

For example, consider the following code.

PROC SORT DATA = EmpsAU; BY EmpID;

PROC SORT DATA = PhoneC; BY EmpID;

DATA EmpsAUC;

MERGE EmpsAU (in=Emps)

PhoneC (in=PhoneNum);

BY EmpID;

RUN;

PROC PRINT DATA=EmpsAUC; RUN;

When you run the above program and look at the output, it is identical to what was obtained in

the previous example. This is because the IN= option does not create new variables to be stored

in the final data set; instead, the variables Emps and PhoneNum will exist only during the data

step. They can, however, be used to create other variables or for subsetting. For example,

consider the following programs and resulting output:

DATA EmpsAUC;

MERGE EmpsAU (in=Emps)

PhoneC (in=PhoneNum);

BY EmpID;

IF Emps=1 and PhoneNum=1;

RUN;

PROC PRINT DATA=EmpsAUC; RUN;

DATA EmpsAUC;

MERGE EmpsAU (in=Emps)

PhoneC (in=PhoneNum);

BY EmpID;

IF Emps=1;

RUN;

PROC PRINT DATA=EmpsAUC; RUN; DATA EmpsAUC;

MERGE EmpsAU (in=Emps)

PhoneC (in=PhoneNum);

BY EmpID;

IF PhoneNum=1;

RUN;

PROC PRINT DATA=EmpsAUC; RUN;