sas essentials ii€¦ · sas essentials ii: better-looking sas for a better community annmaria de...

17
SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just write code that runs, they also look professional. That doesn't refer to their designer wardrobe (a quick glance around at your co-workers probably told you that) but to their code, log and output. In this paper, better- looking and better-designed programs are demonstrated using PROC FREQ and macros. SAS system level options are examined for their effectiveness in producing better-looking logs and output. PROC FORMAT, PROC TABULATE, PROC PRINT and ODS are used to create better-looking reports. SAS/Graph and Graph-N-Go are used to make good-looking graphs. Now here's the catch - where do you get the time and opportunity to try out new techniques? The examples used here are from projects done for various community service organizations, from sports organizations to public schools. In the end, your program looks good, your output looks good and you've improved both your programming skills and your community. Use your SAS skills to produce reports for your child's sports league and you'll never have to chaperone in the freezing cold again! And who says nice guys finish last? INTRODUCTION One way to break out of that vicious circle of "can't get a job without experience, can't get experience without a job" is to take advantage of the many opportunities to use open data to help inform your community. “Open data”, that is data freely available to anyone to use or publish, is available from an enormous range of government and non-profit sources. My personal favorite sources for open data are the websites from data.gov , U.S. Census Bureau and National Center for Education Statistics. These three sites alone provide several hundred thousand data sets to choose from. Applying your SAS skills to open data gives you experience with different data set types, statistical techniques and procedures as well as the potential for doing some good for your community. This is the second of a three-paper series on SAS applied to open data. There is often substantial work involved in preparing a data set for analysis and an earlier paper dealt with that process (De Mars, 2011a). We’re going to assume that work has already been done. The next step is to produce presentation quality results. EXAMPLE 1: AMERICAN COMMUNITY SURVEY - MAKING PRESENTATION-QUALITY TABLES This example uses the 2009 American Community Survey Public Use Microdata Sample (U.S. Census, 2009). The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). There are 3,030,728 records in the data set. Now you see an advantage of open data - it's often huge. You get the opportunity to use BIG DATA (or at least relatively big data) which is an experience everyone wants you to have but people are rightfully a bit nervous about letting novices touch their data sets with millions of records, because, well, they might need that data later, no? Let’s begin with a simple procedure for a presentation to middle school students. We want to know how many people on the census forms select “White” as their race, how many put “Black” as their race, how many checked both the boxes for “Black” and “White” and how many checked neither. We could do a PROC FREQ. PROC FREQ DATA = lib.pums9 ; TABLES racblk*racwht; No matter how much you love SAS you must confess that this produces some of the ugliest output ever. The SAS System 23:18 Saturday| August 6| 2011 1 The FREQ Procedure Table of RACBLK by RACWHT RACBLK(Race includes Black) RACWHT(Race includes White) Frequency| Percent | Row Pct | Col Pct |0 |1 | Total ---------|--------|--------| 0 |3.188E7 |2.342E8 |2.661E8 | 10.38 | 76.28 | 86.66 | 11.98 | 88.02 | | 45.10 | 99.09 | ---------|--------|--------| 1 |3.881E7 |2148908 |4.095E7 | 12.64 | 0.70 | 13.34 | 94.75 | 5.25 | | 54.90 | 0.91 | ---------|--------|--------| Total 7.068E7 2.363E8 3.07E8 23.02 76.98 100.00

Upload: others

Post on 20-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA

ABSTRACT Experienced programmers don't just write code that runs, they also look professional. That doesn't refer to their designer

wardrobe (a quick glance around at your co-workers probably told you that) but to their code, log and output. In this paper, better-looking and better-designed programs are demonstrated using PROC FREQ and macros. SAS system level options are examined for their effectiveness in producing better-looking logs and output. PROC FORMAT, PROC TABULATE, PROC PRINT and ODS are used to create better-looking reports. SAS/Graph and Graph-N-Go are used to make good-looking graphs.

Now here's the catch - where do you get the time and opportunity to try out new techniques? The examples used here are from projects done for various community service organizations, from sports organizations to public schools. In the end, your program looks good, your output looks good and you've improved both your programming skills and your community. Use your SAS skills to produce reports for your child's sports league and you'll never have to chaperone in the freezing cold again! And who says nice guys finish last?

INTRODUCTION

One way to break out of that vicious circle of "can't get a job without experience, can't get experience without a job" is to take advantage of the many opportunities to use open data to help inform your community. “Open data”, that is data freely available to anyone to use or publish, is available from an enormous range of government and non-profit sources. My personal favorite sources for open data are the websites from data.gov , U.S. Census Bureau and National Center for Education Statistics. These three sites alone provide several hundred thousand data sets to choose from. Applying your SAS skills to open data gives you experience with different data set types, statistical techniques and procedures as well as the potential for doing some good for your community.

This is the second of a three-paper series on SAS applied to open data. There is often substantial work involved in preparing a data set for analysis and an earlier paper dealt with that process (De Mars, 2011a). We’re going to assume that work has already been done. The next step is to produce presentation quality results.

EXAMPLE 1: AMERICAN COMMUNITY SURVEY - MAKING PRESENTATION-QUALITY TABLES This example uses the 2009 American Community Survey Public Use Microdata Sample (U.S. Census, 2009). The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). There are 3,030,728 records in the data set. Now you see an advantage of open data - it's often huge. You get the opportunity to use BIG DATA (or at least relatively big data) which is an experience everyone wants you to have but people are rightfully a bit nervous about letting novices touch their data sets with millions of records, because, well, they might need that data later, no? Let’s begin with a simple procedure for a presentation to middle school students. We want to know how many people on the census forms select “White” as their race, how many put “Black” as their race, how many checked both the boxes for “Black” and “White” and how many checked neither. We could do a PROC FREQ.

PROC FREQ DATA = lib.pums9 ; TABLES racblk*racwht;

No matter how much you love SAS you must confess that this produces some of the ugliest output ever. The SAS System 23:18 Saturday| August 6| 2011 1 The FREQ Procedure Table of RACBLK by RACWHT RACBLK(Race includes Black) RACWHT(Race includes White) Frequency| Percent | Row Pct | Col Pct |0 |1 | Total ---------|--------|--------| 0 |3.188E7 |2.342E8 |2.661E8 | 10.38 | 76.28 | 86.66 | 11.98 | 88.02 | | 45.10 | 99.09 | ---------|--------|--------| 1 |3.881E7 |2148908 |4.095E7 | 12.64 | 0.70 | 13.34 | 94.75 | 5.25 | | 54.90 | 0.91 | ---------|--------|--------| Total 7.068E7 2.363E8 3.07E8 23.02 76.98 100.00

Page 2: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

Not only is this table really ugly, it’s not even usable for a presentation. Most middle school students are not going to be able to interpret numbers in scientific notation. We’d like to get rid of the date, give it a title other than “The SAS System”, have “Black” and “White” show up for race instead of 0 and 1. We’d also like to get rid of the scientific notation and have numbers as human beings read them, not calculators. According to the SAS Procedures Guide for Version 8 (SAS Institute, 1999), "When scientific notation is used, only the first few significant digits are shown. If you need more significant digits than PROC FREQ displays, create an output data set by specifying OUT= in the TABLES statement. Then use PROC PRINT and assign an appropriate format to the variable COUNT."

The later documentation has been silent on this issue, so if there is a simple way to get rid of scientific notation in PROC FREQ, I haven’t found it. We want to create an output data set anyway. Think about the fact that you are working with three million records. Charting, sorting, applying IF statements based on the value of 3,000,000+ records is inefficient. So, our challenge is to both become more efficient and change our output to the improved version below, in one and the same program.

Population Distribution by Race

Race includes Black

Race includes White

2009 Population

Percent of Population

No Yes 234,175,873 76.3

Yes No 38,805,561 12.6

No No 31,876,214 10.4

Yes Yes 2,148,908 0.7

2009 American Community Survey Data

Here is the code to produce the table above. At first glance, it may seem an unreasonable amount of code for one little table, but there is a method at work here, trust me.

OPTIONS NODATE NONUMBER ; PROC FORMAT ; VALUE $YN "0" = "No" "1" = "Yes" ; PROC FREQ DATA = lib.pums9 NOPRINT; TABLES racblk* racwht / OUT = lib.blkwhitmix ; WEIGHT pwgtp ; PROC SORT DATA = lib.blkwhitmix ; BY DESCENDING PERCENT ; ODS RTF FILE = "C:\Users\AnnMaria\Documents\pc_pus\sasout\RaceDist.rtf" STYLE = OCEAN ; TITLE "Population Distribution by Race" ; FOOTNOTE "2009 American Community Survey Data" ; PROC PRINT DATA = lib.blkwhitmix SPLIT = " " ; ID racblk ; VAR racwht COUNT PERCENT ; FORMAT COUNT COMMA14. PERCENT 8.1 racblk racwht $yn. ; LABEL COUNT = "2009 Population" PERCENT = "Percent of Population" ;

WHAT WE’RE DOING WITH THIS PROGRAM AND WHY WE ARE DOING IT Starting from the top of our program

OPTIONS NODATE NONUMBER ;

This removes the date and number from the first line of our output.

PROC FORMAT ;

The format procedure begins with PROC FORMAT statement.

VALUE $YN "0" = "No" "1" = "Yes" ;

The VALUE statement will create a format. You can have as many VALUE statements as you like. A few points to note here:

• A character format begins with a $

Page 3: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

• Even if the values are numbers, if the variable to which your format is going to be applied is a character variable, you need to put those numbers in quotes, just like any time you are referencing a character variable’s value.

• Unlike other SAS names, a format name cannot end in a number.

• This format is temporary because I did not store it anywhere. Just like a temporary data set, when the program is ended, this format will be gone. For this reason, you probably want to give it some thought before applying a temporary format to variables stored in a permanent data set.

With these two statements, I have created a new temporary format. A format has to be defined before it can be used in your SAS program, which is why it is a good habit to put your FORMAT procedure before any other DATA or PROC steps.

PROC FREQ DATA = lib.pums9 NOPRINT;

NOPRINT option doesn’t really matter for this case, but it ‘s a good habit to get into when you don’t need the printed output, as with some variables, for example, income, the procedure could produce thousands of lines of useless output.

TABLES racblk* racwht / OUT = lib.blkwhitmix ;

This TABLES statement will produce a cross-tabulation with the first variable being the row variable and the second one the column variable. Don’t forget the * . If you leave out the asterisk you’ll get two tables, one a frequency distribution of the variable racblk and a second a frequency distribution of the racwht variable. The OUT = option will write the counts and frequencies to a data set.

WEIGHT pwgtp ;

Many open data sets are surveys and usually will include a WEIGHT statement. Don’t forget this!!! In the case of the American Community Survey, leaving off the WEIGHT statement means that your counts will be off by a factor of 101.

To see your output data set, click on the EXPLORER tab, double-click on the LIBRARIES tab, double-click on the name of the library, in this case, Lib, and then double-click on the data set. The data set created by the FREQ procedure can be seen in the window to the right.

PROC SORT DATA = lib.blkwhitmix ; BY DESCENDING PERCENT ; This step will sort the data set, listing the groups in order of their percentage, from highest to lowest. ODS RTF FILE = "C:\Users\AnnMaria\Documents\pc_pus\sasout\RaceDist.rtf" STYLE = OCEAN ;

This statement opens an RTF file in the specified directory. The STYLE = is optional, but I live by the ocean so I was feeling like an ocean style. The STYLE = option sets the colors and fonts. SAS has dozens of different styles to choose from with no more effort than changing the word after the STYLES = . If you’d like to see what sort of output styles like “Meadow”, “Harvest” and “Brick” produce, this site from Louisiana State University http://stat.lsu.edu/SAS_ODS_styles/SAS_ODS_styles.htm gives dozens of examples of SAS styles you can select.

TITLE "Population Distribution by Race" ; FOOTNOTE "2009 American Community Survey Data" ;

The TITLE and FOOTNOTE statements add a descriptive title at the top and a footnote at the bottom.

Page 4: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

PROC PRINT DATA = lib.blkwhitmix SPLIT = " " ;

The SPLIT= option in the PROC PRINT statement will cause the labels for each variable to split and go to a new line whenever the character in the quotes is encountered. In this case, the label will start a new line after a blank space.

ID racblk ; VAR racwht COUNT PERCENT ; The ID statement will print the racblk variable first in each line rather than an observation number. The VAR statement lists, in

order, the variables to print. These will come after the ID variable. FORMAT COUNT COMMA14. PERCENT 8.1 racblk racwht $yn. ;

The FORMAT statement specifies that the count variable will have a width of 14, and include commas. The percent variable will have a width of 8 with one decimal place. Since there are two variables listed before the $yn format, both racblk and racwht will use the $yn format we created above. NOTE! I created a temporary format and I am using it in the PROC PRINT step. A format used in a PROC step does not permanently change the format of the stored variable, it only changes it for that step. As a general rule, avoid using temporary formats for permanent data sets if you can and you will run into fewer format error problems.

LABEL COUNT = "2009 Population" PERCENT = "Percent of Population" ;

The LABEL statement puts a label for each variable, and because the SPLIT = option was used, these variables will be split to a new line between words.

EXAMPLE 2: AMERICAN COMMUNITY SURVEY - GRAPHS The first part of our presentation to students involves explaining the choices that statisticians make. Results are presented as unbiased reality, but, in fact, at each point along the way, many decisions were made that involve judgment calls. The students’ first decision we’ll have them make is whether or not to keep in those people who checked that their race was both “white” and “black”. To decide whether this group whose answer is “both” should be kept as a separate group, we create a bar chart and look at how they stack up relative to the rest of the population. As we can see below, this is a very small group relative to the other three groups.

The program to create this graph is shown below.

DATA byrace ; SET lib.blkwhitmix ; IF racblk = 1 AND racwht = 0 THEN Race = "Black" ; ELSE IF racblk = 0 AND racwht = 1 THEN Race = "White" ; ELSE IF racblk = 0 AND racwht = 0 THEN Race = "Other" ; ELSE IF racblk = 1 AND racwht = 1 THEN Race = "Mixed" ; PERCENT = PERCENT/ 100 ; RUN; AXIS1 LABEL = ( ANGLE = 90 "Percent") ORDER = (0 to 1 by .1 ) ; AXIS2 ORDER = ("White" "Black" "Mixed" "Other" ) ; PATTERN1 COLOR = BLACK ;

Page 5: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

PATTERN2 COLOR= GRAY ; PATTERN3 COLOR = BROWN ; PATTERN4 COLOR=WHITE ; PROC GCHART DATA=byrace ; VBAR Race / raxis = axis1 maxis = axis2 SUMVAR= percent TYPE=SUM OUTSIDE= SUM PATTERNID = MIDPOINT ; LABEL Race = "Race" ; FORMAT percent percent8.1 ;

At first glance, it really seems a bit of overkill. Why not just open Excel, type the numbers in and be done with it? There are (as we’ll see in the last example in this paper) simpler options for output. The reason for going to this extent with the American Community Survey is that we are going to produce a lot of output, bar charts, pie charts, tables, scatter grams. Setting up the first chart requires some effort, but as you will see, the options set stay set throughout the program, and at each step, less and less effort is required.

WHAT WE’RE DOING WITH THIS PROGRAM AND WHY WE ARE DOING IT We want to create a new variable, race, based on the survey respondents’ checked answers to the two boxes for black and white, that is the variables racblk and racwht. Here is the first time we’ll be glad we created an output dataset. Rather than perform the logic in the DATA step for the 3,030,728 records in the PUMS data set, we only need to do it for the four records in the output file from the frequency procedure.

DATA byrace ; SET lib.blkwhitmix ; IF racblk = 1 AND racwht = 0 THEN Race = "Black" ; ELSE IF racblk = 0 AND racwht = 1 THEN Race = "White" ; ELSE IF racblk = 0 AND racwht = 0 THEN Race = "Other" ; ELSE IF racblk = 1 AND racwht = 1 THEN Race = "Mixed" ; PERCENT = PERCENT/ 100 ; run;

In the data set saved by the frequency procedure, PERCENT is not saved as a decimal, rather, 40.1% is saved as 40.1. For later use, we want that to be an actual decimal, so divide it by 100.

Note that the OPTIONS, TITLE and FOOTNOTE statements earlier in our program set the title and footnote, removed the number and date. We don’t need to do it again. All of these - TITLE, FOOTNOTE, OPTIONS - will remain the same throughout our program unless we use another TITLE, FOOTNOTE or OPTIONS statement t change them.

These next six statements will also apply to any relevant output throughout our program.

AXIS1 LABEL = ( ANGLE = 90 "Percent") ORDER = (0 to 1 by .1 ) ; The first part of this statement labels the axis with the text in quotes, in this case “Percent”. It also sets the angle for the axis label rotation to be 90 degrees, in other words, it prints sideways instead of at the end of the axis. The ORDER = option specifies that the axis minimum will be 0 and maximum will be 1 with tick marks at .1 . Without the ORDER = , the axis is set based on the data. In this case, it would have had a maximum of 80%.

AXIS2 ORDER = ("White" "Black" "Mixed" "Other" ) ;

This statement specifies the order to display the categories. Because the main point of this chart is for the students to use in discussing whether the “Mixed” group should be include in a comparison of black and white survey respondents, we wanted it put right after the black and white bars. The ORDER = option forces that order. Without this option, the responses would have been in alphabetical order.

PATTERN1 COLOR = BLACK ; PATTERN2 COLOR= GRAY ; PATTERN3 COLOR = BROWN ; PATTERN4 COLOR=WHITE ;

Without PATTERN statements, SAS will pick the colors by default. It makes a more effective graphic to have the bar representing “Black” respondents colored black and the one representing “White” respondents colored white. NOTE! If you look back at the graph, this can be confusing, since the first bar is white, the second is black, the third gray and the fourth brown. This doesn’t seem to match the PATTERN statements. It does, however, if you consider the fact that the PATTERN statements are separate from the AXIS statements. The PATTERN statement for the response variables is assigned in alphabetical order. Since “black” is first, it is assigned the color from PATTERN1.

PROC GCHART DATA=byrace ; This statement begins the GCHART procedure, using the data set byrace.

Page 6: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

VBAR Race / raxis = axis1 maxis = axis2 SUMVAR= percent TYPE=SUM OUTSIDE= SUM PATTERNID = MIDPOINT ;

Note! This is all one statement. The VBAR statement will create a vertical bar chart. The variable to be charted is race. Stop right here and you’ll get a no-frills bar chart. We, however, would like a lot of frills, which is what all of the options after the “/” will give us.

The RAXIS = option specifies the AXIS1 statement be used for the response axis. Why a separate AXIS statement? Doesn’t this seem silly? Why couldn’t you just say X and be done with it here and not have a separate statement? If you think about this for a moment, you’ll realize that if you changed your mind and wanted a horizontal bar chart, all of a sudden the response is going to be the Y axis. Note! If you don’t specify to use axis1 or axis2 on this statement, they will not be used and SAS will use the defaults for those axes.

The MAXIS = option specifies the AXIS2 statement be used for the midpoint axis (that would be the ones with the categories, or midpoints).

The SUMVAR = option specifies that the value charted for race is the summary of the given variable. In this example, it is the “percent” variable, but any numeric variable in the data set could be used. Without the SUMVAR = option, the value for every category would be 1 because we are analyzing the data that came from the frequency procedure and that we recoded in our DATA step above. There is only one record for each race category. Note! SUMVAR = does not stand for “sum” but rather for “summary”. Several types of summary statistics can be specfied.

The TYPE = option specifies the type of summary statistic to use. Together with the SUMVAR = option, the TYPE = option causes SAS to chart for each category the sum of the percent variable. Since there is only one record for each category, the “sum” charted will be the percent given on that record.

OUTSIDE = SUM causes SAS to print the value of the sum outside of each bar. PATTERNID = MIDPOINT assigns patterns based on the value of the midpoint. Given that this is a categorical variable, the

midpoint is each category value. As noted above, the patterns will be assigned in alphabetical order. LABEL Race = "Race" ; FORMAT percent percent8.1 ;

These two statements are pretty obvious. LABEL determines the label printed for race. FORMAT uses the percent format for the percent variable.

ADDING A PIE CHART After all of this work, we’re pretty satisfied, but in discussions with the teacher, we decide it would be useful to have a pie chart and talk to the students about the value of different types of graphics in answering different questions. In the first chart, we were really interested in deciding if it would make a big difference whether or not we included the people who had checked both black and white for race. Our second question involves what percentage of the population is composed of black and white races combined relative to all other races. We decide that a pie chart would make a good graphic for this.

Since we already have the TITLE, FOOTNOTE, OPTIONS, LABEL, PATTERN, PROC GCHART and FORMAT statements written, we only need to add one statement to create our pie chart. We don’t even need a new procedure. Both PIE and VBAR statements can be used in the same procedure. Notice we are still using the four records created in our original frequency procedure. We haven’t touched the 3,000,000 plus records since step one.

The one statement is

PIE Race / NOHEADING ASCENDING SUMVAR= percent TYPE= SUM ;

PIE is, no surprise here, the statement to create a pie chart. All that is required is PIE variable-name. However, as usual, we’d like a few options.

NOHEADING removes the default heading, which in this case would be “Sum of Percentage by Race”. That’s unnecessary information.

ASCENDING will order the pie slices by size. Because our main question to be answered here is, “How much of the population is black or white, versus everything else, I want these two, largest slices, to be together.

The pie chart this statement produces is shown below. There are other options we could have used, for example EXPLODE = “Other” to pull out the “Other” slice and see how much of the pie was left. Very good advice on using some of these options to produce communication-effective pie charts can be found in the presentation by that name from Bessler (20070.

Page 7: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

EXAMPLE 3: AMERICAN COMMUNITY SURVEY - PROC TABULATE Still in our discussion with middle school students on race in America, we ask them if they think people are biased in who they choose to date or marry (see De Mars, 2011). To help them answer this question, we’d like a table that shows race by gender, and the percentage of possible dating / marriage partners in the population if, in fact, dating and marriage occurred without respect to race. Here is the program to produce the results we want.

PROC FORMAT ; VALUE $YN "0" = "No" "1" = "Yes" ; VALUE $sex “1” = “Male” “2” = “Female” ; PROC FREQ DATA = lib.pums9 ; TABLES sex*racblk / OUT = lib.blkwhtsex ; TABLES st*racblk / OUT = lib.blkwhtst ; WEIGHT pwgtp ; WHERE racblk = "1" OR racwht = "1" ; run ; %MACRO mkrace(dsn) ; DATA &dsn ; SET lib.&dsn ; IF racblk = 1 THEN Race = "Black" ; ELSE IF racblk = 0 THEN Race = "White" ; Percent = percent/ 100 ; RUN; %MEND mkrace ; PROC TABULATE DATA = blkwhtsex ; CLASS race sex ; VAR count percent ; TABLE race* sex ALL, count*(SUM= ' '*F=COMMA12.0) percent*(SUM = ' '*F=PERCENT8.1) ; LABEL Count = "2009 Population" Percent = "Percent" ; FORMAT sex $sex. ;

WHAT WE’RE DOING WITH THIS PROGRAM AND WHY WE ARE DOING IT First, go back to the PROC FORMAT and create one more format, for sex. The PROC FORMAT is now:

PROC FORMAT ; VALUE $YN "0" = "No" "1" = "Yes" ; VALUE $sex “1” = “Male” “2” = “Female” ;

Page 8: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

Having learned from our earlier experience, the first thing we are going to do is create an output data set from the frequency procedure.

PROC FREQ DATA = lib.pums9 ; TABLES sex*racblk / OUT = lib.blkwhtsex ; TABLES st*racblk / OUT = lib.blkwhtst ; WEIGHT pwgtp ; WHERE racblk = "1" OR racwht = "1" ; run ;

Remember the WEIGHT statement!!! Even though I said that above, I’m reminding you here because forgetting the WEIGHT statement is a common novice mistake and, in this case, it will cause your results to be, literally, 10,000% wrong. That’s a lot of wrong. The second TABLES statement is not needed for the PROC TABULATE output at the end, but since we’ll need the data set of race by state later, we went ahead and created it in this step. In the actual project, there were a lot of TABLES statements in this step.

The only new statement here from our first step is the WHERE statement. Having concluded our discussion above we have decided to drop out the “Other” category and include those who had checked both categories. We also decided to consider everyone who checked “Black” for their race as black, whether they also checked “White” or not. If some students disagree with us, that is good because the point of this whole project with the schools is to get them talking and thinking about statistics. If they think our designation is wrong or unfair, this is going to be the most passion they’ve ever had about statistics.

We’re going to do the same two things with a lot of data sets, because now we have made two decisions. The first is to use the output from PROC FREQ for analysis, so we’re going to be dividing that percent variable by 100 each time. The second is to categorize people who selected “Black” as their race as black. Whenever you find yourself doing the same bit of code over and over, think about creating a macro. Macro programming is not nearly as scary as some people make it out to be. The trick is to start early in your career with very simple macros and just get progressively more complex. The example below, with only one macro parameter, is about as simple as you can get. Although we only use it one time in this paper, in the actual project, we used it over and over.

Let’s look at this macro line by line. %MACRO mkrace(dsn) ; Create a macro named mkrace and specifies that this macro will require one parameter, which is named dsn. DATA &dsn ;

This statement creates a data set. When the macro is run &dsn will be replaced with whatever we provided when we called the macro.

SET lib.&dsn ; This statement reads in a data set from the library referenced by “lib” and named whatever value I had supplied for &dsn. IF racblk = 1 THEN Race = "Black" ; ELSE IF racblk = 0 THEN Race = "White" ; Percent = percent/ 100 ; RUN;

These are just IF, ELSE and assignment statements like every other IF, ELSE and assignment statement you have written in your life. The fact that they occur in the middle of a macro makes no difference whatsoever.

%MEND mkrace ;

This sends the mkrace macro. Now, to call this macro, all I need to do is:

%mkrace(blkwhtsex) ;

Before moving on to the next procedure, let’s recap what we did here, because it’s important. We used a PROC FREQ to create a couple of permanent SAS data sets. The first one, blkwhtsex, included four records. We read this tiny data set and created a new, temporary data set, also four records, with a new variable, “race”, and the variable “percent” now in a decimal format.

Your mileage may vary. There are a couple of choices I made here for reasons of my own. I mention these choices because part of becoming an experienced programmer is making decisions and judgments. Even if your decision is to copy an example, you should know why the example includes the specific choices it does. Here are the choices I made and why.

• I did not supply a libref for the project directory. I always used “lib” in the LIBNAME statement in this project because it saves me having to specify a library as well as a data set name when I use a macro. To see how to specify both the library and data set, see the earlier paper (De Mars,2011a).

• Could I have just created a format using PROC FORMAT for race? Yes. The reason I chose not to do that is,

Page 9: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

o This is a temporary data set with four records. The time and storage to read every record and create a new variable is as close to nothing as one could get. Thus, the advantage of using PROC FORMAT in many cases, that is, it is faster and takes up less storage space than creating a new variable, is really irrelevant, and

o I am going to use this race variable a lot. The odds of me forgetting to apply the format at some point and having to re-run the analysis to produce some output is great. Given this, it’s less trouble for me to create the macro.

Now, we’re going to do a PROC TABULATE using this temporary data set

PROC TABULATE DATA = blkwhtsex ; This statement is pretty obvious. It begins the TABULATE procedure, using the data set blkwht sex we created with our macro.

CLASS race sex ;

The class statement specifies the classification or categorical variables that we’ll use for our table. All variables used in a table must have been specified in either a CLASS or a VAR statement. Variables in a CLASS statement can be either character or numeric.

VAR count percent ; These are numeric variables that will be used in an analysis. The TABLE statement takes the form TABLE row-variables , column variables ; You can also have a page variable, not included in this example. The last set of variables specified will be the column variables.

Just like the frequency procedure, crossing two variables with an * means that these variables will be cross-classified. Without the asterisk, results will be produced for each variable separately. The keyword ALL requests that statistics be produced for the total population. Statistics and format for a variable are specified by an * followed by the format or statistic. To specify multiple statistics or formats, you can use parentheses.

count*(SUM= ' '*F=COMMA12.0)

is the same as

count*SUM= ' ' count*F=COMMA12.0 TABLE race* sex ALL, count*(SUM= ' '*F=COMMA12.0) percent*(SUM = ' '*F=PERCENT8.1) ; The first part of our TABLE statement, then, requests statistics for race by sex and for the total population. The second part

specifies the first column variable will be count, with the SUM statistic, and this statistic will not have a label over it, that is, the label text is a blank space. The format will be a width of 12, 0 decimal places and commas. The second column variable will be percent, with the SUM statistic, again, no label, and in a percent format with one decimal place.

LABEL Count = "2009 Population" Percent = "Percent" ; FORMAT sex $sex. ;

These last two statements should be familiar from above. These just define the labels for the two column variables and specify the format for sex, which uses the $sex format we created in the PROC FORMAT.

Population Distribution by Race

2009 Population Percent

Race Sex

Male 19,565,078 7.1% Black

Female 21,389,391 7.8%

Male 115,771,666 42.1% White

Female 118,404,207 43.0%

All 275,130,342 100.0%

2009 American Community Survey Data

Here is our table. Since this is part of the same ODS RTF FILE we specified at the beginning, it is still using the STYLE=OCEAN. It still uses the same title and footnote specified at the beginning. This is actually a good thing. Having all of your tables match in style gives your presentation a more professional appearance.

Page 10: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

EXAMPLE 4: AMERICAN COMMUNITY SURVEY - MAKING A MAP

Our final demographic graphic with the American Community Survey is a map. After trying several possible types of maps (not shown here), it seemed that the best graphic would be a map of the United States with the states shaded by the percentage of their population that is African-American. Before showing this graph students are asked to give their guesses as to which states have the highest and lowest percentage of African-Americans in their population.

To force the percentages to fit specific categories, another VALUE statement was added to the PROC FORMAT at the top of our program.

VALUE grays LOW - .002 = "<2%" .003 - .005 = "3-5%" .006 - .009 = "6-9%" .010 - high = "10 - 12%" ;

The syntax LOW - some number = “formatted value”

assigns the formatted value on the right hand side of the equals sign to all of the values from the minimum in the data set to the specified number. Similarly,

some number - HIGH = “formatted value”

will assign the values from the specified number to the maximum value. The rest of the program is:

%mkrace(blkwhtst) ; DATA blkwhtst ; SET blkwhtst ; STATE = INPUT(st,BEST8.) ; WHERE racblk = "1" ; pct = ROUND(percent,.001) ; TITLE "African-American Population by State " ; TITLE2 "By Percent" ; PATTERN1 COLOR = White ; PATTERN2 V=M3N45 color=black; PATTERN3 COLOR= Gray ; PATTERN4 COLOR= Black ; PROC GMAP DATA = blkwhtst MAP = MAPS.US ; ID STATE ; CHORO pct / DISCRETE STATISTIC=MEAN ; WHERE STATE NE 72 ; FORMAT pct grays. ; LABEL pct = "Percentage African-American" ;

Page 11: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

WHAT WE’RE DOING WITH THIS PROGRAM AND WHY WE ARE DOING IT %mkrace(blkwhtst) ; This calls our macro we created above (remember our macro?), creates a temporary data set named ‘blkwhtst’, sets the value of

percent to a decimal and creates a variable named ‘race’. It’s reading in the permanent data set named blkwhtst that we created in the PROC FREQ step earlier.

WHERE DO WE GET THE MAP? SAS 9.2 ships with a library of maps data sets. In the normal installation, these maps will be stored in a library clearly labeled

“maps”. You don’t have to assign this library or do anything. You should be able to look in your explorer window and see it. Go ahead, try it. If you open the US data set, and select COLUMN NAMES from the VIEW menu, you’ll see that you have a variable named ST and it matches exactly the st variable in our own data set, blkwhtst, except for one small problem. The STATE variable in the MAPS.US is numeric. You can tell this by the fact that it is right-justified, while character variables, such as STATECODE are left-justified. You know that your st variable is a character variable because you did the PROC CONTENTS when you were running all of the data quality checks on your open data set. (What? You didn’t do the data quality checks? Go back and read the paper on data quality (De Mars, 2011a) right now!)

Before I can use the MAPS.US data set I need to have a variable in my data set that matches it. This next step creates a

variable to match the STATE variable in the MAPS.US data set. It also does a little more clean up of the data set while you’re at it. DATA blkwhtst ; SET blkwhtst ; STATE = INPUT(st,BEST8.) ; After reading in the data from the blwhtst data set, our assignment statement creates a new, numeric variable, STATE. The

INPUT function inputs the st variable in a numeric format. If there were character values in this field, that could cause problems, but there are no characters, just numbers 1 - 72.

where racblk = "1" ; Because I want to map the percentage of African-American residents in each state, I only need to keep the records where the

respondent checked “black” as his or her race. pct = ROUND(percent,.001) ; The use of the ROUND function will round the variable, percent, to the nearest .001. Without this, SAS would map each value of

percent with a different color. There is another reason for creating a new variable here. PERCENT is a keyword in the GMAP procedure. It is generally both a bad idea and confusing to use keywords as variable names.

TITLE "African-American Population by State " ; Title2 "By Percent" ; Now I need to change the title. This will replace the previous TITLE statement and add a second title line underneath. Notice

that the footnote on the graph stays the same, since there is no new FOOTNOTE statement. PATTERN1 COLOR = White ; PATTERN2 V=M3N45 color=black; PATTERN3 COLOR= Gray ; PATTERN4 COLOR= Black ; The previous patterns started with black. The patterns are used from the lowest percentage of our variable to be graphed -

African-American residents to the highest. It would be confusing to have the states with the lowest percentage of African-Americans black and those with the highest percentage shown on the chart in white. PATTERN1, the states with the lowest percentage, will now show up in white. We need something between white and gray, though. The V = option on the PATTERN statement gives a value for

Page 12: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

shading. The default, without the V= option is solid. In this case, it doesn’t really matter what value we select, as long as it isn't solid, so the M3N45 pattern was as good as any

other. You can find a list of pattern types in more detail than you could possibly ever want to know in the SAS/GRAPH(R) Reference Guide (SAS Institute (2011)

PROC GMAP DATA = blkwhtst MAP = MAPS.US ; The GMAP procedure will create a map using the data from the blkwhtst data set and the mapping data from the MAPS.US. ID STATE ; This is the variable that defines the map area. It must be in both the DATA = and MAP = data sets and the name, type and

length must match in both data sets. We have no worries, because we made sure of that in our DATA step above. CHORO pct / DISCRETE STATISTIC=MEAN ; The CHORO statement assigns patterns to the map area based on the formatted value of the variable given, in this case, pct.

The DISCRETE option specifies that a separate color and pattern be used for each discrete response. Without this the map will be different shades of one color and have a different color for every individual value from 0 to the maximum response. We don’t want that, we want different patterns just for the four different categories.

WHERE STATE NE 72 ; If you peek into the MAPS.US data set, you’ll notice that 72 is Puerto Rico. In this particular analysis, we didn’t want Puerto

Rico, so we dropped it. Do not get clever and put WHERE STATE >50. That will drop the people in Virginia through Wyoming from your data and make them angry.

FORMAT pct grays. ;

This applies the format we created above so that our data fall into four categories and also so that the text from the format shows in the legend.

Label pct = "Percentage African-American" ;

This labels our variable for the legend at the bottom of the page.

EXAMPLE 5: A DIFFERENT TYPE OF OPEN DATA & ODS STATISTICAL GRAPHICS This next example is the complete opposite of the first four. Rather than a data set with millions of records and hundreds of variables, it uses 21 records and three variables. The data are not from the federal government but rather from a non-profit sports organization. The target audience is not middle school students but adults. The purpose of this analysis was not to present information about statistics and the work of statisticians, but rather, to answer two specific questions. The only common factors between these two analyses is that both used open data freely available to anyone with an Internet connection and both involved presentation of information to a non-technical audience.

There had been some discussion regarding whether the number of competitors in judo in the U.S. was declining or not. This may seem like a simple question, but there are several different organizations that register judo competitors. To complicate matters further, the national championships had added new divisions several times. Originally, the U.S. national championships were contested only in the male and female weight divisions that competed in the Olympics. Over the years, separate divisions were added for competitors over 35 years of age, for visually-impaired competitors and other categories not contested in the Olympics. Also, any variable observed, is going to fluctuate over time. The question was whether this year to year fluctuation masked a significant downward trend. The data were posted to a forum on judo with the request that anyone with expertise be kind enough to analyze the data and report back.

Reading the data into SAS was a simple matter of writing an INPUT statement, followed by a CARDS statement , pasting the data, and adding a semi-colon at the end. Yes, a DATALINES statement would have worked as well as a CARDS statement and yes, anyone who still uses the CARDS statement is old.

DATA competitors ;

INPUT year females males ;

CARDS ;

1990 118 267

1991 84 210

....

2011 66 146

;

The first chart I need to produce to answer this question is very simple and I am going to use Graph-N-Go which seems to be custom made for really simple questions. A surprising number of SAS programmers don’t even know they have Graph-N-Go available, but it’s right there next to the RUN menu under SOLUTIONS.

Page 13: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

When you select Graph-N-Go, a new window will pop-up with an icon you are supposed to recognize as a SAS dataset.

Click on that and a new window will pop up with a the words SAS dataset at the top, an empty box and, next to it, a button with

three dots, causing you to ask yourself, “What the heck am I supposed to do now?” The answer is to click on the button with the “….” Click on that and yet another window will pop up. The next window should look familiar. It has the libraries available to you in the

left pane, including the WORK library, SASUSER, MAPS and any libraries you might have defined with a LIBNAME statement. Select the library you want to use. Then, in the right pane, select the dataset you want to use. In this case we are going to select the “WORK” library and the dataset named “competitors”.

On the left of the window are several buttons. We want a line plot, so we’re going to click on the line plot and drag it to the large

pane in the bottom right. An empty box appears with the title Plot 1. We right-click on the empty box and from the drop-down menu select PROPERTIES.

In the pop-up window is a drop-down menu with the title DATA MODEL. By this point we are wondering if it might be easier to

learn SAS/GRAPH after all, but we forge ahead, selecting from the drop-down menu the one dataset that we identified previously, work.competitors.

There are five tabs at the top of the PROPERTIES window, these are “General”, “Data”, Titles/Footnotes”, “Appearance” and

“Object Size”. We’re going to click on DATA tab and from the drop down menu next to X, select “Males” as the variable that we want to plot and under Y, we’ll select “Year”. We’ll also select REGRESSION from the drop-down menu under PLOT STYLES.

Page 14: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

We’ll click the TITLES tab and give a title for the plot. We click OK and the chart below is produced. If we didn’t like the size, we could right-click on it, select Grow/Shrink and then

drag on the side of the plot to increase or decrease it’s size, or check the box next to MAXIMIZE, which will make the graph the maximize to size to fit in the window.

We click on MAXIMIZE and are happy with this size, so we simply right-click on the chart, pick EXPORT and from the options

select “External File”. There will be a pre-filled default directory, name and type, something like : C:\Users\Yourname\My SAS files\9.2\males.bmp If you want to change any of that, to the right is the ubiquitous box with the three dots again. Click on that and a new pop-up

window will allow you to change the folder, file name and type. Here is our plot and it seems pretty clear that there is a downward trend. The middle line is our regression line, showing that the

prediction is a straight line downward. The two dashed lines are the confidence intervals.

The plot for male competitors worked fine but when we do the same steps to get a plot for female competitors it looks decidedly odd. There appears to be an upward trend to a point, and then a downward trend. In 1988, women’s competition was added to the Olympics for the first time. I speculate that this may have caused women’s competition to swing up, counter to the overall downward trend, but then after the excitement of having qualified as an Olympic sport faded, they, too would show a downward trend. Also, elections are held for a new board of the National Governing Body each Olympic year and they take office the following year.

Page 15: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

To test these hypotheses, that there was an upward trend followed by a downward trend, we can use PROC REG, the SAS regression procedure .

ODS GRAPHICS ON ;

This statement turns ODS Statistical Graphics output on. If you have not used ODS Graphics yet, you need to try it. Simply put, SAS tries to guess what you would most likely want as graphics output and produces it. It’s as simple as that.

PROC REG ; MODEL females = year / STB ; WHERE year < 2002 ; The PROC REG statement calls the regression procedure. It will use the most recently created data set which is the temporary

file created above. The MODEL statement gives the dependent variable (females) = the independent (year). The option STB is for ‘standardized regression coefficient’. More information about standardized coefficients can be found in the related paper on Statistics for Hamsters (De Mars, 2011b). The WHERE statement selects only those records where the year is less than 2002.

The next procedure is identical except that for the WHERE statement and produces the same analyses for the years after 2001.

PROC REG ; MODEL females = year / STB ; WHERE year > 2001 ; RUN; ODS GRAPHICS OFF ;

The statement at the end turns ODS graphics off.

The REG procedure with ODS graphics produces a lot of output. This is the reason you probably want to turn it off if you don’t specifically need the graphics. In with all of the other charts is the one below that addresses our particular question. On the right side it gives the R-Square value of .1501. The square root of this, that is the R value, is .39. In other words, we can tell the inquiring minds that want to know that from 1990 to 2001 there was a correlation of about .40 between year and the number of competitors, which means the number of competitors was increasing each year.

Page 16: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

Examining the output for our second PROC REG, we find this next plot. This plot has an R-square of .63. In other words, the correlation between year and the number of female competitors is -.80. You don’t really need the numbers though to see what we had here was a somewhat modest upward trend followed by a very steep downward trend in the number of competitors. What to do about it is the decision of the people in the organization, but the facts are very hard to deny when presented in this manner. The number of competitors is clearly in decline for both males and females, a trend that has been going on for over a decade for women and much longer for men.

Page 17: SAS Essentials II€¦ · SAS Essentials II: Better-Looking SAS for a Better Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Experienced programmers don't just

CONCLUSION One advantage of using open data has over the data sets used with most textbooks is the potential for analysis of big data sets. These analyses almost force the programmer to learn more efficient techniques for processing data. While the first example seems an awful lot of effort to produce a single table, most of this work was re-used over and over throughout out example. The output data set created in the frequency procedure was used repeatedly, the TITLE ,FOOTNOTE and OPTIONS statement applied to several graphs and tables. The PATTERN and AXIS statements applied to several charts. The formats created in the PROC FORMAT step were also used in various output produced for this project. Completing a textbook exercise, one might not see the advantages of going through all of these extra steps for one chart or table. “We have the numbers from the PROC FREQ, you can just make a table in Word or PowerPoint” and insert those numbers, a new programmer is likely to complain. To make one graph or table, that is probably true, but when there are multiple tables and graphs to be produced, the time put in up front pays off. Similarly, repeating code to perform a simple task like creating a new variable or formatting a variable can be repeated using a simple macro.

A second advantage of the use of open data is that, given the number of data sets available, these can be used for almost any type of project, procedure or analysis that the programmer wishes to experience.

The third advantage, as can be seen from our last example, is that even small, simple data sets can lend themselves to a moderately sophisticated statistical analysis.

A further advantage of the use of open data occurs when analysis is done to assist a particular audience. This in itself is a learning experience. The days of transom engineering are over. The value of the ability to produce accurate numbers is greatly increased when paired with the ability to convey information based on that information. Presenting national demographics to a class of seventh-graders or presenting regression analyses to an audience of judo coaches are real challenges that cause the programmer to seek new and better means of presentation.

Creating and implementing an open data project for a community program provides experience not just in trying different SAS techniques but also in tailoring the output of those to the needs of the intended audience. Not only does the community organization served benefit from this technique, but it also increases the marketable skills of the programmer and provides him or her a larger portfolio to point to of statements, options and procedures with which he or she has professional experience.

REFERENCES Besler, L. (2007). Communication-effective pie charts. Presentation at the annual meeting of the SAS Users Group International.

www2.sas.com/proceedings/forum2007/134-2007.pdf De Mars, A. (2011a). SAS® Functions for a Better Functioning Community. Paper presented at the annual meeting of Western

Users of SAS Software. San Francisco, CA. De Mars, A. (2011b). SAS Essentials III: Statistics for Hamsters. Paper presented at the annual meeting of Western Users of

SAS Software. San Francisco, CA. SAS Institute (1999) SAS Procedures Guide. SAS Institute Inc, Cary, NC SAS Institute (2011).SAS/GRAPH(R) 9.2: Reference, Second Edition SAS Institute, Cary, NC.

ACKNOWLEDGMENTS Thank you to Kirby Posey of the U.S. Census Bureau for invaluable assistance in verifying the variable coding and estimates.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at:

AnnMaria De Mars The Julia Group 2111 7th St. #8 Santa Monica, CA 90405 (310) 717-9089 [email protected] http://www.thejuliagroup.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.