course notes in statistical softwares · currently, the dominant office suite is microsoft office....
TRANSCRIPT
JOHN CARLO P. DAQUIS
COURSE
NOTES
IN
STATISTICAL
SOFTWARES
Course Notes In
Statistical Softwares
Statistics 125
John Carlo P. Daquis
Course Overview Statistics 125, or Statistical Softwares is mainly about 3 applications: Microsoft Excel, SPSS and SAS. These three softwares are the ones dominantly used in the succeeding courses of the BS Statistics program. Hence, it is very essential for a student to have a good grasp of these applications. The course is divided into 4 parts – introduction, MS Excel, SPSS and SAS. The first part gives the student an overview Windows, different office suites (other than MS Office) and different statistical softwares. This chapter also includes a comprehensive list of keyboard shortcuts. The MS Excel part consists of 5 chapters – entering data, Excel functions, presenting and analyzing data and Macro programming. This part provides the student an in‐depth knowledge of MS Excel. Lastly, the 3rd and 4th part serve as an introduction to SPSS and SAS respectively.
Course Goals By the end of this course, students should be able to:
Master the basic functions of MS Excel, SPSS and SAS
Perform various statistical methods using the softwares
Create informative and high‐impact materials in data presentation
Enhance statistical programming skills through Excel Macros and SAS programming
Appreciate statistics through the power of statistical softwares
Course Requirements
Grading System Long Exams 50% Machine Problems 30% Final Exam 10%
Quizzes/Assignments 10%
Grading Scale 95 – 100% 1.0 72 – 75% 2.25 54 below 5.0 90 – 94% 1.25 68 – 71% 2.5 85 – 89% 1.5 64 – 67% 2.75 80 – 84% 1.75 60 – 63% 3.0 76 – 79% 2.0 55 – 59% 4.0
Course Outline
I. INTRODUCTION 1. Introduction
a. Windows Interface b. Microsoft Office and other Office Suites c. MS Excel and Statistical Softwares d. Some Useful Keyboard Shortcuts
II. MICROSOFT EXCEL
2. Getting Started with Microsoft Excel
a. Introducing Excel b. Entering and Editing Worksheet Data c. Essential Worksheet Operations d. Working with Cells and Ranges
3. Working with Formulas and Functions
a. Introducing Formulas and Functions b. Creating Formulas that Manipulate Text c. Creating Formulas that Count and Sum d. Creating Formulas that Lookup Variables e. Array Formulas
4. Creating Charts and Graphics
a. Getting Started with Creating Charts b. Learning Advanced Charting c. Enhancing Your Work with Pictures and Drawings
5. Analyzing Data with Excel
a. Statistical Functions in Excel b. Analyzing Data with Pivot Tables c. Analyzing Data Using Analysis Toolpak
6. Programming Excel in VBA
a. Introducing VBA Macros b. Doing Simple Macro Programs
III. SPSS 7. SPSS Overview
a. Windows, Menus and Dialog Boxes b. Selecting Variables c. Basic Steps in Data Analysis
8. The Data Editor
a. Variables and Variable Definitions b. Defining and Copying Variable Properties c. Entering and Editing Data d. Go to Case and Case Selection
9. Data Transformations
a. Computing Variables b. Count Occurrences of Values within Cases c. Recoding Variables: Into Same and Into Different d. Rank Cases e. Automatic Recode f. Visual Binning
10. Working with the Output Viewer
a. The Output Viewer b. Pivot Tables c. The Syntax Viewer
11. SPSS Procedures
a. Frequencies b. Descriptives c. Explore d. Crosstabs e. Summarize f. Means g. T‐test and ANOVA h. Correlations
IV. SAS 12. SAS: Getting Started
a. What is SAS b. The SAS Interface c. SAS Programs and Statements d. SAS Datasets e. Library and Member Names f. DATA Step g. Proc Step
13. Getting the Data into the SAS System
a. Viewtable Window b. Multiple Lines per Observation c. Multiple Observations per Line d. Reading Part of the Raw File h. Options in the INFILE Statement i. Delimited Raw Files
14. Working with Data in SAS
a. Creating and Redefining Variables b. SAS Functions c. If‐Then Statements d. Arrays e. SAS Control Statements
15. SAS Basic Statistical Procedures
a. PROC MEANS b. PROC FREQ c. PROC UNIVARIATE d. PROC TTEST e. PROC ANOVA f. PROC CORR g. PROC REG h. PROC PLOT i. PROC CHART
V. R (Optional)
Living has two main blocks – the DATA and PROCEDURE; You gain knowledge through experience; You experience happiness to smile; You smile at pain to appreciate; You appreciate blessings of the Divine;
You LEARN; LISTEN;
VALUE; You LOVE;
Then you PRINT those lessons in your heart; SORT tears of joy from cries of sorrow; FORMAT yourself for the whole world to see; You comfort your friends in a number of /*selfless*/ iterations;
You advise your loved ones with an ARRAY of /*heartfelt*/ instructions;
You TEACH and DREAM; RUN and SOAR; You CRY; You LOVE;
There are countless events in a library of memories;
Limitless variables to believe in; Oh, how wonderful to realize a myriad of options;
As you go line BY line, observation BY observation; I thank you then, GOD;
For this magnificently beautiful Dataset For a perfectly executed Creation For this most wonderful Program – ;
LIFE; JCDMJE
Introduction
Windows
Interface
MS Office
and other
Office Suites
Statistical Softwares
Some
Keyboard
Shortcuts
Course Notes in Statistical Softwares
2 | D a q u i s
THE WINDOWS INTERFACE
Windows is a Graphical User Interface (GUI) type of Operating System produced by Microsoft. It allows people to manage files and run software programs on computers. Files and folders are visually represented as icons. When a folder is accessed, a browsing window opens. Similarly, accessing programs and applications will open application windows.
Windows provides two major ways of navigating programs, one of these ways is through the Desktop
icons. In opening a file, the default action is a double‐click or a single click and enter key. Another way one can navigate is through the Start Menu. The Start Menu holds many of the programs and folders that windows can access.
HISTORY OF WINDOWS
Microsoft first began development of the Interface Manager (subsequently renamed Microsoft Windows) in September 1981. Although the first prototypes used Multiplan (early Microsoft Spreadsheet Program) and Word‐like menus at the bottom of the screen, the interface was changed in 1982 to use pull‐down menus and dialogs, as used on the Xerox Star.
Figure 1‐1 Multiplan and Xerox Star Interface
Microsoft finally announced Windows in November 1983, with pressure from just‐released VisiOn and impending TopView. Windows promised an easy‐to‐use graphical interface, device‐independent graphics and multitasking support.
Figure 1.2. VisiOn and TopView: Early GUI Systems
The development was delayed several times, however, and the Windows 1.0 hit the store shelves in
November 1985. The selection of applications was sparse, however, and Windows sales were modest. From then on, a series of Windows versions have been released.
Course Notes in Statistical Softwares
3 | D a q u i s
Version Release Date Version Release Date
1.0 November 1985 98 June 1998
2.0 (/286) 1987 2000 February 17, 2000
/386 1987 ME September 14, 2000
3.0 May 1990 XP October 25, 2001
3.x April 1992 Vista January 30, 2007
95 August 1995 7 Expected 2009‐2010 Table 1‐1 Windows Home Versions
Windows 1.0
Windows 2.0 (/286)
Windows 2.1 (/386)
Windows 3.0
Course Notes in Statistical Softwares
4 | D a q u i s
Windows 3.x
Course Notes in Statistical Softwares
5 | D a q u i s
Table1‐2 Windows Releases: Logos and Interface
Course Notes in Statistical Softwares
6 | D a q u i s
THE DESKTOP
The desktop is the main screen area that you see after you turn on your computer and log on to Windows. Like the top of an actual desk, it serves as a surface for your work. When you open programs or folders, they appear on the desktop. You can also put things on the desktop, such as files and folders, and arrange them however you want.
The desktop is sometimes defined more broadly to include the taskbar and Windows Sidebar. The taskbar
sits at the bottom of your screen. It shows you which programs are running and allows you to switch between them. It also contains the Start button, which you can use to access programs, folders, and computer settings.
Figure 1‐3 Windows Vista Interface
1. DESKTOP ICONS 2. WINDOWS START BUTTON – The Start Menu and the Start Button provides an easier and consolidated
way to open programs and serve as the central launching point for applications and tasks. 3. QUICK LAUNCH BAR – A customizable taskbar element which provides shortcuts to applications and
commands such as Internet Explorer, Windows Media Player and Show Desktop. 4. BROWSING WINDOW – It show files which are contained in a specific folder, drive or computer. 5. TASKBAR ‐ The taskbar is a term for an application desktop bar which is used to launch and monitor
applications. 6. DESKTOP 7. NOTIFICATION AREA – It is commonly referred to as the SYSTEM TRAY. The notification area is a portion
of the taskbar which mainly contains icons that show status information (download progress, virus alerts, volume, new messages) as well as some minimized programs.
8. SIDEBAR – The Sidebar contains gadgets which can perform various tasks, such as a clock, post‐it note, weather reporter, picture viewer and calendar.
1
2 3 4 5 6 7
8
Course Notes in Statistical Softwares
7 | D a q u i s
MICROSOFT OFFICE AND OTHER OFFICE SUITES
An office suite is a collection of application softwares often sharing a similar interface of functions needed for office use. Office suites run on different platforms, online (web‐based) or offline and on different operating systems. The typical components of an office suite are: word processor, spreadsheet, presentation program, database management, graphics software and email. Currently, the dominant office suite is Microsoft Office.
The table below shows some examples of different office suites:
Office Suite Word Processor Spreadsheet Presentation
MS Office
MS Word
MS Excel
MS Powerpoint
Open Office
OpenOffice Write
OpenOffice Calc
OpenOffice Impress
Apple iWORK
iWork Pages
iWork Numbers
iWork Keynote
KingSoft Office
Kword
KSpread
KPresenter
Lotus Symphony
Lotus Documents
Lotus Spreadsheets
Lotus Presentations
Corel Wordperfect
WordPerfect
Quattro Pro
Presentations
ThinkFree
ThinkFree Write
ThinkFree Calc
ThinkFree Show
Star Office
Star Office Writer
Star Office Calc
Star Office Impress
Course Notes in Statistical Softwares
8 | D a q u i s
SoftMaker
TextMaker
PlanMaker
Presentations
Celframe
Celframe Write
Celframe Spreadsheet
Power Presentation
Zoho
Zoho Writer
Zoho Sheet
Zoho Show
Table 1‐3 Examples of Office Suites
STATISTICAL SOFTWARES
Statistical Softwares are applications and software packages at which its main task is to perform various statistical procedures and analyses. Such procedures may include the following:
Descriptive Statistics
Presentation of Data – Charts, Tables, Histograms and Frequencies
Parametric Statistical Inference
Non‐parametric Tests
Regression and Time Series Analysis
Sampling
Multivariate Data Analysis
Experimental Designs
Market Research Many statistical softwares cater to various analyses. However, there are some like Eviews and WinBugs
which are specialized in Regression and Time Series and Bayesian Inferential Statistics, respectively.
EXAMPLES OF STATISTICAL SOFTWARES
The list below are some examples of statistical softwares:
Eviews
GenStat
JMP
Minitab
R Language
SAS
S‐plus
SPSS
Stata
Statistica
Systat WinBugs
Table 1‐4 Examples of Statistical Softwares
Course Notes in Statistical Softwares
9 | D a q u i s
KEYBOARD SHORTCUTS
The knowledge and use of shortcut keys in Windows and its applications are vital to any computer users – professionals and home users alike. These shortcut keys make their everyday computer use more quick and efficient. Most of the time, a couple of clicks with a mouse is just a shortcut away with the keyboard.
General Windows Shortcuts
SHORTCUT FUNCTION
CTRL+C Copy
CTRL+X Cut
CTRL+V Paste
CTRL+Z Undo
DELETE Delete
SHIFT+DELETE Delete the selected item permanently without placing the item in the Recycle Bin
CTRL while dragging an item Copy the selected item
CTRL+SHIFT while dragging an item Create a shortcut to the selected item
F2 key Rename the selected item
CTRL+RIGHT ARROW Move the insertion point to the beginning of the next word
CTRL+LEFT ARROW Move the insertion point to the beginning of the previous word
CTRL+DOWN ARROW Move the insertion point to the beginning of the next paragraph
CTRL+UP ARROW Move the insertion point to the beginning of the previous paragraph
CTRL+SHIFT with any of the arrow keys
Highlight a block of text
SHIFT with any of the arrow keys Select more than one item in a window or on the desktop, or select text in a document
CTRL+A Select all
F3 key Search for a file or a folder
ALT+ENTER View the properties for the selected item
ALT+F4 Close the active item, or quit the active program
ALT+ENTER Display the properties of the selected object
ALT+SPACEBAR Open the shortcut menu for the active window
CTRL+F4 Close the active document in programs that enable you to have multiple documents open simultaneously
ALT+TAB Switch between the open items
ALT+ESC Cycle through items in the order that they had been opened
F6 key Cycle through the screen elements in a window or on the desktop
F4 key Display the Address bar list in My Computer or Windows Explorer
SHIFT+F10 Display the shortcut menu for the selected item
ALT+SPACEBAR Display the System menu for the active window
CTRL+ESC Display the Start menu
ALT+Underlined letter in a menu name
Display the corresponding menu
Underlined letter in a command name on an open menu
Perform the corresponding command
Course Notes in Statistical Softwares
10 | D a q u i s
F10 key Activate the menu bar in the active program
RIGHT ARROW Open the next menu to the right, or open a submenu
LEFT ARROW Open the next menu to the left, or close a submenu
F5 key Update the active window
BACKSPACE View the folder one level up in My Computer or Windows Explorer
ESC Cancel the current task
SHIFT when you insert a CD‐ROM into the CD‐ROM drive
Prevent the CD‐ROM from automatically playing
Dialog Box Keyboard Shortcuts
SHORTCUT FUNCTION
CTRL+TAB Move forward through the tabs
CTRL+SHIFT+TAB Move backward through the tabs
TAB Move forward through the options
SHIFT+TAB Move backward through the options
ALT+Underlined letter Perform the corresponding command or select the corresponding option
ENTER Perform the command for the active option or button
SPACEBAR Select or clear the check box if the active option is a check box
Arrow keys Select a button if the active option is a group of option buttons
F1 key Display Help
F4 key Display the items in the active list
BACKSPACE Open a folder one level up if a folder is selected in the Save As or Open dialog box
Windows Logo Keyboard Shortcuts
SHORTCUT FUNCTION
Windows Logo Display or hide the Start menu
Windows Logo+BREAK Display the System Properties dialog box
Windows Logo+D Display the desktop
Windows Logo+M Minimize all of the windows
Windows Logo+SHIFT+M Restore the minimized windows
Windows Logo+E Open My Computer
Windows Logo+F Search for a file or a folder
CTRL+Windows Logo+F Search for computers
Windows Logo+F1 Display Windows Help
Windows Logo+ L Lock the keyboard
Windows Logo+R Open the Run dialog box
Windows Logo+U Open Utility Manager
Course Notes in Statistical Softwares
11 | D a q u i s
Accessibility Keyboard Shortcuts
SHORTCUT FUNCTION
Right SHIFT for eight seconds Switch FilterKeys either on or off
Left ALT+left SHIFT+PRINT SCREEN Switch High Contrast either on or off
Left ALT+left SHIFT+NUM LOCK Switch the MouseKeys either on or off
SHIFT five times Switch the StickyKeys either on or off
NUM LOCK for five seconds Switch the ToggleKeys either on or off
Windows Logo +U Open Utility Manager
Windows Explorer Keyboard Shortcuts
SHORTCUT FUNCTION
END Display the bottom of the active window
HOME Display the top of the active window
NUM LOCK+Asterisk sign (*) Display all of the subfolders that are under the selected folder
NUM LOCK+Plus sign (+) Display the contents of the selected folder
NUM LOCK+Minus sign (‐) Collapse the selected folder
LEFT ARROW Collapse the current selection if it is expanded, or select the parent folder
RIGHT ARROW Display the current selection if it is collapsed, or select the first subfolder
Internet Explorer Navigation
SHORTCUT FUNCTION
CTRL+B Open the Organize Favorites dialog box
CTRL+E Open the Search bar
CTRL+F Start the Find utility
CTRL+H Open the History bar
CTRL+I Open the Favorites bar
CTRL+L Open the Open dialog box
CTRL+N Start another instance of the browser with the same Web address
CTRL+O Open the Open dialog box, the same as CTRL+L
CTRL+P Open the Print dialog box
CTRL+R Update the current Web page
CTRL+W Close the current window
MS Excel:
Getting
Started
Introducing Excel
Entering and Editing
Worksheet
Data
Essential Worksheet
Operations
Working
with Cells
and Ranges
Course Notes in Statistical Softwares
13 | D a q u i s
INTRODUCING EXCEL
Microsoft Excel is by far the most popular spreadsheet program. It is part of the Microsoft Office Suite. Excel’s main purpose is to perform numerical calculations. Now, considering the field of statistics, here are some of the uses of Excel:
Computation of Descriptive Statistics
Survey methods which include questionnaire design, sampling and encoding of data
Data Presentation: tables and charts
Performing different statistical methods (e.g. tests of hypothesis, regression and correlation), usually with the aid of a plug‐in, Analysis Toolpak
WORKBOOKS AND WORKSHEETS There are two way in opening MS Excel. The first way is to double‐click on the MS Excel application icon. This will open a blank workbook. The second way is to look for an Excel file icon and double‐click it. This opens the Excel file you have selected. Excel files, by default have an extension name “.xls”.
Figure 2‐1 An Excel Program Icon (left) and File Icon (Right)
All Excel operations are performed in a workbook file, which appears in its own window. One can have
more than workbook windows open. Each workbook contains one or more worksheets. Each work sheet is divided by rows and columns, making individual cells. Each cell may contain a value, a formula or a text. Figure 2‐2 shows the essential part of Excel interface. One can refer about a brief discussion of these parts on the table below.
Name Description
Active cell indicator This dark outline indicates the currently active (out of 16,777,216) cell.
Column headings Letters ranging from A to IV (256 columns)
Formula bar Shows information (text, value or formula) on the active cell
Horizontal scrollbar Enables you to scroll the sheet horizontally.
Menu bar Excel’s main menu. Each Menu drops down another set of menu items.
Name box Displays the active cell address or the name of the selected cell, range, or object.
Row headings Numbers ranging from 1 to 65,536
Sheet tabs Tabs which represent a different sheet in the workbook.
Tab scroll buttons Scrolls the sheet tabs to display tabs which are not visible
Status bar This bar displays various messages as well as the status of keyboard “lock” keys
Task pane Displays options that are relevant to the task you are performing
Title bar Displays the program name, the file name and holds window‐modifying buttons
Toolbars Contains buttons which represent Excel commands
Vertical scrollbar Lets you scroll the sheet vertically.
Table 2‐1 Parts of an Excel Screen
Course Notes in Statistical Softwares
14 | D a q u i s
Figure 2‐2 The MS Excel 2003 Screen
Every worksheet consists of rows (numbered 1 through 65,536) and columns (labeled A through IV). After
column Z comes column AA; after column AZ comes column BA, and so on. The intersection of a row and a column is a single cell. At any given time, one cell is the active cell. You can identify the active cell by its darker border, as shown in Figure 2‐2. Its address (its column letter and row number) appears in the Name box. Depending on the technique that you use to navigate through a workbook, you may or may not change the active cell when you navigate.
One can navigate Excel with the keyboard or with the mouse. Refer to the appendix part of the course
notes for a complete list of MS Excel 2003 keyboard shortcuts. Using the mouse in Excel is much like any other GUI softwares on windows: left‐clicking on a menu yields a drop‐down menu; on a command button, performs a command. A left click on a cell makes it the active cell. A right click on a certain part of the screen shows a menu containing commands that are applicable to where the mouse cursor is pointed during the right click. If the mouse has a scroll wheel, one can use it to scroll vertically. Also, clicking the wheel and moving the mouse in any direction scrolls the worksheet automatically in that direction.
MENUS AND TOOLBARS
The menu bar is located directly below the title bar. It is always available for the user to execute commands. Using the menu is quite straightforward. Click the menu that you want to open, and it drops down to display menu items. Click the menu item to issue the command.
Excel also includes convenient toolbars that provide another way of issuing commands. Toolbar
commands may act as convenient substitute to menu commands. However, some toolbar buttons does not have a menu counterpart. Excel let's you customize the toolbar or even display multiple toolbars at the same time. Nonetheless, the two most common toolbars are the Standard and Formatting toolbars. The Standard toolbar, located beneath the menu bar, has buttons for commonly performed tasks like adding a column of numbers, printing, sorting, and other operations. On the other hand, the Formatting toolbar, located beneath the Standard toolbar bar, has buttons for various formatting operations like changing text size or style, formatting numbers and placing borders around cells. Table 2‐2 lists Excel’s most used built‐in toolbars.
Course Notes in Statistical Softwares
15 | D a q u i s
Toolbar Use
Standard Issues commonly used commands
Formatting Changes how your worksheet or chart looks
Borders Adds borders around selected areas
Chart Manipulates charts
Drawing Inserts or edits drawings on a worksheet
Control Toolbox Adds controls (buttons, spinners, and so on) to a worksheet
Formula Auditing Identifies errors in your worksheet
Picture Inserts or edits graphic images
PivotTable Works with pivot tables
Protection Controls what types of changes can be made in your worksheet
Reviewing Provides tools to use workbooks in groups
Text to Speech Provides tools to read aloud cell contents
Web Provides tools to access the Internet from Excel
WordArt Inserts or edits a picture composed of words
Table 2‐2 Excel’s Most Used Built‐in Toolbars
Figure 2‐3 Standard and Formatting Toolbars
MOVING THE ACTIVE CELL
Cell selection and movement around the worksheet are similar operations in Excel. To select a given cell or make it active, simply click on that cell. Use the mouse or the arrow keys to move around the worksheet. To move the active cell in another direction, simply press the appropriate arrow key.
MS EXCEL 2007
The MS Office (so does Excel 2007) user interface has been revamped. Microsoft describes it as a “results‐oriented user interface” and elaborates. That is, the commands and features that were often hard to find or buried somewhere in the menus and toolbars are now easier to find using task oriented tabs called ribbons which contains commands and tasks grouped logically.
The whole interface of Office 2007 has a name: MS Office Fluent. Fluent has many noticeable interface
changes. Some of the them, together with Excel’s internal changes are:
The MS Office 2007 button which replaces the File Menu
The Ribbon, a tab‐based interface which replaces all other menus and toolbars.
Live Preview
1,048,576 Rows and 16384 Columns (from 65,536 x 256)
Unique colors in a spreadsheet: from 256 to 4.3 Billion
More Graphical Options (Smart Art, 3d Options)
Course Notes in Statistical Softwares
16 | D a q u i s
By default, the Ribbon has seven tabs—Home, Insert, Page Layout, Formulas, Data, Review, and View. Contextual tabs appear as needed depending on the object being worked with. The icons under each tab are grouped logically into blocks on the Ribbon, and only the most frequently‐used command icons are displayed.
Live Preview is a useful feature in the new UI. Unlike earlier versions where the results of formatting would be visible only after they were applied, in this version, just hovering the cursor over a formatting option will reflect the change in the content. This removes the need for undoing any applied changes if one is not satisfied.
Figure 2‐4 Microsoft Excel 2007 Interface
ENTERING AND EDITING WORKSHEET DATA Cells in MS Excel can hold any of the three basic types of data:
Numerical values which include dates and times
Text
Formulas An Excel workbook can also hold objects such as charts, drawings, pictures, buttons and other objects.
They reside on a draw layer, which is an invisible layer on top of each worksheet. To enter a numerical value into a cell, move the cell pointer to the appropriate cell, type the value, and
then press Enter. The active cell moves down on the cell right below it. Alternatively, one can press Tab instead of enter, this time, the active cell moves to the right. The value is displayed on the cell as well as the formula bar when that particular cell is active. When entering values, one can include decimal points and currency symbols, along with plus, minus and comma signs. However when a value is succeeded by a text (e.g. 13 captains), it is now regarded as text.
Excel handles dates by using a serial number system. The earliest date that Excel understands is January 1,
1900. This date has a serial number of 1. January 2, 1900, has a serial number of 2, and so on. The last date Excel can comprehend is December 21, 9999, which has a serial number of 2,958,465. In working with times, Excel uses fractional days. That is, if 40423 is September 2, 2010 then 40423.62 is 9/2/2010 2:52:48 PM. One can also enter data in the worksheet by using the fill handle. The fill handle is the small black square in the bottom right of the active cell.
Course Notes in Statistical Softwares
17 | D a q u i s
Figure 2‐5 The Fill Handle (Encircled)
The fill handle, in its simplest form is used to increment any series of values. For example, if you type the number 1 in any cell and then the number 2 in a cell that adjoins it, you can use the Fill Handle to increment up to any number desired. To do this you simply select you two cells (Starting from the one with the number 1) and then hover your mouse pointer over the Fill Handle (until it changes to a small black cross), left click and drag in the direction you want the incremented numbers to show. You can also do the same by entering any Starting number in any cell, selecting the cell, holding down the Ctrl key and then dragging down with the Fill Handle. If you do not hold down the Ctrl key Excel will simply copy the same number.
Process 2‐1 Entering Data Using the Fill Handle
MODIFYING CELL CONTENTS After entering data, one can modify it in several ways:
Erase the cell’s contents
Replace the cell’s contents with something else
Edit the cells contents Erasing Contents in a Cell
To erase the contents of a cell, just click the cell and press Delete. To erase more than one cell, select all the cells that you want to erase and then press Delete. Pressing Delete removes the cell’s contents but doesn’t remove any formatting (such as bold, italic, or a different number format) that may have applied to the cell. One can also erase a range of cells by using the fill handle. Just click the fill handle and drag back across the selected cells. Replace the Contents of a Cell
To replace the contents of a cell with something else, just click the cell and type the new entry, which replaces the previous contents. Editing Contents of a Cell
Process 2‐2 Replacing the Contents of a Cell
The first box in the process above causes Excel to go into Edit mode. In edit mode, two new icons appear in the formula bar: the X icon for cancel edit and the check mark icon for accept edit.
Type numbers in 2 adjoining
cells
Select the 2 cells
Left click on the fill handle
Drag over the desired cells to
be filled
Double‐click the cell / press F2 after selecting the cell / click inside the
formula bar.
Edit the selected cell.
Press Enter / Tab
Course Notes in Statistical Softwares
18 | D a q u i s
APPLYING NUMBER FORMATTING
Number formatting refers to the process of changing the appearance of values contained in cells. The primary reason in formatting numbers is to improve the readability of these numbers. One can format numbers by using the Format Cells Dialog Box under the Number Tab. There are three ways to access the dialog box:
Select Format Cells Command
Right click Choose Format Cells
Press Ctrl + 1
Figure 2‐6 The Format Cells Dialog Box
ESSENTIAL WORKSHEET OPERATIONS Making the Worksheet the Active Sheet
If the workbook has too many sheets, use the tab‐scrolling buttons to scroll the sheet tabs. Activating the worksheet can be done in three ways:
Click the sheet tab
Press Ctrl+PgUp (previous sheet)
Press Ctrl+PgDn (next sheet) Adding a New Worksheet The following are three ways to add a new worksheet:
Select Insert Worksheet Command
Press Shift + F 11
Right Click Sheet Tab Insert Worksheet OK Deleting a Worksheet In order to delete a worksheet, one can do the following:
Select Edit Delete Sheet Command
Right Click Sheet Tab Delete Changing the Name of a Worksheet The default names for worksheets are Sheet1, Sheet2, Sheet3 and so on. Such names are not very descriptive. In order to change a worksheet name, one may use any of the methods:
Select Format Sheet Rename Rename Sheet
Double Click Sheet Tab Rename Sheet
Right Click Sheet Tab Rename Rename Sheet Note that sheet names are allowed up to 31 characters but can’t include symbols : (colon), / (slash), \
backslash), ? (question mark) and * asterisk .
Course Notes in Statistical Softwares
19 | D a q u i s
Changing a Sheet Tabs Color To change the color of a sheet tab, right‐click the tab and choose Tab Color. Then select the color in the Format Tab Color dialog box. Rearranging Worksheets One can move or copy a worksheet in the following ways:
Select Edit Move or Copy Sheet Command
Right Click Sheet Tab Move or Copy Command
Move Worksheet: Click Worksheet Tab drag to desired location
Copy Worksheet: Click Worksheet Tab Press Ctrl drag to desired location
Figure 2‐7 The Move or Copy Dialog Box
Split Worksheet and Freeze Panes To split panes, do 1. Windows Split Command or 2. Drag either the horizontal or vertical split bar. To freeze panes: Select cell below and to the right of the row and column desired to freeze Windows Split Command.
WORKING WITH ROWS AND COLUMNS Inserting rows and columns in Excel does not mean adding rows and columns in the spreadsheet. The number of rows and columns are fixed. Rather, inserting a row (column) moves down (right) the other columns to accommodate a new row (column). To insert new rows (columns), one can do any of the following:
Click row number (column letter) in the worksheet border Select Insert Rows (Columns) Command
Click row number (column letter) in the worksheet border Right Click Insert
Move cell pointer to the row (column) one desires to insert Select Insert Rows (Columns) Command
One can also insert cells instead of entire rows or columns. To do this, select the range Select Insert
Cells (or right‐click Insert) shift cells right or down.
Figure 2‐8 The Insert Cells Dialog Box
Course Notes in Statistical Softwares
20 | D a q u i s
WORKING WITH CELLS AND RANGES Cells are individual elements of a worksheet that can handle a single value, formula or text. It is being identified by its address, which consists of a column letter and a row number. A group of cells is called a Range. A range’s address is the its upper‐left cell address and lower‐right cell address separated by a colon. Below are examples of different range addresses: B76 A range consisting of a single cell A3:C3 Three cells that occupy a single row and three columns C5:C105 100 cells in column C F2:J6 25 cells (5 rows x 5 columns) D:D Entire column D 5:5 Entire Row 5 A1:IV65536 All cells in a worksheet
SELECTING RANGES Selecting a Range One can select a range using the following ways:
Mouse Drag
Shift + Directional Buttons / Shift + Mouse Click
F8 Directional Buttons F8
Type range in name box
Edit Go To Command Type Range / F5 Type Range
Figure 2‐9 The Go To Dialog Box
Selecting Complete Rows and Columns The following are the ways in selecting complete rows or columns. Click the row or column border
For multiple selection, click row or column border and drag to highlight additional rows or columns
Ctrl + Click multiple rows (columns) to select non‐adjacent rows (columns)
Ctrl + Spacebar to select a row
Shift + Spacebar to select a column
Ctrl + A to select the entire spreadsheet Selecting Noncontiguous Ranges
Ctrl + Mouse Click and Drag
Select Range Press Shift + F8
Edit Go To Command (or press F5/Ctrl+G) Type Ranges separated by commas
Course Notes in Statistical Softwares
21 | D a q u i s
Selecting Special Types of Cells In selecting special types of cells, one should use the Go To Special Dialog Box by choosing Edit Go To Click Special.
Figure 2‐9 The Go To Special Dialog Box
COPYING OR MOVING RANGES Here are the following ways to copy (or cut) ‐paste the selected range:
Toolbar buttons: Select Range Copy (Cut) Button Select Destination Paste Button
Menu Commands: Select Range Edit Copy (Cut) Select Destination Edit Paste
Shortcut: Select Range Ctrl + C (X for Cut) Select Destination Ctrl + V
Drag and Drop Cut: Select Range Click one of 4 borders Drag to desired Cell Drop
Drag and Drop Copy: Range Click one of 4 borders Ctrl + Drag to desired Cell Drop
PASTE SPECIAL Sometimes, one may wish not copy everything in a cell. For example, one only desires to copy the values itself, but not the formulas in a cell. Or he may copy only the formatting of one cell to another cell without overwriting the entry in the destination cell. For these cases, the Copy Edit Paste Special is useful. Note that this does not work with the Cut Edit formula.
Figure 2‐9 The Paste Special Dialog Box
ADDING COMMENTS TO CELLS One can add comments in order to clarify important items in the worksheet. To add a comment to a cell, select the cell and then choose InsertComment (or press Shift+F2).
Figure 2‐10 A Small Red Triangle Signifying an Attached Comment
Working with
Formulas and
Functions
Introducing Formulas &
Functions
Formulas
that
Manipulate
Text
Formulas
that Count &
Sum
Formulas
that Lookup
Variables
Array Formulas
Course Notes in Statistical Softwares
23 | D a q u i s
INTRODUCING FORMULAS & FUNCTIONS Formulas, when entered into a cell, perform a calculation and return a result. To distinguish a formula from a text, it always begins with an equal sign. A formula may consist with any of the following:
Values and/or text
Mathematical Operators
Cell References
Worksheet Functions The calculation result is displayed at the cell where the formula is written. The actual formula can
nonetheless be seen at the formula bar.
Mathematical Operators Operators are symbols that tell Excel what type of mathematical operation you want the formula to
perform. A list of Excel operators can be found in the table below.
Operator Name
+ Addition
‐ Subtraction
* Multiplication
/ Division
^ Exponentiation
& Concatenation
= Equal to
> Greater Than
< Less Than
>= Greater Than or Equal To
<= Less Than or Equal To
Table 3‐1 MS Excel Operators
When there are more than one operators in a formula, Excel has rules in determining the order of
calculating the formula. One should therefore know the precedence of Excel operators in order to get the desired answer. Table 3‐1 lists the order of preference of MS Excel.
Operator Name Precedence
^ Exponentiation 1
* Multiplication 2
/ Division 2
+ Addition 3
‐ Subtraction 3
& Concatenation 4
= Equal to 5
< Less Than 5
> Greater Than 5
Table 3‐2 MS Excel Operator Precedence
Course Notes in Statistical Softwares
24 | D a q u i s
However, when one wishes to use an operator of lower precedence over a higher one, a parenthesis should be used. A parenthesis supersedes precedence and computes the operations inside it. If there are more than one operations inside the parenthesis, the precedence still holds LOCALLY inside the parenthesis. Worksheet Functions There are certain calculations are tedious or even impossible to perform when using operators alone. In order to make more efficient formulas, one should make use of worksheet functions. For example, the formula:
=(A1+B1+C1+D1+E1+F1+G1+H1)/8
It involves adding 8 values and dividing by 8 (the average). Using functions, the formula above can be simplified and become:
=SUM(A1:H1)/8
Further, since it is a calculation of the mean, one can just perform the computation using just one function:
=AVERAGE(A1:H1) Another example of a worksheet function is the following:
=IF(B3>=40000, “WIN”, “LOSE”) The formula above checks if the value in cell B3 is greater than or equal to 40,000. If the value satisfies the condition, a WIN is displayed, otherwise LOSE is displayed. Such functions (conditional statements) cannot be written in terms of operators. Cell Reference The first formula in the worksheet functions section includes 8 cell references. These references enable your formulas to work with the data contained in those cells or ranges rather than simply with fixed values. That is, there is no need to edit the values in the formulas themselves to change them. There are three types of cell references:
1. Relative Cell Reference: The cell reference is an offset from the current row and column. When one tries to copy‐paste formulas from one cell to another, the row and column references in that formula will also change.
Figure 3‐1 Relative Cell Reference
1. Absolute Cell Reference: Unlike Relative Cell Reference, the cell reference does not change when copying the formula since absolute cell reference refers to the actual cell name, instead of its offset.
Course Notes in Statistical Softwares
25 | D a q u i s
2. Mixed Cell Reference: Here, the row reference is relative and the column reference is fixed or vice versa.
One can change a cell reference by pressing F4 repeatedly to cycle through cell references. For example, if
one writes =B7, pressing F4 results to =$B$7 (fixed column, fixed row). Pressing F4 again results to =B$7 (fixed row only), then =$B7 on the third press and back to =B7 on the 4th press. Alternatively, one can just type the dollar sign. Excel Error Values Sometimes after entering a formula, Excel displays a value that begins with a hash mark (#). The formula is returning an error value. To get rid of the error display, one has to correct the formula. The table below lists the types of error values that may appear in a cell:
Error Value Explanation
#DIV/0! The formula is trying to divide by zero or an empty cell
#NAME? The formula uses a name that Excel doesn’t recognize.
#N/A Occurs when a value is not available to a function or formula.
#NULL! Two or more cell references are not separated correctly.
#NUM! A problem with a value exists.
#REF! A spreadsheet formula contains incorrect cell references.
Usually happens when the cell has been moved or deleted
#VALUE! The formula includes an argument(s) of the wrong type. Table 3‐3 Excel Error Values
FORMULAS THAT MANIPULATE TEXT When one enters something in a cell, Excel automatically determines if it is a number, a formula or a text. There can be times that numbers are not really quantitative, rather qualitative in nature (e.g. zip code). During such cases, one may store the number in the cell as text. To store a number as text, one can do the following
Format Cells Number Tab Select Text in category list
Precede the number with an apostrophe
Figure 3‐2 Format Cells Dialog Box
Course Notes in Statistical Softwares
26 | D a q u i s
TEXT FUNCTIONS
This section provides some of the useful text functions in Excel, for a complete list, refer to the Appendix part of the notes. Determine Whether a Cell Contains Text: ISTEXT function The ISTEXT function takes a single argument and returns TRUE if the argument contains text and FALSE if it doesn’t contain text. The CODE and CHAR Functions The CODE function returns the character code for its argument. For example, the formula =CODE(“A”) returns a value 65. If the argument contains more than one characters CODE only considers the first character. That is =CODE(“STAT 125”) = CODE(“S”) = 83. On the other hand, the function CHAR takes returns the character which corresponds to the number in the argument. For a full list of character codes, please refer to the appendix section. Comparing If Two Strings are Identical: EXACT Function There are two ways to compare if two texts are identical. First is, to do this: =A1=B1. This formula returns TRUE if they are indeed identical or FALSE otherwise. The “=” function is not case sensitive. If one wishes to make case‐sensitive comparisons the EXACT function must be used. Joining Two or More Cells: The Ampersand & and CONCATENATE Function Excel uses the ampersand as its concatenation operator. Concatenation is simply joining the contents of two or more cells, exact values or text. For example, if A1 = 10,000 and C3 = The net profit is, then =C3 & A1 The net profit is10,000. Notice that there is no space in between the two cell contents. Revising it to =C3 & “ “ & A1 yields the desired result. Excel also uses the CONCATENATE function which can hold up to 30 arguments. Removing Excess Spaces: The TRIM Function The TRIM function removes any excess spaces in a cell. That is, it removes any leading trailing spaces and replaces internal strings of multiple spaces by a single space.
=ISTEXT(A1) TRUE/FALSE
=CODE(“A”) 65
=CHAR(65) A
=Stat=stat TRUE
=EXACT(“Stat”,”stat”) FALSE
=A1 & " " & B1 <A1 contents><space><B1 Contents>
=CONCATENATE("Statistical"," ","Softwares") Statistical Softwares
=TRIM(" You have just been reloaded with 100 Pesos ")
You have just been reloaded with 100 Pesos
Course Notes in Statistical Softwares
27 | D a q u i s
Counting the Number of Characters in a String: LEN Function The LEN returns the number of characters in its argument. LEN only accepts one argument, so if one wishes to count the total length of more than one argument, it can be done by adding the length of each of the arguments. Changing the Case of Text: UPPER, LOWER and PROPER Functions These three functions perform the following: Extracting Characters from a String the LEFT, RIGHT and MID Functions Sometimes, one only wants to extract part of string in a cell. For example, the accounting department only needs the surname of all employees in the company, or the college registrar needs to extract the last five digits of the students currently enrolled in Stat 125. There are three functions which lets users extract characters from a string, depending on which part of the string one wants to get. For example, if the cell E6 contains the string “2007‐00051 JOANNE SANTOS” Repeating a character or a string: REPT Function We use the REPT function simply to repeat a character or string. Other Text Functions: SUBSTITUTE, REPLACE, FIND and SEARCH Both functions SUBSTITUTE and REPLACE change part of a text string with another text. Their difference is as follows:
In the SUBSTITUTE function, one knows which text to replace but does not know the position in the text string. The function is case sensitive.
The REPLACE function replaces the text that occurs in a specific location in the string. It is used when one knows the position of the text to be replaced but not the actual text.
On the other hand, functions FIND and SEARCH enables one to extract the starting position of a particular substring within a string. The only difference is that FIND is case‐sensitive while SEARCH is not.
=LEN(A1) <number of characters in cell A1>
=UPPER("text to uppercase") TEXT TO UPPERCASE
=LOWER("TEXT TO LOWERCASE") text to lowercase
=PROPER("first letter in each word is capital") First
Letter In Each Word Is Capital
=LEFT(E6, 10) 2007‐00051
=RIGHT(E6, 6) SANTOS
=MID(E6, 12, 2) JO
=REPT("again ", 3) again again again
=SUBSTITUTE("SubstituteQwithQSpaces","Q"," ") Substitute with Spaces
=REPLACE("Replace withQQQQSpace",13,4, " ") Replace with Space
=FIND("W", "small w big W", 1) 13
=SEARCH("W", "small w big W", 1) 7
Course Notes in Statistical Softwares
28 | D a q u i s
COUNTING AND SUMMING FUNCTIONS A counting formula returns the number of cells in a specified range that meet a certain criteria. On the
other hand, a summing formula returns the sum of the values of the cells in the range that meet certain criteria. Table 3‐4 below lists Excel worksheet functions when creating counting and summing formulas.
FUNCTION DESCRIPTION
COUNT Returns the number of cells in a range that contain a numeric value
COUNTA Returns the number of nonblank cells in a range
COUNTBLANK Returns the number of blank cells in a range
COUNTIF Returns the number of cells that meet a specified criterion in a range
SUM Returns the sum of its arguments
SUMIF Returns the sum of cells that meet a specified criterion in a range
SUMPROD Multiplies corresponding cells in two or more ranges and returns the sum of those products
SUMSQ Returns the sum of the squares of its arguments
SUMX2PY2 Returns the sum of the sum of squares of corresponding values in two ranges
SUMX2MY2 Returns the sum of squares of the differences of corresponding values in two ranges
SUMX2MY2 Returns the sum of the differences of squares of corresponding values in two ranges Table 3‐4 List of MS Excel Counting Functions
LOOKUP FUNCTIONS: THE VLOOKUP AND HLOOKUP A lookup function basically returns a value from a table/range by looking up another value. One good
example is when you are looking at your grade or current standing in a certain subject. First you find your name or student number in the first column, and then at that particular row, you look for the value which corresponds to your grade or standing. Another good analogy is the use of a telephone directory. If you want to find a person’s telephone number, you first locate the name (look it up) and then retrieve the corresponding number. There are many cases wherein looking up a value in a table can be very tedious, especially when one is finding a number of items or when the table is considerably large. Hence one may opt to use Excel’s lookup functions.
The VLOOKUP Function
The VLOOKUP function looks up the value in the first column of the lookup table and returns the corresponding value in a specified table column. The lookup table is arranged vertically. The syntax for the VLOOKUP function is which means “I want to search for lookup_value in the first column of table_array, then return its corresponding value under column col_index_num”. The VLOOKUP function’s arguments are as follows:
lookup_value: The value to be looked up in the first column of the lookup table.
table_array: The range that contains the lookup table.
col_index_num: The column number within the table from which the matching value is returned.
range_lookup: Optional. If TRUE or omitted, an approximate match is returned. (If an exact match is not found, the next largest value that is less than lookup_value is returned.) If FALSE, VLOOKUP will search for an exact match. If VLOOKUP cannot find an exact match, the function returns #N/A.
=VLOOKUP(lookup_value,table_array,col_index_num,range_lookup)
Course Notes in Statistical Softwares
29 | D a q u i s
The HLOOKUP Function The HLOOKUP function is similar with the VLOOKUP function, except that the lookup table is arranged horizontally instead of vertically. Note that the VLOOKUP and HLOOKUP are not case‐sensitive. Also, when the table_array contains non‐unique entries, that means the table_array contains more than one of the lookup_value in the first row or column, VLOOKUP and HLOOKUP only considers the first‐occurring value. Therefore, these lookup functions work at their best when the lookup values are unique. Performing Left Lookups: Combining MATCH and INDEX Functions A limitation of VLOOKUP and HLOOKUP functions is that these functions ONLY look up on the 1ST COLUMN (VLOOKUP) or on the 1st ROW (HLOOKUP) of table_array. That means these functions can’t return values to the left or above lookup_value. In order to perform left and upper lookups, one must use a combination of MATCH and INDEX functions. The MATCH function returns the relative position of a cell in a range that matches a specified value.
lookup_value: The value to be matched up in lookup_array.
lookup_array: The range being searched. It could be a row or a column.
match_type: 0 exact match, 1 less than, ‐1 greater than. The MATCH syntax means “I want to know the relative position of lookup_value in lookup_array”. If the array is a column, MATCH returns a row number, otherwise if the array is a row MATCH outputs a column number.
The index function returns a cell from a range based on its row and column number relative to the array. Now, when the MATCH function becomes an argument in the INDEX function, one can do upper or lower
lookups. For example, if you want to perform a left (or even right) lookup, you can do left and upper lookups. Left Lookup
Via the MATCH function, the lookup_value is found in a specific row, say r. Then the INDEX function
returns the value on the rth row and column col_num. If the col_num refers to a certain column to the left of the lookup value, a left lookup is performed. Otherwise, the formula above functions the same as VLOOKUP. Upper Lookup
An upper lookup can be performed in a very similar fashion: The “‐1” in the second argument means “one row above” the row number where lookup_value is found.
The 3rd argument should be the same column number where the lookup value is found in the array. Note that instead of subtracting 1 from the MATCH function value but adding a positive integer, the formula above functions the same as HLOOKUP.
=HLOOKUP(lookup_value,table_array,row_index_num,range_lookup)
=MATCH(lookup_value, lookup_array, match_type)
=INDEX(array, row_num, col_num)
=INDEX(array, MATCH(lookup_value, lookup_array, match_type), col_num)
=INDEX(array, MATCH(lookup_value, lookup_array, match_type)-1, same col_num as lookup_value )
Course Notes in Statistical Softwares
30 | D a q u i s
ARRAY FORMULAS In constructing a FDT, the raw file is arranged in increasing or decreasing order to form an array. In this course, an array is any collection of items operated on collectively or individually. An array does not need to be arranged in a particular order. In Excel, an array can either be one dimensional or two dimensional. One dimensional arrays are stored in a range that could be a row (horizontal array) or a column (vertical array). Two dimensional arrays are stored in a rectangular range of cells.
Figure 3‐3 Horizontal (Blue), Vertical (Pink) and Two‐Dimensional (Red) Arrays
Array formulas allow Excel users to perform multiple and more complex calculations than the usual
formulas. This section discusses two kinds of array formulas: the first one also returns an array which occupies multiple cells, and the other one is a single‐cell array formula. Creating array formulas is much like the same as the normal ones. The general process of an array formula is summarized in the process below.
Process 3‐1 Creating an Array Formula
Count the Number of Characters in an Array Refer to the Excel file ArrayFormulas.xls. The original table (cols A‐D) presents different products sold by committees of a certain organization ORGSOC. For illustration purposes, let us say we want to count the number of characters in range A3 to B11. Without an array formula, one will manually get the length of EACH cell using the LEN function and then add them all up. But with an array formula, one can simply do the following:
Process 3‐2 An Array Formula in Counting the Number of Characters in a Range of Cells
Note that the formula above is a single cell array formula. Also, notice that the LEN function takes an array as its range. In a standard Excel formula, LEN only takes one argument. Furthermore, after pressing CTRL+SHIFT+ENTER the formula is being enclosed by braces “{ }”. Getting the Product of Corresponding Cells in Two Arrays Columns C and D of the same file show us the quantity of a product being sold by a committee and its corresponding price per piece. Now multiplying these two columns will yield sales of these products by committee. One can do this by multiplying C1 by D1 and double‐clicking the Fill Handle. An alternative way is to use a multiple cell array formula:
Process 3‐3 An Array Formula in Computing the Product of Corresponding Cells of Two Arrays
Computing for the Sum of Products of Two Arrays The usual formula for getting the sum of products of two arrays is to first get the individual products (col E) then getting its total (cell B14). Another way is to use the SUMPRODUCT function. The SUMPRODUCT function
Select the cell / array where you want to show the result
Input the array formula
Press CTRL+SHIFT+ENTER
Select the cell where you want to show
the result
Input: =SUM(LEN(A3:B11))
Press CTRL+SHIFT+ENTER
Select the range F3:F11
Input: =C3:C11*D3:D11 (select C3:C11, type *, select
D3:D11)
Press CTRL+SHIFT+ENTER
Course Notes in Statistical Softwares
31 | D a q u i s
takes arrays of the same dimension as its arguments and multiplies the corresponding cell pairs. The third option is to use a single cell array formula:
Process 3‐3 An Array Formula in Computing the Sum of Products
Another useful array formula is one which uses the SUM‐IF combination. The IF function can be furthermore combined with other functions such as AVERAGE or COUNT. But first let us define the IF Function. Conditional Function: The IF Function The IF function returns one of two values, depending on whether the logical statement is satisfied or not. Its syntax is as follows: Basically the IF function reads a specific cell and tests its value whether or not it meets the condition stated in logical_test. If indeed satisfied, the function returns the value [value_if_true], otherwise [value_if_false]. Computing for the Sum of values which Satisfy a Condition For example, you only want to get the total sales made by a particular committee. One way is to extract first the sales of that particular committee and add them all up (cols G‐H). Another way is to use a single cell array formula:
Process 3‐3 An Array Formula Using a SUM‐IF Combination
Alternatively, one can use the SUMIF function:
range: The range containing the values that determine whether to include a particular cell in the sum.
criteria: An expression that determines whether to include a particular cell in the sum.
sum_range: Optional. The range that contains the cells you want to sum. If you omit this argument, the function uses the range specified in the first argument.
Remarks on Array Formulas After seeing some basic array formulas, three fundamental rules can now be established:
1. Arguments within an array must be of the same dimension. 2. Creating array formulas ALWAYS end with a CTRL+SHIFT+ENTER. Hence, sometimes array
formulas are also called as CSE formulas. 3. After pressing CTRL+SHIFT+ENTER, braces { } will enclose the formula. It cannot be done by
manually typing the braces. Furthermore, for multiple cell array formulas here are some other remarks:
Editing a cell in an array formula changes the whole array formula.
After editing a cell, one should still press CTRL+SHIFT+ENTER. Further Use of Excel Array Formulas In Stat 135 (Matrix Theory) you will be dealing with vectors and matrices, which are generally arrays. Matrix array formulas like MINVERSE (matrix inverse) and MMULT (matrix multiplication) could be of big assistance in checking if your manual computations (which are tedious) are indeed correct.
Select the cell where you want to show
result: B16
Input: =SUM(C3:C11*D3:D11)
Press CTRL+SHIFT+ENTER
=IF(logical_test, [value_if_true], [value_if_false])
Select the cell where you want to show
result: B19
Input: =SUM(IF(B3:B11="Marketing"
,F3:F11,0))
Press CTRL+SHIFT+ENTER
=SUMIF(range, criteria, [sum_range])
Creating
Charts and
Graphics
Getting Started with
Creating
Charts
Learning Advanced
Charting
Enhancing Your Work
With
Pictures And
Drawings
Course Notes in Statistical Softwares
33 | D a q u i s
CHARTS Charts or graphs are visual representations of numeric values, making the figures more understandable. Because a chart is technically a picture, they are very useful in summarizing data and are essential in revealing characteristics, behaviors and interrelationships of a series of numbers. Displaying data in a well‐conceived chart makes the data more understandable. On the other hand, different kinds of charts, although based on the very same data conveys messages which are of different viewpoints. For example, a line graph and a vertical column chart can both graphically represent time series data, but they give different emphasis.
CHART TYPES Choosing the “correct” chart type is an essential factor in making the message more relevant and convincing. In almost every case, the underlying message in a chart is some type of comparison. These comparisons may include the following:
Comparing Items to other Items: For example, a chart may compare different kinds of waste products produced by Metro Manila
Comparing Data over Time: For example, a chart may display a person’s daily basal heart rate and indicate trends and changes over time.
Make Relative Comparisons: A Pie Chart of a household’s breakdown of monthly income depicts relative values (money) in terms of the slices of the pie.
Compare Data Relationships: Like in the XY Scatterplots.
Frequency Comparison: Histograms
Identify Outliers or Unusual Situations
These comparisons help in choosing which type of chart to use. Be warned that Excel can actually be requested to do different kinds of charts on a single data almost always. That is, the software has no capability of determining which is the best graph to use.
TYPES OF GRAPHS Pie Charts Use the pie chart when considering displaying one set of data as part of a whole, or when a main item is divided into several categories.
Vertical Column Charts Vertical Column Charts are useful in showing data changes over a period of time or for comparing among several items.
Food35%
Transportation15%
Education10%
Monthly Bills20%
Repair and Utility5%
Leisure and Recreation
10%
Savings5%
Budget Allocation of a Middle‐Class Household in Metro Manila, 2nd Qtr 2008
Course Notes in Statistical Softwares
34 | D a q u i s
Horizontal Bar Charts Horizontal bar charts are used to illustrate comparisons among individual items. Sometimes, pictures are used to substitute the bars. Such graphs are called pictographs.
0
5
10
15
20
25
30
Q1 Q2 Q3
Total Assets of Ledger & Clemens and SIX Barons, 2008 (in Billion Php)
Ledger & Clemens
SIX Barons
0 200 400 600 800 1000
Other
Glass
Wood
Metals
Textiles
Food Scraps
Plastics
Paper
Breakdown of Philippine Solid Waste, 2007 (in 1,000 tons)
0 20 40 60 80 100 120
Course 5
Course 4
Course 3
Course 2
Course 1
Happiness Ratio Among Students Across 5 UP Courses
Happiness Ratio
Course Notes in Statistical Softwares
35 | D a q u i s
Line Charts Like vertical column charts, line graphs are used to display data across a period of time. The emphasis of line charts are not the actual values, but the trend or movement of the data across time.
Radar Graphs Radar graphs are used to compare two or more data based on several attributes.
Scatter Plots Scatter plots are used mainly in determining the relationship between two variables.
0
20
40
60
80
100
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
Exports (in Billions)
Annual Philippine Exports of Electronics and Clothing, 1990‐2007 (in Billion Php)
Electronics
Clothing
0
2
4
6
8
10Accuracy
Speed
ReactionPrecision
Power
Arcade Game Scores of Kadyo and Nanding based on 5 Attributes
KadyoNanding
050100150200250300
0 50 100 150 200 250 300
Predicted Values
Actual Values
Predicted vs Actual Valuesof Chlorine Concentration
Course Notes in Statistical Softwares
36 | D a q u i s
Surface Charts Surface charts are used to display the effects of two variables, X, Y in determining the value of variable Z.
Bubble Charts Bubble Charts are used to display three values in one graph: the X values, the Y values and the radius of the bubble.
EXCEL CHARTING A Chart is essentially an object that Excel creates upon request. The data, of course from which the chart is based are stored in cells in worksheets. By default, a chart resides in an exclusive chart sheet, but it can also be placed in any Excel worksheet.
CREATING EXCEL CHARTS The fastest way to create a chart is to do the following:
Process 4‐1 Creating a Chart with the F11 Keystroke
The result would be a default chart created in another sheet. Of course, one can still customize the chart
after it is created. A very common way of creating charts in Excel is to do the first two steps in the process above and then go to the Insert Chart command. A Chart Wizard dialog box will open.
0
200
400
600
1 2 3 4 5 6 7 8 9 10 11
Temperature
Tensile Strength
Seconds
Tensile Strength Measurements
400‐600
200‐400
0‐200
‐2
0
2
4
6
8
0 50 100 150
Compan
y Growth
Total Sales
Market Penetration Index of Six Electronic Firms
Gitz
DGS Int'l
Lizardo
Pheng‐Xiu
Monroe
Lanaday
Input the data to be charted
Select the range in step 1, including row and column titles Press F11
Course Notes in Statistical Softwares
37 | D a q u i s
Figure 4‐1 The Chart Wizard Dialog Box
Process 4‐2 Creating a Chart Using the Chart Wizard
The Chart Wizard Creating a Chart using the chart wizard is process involving 4 steps:
Process 4‐3 Steps Involved in Using the Chart Wizard
Input the data to be charted
Select the range in step 1, including row and
column titles
Click Insert then Chart to open Chart wizard
Perform 4 Chart Wizard steps then
click Finish
Select Chart Type
Input Source DataChart Options: Titles, Legend X and Y Values Etc.
Specify Chart Location
Analyzing
Data With
Excel
Statistical Functions in
Excel
Analyzing Data with
Pivot Tables
Analyzing Data Using
Analysis
Toolpak
Course Notes in Statistical Softwares
39 | D a q u i s
A REVIEW OF DESCRIPTIVE STATISTICS Before doing any actual statistical procedures in Excel, let us first have a detailed walkthrough on basic
statistical process. This review section is divided into three parts. The first part being a review of descriptive statistics.
Descriptive statistics are statistical methods which are used to describe a collection of data at hand
gathered in various ways. This area of statistics is very essential to understand the behavior of the data being studied in order to make intelligent decisions. It forms the basis of virtually (and sometimes qualitative) data analysis.
There are various ways in describing data at hand:
Graphical Representations
Tabular Descriptions Summary Statistics or Measures
So far we have discussed the first bullet. The second bullet is mainly a discussion of the Frequency
Distribution Table or the FDT.
THE FREQUENCY DISTRIBUTION TABLE A frequency distribution table is defined as an organized tabulation of the number of individual
observations located in each category called a class interval.
The Frequency Distribution Table (FDT) is a way to show frequencies in an organized way. It is part of the textual presentation of data. In the FDT, the number of times an observation has occurred or the frequency of the observation is being tallied. In most cases wherein there are many possible outcomes, instead of counting the frequencies of individual observations, they are grouped into class intervals. Then the “frequency” of a class interval is the number of observations that fall within that particular interval. Parts of the FDT The FDT is composed of 9 columns. Refer to the table below:
CLASS INTERVALS FREQ LCB UCB CM RF RFP <CF >CF
16 21 10 15.5 21.5 18.5 0.13 13 10 80
22 27 14 21.5 27.5 24.5 0.18 18 24 70
28 33 8 27.5 33.5 30.5 0.10 10 32 56
34 39 10 33.5 39.5 36.5 0.13 13 42 48
40 45 9 39.5 45.5 42.5 0.11 11 51 38
46 51 3 45.5 51.5 48.5 0.04 4 54 29
52 57 9 51.5 57.5 54.5 0.11 11 63 26
58 63 2 57.5 63.5 60.5 0.03 3 65 17
64 69 9 63.5 69.5 66.5 0.11 11 74 15
70 75 6 69.5 75.5 72.5 0.08 8 80 6
n = 80
Table 5‐1 The Frequency Distribution Table
Class Intervals: These are pairs of numbers, called class limits which define a class. The lower number in the
interval is called the lower class limit and the upper number in the interval is called the upper class limit. Class Frequency (FREQ): These are the number of observations that fall within the class interval. It is a good
practice to include the sum of observations at the last row in this column.
Course Notes in Statistical Softwares
40 | D a q u i s
Class Boundaries (LCB & UCB): These numbers are the true class limits. The upper class boundary is halfway between the upper class limit of the same class and the lower class limit of the next class while the lower class boundary is a number halfway between the lower class limit of the same class and the upper class limit of the previous class.
Class Marks (CM): Numbers halfway between class limits or class boundaries of the same class. Relative Frequency (RF): The RF is computed as the frequency of the class divided by the total number of
observations. Relative Frequency Percentage (RFP): It is equal to RF x 100%. Less than Cumulative Frequency (<CF): It is the number of observations less than the UCB of the same class. Greater than Cumulative Frequency (>CF): The number of observations greater than the LCB of the same class. Constructing the FDT Creating the FDT can be summarized in three steps. Of course, the second and third step require a number of
processes:
Process 5‐1 Constructing the FDT
1. Arrange the Raw Data into an Array
Let us first start with a raw data (80 observations):
45 70 45 48 37 68 35 74
25 75 37 71 69 21 53 31
37 22 65 31 62 16 28 23
27 36 27 35 45 68 41 30
19 57 70 39 40 53 57 57
26 42 41 21 55 54 20 24
38 25 21 60 64 45 23 70
69 33 65 20 31 54 40 26
31 35 47 25 48 55 33 35
25 66 20 25 67 16 27 20 The raw data below is then ordered (preferably ascending) to form what we call an array.
16 22 26 33 38 45 57 68
16 23 27 33 39 47 57 68
19 23 27 35 40 48 57 69
20 24 27 35 40 48 60 69
20 25 28 35 41 53 62 70
20 25 30 35 41 53 64 70
20 25 31 36 42 54 65 70
21 25 31 37 45 54 65 71
21 25 31 37 45 55 66 74
21 26 31 37 45 55 67 75 2. Determine the Number of Classes Let K be the number of classes and n = number of observations. Approximating K can be done by using the
Sturges’ Formula: K = 1 + 3.322 log(n) So in our example: K = 1 + 3.322 log(80)
Arrange raw data into an array.
Fill‐up FDT ColumnsConstruct the Histograms and
Ogives
Course Notes in Statistical Softwares
41 | D a q u i s
= 7.322065 ≈ 8 (round‐up) Alternatively, one can also use the 2K rule which chooses the smallest K at which 2K > n is satisfied: K = 6: 26 = 64 > 80? No. K = 7: 27 = 128 > 80? Yes! And of course, one’s professional judgment determines the number of classes. 3. Determine the Class Size The Class Size C can be approximated (by rounding‐up) by the following equation: C ≈ (highest value – lowest value) / # of classes C ≈ (75 – 16)/7 = 8.428571 ≈ 9 4. Determine the Lowest Class Limit Note that the first class interval should include the smallest observation. On the other hand, the selection
must not be too small so that the highest class interval accommodates the highest observation. In the example, let’s say the smallest is 12, so that the first class interval is 12 ‐ 20.
5. Determine the Class Limits First determine the lower class limits by adding C to the lower class limits with the previous class. Since we
have started with 15, the next number is 12 + 9 =21, followed by 21 + 9 = 30 and so on. CLASS
12 20
21 29
30 38
39 47
48 56
57 65
66 74
75 83
Before the actual tallying of frequencies, check if the last class frequency is nonzero. Otherwise, adjust the lowest class limit.
6. Tally frequencies for each class
CLASS FREQ
12 20 7
21 29 18
30 38 16
39 47 11
48 56 8
57 65 8
66 74 11
75 83 1
n = 80
Here, it is a very good practice to add the frequencies and check if it is equal to the total number of observations.
Course Notes in Statistical Softwares
42 | D a q u i s
4. Complete the FDT
LCB/UCB: UCB is halfway between the upper class limit of the same class and the lower class limit of the next class while the LCB is number halfway between the lower class limit of the same class and the upper class limit of the previous class.
CM: Class mark = (UCB + LCB) / 2 RF: Relative Frequency = frequency of the class / number of observations RFP: Relative Frequency Percentage = RF X 100% <CF: Add all frequencies less than the UCB >CF: Add all frequencies greater than the UCB
CLASS FREQ LCB UCB CM RF RFP <CF >CF
12 20 7 11.5 20.5 16 0.09 9 7 80
21 29 18 20.5 29.5 25 0.23 23 25 73
30 38 16 29.5 38.5 34 0.2 20 41 55
39 47 11 38.5 47.5 43 0.14 14 52 39
48 56 8 47.5 56.5 52 0.1 10 60 28
57 65 8 56.5 65.5 61 0.1 10 68 20
66 74 11 65.5 74.5 70 0.14 14 79 12
75 83 1 74.5 83.5 79 0.01 1 80 1
n = 80
Table 5‐2 The Frequency Distribution Table with 8 Class Intervals
Alternatively, one can also create a FDT with 10 class intervals, as shown in Table B‐1. HISTOGRAMS AND OGIVES The Frequency Histogram The frequency histogram is of course, a graph of frequencies. Every vertical column represents the frequency of observations in a particular class interval. It tells us about the shape of the frequency distribution.
Figure 5‐1 The Frequency Histogram
Sometimes, instead of the actual frequencies, the magnitude of each bar represents relative frequencies. This histogram is called the relative frequency histogram.
0
5
10
15
20
16 25 34 43 52 61 70 79 More
Frequency
Class Marks
Histogram
Course Notes in Statistical Softwares
43 | D a q u i s
The Frequency Polygon Instead of a vertical column chart, the frequency polygon uses a line chart in displaying frequencies. It is being done by simply creating a line graph wherein the class marks are its X values and the frequencies, the Y values. Extra class marks are added at the leftmost and rightmost part of the graph to close it (since these extra class marks have frequency = 0, hence a polygon).
Figure 5‐2 The Frequency Polygon
The Ogives The term ogive is actually the pointed, curved part of a ballistic like on bullets and missiles. Ogives are also seen on church architecture, particularly the pointed arch windows of Gothic churches. Ogives in statistics also look very similar to that. Ogives has two components:
1. < Ogive: <CF vs UCB the increasing line. 2. > Ogive: >CF vs LCB the decreasing line.
Figure 5‐3 Plot of the Less Than and Greater than Ogives The Ogive has the same interpretation with the cumulative frequencies. The intersection of the two lines
is actually where the median (the middle observation) falls.
0
5
10
15
20
7 16 25 34 43 52 61 70 79 86
Frequency
Class Marks
Frequency Polygon
0102030405060708090
11.5 20.5 29.5 38.5 47.5 56.5 65.5 74.5 83.5 92.5
Cumulative Frequency
Class Boundaries
Ogives
Course Notes in Statistical Softwares
44 | D a q u i s
MEASURES OF CENTRAL TENDENCY – UNGROUPED DATA When we say ungrouped data, it means that the data we are working with is not, in any way summarized, though it can be arranged into an array. In our case, ungrouped data is the raw data, while grouped data is the FDT. Measures of central tendency are values which convey information of the centralness of the data set. The most common measures of central tendency are the mean, median and mode. The Arithmetic Mean The arithmetic mean is the most common average. Simply called as the mean, it is used far more frequently than any other means – the geometric mean and the harmonic mean. The mean is computed by adding all the values divided by the number of observations. Let us denote µ as the population mean with N observations
and the sample mean by X with n observations. Then µ and X are computed as:
N
XN
ii
1 n
XX
n
ii
1
In some cases, the weights of the observations are not equal. One example is the computation of the GWA (General Weighted Average) wherein most of the subjects are worth 3 units while some, 4 units (Stat 131) or
5 units (Math 17 to Math 54). In such cases, the weighted average is used, denoted by WX .
n
ii
i
n
ii
W
W
XWX
1
1
The Median The median, also a measure of location is simply defined as the positional middle of an arranged data. If we denote the Median by
dM , and )(iX be the ith observation in the array. The median is calculated as follows:
If n is ODD )2/]1([ nd XM
If n is EVEN 2
)1]2/([)2/( nn
d
XXM
Unlike the mean, since the median is just a positional measure, it is only affected by the position of the values but not the magnitude of the observations. The Mode The mode, denoted by
oM , is the value that occurs most frequently in the data. It depends on the
frequency of a values and hence unaffected by extreme observations. The mode may not exist, or if it does, it may not be unique. A data set with only one mode is unimodal, bimodal if it has two modes and trimodal if it has three.
MEASURES OF CENTRAL TENDENCY – GROUPED DATA If our data is summarized in an FDT, the information on the ACTUAL value of the observations is lost. Nonetheless, approximations of such measures can still be computed in an FDT. Approximated Mean – Grouped Data The mean is approximated by the following formula:
n
CMfX
i
k
ii
1
Where if = frequency of the ith class
iCM = class mark of the ith class
n = number of observations k = number of classes
Course Notes in Statistical Softwares
45 | D a q u i s
Approximated Median – Grouped Data The median is approximated by the following formula:
md
mdmdd f
CFncLCBM 12/
Where mdLCB = lower class boundary of the median class
c = class size of the median class n = number of observations
1 mdCF = less than cumulative frequency of the class preceding the
median class
mdf = frequency of the median class
The median class is the class wherein its <CF is greater than or equal to n/2 for the first time. Approximated Mode – Grouped Data The mode is approximated by the following formula:
21
1
2 fff
ffcLCBM
mo
momoo
Where moLCB = lower class boundary of the modal class
c = class size of the modal class
mof = frequency of the modal class
1f = frequency of the class preceding the modal class
2f = frequency of the class following the modal class
MEASURES OF LOCATION Also called as fractiles or quantiles, these are values below which a specified fraction or percentage of the observations must fall. Special forms of quantiles are the percentiles, deciles, quartiles and the median. Percentiles Percentiles are value that divide a set of observations in an array of 100 equal parts. For example, the first percentile, p1 is the value below which 1% of the values fall. Similarly, p95, or the 95
th percentile is the value below which 95% of the values fall. The percentile is computed as follows:
Pi = the value of the thni
100
)1( observation in the array.
In cases wherein the number of observations is less than 100 or the positioning value is not a whole number, one can round it off so that the positioning value becomes a whole number. Deciles, Quartiles and the Median Deciles, Quartiles and the Median can be expressed in terms of the percentile. For example, the deciles are measures of location which divide the array of observations into ten equal parts. Hence the ith decile, denoted by Dj may be defined as the measure wherein jx10% of the observations fall below it. For example, the 4th decile, or D4 is the values wherein 40% of the observation falls below it.
Course Notes in Statistical Softwares
46 | D a q u i s
MEASURES OF DISPERSION Measures of dispersion are values which use to describe the degree of spread in the data set, or its variation from the center. Measures of dispersion can either be absolute or relative. The common types of measures of dispersion are the standard deviation and the range. The Range Before defining the range, let us first define what we call order statistics. Given an array with n observation we define an order statistic, denoted by X(i) = ith observation in the array With order statistics and given an array with n observations, we can now define X(1) = minimum value and X(n) = the maximum value. The range, is now defined as
Range = X(n) ‐ X(1) (Ungrouped data) Range = Highest Class Limit ‐ Lowest Class Limit (Data in FDT)
The Variance and Standard Deviation Let us denote the population variance as σ2 and the sample variance as s2. Then the variance is defined as follows:
N
XN
ii
1
2
2
1
1
2
2
n
XXs
N
ii
Taking the positive roots, we can now define the standard deviation:
N
XN
ii
1
2
1
1
2
n
XXs
N
ii
Note that the variance may be viewed as the “average squared deviations from the mean”. It is
not of the same unit as the original observations (for example if the observations are expressed
in kms, then the variance has a unit in kms2). Hence, getting the square root (standard deviation)
brings it back to the original unit of measure.
The sample standard deviation is a BIASED estimator of the population standard deviation. To
correct for the bias, the sum is divided by n‐1 instead of the usual n.
To facilitate computation, the computational formula for the variance is used:
)1(
2
11
2
2
nn
XXn
s
n
ii
n
ii
Getting the square root yields the computational formula for the standard deviation.
The standard deviation is a measure of absolute dispersion.
The Coefficient of Variation, CV
The Coefficient of Variation, denoted by CV is a measure of relative dispersion used to compare the
scatter of one distribution with another distribution. Instead of the standard deviation, one should use the
coefficient of variation when comparing two data sets of different units or significantly different means. The CV is
defined by the following formula:
%100
CV
%100X
sCV
Course Notes in Statistical Softwares
47 | D a q u i s
PRACTICE EXERCISES – UNGROUPED DATA The table below presents 20 sets of randomly generated numbers. Using a calculator, try to practice
getting the mean median and mode. You can also compute for some measures of dispersion.
DATA 1 84 91 22 40 78 40 51 71 16 56 23
DATA 2 32 86 96 39 20 29 56 91 64
DATA 3 92 86 85 73 25 95 61 36 17 97 45 34 98
DATA 4 17 15 88 61 68 22 56
DATA 5 56 55 22 50 11 15 44 94 75 85 58
DATA 6 25 12 14 63 97 27 97 80 97 78 51 40 100
DATA 7 19 60 25 23 14 88 19 47 61 71 81 48 67
DATA 8 89 61 43 84 60 79 100 64 51
DATA 9 72 61 43 98 91 84 61 18 92 27 100
DATA 10 69 64 59 33 47 60 28 93 26 60 91
DATA 11 96 57 62 31 48 13 71 53 86 12 44 74 59 14
DATA 12 12 88 59 16 100 90 12 92 62
DATA 13 77 47 44 71 43 88 86 50 12 56 73
DATA 14 26 60 60 95 26 72 19
DATA 15 89 38 63 71 91 20 63 50 59 85 54
DATA 16 12 94 82 66
DATA 17 78 20 25 68 11 97 14 60 91 47 48 100 43 26 91
DATA 18 40 39 73 61 50 63 23 84 98 33 31 16 71 95 71 66 68
DATA 19 22 96 85 88 22 14 25 43 43
DATA 20 51 56 61 13 44 97 19 100
MEAN MEDIAN MODE RANGE STANDARD DEVIATION CV
52.000000 51 40 75 26.313495 0.506028743
57.000000 56 76 28.874729 0.506574201
64.923077 73 81 29.956807 0.461420006
46.714286 56 73 28.715352 0.61470172
51.363636 55 83 27.306676 0.531634397
60.076923 63 97 88 33.732925 0.561495557
47.923077 48 19 74 25.633512 0.534888688
70.111111 64 57 18.857654 0.268968122
67.909091 72 61 82 28.644212 0.421802326
57.272727 60 60 67 22.680789 0.396013773
51.428571 55 84 26.592984 0.517085806
59.000000 62 12 88 36.823905 0.624133988
58.818182 56 76 22.710430 0.386112413
51.142857 60 26 76 28.322127 0.553784599
62.090909 63 63 71 21.769036 0.350599401
63.500000 74 82 36.198527 0.570055538
54.600000 48 91 89 31.556752 0.577962483
57.764706 63 71 82 24.363214 0.421766427
48.666667 43 22 82 32.318725 0.664083395
55.125000 53.5 87 31.674404 0.574592359
Course Notes in Statistical Softwares
48 | D a q u i s
CONFIDENCE INTERVAL ESTIMATION
%1001 Confidence Interval for the population mean µ
A.
n
zXn
zX
22 ,
B.
n
StX
n
StX vv ,2,2 , where v = n ‐ 1
C.
n
SzX
n
SzX 22 ,
%1001 Confidence Interval for the difference between two population means µ1‐ µ2
Independent Samples
D.
2
22
1
21
221
2
22
1
21
221 ,nn
zXXnn
zXX
E.
21,221
21,221
11,
11
nnStXX
nnStXX pvpv
Where 2
)1()1(
21
222
211
nn
SnSnS p and 221 nnv
F.
2
22
1
21
,221
2
22
1
21
,221 ,n
S
n
StXX
n
S
n
StXX vv
Where
1
/
1
/
//
2
2
222
1
2
121
2
2221
21
n
nS
n
nS
nSnSv
G.
2
22
1
21
221
2
22
1
21
221 ,n
S
n
SzXX
n
S
n
SzXX
Course Notes in Statistical Softwares
49 | D a q u i s
Two Related/Paired Samples
H.
n
Std
n
Std d
vd
v ,2,2 ,
iii yxd n
dd
n
ii
1 )1(
2
11
2
nn
ddn
S
n
ii
n
ii
d
1 nv n = # of pairs
%1001 Confidence Interval for the population proportion p
I.
n
qpzp
n
qpzp
ˆˆˆ,
ˆˆˆ 22
%1001 Confidence Interval for the difference between two population proportions p1‐ p2, where n1 and n2
are large
J.
2
22
1
11221
2
22
1
11221
ˆˆˆˆˆˆ,
ˆˆˆˆˆˆ
n
qp
n
qpzpp
n
qp
n
qpzpp
In determining which formula should be used for a particular kind of data/given in a problem, one may refer to the flowchart below:
Estimating what?
Mean
Mean µ
σ2 known A
σ2 unknown
n≤30 B
n>30 C
Difference µ1-µ2
Independent Samples
σ2 known D
σ2 unknown
n1 and n2 > 30 G
σ12=σ2
2 E
σ12≠σ2
2 FTwo
Related/Paired Samples
H
Proportion
Proportion p IDifference
p1‐p2 J
Course Notes in Statistical Softwares
50 | D a q u i s
HYPOTHESIS TESTING
Process 5‐2 steps in Hypothesis Testing
LIST OF TEST STATISTICS
Testing a hypothesis on the population mean
A. n
XZ o
B. nS
Xt o where v = n ‐ 1
C. nS
XZ o
State the null hypothesis (Ho) and the alternative hypothesis (Ha).
Choose the level of significance α.
Select the appropriate test statistic and state the decision rule with respect to the critical region.
Collect the data and compute for the value of the test statistics from the sample data.
Make the decision. If test statistic belongs within the rejection region, reject Ho. Otherwise do not reject Ho.
On critical regions, remember the following:
Ha: θ>θo → Z>Zαor t>tα
Ha: θ<θo → Z<Zαor t<tα
Ha: θ≠θo → |Z|>Zα/2 or |t|>tα/2
Ha: the 2 variables are not independent
→ χ2>χ2α,(r‐1)(c‐1)
Course Notes in Statistical Softwares
51 | D a q u i s
Testing the difference between two population means
Independent Samples
D. 22
2121
21
nn
dXXZ o
E.
21
21
11 nnS
dXXt
p
o
Where 2
)1()1(
21
222
211
nn
SnSnS p and 221 nnv
F. 22
2121
21
nSnS
dXXt o
Where
1
/
1
/
//
2
2
222
1
2
121
2
2221
21
n
nS
n
nS
nSnSv
G. 22
2121
21
nSnS
dXXZ o
Two Related/Paired Samples
H. nS
ddt
d
o
iii yxd n
dd
n
ii
1 )1(
2
11
2
nn
ddn
S
n
ii
n
ii
d
1 nv n = # of pairs
Testing a hypothesis on proportions
I. oo
o
qnp
npxZ
Course Notes in Statistical Softwares
52 | D a q u i s
Testing the difference between two proportions
J.
21
21
1
nnqp
ppZ
pp
Where 21
21
nn
xxp p
and pp pq 1
Testing for independence
K.
r
i
c
j ij
ijij
E
EO
1 1
2
2
Where Oij = observed number of cases in the ith row of the jth column
Eij = expected number of cases under Ho = (column total) x (row total) grand total
In determining which test statistic should be used for a particular kind of data/given in a problem, one may refer to the flowchart below:
Testing what?
Mean
Mean µ
σ2 known A
σ2 unknown
n≤30 B
n>30 C
Difference µ1-µ2
Independent Samples
σ2 known D
σ2 unknown
n1 and n2 > 30 G
σ12=σ2
2 E
σ12≠σ2
2 FTwo
Related/Paired Samples
H
Proportion
Proportion p IDifference
p1‐p2 J
Independence K
Course Notes in Statistical Softwares
53 | D a q u i s
PRACTICE EXERCISES – CONFIDENCE INTERVALS AND HYPOTHESIS TESTING Confidence Intervals
1. A statistics test was given to 50 girls and 75 boys. The girls made an average of 80 with a standard deviation of 4 and the boys had an average of 86 with a standard deviation of 6. Find a 95% confidence interval for the difference of the true population means of statistics scores between boys and girls.
2. On a study on 50 successful pregnancies involving natural birth, it has been found out that the mean pregnancy term was 274 days. Suppose that from previous studies the standard deviation of a pregnancy term may be safely assumed to be 14 days, Construct a 98% confidence interval for the true mean pregnancy term.
3. In a test to measure the performance of two comparable models of cars, Jissan and Conda, 128 cars of each model were driven on the same terrain with 40 liters of the same gasoline in each car. The mean number of miles for Jissan was 547km with standard deviation 16km; the number of kilometers for Conda was 539km with standard deviation of 25km. Obtain a 98% confidence interval for the difference between the true mean number of kilometers for the two models of cars.
4. Two types of soft drinks, Pepti Cola and Krokikana were tested for the amount of glucose (g/100ml). The findings are
summarized in the following table.
GLUCOSE (g/100ml) Pepti Cola Krokikana
Mean 4.10 4.63 Standard Deviation 0.06 0.09 Number of Samples 15 18
Find an estimate of the difference between the true means of glucose amount in the two kinds of drinks, using a 90% confidence interval. Assume unequal variances.
Hypothesis Testing
1. A statistics test was given to 50 girls and 75 boys. The girls made an average of 80 with a standard deviation of 4 and the boys had an average of 86 with a standard deviation of 6. Is there sufficient evidence at 0‐5 level of significance that the average grades of girls and boys are different?
2. A sample of 262 leukemia cases was made up of 150 male cases and 112 female cases. Does this provide evidence for
a gender preference of the disease? Use a 0.01 level of significance.
3. A random sample of 10 MS Statistics students sat a Probability examination consisting of a theoretical and an application part. Their scores out of 100 are given in the table below.
Student A B C D E F G H I J
Theory Part 30 42 49 50 63 38 43 36 54 42 Application Part
52 58 42 67 94 68 22 34 55 48
Assuming differences to be normally distributed, at the 5% level of significance, are the difference of scores of the two parts in the probability exam significant?
4. In a poll, 883 males and 893 females were asked, “If you could have only one of the following, which would you pick:
money, health, or love?” Their responses are presented in the table below. Money Health Love
Men 82 446 355Women 46 574 273
Are gender and response independent? Use a 0.10 level of significance.
Course Notes in Statistical Softwares
54 | D a q u i s
TABLE OF MS EXCEL STATISTICAL FUNCTIONS FUNCTION What it Does
AVEDEV Returns the average of the absolute deviations of data points from their mean
AVERAGE Returns the average of its arguments
AVERAGEA Returns the average of its arguments and includes evaluation of text and logical values
BETADIST Returns the cumulative beta probability density function
BETAINV Returns the inverse of the cumulative beta probability density function
BINOMDIST Returns the individual term binomial distribution probability
CHIDIST Returns the one‐tailed probability of the chi‐squared distribution
CHIINV Returns the inverse of the one‐tailed probability of the chi‐squared distribution
CHITEST Returns the test for independence
CONFIDENCE Returns the confidence interval for a population mean
CORREL Returns the correlation coefficient between two data sets
COUNT Counts how many numbers are in the list of arguments
COUNTA Counts how many values are in the list of arguments
COUNTBLANK Counts the number of blank cells in the argument range
COUNTIF Counts the number of cells that meet the criteria you specify in the argument
COVAR Returns covariance, the average of the products of paired deviations
CRITBINOM Returns the smallest value for which the cumulative binomial distribution is less than or equal to a criterion value
DEVSQ Returns the sum of squares of deviations
EXPONDIST Returns the exponential distribution
FDIST Returns the F probability distribution
FINV Returns the inverse of the F probability distribution
FISHER Returns the Fisher transformation
FISHERINV Returns the inverse of the Fisher transformation
FORECAST Returns a value along a linear trend
FREQUENCY Returns a frequency distribution as a vertical array
FTEST Returns the result of an F‐Test
GAMMADIST Returns the gamma distribution
GAMMAINV Returns the inverse of the gamma cumulative distribution
GAMMALN Returns the natural logarithm of the gamma function, G(x)
GEOMEAN Returns the geometric mean
GROWTH Returns values along an exponential trend
HARMEAN Returns the harmonic mean
HYPGEOMDIST Returns the hypergeometric distribution
INTERCEPT Returns the intercept of the linear regression line
KURT Returns the kurtosis of a data set
LARGE Returns the kth largest value in a data set
LINEST Returns the parameters of a linear trend
LOGEST Returns the parameters of an exponential trend
LOGINV Returns the inverse of the lognormal distribution
LOGNORMDIST Returns the cumulative lognormal distribution
MAX Returns the maximum value in a list of arguments, ignoring logical values and text
MAXA Returns the maximum value in a list of arguments, including logical values and text
MEDIAN Returns the median of the given numbers
MIN Returns the minimum value in a list of arguments, ignoring logical values and text
MINA Returns the minimum value in a list of arguments, including logical values and text
MODE Returns the most common value in a data set
NEGBINOMDIST Returns the negative binomial distribution
NORMDIST Returns the normal cumulative distribution
NORMINV Returns the inverse of the normal cumulative distribution
NORMSDIST Returns the standard normal cumulative distribution
NORMSINV Returns the inverse of the standard normal cumulative distribution
PEARSON Returns the Pearson product moment correlation coefficient
PERCENTILE Returns the kth percentile of values in a range
PERCENTRANK Returns the percentage rank of a value in a data set
Course Notes in Statistical Softwares
55 | D a q u i s
PERMUT Returns the number of permutations for a given number of objects
POISSON Returns the Poisson distribution
PROB Returns the probability that values in a range are between two limits
QUARTILE Returns the quartile of a data set
RANK Returns the rank of a number in a list of numbers
RSQ Returns the square of the Pearson product moment correlation coefficient
SKEW Returns the skewness of a distribution
SLOPE Returns the slope of the linear regression line
SMALL Returns the kth smallest value in a data set
STANDARDIZE Returns a normalized value
STDEV Estimates standard deviation based on a sample, ignoring text and logical values
STDEVA Estimates standard deviation based on a sample, including text and logical values
STDEVP Calculates standard deviation based on the entire population, ignoring text and logical values
STDEVPA Calculates standard deviation based on the entire population, including text and logical values
STEYX Returns the standard error of the predicted y‐value for each x in the regression
TDIST Returns the student’s t‐distribution
TINV Returns the inverse of the student’s t‐distribution
TREND Returns values along a linear trend
TRIMMEAN Returns the mean of the interior of a data set
TTEST Returns the probability associated with a student’s t‐Test
VAR Estimates variance based on a sample, ignoring logical values and text
VARA Estimates variance based on a sample, including logical values and text
VARP Calculates variance based on the entire population, ignoring logical values and text
VARPA Calculates variance based on the entire population, including logical values and text
WEIBULL Returns the Weibull distribution
ZTEST Returns the two‐tailed P‐value of a z‐test
Table 5‐3 List of Excel Statistical Functions
Programming
Excel with VBA
Getting Started: MS
Excel VBA
Introducing Macro
Programming
Doing Simple
Macro
Programs
Course Notes in Statistical Softwares
57 | D a q u i s
WHAT ARE MACROS A macro is a sequence of statements or instructions, usually a program that automates part of a job in MS
Excel for faster and more efficient output. In MS Excel, the main programming language used is what we call Visual Basic for Applications (VBA) macro language. After developing a macro, one can execute it anytime to perform many time‐consuming, error‐prone and tedious procedures automatically. There are two ways of creating a macro in Excel:
1. RECORD A MACRO – All actions performed within the software are recorded and converted into a VBA Macro. When you execute/run this macro, Excel performs it again.
2. WRITE THE CODE – like Visual Basic or any programming language, one can write, or hard‐code the process. This is especially useful when there are processes which cannot be created by recording them alone. Also, writing a macro versus recording it makes the program more efficient by deleting unnecessary lines.
Uses of Macro Programming Macro programming in Excel has a lot of uses. The following list contains but a few of the things that you can do with VBA Macros.
Insert a text string or formula
Automate a text procedure that you perform frequently
Automate repetitive reports
TWO TYPES OF VBA MACROS It is important for us to understand the two types of macro: a Sub or a Function. VBA Sub Procedures VBA Sub procedures can be thought of a command that can be run by either the user or another macro. An Excel workbook can have as many sub procedures.
Figure 6‐1 A VBA Sub Procedure
A Sub procedure always start with the keyword Sub, then the Macro’s name and then a pair of parentheses. The End Sub signals the end of the procedure. The macro above, named Compute is an example of a macro sub procedure which computes in cell F2 the average of the numbers in cells C10 and D10. Then the second line tells Excel to make the active cell’s font color to black. The sub procedure compute also contains a comment. Comments are notes to yourself or to other users, and are ignored by VBA. These are statements in green color which are preceded by an apostrophe. A comment can be put in a line or after a statement. When VBA sees an apostrophe, it ignores the rest of the text in the line. Executing a VBA Sub Procedure You can execute a VBA sub procedure in any of the following ways:
Choose Tools → Macro → Macros (shortcut Alt+F8) then select any macro from the list and run.
If it has one, press the procedure’s shortcut key combination.
In the Visual Basic Editor, move the cursor to anywhere within the code and press F5 or the play button.
“Call” the procedure in another VBA procedure. VBA Functions VBA Functions, like the usual worksheet functions return a single value. A VBA function can be executed by other VBA procedures or used in worksheet formulas, just as you would use Excel’s built‐in worksheet functions.
Course Notes in Statistical Softwares
58 | D a q u i s
Figure 6‐2 A VBA Function
The VBA function above, named CubeRt returns the cube root of a number. Notice that the function needs an argument, in this case, the variable num. Unlike a sub procedure, the function procedure begins with a function word and ends with the end function statement.
SOME VBA DEFINITIONS Code: VBA instructions that are produced in a module sheet when you record a macro. You also can enter
VBA code manually.
Controls: Objects on a UserForm (or in a worksheet) that you manipulate. Examples include buttons, check boxes, and list boxes.
Function: One of two types of VBA macros that you can create. (The other is a Sub procedure.) A function returns a single value. You can use VBA functions in other VBA macros or in your worksheets.
Macro: A set of VBA instructions that are performed automatically.
Method: An action that is taken on an object. For example, applying the Clear method to a range object erases the contents and formatting of the cells.
Module: A container for VBA code.
Object: An element that you manipulate with VBA. Examples include ranges, charts, drawing objects, and so on.
Procedure: Another name for a macro. A VBA procedure can be a Sub procedure or a Function procedure.
Property: A particular aspect of an object. For example, a range object has properties, such as Height, Style, and Name.
Sub Procedure: One of two types of Visual Basic macros that you can create. The other is a function.
UserForm: A container that holds controls for a custom dialog box and holds VBA code to manipulate the controls.
VBA: Visual Basic for Applications. The macro language that is available in Excel as well as in the other applications in Microsoft Office.
Visual Basic Editor: The window (separate from Excel) that you use to create VBA macros and UserForms.
CREATING VBA MACRO IN EXCEL Creating a macro is not just a linear process, it is basically a cycle consisting of four stages:
Process 6‐1 The VBA Macro Creation Process
Note that the upper part of the process is the creation part, and the lower half, the diagnosis part.
Record Write
TestDiagnose
Course Notes in Statistical Softwares
59 | D a q u i s
RECORDING YOUR FIRST MACRO Now let us record our very first VBA macro in Excel.
0. Choose Tools → Macro → Security → Security Level tab → Medium Option → press OK. 1. Choose Tools → Macro → Record New Macro → press OK.
2. Press the relative reference icon . 3. Type your name in Cell E5 then press enter. 4. Select again E5 and change its font color to Blue. 5. Copy your name, Ctrl+C. 6. Select Cells A1:D10. 7. Paste, Ctrl+V. 8. Press “Stop Recording” Button.
Examining the Macro The macro was recorded in a new module, module1. To view the code in this module, you must activate the Visual Basic Editor (VBE). You can activate the VBE in either of two ways:
Press Alt+F11
Choose Tools → Macro → Visual Basic Editor The Visual Basic Editor looks like this:
Figure 6‐3 The Microsoft Visual Basic Editor
Your macro program will more or less look like this:
Sub Macro10() ' ' Macro10 Macro ' Macro recorded 1/13/2009 by cjdaquis ' ' Windows("Book2").Activate ActiveCell.Offset(4, 4).Range("A1").Select ActiveCell.FormulaR1C1 = "John Carlo P. Daquis" ActiveCell.Select Selection.Font.ColorIndex = 5 Selection.Copy ActiveCell.Offset(-4, -4).Range("A1:D10").Select ActiveSheet.Paste End Sub
Before proceeding, try to understand the code first. Testing the Macro Now, display the VB editor side‐by‐side with the Excel worksheet. Delete the used contents in the Excel worksheet and select cell A1. In the VB editor, place the cursor anywhere in the sub procedure then press F5. The procedure will run automatically. Delete again the contents in the worksheet and select A1. This time in the VB
Course Notes in Statistical Softwares
60 | D a q u i s
editor, press F8. This will run the procedure, step by step and allows you to see the changes in the worksheet in real time. For the third time, repeat the process. But now, select any cell other than A1. Run the program by pressing F8. This time, the program still works BUT the results are not displayed on the same cells as when you have selected cell A1. This is because the recorded macro does not refer to the actual names of the cells, but rather on offsets relative to the cell being activated, or the activecell before executing the program. What if you ALWAYS have to refer in a fixed cell, instead of a cell relative to another one? This is one of the disadvantages of recording a macro. To improve the program, we must modify it by writing a code.
Editing the Macro By now you may have understood simple elements in a VBA code. For example, objects and properties and methods are separated by periods. Intuitively you know that Windows("Book2").Activate means “Excel, activate book2” or that Selection.Copy obviously, is “copy the selection”. With such intuition and basic knowledge, we are now ready to write simple VBA codes. The table below presents the steps that are needed to be done with its corresponding (written) macro.
INSTRUCTION: WRITTEN CODE:
Type your name in Cell E5 then press enter. Range("E5").Select
ActiveCell.Value = "John Carlo P. Daquis"
Select again E5 and change its font color to Blue. Selection.Font.ColorIndex = 5
Copy your name, Ctrl+C Selection.Copy
Select Cells A1:D10. Range("A1:D10").Select
Paste, Ctrl+V. ActiveSheet.Paste
Table 6‐1 An Example of Instructions Translated into VBA Codes
Now you don’t need to worry about “stray” outputs since your name will always be written on E5 and always be pasted in range A1:D10. Repeat recording the macro again but this time, make it in absolute reference. The code you will get is very similar to the ones you have written. Another example This time, record a macro and do the following:
1. Select C5:D13. 2. Right click → Format Cells → Border Tab → Choose Outline → OK 3. Stop Recording
The recorded program, surprisingly is long: Sub Macro15() ' ' Macro15 Macro ' Macro recorded 1/13/2009 by cjdaquis ' ' Range("C5:E13").Select Selection.Borders(xlDiagonalDown).LineStyle = xlNone Selection.Borders(xlDiagonalUp).LineStyle = xlNone With Selection.Borders(xlEdgeLeft) .LineStyle = xlContinuous .Weight = xlThin .ColorIndex = xlAutomatic End With With Selection.Borders(xlEdgeTop) .LineStyle = xlContinuous .Weight = xlThin .ColorIndex = xlAutomatic End With With Selection.Borders(xlEdgeBottom) .LineStyle = xlContinuous .Weight = xlThin .ColorIndex = xlAutomatic End With
Course Notes in Statistical Softwares
61 | D a q u i s
With Selection.Borders(xlEdgeRight) .LineStyle = xlContinuous .Weight = xlThin .ColorIndex = xlAutomatic End With Selection.Borders(xlInsideVertical).LineStyle = xlNone Selection.Borders(xlInsideHorizontal).LineStyle = xlNone End Sub
Most of the lines here are unnecessary. Note that the code above can be further reduced to: Sub Macro15() Range("C5:E13").Select With Selection.Borders(xlEdgeLeft) .LineStyle = xlContinuous End With With Selection.Borders(xlEdgeTop) .LineStyle = xlContinuous End With With Selection.Borders(xlEdgeBottom) .LineStyle = xlContinuous End With With Selection.Borders(xlEdgeRight) .LineStyle = xlContinuous End With End Sub
Note that the cells that we are working on are cells wherein there are no prior formatting that has been made. It means that, for example, there are no diagonal borders at all before running the program. Deleting lines in the recorded code makes the program more efficient, but deleting more than required may lead to incomplete processes or even errors. That is why a very good tip in making the code more efficient is to first convert the line into a comment, run the code and see if anything changes.
ABSOLUTE VERSUS RELATIVE RECORDING You may have noticed the difference between absolute and relative recording of macros. By default,
when you record a macro, Excel stores the EXACT references of the cells you have selected. For example the line Range("B1:C10").Select
This means exactly what it says: “Select the cells in the range B1:C10.” In absolute recording, you work on
the same cells you have selected, regardless of where the active cell is selected. When recording in relative mode, selecting a range of cells is translated differently, depending on where
the active cell is located right before the program is invoked. For example, if you are recording a macro and the active cell is in cell A1, selecting the cells in range B1:B10 generates the following statement:
Activecell.Offset(0,1).Range("A1:B10").Select
The program line above tells us: “From the active cell (which is A1), move 0 rows down and 1 column
right, then treat this new cell as if it were cell A1. Now select the cells in what would be the new range A1:B10.” In other words, relative reference starts out by using the current active cell as its base cell and stores relative references to this cell.
HOW VBA WORKS VBA is by far the most complicated feature in Excel. It is very powerful yet sometimes too overwhelming for a user‐programmer. This section summarizes how the VBA works.
1. After writing or recording a code in a module sheet you can execute the macro in any one of various ways mentioned before. VBA modules are stored in an Excel workbook, and a workbook can hold any number of VBA modules. To view or edit a VBA module, you must activate the Visual Basic Editor window (press Alt+F11 to toggle between Excel and the VBE window).
2. A VBA module consists of procedures. A procedure is basically computer code that performs some action. The following is an example of a simple Sub procedure called SumCell (it adds the contents in A1 and B1 and displays the result in C1):
Sub SumCell()
Course Notes in Statistical Softwares
62 | D a q u i s
Range("C1").Value = "=A1 + B1" End Sub
3. A VBA module also can store function procedures. A function procedure performs some calculations and returns a single value. A function can be called from another VBA procedure or can even be used in a worksheet formula. Here’s an example of a function named AddTwo. (It adds two values, which are supplied as arguments.)
Function AddTwo(arg1, arg2) AddTwo = arg1 + arg2
End Function
4. VBA manipulates objects. Examples of objects include a workbook, a worksheet, a range on a worksheet, a single cell, and a jpeg picture.
5. Objects are arranged in a hierarchy and can act as containers for other objects. For example, Excel itself is an object called Application, and it contains other objects, such as Workbook objects. The Workbook object can contain other objects, such as Worksheet objects and Chart objects. A Worksheet object can contain objects such as Range objects, PivotTable objects, and so on. The arrangement of these objects is referred to as an object model.
6. Objects that are alike form a collection. For example, the Worksheets collection consists of all worksheets in a particular workbook.
7. You refer to an object in your VBA code by specifying its position in the object hierarchy, using a period as a separator. For example, you can refer to a workbook named Book1.xls as
Application.Workbooks(“Book1”)
This refers to the Book1.xls workbook in the Workbooks collection. The Workbooks collection is contained in the Application object (that is, Excel). Extending this to another level, you can refer to Sheet1 in Book1 as follows:
Application.Workbooks(“Book1”).Worksheets(“Sheet1”)
You can take it to still another level and refer to a specific cell as follows: Application.Workbooks(“Book1”).Worksheets(“Sheet1”).Range(“A1”)
8. If you omit specific references, Excel uses the active objects. If Book1 is the active workbook, the preceding reference can be simplified as follows:
Worksheets(“Sheet1”).Range(“A1”)
If you know that Sheet1 is the active sheet, you can simplify the reference even more: Range(“A1”)
10. Objects have properties. A property can be thought of as a setting for an object. For example, a Range object has properties, such as Value and Name. You can use VBA both to determine object properties and to change them.
11. You can assign values to variables. To assign the value in cell A1 on Sheet1 to a variable called Interest, use the following VBA statement:
Interest = Worksheets(“Sheet1”).Range(“A1”).Value
12. Objects have methods. A method is an action that is performed with the object. For example, one of the methods for a Range object is ClearContents. This method clears the contents of the range. You specify methods by combining the object with the method, separated by a period. For example, to clear the contents of cells in the range D1:E15, use the following statement:
Range(“D1:E15”).ClearContents
14. VBA also includes all the constructs of modern programming languages, including arrays, looping, and so on.
References
Parts 1 & 2
Book References
Websites
Further Reading
Course Notes in Statistical Softwares
64 | D a q u i s
REFERENCES Main References Walkenbach, John. “Microsoft Office Excel 2003 Bible”. Wiley Publishing, Inc.
Indianapolis, Indiana: 2003 Stinson, Craig and Dodge, Mark. “Microsoft Office Excel 2003 Inside Out”. Microsoft Press.
Redmond, Washington: 2004. Website References
http://en.wikipedia.org
http://office.microsoft.com/en‐us/excel/default.aspx
http://www.utexas.edu/its/training/handouts/UTOPIA_ExcelGS/
http://www.bettersolutions.com/excel.aspx
Further Reading Aitken, Peter. “Excel Programming Weekend Crash Course” Wiley Publishing, Inc.
Indianapolis, Indiana: 2003 Harvey, Greg. “Excel Timesaving Techniques for Dummies”. Wiley Publishing, Inc.
Indianapolis, Indiana: 2005 Frye, Curtis; Freeze, Wayne and Buckingham, Felicia. “Microsoft Office Excel 2003 Programming Inside Out”.
Microsoft Press. Redmond, Washington: 2004.
SPSS
Overview
What is
SPSS
The SPSS Interface
Crosstabs Descriptives
Course Notes in Statistical Softwares
66| D a q u i s
WHAT IS SPSS SPSS is an acronym for Statistical Package for the Social Sciences. This software was originally developed
as a programming language used to conduct statistical analyses. Since then it has grown to become a complex and powerful application which uses both GUI and programming interfaces to provide a variety of processes for encoding, data management, presentation and analysis of data.
This software performs most of the basic statistical methods like frequencies, descriptives, exploratory data analysis, regression, estimation and hypothesis testing. Moreover, it can do other more complex and specialized statistical processes like multivariate data analysis, (and add‐ons like) market research techniques, neural networks to name a few.
Figure 7‐1 The SPSS Logo
SPSS History (Excerpts from http://www.spss.com/corpinfo/history.htm)
In 1968, Norman H. Nie, C. Hadlai (Tex) Hull and Dale H. Bent, three young men from disparate professional backgrounds, developed a software system based on the idea of using statistics to turn raw data into information essential to decision‐making.
This revolutionary statistical software system was called SPSS, which stood for the Statistical Package for the Social Sciences. Nie, Hull and Bent developed SPSS out of the need to quickly analyze volumes of social science data gathered through various methods of research.
THE SPSS INTERFACE There are several windows in SPSS:
Data View
Variable View
Output View
Draft Output View
Script View
Syntax Editor But since we are just starting to learn the software, we will just concentrate on the first three windows
mentioned above. In fact, those three are the most essential. The Data View The data view is much like the interface in Excel. It has cells, wherein each column indicates a specific variable, a question or an attribute; and each row is an observation, response set or measurement. That is, for example we translate the cell E17 in Excel to SPSS, it refers to the response or measurement of the 17th observation in the 5th question or attribute. Data View is the 1st interface you will ever see right after opening the software. To appreciate this, let us say that we need to analyze a certain encoded data. The instructions on how to input data will be reserved for later discussions. The first file extension that we are going to access is the SPSS dataset file with the extension *.sav. To open a SPSS dataset, one can do the following:
Select File → Open → Data… (Shortcut: Ctrl + O)
Open the *.sav file (say survey_sample.sav)
Course Notes in Statistical Softwares
67| D a q u i s
Figure 7‐2 The SPSS Data View
Selecting Cells in Data View Selecting cells is similar in some ways to Excel:
One cell ‐ Left‐click on the cell
A range of cells ‐ Click‐and‐drag / Shift + Arrow Keys
All cells ‐ Ctrl + A
One column ‐ Left‐click on the column header
One row ‐ Left‐click on the row header
Multiple contiguous ‐ Click‐and‐drag on row/column headers rows / columns
Multiple non‐contiguous ‐ Left‐click + Hold Ctrl Button + Left Click on rows / columns row/column headers
Multiple non‐contiguous ‐ cannot be done cells
The Variable View Like in Excel, worksheets are accessed through clicking on tabs. However in SPSS there can only be one data view sheet. The other sheet is called the variable view. Pressing the Variable View tab activates the variable view interface. The variable view is where you input the definitions of each of the variables in the data set. These definitions include the name of the variable, type, width, number of decimals, labels, values configuration, column width, alignment and level of measurement. A detailed explanation of these variable definitions will be discussed later. Note that in the variable view, the variables themselves are written in as rows whereas in the data view, the variables are listed as columns. In the variable view, each column, which represents an attribute of a variable, can be regarded as variables.
Course Notes in Statistical Softwares
68| D a q u i s
Figure 7‐3 The SPSS Variable View
The Output View The output view is where you can see the results of your various queries such as statistical tests, frequency distribution, regression models, charts and crosstabs. In Excel, the output is displayed in the same or in another worksheet. Here in SPSS, the output is displayed, obviously in the Output view. By now, you may know that each interface in SPSS performs a different function. In order to view this interface, we need to perform a simple summary procedure called cross tabulation or crosstabs. Procedure: Crosstabs A crosstab is a summary procedure which displays the joint frequency distribution of two or more variables. The following is an example of a SPSS output of a crosstab between sex and race of the respondents in our survey dataset:
Table 7‐1 SPSS Crosstab Sample Output
The procedure of creating a basic crosstab in SPSS is fairly straightforward.
Process 7‐1 SPSS Crosstabs Procedure
AnalyzeDescriptive Statistics
Crosstabs
Add row and
column variables
Add Test Statistics (optional) / Change Cell Display / Add
Charts
Course Notes in Statistical Softwares
69| D a q u i s
The crosstabs output above is partially interpreted as follows: “Out of the 2832 respondents, 1232 of them are males, 1002 of which are whites.” Also, notice that the crosstab output only displays frequencies. Column/row percentages, expected values per cell and the like can also be displayed. Crosstabs can also be accompanied by a clustered bar chart. To do this, mark the Display clustered bar charts checkbox in the crosstabs dialog box (see encircled in red).
Figure 7‐4 Crosstabs Dialog Box
The resulting clustered bar chart will look like this:
Figure 7‐5 SPSS Sample Clustered Bar Chart
If one wishes to conduct a hypothesis test, the crosstabs procedure gives the user an option to display
which test statistics should be used – both parametric and non‐parametric tests. An important caution in using crosstabs is that it is best to use it for discrete ‐ nominal and ordinal data.
We can use crosstabs between sex and race, like in our example. But we can’t use it for continuous data like income, temperature time in seconds. But we can recode or transform the data into discrete categories For example, monthly income of a household can be recoded into income groups: below 5,000 Pesos, 5000‐14,999, 15,000‐24,999 and so on. Discussions in SPSS recoding ad transformations will follow later.
After configuring the crosstabs and pressing OK, Another window appears. That is the output view. The output view has two panels: the standard viewer panel and the outline viewer panel. The former is the bigger pane wherein one can see the process output. The smaller pain displays the outline of the results in the output view. These files can be saved. They have the file extension *.spv.
Course Notes in Statistical Softwares
70| D a q u i s
Figure 7‐6 SPSS Output View
Let us conclude this section by introducing another process which is Descriptives.
Procedure: Descriptives The descriptive procedure simply outputs summary statistics of the chosen variable(s). By default, it outputs the following statistics: number of observations N, the mean, maximum, minimum and standard deviation. But it can also display other statistics like the sum, variance, skewness and kurtosis.
Process 7‐2 SPSS Descriptives Procedure
The result, with most of the statistics displayed is given in the table below:
Figure 7‐7 SPSS Descriptives Sample Output.
Next section we will discuss how to enter and modify data. It includes definitions on variable names, descriptions on variable attributes, creating a new variable and a section on variable recoding and transformations.
AnalyzeDescriptive Statistics
Descriptives Add VariablesAdd other statistics (optional)
The Data
Editor
Variables and
Variable
Definitions
Defining Variable
Properties
Entering and Editing
Data
Go to Case and Case
Selection
Course Notes in Statistical Softwares
72| D a q u i s
VARIABLE TYPES SPSS uses what we call strongly typed variables. These strongly typed variables are being defined
according to the type of data they will contain. Since variables in SPSS are defined, we need to make sure that the values in our defined variables are consistent.
Numeric. These are number‐value variables. The values of numeric variables are displayed in standard numeric format. The period is the decimal separator. The data editor accepts numeric values in standard format or in scientific notation.
Comma. These are also numeric variables. But the values are displayed with commas as separators on every three place values of the number. The period is the decimal separator. The data editor accepts comma values in standard format with or without comma or in scientific notation.
Dot. Same with comma variables, but this time the place value separator is the dot and the comma is the decimal separator. The data editor accepts dot values in standard format with or without dots or in scientific notation.
Scientific Notation. The values will be displayed in a #.E###, where the “#.” is the value of the largest digit, the embedded “E” is the power‐of‐ten exponent and “###” is the number of zeroes in the power‐of‐ten multiplier
Date. A numeric variable in a date format. You can select a date format from a list.
Dollar. A numeric variable in dollar format.
Custom Currency. A numeric variable whose values are displayed in one of the custom currency formats that you have defined in the Currency tab of the Options dialog box.
String. Values of a string variable are not numeric, even though the values are numbers. They can contain characters up to the specified length. Note that variables of this types are case sensitive.
Figure 8‐1 Variable Type Dialog Box
VARIABLE NAMES AND LABELS Variable names can have at most 8 characters. This saves time and space from giving our variables extravagant file names. SPSS variable names should be short but intuitive. The following are the rules in naming variables:
Variable names lengths can be at most 8 characters only.
Variable names must begin with a letter.
Variable names must not end with a period.
Variable names cannot contain spaces or special characters.
Variable names must not be duplicated.
Variable names are not case‐sensitive. SPSS treats them all as lowercase characters.
Only special characters @, #, _, or $ can be used in variable names.
There are some keywords which cannot be used as variable names: ALL AND BY EQ GE GT LE NE NOT OR TO WITH
Course Notes in Statistical Softwares
73| D a q u i s
Variable labels, on the other hand are just descriptions of your variables. They can be up to 256 characters long. Variable labels can contain special characters and reserved keywords which are not allowed in creating variable names.
VIEWING VARIABLE INFORMATION In viewing information of the variable, one does not have to go to the variable editor window and search
for the variable and then look for its definitions. SPSS has a Variables button wherein upon click, a dialog box will contain a list of variables and information about it.
Figure 8‐2 Variables Dialog Box
VALUE LABELS Value labels are descriptive tags for each value of a variable. This is particularly useful when you assign numbers codes to represent non‐numeric values, especially categories. Value labels can be at most 60 characters long.
Figure 8‐3 Value Label Dialog Box
The process below tells us how to create value labels:
Process 8‐1 Creating Value Labels
In the Data View, the labels can be viewed instead of the values by clicking on the Value Labels button.
VALUE MEASUREMENT LEVEL In SPSS, one can specify the level of measurement of the variable as to Nominal, Ordinal, or Scale (Interval or Ratio). Here’s a brief description of the levels of measurement.
Nominal. Data values are categorical and have no intrinsic order. Nominal variables can be a string, or numeric values that represent distinct categories. In our case, we prefer the latter way of defining nominal variables.
Click the button in the values cell.
Specify ValueFor each value, enter label
Click Add
Click on the value‐"label" description to change/remove
the label.
Course Notes in Statistical Softwares
74| D a q u i s
Ordinal. This level is still categorical in nature but now has a certain ordering. (for example, rating scales are ordinal variables, same as income categories. Nominal variables can be a string, or numeric values that represent distinct categories. Again, we prefer the second way. Note: In ordinal variables defined by strings, the ALPHABETIC ORDER is assumed to reflect the correct ordering of the categories. For example the categories low, medium and high are ordered as high, low and medium, which is obviously not the correct order.
Scale. Scale variables are always numeric. These are values on an interval or ratio scale.
Copying an Attribute of a Variable The variable view functions as a typical spreadsheet program. That means, one can copy and then paste an attribute of a certain variable to another variable. This is particularly useful when you are dealing with survey questions with the same answers or items which use the same rating scale.
ENTERING AND EDITING DATA The most direct way in entering data is through the data editor in the data view. One can enter data in any order – by case (downwards) or by variable (left‐to‐right). When selecting a cell, the active cell is highlighted. The value in that particular cell is displayed in the cell editor above the data editor.
Figure 8‐4 The Cell Editor Displaying the Active Cell Value
When entering or editing a cell value, the values are not recorded until the enter button is pressed or another cell is selected. Lastly, entering data in an empty column automatically creates a new variable.
In creating a new data set, it is generally better to define the variables beforehand before the actual encoding of data. This makes the encoding more efficient. Moreover, “stray” values like strings in numeric variables are easier to notice; hence data treatment is done even before the value is encoded.
GO TO CASE/VARIABLE Real‐life data especially national survey data are most likely to have tens of thousands of cases spanning
across hundreds of variables. Finding a particular case or variable indeed would eat up time. Fortunately, SPSS has what we call Go To Case/Variable command. The Go To command simply points you to the case or variable you
want to select. To do this, simply click on the Go To Case button or i and then type or select your desired case (variable).
Figure 8‐5 The Go To Case/Variable Dialog Box
Course Notes in Statistical Softwares
75| D a q u i s
Notice that the Go To dialog box has two tabs – one for Go To Case and the other for Go To Variable. This obviously means that the Go To Variable can also be accessed from Go To Case and vice‐versa. Alternatively, you can select from the menu Edit Go To Case/Variable (Note: in previous versions, click Data menu instead).
CASE SELECTION More often than not, we may want to analyze only a part of the data. Such a part may be the first n
respondents, or a sample of male only respondents, or only adult respondents. Intuitively, one may do so by first
sorting the data according to the filter variable (In SPSS, one can do this by clicking on Data Sort Cases .), have the option to discard cases which will not be used and then perform analysis on the selected data only.
Alternatively, we can use the Select Cases option by clicking on its button or through the menu Data Select Case.
Figure 8‐6 Select Cases Dialog Box
In selecting cases, we have the output option to (i) filter out unselected cases, (ii) copy selected cases to a new dataset or (iii) delete unselected cases. In (i), excluded cases are marked by a diagonal line through the row number, whereas in (ii), a new dataset will be created containing only the selected cases. Note that the new dataset is still not saved in the hard disk. It is NOT recommended to use option (iii)! All Cases
Choose this option if you want to turn off the filter. All cases will again be used. If Condition is Satisfied
Here you can choose cases which satisfy a number of conditions you have specified. Here you can choose at least one filtering variable.
Figure 8‐7 Select Cases: If Dialog Box
Random Sample of Cases and Based on Time or Case Range In random sample of cases, one may specify the percentage of the random sample which will be selected or specify the exact number of cases out of the first N cases. In time or case range, one specifies, which range of cases will be selected.
Data
Transformations
Computing
Variables
Count Values Within Cases
Recoding Variables
Ranking Cases
Automatic
Recode
Visual Binning
Course Notes in Statistical Softwares
77| D a q u i s
COMPUTING VARIABLES Another way of entering data is by the way of computing variables. Here, a new variable is created via a function of existing variable(s). Computing variables is invoked by clicking on the menu TransformCompute
Variable . The Compute Variable dialog box will appear. On the dialog box you can write the name of the new variable (Target Variable) and its corresponding formula (Numeric Expression).
Figure 9‐1 Compute Variable Dialog Box
COUNT VALUES WITHIN CASES In a set of questions, for example YES/NO questions, one may be interested in counting the number of “YES” (or “NO”) answered by the respondents. One can do this by using the Count Values within Cases Command. Click on Transform Count Values within Cases. This dialog box will appear:
Figure 9‐2 Count Values Within Cases Dialog Box
The whole process is illustrated below:
Process 9‐1 Count Values Within Cases
Figure 9‐3 Count Values Within Cases: Values To Count Dialog Box
Transform > Count Values within Cases
Specify name of new
variable, label optional
Inlude all relevant variables
Define values to be counted
Select cases (optional) >
OK
Course Notes in Statistical Softwares
78| D a q u i s
RECODING VARIABLES You can modify existing data or create new variables by recoding them. You can recode values within existing variables (INTO SAME) or you can create new variables based on encoded value of existing variables (INTO DIFFERENT). Recoding applies to both string and numeric variables. However if multiple variables are selected, they must all be of the same type. Recoding string and numeric variables together is not permissible. The process of recoding is summarized by the following:
Process 9‐2 Recoding Variables
An important reminder in recoding variables would be this: have a backup copy of your datafile. Recoding, knowingly or not, may result to loss of information. Such information may be very, very important. This is more frustrating if there is only one copy of the file (and most frustrating if you have purchased the data).
RECODING INTO SAME VARIABLES As the name implies, the recoding into same variables allows us specifically to do two things:
i. Reassign values of existing variables, or ii. Collapse ranges of existing values into new categories
By any means, DO NOT USE Recoding into Same Variables if you are to collapse existing values into new categories. Collapsing results to loss of information and step down in level of measurement. Its only advantage against recoding into different variables is that this procedure saves disk and memory space.
RECODING INTO DIFFERENT VARIABLES Recoding into different variables functions the very same way as in recoding into same variables. Obviously, the difference is that in Recoding into Different Variables, new variables are being created. This is particularly the thing which makes this kind of recoding advisable. Again, there is no loss of information. The dialog box is very much the same as in recoding into same, the only difference is that the there is a section to fill‐in the name of the output variable (and label, optional).
Figure 9‐4 Recode into Same / Different Variables Dialog Box
Old and New Values Old values can be one of the following options:
i. Value : An individual value is recoded. The value must be of the same data type.
ii. System‐missing : Numeric system‐missing values are indicated by periods. String variables cannot have system‐missing values.
iii. System/user missing : user‐defined missing values or system‐missing values. iv. Range : Inclusive range of values. v. All other values : values not included in any option.
Transform > Recode into Same/Different Variables
Include variable(s) you
want to recode
Define the old and
new values
Indicate output variable name (into different)
Select cases (optional) >
OK
Course Notes in Statistical Softwares
79| D a q u i s
Figure 9‐4 Old and New Values Dialog Box
RANK CASES Ranking cases allows us to create new variables based on the ranks of observations on a particular variable. The process of ranking is summarized in the table below:
Process 9‐2 Creating Ranking Variables
Figure 9‐5 Rank Cases Dialog Box
Note that the rank cases command goes hand‐in‐hand with the sort cases. Simply put, you can sort using ranks. Rank types, for convenience, we will just use the normal ranking scheme, which is simply, rank. Ties The table below explains the different methods of handling ties:
Values Mean Low High Sequential
23 1 1 1 1
27 3 2 4 2
27 3 2 4 2
27 3 2 4 2
29 5 5 5 3
32 6 6 6 4
Table 9‐1 Different Methods of Handling Ties
Transform > Rank Cases
Include variable(s) you want to rank
Include the "by" variable (optional)
Indicate if ascending or descending
order
Specify ranking method and dealing with ties > OK
Course Notes in Statistical Softwares
80| D a q u i s
AUTOMATIC RECODE The automatic recode allows users to convert strings or numeric values into consecutive integers. To
recode a string or numeric values into integers, choose Transform Click Automatic Recode Select one or more variables to recode For each variable being recoded, indicate the name(s) of the output variable. Below is the Automatic Recode dialog box.
Figure 9‐6 Automatic Recode Dialog Box
Some remarks on automatic recode
Whenever a new variable(s) is created, Automatic Recode retains any defined variable and value labels from the old variable. If the original variable has no defined value label, the original value is used as the label for the recoded value.
String values are recoded alphabetically. Uppercase letters precede their lowercase counterparts.
Missing values are recoded into missing values. The order of the missing values is preserved. They are higher than any nonmissing values. For example, if the original variable has 97 nonmissing values, then the lowest missing value will be recoded as 98. That 98, on the other hand is a missing value for the new variable.
VISUAL BINNING Visual Binning is a visual tool in SPSS in aiding users in creating new variables by grouping values of existing variables into a limited number of categories, or bins. Visual binning lets the user create categorical variables from continuous ones (e.g., actual monthly income into distinct categories), or collapse and ordinal categories into fewer set of categories (e.g., 9 categories into three: low, medium and high).
Visual binning can be accessed from clicking Transform Visual Binning . The process of doing visual binning is summarized in the following process:
Process 9‐3 Visual Binning
It may be worthy to note that the process of visual binning can also be done in recode into different
variables. Whereas the recode into different (same) variables is a more flexible way of categorizing values of variables, visual binning, as the name implies, provides a visual way of categorization.
Transform > Visual Binning
Include variable(s) to bin
Specify binned (new) variable name
Inspect histogram of the selected variable.
Make cutpoints and labels
Indicate if upper endpoints are
included/excluded in the bin > OK
Course Notes in Statistical Softwares
81| D a q u i s
Figure 9‐7 Visual Binning Dialog Box
The first (variable inclusion) step is not shown, but here, the starting salary is the variable to be binned.
Notice that the name of the new variable is salarycat. The histogram displayed is the frequency distribution of the selected variable. The blue and red lines indicate the cutoff points, with the red line the selected cutoff point in the grid below.
There are 6 cutoff points resulting to 7 bins. In general, n‐1 cutoff points result to n bins. Each bin in the example is labeled from 1 to 7. The labeling process is done by simply typing the labels in the Label column. The process of making cutoff points is done in a new dialog box:
Figure 9‐7 Make Cutoffs Dialog Box
Equal Width Intervals In equal width intervals, there are three entries:
i. First Cutpoint Location : Upper bound of the 1st category ii. Number of Cutpoints : Defines categories, n‐1 cutpoints = n bins iii. Width : Width of each category
You only need to fill‐in two entries, the third one is automatically computed. Nonetheless, it may still be better to fill‐out all three. For example, defining only the 1st two entries will most likely result to a “nonconvenient” interval width (e.g., 8 bins with 1st cutoff at 10000 results to 7928.571 interval width).
Course Notes in Statistical Softwares
82| D a q u i s
Equal percentiles Based on Scanned Cases Here, the percentiles are used to create as cutpoints. You only need to fill‐in an entry; the other entry is automatically computed. The two entries are as follows:
i. Number of Cutpoints : number of percentile cutoff points ii. Width(%) : percentage of the observations that will fall on each interval
If there is only one cutpoint, or the width(%) is equal to 50, it means that the cutoff point is the median itself. Cutpoints at Mean and Selected Standard Deviations Based on Selected Cases With the mean as the mid‐cutpoint, this option enables the mean and standard deviations to become the cutpoints of the observations. This process has three checkboxes. Not checking any of the standard deviations results to the mean dividing the observations into two parts (i.e., LOW and HIGH / ABOVE MEAN and BELOW MEAN). Checking at least one of the boxes makes the mean and the plus/minus standard deviation the cutoff points. For example, the histogram below shows the cutoff points for different scenarios:
Mean as Lone Cutpoint
Mean and +/‐ 1 Standard Deviation as Cutpoints
Mean and +/‐ 2 Standard Deviations as Cutpoints
Mean and +/‐ 1 SD and +/‐ 2 SDs as Cutpoints
Mean and +/‐ 1, 2 and 3 SDs as Cutpoints
Figure 9‐8 Different Scenarios on Mean and Standard Deviations as Cutpoints
The Output and
Syntax Viewer
Output Viewer:
Revisited
Pivot Tables The Syntax
Viewer
Course Notes in Statistical Softwares
84| D a q u i s
THE OUTPUT VIEWER: REVISITED In running a procedure, the syntax and the output of the process is being displayed in a window called as the Output or the SPSS viewer. For consistency with the previous chapters, let us call it as the Output Viewer. The output can be saved as an SPSS Viewer file with the extension *.spv. In the Output Viewer, one can basically do the following:
i. Browse results/output ii. Copy the process syntax iii. Show/Hide and Delete selected objects in the output viewer iv. Edit tables with the Pivot Table v. Change display order of the objects/results vi. Move items between the output viewer and other applications vii. Show SPSS system errors
Figure 10‐1 The Output Viewer
The Output Viewer has two panes: the outline pane on the left and the output pane on the right. The active or selected object in the view is marked by a border and a red arrow on the right. Note that the above menus still display the menus seen in the data editor window, plus more. SPSS is designed so that wherever view (data editor, output or syntax view) you are currently using, you can still perform most of the SPSS processes. Outline Pane In the outline pane, the whole output and each new process is represented by a yellow ring bind book
icon and each individual object, by an opened or closed book icon.
A single click on an outline line (icon or label) activates and makes the corresponding object visible.
Clicking on collapses the hierarchy and hides the objects under it in the output pane.
Double‐clicking on hides the object in the output pane, to make in visible, click on the icon.
Double‐clicking on the object/hierarchy text lets you edit the label.
Moving an object/hierarchy in the outline pane rearranges the order of the objects/hierarchy in the output pane.
Course Notes in Statistical Softwares
85| D a q u i s
Pressing delete while an object/hierarchy is select deletes the object or whole hierarchy in the output pane.
The Log with the icon is the icon for the syntax of the process/hierarchy.
Except for the object notes, all objects are visible. Output Pane The output pane is where you actually browse the output of your process.
You can browse by using the scroll bars or via the outline view.
Clicking on an object activates it. A ctrl+click activates more than one objects.
To delete an object, select it then press delete.
To move an object, select it and move it to desired place.
A selected object can be copied by either in edit copy, ctrl+c or right‐click copy.
Double‐clicking on tables activates the edit mode or opens the pivot table interface.
PIVOT TABLES Pivot tables allow users to interactively rearrange the rows, columns and layers in the table. More than that, the pivot table interface also includes table editing and formatting commands. In order to open the pivot table interface, double‐click on a table in the output pane in the output viewer. The Pivot table dialog box will appear:
Figure 10‐2 The Pivot Table Interface
The Pivot Table ins SPSS can perform various tasks:
i. Transpose rows and columns ii. Moving rows and columns iii. Create layers iv. Grouping/Ungrouping rows and columns v. Show/Hide cells vi. Changing orientation of row and column labels vii. Editing Text in a Table viii. Table Formatting
Course Notes in Statistical Softwares
86| D a q u i s
Transpose Rows and Columns On the pivot table interface, choose Pivot /Transpose rows and columns. This command simply transposes the table. For example, the original table is this:
Table 10‐1 Transposed Table
Moving Rows and Columns In order to move rows and columns or create layers, the Pivoting Trays Window should be open. If the Pivoting Trays window does not show up, click Pivot Pivoting Trays. The pivoting trays window will be displayed:
Figure 10‐3 The Pivoting Trays Window
To move a row or a column, click on any of the entry in the row or column and drag it onto another place. For example, if you want to have a crosstabulation with marital status as columns divided into gender, click and drag the variable label marital status and drop it right above the variable label gender, making the following change (crosstabs cropped):
Course Notes in Statistical Softwares
87| D a q u i s
Yields the following pivot transformation:
Table 10‐2 Rearranged Table
Creating Layers
Creating layers is just the very same process as moving columns and rows. Open the Pivoting Trays window (again, click Pivot Pivoting Trays) then click on a variable and to drag and drop it on the Layer field. The resulting crosstab will be divided into layers, the visible table is the table of the top layer. To change a layer, select a category from a drop‐down list of layers:
Table 10‐3 Selecting a Layer
Rotate Pivot Labels To rotate pivot labels, Click Format Rotate Inner Column Labels or Rotate Outer Row Labels.
Table 10‐4 Rotated Outer Row and Inner Column Labels
Edit Text and Formatting Tables To edit a text, go to table edit mode or open the pivot table and then double click on the entry you want to edit. Caution: Editing is not just restricted to editing category names, but also eding the data themselves! To format a cell, click on a particular cell (note that every part of the table is a cell, including the title, then Click Format/Right‐Click Cell Properties. The cell properties dialog box will open:
Course Notes in Statistical Softwares
88| D a q u i s
Figure 10‐4 Cell Properties Dialog Box
The dialog box has three tabs which indicate the attributes you can edit: Font and Background, Format Value and Alignment and Margins. Formatting tables has two commands: Table Properties and Table Looks. In Table Properties, there are five tabs to work on: General, Footnotes, Cell Formats, Borders and Printing. Table Looks is similar to predesigned table styles in Microsoft Office Suites.
Figure 10‐5 Table Looks and Table Properties Dialog Boxes
THE SYNTAX VIEWER: INTRODUCTION The Syntax viewer is where you can view the SPSS command language (program) that allows users to save procedures, hence automate certain tasks. It is comparable to Macros in Microsoft Office. In order to open the Syntax viewer you can either click File New Syntax or open an existing syntax file with the extension *.sps.
Figure 10‐6 The SPSS Syntax Viewer
Course Notes in Statistical Softwares
89| D a q u i s
SYNTAX RULES Though we will not be discussing on how to code commands, here are the rules in creating SPSS commands:
i. Each command/procedure begins in a new line and ends with a period. ii. Each subcommand must be preceded by slashes, indicating the start of a new subcommand. iii. Complete variable names are required. iv. Texts in apostrophes or quotation marks must be typed in a single line. v. Each syntax line must not exceed 80 characters. vi. A period indicates decimals (since in some regions, the comma is used). vii. Three or four letter abbreviations can be used for many commands. viii. The whole syntax is case insensitive.
The last item above basically tells us that this command block: FREQUENCIES VARIABLES=JOBCAT GENDER /PERCENTILES=25 50 75 /BARCHART.
Is basically the same as: freq var=jobcat gender /percent=25 50 75 /bar.
COPYING SYNTAX FROM OUTPUT LOG Copying syntax from the output log is very similar to recording a Macro in Excel in such a way that you
first do the procedure then have the code. The output log is represented by the icon (and of course the label Log). Select the Log Ctrl+C / Right‐click Copy Paste in the Syntax Editor.
RUNNING THE COMMAND
The run button in the SPSS Syntax Viewer is the play button (shortcut Ctrl+R), which is the run current command. Actually there are other types of runs as seen in the Run menu:
Figure 10‐7 Different Run Commands
Run All : Runs all commands in the syntax viewer. Selection : Runs the highlighted/selected command. Current : Runs the current command block. To End : Runs all commands from the current cursor position to the end.
Some SPSS
Statistical
Procedures
Frequencies Descriptives Explore Crosstabs Means
T‐test Correlations
Course Notes in Statistical Softwares
91| D a q u i s
FREQUENCIES Frequencies in SPSS is a good place to start when one wants to describe variables. The frequencies procedure provides various statistics and graphical displays. The procedure can be accessed via clicking on the
menu Analyze Descriptive Statistics Frequencies . From there a dialog box will be shown by where you will include the variables. The procedure will output the following table:
Table 11‐1 Frequencies Output Table
The frequencies procedure is most applicable to data of nominal or ordinal levels. They can still be applied on scale variables, but may not be interpretatively helpful.
Optionally, one can also include statistics, graphs and arrange the format for the ordering of the results displayed.
Table 11‐1 Frequencies: Statistics Output Table
Figure 11‐1 Frequencies: Different Charts Output
When you have included more than one variable, the order at which they are displayed follows the ordering upon which the variables are selected. The display order option lets you rearrange the output, either alphabetically ascending or descending.
Course Notes in Statistical Softwares
92| D a q u i s
DESCRIPTIVES The descriptives procedure is another summary procedure in SPSS. It can be accessed by clicking Analyze
Descriptive Statistics Descriptives .Most effective in data with many cases, particularly those variables of the scale level, descriptives produces a table of, obviously, descriptive statistics:
Table 11‐2 Descriptives Default Output Table
Again, there is an option to inlcude more summary statistics, and rearrange table output, as displayed in the following dialog box:
Figure 11‐2 Descriptives: Options Dialog Box
EXPLORE The explore procedure is both a descriptive and at the same time an Exploratory Data Analysis tool of SPSS. The explore command presents descriptive statistics of selected variables, together with EDA techniques: the Stem and Leaf Display and the Box and Whisker Plot. To run this procedure, click Analyze Descriptive Statistics
Explore . Then the explore dialog box will appear:
Figure 11‐3 Explore Dialog Box
You can leave factor list and label cases by blank. Including only the variable being explored, the following tables are the result:
Figure 11‐4 Explore: Default Descriptive Statistics and Exploratory Data Analysis
Course Notes in Statistical Softwares
93| D a q u i s
Including a factor list divides the variable of interest into the categories of the variable in the factor list. For example, if the starting salary is our variable of interest (dependent list) and our factor list is the variable gender, SPSS will display 2 sets of explore output, one for males and one for females. Label cases by option simply labels the cases according to the value labels of the chosen variable. This is particularly useful in identifying what defines the outliers in the boxplots. Explore: Statistics
Figure 11‐4 Explore: Statistics Dialog Box
Descriptives : Mean, 95%CI for Mean, 5% Trimmed Mean, Median, Variance, Standard
Deviation, Minimum, Maximum, Range, Interquartile Range, Skewness, Kurtosis.
M‐Estimators : Robust measures of central tendency. Outliers : Actually, extreme values, displays highest and lowest 5 observations. Percentiles : 5th, 10th, 25th, 50th, 75th, 90th, and 95th percentiles. Explore: Plots
Figure 11‐4 Explore: Plots Dialog Box
Boxplots : Factor Levels together – separate display for each dependent variable. Dependents together – separate display for each category defined by factor
variable. Descriptive : Either Stem‐and‐Leaf Display or Histogram or both. Normality Plots With Tests : Displays Normal Probability, or Quantile‐Quantile Plots
Spread vs Level : Levene’s test for homogeneity of variance of the original or the transformed data (more on this on your Stat146)
Course Notes in Statistical Softwares
94| D a q u i s
Explore: Options
Figure 11‐5 Explore: Missing Values Dialog Box
Exclude cases listwise : Missing cases for any dependent or factor excluded. Exclude cases pairwise Report Values : Missing values for factor variables have a separate category. All output
is produced for this “missing” category.
CROSSTABS For review, recall the crosstabs procedure in pages 68‐71 of the notes. The crosstabs procedure displays a two or multidimensional tables, usually frequency tables and provides a variety of tests and measures of association for two‐way tables. Crosstabs: Statistics
Figure 11‐6 Crosstabs: Statistics Dialog Box
The SPSS Base Users guide provides a good background on these tests. Crosstabs: Cell Display
Figure 11‐6 Crosstabs: Cell Display Dialog Box
Course Notes in Statistical Softwares
95| D a q u i s
Counts : Observed (Oij) and Expected under the assumption of independence (Eij). Percentages : Displays row/column percentages per category, as well as percentage of the
particular cell across all observations. Residuals : Unstandardized – difference between an observed value and
an expected value. Standardized – unstandardized/estimate of standard deviation. Adjusted standardized – residual divided by an estimate of its standard error.
MEANS
The means procedure in SPSS can be accessed by clicking Analyze Compare Means Means . This procedure compares means and other descriptive statistics of the values of the dependent variable grouped by the categories of the independent variable.
Figure 11‐7 Means Dialog Box
Means: Layers You can include more than one variable in the independent list. More than one variable in a single layer yields two separate tables of comparing means:
Table 11‐3 Comparing Means: Single Layer, Two Variables
On the other hand, pressing the next button enables you to create another subdivision of the dependent variable. So instead of having two separate tables, a single table is presented, with the dependent variable divided into the categories of the first layer, then further subdivided by the categories of the second variable. To illustrate:
Table 11‐4 Comparing Means: Two Layers, One Variable Per Layer
Course Notes in Statistical Softwares
96| D a q u i s
T‐TEST There are three types of t‐tests available: one sample, independent samples and paired sample t‐tests. All
of these can be accessed via clicking Analyze Compare Means Select desired t‐test. The one sample t‐test
compares the mean of one variable against a hypothesized mean value. The independent samples t‐test compares means of one variable for two groups of cases.
Figure 11‐7 Independent Samples t‐test Dialog Box
Include the variable(s) which will be tested in the Test Variables pane, and the variable which defines the groups, in the Grouping Variable. Here, an idependent sampls t‐test is performed testing if there is a significant difference between the true mean salary of male graduates and female graduates in a certain university. The output is as follows:
Table 11‐5 Independent Samples t‐test Output
Conviniently, SPSS outputs two test statistics, one for equal variances assumed and another for equal variances not assumed. The paired samples t‐test compares the means of two variables for a single group. To illustrate, let us compare the initial and final weights of patients who undergo a certain diet plan (dietstudy.sav) and test whether the true mean initial weight and final weight are significantly different.
Figure 11‐8 Paired Samples t‐test Dialog Box
The setup above will yield the following table, which tells a significant difference.
Table 11‐6 Paired Samples t‐test Output
Course Notes in Statistical Softwares
97| D a q u i s
BIVARIATE CORRELATIONS Correlations measure how variables or rank orders are correlated. SPSS has three bivariate correlation coefficients: Pearson’s correlation coefficient, and the two nonparametric correlations Spearman’s rho and Kendall’s tau. For example, is the patient’s age related to the amount of weight he has lost in the diet plan? The following setup will answet it:
Figure 11‐9 Bivariate Correlations Dialog Box
Which yields the following result:
The correlation between age in years and weight difference is 0.382. And at 0.05 level of significance, there is no sufficient evidence that the there is indeed a significant correlation between the two variables.
References
Part 3
Book References
Further Reading
Course Notes in Statistical Softwares
99| D a q u i s
REFERENCES Main Reference SPSS Base 13.0 Users Guide. SPSS Inc. Chicago Illinois. 2004. Further Reading Pallant, Julie. “SPSS Survival Manual”. Open University Press. Philadelphia, Pennsylvania. 2001. Gupta, Vijay. “SPSS for Beginners”. VJBooks Inc.
SAS: Getting
Started
What is SAS
The SAS Interface
SAS Programs
and
Statements
SAS Data Sets
Library and Member
Names
DATA Step PROC Step
Course Notes in Statistical Softwares
101| D a q u i s
WHAT IS SAS SAS is a program‐based software application which enables users to do the following data‐driven tasks:
i. Data Access (What data is needed?) ii. Data Management (How do you form the data to meet certain requirements?) iii. Data Analysis (What are the meaningful information in the data?) iv. Data Presentation (How can this information and its significance be
communicated?) In SAS, we use statements (much like sentences) to write a series of instructions. This is the SAS program.
Figure 12‐1 The SAS Logo
Historically speaking, SAS was developed back in the 1970s wherein it started as a software package for statistical computing, much like SPSS. Hence, SAS was originally an acronym for Statistical Analysis Software. By the early 1980s, SAS has branched out from its humble beginnings out into graphics, spreadsheets and online data entry. Today, with the advancement of computing technology, SAS has grown into one of the most flexible data‐driven software which is a powerhouse of a myriad of statistical analysis procedures.
THE SAS INTERFACE SAS has 5 basic windows: the 2 left pane windows which include the Results and Explorer windows and 3 programming windows which are the Log, Output and Program Editor.
Figure 12‐2 The SAS Interface
Program Editor Much like a notepad, this is where you input data as well as edit, create or
submit a SAS program. Log Records notes about the SAS session. The main concern here is that the log
window can point out errors and warnings associated with the program that has being run.
Course Notes in Statistical Softwares
102| D a q u i s
Output This is where the output of the SAS program is displayed. The output window is comparable to the output pane of SPSS.
Results It displays an outline of the SAS output similar to that of the outline pane in SPSS.
Explorer Gives you an easy access to your SAS files and libraries.
SAS PROGRAMS AND STATEMENTS
A SAS program is a series of SAS statements executed in order. Meaning, these SAS statements must be put in the program in a correct and ordered way. SAS programs are comparable to that of a statistician’s everyday activity: I would like to compute for the mean. The data I will use is the record of Stat 101 Students. I will use the variable: scores in their first exam. Produce averages for every course. The statistician will first say what he wants to do (1st line), then tells the details after (2nd‐4th line). Note that the statements are ordered. When you try to jumble the statements, the program will be unnatural and unclear. The SAS program is an ordered set of SAS statements similar to that series of statements above. SAS Statements SAS statements are “sentences” which comprise a SAS program. Like any language, SAS programs has a few rules to follow. Fortunately enough, they are easy to remember. The most important rule in writing a SAS program is: Memorize this most important rule and know it by heart, for even an experienced SAS user will at least occasionally forget the semicolon as a child (or even an adult) forgets to write a period after a sentence. Note that a SAS statement is not necessarily a single line. You can have a series of statements in a single line, or a single SAS statement covering at least two lines. SAS Program Layout As long as a SAS statement ends with a semicolon, SAS doesn’t bother on how the program is written. Of course, it is very helpful to have a neat‐looking program, the one with indentions, uppercase letters for commands and lowercase for variables and data, these are not necessary.
i. SAS statements can be in upper or lowercase, or combined. ii. SAS statements can continue on the next line, as long as you don’t split words and end the
statement with a semicolon. iii. SAS statements can be on the same line as other statements, separated by a semicolon. iv. SAS statements can start in any column.
Comments Comments are statements which SAS skips. They are not part of the program per se, but helps the programmer in making the program more understandable. There are two ways in creating a SAS comment: Way 1 : * Asterisk - semicolon comment; Way 2 : /* Slash asterisk - asterisk slash comment */ Errors SAS errors are displayed in the LOG window. They are displayed in red letters. One should expect errors, especially beginners. Maybe it is just a misspelled command or variable, or an omitted semicolon. No need to panic.
EVERY SAS STATEMENT ENDS WITH A SEMICOLON (;).
Course Notes in Statistical Softwares
103| D a q u i s
SAS DATA SETS SAS is very flexible and can read almost any kind of data. Nonetheless, these data follow a very familiar convention: SAS data sets, like in SPSS are tables with each row an observation, and each column an attribute of that observation.
Id Name Height Weight
34 Iyo 46 55
35 Michi 42 41
36 Myka 40 35
37 Poppert 44 52
38 41 40
39 Andrea 43 .
Table 12‐1 SAS Data Set Arrangement
In a single data set, SAS can handle a whopping 32,767 variables. The number of observations, on the other hand is limited only by your computer’s capacity to store and handle the data set. Data Types SAS has only two data types: numeric and character. If a variable contains letter or special characters, then it must be character data. But if the variable contains only numbers, it may be numeric or character. ZIP codes are a classic example of character variables which contains only numbers. Height and weight are numeric and id, could be either numeric or character. It is your call. SAS Data Set and Variable Naming There are a few rules to remember in naming SAS data sets and variables:
i. Names must be at most 32 characters long. ii. They must start with a letter or an underscore. iii. They can contain letters, numerals or underscores. iv. SAS names can contain upper and lowercase letters.
The last point means that SAS is case insensitive. For example, the data set name scorestanding is the very same as ScoreStanding, or even sCoReStAnDiNg. In variable names, SAS remembers the first occurrence of each variable name and uses that case in printing result. In this course, we will use the convention that all data set names and variables are in lowercase. Missing Values In SAS, missing character data are represented by a blank, while missing numeric data is marked by a period (refer to the data above).
DATA AND PROC STEPS A SAS program divided into two main blocks: the DATA step and the PROC step. The DATA step creates a SAS data step and the PROC step processes the data. In summary, the following table presents the basic differences between the two steps:
DATA PROC
Begins with DATA statement Begins with PROC Statement
Reads and modifies data Performs analysis or function
Creates a SAS data set Produces results or report
Table 12‐2 Basic Difference Between DATA and PROC Step
However, as you learn more about SAS, you will discover that a DATA step can produce an output and the PROC step can also create data.
Course Notes in Statistical Softwares
104| D a q u i s
THE DATA STEP The first important thing to know is how to get data into the SAS system. SAS uses the DATA step in order to construct SAS datasets. A SAS dataset can be created by a SAS program which includes several or one data step. These datasets can be combined to form new datasets or can be permanently stored in a computer. How does SAS execute a DATA step? This is another very important rule: The part “SAS steps execute line by line” simply means that SAS executes line 1 of the program, then line 2, and so on. The part “and observation by observation” is not that obvious. It means that SAS takes the first observation then runs it all the way through the DATA step, before looping back to pick up the second observation and so on. Later on we will understand more about this through examples. SAS Constants, Variables and Expressions SAS constants can either be numeric or character. Numeric constants are just numbers. Decimal points, negative signs and scientific notations are valid for numeric constants. A character constant on the other hand consists of 1 – 200 characters enclosed in single quotes. SAS variables, which are user‐given names can be numeric or character. When a variable exists but does not have a value, it is said to be missing. A SAS expression is a series of constants, variables and operators that produces a value. Consider your very first SAS program:
DATA first; name = 'Ignacio'; wpounds = 153; wkilos = wpounds * .45359237; RUN; PROC PRINT DATA = first; RUN;
The program above creates a data first with a single observation of three variables: name, weight in pounds and weight in kilos. Notice that the variables are assigned by a character constant, a numeric constant and an expression respectively. There is our first PROC statement. The PROC PRINT statement simply prints the output data. If no dataset name is indicated, SAS will print the most recent data being created. The output will be like this:
In order to run a SAS program, press the running man button (submit). If you only want to run a particular program, highlight the program first then press submit. INPUT and DATALINES statements
In order to clarify the remark above that SAS data steps execute line by line and observation by observation, consider a program with 4 variables: let name = name of person, y = weight, x1 = height and x2 = average number of calories consumed daily. There are 5 observations:
DATA second; INPUT name $ y x1 x2; DATALINES; Esther 160 69 400 Fraulein 152 70 500 Igor 180 72 4500 Vangie 160 68 2500 ; RUN;
DATA steps execute line by line and observation by observation.
The SAS System 00:49 Wednesday, February 25, 2009 12
Obs name wpounds wkilos
1 Ignacio 153 69.3996
Course Notes in Statistical Softwares
105| D a q u i s
This program is a very basic way of creating data using input. It creates the data second which has 4 variables, 1 of which is a character variable with 5 observations. The Input statement tells SAS to input these variables. The datalines statement then instructs SAS to treat the next line as observations until a semi‐colon is observed. The datalines statement can be replaced by the cards statement. Going back, let us write again the program, this time, with SAS’ invisible loop:
/* i = 1 */ /* DO UNTIL i = N */ DATA second; INPUT name $ y x1 x2; DATALINES; /* Reads only the ith observation */ Esther 160 69 400 Fraulein 152 70 500 Igor 180 72 4500 Vangie 160 68 2500 ; RUN; /* i = i + 1 then LOOP*/
INFILE Statement The infile statement tells SAS to read data from a different source file. For example, we have saved the data in a text file, example.txt. The content of the text file is:
Esther 160 69 400 Fraulein 152 70 500 Igor 180 72 4500 Vangie 160 68 2500 The program above will become:
DATA second; INFILE 'C:\example.txt'; INPUT name $ y x1 x2; CARDS; RUN;
The output is the same as the preceding program: Joining Datasets: SET and MERGE Statements These two SAS statements help you in creating a new SAS dataset from a combination of sources, either by direct input or by using the INFILE statement. In summary, these two SAS statements do the following: SET : concatenates observations of datasets into a single dataset (vertical). MERGE : joins variables of datasets into a single dataset (horizontal). In using the SET statement, for example, there is another dataset third. The variables for the data are the same as the variables in second, but of different observations. The following is the program for creating third:
DATA third; INPUT name $ y x1 x2; CARDS; Lucky 140 60 375 Stephen 175 70 1250 ; RUN;
The SAS System 00:49 Wednesday, February 25, 2009 15
Obs name y x1 x2
1 Esther 160 69 400 2 Fraulein 152 70 500 3 Igor 180 72 4500 4 Vangie 160 68 2500
Course Notes in Statistical Softwares
106| D a q u i s
This creates a dataset third which is the same as second, only different observations. Suppose we want to concatenate the two datasets, second and third. Hence we will use the SET statement. The program for that is as follows:
DATA setexample; SET second third; RUN;
Next, the program DATA setexample2; SET setexample; IF x2 > 1000; RUN;
creates a new dataset from setexample, but this time only those with average daily calorie consumption greater than 1000 are retained. The conditioning statement here is IF, and the operator > is used.
Let us say that we want to concatenate two datasets, but this time, instead of the first dataset followed by the second, they are sorted by a variable. The program
DATA setexample3; SET second third; BY name; RUN;
creates a new dataset which is the same as setexample, this time, the new dataset is sorted by the variable name. So instead of Lucky (1st obs 2nd dataset) following Vangie (Last obs, 1st dataset), Lucky follows Igor. Note that the BY variable should be sorted for each dataset, otherwise, the data step will yield an error.
Now, suppose there is another variable for each of the observations in setexample3. That is x3 = 1 (exercising) or 0 (no exercise) the following program creates this dataset:
DATA fourth; INPUT x3; DATALINES; 1 0 0 1 1 1 ; RUN;
The following program creates a dataset which merges datasets setexample and fourth: DATA mergeexample; MERGE setexample fourth; RUN; PROC PRINT DATA = mergeexample noobs; RUN;
Of course, the observations in the first dataset should match the observations in the second dataset. So make sure that the new dataset makes sense. There is a new option in the PROC PRINT command, which is noobs, which do not print the column Obs in the output table. Choosing and Renaming Variables: KEEP, DROP and RENAME Statements In summary, these statements do the following tasks
KEEP : creates a dataset from an existing one, keeping only the specified variables.
DROP : creates a dataset from an existing one, dropping all the specified variables.
RENAME : renames an existing variable of the new dataset.
The program below creates a new dataset kdrexample from the existing mergeexample, but keeping only the variable name:
Course Notes in Statistical Softwares
107| D a q u i s
DATA kdrexample; SET mergeexample; KEEP name; RUN; PROC PRINT noobs; RUN;
Note that the “data = kdrexample” is omitted in the PROC PRINT step. It means that if the dataset name is omitted in a PROC step, then SAS processes the most recent dataset being created.
In contrast, the DROP variable keeps only the unspecified variables: DATA kdrexample2; SET mergeexample; DROP x1; RUN; PROC PRINT; RUN;
This time, notice that the two lines in the PROC step is collapsed into one. It is absolutely okay, as long as there is a semicolon on where it is supposed to be. You can even write the DATA step in a single line, again, there should be a semicolon at the end of each statement. You can also have the DROP and KEEP statements as options in the DATA statement. For example,
DATA kdrexample3 (KEEP = name) kdrexample4 (DROP = x1); SET mergeexample; RUN;
creates two datasets from mergeexample wherein kdrexample3 = kdrexample1 and kdrexample4 = kdrexample2. It is also possible to rename variables from an existing dataset in the newly created dataset. For example,
DATA kdrexample5; SET mergeexample; KEEP name y; RENAME y = weight; RUN;
This DATA step creates a new dataset kdrexample5 which keeps variables name and y from mergeexample dataset, and in the new dataset y is renamed as weight. Here the variable weight takes the values of y. Creating a Permanent SAS Dataset So far, all the SAS datasets that we have created are just TEMPORARY datasets, meaning if you ended the SAS session, the software automatically deletes the dataset. For example we would like to create a permanent SAS dataset:
DATA 'C:\Users\cjdaquis\Desktop\pemaexample'; SET mergeexample; RUN;
The program above creates a permanent dataset permaexample which uses the whole mergeexample dataset. This file will be stored in the specified directory with the filename extension .sas7bdat. The single quotation marks tell SAS that it is a permanent SAS dataset. If no directory is specified, SAS will automatically create the permanent dataset in the current working directory, which is place in the lower right part of the program. Two Levels of Names SAS datasets It is not always clearly seen, but all SAS datasets have two levels, the libref (short for library reference) and member name, the actual dataset name which uniquely identifies the dataset within the library. These two levels are separated by a period.
All SAS Datasets have two levels: libref.membername
Course Notes in Statistical Softwares
108| D a q u i s
Note that all temporary SAS datasets have a libref WORK (check the log). All datasets in WORK are deleted after each session. Our first permanent SAS dataset, permaexample also has an invisible and unknown libref name automatically created by SAS. There is no need for us to know it. But if you insist, it can be located in the explorer window.
Figure 12‐3 The Automatically Created SAS Libname and the Working Directory Creating a Permanent SAS Dataset using the LIBNAME Statement The LIBNAME explicitly tells SAS the your library reference of the member name:
LIBNAME example 'C:\Users\cjdaquis\Desktop\'; DATA example.libnameex; SET mergeexample; RUN;
PROC PRINT DATA = example.libnameex; TITLE 'Respondent Profile'; RUN;
The program above tells SAS to create a library reference which is located at the specified directory. Furthermore, a dataset in the library is created under the member name libnameex. This dataset is permament, but not the existing dataset mergeexample. In the PROC PRINT step, a title for the output is added, via the TITLE statement.
In summary, here are some sample DATA statements and the characteristics of the datasets they create:
Data Statement Libref Member Name Type
DATA example; WORK example Temporary
DATA 'c:\MySASLib\example2'; automatically created example2 Permanent
DATA firstprogram.example3; firstprogram example3 PermanentTable 12‐1 SAS Datasets
Course Notes in Statistical Softwares
109| D a q u i s
THE PROC STEP The PROC STEP starts with the word PROC. As the DATA step reads and modifies data, the PROC step performs mainly for a certain specific analysis, function and for producing results of report. In summary, the PROC step does the following functions:
1. Sorting, Printing and Summarizing Data 2. Statistical Procedures 3. Data management
The PROC step varies from the simple PROC PRINT and PROC IMPORT to the statistical PROC SURVEYMEANS or PROC REG, then to data management procedures like PROC SQL.
PROC SORT The following data step creates the data heroes and inputs the faction, primary attribute, and starting
attributes of selected heroes in a certain game:
DATA heroes; INPUT name $ 3-15 @18 faction $ primeattrib $ strtstr strtagi strtint; CARDS; Atropos Scourge Int 18 18 18 Pudge Scourge Str 25 14 14 KnightDavion Sentinel Str 19 19 15 Yurnero Sentinel Agi 20 20 14 Ezalor Sentinel Int 16 15 22 Naix Scourge Str 25 18 15 Kardel Sentinel Agi 16 21 15 Clinkz Scourge Agi 15 22 15 Krobelus Scourge Int 19 14 20 Meepo Scourge Agi 23 23 20 Banehallow Scourge Str 22 16 15 Rattletrap Sentinel Str 24 13 17 Rylai Sentinel Int 16 16 21 ; RUN;
In the program above, the numbers in the INPUT statement tells SAS where the variable values are located in the data lines. For example, 3-15 tells SAS that the values of the preceding variable can be found at columns 3‐15 while @18 tells SAS that the next variable can be found at the 18th column. Now the data heroes is quite messy and there is no order. The PROC SORT procedure helps alleviate the problem:
PROC SORT DATA = heroes OUT = heroessort; BY faction DESCENDING strtstr; RUN;
This creates another dataset heroessort which is an ordered dataset from the existing heroes. Here, new
options are introduced: OUT : Creates a new dataset as an output of the PROC step. BY : The dataset will be sorted according to the variables being selected. DESCENDING : Sorts data from lowest to highest (#) or reversely alphabetical ($).
The default sorting procedure is ASCENDING. Now we realize that PROC steps, via the OUT option can also create new datasets. Also in sorting, one can add the NODUPKEY option. This removes duplicates in the data. Extra care must
be done in the NODUPKEY option for it removes ANY succeeding duplicated values of variables specified by the BY option, and it may lead to unnecessary deletion of distinct observations.
Course Notes in Statistical Softwares
110| D a q u i s
PROC PRINT A simple PROC PRINT procedure of the sorted data is the following:
PROC PRINT DATA = heroessort; RUN;
With the following (chopped) output: Selected Heroes 14:04 Thursday, March 5, 2009 4
Obs name faction primeattrib strtstr strtagi strtint 1 Pudge Scourge Str 25 14 14 2 Naix Scourge Str 25 18 15 3 Meepo Scourge Agi 23 23 20 4 Banehallow Scourge Str 22 16 15 5 Krobelus Scourge Int 19 14 20 6 Atropos Scourge Int 18 18 18 7 Clinkz Scourge Agi 15 22 15
Optional statements can be added to have the following program:
PROC PRINT DATA = heroessort; TITLE 'Selected Heroes'; BY faction; ID primeattrib; VAR name strtagi strtint strtstr; SUM strtagi strtint strtstr; RUN;
There are many options introduced here:
BY : The output will be grouped by the BY variable. Note that the data must be presorted with the very same BY variable.
ID : Instead of the observation number being printed, the variable in the ID option will appear on the left hand side of the page.
VAR : Selects the variable to be printed and in what order. SUM : Prints the total of the values of the variables in the list
The last program above will produce the following output: Selected Heroes 14:04 Thursday, March 5, 2009 5 --------------------------------------- faction=Scourge ---------------------------------------- primeattrib name strtagi strtint strtstr Str Pudge 14 14 25 Str Naix 18 15 25 Agi Meepo 23 20 23 Str Banehallow 16 15 22 Int Krobelus 14 20 19 Int Atropos 18 18 18 Agi Clinkz 22 15 15 ----------- ------- ------- ------- faction 125 117 147
--------------------------------------- faction=Sentinel --------------------------------------- primeattrib name strtagi strtint strtstr Str Rattletrap 13 17 24 Agi Yurnero 20 14 20 Str KnightDavion 19 15 19 Int Ezalor 15 22 16 Agi Kardel 21 15 16 Int Rylai 16 21 16 ----------- ------- ------- ------- faction 104 104 111 ======= ======= ======= 229 221 258
Course Notes in Statistical Softwares
111| D a q u i s
PROC FORMAT – ASSIGNING VALUE LABELS One of the uses of PROC FORMAT is to assign value labels to each variables in the data. This is particularly useful especially when the data itself is coded. A person reading the coded output will find it a hard time in deciphering the variable representations (e.g., what is 1 or 0 in the variable Sex?). To solve the problem, we create custom formats via the PROC FORMAT procedure. The PROC FORMAT procedure in SAS is basically the same as the VALUE LABELS in SPSS. Suppose that the UP Stat has two conducted a survey regarding a planned undergraduate summer outing. The program below inputs the data:
DATA statouting; INPUT willjoin $ sex date place $ @@; DATALINES; Y 1 1 EK Y 2 3 LB Y 2 2 LB N 1 2 SU Y 1 2 OP Y 1 1 OP N 2 1 OP Y 2 1 EK Y 1 3 EK Y 2 2 SU Y 2 1 EK Y 1 1 BO Y 2 3 OP Y 2 1 EK Y 2 2 EK Y 1 1 EK Y 1 2 EK Y 2 1 BO N 2 2 SU Y 1 1 EK ; RUN;
If the Datalines have Multiple Observations per Line
The problem in the datalines is that there are two observations per line. By default, SAS will only be able to read the 1st observation, and noticing that it has now filled up the values of all the four variables, it will immediately proceed to the next line, and that is the 3rd observation, thus ignoring the second observation.
To solve this, one must include a double at sign @@ which tells SAS to hold the particular line of data. And then after the observation has been recorded, do not immediately go on the next line. Rather, continue reading the very same line for the next observation. This way, SAS is forced to read until the dataset’s end of line before proceeding on the next line. Going back, printing the raw data above may not be very comprehensible to readers. But once value labels are assigned to the values in the actual observations, it will be easier to read the data. The PROC FORMAT takes on the task:
PROC FORMAT; VALUE $join 'Y' = 'Yes' 'N' = 'No'; VALUE gender 1 = 'Male' 2 = 'Female'; VALUE mon 1 = 'March' 2 = 'April' 3 = 'May'; VALUE $place 'BO' = 'Boracay' 'EK' = 'Enchanted Kingdom' 'LB' = 'Los Banos' 'OP' = 'Manila Oceanpark' 'SU' = 'Subic'; RUN;
The program stated above creates custom formatting take note that the names right after the keyword
VALUE are NOT variables, rather, they are just formatting names. That is, gender is not a new variable, rather the name of the formatting: “If value is 1 then label Male, otherwise it Female”. Also, for string formatting, the format name should start with a “$” to indicate that the values being labeled are strings.
Course Notes in Statistical Softwares
112| D a q u i s
After specifying custom formatting, the next step is to use the formats in printing the data:
PROC PRINT DATA = statouting; FORMAT willjoin $join. sex gender. date mon. place $place.; RUN;
SAS will then print not the original values, but this time the formatting procedure used by each of the
variables. In this simple PRINT procedure we have learned of the following: 1. The FORMAT statement uses the formatting specified for every associated variable. 2. The format name proceeds immediately after the variable it will format. 3. To differentiate a format name from a variable name, the former ends with a period. 4. Format names can be the same as the variable names.
The (“cropped”) formatted result is displayed in the textbox below: Other than custom formats, SAS has their own list of standard formats. For further knowledge, read the
SAS documentation.
The SAS System 07:05 Friday, March 6, 2009 1 Obs willjoin sex date place 1 Yes Male March Enchanted Kingdom 2 Yes Female May Los Banos 3 Yes Female April Los Banos 4 No Male April Subic 5 Yes Male April Manila Oceanpark 6 Yes Male March Manila Oceanpark 7 No Female March Manila Oceanpark 8 Yes Female March Enchanted Kingdom 9 Yes Male May Enchanted Kingdom 10 Yes Female April Subic
Getting the
Data into the
SAS System
Viewtable Window
Multiple Lines
per
Observation
Multiple
Observations
per Line
Reading Part of the Raw
File
Options in the INFILE
statement
Delimited
Raw Files
Course Notes in Statistical Softwares
114| D a q u i s
THE VIEWTABLE WINDOW Another way of data input is via what we call the Viewtable window. The Viewtable window is a
spreadsheet at which where the user will create a new dataset simply by encoding the values, as if he were using Excel or SPSS.
Figure 12‐4 The Viewtable Window
In order to open the Viewtable, click on Tools Table Editor . The viewtable window is composed of rows and columns, corresponding to case/observations and attributes/variables respectively.
To edit a variable name : Double‐click on the column header. To select an entire row/column : Click‐hold the row/column header. To encode data : Select a cell and record the value. To edit variable attribute : Right‐click on column header and click column
attribute. A dialog box will appear.
Figure 12‐5 Column Attribute Dialog Box
In the column attributes, you can specify the variable name (same with the INPUT statement), and
determine whether it is of character or numeric type. To save the table, simply save it using standard procedures.
Course Notes in Statistical Softwares
115| D a q u i s
MULTIPLE LINES PER OBSERVATION There are times when your data spreads out over one line:
Pudge Scourge Str 25 14 14 KnightDavion Sentinel Str 19 19 15 Yurnero Sentinel Agi 20 20 14 Ezalor Sentinel Int 16 15 22 Naix Scourge Str 25 18 15 Kardel Sentinel Agi 16 21 15
When such a case happens, SAS would just automatically go on the next line if it runs out of data before it
has read all the variables in the INPUT statement. So the normal DATA step would work just fine. But if you know that your raw data has multiple lines per observation, it is better to EXPLICITLY TELL SAS WHERE TO GO. It makes you more in control of your data.
For example, an observation spans across four lines, at which you only want to read the first, second and the fourth data. To do this, you have to introduce what we call LINE POINTERS:
Figure 13‐1 SAS Line Pointers The program below reads the data from the text file heroes.txt then uses the line pointers:
LIBNAME dota 'C:\Users\cjdaquis\Desktop\'; /*<--change directory*/ DATA dota.heroes2; INFILE 'C:\Users\cjdaquis\Desktop\heroes.txt'; /*<--change directory*/ INPUT name $ / faction $ #3 primeattrib $ strtstr strtagi strtint; RUN;
In the LIBNAME statement, a library will dota will be created at the specified directory. Then a member of
that library will be PERMANENTLY created, since the raw file is now a member of a permanent library, and not the temporary WORK library.
LIBNAME dota2 'C:\Users\cjdaquis\Desktop\'; /*<--same location*/ PROC PRINT DATA = dota2.heroes2; RUN;
This time, the libref in the LIBNAME is now dota2 but IT POINTS TO THE SAME LOCATION AS BEFORE,
which means two libraries are now created referring to the same directory. And both libraries have a common member, heroes2, which is a permanent SAS dataset.
•Slash Pointer
•Go to Next Line
/•Pound‐n Pointer
•Go to Line n
#n
Course Notes in Statistical Softwares
116| D a q u i s
MULTIPLE OBSERVATIONS PER LINE Normally, SAS assumes that each line of raw data represents exactly one observation. When there are multiple observations per line of data, you can use the double at signs “@@” at the end of the INPUT statement.
Figure 13‐1 SAS Stop and Hold Sign
The sign @@ actually tells SAS to stop, hold that line of data and continue to read the observations. Let us take for example the dataset we have considered in the past chapter. Suppose that UP Stat has two conducted a survey regarding a planned undergraduate summer outing. The program below inputs the data:
DATA statouting; INPUT willjoin $ sex date place $ @@; DATALINES; Y 1 1 EK Y 2 3 LB Y 2 2 LB N 1 2 SU Y 1 2 OP Y 1 1 OP N 2 1 OP Y 2 1 EK Y 1 3 EK Y 2 2 SU Y 2 1 EK Y 1 1 BO Y 2 3 OP Y 2 1 EK Y 2 2 EK Y 1 1 EK Y 1 2 EK Y 2 1 BO N 2 2 SU Y 1 1 EK ; RUN;
There are 4 variables needed in the program. The problem is, there are two observations per line. Normally, or without the “@@”, SAS will read only the first four values per line, skip the second, record the 3rd, 5th and so on.
OPTIONS IN THE INFILE STATEMENT: FIRSTOBS AND OBS Let’s say we have the data in Ice Cream.Txt:
Ice Cream Sales Data for the month of February 2009 Flavor Location Cones Sold Chocolate Tambayan 57 CnC Tambayan 53 Mango Tambayan 45 Chocolate Stat Walk 34 CnC Stat Walk 22 Mango Stat Walk 19 Data Verified by FinComm
We want to input the data above using the INFILE option, but the problem is, the raw data is sandwiched
in between. For such cases, we use the FIRSTOBS and OBS options in the INFILE statement: LIBNAME examples 'G:\Work and Study Docs\Stat 125\Datasets'; DATA examples.icsales; INFILE 'G:\Work and Study Docs\Stat 125\Datasets\Ice Cream.txt' FIRSTOBS = 3 OBS = 8; INPUT Flavor $ 1-10 Location $ 11-20 @25 Cones; RUN;
• Double at signs
• stop and hold the line of data
@@
Course Notes in Statistical Softwares
117| D a q u i s
The SAS program inputs data from the text file starting at the 3rd line until the 8th line only. In general: FIRSTOBS = n : tells SAS to start reading on the nth line OBS = k : tells SAS to read the data only until the 8th line Note that OBS is not the number of observations, but the line number of the last observation to be recorded.
OPTIONS IN THE INFILE STATEMENT: MISSOVER AND TRUNCOVER Suppose in a certain subject there are 5 long exams. Five students missed at least an exam, three of them missed the last:
2007-542153 79 75 62 54 2007-624609 66 55 43 22 2007-000415 89 . 56 52 45 2006-168910 82 82 . 47 2007-515611 66 79 32 85 56
Normally, since SAS did not see a period in the last value, it would go on the next line. The MISSOVER option tells SAS that if it runs out of data, don’t go to the next data line. Instead assign missing values to any remaining variables:
DATA examples.longexam; INFILE 'G:\Work and Study Docs\Stat 125\Datasets\les.txt' MISSOVER; INPUT studnum $ 1-11 Test1 - Test5; RUN;
When variables are of the form varname1, varname2, … , varnameN, instead of manually typing each variable, one can just use the dash symbol. The TRUNCOVER option is used in column input wherein some data lines are shorter than others. By default, if the variable fields (columns) extend past the data line, then by default, SAS will go to the next line to start reading the variable’s value. The TRUNCOVER option tells SAS to read data for the variable until it reaches the end of the data line, or the last column specified in the format or column range, whichever comes first. For example:
----+----1----+----2---------3----+-- John Santos 114 Silay Ave. Esther Tores 1302 Washington Drive Stanly Garcia 45 G.B. 14th St.
Will have the following correct SAS input statement:
DATA examples.regis; INFILE 'G:\Work and Study Docs\Stat 125\Datasets\addresses.txt' FIRSTOBS = 2 TRUNCOVER; INPUT name $ 1-15 number 16-19 street $ 22-37; RUN;
One may wonder: what is the default of SAS here? The default INFILE option of SAS, the one which goes to the next line if it runs out of data or if the variable’s field extends past the end of the data line is called the FLOWOVER statement.
READING DELIMITED FILES – DELIMITER OR DLM OPTION If the raw file is comma delimited:
Grace 3 1,5,2,6 Martin,1,2,4,1,3 Scott,9,10,4,8,6
The following SAS program correctly reads the raw data: DATA examples.grades; INFILE 'G:\Work and Study Docs\Stat 125\Datasets\grades.txt' DLM = ','; INPUT name $ week1 - week5; RUN;
If the raw file is tab delimited, use DLM = ‘09’X option.
Working with
Data in SAS
Creating and Redefining
Variables
SAS Functions If‐Then
Statements
Arrays SAS Control
Statements
Course Notes in Statistical Softwares
119| D a q u i s
WORKING WITH DATA IN SAS So far, we have learned the following important points in SAS:
i. Every SAS statement ends with a semicolon ii. Data steps execute line by line and observation by observation iii. All SAS datasets have two levels: libref.membername
Plus, there is another important thing to note in SAS: Now, let us delve deeper the logical side of SAS programming. In this chapter, we will further learn more things on how to manage SAS datasets.
CREATING AND REDEFINING VARIABLES This simple SAS programs tells us on how we can create and redefine variables. Basically, the creation and
redefinition of variables can be done by the assignment statement variable = expression: DATA statgrades; INPUT Name $ S114 S115 S117 S121 S124; Batch = 2007; College = 'Stat'; S124 = S124 - 0.25; CGWA = SUM(S114,S115,S117,S121,S124) / 5; ; DATALINES; Lanie 1.25 2.00 1.75 1.75 2.50 Albert 2.50 2.75 2.75 3.00 2.25 Ismael 1.00 1.25 1.75 1.50 1.50 ; RUN;
Notice that after the typical input statement, 4 new variables are assigned:
Batch = 2007 Numeric constant assigned College = 'Stat' String Constant assigned S124 = S124 - 0.25 Variable is redefined CGWA = SUM(S114,S115,S117,S121,S124)/5 Variable is created using a function
Like in any other computing softwares, SAS also has a set of functions. These functions may take string or numeric variables or constants as their arguments. The proceeding section discusses about some these SAS functions.
Printing the dataset yields the output below:
Note that the variable S124 appears only once because the new value is replaced by the old value. Using the existing variable in redefinition is advantageous as not to clutter the dataset. But like in recoding in SPSS, redefining an existing one into a new dataset is much recommended so as not to lose any information.
A SAS program has two main blocks: the DATA and PROC steps.
Grades of Selected Sophomore Students 6 07:04 Wednesday, March 11, 2009 Obs Name S114 S115 S117 S121 S124 Batch College CGWA 1 Lanie 1.25 2.00 1.75 1.75 2.25 2007 Stat 1.80 2 Albert 2.50 2.75 2.75 3.00 2.00 2007 Stat 2.60 3 Ismael 1.00 1.25 1.75 1.50 1.25 2007 Stat 1.35
Course Notes in Statistical Softwares
120| D a q u i s
SAS FUNCTIONS Like in MS Excel, performing operations using only mathematical and text operators may be quite tedious, or sometimes may even be impossible to perform. Fortunately, functions simplify such undertaking. SAS also has functions – there are close to 300 of them, which takes on the following general form:
function_name(argument1, argument2,…) Here are some things to remember in about SAS functions:
i. All functions must have parentheses even if they require/have no arguments. ii. Arguments are separated by commas and can either be a constant or a variable, numeric or
string depending on the function. iii. Functions can be nested (a function as an argument of another function).
SAS FUNCTIONS ‐ EXAMPLES For your reference, the following is a (very long) list of SAS functions, from the SAS documentation categorized according to their use:
Arithmetic Functions
ABS(argument) returns absolute value
DIM<n>(array‐name) returns the number of elements in a one‐dimensional array or the number of elements in a specified dimension of a multidimensional array.
n specifies the dimension, in a multidimensional array, for which you want to know the the number of elements.
DIM(array‐name,bound‐n) returns the number of elements in a one‐dimensional array or the number of elements in the specified dimension of a multidimensional array
bound‐n specifies the dimension in a multidimensional array, for which you want to know the number of elements.
HBOUND<n>(array‐name) returns the upper bound of an array
HBOUND(array‐name,bound‐n) returns the upper bound of an array
LBOUND<n>(array‐name) returns the lower bound of an array
LBOUND(array‐name,bound‐n) returns the lower bound of an array
MAX(argument,argument, ...) returns the largest value of the numeric arguments
MIN(argument,argument, ...) returns the smallest value of the numeric arguments
MOD(argument‐1, argument‐2) returns the remainder
SIGN(argument) returns the sign of a value or 0
SQRT(argument) returns the square root
Character Functions BYTE(n) returns one character in the ASCII or EBCDIC collating sequence where nis an integer
representing a specific ASCII or EBCDIC character
COLLATE(start‐position<,end‐position>) | (start‐position<,,length>)
returns an ASCII or EBCDIC collating sequence character string
COMPBL(source) removes multiple blanks between words in a character string
COMPRESS(source<,characters‐to‐remove>) removes specific characters from a character string
DEQUOTE(argument) removes quotation marks from a character value
INDEX(source,excerpt) searches the source for the character string specified by the excerpt
INDEXC(source,excerpt‐1<, ... excerpt‐n>) searches the source for any character present in the excerpt
INDEXW(source,excerpt) searches the source for a specified pattern as a word
LEFT(argument) left‐aligns a SAS character string
LENGTH(argument) returns the length of an argument
LOWCASE(argument) converts all letters in an argument to lowercase
QUOTE(argument) adds double quotation marks to a character value
RANK(x) returns the position of a character in the ASCII or EBCDIC collating sequence
REPEAT(argument,n) repeats a character expression
Course Notes in Statistical Softwares
121| D a q u i s
REVERSE(argument) reverses a character expression
RIGHT(argument) right‐aligns a character expression
SCAN(argument,n<,delimiters>) returns a given word from a character expression
SOUNDEX(argument) encodes a string to facilitate searching
SUBSTR(argument,position<,n>)=characters‐to‐replace
replaces character value contents
var=SUBSTR(argument,position<,n>) extracts a substring from an argument. (var is any valid SAS variable name.)
TRANSLATE(source,to‐1,from‐1<,...to‐n,from‐n>)
replaces specific characters in a character expression
TRANWRD(source,target,replacement) replaces or removes all occurrences of a word in a character string
TRIM(argument) removes trailing blanks from character expression and returns one blank if the expression is missing
TRIMN(argument) removes trailing blanks from character expressions and returns a null string if the expression is missing
UPCASE(argument) converts all letters in an argument to uppercase
VERIFY(source,excerpt‐1<,...excerpt‐n) returns the position of the first character unique to an expression
Date and Time Functions
DATDIF(sdate,edate,basis) returns the number of days between two dates
DATE() returns the current date as a SAS date value
DATEJUL(julian‐date) converts a Julian date to a SAS date value
DATEPART(datetime) extracts the date from a SAS datetime value
DATETIME() returns the current date and time of day
DAY(date) returns the day of the month from a SAS date value
DHMS(date,hour,minute,second) returns a SAS datetime value from date, hour, minute, and second
HMS(hour,minute,second) returns a SAS time value from hour, minute, and second
HOUR(<time | datetime>) returns the hour from a SAS time or datetime value
INTCK('interval',from,to) returns the number of time intervals in a given time span
INTNX('interval',start‐from,increment<,'alignment'>)
advances a date, time, or datetime value by a given interval, and returns a date, time, or datetime value
JULDATE(date) returns the Julian date from a SAS date value
MDY(month,day,year) returns a SAS date value from month, day, and year values
MINUTE(time | datetime) returns the minute from a SAS time or datetime value
MONTH(date) returns the month from a SAS date value
QTR(date) returns the quarter of the year from a SAS date value
SECOND(time | datetime) returns the second from a SAS time or datetime value
TIME() returns the current time of day
TIMEPART(datetime) extracts a time value from a SAS datetime value
TODAY() returns the current date as a SAS date value
WEEKDAY(date) returns the day of the week from a SAS date value
YEAR(date) returns the year from a SAS date value
YRDIF(sdate,edate,basis) returns the difference in years between two dates
YYQ(year,quarter) returns a SAS date value from the year and quarter
Mathematical Functions
AIRY(x) returns the value of the AIRY function
DAIRY(x) returns the derivative of the AIRY function
DIGAMMA(argument) returns the value of the DIGAMMA function
ERF(argument) returns the value of the (normal) error function
ERFC(argument) returns the value of the (normal) error function
EXP(argument) returns the value of the exponential function
GAMMA(argument) returns the value of the GAMMA function
IBESSEL(nu,x,kode) returns the value of the modified bessel function
JBESSEL(nu,x) returns the value of the bessel function
LGAMMA(argument) returns the natural logarithm of the GAMMA function
LOG(argument) returns the natural (base e) logarithm
Course Notes in Statistical Softwares
122| D a q u i s
LOG2(argument) returns the logarithm to the base 2
LOG10(argument) returns the logarithm to the base 10
TRIGAMMA(argument) returns the value of the TRIGAMMA function
Noncentrality Functions
CNONCT(x,df,prob) returns the noncentrality parameter from a chi‐squared distribution
FNONCT(x,ndf,ddf,prob) returns the value of the noncentrality parameter of an F distribution
TNONCT(x,df,prob) returns the value of the noncentrality parameter from the student's t distribution
Probability and Density Functions
CDF('dist',quantile,parm‐1,...,parm‐k) computes cumulative distribution functions
LOGPDF|LOGPMF('dist',quantile,parm‐1,...,parm‐k)
computes the logarithm of a probability density (mass) function. The two functions are identical.
LOGSDF('dist',quantile,parm‐1,...,parm‐k) computes the logarithm of a survival function
PDF|PMF('dist',quantile,parm‐1,...,parm‐k) computes probability density (mass) functions
POISSON(m,n) returns the probability from a POISSON distribution
PROBBETA(x,a,b) returns the probability from a beta distribution
PROBBNML(p,n,m) returns the probability from a binomial distribution
PROBCHI(x,df<,nc>) returns the probability from a chi‐squared distribution
PROBF(x,ndf,ddf<,nc>) returns the probability from an F distribution
PROBGAM(x,a) returns the probability from a gamma distribution
PROBHYPR(N,K,n,x<,r>) returns the probability from a hypergeometric distribution
PROBMC probabilities and critical values (quantiles) from various distributions for multiple comparisons of the means of several groups.
PROBNEGB(p,n,m) returns the probability from a negative binomial distribution
PROBBNRM(x,y,r) standardized bivariate normal distribution
PROBNORM(x) returns the probability from the standard normal distribution
PROBT(x,df<,nc>) returns the probability from a Student's t distribution
SDF('dist',quantile,parm‐1,...,parm‐k) computes a survival function
Quantile Functions
BETAINV(p,a,b) returns a quantile from the beta distribution
CINV(p,df<,nc>) returns a quantile from the chi‐squared distribution
FINV(p,ndf,ddf<,nc>) returns a quantile from the F distribution
GAMINV(p,a) returns a quantile from the gamma distribution
PROBIT(p) returns a quantile from the standard normal distribution
TINV(p,df<,nc>) returns a quantile from the t distribution
Sample Statistics Functions
CSS(argument,argument,...) returns the corrected sum of squares
CV(argument,argument,...) returns the coefficient of variation
KURTOSIS(argument,argument,...) returns the kurtosis (or 4th moment)
MAX(argument,argument, ...) returns the largest value
MIN(argument,argument, ...) returns the smallest value
MEAN(argument,argument, ...) returns the arithmetic mean (average)
MISSING(numeric‐expression | character‐expression)
returns a numeric result that indicates whether the argument contains a missing value
N(argument,argument, ....) returns the number of nonmissing values
NMISS(argument,argument, ...) returns the number of missing values
ORDINAL(count,argument,argument,...) returns the largest value of a part of a list
RANGE(argument,argument,...) returns the range of values
SKEWNESS(argument,argument,argument,...) returns the skewness
STD(argument,argument,...) returns the standard deviation
STDERR(argument,argument,...) returns the standard error of the mean
Course Notes in Statistical Softwares
123| D a q u i s
SUM(argument,argument,...) returns the sum
USS(argument,argument,...) returns the uncorrected sum of squares
VAR(argument,argument,...) returns the variance
Trigonometric and Hyperbolic Functions
ARCOS(argument) returns the arccosine
ARSIN(argument) returns the arcsine
ATAN(argument) returns the arctangent
COS(argument) returns the cosine
COSH(argument) returns the hyperbolic cosine
SIN(argument) returns the sine
SINH(argument) returns the hyperbolic sine
TAN(argument) returns the tangent
TANH(argument) returns the hyperbolic tangent
Truncation Functions
CEIL(argument) returns the smallest integer that is greater than or equal to the argument
FLOOR(argument) returns the largest integer that is less than or equal to the argument
FUZZ(argument) returns the nearest integer if the argument is within 1E‐12
INT(argument) returns the integer value
ROUND(argument,round‐off‐unit) rounds to the nearest round‐off unit
TRUNC(number, length) truncates a numeric value to a specified length
Table 14‐1 SAS Functions
SAS IF‐THEN STATEMENTS Sometimes, you want to do the following:
i. Assign statements which only apply to observations which satisfy a certain condition(s). ii. Group observations. iii. Choose only a subset of a data.
These three things can be done using the SAS IF‐THEN statements. The general expression for an IF‐THEN is as follows:
IF Condition THEN action;
SAS IF‐THEN: Assigning Values to Variables Satisfying a Condition Consider the following car data in the file cars.txt:
Sentra 2005 Nissan 4 Sand Z3 2002 . . Silver Civic 2007 . 4 Black QQ 2008 Chery 2 Yellow Rio 2006 Hyundai 4 White Corolla 2008 Toyota 4 Black Miata 2006 Mazda 2 Teal City 2009 . 4 Black
The following program reads the data from the file, uses a series of IF‐THEN statements to fill in missing data, and creates a new variable:
LIBNAME examples 'G:\Work and Study Docs\Stat 125\Datasets'; DATA examples.cars; INFILE 'G:\Work and Study Docs\Stat 125\Datasets\cars.txt'; INPUT model $ year make $ seats color $; IF year <= 2005 Then status = 'classic'; IF model = 'Civic' OR model = 'City' THEN make = 'Honda'; IF model = 'Z3' THEN DO; make = 'BMW'; seats = 2; END; RUN;
Course Notes in Statistical Softwares
124| D a q u i s
Wherein three IF‐THEN statements were made, the output is displayed below: The program gives a status to cars only when they are older than 2006 (2005 and older). Hence for the other values, the data is blank. The second IF‐THEN statement has two conditions while the last, more than one actions.
In IF‐THEN statements, here are the following comparison operators:
Symbol Mnemonic Use
= EQ Equals
~= NE Not Equal
> GT Greater Than
< LT Less Than
>= GE Greater Than or Equal
< = LE Less Than or Equal
& AND All comparisons: TRUE
| or ! OR At least comparison: T
Table 14‐2 SAS IF‐THEN Operators
A single IF‐ THEN statement, can only have one action. Adding the keywords DO and END lets you perform
more than 1 actions: IF Condition THEN DO;
Action1; Action2; … ActionK; END;
The DO statement regards all succeeding statements as actions until a matching END statement appears. On the other hand, instead of more than one actions, there can be more than one conditions. This time, the keywords AND / OR can be used:
IF Condition1 AND/OR Condition2 AND/OR … AND/OR ConditionN THEN action; Note that in the AND statement, all conditions must be satisfied for the IF‐THEN statement to perform the
action. For the OR, at least one statement needs to be satisfied. When the condition is not satisfied, the programmer may opt to use the following expression, the IF‐THEN‐ELSE:
IF Condition THEN action; ELSE action;
The expression above can be further generalized into another IF‐THEN use: grouping observations using the IF‐THEN‐ELSE statement.
CAR MAKES 07:04 Wednesday, March 11, 2009 13 model year make seats color status Sentra 2005 Nissan 4 Sand classic Z3 2002 BMW 2 Silver classic Civic 2007 Honda 4 Black QQ 2008 Chery 2 Yellow Rio 2006 Hyundai 4 White Corolla 2008 Toyota 4 Black Miata 2006 Mazda 2 Teal City 2009 Honda 4 Black
Course Notes in Statistical Softwares
125| D a q u i s
SAS IF‐THEN‐ELSE: Grouping Observations Further expanding the IF‐THEN‐ELSE statement in the preceding section, the IF‐THEN‐ELSE logic takes the basic form:
IF Condition1 THEN action1; ELSE IF condition2 THEN action2; … ELSE IF condition THEN action; ELSE action;
The last line in the form is the action which is being performed to all observations failing to satisfy any of
the conditions mentioned. This makes the grouping not just mutually exclusive but also mutually exhaustive. Consider the following program which creates a permanent SAS dataset on IQ:
LIBNAME examples 'G:\Work and Study Docs\Stat 125\Datasets'; DATA examples.iq; INPUT IQ @@; LENGTH iqlev $13; IF IQ <= 70 THEN iqlev = 'deficient'; ELSE IF IQ <= 85 AND IQ > 70 THEN iqlev = 'low'; ELSE IF IQ <= 100 THEN iqlev = 'average'; ELSE IF IQ <= 115 THEN iqlev = 'above average'; ELSE IF IQ <= 130 THEN iqlev = 'high'; ELSE IF IQ <= 145 THEN iqlev = 'superior'; ELSE iqlev = 'gifted'; DATALINES; 96 89 67 87 101 103 103 96 147 126 101 96 93 88 94 85 97 114 113 124 80 142 99 127 ; RUN;
The program above groups the IQ scores into its corresponding levels: deficient, low, average, above average, high, superior and gifted. The last ELSE statement groups all values not belonging to any iqlevel group as gifted. Note that the first ELSE IF statement explicitly tells the lower and upper bound of the category. But the succeeding ELSE IF statements only specify the upper bound. In this way of categorizing, since beforehand classified IQ less than or equal to 70 as deficient, there is no need to specify the lower bound of the next higher category.
SAS IF‐THEN: Subsetting Your Data When you only one to select a part of the raw file which satisfies a certain condition/conditions, then the IF‐THEN statement is used. For example in the data books.txt:
---------1---------2---------3 Lord of the Rings Tolkien Eva Luna Allende Coraline Gaiman Stardust Gaiman Silmarillon Tolkien Of Love and Shadows Allende American Gods Gaiman Children of Hurin Tolkien Zorro Allende
The following program creates a dataset which only chooses the novels authored by Neil Gaiman: DATA examples.gaiman; INFILE 'G:\Work and Study Docs\Stat 125\Datasets\books.txt' FIRSTOBS = 2; INPUT novel $ 1-19 @24 author $; IF author = 'Gaiman'; RUN;
Note that the subsetting if in the program is case‐sensitive.
Course Notes in Statistical Softwares
126| D a q u i s
SAS ARRAYS An array is an ordered group of similar items. In SAS an array is a group of variables. An array can be
either string or numeric, depending on the variable members, but never both. It may be composed of existing variables, or new variables which are to be created. The general form of an array is the following:
ARRAY name (n) variable1 variable2 … variablen;
Here, the dimension of the variable, n should be the equal to the number of variables defined in the array. The array in not stored in the dataset itself; it is only defined for the duration of the DATA step. You can name arrays like naming variables, as long as it does not match any defined variables or SAS keywords. To illustrate on how variables are being referenced in the array, consider the following ARRAY step:
ARRAY books (6) $ allende dahl kundera marquez murakami tolkien;
The above step creates an array books of six dimensions (variables). Now the assignment per variable is as follows: book(1) is the variable allende, book(2) is the variable dahl and so on until book(6), which is Tolkien. The most widely used function of array is that arrays simplify a program. Consider the SAS program below which inputs data on ranks of six filipino cuisines from 1 to 5, 5 being the highest and 9 as missing:
DATA examples.cuisines; INPUT adobo tinola menudo pinakbet karekare sinigang binagoongan; ARRAY filfood (7) adobo tinola menudo pinakbet karekare sinigang binagoongan; DO i = 1 TO 7; IF filfood(i) = 9 THEN filfood(i) = .; END; meanrank = MEAN(OF adobo -- binagoongan); DATALINES; 4 3 5 9 9 2 4 5 2 4 3 9 2 3 1 9 3 4 5 5 1 5 3 2 4 3 4 9 9 9 3 4 5 4 5 4 5 5 9 3 2 3 1 1 3 2 4 1 4 ; RUN;
Aside from data input, an array of dimension 7, filfood is also created, assigning filfood(1) to adobo, filfood(2) to tinola and so on. In order to recode 9 into a missing value (period), a typical solution would be to use 7 IF‐THEN statements, one for each variable in the array. But with the use of arrays, the recoding is then reduced to a very simple DO statement. Also, note that the variable meanrank gets the average rankings per observation (which we have just recoded 9 to missing). Instead of specifying all the arguments in the function MEAN, a double dash sign “‐‐" is used.
Instead of the normal printing, try to run this and see what happens (ODS: Output Delivery System): ODS PDF FILE = 'G:\Work and Study Docs\Stat 125\Datasets\Cuisines.pdf'; PROC PRINT DATA = examples.cuisines NOOBS; TITLE 'Filipino Cuisine Survey'; RUN; ODS PDF CLOSE;
Filipino Cuisine Survey 8 12:42 Thursday, March 12, 2009 adobo tinola menudo pinakbet karekare sinigang binagoongan i meanrank 4 3 5 . . 2 4 8 3.60000 5 2 4 3 . 2 3 8 3.16667 1 . 3 4 5 5 1 8 3.16667 5 3 2 4 3 4 . 8 3.50000 . . 3 4 5 4 5 8 4.20000 4 5 5 . 3 2 3 8 3.66667 1 1 3 2 4 1 4 8 2.28571
Course Notes in Statistical Softwares
127| D a q u i s
SAS CONTROL STATEMENTS So far, we have only encountered 2 SAS control statements: the IF‐THEN‐ELSE and DO‐END statements.
There are still three of them: the GOTO‐RETURN, STOP and SELECT‐OTHERWISE.
SAS Control Statements: GOTO‐RETURN A GOTO statement tells SAS to jump immediately to another statement in the same data step and begin
executing statements from that point. For example: DATA passfail; INPUT grade @@; IF grade = 5 THEN GOTO FAIL; result = 'pass'; FAIL: RETURN; CARDS; 1.25 1.0 5 3 2.75 1.25 5 5 1.0 3.0 2.25 2.5 5 3 1.75 1.75 1.75 ; RUN;
The program checks to see if the input value of grade is equal to 5; if it is not, then result is set equal to pass; if it is then the SAS program jumps to the statement labeled FAIL. This is a RETURN statement which tells SAS to begin processing a new observation.
SAS Control Statements: STOP The stop statement stops processing a SAS data step. The observation being processed when the stop statement is
encountered is not added to the data set and processing resumes with the first statement after this data step. For example, in the following program:
DATA stopgrade; INPUT grade @@; IF grade = 2.5 THEN STOP; CARDS; 1.25 1.0 5 3 2.75 1.25 5 5 1.0 3.0 2.25 2.5 5 3 1.75 1.75 1.75 ; RUN;
When a grade of 2.5 is encountered, the building of the data stops. Hence in its output, the last observation is 2.25, which is before the first 2.5. The values 2.5 to the last observation which is 1.75 will not be written into the SAS dataset.
SAS Control Statements: SELECT – OTHERWISE The SELECT‐OTHERWISE statement replaces a sequence of IF‐THEN‐ELSE statements. The SELECT‐OTHERWISE
statement takes the form: select; when (expression1 ) statement1 ; when (expression2 ) statement2 ; ... otherwise statement; end;
The IQ data at which a series of IF‐THEN‐ELSE statements were used to code iqlev can now be converted into a series of SELECT‐OTHERWISE statements;
DATA examples.iq2; INPUT IQ @@; LENGTH iqlev $13; SELECT; WHEN (IQ <= 70) iqlev = 'deficient'; WHEN (IQ <= 85) iqlev = 'low'; WHEN (IQ <= 100) iqlev = 'average'; WHEN (IQ <= 115) iqlev = 'above average'; WHEN (IQ <= 130) iqlev = 'high'; WHEN (IQ <= 145) iqlev = 'superior'; OTHERWISE iqlev = 'gifted'; END; DATALINES; <datalines here>; ; RUN;
SAS: Basic
Statistical
Procedures
PROC MEANS
PROC FREQ PROC
UNIVARIATE
PROC TTEST PROC ANOVA PROC CORR PROC REG
Course Notes in Statistical Softwares
129| D a q u i s
PROC MEANS The MEANS procedure displays summary descriptive statistics of your data. It takes the general form:
The VAR statement specifies which variables for which statistics are calculated. By default, SAS will compute the
statistics of all numeric variables, except for variables specified in the BY statement. The BY statement is used to obtain separate analyses according to the categories specified by the variables in the BY statement. The class statement functions basically the same way as the BY statement. The CLASS statement variables are being used as subgroups.
By default, it will produce the mean, the number of non‐missing values, the standard deviation, the minimum and maximum value for each numeric variable. The following is a list of descriptive statistics that can be requested from the PROC MEANS:
STAT USAGE STAT USAGE
CLM two‐sided confidence limits RANGE range
CSS corrected sum of squares SKEWNESS skewness
CV coefficient of variation STDDEV standard deviation
KURTOSIS kurtosis STDERR standard error
LCLM lower confidence limit SUM sum
MAX maximum SUMWGT weighted sum
MEAN minimum UCLM upper confidence limit
N non‐missing values USS uncorrected sum of squares
NMISS missing values VAR variance
MEDIAN median (50th Percentile) PROBT t statistic pvalue
Q1 1st quartile T t statistic
Pxx xxth percentile Q3 3rd quartile
Table 15‐1 PROC MEANS Statistics
For example, the following program creates a SAS dataset and computes the default statistics of normal body
temperatures and heartbeats per minute of 130 individuals:
LIBNAME examples 'H:\Work and Study Docs\Stat 125\Datasets'; DATA examples.normtemp; INFILE 'H:\Work and Study Docs\Stat 125\Datasets\normaltemp.txt'; INPUT nbtemp sex hrtrate @@; RUN; PROC MEANS data=examples.normtemp; RUN;
Produces the following output: The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ nbtemp 130 98.2492308 0.7331832 96.3000000 100.8000000 sex 130 1.5000000 0.5019342 1.0000000 2.0000000 hrtrate 130 73.7615385 7.0620767 57.0000000 89.0000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Note that once you have specified the statistics options, the default statistics will not be displayed. You have to
request them. The program below gets the 90% confidence intervals of normal body temperatures and the heartbeats per minute of 130 individuals:
PROC MEANS data OPTIONS;
VAR variables;
BY variables;
CLASS variables;
RUN;
Course Notes in Statistical Softwares
130| D a q u i s
PROC MEANS data=examples.normtemp N ALPHA = 0.10 CLM; VAR nbtemp hrtrate; BY sex; OUTPUT OUT = examples.ntempresult; RUN;
Since the default level of significance is 0.05, you may specify via the ALPHA option your level of significance. The SAS
program calculates and displays the number of valid observations, the 90% confidence intervals of the variables nbtemp and hrtrate, grouped by sex. The output is being written on another SAS dataset, which is the ntempresult, via the OUTPUT OUT option. Try to run the program, now instead of the BY statement, use the CLASS statement.
PROC FREQ The SAS FREQ procedure is very important in analyzing categorical (or ordinal) data. It produces n‐way
crosstabulations of variables and outputs important statistics in testing association of variables involved in the frequency table. The general template for PROC FREQ is as follows:
The TABLES option lets you specify the variables to be included in the frequency tables. The format for specifying
values are as follows: ONE‐WAY TABLE : TABLES variable1; TWO‐WAY TABLE : TABLES variable1*variable2; THREE‐WAY TABLE : TABLES variable1*variable2*variable3; The OPTIONS statement on the other hand gives you options on the available table information and statistical options
available. Note that some of the statistics presented here are unfamiliar to you, but are nonetheless listed for future reference:
STATISTICS USAGE
CMH Cochran‐Mantel‐Haenszel statistics
EXACT Fisher's exact test for tables larger than 2x2
MEASURES measures of association: Pearson and Spearman and correlation coefficients gamma, Kendall's tau‐b, Stuart's tau‐c, Somer's D, lambda, odds and risk ratios and CI
PLCORR polychloric correlation coefficient
RELRISK relative risk measures for 2x2 tables
TREND Cochran‐Armitage test for trend
TABLE OPT USAGE
NOFREQ No frequency in the table
NOPERCENT No percentage
NOROW No row frequencies
NOCOL No column frequencies
NOCUM No cumulative
NOPRINT Does not print table
ALL Displays all available options
Table 15‐2 PROC FREQ Statistics and Table Options
PROC FREQ data;
TABLES var combins / OPTIONS;
BY variable;
RUN;
Course Notes in Statistical Softwares
131| D a q u i s
This is a fictional data on two bus lines, Galloper (G) and Amihan (A) and tells whether they are ontime (O) or late (L): G O G L G L A O G O G O G O A L A O A L A O G O A L G O A L A O G O G O A L G L G O A L G O A L G O A L G O A O G L G O G O G O G O G L G O G O A L A L A O A L G L G O A L A O G O G O G O G L A O A L
The program below performs an input then the default PROC FREQ: LIBNAME examples 'H:\Work and Study Docs\Stat 125\Datasets'; DATA examples.bus; INFILE 'H:\Work and Study Docs\Stat 125\Datasets\buses.txt'; INPUT liner $ ontimelate $ @@; RUN; PROC FREQ DATA = examples.bus; TABLES liner * ontimelate; TITLE; RUN;
And the following is the output: When some options are added:
PROC FREQ DATA = examples.bus; TABLES liner * ontimelate / NOPRINT CHISQ; TITLE; RUN;
Yields the table below. Note that the NOPRINT option tells SAS not to print the actual frequency table:
Note that the chi‐square statistic concludes that there is indeed association between the bus liner and arrival time.
Course Notes in Statistical Softwares
132| D a q u i s
PROC UNIVARIATE PROC UNIVARIATE is very much like the EXPLORE procedure in SPSS. Here it, just like the PROC MEANS outputs descriptive statistics. Aside from that, the UNIVARIATE procedure gives emphasis on the analysis of the distribution of the data, hence, Stem‐and‐Leaf Displays, Box‐and‐Whisker Plots and Normal Probability plots are available. The general form of PROC UNIVARIATE is as follows:
Here are some the following available options in PROC UNIVARIATE: PLOT : gives a SALD, BAWP and a normal probability plot NORMAL : tests the null hypothesis of normality of the data To illustrate, given the following statistics test scores of 30 students:
56 78 84 73 90 44 76 87 92 95 85 67 90 87 74 64 73 78 69 56 87 73 100 54 81 78 69 64 73 65
Then the following SAS UNIVARIATE procedure is as follows: LIBNAME examples 'H:\Work and Study Docs\Stat 125\Datasets'; DATA examples.statscore; INFILE 'H:\Work and Study Docs\Stat 125\Datasets\statgrades.txt'; INPUT score @@; RUN; PROC UNIVARIATE DATA = examples.statscore PLOT NORMAL; VAR SCORE; TITLE; RUN;
Which produces a very comprehensive analysis of the data, from descriptive statistics to plots of the distribution of the data:
The UNIVARIATE Procedure Variable: score Moments N 30 Sum Weights 30 Mean 75.4 Sum Observations 2262 Std Deviation 13.2029255 Variance 174.317241 Skewness -0.3367612 Kurtosis -0.1696661 Uncorrected SS 175610 Corrected SS 5055.2 Coeff Variation 17.5105113 Std Error Mean 2.41051337 Basic Statistical Measures Location Variability Mean 75.40000 Std Deviation 13.20293 Median 75.00000 Variance 174.31724 Mode 73.00000 Range 56.00000 Interquartile Range 20.00000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 31.27964 Pr > |t| <.0001 Sign M 15 Pr >= |M| <.0001 Signed Rank S 232.5 Pr >= |S| <.0001
PROC UNIVARIATE data OPTIONS;
VAR variables;
BY variables
RUN;
Course Notes in Statistical Softwares
133| D a q u i s
Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.981702 Pr < W 0.8688 Kolmogorov-Smirnov D 0.094545 Pr > D >0.1500 Cramer-von Mises W-Sq 0.032009 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.216843 Pr > A-Sq >0.2500 Quantiles (Definition 5) Quantile Estimate 100% Max 100 99% 100 95% 95 90% 91 75% Q3 87 50% Median 75 25% Q1 67 10% 56 5% 54 1% 44 0% Min 44 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 44 6 90 5 54 24 90 13 56 20 92 9 56 1 95 10 64 28 100 23 Stem Leaf # Boxplot 10 0 1 | 9 5 1 | 9 002 3 | 8 5777 4 +-----+ 8 14 2 | | 7 6888 4 *--+--* 7 33334 5 | | 6 5799 4 +-----+ 6 44 2 | 5 66 2 | 5 4 1 | 4 | 4 4 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1
Course Notes in Statistical Softwares
134| D a q u i s
Normal Probability Plot 102.5+ +*++ | +*++ | * *+* | **+*++ | **++ | **+* 72.5+ ***** | ***+ | +**+ | +*+* | +++* | ++++ 42.5+++ * +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
PROC TTEST The PROC TTEST is used to do two things: the one sample t‐test is used to compare the sample mean to a given value and the other one is used to compare two means. The general form of the procedure is a s follows:
The CLASS option is NOT optional in ttest for grouped means. This statement defines the grouping variable. The
grouping variable must have two and only two levels. A numeric or a character value can be used as a CLASS statement. The VAR statement on the other hand selects the variables to be tested. Not specifying the variable(s) will make SAS test all the variables, except of course the CLASS variable. One Sample t‐test
A one‐sample t test can be used to compare a sample mean to a given value. This example, taken from Huntsberger and Billingsley (1989, p. 290), tests whether the mean length of a certain type of court case is 80 days using 20 randomly chosen cases:
DATA time; INPUT time @@; DATALINES; 43 90 84 87 116 95 86 99 93 92 121 71 66 98 79 102 60 112 105 98 ; RUN; PROC TTEST H0=80 ALPHA=0.1; VAR time; RUN;
This is a one sample test testing the null hypothesis that if the sample mean is equal to 80 at 10% level of significance (default is 5%). The result is as follows:
The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err time 20 82.447 89.85 97.253 15.2 19.146 26.237 4.2811 T-Tests Variable DF t Value Pr > |t|
time 19 2.30 0.0329
PROC TTEST data OPTIONS;
CLASS variable;
VAR variable;
RUN;
Course Notes in Statistical Softwares
135| D a q u i s
Comparing Group Means In the following example, the Math55 scores for male and female students their final examination are compared. The
data are read by the following SAS statements:
DATA mathscores; INPUT gender $ score @@; DATALINES; f 75 f 76 f 80 f 77 f 80 f 77 f 73 m 82 m 80 m 85 m 85 m 78 m 87 m 82 ; RUN; PROC TTEST; CLASS gender; VAR score; TITLE; RUN;
Again the CLASS statement determines the variable which distinguishes the groups being compared, for this case, it’s between males and females. The VAR statement specifies variable to be used as a response for the test. The SAS result is as follows:
Inequality of variances is also tested, at which you do not reject the null hypothesis of equal variance. Hence for this example, we use the t‐test assuming equal variances. The test rejects the null hypothesis of equality of mean score of scores between boys and girls. Another method for unequal variances is COCHRAN which is invoked as an OPTION statement. Comparing Paired Group Means The SAS program below tests whenever a student has significantly decreased his depression score after enrolling in a depression management program:
DATA programeffect; INPUT student $ before after @@; DATALINES; A 79 77 B 69 60 C 86 42 D 59 61 E 67 49 F 73 56 G 88 82 H 53 48 I 55 48 J 61 50 ; RUN;
Course Notes in Statistical Softwares
136| D a q u i s
ODS HTML; PROC TTEST; PAIRED before * after; TITLE; RUN; ODS HTML CLOSE;
The ODS HTML – CLOSE statement, like the ODS PDF, creates an HTML output counterpart of the SAS output:
Note that the difference in the scores is indeed statistically significant (for level of significance greater than 0.0186).
PROC ANOVA: One Way Analysis of Variance The ANOVA procedure is one of several in the SAS System that perform analysis of variance. PROC ANOVA is
specifically designed for balanced data – data where there are equal numbers of observations in each classification. If your data are not balanced then you should use the more general GLM (Generalized Linear Model) procedure, at which has basically the same structure as the PROC ANOVA.
The ANOVA procedure has two required statements – the CLASS and MODEL statements. The General form of PROC ANOVA is as follows:
The CLASS statement lets you specify the classification variable(s). For one‐way ANOVA, only one variable is listed. This statement must always come before the MODEL statement. On the other hand, the MODEL statement defines the dependent variable and the effects. For the one way ANOVA, the effect is the variable specified in the CLASS statement. The MEANS statement lets you compute for the mean of the dependent variable for each of main effects in the MODEL statement. For one‐way ANOVA, they are the means for every category of the effect variable. Another good thing about the MEANS statement are its options which are multiple comparison tests: BON : Bonferronni t‐tests DUNCAN : Duncan’s Multiple Range Test SCHEFFE : Scheffe’s Multiple Comparison Procedure TUKEY : Tukey’s Studentized Range Test The default level of significance for the tests above is 0.05. To change the level of significance to say, 0.10, type ALPHA = 0.10 as an option statement. For example, your varsity friend complains that it seems like the players from all other volleyball teams are taller than her team. By ANOVA, you can decide whether there is indeed difference in heights among teams. The data are as follows, an notice that there are five teams, at which there are 12 players per team:
PROC ANOVA data OPTIONS;
CLASS variable;
MODEL dependent = effects;
MEANS variable / OPTIONS;
RUN;
Course Notes in Statistical Softwares
137| D a q u i s
maroon 56 maroon 49 maroon 54 maroon 48 maroon 52 maroon 44 maroon 46 maroon 47 maroon 56 maroon 55 maroon 46 maroon 53 blue 47 blue 57 blue 49 blue 48 blue 55 blue 53 blue 50 blue 52 blue 46 blue 49 blue 56 blue 48 green 56 green 46 green 48 green 57 green 50 green 54 green 49 green 54 green 52 green 53 green 49 green 48 red 54 red 54 red 59 red 57 red 51 red 56 red 60 red 58 red 50 red 56 red 57 red 58 gold 54 gold 56 gold 49 gold 46 gold 48 gold 57 gold 56 gold 46 gold 48 gold 54 gold 52 gold 51
Since there are equal observations per category, PROC ANOVA is used (otherwise, PROC GLM). And if ever there are indeed teams significantly taller than the rest, we use the MEANS option to output SCHEFFE and T test results. The SAS program is given:
LIBNAME examples 'H:\Work and Study Docs\Stat 125\Datasets'; DATA examples.vballheights; INFILE 'H:\Work and Study Docs\Stat 125\Datasets\vball.txt'; INPUT team $ height @@; RUN; PROC ANOVA DATA = examples.vballheights; CLASS team; MODEL height = team; MEANS team / T SCHEFFE; RUN;
The output of PROC ANOVA is quite long, so let’s see them piece by piece:
The first output gives information about the classification variable. The CLASS is the variable team. And it has 5 categories, for a total of 60 observations. The second part is the ANOVA table:
The model is indeed significant (F‐stat of 4.10 yields p‐value = 0.0056), that is, at least a pair of teams have statistically significant heights. The T and SCHEFFE options in the MEANS statement compare the heights between the teams:
Course Notes in Statistical Softwares
138| D a q u i s
For the T option, team red (classified as A) is significantly different from all other teams. Pairwise, the average heights of the four remaining teams (classified as B) are not significantly different. Let’s see if the same holds for the SCHEFFE grouping:
The SCHEFFE option yields to overlapping groups: A – red, gold and green and B – gold, green, blue and maroon. The same conclusions still hold for the Scheffe grouping, except for two pairs: the average heights of players from pairs red‐gold and red‐green are not significantly different. You can also add, other tests or change the level of significance.
Course Notes in Statistical Softwares
139| D a q u i s
PROC CORR The CORR procedure in SAS computes for correlations. A correlation coefficient is the degree of relationship between two variables. A correlation of zero implies no or nonlinear correlation. A correlation of 1 implies perfect positive linear correlation, while ‐1 implies perfect negative linear correlation. The general form for PROC CORR is as follows:
The default correlation coefficient is PEARSON, hence there is no need to type in OPTIONS. Type SPEARMAN if you want to output the spearman correlation coefficient. Other options may are HOEFFDING (Hoeffding’s D statistic) and KENDALL (Kendall’s tau‐b coefficient). The VAR statement includes the variables to be correlated. If there is no VAR statement, SAS will correlate all pairs of numeric variables. On the other hand, the WITH statement limits correlations to only: each of the variables in the VAR statement with the variable in the WITH statement. That is, whereas “VAR a b c;” produces correlations a‐b, a‐c and b‐c, “VAR b c; WITH a;” produces correlations a‐b and a‐c. For example, each student in a Statistics 101 class recorded three values: test score, number of hours spent surfing the net in a week prior to the test and the number of hours spent exercising during the same week. The raw data (exercise.txt) are as follows:
57 7 3 79 8 5 85 6 6 74 5 1 91 4 5 45 10 1 77 6 2 88 4 4 93 3 8 76 9 4 86 2 7 68 5 3 91 6 6 85 7 6 75 6 3 65 5 2 74 1 6 79 6 3 70 7 2 57 8 2 88 9 5 74 9 4 101 1 7 55 9 1 82 6 5 79 6 3 70 5 2 65 8 2 74 8 4 66 7 3
The following program reads the data and performs the correlation procedure: LIBNAME examples 'H:\Work and Study Docs\Stat 125\Datasets'; DATA examples.exernet; INFILE 'H:\Work and Study Docs\Stat 125\Datasets\exercise.txt'; INPUT score internet exercise @@; RUN; PROC CORR data = examples.exernet; VAR internet exercise; WITH score; RUN;
Which has the following output:
The first part of the output presents the descriptive statistics for each variable, which is followed by a correlation matrix which displays the requested correlations. In our example, the correlation coefficients are Pearson, and below the value are the p‐values which tests thy null hypothesis that there is no linear correlation. We have found out that using the internet a during a week prior to the exam has a negative correlation with the exam scores, for exercise, it is positive. Both correlation coefficients are significant for any alpha greater than their 0.0015.
PROC CORR data OPTIONS;
VAR variable(s);
WITH variable;
RUN;
Course Notes in Statistical Softwares
140| D a q u i s
PROC REG The REG procedure in SAS is only one of the many procedures which can perform a regression analysis. The regression algorithm used in the procedure is the least squares regression. The basic form of the regression procedure in SAS is as follows:
The REG procedure only requires the MODEL statement, which lets you specify the regression model. Note that we can have more than one independent variables and the model may not be linear. The PLOT statement is used to generate scatterplots of your data. In the following example from the book Regression Methods (RJ Freund and PD Minton, 1979), page 111, 44 fishes were being measured. The task here is to find out if the length of the fish (mm) is a function of its age (days) and water temperature (degrees Celsius):
1 14 25 620 2 28 25 1315 3 41 25 2120 4 55 25 2600 5 69 25 3110 6 83 25 3535 7 97 25 3935 8 111 25 4465 9 125 25 4530 10 139 25 4570 11 153 25 4600 12 14 27 625 13 28 27 1215 14 41 27 2110 15 55 27 2805 16 69 27 3255 17 83 27 4015 18 97 27 4315 19 111 27 4495 20 125 27 4535 21 139 27 4600 22 153 27 4600 23 14 29 590 24 28 29 1305 25 41 29 2140 26 55 29 2890 27 69 29 3920 28 83 29 3920 29 97 29 4515 30 111 29 4520 31 125 29 4525 32 139 29 4565 33 153 29 4566 34 14 31 590 35 28 31 1205 36 41 31 1915 37 55 31 2140 38 69 31 2710 39 83 31 3020 40 97 31 3030 41 111 31 3040 42 125 31 3180 43 139 31 3257 44 153 31 3214
The following SAS program outputs a plot and the regression model:
LIBNAME examples 'H:\Work and Study Docs\Stat 125\Datasets'; DATA examples.regfish; INFILE 'H:\Work and Study Docs\Stat 125\Datasets\fish.txt'; INPUT id age watertemp length @@; RUN; PROC REG DATA = examples.regfish LINEPRINTER; MODEL length = age watertemp; PLOT length*age; RUN;
The LINEPRINTER OPTION is used if you don’t have the SAS/GRAPH module. PROC REG output is as follows:
PROC REG data;
MODEL dependent = independent;
PLOT variable combinations;
RUN;
Course Notes in Statistical Softwares
141| D a q u i s
The first table above is the ANOVA section. The ANOVA tests the null hypothesis that all parameter coefficients are zero except for the intercept, the R‐square and the R‐square value adjusted for the degrees of freedom. The F‐test (with p‐value <0.0001) rejects the null‐hypothesis that the coefficients of the independent variables equal zero, hence telling that the model is significant. The R‐squared coefficient of determination value is high (0.80956), which means that 81% of the variability in fish length is explained by the variables age and water temperature. Hence, we will proceed to interpreting the actual model:
From the parameter estimates you can construct the following estimated regression equation:
fish length = 3904.26602 + 26.24068 * age – 106.41364 * watertemp That is, holding all other factors constant, a year increase in fish age increases its estimated fish length by about 26.24 millimeters. Similarly, an increase in water temperature by 1 degree Celsius decreases the estimated fish length by about 106.41 millimeters. To complete the analysis, let us inspect the plot of fish length and age:
Note that there is indeed an increasing trend between fish length and age.
References
Part 4
Book References
Further Reading
Course Notes in Statistical Softwares
143| D a q u i s
REFERENCES Main Reference Delwich, Lora D and Slaughter, Susan J. “The Little SAS Book, Second Edition”.
SAS Publishing. Cary, North Carolina. 1996. Evans, Michael. “SAS Manual for Introduction to the Practice of Statistics, Third Edition”. University of Toronto, Canada. Further Reading You can also refer to the following manuals published by The SAS Institute,
SAS Campus Drive, Cary, North Carolina, 27513. SAS Language Reference, Version 8‐9. SAS Procedure’s Guide, Version 8‐9. SAS/STAT User’s Guide, Version 8‐9.