lecture 4

25
Lecture 4 • Ways to get data into SAS • Some practice programming • Review of statistical concepts

Upload: channer

Post on 12-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Lecture 4. Ways to get data into SAS Some practice programming Review of statistical concepts. Getting data into SAS. DATALINES statement Data is contained within a data step INFILE statement Data contained in separate file PROC IMPORT Data contained in separate file. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 4

Lecture 4

• Ways to get data into SAS

• Some practice programming

• Review of statistical concepts

Page 2: Lecture 4

Getting data into SAS

• DATALINES statement– Data is contained within a data step

• INFILE statement– Data contained in separate file

• PROC IMPORT– Data contained in separate file

Page 3: Lecture 4

* List Directed Input: Reading data values separated by spaces.;

DATA bp; INFILE DATALINES; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A . . 86 155C 81 145 86 140;RUN ;TITLE 'Data Separated by Spaces';PROC PRINT DATA=bp;RUN;

Obs clinic dbp6 sbp6 dbpbl sbpbl

1 C 84 138 93 143 2 D 89 150 91 140 3 A 78 116 100 162 4 A . . 86 155 5 C 81 145 86 140

Page 4: Lecture 4

* List Directed Input: Reading data values separated by commas;

DATA bp; INFILE DATALINES DLM = ',' ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C,84,138,93,143D,89,150,91,140A,78,116,100,162A,.,.,86,155C,81,145,86,140;RUN ;TITLE 'Data separated by a comma';PROC PRINT DATA=bp;RUN;

Page 5: Lecture 4

* List Directed Input: Reading data values from a .csv type file;

DATA bp; INFILE DATALINES DLM = ',' DSD ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140;TITLE 'Reading in Data using the DSD Option';PROC PRINT DATA=bp;RUN;

Page 6: Lecture 4

* List Directed Input: Reading data values separated by tabs (.txt files);

DATA bp; INFILE DATALINES DLM = '09'x DSD; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A 86 155C 81 145 86 140;TITLE 'Reading in Data separated by a tab';PROC PRINT DATA=bp;RUN;

Page 7: Lecture 4

* Reading data from an external file

DATA bp; INFILE '/home/ph5415/data/bp.csv' DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ;TITLE 'Reading in Data from an External File';PROC PRINT DATA=bp;

clinic,dbp6,sbp6,dbpbl,sbpblC,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140

Content of bp.csv

Page 8: Lecture 4

*Using PROC IMPORT to read in data ;

PROC IMPORT DATAFILE='/home/ph5415/data/bp.csv' OUT = bp

DBMS = csv REPLACE ; GETNAMES = yes;

TITLE 'Reading in Data Using PROC IMPORT';

PROC PRINT DATA=bp;PROC CONTENTS DATA=bp;

Page 9: Lecture 4

The CONTENTS Procedure

Data Set Name: WORK.BP Observations: 5 Member Type: DATA Variables: 5 Engine: V8 Indexes: 0 Created: 18:15 Tuesday, January 25, 2005 Observation Length: 40 Last Modified: 18:15 Tuesday, January 25, 2005 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label:

-----Alphabetic List of Variables and Attributes-----

# Variable Type Len Posƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ1 clinic Char 8 322 dbp6 Num 8 04 dbpbl Num 8 163 sbp6 Num 8 85 sbpbl Num 8 24

Page 10: Lecture 4

Some Definitions

• Statistics: The art and science of collecting, analyzing, presenting, and interpreting numerical data.

• Data: facts and figures that are analyzed• Dataset: All the data collected for a study• Elements: Units in which data is collected

– People, companies, schools, households• Variables: Characteristics measured on elements

– People (height, weight)– Company (number of employees)– Schools (percentage of students who graduate in 5 years)– Households (number of computers owned)

Page 11: Lecture 4

Informal Definition

• Statistics:

In a scientific way gain information about something you do not know

Page 12: Lecture 4

Start With Research Question

• What is the proportion of persons without health insurance in Minnesota?

• Do newer BP medications prevent heart disease compared to older medications?

• What is the relationship between grade point average and SAT scores

• Do persons who eat more F&V have lower risk of developing colon cancer.

• Does the program DARE reduce the risk of young persons trying drugs?

Page 13: Lecture 4

Statistics

Start WithQuestion

Start WithQuestion

Design Study And

Collect Data

Compute SummaryCompute SummaryData to AssessData to Assess

Question.Question.

Compute SummaryCompute SummaryData to AssessData to Assess

Question.Question.

Make Conclusions(Inference)

Make Conclusions(Inference)

Page 14: Lecture 4

Statistical Inference

• Estimation (Chapter 4)

• Hypothesis Testing (Chapter 5)– Comparing population proportions (Chap 6)– Comparing population means (Chap 7)

Page 15: Lecture 4

Common Parameters to Estimate

Parameter Parameter Description

Mean of population

Proportion with a certain trait

Correlation between 2 variables

Difference between 2 means

Difference between 2 proportions

Population standard deviation

Page 16: Lecture 4

Statistical Inference

Population with mean

= ?

Population with mean

= ?

A simple random sampleof n elements is selected

from the population..

The sample data provide a value for

the sample mean . .

The sample data provide a value for

the sample mean . .xx

The value of is used tomake inferences about

the value of .

The value of is used tomake inferences about

the value of .

xx

Page 17: Lecture 4

Sampling

• Sample: a subset of target population

(usually a simple random sample - each sample has equal probability of occurring)

• Different samples yield different estimates

• Trying to understand the population parameter (the “true value”)– It’s usually not possible to measure the population value

Page 18: Lecture 4

Point Estimate

Parameter Point Estimate

Sample mean

Sample proportion

Sample correlation

Difference between 2 sample means

Difference between 2 sample proportions

Sample standard deviation

Page 19: Lecture 4

Interval Estimation

In general, confidence intervals are of the form:

SEestimate 96.1

SE = standard error of your estimate

Estimate = mean, proportion, regression coefficient, odds ratio...

1.96 = for 95% CI based on normal distribution

Page 20: Lecture 4

Estimation“What is the average total cholesterol level for MN

residents?”

Random sample of cholesterol levels

sample mean = sum of values / number of observations

Xn

XX

Estimates the population mean:

Page 21: Lecture 4

Estimation

“What is the average total cholesterol level for MN residents?”

sample standard deviation:

sestimates the

population standard deviation:

1

)( 2

n

XXs

Page 22: Lecture 4

Confidence Interval Example

Suppose sample of 100

mean = 215 mg/dL, standard deviation = 20

95% CI = nsX /96.1

= (215 - 1.96*20/10, 215 + 1.96*20/10) approximately = (211, 219)

ns / = standard error of mean

Page 23: Lecture 4

Properties of Confidence Intervals

• As sample size increases, CI gets smaller– If you could sample the whole population;

• Can use different levels of confidence – 90, 95, 99% common– More confidence means larger interval; so a 90% CI is smaller than a 99% CI

• Changes with population standard deviation– More variable population means larger interval

X

Page 24: Lecture 4

Caution with Confidence Intervals

– Data should be from random sample

– More complicated sampling requires different methods• Example - multistage or stratified sampling

– Outliers can cause problems

– Non-normal data can change confidence level• Skewed data a big problem

– Bias not accounted for• Non-responders

• Target and sampled population different

Page 25: Lecture 4

95% Confidence Intervals with SAS

1) Construct from output

estimate +/- 1.96*SE

2) Provided automatically by some procedures

PROC MEANS DATA = STUDENTS LCLM UCLM;

VAR AGE;