thomas songer, phd with acknowledgment to several slides provided by m rahbar and moataza mahmoud...

Thomas Songer, PhDwith acknowledgment to several slides provided by

M Rahbar and Moataza Mahmoud Abdel Wahab

Introduction to Research MethodsIn the Internet Era

Data CollectionDescriptive Statistics

Introduction to Biostatistics

Key Lecture Concepts

• Distinguish between different strategies for obtaining a sample from a population

• Distinguishing between different forms of data collection

• Identify key approaches to organize and portray your data

• Understand the measures of central tendency and variability in your data

2

Descriptive & Inferential Statistics Descriptive Statistics deal with the enumeration, organization and graphical representation of data from a sample

Inferential Statistics deal with reaching conclusions from incomplete information, that is, generalizing from the specific sample

Inferential statistics use available information in a sample to draw inferences about the population from which the sample was selectedRahbar

Epidemiology is…• The study of disease and its treatment, control,

and prevention in a population of individuals.

• Whole populations may be examined, but…

• More frequently, samples of the population may be examined. Samples that are studied must be representative of the population for the results to be generalized to the total population.

Torrence 1997 4

Hypothetical Population

Sample 1:

Sample 2:

Sample 3:

Representative? Y N

Representative? Y N

Representative? Y N

5

Sampling Approaches

• Convenience Sampling: select the most accessible and available subjects in target population. Inexpensive, less time consuming, but sample is nearly always non-representative of target population.

• Random Sampling (Simple): select subjects at random from the target population. Need to identify all in target population first. Provides representative sample frequently.

6

Sampling Approaches

• Systematic Sampling: Identify all in target population, and select every xth person as a subject.

• Stratified Sampling: Identify important sub-groups in your target population. Sample from these groups randomly or by convenience. Ensures that important sub-groups are included in sample. May not be representative.

• More complex sampling 7

Sampling Error

• The discrepancy between the true population parameter and the sample statistic

• Sampling error likely exists in most studies, but can be reduced by using larger sample sizes

• Sampling error approximates 1 / √n

• Note that larger sample sizes also require time and expense to obtain, and that large sample sizes do not eliminate sampling error

8

Research Process

Research question

Hypothesis

Identify research design

Data collection

Presentation of data

Data analysis

Interpretation of data

Polgar, Thomas 9

Types of Data Collection

• Surveys/Questionnaires– Self-report

– Interviewer-administered

– proxy

• Direct medical examination

• Direct measurement (e.g. blood draws)

• Administrative records

10

Understanding and Presenting Data

11

Types of Data

1. Categorical: (e.g., Sex, Marital Status, income category)

2. Continuous: (e.g., Age, income, weight, height, time to achieve an outcome)

3. Discrete: (e.g.,Number of Children in a family)

4. Binary or Dichotomous: (e.g., response to all Yes or No type of questions)

12

Brain Size and IQWhat types of data do these variables represent?

Gender FSIQ VIQ PIQ Weight Height MRI Count

Female 133 132 124 118 64.5 816932

Male 140 150 124 124 72.5 1001121

Male 139 123 150 143 73.3 1038437

Male 133 129 128 172 68.8 965353

Female 137 132 134 147 65 951545

Female 99 90 110 146 69 928799

Female 138 136 131 138 64.5 991305

Female 92 90 98 175 66 854258

Male 89 93 84 134 66.3 904858

Male 133 114 147 172 68.8 955466

Female 132 129 124 118 64.5 833868

13

Scale of Data1. Nominal: These data do not represent an amount or

quantity (e.g., Marital Status, Sex)

2. Ordinal: These data represent an ordered series of relationship (e.g., level of education)

3. Interval: These data is measured on an interval scale having equal units but an arbitrary zero point. (e.g.: Temperature in Fahrenheit)

4. Interval Ratio: Variable such as weight for which we can compare meaningfully one weight versus another (say, 100 Kg is twice 50 Kg) 14

Organizing Data and Presentation

• Frequency Table • Frequency Histogram • Relative Frequency Histogram• Frequency polygon• Relative Frequency polygon• Bar chart• Pie chart • Box plot

15

Frequency Table

• Generally, the first approach to examining your data.

• Identifies distribution of variables overall

• Identifies potential outliers– Investigate outliers as possible data entry errors

– Investigate a sample of others for data entry errors

16

Frequency Table

A research study has been conducted examining the number of children in the families living in a community. The following data has been collected based on a random sample of n = 30 families from the community.

2, 2, 5, 3, 0, 1, 3, 2, 3, 4, 1, 3, 4, 5, 7, 3, 2, 4, 1, 0, 5, 8, 6, 5, 4 , 2, 4, 4, 7, 6

Organize this data in a Frequency Table!

17

X=No. of Children

Count

(Frequency)

Relative Freq.

0 2 2/30=0.067

1 3 3/30=0.100

2 5 5/30=0.167

3 5 5/30=0.167

4 6 6/30=0.200

5 4 4/30=0.133

6 2 2/30=0.067

7 2 2/30=0.067

8 1 1/30=0.033

18

Frequency Table

Now, construct a similar frequency table for the age of patients with Heart related problems in a clinic.

The following data has been collected based on a random sample of n = 30 patients who went to the emergency room of the clinic for Heart related problems.

The measurements are: 42, 38, 51, 53, 40, 68, 62, 36, 32, 45, 51, 67, 53, 59, 47, 63, 52, 64, 61, 43, 56, 58, 66, 54, 56, 52, 40, 55, 72, 69. 19

Age Groups Frequency Relative Frequency

32 -36 yr 2 2/30=0.067

37- 41 yr 3 3/30=0.100

42-46 yr 4 4/30=0.134

47-51 yr 3 3/30=0.100

52-56 yr 8 8/30=0.267

57-61 yr 3 3/30=0.100

62-66 yr 4 4/30=0.134

67-72 yr 3 3/30=0.100

Total n=3020

Frequency Polygon

• Use to identify the distribution of your data

0

1

2

3

4

5

6

7

8

9

20- 30- 40- 50- 60-69

Age in years

Fre

qu

ency

Female

Male

21

Table 1 in a paperDescribe your study population in a frequency table

Table Title

Name of variableName of variable

(Units of variable) (Units of variable) FrequencyFrequency

(n)(n)%%

Mean Mean (SD)(SD)

- -

- Categories- Categories

- -

Total Total

22

Measures of Central Tendency

Where is the heart of distribution?

1. Mean 2. Median

3. Mode

23

Sample MeanThe arithmetic mean (or, simply, mean) is computed by summing all the observations in the sample and dividing the sum by the number of observations.

For a sample of five household incomes, 6000, 10,000, 10,000, 14000, 50,000 the sample mean is,

X =

6000 + 10000 + 10000 + 14000 + 50000

5 = 18000

24

Median

In a list ranked from smallest measurement to the highest, the median is the middle value

In our example of five household incomes, first we rank the measurements 6,000 10,000 10,000 14,000 50,000 Sample Median is 10,000

25

Mode

• In nominal data:

• The value which occurs with the greatest frequency

26

Measures of non-central locations

•Quartiles•Quintiles•Percentiles

27

Measures of Dispersion or Variability

• Range (present highest and lowest value in a distribution. The difference between these values is the range)

• Variance

• Standard deviation (the square root of the variance)

28

Sample Variance

2 i=1

n

i2

s = (x - x )

n -1

S = standard deviation (square root of variance)

29

Calculation of Variance and Standard deviation

2 2 2 22 (6000 -18000 +(10000 -18000 +(10000 -18000 +(14000 -18000)+(50000 -18000) ) ) ) = =S

5 -1

2 = 328,000,000SS 18110.77

30

Mean and Standard deviation (SD)

7 7

7 7 7

7

7 8

7 7 7

6 3 2

7 8 13

9 Mean = 7

SD=0

Mean = 7SD=0.63

Mean = 7SD=4.04

31

Empirical Rule

For a Normal distribution approximately, a) 68% of the measurements fall within one

standard deviation around the mean

b) 95% of the measurements fall within two standard deviations around the mean

c) 99.7% of the measurements fall within three standard deviations around the mean

32

Suppose the reaction time of a particular drug has a Normal distribution with a mean of 10

minutes and a standard deviation of 2 minutes

Approximately,a) 68% of the subjects taking the drug will have

reaction time between 8 and 12 minutes

b) 95% of the subjects taking the drug will have reaction tome between 6 and 14 minutes

c) 99.7% of the subjects taking the drug will have reaction tome between 4 and 16 minutes

33

thomas songer, phd with acknowledgment to several slides provided by m rahbar and moataza mahmoud...

Documents

representative sample

hypothetical population

total population

population of individuals

large sample sizes

presenting data

stratified sampling

complex sampling