today’s outline types of statistics - ut southwestern … statistics ... inferential statistics...

12
1 June 27, 2013 1 INTRODUCTION TO BIOSTATISTICS FOR GRADUATE AND MEDICAL STUDENTS Descriptive Statistics and Graphically Visualizing Data Beverley Adams Huet, MS Assistant Professor Department of Clinical Sciences, Division of Biostatistics June 27, 2013 2 Files for today (June 27) Lecture and Homework Lecture #2 (1 file) PPT presentation Homework #1 (1 file) Biostat_HW_Huet_assigned062713.docx Medical Student Research Page -- Summer II http://www.utsouthwestern.edu/education/medical-school/academics/research/summer/course-ii.html June 27, 2013 3 Today’s Outline Describing data Descriptive statistics Measures of central tendency Measures of dispersion Other statistics Coefficient of variation Standard error of the mean Transformations Histograms and other graphs June 27, 2013 4 Types of Statistics Descriptive statistics Which summary statistics to use to organize and describe the data? Proportion, mean, median, SD, percentiles Inferential statistics Generalizing from the sample. Which test? T-test, Fisher’s Exact, ANOVA, survival analysis • Bayesian approaches June 27, 2013 5 Type of Outcome Variable: Goal: Continuous measurement Rank, Score, or Measurement Binomial Survival (from a normal distribution) (from non-normal distribution) (e.g. heads or tails) (Time to event) Describe one group: Mean, SD Median, interquartile range Proportion Kaplan-Meier survival curve Compare one group to a hypothetical value: One-sample t test Wilcoxon Signed-Rank test Chi-square or binomial test Compare two unpaired (independent) groups: Two sample (unpaired) t test Mann-Whitney (Wilcoxon Rank Sum) test Fisher's exact test (or chi-square for large samples) Log-rank test or Mantel-Haenszel Compare two paired groups: Paired t test Wilcoxon Signed-Rank test McNemar test Conditional proportional hazards regression Compare three or more unmatched groups: One-way ANOVA Kruskal-Wallis test Chi-square test Cox proportional hazard regression Compare three or more matched groups: Repeated-measures analysis Friedman test Cochrane Q test Conditional proportional hazards regression Quantify association between two variables: Pearson correlation Spearman correlation Contingency coefficients Predict value from another measured variable: Simple linear regression Nonparametric regression Simple logistic regression Cox proportional hazard regression Predict value from several measured or binomial variables: Multiple linear regression, ANCOVA Multiple logistic regression Cox proportional hazard regression Summary of commonly used statistical tests June 27, 2013 Censored data • Left censoring • Right censoring Cannot be measured beyond some limit 6

Upload: lytuong

Post on 19-Apr-2018

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

1

June 27, 2013 1

INTRODUCTION TO BIOSTATISTICSFOR GRADUATE AND MEDICAL STUDENTS

Descriptive Statistics and Graphically Visualizing Data

Beverley Adams Huet, MSAssistant ProfessorDepartment of Clinical Sciences, Division of Biostatistics

June 27, 2013 2

Files for today (June 27) Lecture and Homework

Lecture #2 (1 file) PPT presentation

Homework #1 (1 file) Biostat_HW_Huet_assigned062713.docx

Medical Student Research Page -- Summer IIhttp://www.utsouthwestern.edu/education/medical-school/academics/research/summer/course-ii.html

June 27, 2013 3

Today’s Outline

Describing dataDescriptive statisticsMeasures of central tendency

Measures of dispersion

Other statisticsCoefficient of variation

Standard error of the mean

Transformations

Histograms and other graphs

June 27, 2013 4

Types of Statistics

Descriptive statistics• Which summary statistics to use to organize

and describe the data?• Proportion, mean, median, SD, percentiles

Inferential statistics• Generalizing from the sample. Which test?

• T-test, Fisher’s Exact, ANOVA, survival analysis• Bayesian approaches

June 27, 2013 5

Type of Outcome Variable:

Goal: Continuous measurement

Rank, Score, or Measurement Binomial Survival

(from a normal distribution)

(from non-normal distribution) (e.g. heads or tails) (Time to event)

Describe one group: Mean, SD Median, interquartile range Proportion Kaplan-Meier survival

curve

Compare one group to a hypothetical value: One-sample t test Wilcoxon Signed-Rank

testChi-square or binomial

test

Compare two unpaired (independent) groups:

Two sample (unpaired) t test

Mann-Whitney (Wilcoxon Rank Sum)

test

Fisher's exact test (or chi-square for large

samples)

Log-rank test or Mantel-Haenszel

Compare two paired groups: Paired t test Wilcoxon Signed-Rank test McNemar test Conditional proportional

hazards regression

Compare three or more unmatched groups: One-way ANOVA Kruskal-Wallis test Chi-square test Cox proportional hazard

regression

Compare three or more matched groups:

Repeated-measures analysis Friedman test Cochrane Q test Conditional proportional

hazards regression

Quantify association between two variables: Pearson correlation Spearman correlation Contingency coefficients

Predict value from another measured variable:

Simple linearregression

Nonparametric regression

Simple logistic regression

Cox proportional hazard regression

Predict value from several measured or binomial variables:

Multiple linear regression, ANCOVA

Multiple logistic regression

Cox proportional hazard regression

Summary of commonly used statistical tests

June 27, 2013

Censored data

•Left censoring

•Right censoring

Cannot be measured beyond some limit

6

Page 2: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

2

June 27, 2013

Left Censored data

• Lab data – “undetectable”, “below lower limit”

• Example CRP “< 0.2 mg/dL”

Cannot be measured beyond some limit

Subject CRP001 0.7002 1.6003 <0.2004 3.8

Censored at the limit of detectability

7 June 27, 2013

Right Censored data

• Right censoring

- “Survival” data – the period of observation was cut off before the event of interest occurred.

Cannot be measured beyond some limit

Note – an event in a ‘survival’ analysis may be infection, fracture , transplant , metastasis

8

June 27, 2013

Right censored survival data

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12

Study time, months

Su

bje

ct

Survival time knownCensored

“Event” at 3 months

Lost to follow-up at 9 months

9 June 27, 2013

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12

Su

bje

ct

Study time, months

Survival time known

Censored

Right censored survival data

Survival Analysis

Time

0 2 4 6 8 10 12

Su

rviv

al

0.0

0.2

0.4

0.6

0.8

1.0

10

Step function

June 27, 2013

• Measures of Central Tendency

• Measures of Dispersion

Descriptive statistics

11 12

Measures of Central Tendency*

• Mean• Median• Geometric mean• Mode

*or Measures of Location

0 20 40 60 80 1000

50

100

150

200

250

300

350

Page 3: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

3

June 27, 2013 13

Measures of Central Tendency*

• Mean– Arithmetic average or balance point– Discrete/continuous data; symmetric

distribution– May be sensitive to outliers– Sample mean symbol is denoted as ‘x-bar’

XX

N

SubjectID Glucose mg/dL0204 1450205 1260206 1360210 970211 2640212 144Mean 152

Fasting plasma glucose, n=6

*or Measures of Location

June 27, 2013 14

Fasting plasma glucose, n=6

0

20

40

60

80

100

120

140

160

180

200

Mean

Glucosemg/dL

0

50

100

150

200

250

300

Glu

cose

, m

g/d

L

Fasting Plasma Glucose

SubjectID Glucose mg/dL0204 1450205 1260206 1360210 970211 2640212 144Mean 152

Median 140

X

What about other measures of central tendency?

June 27, 2013 15

Measures of Central Tendency**or Measures of Location

In a symmetric distribution, the median, mode and mean will have the same value.

0 2 4 6 8 10

0

20

40

60

80

100

In a non-symmetric (skewed) distribution, the median, mode and mean may not have the same value.

June 27, 2013 16

Measures of Central Tendency

• Middle value when the data are ranked in order (if the sample size is an even number then the median is the average of the two middle values)

• 50th percentile• Ordinal/discrete/continuous data• Useful with highly skewed discrete or

continuous data• Relatively insensitive to outliers

Median

June 27, 2013 17

Measures of Central Tendency

The median of 13, 11, 17 is 13 The median of 13, 11, 568 is 13The median of 14, 12, 11, 568 is 13

June 27, 2013 18

Measures of Central TendencySubjectID Glucose mg/dL

0204 1450205 1260206 1360210 970211 2640212 144Mean 152

Median 140

SubjectIDGlucose mg/dL

0210 970205 1260206 1360212 1440204 1450211 264

Order the glucose values from

smallest to largest

Page 4: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

4

June 27, 2013 19

Gonick & Smith (1993) The Cartoon Guide to Statistics.

The median is often better than the mean for describing the center of the data

June 27, 2013 20

Geometric mean

Report for log transformed data

• Usually smaller than the arithmetic mean.

• Can be used instead of arithmetic mean when data have a skewed or log-normal distribution

• Find the mean on the log scale, taking the antilog of this mean yields the geometric mean.

n xxxxG n...*3*2*1

June 27, 2013 21

Geometric mean• Creatinine Log10(Creatinine)

Histograms

Log transformed data

Sometimes we can transform our data

June 27, 2013 22

Geometric mean

SubjectID Glucose mg/dL ln(Glucose)

0204 145 4.976734

0205 126 4.836282

0206 136 4.912655

0210 97 4.574711

0211 264 5.575949

0212 144 4.969813

Mean 152 4.9743573

SD 57.644 0.330

Median 140 4.941234093

Geometric mean

Take the antilog of the mean

exp(4.974357) = 144.6558278

Geometric mean:

Back-transform (antilog) the mean of the log transformed data

Loge transformed data

June 27, 2013 23

Measures of Central Tendency

• Most frequently occurring value in the distribution

• Nominal/ordinal/discrete/continuous data

Mode

The mode of 13, 11, 22, 11, 17 = 11

June 27, 2013 24

Gonick & Smith (1993) The Cartoon Guide to Statistics.

Measures of Dispersion

Page 5: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

5

June 27, 2013 25

Measures of Dispersion

• Also known as– Measures of Spread– Measures of Variability

• Commonly used measures of variability– Standard deviation– Range – Percentiles

June 27, 2013 26

Measures of Dispersion• Standard Deviation

Square root of average of the squared deviations

– Has the same units as the original observation

– NEVER negative

2

1x

X Xs

n

2

x

X

n

Population standard deviation (Greek letters)

Sample standard deviation (Roman letters)[n-1 in the denominator corrects for bias]

Lower case sigma

June 27, 2013 27

Deviation from mean

SubjectIDGlucose

mg/dL

Glucose minus

mean of 152Squared deviation

0204 145 -7 49

0205 126 -26 676

0206 136 -16 256

0210 97 -55 3025

0211 264 112 12544

0212 144 -8 64

Mean 152

Median 140

Sum of squares

Divide by n-1

n 6 Variance SD

16614 3322.8 57.644

Standard deviation calculation

Square root of variance

Standard deviation and the Normal Distribution

28

Approximately 68% of the observations fall within 1 SD of the meanApproximately 95% of the observations fall within 2 SDs of the mean

Approximately 99.7% of the observations fall within 3 SDs of the mean

June 27, 2013 29

Percentiles

From Primer of Biostatistics by Stanton A Glantz

June 27, 2013 30

Percentiles

• The value below which a given percentage of the values occur– The 50th percentile is the median– Quartiles

• The 25th percentile is first quartile (Q1)• The 75th percentile is third quartile (Q3)

– Interquartile range is the difference between 25th

and 75th percentiles– Other commonly reported percentiles

• 10th and 90th percentiles• 5th and 95th percentiles

Page 6: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

6

“Special” PercentilesQuartiles

25% 25% 25% 25%

June 27, 2013 31

Q1 Q2 Q3

IQR

June 27, 2013 32

Percentiles

From Primer of Biostatistics by Stanton A Glantz

June 27, 2013 33

Summarizing data with medians and percentiles

June 27, 2013 34

Box and Whisker Plot

From Lang and Secic, How to Report Statistics in Medicine

Copyright ©2007 The Endocrine Society

Dovio, A. et al. J Clin Endocrinol Metab 2007;92:1803-1808

FIG. 1. Serum levels of OPG (A) and sRANKL (B) in CS patients and controls. Values are median, 25th and 75th percentile, and range. *, P < 0.01 by Mann-Whitney U test.

Box and Whisker plots

June 27, 2013 35 June 27, 2013 36

Descriptive statistics with percentiles

proc means n mean std median min p25 p75 max maxdec=5 ;;

title3 'Descriptive statistics';

class group;

var hdl;

run;

Page 7: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

7

June 27, 2013 37

Measures of shape

“Skewed left” – a long left tail “Skewed right” – a long right tail

Skewness measures symmetry

June 27, 2013 38

Figure 1. Histograms of the fasting insulin levels in Japanese men and women.

Characteristics of the Insulin Resistance Syndrome in a Japanese Population The Jichi Medical School Cohort Study

Arteriosclerosis, Thrombosis, and Vascular Biology. 1996;16:269-274

Skewed left or right??

June 27, 2013 39

Examples of variables having a skewed distribution

Skewness measures symmetry

• Triglycerides, insulin, HOMA

• Bilirubin, leptin, CRP, viral load counts

• Urine albumin

• Income

• Health care costs

• Hospital length of stay

June 27, 2013 40

Other statisticsCoefficient of variation

Standard error of the mean

June 27, 2013 41

Coefficient of variation (CV)

MeanSDCV

MeanSDCV 100%

Often expressed as a percent:

The standard deviation expressed as a proportion of the mean:

June 27, 2013 42

Coefficient of variation (CV)

“The magnitude of total intra-individual variability based on coefficient of variation (CV) for these lipids in premenopausal women (CV, 4% to 8.1%) was similar to that found for men (CV, 4.3% to 9.1%) and for postmenopausal women (CV, 3.7% to 6.7%). “

Metabolism. 2000 Sep;49(9):1101-5.

Page 8: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

8

June 27, 2013 43

Femoral Neck Hip Bone Density

ID1 ID2 ID3 ID4

Scan1 0.691 0.775 0.754 0.846

Scan2 0.688 0.779 0.780 0.832

Scan3 0.693 0.775 0.753 0.798

Mean 0.690667 0.776333 0.762333 0.825333

SD 0.002517 0.002309 0.015308 0.024685

CV 0.003644 0.002975 0.02008 0.029909

CV% 0.36% 0.30% 2.01% 2.99%

Coefficient of variation (CV)

Subject A

CO2 AST ALT

Day1 28 24 27

Day2 31 29 27

Day3 29 30 28

Mean 29.33333 27.66667 27.33333

SD 1.527525 3.21455 0.57735

CV 0.052075 0.116189 0.021123

CV% 5.21% 11.62% 2.11%MeanSDCV

MeanSDCV 100%

June 27, 2013 44

Coefficient of variation (CV)

• A measure of spread

• A unitless fraction (SD/mean)

• Used for comparing the relative variability of variables measured in different units– e.g., HDL-cholesterol (4.6%) and triglycerides (2.7%)

• Used for comparing the relative variability of variables measured in same units– e.g., inter-assay versus intra-assay variability

June 27, 2013 45

Standard error of the mean

• Is NOT a descriptive statistic

• The standard error of the mean is useful in the calculations of confidence intervals and significance tests

Do not summarize continuous data with the mean and the standard error of the mean.

Lang and Secic (2006)How to Report Statistics in Medicine

June 27, 2013 46

Standard error of the mean(also called SEM, SE, standard error)

SEM =Standard deviation

_______________________________

Square root of the sample size

nSDSEM

or

Which is smaller - SD or SEM?

June 27, 2013 47

Standard error(SE, SEM, standard error of the mean)

Why is the standard error commonly used as descriptive statistic and graphed as ‘error bars’?

• It is smaller than the standard deviation• ‘Looks’ better?

The only role of the standard error…is to distort and conceal the data.

Feinstein, Clinical Biostatistics

June 27, 2013 48

Standard deviation (SD) and Standard error of the mean (SEM)

nSDSEM

nSEMSD These formulas do need to be memorized!

Can convert one to the other if the sample size (n) is known

Page 9: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

9

June 27, 2013 49

Standard deviation (SD) and Standard error of the mean (SEM)

Can convert one to the other if the sample size (n) is known

SD = 40, n = 64,

SEM = 40/8 = 5

SEM = ?

SEM = 12, n = 81, SD = ?SD = 12 * 9 =108

June 27, 2013 50

Make friends with your data!

Look at your

data!

Making friends with your data

Don’t run away!

June 27, 2013 51

Transformations

Why transform data?

• Many statistical analyses include assumptions

• Normality (Normally distributed)• The different groups have the same standard

deviation• Linearity for correlation or modeling

June 27, 2013 52

Transformations

• Linear

• Nonlinear

Plasma Glucose

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0 50 100 150 200 250 300

mg/dL

mm

ol/L

June 27, 2013 53

Linear Transformations

Add, subtract, multiply, divide…

• A straight line is obtained when plotting the new values against the original values

• Mean and standard deviation of the transformed values are easily obtained

June 27, 2013 54

Linear TransformationsAdding/subtracting a constant

• The mean increases/decreases by the same amount as the constant

• The standard deviation is unaffected

Obs X X+20

a 5 25.00

b 7 27.00

c 2 22.00

d 9 29.00

e 3 23.00

f 1 21.00

Mean 4.5 24.50

SD 3.082 3.082

Median 4 24.00

Adding a constant

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

0 2 4 6 8 10

X

X+

20

Page 10: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

10

June 27, 2013 55

Linear Transformations - Multiplying by a constant

Plasma Glucose

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0 50 100 150 200 250 300

mg/dL

mm

ol/L

SubjectIDGlucose mg/dL

Convert to SI units

Glucose mmol/L

0204 145 8.05

0205 126 6.99

0206 136 7.55

0210 97 5.38

0211 264 14.65

0212 144 7.99

Mean 152 8.44

SD 57.644 3.20

Median 140 7.77

Multiply by 0.0555 to convert to mg/dL to mmol/L

Mean 152 152*0.0555= 8.44

SD 57.6437334 57.64*0.0555= 3.20

Median 140 140*0.0555= 7.77

Conversion from ‘conventional’ units to Standard International (SI) units

Both the mean and SD are multiplied by 0.0555

June 27, 2013 56

Nonlinear transformation

Copyright ©1996 BMJ Publishing Group Ltd.

Bland, J M. et al. BMJ 1996;312:1079

Fig 1--Serum triglyceride and log10 serum triglyceride concentrations in cord blood for 282 babies, with best fitting normal distribution

June 27, 2013 57 June 27, 2013 58

Transformations

Common non-linear transformations

• Log transformation (log10 or loge )

• Square-root transformation–Less dramatic than the log transformation

• Reciprocal transformation–More drastic than the log transformation

–Use for extremely skewed data distributions

June 27, 2013 59

Transformations

Pearson Correlation coefficient = 0.28

Untransformed

0

5

10

15

20

25

30

35

40

45

50

0 10 20 30 40 50 60

% Total Body Fat

% L

ive

r F

at

loge transformed

-6

-5

-4

-3

-2

-1

0

1

2

3

4

5

0 10 20 30 40 50 60

% Total Body Fat

% L

iver

Fat

Pearson Correlation coefficient = 0.38X

For assessing linear association

June 27, 2013 60

Log transformation

• When might a log transformation be useful?–Remove positive (right) skewness

–The standard deviation is greater than half of the

mean (if the measure cannot be negative)

–The mean is larger than the median • Mean liver fat = 6.2% SD=6.3% (median=4.2)

–The mean is proportional to the standard deviation

–When comparing several groups the variances or

standard deviations are not equal

Page 11: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

11

June 27, 2013 61

Leptin (ng/ml)

Group A Group B

n 33 29Mean 36.3 27.4SD 19.5 15.1Median 29 19

These data might need a log transformation – why?

62

Log Transformation Serum Creatinine

Original units Log transformed

June 27, 2013 63

Histogram with normal distribution overlay

0 20 40 60 80 1000

50

100

150

200

250

300

350

Statistics and graph softwareSigmaPlot and GraphPad Prism can be downloaded from the UTSW Information Resources INTRAnet

June 27, 2013 64

Histogram• Group the data into intervals (x-axis)

• Height of the bar indicates the frequency (y-axis)

• Each “bar” begins/ends at the “true limits” of the interval.

• Bars are presented next to each other (unless the data in the next interval has a frequency of 0).

• Bars are usually the same width

• Frequencies correspond to the area of each bar

Histograms for continuous/discrete data

Relative frequency histogram

Absolute frequency histogram

Absolute Relative

June 27, 2013 66

Spaghetti plot for Repeated Measurements

Each subject is observed under multiple conditions or on multiple occasions

Spaghetti plot

Multiple time points

Page 12: Today’s Outline Types of Statistics - UT Southwestern … Statistics ... Inferential statistics • Generalizing from the sample. ... – Interquartile range is the difference between

12

June 27, 2013 67

Forest plot for Meta-analysis

Forest plotDescriptive statistics

June 27, 2013 68

Look at your data!

Do not rely on p-values. The non-significant results might be just as interesting or enlightening.

A difference must be a difference to make a difference