biost 514/517 biostatistics i / applied biostatistics...

15
Lecture 8: Diagnostic Testing, Parametric Distributions 1 1 BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen Kerr, Ph.D. Associate Professor of Biostatistics University of Washington Lecture 8: Probability, Diagnostic Testing, Random Variables, Parametric Distributions October 23, 2013 2 Lecture Outline • Probability Diagnostic Testing Random variables: expectation and variance Normal distribution Some discrete distributions: Bernoulli, Binomial, Poisson Probability Biostatisticians mostly use the long-run frequency definition of probability: If we do lots of coin tosses, half of them will come up heads. The long-run frequency of heads is ½, so P(heads)=1/2. There are other definitions of probability… subjective probability definitions based on symmetry … We care most about the properties of probability. 3 4 Conditional Probabilities All probabilities are conditional – all probabilities depend on the context. P(heads)=0.5 depends on the coin being fair. If E is an event or proposition of interest, and I is information or data, then P( E | I ) is the “probability of E conditional on I” or “probability of E given I.”

Upload: others

Post on 17-Mar-2020

33 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

1

11

BIOST 514/517Biostatistics I / Applied Biostatistics I

Kathleen Kerr, Ph.D.Associate Professor of Biostatistics

University of Washington

Lecture 8:Probability, Diagnostic Testing,

Random Variables, Parametric DistributionsOctober 23, 2013

22

Lecture Outline

• Probability• Diagnostic Testing• Random variables: expectation and variance• Normal distribution• Some discrete distributions: Bernoulli, Binomial,

Poisson

Probability

• Biostatisticians mostly use the long-run frequency definition of probability:– If we do lots of coin tosses, half of them will come up

heads.– The long-run frequency of heads is ½, so

P(heads)=1/2.• There are other definitions of probability…

– subjective probability– definitions based on symmetry

• … We care most about the properties of probability.

33 44

Conditional Probabilities

All probabilities are conditional – all probabilities depend on the context.

P(heads)=0.5 depends on the coin being fair.

If E is an event or proposition of interest, and I is information or data, then P( E | I ) is the “probability of E conditional on I” or “probability of E given I.”

Page 2: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

2

55

Conditional Probability Example

Role a fair die.E=get a 4I=get an even number

P( E | role a fair die)=1/6P( E | role a fair die and I)=1/3

6

E

Sample space: set of all possible outcomes

Event: subset of outcomes of interest

1)(0 EPProbability of event E:

Rules of Probability

Combining Probability

• P(A or B) = P(A) + P(B) ‒ P(A and B)• P(A and B) = P( A | B ) P( B )• P(A and B) = P( B | A ) P( A )• P(A)= P(A and B) + P(A and not B)

77 88

Example: Events

• A father’s blood type is A and a mother’s blood type is B.

• Event: “child has blood type A”• Event: “child has an ‘O’ allele”

Page 3: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

3

9

S

A

AC

Complement eventAc

S"A or B"A

BUnion eventAB [A or B]

S"both A and B"

A

B

Intersection eventAB [A and B] [AB]

10

A B

Disjoint or Mutually Exclusive events[A B = (“empty set”)]

Example: A father’s blood type is A and a mother’s blood type is B.  A=“first child has blood type A” B=“first child has blood type B”

Example: A and Ac are always mutually exclusive events

1111

Rules of Probability

• Let A and B denote events and S denote the sample space

• Rule 1: 0 P(A) 1• Rule 2: P(S) = 1• Rule 3: Addition rule for disjoint events

P(A or B) = P(A) + P(B)• Rule 4: Complement rule

P(Ac) = 1 – P(A)

1212

Rules of Probability • General addition rule: What is P(A or B)?

If events are disjoint, then P(A B) = 0 andP(A or B) = P(A) + P(B)

)()()(#

)(#)(#)(##

)(#)(

ABPBPAPS

ABBAS

BorABorAP

S"A or B"A

B

Page 4: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

4

1313

Rules of Probability

• Conditional probability: What is P(A|B)? “the probability of A given B”

SA given B

A

B

B = updated sample space

original sample space

)()()|(

BPBAPBAP

1414

Rules of Probability

)()|()()()()|(

BPBAPBAPBPBAPBAP

Equivalently:

1515

Rules of Probability

Definition: Two events A and B are independent if P(A|B)=P(A) or, equivalently, if P(B|A) =P(B) .

Interpretation: If two events do not influence each other (that is, if knowing

one has occurred does not change the probability of the other occurring), the events are independent.

Using the multiplicative rule, if two events A and B are independent, then

P(AB) = P(A) P(B).1616

Caution

• “Independent” and “mutually exclusive” sound similar in everyday language. But in probability they mean VERY different things.

Page 5: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

5

17

A

B

S

Independence is harder to visualize

Independence means that P(A|B)=P(A)[area of A in relation to S is the same as the area of AB in relation to B]

18

Rules of Probability• Partitioning

)()|()()|(A

)()(

cc

c

BPBAPBPBAPP

BandAorBandAA

Disjoint!

A

B Bc

1919

Rules of Probability

• Bayes’ theorem/Bayes’ rule

)()|()()|()()|(

)()()|(

BnotPBnotAPBPBAPBPBAP

APBandAPABP

Definition of conditionalprobability Partitioning

2020

• “The term ‘controversial theorem’ sounds like an oxymoron, but Bayes' theorem has played this part for two-and-a-half centuries. Twice it has soared to scientific celebrity, twice it has crashed, and it is currently enjoying another boom.” -Bradley Efron, Science 2013

Bayes’ rule… Not just for Bayesians!

Page 6: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

6

2121

Diagnostic Testing

• We characterize the “accuracy” of a diagnostic test with the sensitivity and specificity

• Sensitivity of test: Positivity in diseased– P( + | D )

• Specificity of test: Negativity in healthy (non-diseased)– P( - | H )

2222

Predictive Values

• Neither sensitivity nor specificity directly answers the patient’s question: I got a positive test result. Am I diseased? Or: I got a negative test result. Am I not diseased?

• Positive predictive value: Probability of disease when test is positive– P( D | + )

• Negative predictive value: Probability of healthy when test is negative– P( H | - )

2323

Computing Predictive Values

• Apply Bayes’ Rule to relate PPV and NPV to Sensitivity and Specificity

DDHH

HHH

HHDDDDD

Pr|PrPr|PrPr|Pr|Pr

Pr|PrPr|PrPr|Pr|Pr

2424

PPV: Relationship to Prevalence

• Need to know sensitivity, specificity, AND prevalence of disease

)1()1(

Pr|PrPr|PrPr|Pr|Pr

PrevSpecPrevSensPrevSensPPV

HHDDDDD

Page 7: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

7

2525

NPV: Relationship to Prevalence

• Need to know sensitivity, specificity, AND prevalence of disease

PrevSensPrevSpecPrevSpecNPV

DDHHHHH

)1()1()1(

Pr|PrPr|PrPr|Pr|Pr

2626

Ex: Syphilis and VDRL

• Typical study: Sample by disease

• Sensitivity of test: Probability of positive in diseased– 90% of subjects with syphilis test positive– (Actually depends on duration of infection)

• Specificity of test: Probability of negative in healthy– 98% of subjects without syphilis test negative– (Actually depends on age, and presence/absence of

certain other diseases)

2727

Ex: PPV, NPV at STD Clinic

• Ex: 1000 patients at STD clinic – Prevalence of syphilis 30%– PPV: 95% with positive VDRL have syphilis (270/284)

Syphilis

Yes No | Tot

VDRL Pos 270 14 | 284

Neg 30 686 | 716

Total 300 700 | 1000

2828

Ex: PPV, NPV in Marriage License

• Ex: Screening for marriage license– Prevalence of syphilis 2%– PPV: 47% with positive VDRL have syphilis (18/38)

Syphilis

Yes No | Tot

VDRL Pos 18 20 | 38

Neg 2 960 | 962

Total 20 980 | 1000

Page 8: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

8

2929

Bottom Line

• The Positive and Negative Predictive Values are clinically important, but the Sensitivity and Specificity are more likely to be generalizable.

• The relationship between (Sensitivity, Specificity) and (PPV, NPV) depends heavily on the prevalence.

• When a disease or condition is rare, even a test with excellent sensitivity can have extremely poor PPV.

Rare Disease

• Consider testing 10,000 people, ten of whom have the disease, using a test with 95% sensitivity and 95% specificity.– 95% sensitivity means that we expect 9 -10 of the

people with the disease to test positive.– 95% specificity means that about 9500 of the 9990

people without the disease will test negative.• So roughly 500 people test positive, roughly 10 of them

have the disease, so PPV≈10/500=2%.

3030

Thresholds

The relationship between false positives and prevalence means that the same test may be used with different thresholds in different circumstances. For example, the Mantoux skin test for tuberculosis involves injecting TB antigen under the skin and measuring the resulting indurated area 2–3 days later. The recommended thresholds are• 15mm induration in healthy people with no risk factors• 10mm in people with occupational or other potential exposure• 5mm in people for whom TB would be especially dangerous.E.g., HIV infection or immunosuppression identify a population for whom a false negative is relatively more serious, so a lower PPV can be tolerated in favor of a better NPV.

A higher threshold means a test with lower sensitivity and higher specificity 3131

32

Page 9: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

9

Random Variables

• Mathematical definition: a function from the sample space to the real numbers.– “Rule” that assigns a numerical value to each point in the

sample space• Working definition: some measurement that has a

distribution across subjects– Denoted with a capital letter or a mnemonic– X, Y, A, AGE,

• We usually use lower-case letters to represent the values a random variable can attain. We talk about “events” such as A = a1, A > a2, a3 ≤ A ≤ a4

3333

Random Variables: Examples

• Sampling space is our population• Random variables examples:

– HEIGHT: Height– CHOL: Cholesterol level– BP: Blood pressure– H=1 if person is hypertensive, 0 otherwise– STAGE: Stage of cancer

3434

Expectation

• The expected value of a random variable is the average (mean) of that variable over the population. For a discrete variable there is a simple formula:

• ∑

• Example: roll two fair dice. Let X=the number of 6’s.• The expected value of X is 0·25/36+ 1·10/36+

2·1/36=12/36=1/3• The expected number of 6’s we get when we roll two fair

dice is 1/3 .

353536

Example:Experiment: Roll 2 fair dice. Sample space = {

(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)(6,1) (6,2) (6,3) (6,4) (6,5) (6,6) }

Random Variable: X: number of “6” that occur when rolling two fair dice.

Takes values 0, 1 and 2 X 0 1 2P(X=x) 25/36 10/36 1/36

X=0 X=1

X=1X=2

Page 10: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

10

Expectation

• For continuous variables the concept is the same but the formula requires calculus.

• Properties of expectations:– E[X+Y]=E[X]+E[Y]– For a number c, E[cX] = c E[X]

3737

Variability

• We can also talk about the variance of a random variable. Let E[X]=μ . Then var[X]=E[X‒μ]2.

• Properties of variance:– var[cX] = c2 var[X]– Var[X+Y] = var[X]+var[Y]+2cov[X,Y]– Only when X and Y are independent:

• Var[X+Y] = var[X]+var[Y]

3838

Example

• We can use these formulas to work out the expected value of the mean for independent*, identically distributed X1, X2, … , Xn:

*do not need to assume independence 3939

Example

• We can use these formulas to work out the variance of the mean for independent, identically distributed X1, X2, … , Xn.

• Therefore, SD4040

Page 11: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

11

Some parametric distributions

• Continuous– Normal

• Discrete– Bernoulli– Binomial– Poisson

414142

Continuous Random Variables: Normal• Notation: X ~ N(, )

– Continuous quantitative random variable X is normally distributed if the density of its probability distribution is

– The normal distribution corresponds to the familiar “bell-shaped” curve

– Symmetric around – Expected Value: – Variance is 2 and standard deviation is

2

2

2exp

21)(

xxf

43 44

Page 12: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

12

45

Standard Normal Random Variable• Notation: Z ~ N(0, 1)

– A Normal random variable with mean 0 and standard deviation 1.

– Symmetric around 0– Expected Value: 1– Variance: 1, Standard Deviation: 1– (a standard Normal random variable is often

denoted with Z)

2exp

21)(

2zzf

46

Standard Normal Random Variables

Z ~ N(0,1)

47

Properties of Normal Random Variables

• Let Y= aX + b.

• If X ~ N (mean , standard deviation ), then

• Let W = aX + bY.

• If X ~ N (, ) is independent of Y ~ N (, ) then

222,where),,(~ abaNY YYYY

22222,where),,(~ babaNW WWWW

Select Discrete Random Variables

• Bernoulli• Binomial• Poisson

48

Page 13: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

13

49

Discrete Random Variable: Bernoulli• Bernoulli: takes only two possible values

– Examples: • Patient has a disease or not• Patient responds to a treatment or not• Patient dies or not

• A Bernoulli random variable X is complete characterized by a single parameter p=P(X=1)– Because P(X=0)=1-p

50

Discrete Random Variable: Binomial• number of times that a particular binary event

occurs in a sequence of n independent observations with the same probability p of occurrence– Related to Bernoulli– Example: Roll a fair die 60 times, event of interest is

getting a 4. So 60 “trials” with probability p=1/6 of “success” each time. Let X be the total number of 4’s. X has a binomial distribution. Mathematically, X can be any number between 0 and 60, but E[X]=10

– Example: n patients participate in a study and take aspirin; each patient is followed for 2 years on aspirin; the number of patients who experience a heart attack during the study period is a Binomial random variable. Number of trials n is known, interest is in p=P(heart attack)

51

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

Prob

abilit

y

0 1 2 3 4 5

0.05

0.15

0.25

Prob

abilit

y

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

obab

ty

0 2 4 6 8 10

0.00

0.10

0.20

0.30

Prob

abilit

y

0 2 4 6 8 10

0.00

0.10

0.20

Prob

abilit

y

0 2 4 6 8 10

0.00

0.10

0.20

0.30

obab

ty

0 20 40 60 80 100

0.00

0.04

0.08

Prob

abilit

y

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

Prob

abilit

y

0 20 40 60 80 100

0.00

0.04

0.08

obab

ty

p=0.2 p=0.5 p=0.8

n=5

n=10

n=100

BINOMIAL(n,p)

52

Discrete Random Variable: Poisson• Poisson: number of events in a specific time

period or area. – Examples:

• number of telephone calls during business hours• number of individuals with heat exhaustion at UWMC

emergency room– A Poisson random variable is characterized

by a single parameter, the rate, often denoted with the Greek letter “lambda” λ

– A Poisson random variable can take on any non-negative integer value (0, 1, 2, 3, ….). However, the most likely values are the values around λ

Page 14: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

14

53

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0 10 20 30 40 50

0.00

0.05

0.10

0.15

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

=0.8 =1

=5 =15

POISSON()

Better description: Binomial or Poisson?

• X=“Number of HPV infected women among UW female sophomores”

• Y=“number of bicycling accidents in Seattle per year”

54

• Some distributions that are not Normal (may not even be continuous) can be well approximated with a Normal distribution.

55 56

Fractio

n

X61 110

0

.05 Poisson

Fraction

X0 36

0

.12Binomial

Normal Approximations to some discrete variables

Page 15: BIOST 514/517 Biostatistics I / Applied Biostatistics Icourses.washington.edu/b517/Lectures/L08.pdf · 2013-10-25 · BIOST 514/517 Biostatistics I / Applied Biostatistics I Kathleen

Lecture 8: Diagnostic Testing, Parametric Distributions

15

57

Normal Approximations to some discrete variables

• 1. Binomial–When both np and n(1-p) are “large” a Normal may be used to

approximate the binomial. [ Rule of thumb: np>5, n(1-p) > 5]–: X ~ Bin(n,p)

– E(X) = np– V(X) = np(1-p)

–Approximation:

• 2. Poisson–When is “large” a Normal may be used to approximate the Poisson [ Rule of thumb: > 10].–: X ~ Po()

– E(X) = – V(X) =

–Approximation:

))1(,(~ pnpnpNX

),(~ NX 58

Summary: Parametric DistributionsRandom Variable

Discrete Continuous

Bernoulli(p)E[X]=p

Var[X]=p(1-p)

Binomial(n,p)E[X] = np

Var[X] = np(1-p)

Poisson()E[X]=

Var[X]=

Normal(,)E[X]=

Var[X]= 2