biost 514/517 biostatistics i / applied biostatistics...

Lecture 8: Diagnostic Testing, Parametric Distributions

BIOST 514/517Biostatistics I / Applied Biostatistics I

Kathleen Kerr, Ph.D.Associate Professor of Biostatistics

University of Washington

Lecture 8:Probability, Diagnostic Testing,

Random Variables, Parametric DistributionsOctober 23, 2013

Lecture Outline

• Probability• Diagnostic Testing• Random variables: expectation and variance• Normal distribution• Some discrete distributions: Bernoulli, Binomial,

Poisson

Probability

• Biostatisticians mostly use the long-run frequency definition of probability:– If we do lots of coin tosses, half of them will come up

heads.– The long-run frequency of heads is ½, so

P(heads)=1/2.• There are other definitions of probability…

– subjective probability– definitions based on symmetry

• … We care most about the properties of probability.

Conditional Probabilities

All probabilities are conditional – all probabilities depend on the context.

P(heads)=0.5 depends on the coin being fair.

If E is an event or proposition of interest, and I is information or data, then P( E | I ) is the “probability of E conditional on I” or “probability of E given I.”

Conditional Probability Example

Role a fair die.E=get a 4I=get an even number

P( E | role a fair die)=1/6P( E | role a fair die and I)=1/3

Sample space: set of all possible outcomes

Event: subset of outcomes of interest

1)(0 EPProbability of event E:

Rules of Probability

Combining Probability

• P(A or B) = P(A) + P(B) ‒ P(A and B)• P(A and B) = P( A | B ) P( B )• P(A and B) = P( B | A ) P( A )• P(A)= P(A and B) + P(A and not B)

Example: Events

• A father’s blood type is A and a mother’s blood type is B.

• Event: “child has blood type A”• Event: “child has an ‘O’ allele”

Complement eventAc

S"A or B"A

BUnion eventAB [A or B]

S"both A and B"

Intersection eventAB [A and B] [AB]

Disjoint or Mutually Exclusive events[A B = (“empty set”)]

Example: A father’s blood type is A and a mother’s blood type is B. A=“first child has blood type A” B=“first child has blood type B”

Example: A and Ac are always mutually exclusive events

• Let A and B denote events and S denote the sample space

• Rule 1: 0 P(A) 1• Rule 2: P(S) = 1• Rule 3: Addition rule for disjoint events

P(A or B) = P(A) + P(B)• Rule 4: Complement rule

P(Ac) = 1 – P(A)

Rules of Probability • General addition rule: What is P(A or B)?

If events are disjoint, then P(A B) = 0 andP(A or B) = P(A) + P(B)

)()()(#

)(#)(#)(##

ABPBPAPS

BorABorAP

S"A or B"A

• Conditional probability: What is P(A|B)? “the probability of A given B”

SA given B

B = updated sample space

original sample space

)()()|(

BPBAPBAP

)()|()()()()|(

BPBAPBAPBPBAPBAP

Equivalently:

Definition: Two events A and B are independent if P(A|B)=P(A) or, equivalently, if P(B|A) =P(B) .

Interpretation: If two events do not influence each other (that is, if knowing

one has occurred does not change the probability of the other occurring), the events are independent.

Using the multiplicative rule, if two events A and B are independent, then

P(AB) = P(A) P(B).1616

Caution

• “Independent” and “mutually exclusive” sound similar in everyday language. But in probability they mean VERY different things.

Independence is harder to visualize

Independence means that P(A|B)=P(A)[area of A in relation to S is the same as the area of AB in relation to B]

Rules of Probability• Partitioning

)()|()()|(A

BPBAPBPBAPP

BandAorBandAA

Disjoint!

• Bayes’ theorem/Bayes’ rule

)()|()()|()()|(

)()()|(

BnotPBnotAPBPBAPBPBAP

APBandAPABP

Definition of conditionalprobability Partitioning

• “The term ‘controversial theorem’ sounds like an oxymoron, but Bayes' theorem has played this part for two-and-a-half centuries. Twice it has soared to scientific celebrity, twice it has crashed, and it is currently enjoying another boom.” -Bradley Efron, Science 2013

Bayes’ rule… Not just for Bayesians!

Diagnostic Testing

• We characterize the “accuracy” of a diagnostic test with the sensitivity and specificity

• Sensitivity of test: Positivity in diseased– P( + | D )

• Specificity of test: Negativity in healthy (non-diseased)– P( - | H )

Predictive Values

• Neither sensitivity nor specificity directly answers the patient’s question: I got a positive test result. Am I diseased? Or: I got a negative test result. Am I not diseased?

• Positive predictive value: Probability of disease when test is positive– P( D | + )

• Negative predictive value: Probability of healthy when test is negative– P( H | - )

Computing Predictive Values

• Apply Bayes’ Rule to relate PPV and NPV to Sensitivity and Specificity

HHDDDDD

Pr|PrPr|PrPr|Pr|Pr

PPV: Relationship to Prevalence

• Need to know sensitivity, specificity, AND prevalence of disease

)1()1(

Pr|PrPr|PrPr|Pr|Pr

PrevSpecPrevSensPrevSensPPV

HHDDDDD

NPV: Relationship to Prevalence

• Need to know sensitivity, specificity, AND prevalence of disease

PrevSensPrevSpecPrevSpecNPV

DDHHHHH

)1()1()1(

Pr|PrPr|PrPr|Pr|Pr

Ex: Syphilis and VDRL

• Typical study: Sample by disease

• Sensitivity of test: Probability of positive in diseased– 90% of subjects with syphilis test positive– (Actually depends on duration of infection)

• Specificity of test: Probability of negative in healthy– 98% of subjects without syphilis test negative– (Actually depends on age, and presence/absence of

certain other diseases)

Ex: PPV, NPV at STD Clinic

• Ex: 1000 patients at STD clinic – Prevalence of syphilis 30%– PPV: 95% with positive VDRL have syphilis (270/284)

Syphilis

Yes No | Tot

VDRL Pos 270 14 | 284

Neg 30 686 | 716

Total 300 700 | 1000

Ex: PPV, NPV in Marriage License

• Ex: Screening for marriage license– Prevalence of syphilis 2%– PPV: 47% with positive VDRL have syphilis (18/38)

Syphilis

Yes No | Tot

VDRL Pos 18 20 | 38

Neg 2 960 | 962

Total 20 980 | 1000

Bottom Line

• The Positive and Negative Predictive Values are clinically important, but the Sensitivity and Specificity are more likely to be generalizable.

• The relationship between (Sensitivity, Specificity) and (PPV, NPV) depends heavily on the prevalence.

• When a disease or condition is rare, even a test with excellent sensitivity can have extremely poor PPV.

Rare Disease

• Consider testing 10,000 people, ten of whom have the disease, using a test with 95% sensitivity and 95% specificity.– 95% sensitivity means that we expect 9 -10 of the

people with the disease to test positive.– 95% specificity means that about 9500 of the 9990

people without the disease will test negative.• So roughly 500 people test positive, roughly 10 of them

have the disease, so PPV≈10/500=2%.

Thresholds

The relationship between false positives and prevalence means that the same test may be used with different thresholds in different circumstances. For example, the Mantoux skin test for tuberculosis involves injecting TB antigen under the skin and measuring the resulting indurated area 2–3 days later. The recommended thresholds are• 15mm induration in healthy people with no risk factors• 10mm in people with occupational or other potential exposure• 5mm in people for whom TB would be especially dangerous.E.g., HIV infection or immunosuppression identify a population for whom a false negative is relatively more serious, so a lower PPV can be tolerated in favor of a better NPV.

A higher threshold means a test with lower sensitivity and higher specificity 3131

Random Variables

• Mathematical definition: a function from the sample space to the real numbers.– “Rule” that assigns a numerical value to each point in the

sample space• Working definition: some measurement that has a

distribution across subjects– Denoted with a capital letter or a mnemonic– X, Y, A, AGE,

• We usually use lower-case letters to represent the values a random variable can attain. We talk about “events” such as A = a1, A > a2, a3 ≤ A ≤ a4

Random Variables: Examples

• Sampling space is our population• Random variables examples:

– HEIGHT: Height– CHOL: Cholesterol level– BP: Blood pressure– H=1 if person is hypertensive, 0 otherwise– STAGE: Stage of cancer

Expectation

• The expected value of a random variable is the average (mean) of that variable over the population. For a discrete variable there is a simple formula:

• ∑

• Example: roll two fair dice. Let X=the number of 6’s.• The expected value of X is 0·25/36+ 1·10/36+

2·1/36=12/36=1/3• The expected number of 6’s we get when we roll two fair

dice is 1/3 .

353536

Example:Experiment: Roll 2 fair dice. Sample space = {

(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)(6,1) (6,2) (6,3) (6,4) (6,5) (6,6) }

Random Variable: X: number of “6” that occur when rolling two fair dice.

Takes values 0, 1 and 2 X 0 1 2P(X=x) 25/36 10/36 1/36

X=0 X=1

X=1X=2

Expectation

• For continuous variables the concept is the same but the formula requires calculus.

• Properties of expectations:– E[X+Y]=E[X]+E[Y]– For a number c, E[cX] = c E[X]

Variability

• We can also talk about the variance of a random variable. Let E[X]=μ . Then var[X]=E[X‒μ]2.

• Properties of variance:– var[cX] = c2 var[X]– Var[X+Y] = var[X]+var[Y]+2cov[X,Y]– Only when X and Y are independent:

• Var[X+Y] = var[X]+var[Y]

Example

• We can use these formulas to work out the expected value of the mean for independent*, identically distributed X1, X2, … , Xn:

*do not need to assume independence 3939

Example

• We can use these formulas to work out the variance of the mean for independent, identically distributed X1, X2, … , Xn.

• Therefore, SD4040

Some parametric distributions

• Continuous– Normal

• Discrete– Bernoulli– Binomial– Poisson

414142

Continuous Random Variables: Normal• Notation: X ~ N(, )

– Continuous quantitative random variable X is normally distributed if the density of its probability distribution is

– The normal distribution corresponds to the familiar “bell-shaped” curve

– Symmetric around – Expected Value: – Variance is 2 and standard deviation is

Standard Normal Random Variable• Notation: Z ~ N(0, 1)

– A Normal random variable with mean 0 and standard deviation 1.

– Symmetric around 0– Expected Value: 1– Variance: 1, Standard Deviation: 1– (a standard Normal random variable is often

denoted with Z)

Standard Normal Random Variables

Z ~ N(0,1)

Properties of Normal Random Variables

• Let Y= aX + b.

• If X ~ N (mean , standard deviation ), then

• Let W = aX + bY.

• If X ~ N (, ) is independent of Y ~ N (, ) then

222,where),,(~ abaNY YYYY

22222,where),,(~ babaNW WWWW

Select Discrete Random Variables

• Bernoulli• Binomial• Poisson

Discrete Random Variable: Bernoulli• Bernoulli: takes only two possible values

– Examples: • Patient has a disease or not• Patient responds to a treatment or not• Patient dies or not

• A Bernoulli random variable X is complete characterized by a single parameter p=P(X=1)– Because P(X=0)=1-p

Discrete Random Variable: Binomial• number of times that a particular binary event

occurs in a sequence of n independent observations with the same probability p of occurrence– Related to Bernoulli– Example: Roll a fair die 60 times, event of interest is

getting a 4. So 60 “trials” with probability p=1/6 of “success” each time. Let X be the total number of 4’s. X has a binomial distribution. Mathematically, X can be any number between 0 and 60, but E[X]=10

– Example: n patients participate in a study and take aspirin; each patient is followed for 2 years on aspirin; the number of patients who experience a heart attack during the study period is a Binomial random variable. Number of trials n is known, interest is in p=P(heart attack)

0 1 2 3 4 5

abilit

0 1 2 3 4 5

abilit

0 1 2 3 4 5

0 2 4 6 8 10

abilit

0 2 4 6 8 10

abilit

0 2 4 6 8 10

0 20 40 60 80 100

abilit

0 20 40 60 80 100

abilit

0 20 40 60 80 100

p=0.2 p=0.5 p=0.8

BINOMIAL(n,p)

Discrete Random Variable: Poisson• Poisson: number of events in a specific time

period or area. – Examples:

• number of telephone calls during business hours• number of individuals with heat exhaustion at UWMC

emergency room– A Poisson random variable is characterized

by a single parameter, the rate, often denoted with the Greek letter “lambda” λ

– A Poisson random variable can take on any non-negative integer value (0, 1, 2, 3, ….). However, the most likely values are the values around λ

0 10 20 30 40 50

=0.8 =1

=5 =15

POISSON()

Better description: Binomial or Poisson?

• X=“Number of HPV infected women among UW female sophomores”

• Y=“number of bicycling accidents in Seattle per year”

• Some distributions that are not Normal (may not even be continuous) can be well approximated with a Normal distribution.

Fractio

X61 110

.05 Poisson

Fraction

.12Binomial

Normal Approximations to some discrete variables

• 1. Binomial–When both np and n(1-p) are “large” a Normal may be used to

approximate the binomial. [ Rule of thumb: np>5, n(1-p) > 5]–: X ~ Bin(n,p)

– E(X) = np– V(X) = np(1-p)

–Approximation:

• 2. Poisson–When is “large” a Normal may be used to approximate the Poisson [ Rule of thumb: > 10].–: X ~ Po()

– E(X) = – V(X) =

–Approximation:

))1(,(~ pnpnpNX

),(~ NX 58

Summary: Parametric DistributionsRandom Variable

Discrete Continuous

Bernoulli(p)E[X]=p

Var[X]=p(1-p)

Binomial(n,p)E[X] = np

Var[X] = np(1-p)

Poisson()E[X]=

Var[X]=

Normal(,)E[X]=

Var[X]= 2

biost 514/517 biostatistics i / applied biostatistics...

Documents

stata review: part ii biost/epi 536 discussion section...

lecture 7 linear regression...

biost 572 final talk - university of...

biology 458 biostatistics prototypes - binghamton...

ΓΕΝΙΚΑ ΜΑΘΗΜΑΤΙΚΑ Περιγραφική...

biost 536 / epi 536 categorical data analysis in...

lecture 16 regression with time-to-event outcomeslecture 16...

gtz-vorhaben zur praktischen umsetzung der biost-nachv...

biology 458 biostatistics prototypes · biology 458...

generalit biost 2015 inter (1)

lecture 12 logistic regression - university of...

profesor biost rom

detecting dna-protein interactions xinghua lu dept...

biost 518 / biost 515 applied biostatistics ii ... · biost...

biost 536 lecture 12 1 lecture 12 – introduction to...

homework #2€¦ · web viewbiost 518: applied...

lecture 10 polynomial regression - university of...

question biost 536 / epi 536 categorical data analysis in...

teste biost 2015

analysis of next generation sequence data biost 2055...