probability&statistics - based models august 1, 2007 mathfest 2007 san jose, ca raina robeva –...
TRANSCRIPT
Probability&Statistics - based modelsProbability&Statistics - based models
August 1, 2007 MathFest 2007 San Jose, CA
Raina Robeva – Sweet Briar College
Probability&Statistics - based modelsProbability&Statistics - based models
Introduction Introduction
Quantitative Traits (Limit Quantitative Traits (Limit Theorems)Theorems)
Luria – Delbruck Experiments Luria – Delbruck Experiments
Evaluating risks from time Evaluating risks from time series data series data
Elementary ProbabilityElementary Probability
Random Variables
),(XX
),,( P - Probability Space
Histograms
Elementary ProbabilityElementary Probability
Set of all outcomes -
Examples:
1) Flipping a coin:
2) Rolling a die:
3) Rolling two dice:
TH ,
6,5,4,3,2,1
Elementary ProbabilityElementary Probability
Elementary Events – the elements of
Events – the subsets of : CBA ,,
Definition of Probability:
elementsofnumberA
AP
||,||
||)(
How do we find probabilities?
We Count!We Count!
Chromosomes and GenesChromosomes and Genes
Genes are found on chromosomes and code for a specific trait
The possible alternative forms of the genes are called alleles.
Chromosomes are large DNA molecules found in the cell’s nucleus
Each gene has a specified place on the chromosome called a locus.
The human Chromosome 11 contains 28 genes. The first 5 genes from the tip of the short arm form a cluster of genes that encode components of hemoglobin
ProblemProblem
- All possible sequences of length 2 comprised of a and A
2when,1
1when,2
0when,1
||
k
k
k
E
If E = “exactly k dominant alleles”, find P(E).
AAAaaAaa ,,,
||
||)(
EEP
One gene, two types of alleles: a (recessive) and A (dominant)
k = number of dominant alleles (0, 1, or 2)
Problem (cont.)Problem (cont.) AAAaaAaa ,,,
2when,4/1
1when,2/1
0when,4/1
||
||)(
k
k
kE
EP
Gregor Mendel – experiments with peas
Round - dominant Wrinkled - recessive
Parental Generation
First Filial Generation
P
Second Filial Generation
only round peas in F1
3:1 ratio of round vs. wrinkled in F2
1F
2F
x
x
Phenotypic Phenotypic Ratios Ratios
1:3 (1:2:1)1:3 (1:2:1)
%751)round(
%25)wrinkled(
41
41
P
P
Quantitative Traits (1909)Quantitative Traits (1909)
Herman Nilsson – Ehle
Phenotypic Ratios Phenotypic Ratios
1 : 4 : 6 : 4 : 1 1 : 4 : 6 : 4 : 1
1 : 6 : 15 : 20 : 15 : 6 : 1 : 6 : 15 : 20 : 15 : 6 : 11
……
Parental Generation
First Filial Generation
P
Second Filial Generation
1F
2F
x
x All of intermediate color
Tw
o n
ew s
had
es a
pp
ear
Quantitative Traits – ExamplesQuantitative Traits – Examples
n genes, two types of alleles: a and A
Polygenic HypothesisPolygenic Hypothesis
N = 2n – total positions
If E = “exactly k dominant alleles”, find P(E) = ?k = number of dominant alleles (0, 1, 2, …, N)
Polygenic Hypothesis – set of Polygenic Hypothesis – set of outcomesoutcomes
1
2
3
4
- All possible sequences of length 8 comprised of a and A82|| In general,
nN 222||
Polygenic HypothesisPolygenic Hypothesis
Alleles a and A are equally likely
N = 2n – total positions
If E = “exactly k dominant alleles”, find P(E).
k = number of dominant alleles (0, 1, 2, …, N)
)!(!
!||
kNk
N
k
NE
N2||
N
k
N
EEP
2||
||)(
Example: Nilsson-Ehle (1909)Example: Nilsson-Ehle (1909)
Nilsson – Ehle: Two genes (n = 2), N = 2n = 4 number of alleles
X – number of a alleles
in the N loci
16
4
2
kk
N
NP(X = k) =
Random VariablesRandom Variables
)(XX
Continuous – X can be any value from an interval
X is “known” when we know:
the distribution function F(x) = P(X< x);
the probability density function f(x) = d/dx [F(x)]
x
dttfxF )()(
Discrete – X takes integer values
X is “known” when we know P(X=k) for all possible k
Common Discrete Random Common Discrete Random VariablesVariables
Bernoulli X takes values k = 0, 1
P(X=1) = p; P(X=0) = 1-p
Binomial X takes values k = 0, 1, 2, …, N
kNk ppk
NkXP
)1()(
Poisson X takes values k = 0, 1, 2, 3, …
!)(
kekXP
k
N = 20, p = 0.5N= 20, p = 0.2N= 20, p = 0.7
Parameters
Bernoulli (p)
Bin(N, p)
Po( )
Common Continuous Random Common Continuous Random VariablesVariables
Exponential X takes values ),0( xxexF 1)(
xexf )(
Gaussian (Normal) X takes values ),( x
22
2)(
21)(
x
exf 2
2
21)(
x
exf
),( N )1,0(N
Bell - Shaped Distr. of Quantitative Bell - Shaped Distr. of Quantitative TraitsTraits
Traits are controlled not by one but by several different genes. The genes are independent and contribute cumulatively to the expression of the characteristic (Polygenic Hypothesis)
Distribution of the trait is Binomial (2n, p), where n –number of genes and p frequency of the non-contributing allele in the population.
Distribution is approximately Gaussian.
Further “smoothing” by environmental factors
N=8, p = 0.2
N = 20, p = 0.5
N=50, p = 0.7
When Np is large and N(1-p) is large, then
Binomial (N,p) ~ Normal (Np, ))1( pNp
Why the “bell-shaped” distribution of Why the “bell-shaped” distribution of quantitative traits? quantitative traits?
1667 - 1754MoivreMoivre
1749 - 1827LaplaceLaplace
Central Limit TheoremCentral Limit Theorem
Aggregate CharacteristicsAggregate Characteristics
Mean Value )()( kXkPXE
dxxxfXE )()(
Standard Deviation222 )]([)()]([)( XEXEXEXEXVar
Moments of order m )()( kXPkXE mm
dxxfxXE mm )()(
ExamplesExamples
Binomial (N, p) NpXE )(
NpqXVar )(
)(XE
)(XVar
)(XE2)( XVar
Gaussian ( ),
Poisson( )
Poission Distribution Arises Poission Distribution Arises When…When…
Events of low intensity occurring in time
X(t) – the number of events that have occurred in [0,t]
0 timet
X(t) has a Poisson distribution with parameter
!
)()(
k
etkXP
tk
t
Average number of events per unit time =
Events of low intensity occurring independently of one another
t
X– the number of events that have occurred in a unit surface/volume over time t
X has a Poisson distribution with parameter
!
)()(
k
etkXP
tk
Average number of events per unit surface/volume per unit time =
Poission Distribution Arises Poission Distribution Arises When…When…
The Law of Large Numbers (1713)
If X is a random variable with
,)( XE
then
,as,21
nn
XXX n
.as, nX
or, equivalently,
Example – Ordinary Coin Toss Game
1. Toss a coin
.as,5.021
nn
XXX n
5. Average payback to you
2. If Heads, win $1
3. If Tails, win nothing
50.0$1)2/1(0)2/1()( iXE
4. Let Xi be your win for game i
6. By the Law of Large Numbers
Simulation Example
Example – St. Petersburg Game1. Toss a coin
5. With probability 1/(2N) we win $2N
2. If Heads, win $2
3. If Tails, keep tossing until it falls Heads
4. If first Heads on N-th toss, win $2N
H $2TH $4TTH $8TTTH $16 etc.
111
2)2
1(2)
2
1(2)
2
1( 3
32
2
6. Average payback to you
St. Petersburg Game – a sample run
Random Processes (Temporal Stochastic Models)Random Processes (Temporal Stochastic Models)
Random Process: X(t) – Random variable that changes in time
When t = 0, 1, 2, … – Discrete Random Process
When t changes continuously – Continuous Random Process
In addition, since for any value of t, X(t) can be discrete or continuous random variable, there are four possibilities for the process {X(t), t}.
{X(t), t} is defined through its probability distribution. ))0(|)(()( iXxtXPtp i
x
For example, if X(t) can take values x = 0,1,2,…, then is the probability
distribution of X. ),...](),(),([)( 210 tptptptp iiii
Single Population Immigration-Death ProcessSingle Population Immigration-Death Process
Deterministic Model
X(t) = population size at time t
I = rate of immigration
a = per capita death rate
aXIdt
dX
Stochastic Model (Kolmogorov – Chapman DE) xttX )( can happen when:
X(t) = x and no change over . (Event A)
X(t) = x + 1 and one death over . (Event B)
X(t) = x -1 and one immigration over . (Event C)
Probability for more than unit change over . (D)
t
t
t
)( tot
Kolmogorov – Chapman EquationsKolmogorov – Chapman Equations
))(Pr()( ntXtpn
)()())(1()()()1()( 11 totpttoIantptItptnattp nnnn
P(B) P(C) P(A) P(D)
Subtract , divide by , and let )(tpn t 0t
0),()()()()1()( 11 ntpanItIptpnatpdt
dnnnn
0),()()( 100 ntaptIptpdt
d
Demo
How are the Stochastic and Deterministic Models Related?How are the Stochastic and Deterministic Models Related?
Define )(tnpEXX n
Multiply the K-C equation by n and sum over n
0),()()()()1()( 11 ntpanItIptpnatpdt
dnnnn
][])1([
)()()()()1(
11
11
nnnn
nnn
npnpIannpnanpXdt
d
tpanIntnIptpnnaXdt
d
Xatnpa n )( 1
The mean value of the stochastic process X
satisfies the deterministic equation
XaIX
dt
d
Luria-Delbruck Experiments
Darwinian Model - mutations are equally likely to occur at any moment in time.
Lamarckian Model - mutations evolve only in response to an environmental cue.
When do mutations occur?
Luria-Delbruck Experiments (1943)
Large number of bacterial cultures, starting each one from a small number of cells.
Plate the cultures on nutrient agar plates that on which a large amount of a virus has been plated first. Incubate.
Luria SE & Delbruck M. Mutations of Bacteria from Virus Sensitivity to Virus Resistance. Genetics 28:491(1943).
Control
Hypothesis 1 (Mutation): Mutations occur randomly, but the probability that a bacterium mutates from sensitive to resistant is small. This mutation is completely independent from the presence of the virus. When the bacteria are added to the plates, the mutants are already resistant to the virus. Only these mutants proliferate into colonies on the plate.
Hypothesis 1 (Acquired Immunity): A small number of bacteria mutated to acquire resistance only after they are exposed to the virus. Survival confers immunity not only to the individual but also to its offspring, and the colonies grow.
Hypotheses
Count the Number of Colonies
Hypothesis 1 (Acquired Immunity, Directed Mutation): A small number of bacteria mutated to acquire resistance only after they are exposed to the virus. Survival confers immunity not only to the individual but also to its offspring, and the colonies grow.
Two opposing hypotheses
killer virus
Two opposing hypotheses Hypothesis 2 (Mutation + Selection): Mutations occur randomly, but the probability that a bacterium mutates from sensitive to resistant is small. This mutation is completely independent from the presence of the virus. When the bacteria are added to the plates, the mutants are already resistant to the virus. Only these mutants proliferate into colonies on the plate.
killer virus
Poisson
)()( XVarXE 1)(/)( XVarXE
What is the Distribution of the Mutant Cells at the time of plating?
Under the Directed Mutation Hypothesis
killer virus
Under the Mutation + Selection Hypothesis
killer virus
Non-Poisson
)()( XVarXE largeveryis)(XVar
Luria-Delbruck Distribution
Large variation in the number of mutants
What is the average number of resistant cells under continuous mutation? Assume that mutation can only occur at the time of division
Assume that each cell can mutate with a constant probability p
Average number of mu-tant cells in generation i
Generation (i)
Expected number of mutants at the end from this generation
0
1
23
45
6
p
p2
p22p32
p42
p52
p62
Np2
NN pp 222 1
NN pp 222 22
NN pp 222 33
NN pp 222 44
NN pp 222 55
NN pp 222 66
NNN pppXE 222)( 1111)(XE
Mutation.xls
AcqIm.xls
Biological ESTEEM
Lea and Coulson (1949)
Lea, D.E. and Coulson, C.A. (1949) The distribution of the number of mutants in bacterial populations. J. Genetics 49, 264-285
xxmxmx /)1()1(),(
Theorem. Let Xt denote the number of mutant cells in the culture at time t. If p is the probability for a single cell to mutate and m = p2n, then the probability generating function of the distribution defined by
has the form
kt xkXPmx )(),(
More recent work on the Luria-Delbruck distribution
Evaluating risk from time series data Evaluating risk from time series data
Glucose Variability and Risk Assessment Glucose Variability and Risk Assessment in Diabetesin Diabetes
Hearth Rate Variability and the Risk for Hearth Rate Variability and the Risk for Neonatal Sepsis Neonatal Sepsis
In both human and economic terms, diabetes is one of the nations most costly diseases. Diabetes is the leading cause of kidney failure, blindness in adults, and amputations. It is a major risk factor for heart disease, stroke, and birth defects. Diabetes shortens average life expectancy by up to 15 years, and costs our nation in excess of $100 billion annually in health-relatedSixteen Million people Sixteen Million people
in the United States havein the United States haveDiabetes Mellitus.Diabetes Mellitus.
expenditures- more than any other single chronic disease. Diabetes spares no group, affecting young and old, all races and ethnic groups, the rich and the poor.
Blood Glucose Fluctuation Characteristics Blood Glucose Fluctuation Characteristics Quantified from Self-Monitoring DataQuantified from Self-Monitoring Data
• Type 1 Diabetes also referred to as Insulin Dependent Diabetes Mellitus (IDDM) is the type of diabetes in which the pancreas produces no insulin or extremely small amounts;
• Type 2 Diabetes is the type of diabetes in which the body doesn’t use its insulin effectively or doesn’t produce enough insulin
• Insulin a hormone secreted by the pancreas that regulates metabolism of glucose.
• Blood Glucose (BG) is the concentration of glucose in the bloodstream;
• The BG levels are measured in mg/dl (USA) and in mmol/L (most elsewhere);
• The two scales are directly related by: 18 mg/dl= 1mM;
DefinitionsDefinitionsDefinitionsDefinitions
Target BloodGlucose Range:
70-180 mg/dl(DCCT, 1993)
Hyperglycemia
Hypoglycemia
Food
Insulin
Insulin
Severe Hypoglycemia
Counter-regulation
Insulin
• Defined as a low BG resulting in stupor, seizure, or unconsciousness that precludes self-treatment (The Diabetes Control and Complications Trial Research Group, 1997). Four percent of the deaths among individuals with IDDM are attributed to SH (DCCT Study Group, 1991).
• Although most severe hypoglycemic episodes are not fatal, there remain numerous negative sequelae leading to compromised occupational and scholastic functioning, social embarrassment, poor judgment, serious accidents, and possible permanent cognitive dysfunction (Gold AE et al., 1993; Deary et al., 1993; Lincoln et al., 1996).
• Fear of severe hypoglycemia is identified as the major barrier to improved metabolic control (Cryer et al., 1994).
Severe HypoglycemiaSevere Hypoglycemia
BG Fluctuations: T1DMBG Fluctuations: T1DM
0.00
100.00
200.00
300.00
400.00
500.00
600.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00
BG Fluctuations: T2DMBG Fluctuations: T2DM
0.00
100.00
200.00
300.00
400.00
500.00
600.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00
Average Glycemia and Glucose Variability Person A: HbA1c=8.0%
Blo
od
Glu
cose
(m
g/d
l)
Person B: HbA1c=8.0%
Blo
od
Glu
cose
(m
g/d
l)
Time (days)
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Blood Glucose (BG) Monitoring SystemsBlood Glucose (BG) Monitoring Systems
Self-Monitoring BG Devices (typically 3-10 measurements/24 hours)
Continuous BG Monitoring Systems
(up to 288 measurements/24 hours)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
5
10
15
20
25
30
Fre
quen
cy
Hypo- Target Range Hyperglycemia
Data Range, if Symmetrization is used
BG (mM)Standard Data Range
ClinicalCenter
NumericalCenter
The Distribution of the BG LevelsThe Distribution of the BG Levels::(Mean=6.7, SD=3.6, Normality hypothesis is rejected, P<0.05)
Symmetrization of the BG Scale:Symmetrization of the BG Scale:
Assumptions:A1: The transformed whole BG range should be symmetric around 0. A2: The transformed target BG range should be symmetric around 0.
Transformation:f(BG,) = [(ln (BG )) ‑ ], > 0
That satisfies the conditions:A1: f (33.3,) = - f (1.1,) and A2: f(10,) = - f(3.9,).
Which leads to the equations:(ln (33.3)) ‑ = [(ln (1.1)) ‑ ]
(ln (10.0)) ‑ ln ‑. [(ln (33.3)) ‑ (ln (1.1) ‑ 10 (scaling)
When solved numerically:1.0331.871 and 1.774 (when BG is in mM)
1.0841 and 1.509 (when BG is in mg/dl)
1 4 7 10 13 16 19 22 25 28 31 34
BG (mM)
00.5
11.5
22.5
33.5
-0.5-1
-1.5-2
-2.5-3
-3.5
ClinicalCenter
NumericalCenter
f(BG) = 1.774 * (ln(BG)^1.033 - 1.871)
Symmetrization Function:Symmetrization Function:
Distribution of the Transformed BG Levels:Distribution of the Transformed BG Levels:
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50
10
20
30
40
50
Fre
quen
cy
f(BG)
Hypoglycemia HyperglycemiaTarget Range
Symmetrized Data Range
Clinical and Numerical
Center
The BG risk function: r(BG)=10.f(BG)2
Let x1, x2, ... xn be a series of n BG readings,and let
rl(BG)=r(BG) if f(BG)<0 and 0 otherwise;rh(BG)=r(BG) if f(BG)>0 and 0 otherwise.
The Low Blood Glucose [Risk] Index (LBGI) and the High BG [Risk] Index (HBGI) are then defined as:
)xrl(n
1=LBGI i
n
1=i )xrh(
n1
=HBGI i
n
1=i
Defining the Low and High Defining the Low and High Blood Glucose Indices:Blood Glucose Indices:
Symmetrization Symmetrization of the BG Measurement Scaleof the BG Measurement Scale
0 0.5 1 1.5 2 2.5 3-0.5-1-1.5-2-2.5-30
20
40
60
80
100
Transformed BG Scale
r(B
G)
Target RangeHypoglycemia Hyperglycemia
y = 10 * x^2Low BG Risk High BG Risk
Clinical and Numerical
Center
• Evaluation of HbAEvaluation of HbA1c1c
• Assessment of Long-Term Risk Assessment of Long-Term Risk for [Severe] Hypoglycemiafor [Severe] Hypoglycemia
• Assessment of Short-Term Assessment of Short-Term Risk for [Severe] HypoglycemiaRisk for [Severe] Hypoglycemia
Risk Analysis of Blood Glucose Data: Theory and Algorithms
• Predicts 40% of SH episodes for the subsequent 6 months;• Predicts 50% of imminent SH episodes (24 hours);• The technology has been licensed by Lifescan Inc, Milpitas, CA;
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 340
20
40
60
80
100
BG Level (mM)
r(B
G)
Target Range
Low BG Risk High BG Risk
The Blood Glucose Risk Function:The Blood Glucose Risk Function:(As Defined on the Original Blood Glucose Scale)
• 4 million births 4 million births • 40,000 very low birth weight 40,000 very low birth weight
(<1500 grams) infants (<1500 grams) infants • 15,000 NICU beds15,000 NICU beds• 400,000 NICU admissions400,000 NICU admissions
Hearth Rate Variability and the Risk for Neonatal Hearth Rate Variability and the Risk for Neonatal SepsisSepsis
Neonatal Sepsis: A Major Public HealthNeonatal Sepsis: A Major Public Health ProblemProblem
• Risk of sepsis is high– 25 - 40% of VLBW infants develop sepsis while
in the neonatal intensive care unit
• Significant mortality and morbidity – In VLBW infants, sepsis doubles the risk of
dying – Length of stay is increased by 1 month– Health care costs are increased
Current Practice for Infants at Risk for Current Practice for Infants at Risk for SepsisSepsis
• Nurse relates that infant in NICU is “not acting right” or “looks a little off”
• Physicians must take the cautious approach, suspecting sepsis
• Assessment includes invasive tests:– CBC, blood culture, urine culture, lumbar
puncture
• Intervention: antibiotics
Baby
Problems with Problems with Current Medical PracticeCurrent Medical Practice
• Nurses and physicians’ subjective assessments are neither sensitive nor specific
• Diagnostic tests have important limitations:– invasive– not performed until infant has clinical signs– various CBC components range from 11% to 77%
Need for Better Risk Need for Better Risk Assessment for Neonatal SepsisAssessment for Neonatal Sepsis
• Tremendous need for continuous non-invasive monitoring for sepsis
• Any device that adds objective information about infant’s state of health from continuous risk assessment monitoring would be helpful
Time [RR interval number]
Mag
nit
ud
e o
f R
R in
terv
al [
400
500
600
300
A
400
500
600
300
B
400
500
600
300
C
0 512 1024 1536 2048 2560 3072 3584 4096
Time [RR interval number]
Mag
nit
ud
e o
f R
R in
terv
al [M
sec]
400
500
600
300
A
400
500
600
300
B
400
500
600
300
C
0 512 1024 1536 2048 2560 3072 3584 4096
Time [RR interval number]
Mag
nit
ud
e o
f R
R in
terv
al [
400
500
600
300
A
400
500
600
300
B
400
500
600
300
C
0 512 1024 1536 2048 2560 3072 3584 4096
Time [RR interval number]
Mag
nit
ud
e o
f R
R in
terv
al [M
sec]
400
500
600
300
A
400
500
600
300
B
400
500
600
300
C
0 512 1024 1536 2048 2560 3072 3584 4096
4000
8000
12,000
16,000
18,000
10
100
1,000
10,000
1 0
4000
8000
12,000
16,000
18,000
10
100
1,000
10,000
1 0
-20 0 20 40 60 80 100 120
Difference from median [msec]
medianmedian
Sample Asymmetry=2.97R1=27
R2=79.5
Sample Asymmetry=11.8R1=45.5
R2=538.5
B
C
medianmedian
4000
8000
12,000
16,000
18,000
10
100
1,000
10,000
1 0
Sample Asymmetry=1.37R1=42
R2=57.5
A