New Statistical Algorithms for Analyzing Multi BatchCSF Data with Systematic Variations
Hao Zhou
Statistics
CPCP-talk
Joint work with Sathya N. Ravi, Vamsi K. Ithapu, Sterling C. Johnson,Grace Wahba, Vikas Singh
1 / 24
CSF: Same participant’s values may change across batches
Dataset quick introduction
12 CSF protein levels of 701 subjects were collected in two differentbatches (measured on two different time points), 413 in batch 1, and288 in batch 2.
A subset of 85 individuals have both batch data (measured asdifferent values), others only available in one batch
2 / 24
Domain Adaptation and variability across CSF batches
Domain Adaptation (DA): In many real world datasets, training/testing(or source/target) samples may come from different “domains”.
3 / 24
Domain adaptation ideas could be used on CSF issue.Inputs/features in source and target domains denoted by xs and xt .Outputs/labels denoted by ys and yt .Transform source and target domains: match the feature/covariatedistributions across domains
A simple example of grades on a class
A binary setup where Pr(xs) 6= Pr(xt), Pr(ys |xs) 6= Pr(yt |xt)(Hint: solve by xs = 100− xs).
0.000
0.005
0.010
0.015
0.020
0.025
0 25 50 75 100
grade
dens
ity
domainsourcetarget
Feature density functions
0.25
0.50
0.75
0 25 50 75 100
grade
stud
y w
ell
cond_probsourcetarget
Study well Conditional Probability
4 / 24
Use MMD as distance measure of distributions
A statistic that measures the distance between two distributions.Maximum Mean Discrepancy (MMD) (Gretton et al 2012)
MMD(xs , xt) = ‖ 1
m
m∑i=1
K (x it , ·)−1
n
n∑i=1
K (x is , ·)‖H (1)
The objective function of our estimation problem (minimal MMD).
minλ∈Ωλ
minβ∈Ωβ
‖ 1
m
m∑i=1
K (g(x it , β), ·)− 1
n
n∑i=1
K (h(x is , λ), ·)||H (2)
5 / 24
Our hypothesis and how its different from MMDH0 : There exists a λ and β such that Pr(g(xt , β)) = Pr(h(xs , λ)).HA : No λ and β exists such that Pr(g(xt , β)) = Pr(h(xs , λ)).
0.0
0.1
0.2
0.3
0.4
−2 0 2 4
feature
dens
ity
distrN(0,1)N(1,1)
Density functions of two distributions
Figure: MMD rejects this case, but ours does not.6 / 24
CSF: Same participant’s values may change across batches
Dataset details
12 CSF protein levels of 701 subjects were collected in two differentbatches (measured on two different time points), 413 in batch 1, and288 in batch 2.
. . . includes sAppα, sAppβ, 1-38-Tr, 1-40-Tr, 1-42-Tr, MCP-1, YKL40,NFL, Ab-42, hTau, PTau, Neurogranin
A subset of 85 individuals have both batch data, others only availablein one batch
Linear standardization transformation between the two batches servesas a ‘gold’ standard.
Our algorithm does not use information about corresponding samplesbut compares two batches’ distributions directly.
7 / 24
Calculate difference of batch 2 and transformed batch 1We transform batch 1 data of those individuals having both batchdata.We then calculate `1 relative error of transformed batch 1 data andbatch 2 data of those individuals.
0.25
0.50
0.75
1.00
1.25
1 2 3 4 5 6 7 8 9 10 11 12
protein
mea
n re
lativ
e er
ror
methodAll (ours)NoneSubset (ours)gold standard
Relative L1 error
8 / 24
Predict Hippocampal Volumes based on transformed CSF
We used the ’transformed’ CSF data from the two batches andperformed a multiple regression to predict L/R Hippocampal Volume.
Performance measured by correlation between predicted and actualHippocampal Volume. 10-fold cross validation is used to formtraining and testing datasets.
Model Left Right
gold standard 0.46± 0.15 0.37±0.16Subset (ours) 0.48± 0.15 0.39± 0.15
All (ours) 0.48± 0.15 0.40± 0.15
9 / 24
Plot transformed batch 1 & 2 data of participants
0 1000 2000
0
1000
2000
sAppalfa
Paired
All
0 1000 2000
0
500
1000
1500
sAppbeta
Paired
All
0 2000 4000
0
2000
4000
60001-38-Tr
Paired
All
0 5000 10000
0
5000
100001-40-Tr
Paired
All
0 500 1000
0
500
10001-42-Tr
Paired
All
0 500 1000
0
500
1000MCP-1
Paired
All
×105
0 2 4
×105
0
2
4YKL40
Paired
All
0 2000 4000
0
2000
4000NFL
Paired
All
batch1
0 1000 2000
ba
tch
2
0
1000
2000
Ab1-42
Paired
All
0 500 1000
0
500
1000hTau
Paired
All
0 50 100
0
50
100Ptau
Paired
All
0 500 1000
0
500
1000
Neurogranin
Paired
All
10 / 24
Plot transformed batch 1 & 2 data of common persons
1− 38− Tr
0
1000
2000
3000
4000
0 1000 2000 3000 4000batch_1
batc
h_2
pairedall
1−38−Tr
0
1000
2000
3000
4000
5000
0 1000 2000 3000batch_1
batc
h_2
pairednone
1−38−Tr
11 / 24
Plot transformed batch 1 & 2 data of common persons
MCP − 1
200
400
600
800
200 400 600 800batch_1
batc
h_2
pairedall
MCP−1
500
1000
500 1000batch_1
batc
h_2
pairednone
MCP−1
12 / 24
Plot transformed batch 1 & 2 data of common persons
NFL
500
1000
1500
2000
500 1000 1500 2000batch_1
batc
h_2
pairedall
NFL
500
1000
1500
2000
500 1000 1500 2000batch_1
batc
h_2
pairednone
NFL
13 / 24
Remind of our method
The objective function of our estimation problem (minimal MMD).
M(λ, β) = minλ∈Ωλ
minβ∈Ωβ
‖ 1
m
m∑i=1
K (g(x it , β), ·)− 1
n
n∑i=1
K (h(x is , λ), ·)||H
(3)
The hypothesis testingI H0 : There exists a λ and β such that Pr(g(xt , β)) = Pr(h(xs , λ)).I HA : No λ and β exists such that Pr(g(xt , β)) = Pr(h(xs , λ)).
14 / 24
Assumptions
(A1)‖K (h(xs , λ1), ·)− K (h(xs , λ2), ·)‖ ≤ Lhd(λ1, λ2)rh ∀xs ;λ1, λ2 ∈ Ωλ
(A2)‖K (g(xt , β1), ·)− K (g(xt , β2), ·)‖ ≤ Lgd(β1, β2)rg ∀xt ;β1, β2 ∈ Ωβ
15 / 24
Hypothesis Testing Consistency
Theorem (Hypothesis Testing)
(a) Whenever H0 is true, with probability at least 1− α,
0 ≤M(λ, β) ≤√
2K (m + n) logα−1
mn+
2√K√n
+2√K√m
(4)
(b) Whenever HA is true, with probability at least 1− ε,
M(λ, β)−M∗(λA, βA) ≤√
2K (m + n) log ε−1
mn+
2√K√n
+2√K√m
≥ −√K√n
(4 +
√C (h,ε) +
dλ2rh
log n
)−√K√m
(4 +
√C (g ,ε) +
dβ2rg
logm
)(5)
C (h,ε) = log(2|Ωλ|) + log ε−1 + dλrh
log Lh√K
16 / 24
Convergence Consistency
Theorem (MMD Convergence)
Under H0
‖ExsK (h(xs , λ), ·)− ExtK (g(xt , β), ·)‖H → 0
in rate min(√
log n√n,√
log m√m
).
Theorem (Consistency)
Under H0, the estimators λ and β are consistent.
17 / 24
Assumptions
(A1)‖K (h(xs , λ1), ·)− K (h(xs , λ2), ·)‖ ≤ Lhd(λ1, λ2)rh ∀xs ;λ1, λ2 ∈ Ωλ
(A2)‖K (g(xt , β1), ·)− K (g(xt , β2), ·)‖ ≤ Lgd(β1, β2)rg ∀xt ;β1, β2 ∈ Ωβ
18 / 24
Simulation for test power
xs indicated by legend, xt ∼ N(10, 4), Model is xt = λ1 ∗ xs + λ2
xs ∼ N(0, 1), xt ∼ N(10, 4), Model is indicated by legend.
Sample Size (Log2 scale)
4 6 8 10
Acce
pta
nce
Ra
te
0
0.2
0.4
0.6
0.8
1
1.2Normal target vs. different sources
Normal(0,1)
Laplace(0,1)
Exponential(1)
Sample Size (Log2 scale)
4 6 8 10A
cce
pta
nce
Ra
te
0
0.2
0.4
0.6
0.8
1
1.2Models linear in parameters
a*x 2+b*x+c
a*log(|x|)+b
19 / 24
Simulation for estimation error
xs ∼ N(0, 1), xt ∼ N(10, 4), Model is xt = λ1 × xs + λ2
The L1 error is |λ1 − 2| for slope curve and |λ2 − 10| for intercept curve.
Sample Size (Log2 scale)
2 4 6 8 10 12
L1
Err
or
0
0.2
0.4
0.6
0.8
1
1.2Estimation Errors normal vs. normal
Slope
Intercept
20 / 24
An Ellipsoid Constraint
Theorem (Linear transformation)
Under H0, identity g(·) with h = φ(xs)Tλ, we have
Ωλ := λ; | 1n∑n
i=1 ‖x it − φ(x is)Tλ)‖2 ≤ 3∑p
k=1Var(xt,k) + ε. For any
ε, α > 0 and sufficiently large sample size, a neighborhood of λ0 iscontained in Ωλ with probability at least 1− α.
Observe that subscript k in xt,k above denotes the kth dimensional featureof xt .
21 / 24
Signomial Geometric Programming (SGP)
Monomial:exp(aT y + b)Posynomial:
∑K0k=1 exp(aT0ky + b0k)
Signomial Geometric Programming:
miny
K0∑k=1
exp(aT0ky + b0k)−L0∑l=1
exp(cT0l y + d0l) (6)
s.t.
Ki∑k=1
exp(aTiky + bik)−Li∑l=1
exp(cTil y + dil) ≤ 0 (7)
Idea
min f (x)⇔ sup γ s.t. f (x)− γ ≥ 0. Relaxation on ”NonnegativeSignomial” constraint.
Series of convex problems that give tighter bounds.
22 / 24
Conclusions
A statistical framework to harmonize CSF measurements acrossbatches/sites
Assumption: Same ”concept” is captured across sites
Constructions for hypothesis tests
Participants don’t need to be represented twice in different batchesfor calibration
23 / 24
The End, Thank You
24 / 24