motivation & relevance statistical analysis of high ... › sites › default › files ›...
TRANSCRIPT
Statistical Analysis of High-Dimensional
Biomedical Data
Analytical Goals, Common Approaches and
Challenges
Axel Benner
German Cancer Research Center (DKFZ), Heidelberg, Germany
July 18, 2019
Motivation & Relevance
Increasing use and availability of health-related metrics
Omics data (e.g., genomics, transcriptomics, proteomics)
Electronic health records
Big data / high dimensionality
Big data
typically characterized by very large sample size n
High dimensionality
number of unknown parameters p is of much larger order thansample size n (p >> n)
1
Motivation & Relevance
High dimensionality introduce computational and statistical
challenges
Heterogeneity (e.g., di↵erent sources, technologies)
Noise accumulation (accumulation of estimation errors)
Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 20142
Example: Noise accumulation
−4 −3 −2 −1 0 1 2 3
−2−1
01
2
p= 5
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
−4 −2 0 2
−4−2
02
4
p= 50
●
●
●
●
●●
●
●
●
●
●●● ●
●●
●
●
●
●
● ●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●● ●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●
●● ●
●
●●
●● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
−6 −4 −2 0 2 4 6−6
−4−2
02
46
p= 500
●
●
●
●
●
●
●
●
●●●
●
●
●
●
● ●●
●
● ●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●●
●
● ●●
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
● ●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●●
●● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
−10 −5 0 5
−50
5
p= 1000
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●●
●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Projections of the observed data (n=100, each class) onto the first
two principal components of the p-dimensional feature space.
Motivation & Relevance
High dimensionality introduce computational and statistical
challenges
Heterogeneity (e.g., di↵erent sources, technologies)
Noise accumulation (accumulation of estimation errors)
Spurious correlation (uncorrelated variables may have high
sample correlations in high dimensions)
Incidental endogeneity (correlations between predictors and
residual noise)
These features of high-dimensional data often make traditional
statistical methods invalid!
Guidance required on how to deal with these challenges.
Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 20142
14 Members (July 2019)
Federico Ambrogi (University of Milan, Italy)
Axel Benner (DKFZ Heidelberg, Germany)
Harald Binder (Freiburg University, Germany)
Anne-Laure Boulesteix (LMU Munich, Germany)
Tomasz Burzykowski (Hasselt University, Belgium)
Riccardo De Bin (University Oslo, Norway)
W. Evan Johnson (Boston University, USA)
Lara Lusa (University of Ljubljana, Slovenia)
Lisa McShane (NCI, USA)
Stefan Michiels (University Paris-Sud, France)
Eugenia Migliavacca (Nestle Institute of Health Sciences
Lausanne, Switzerland)
Jorg Rahnenfuhrer (TU Dortmund, Germany)
Sherri Rose (Harvard Medical School, USA)
Willi Sauerbrei (Freiburg University, Germany)3
Subtopics
(All in context of very large number of predictor, descriptor, or
outcome variables)
1. Data pre-processing
2. Exploratory data analysis
3. Data reduction
4. Multiple testing
5. Prediction modeling/algorithms
6. Comparative e↵ectiveness and causal inference
7. Design considerations
8. Data simulation methods
9. Resources for publicly available high-dimensional data sets
4
Current Goals
Overview paper
Statistical analysis of high-dimensional biomedical data:A gentle introduction to analytical goals, common approachesand challenges
Simulation paper
Guidance for planning, conducting and reporting simulationstudies for comparing analytic approaches for biomedical data:General concepts with additional considerations forhigh-dimensional data
Guidance for analysis processes
Examples for data analysis processes for specific types of HDDRecommendations for best practicesR-Code with interpretations
5
Overview paper
Topics
Initial data analysis
Exploratory data analysis
Multiple testing
Prediction
6
Initial Data Analysis
Analytical goals
Describe distributions of variables and identify inconsistent,
suspicious or unexpected values
Identify missing values and consider strategies to address
Identify systematic e↵ects due to data collection and adjust if
required
Simplify data and refine/update analysis plan if required
Common approaches
Graphical displays: Scatterplots, Histograms, Heatmaps, ...
Descriptive statistics
Projections: Principal component analysis (PCA)
7
Exploratory data analysis
Analytical goals
Identify interesting data characteristics
Analyze data structure
Common approaches
Projections into fewer dimensions: PCA, t-Distributed
Stochastic Neighbor Embedding (t-SNE), ...
Descriptive statistics (e.g., to identify regions with relatively
large data density)
Cluster analysis
Hierarchical clustering, K-means, Partitioning around Medoids(PAM), ...
Creation of prototypical samples
8
Exploratory Data Analysis
Example: PCA vs. t-SNE
PCA performs a linear mapping of the data to a
lower-dimensional space such that the variance of the data is
maximized.
t-SNE (van der Maaten & Hinton, 2008) is a variation of
Stochastic Neighbor Embedding (SNE, Hinton & Roweis,
2002) that minimizes the divergence between the distribution
of the input data and the distribution in the low-dimensional
space.
The main di↵erence between t-SNE and PCA is that PCA
focuses on preserving the distances between widely separated
data points whereas t-SNE tries to preserve the distances
between nearby high-dimensional data points.
i.e. t-SNE reduces the dimensionality of data mainly based on
local properties of the data9
Example: PCA vs. t-SNE
−1.5−1.0−0.5 0.0 0.5 1.0 1.5−3−2
−1 0
1 2
3
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
x
y
z
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
● ●
●●
●
●
●
●
●
●
●
●●●
●
●
●● ●●
●●
● ●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●● ●●
●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
● ●
●
● ●
●
● ●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
● ●●
●
●
●●
●●
●
●
● ●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
● ●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
● ●
●
●
●
●
● 3D S Curve
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−1.0
−0.5
0.0
0.5
1.0
PC1
PC2 ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
● ●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
● ●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
−20 0 20−15
−10
−50
510
15
tSNE1
tSNE2
10
Example: PCA vs. t-SNE
Extended Data Figure 3
183 mm
RF estimated tumour purity
40−49%50−59%60−69%70−79%80−89%
non-tumor class
020
040
060
080
010
0012
0014
00
Num
ber o
f ref
eren
ce c
ases
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RF model to estimate tumour purity (TCGA training data)
ABSOLUTE purity
RF
estim
ated
pur
ity (o
ob)
Mean sqared error = 0.015
a
b
c
LGm1
LGm2
LGm3
LGm4
LGm5
LGm6
other tumor
O IDH
A IDH
A IDH, HG
DMG, K27
GBM, MIDGBMRTK I
GBM, MYCN
GBM, MES
GBM, RTK II
GBM, G34
GBM, RTK III
d
Genome-wide DNA methylation profiles of n = 2,801 brain tumor
samples (Capper et al., Nature 2018)11
Multiple Testing
Analytical goals
Identify informative variables for an outcome
Identify informative groups of variables
12
Multiple Testing
Statistical testing of thousands of hypotheses
requires alternative procedures to control the false discovery
rates and to improve the power of the tests.
Many di↵erent scenarios
Find variables with di↵erent distributions between
pre-specified classes of subjects or with association with
outcome
Enriched variables classes in a list of selected variables
Common approaches
Control for false discoveries (e.g. FDR, empirical Bayes)
Global testing versus one-at-a-time testing
Enrichment tests (e.g., gene set enrichment analysis)
13
Prediction
Analytical goals
Construct prediction models
Assess performance and validate prediction models
14
Prediction
Goal: Construct prediction models
Common approaches
Dimension reduction
Statistical modelling
Ridge regression, lasso and their modificationsBoostingSupport vector machinesTrees and Random forestsNeural networks and deep learning
15
Prediction
Problem: Standard methods break down
For n ⌧ p cannot fit standard regression model
Redundancy in variables:
huge correlation as problem for stable variable selection
Example: Incidental endogeneity
Gene expression example (Fan et al. 2014):
(Ten-fold cross validated) L1-penalized least squares
regression (37 genes are selected) - refit ordinary least squares
regression on the selected model to calculate residuals.
16
Example: Incidental endogeneity
Figure 3. Illustration of incidental endogeneity on a microarry gene expression data. Left panel:
Distribution of the sample correlation . Right panel:
Distribution of the sample correlation . Here represents the residual noise after the Lasso fit. We provide the distributions of the sample correlations using both the raw data and permuted data.
Fan et al. Page 31
Natl Sci Rev. Author manuscript; available in PMC 2014 December 01.
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
NIH
-PA Author Manuscript
Red: Empirical distribution of the correlations between the
predictors and the residuals
Blue: ”null distribution” of the spurious correlations by randomly
permuting the orders of rows in the design matrix.
Prediction
Penalized regression is ine�cient when the dimensionality p of the
covariates is ultrahigh (Fan & Lv, JRSS-B 2008)
Dimension reduction
Apply screening methods that are based on correlation
learning to reduce dimensionality from ultrahigh to a
moderate scale
This enhances finite sample performance of subsequent
regression methods (Fan & Lv, JRSS-B 2008)
Speed is essential: Fast distance correlation screening using
martingale residuals, which is computationally e�cient and easy to
implement (Edelmann et al. BiomJ, 2019)
16
Example: Incidental endogeneity
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Model size
True
Pos
itive
Rat
e (T
PR)
25 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 10025 50 75 100
COXCINDEXCRCDCSCRISFASTRESIBCORSISIPODFPSISRCDCSLASSO
Figure S5: True positive rate vs. model size considering linear and nonlineare↵ects. A plasmode simulation of 100,000 CpG sites is done for a sample sizen=300 and 50% censoring (500 data sets).
12
True positive rate vs. model size considering linear and nonlinear e↵ects.
Plasmode simulation of 100,000 CpG sites, sample size n=300 and 50%
censoring (500 data sets).
Prediction
Goal: Assess performance and validate prediction models
Problem: Improper evaluation (e.g., resubstitution) drastically
overestimates model performance
(and is still extremely common)
Common approaches
Calibration and prediction accuracy
Choice of performance measures (e.g., MSE, AUC, Brier
score)
Risk of overfitting
) Stability of model selection
16
Simulation paper
Goal: Guidance for planning, conducting and reporting
simulation studies for comparing analytic approaches for
biomedical data
For high-dimensional data heavier reliance on simulated data
necessary
Data often generated to address complex research questions,
and analytical methods may be correspondingly tailored
Wide range of specialized data and analysis approaches, thus
often not su�cient number of data sets available
TG9 cooperation with Simulation Panel obvious!
17
Simulation paper
Issues specific to high-dimensional data (HDD)
Underlying (biological) mechanism not well understood
Di�cult to simulate realistic correlation structure and suitable
multivariate distributions
Common Approaches
Simulations based on assumed distributions (e.g. normal,
Poisson, negative binomial)
Simulation using extracted parameters from pilot data
Simulation using real data (e.g., plasmode data)
More about this:
Victor Kipnis: Issues in modern biomedical simulation studies
18
Example: ”Real data” simulation of HDD
Useful approach for realistic high-dimensional data generation
Plasmode data:
Real data (e.g., omics data from actual biological specimens)
which are manipulated such that the parameters of interest
are known with certainty.
Name from plasm=form, and mode=measure
References:
Cattell, R. B. (1966). Handbook of Multivariate ExperimentalPsychology. Rand McNally, Chicago.Mehta et al., Physiological Genomics 2006;28(1):24-32
19
Example: ”Real data” simulation of HDD
Advantages of plasmode data
Distributions/correlations are taken directly from real data
Appropriate permutation, resampling, or modification of real
data o↵ers flexibility to generate data with desired features
Can combine with outcome models to generate dependent
variables associated with realistic HDD as independent
variables
20
”Real data” simulation of HDD
Example: Generate data for evaluation of multiple testing methods
Permute subject/specimen IDs to generate a null distribution
Global null allows assessment of ”weak control” of falsepositives for a multiple testing procedure
Add back defined e↵ects on specific individual variables
Allows assessment of both ”power” for true positives and”strong control” of false positives for a multiple testingprocedure
21
Outlook
TG 9: Large topic with links to Bioinformatics and Molecular
Epidemiology
But: high-dimensional challenges also in non-omics settings
Overlap with many other topic groups, but always with
high-dimensional flavor
Cooperation with other Topic Groups is essential
22
Outlook
Guidance is essential!
(Source: Sidney Harris)
23
Thank You
Federico Ambrogi Axel Benner
Harald Binder Anne-Laure Boulesteix
Tomasz Burzykowski Riccardo De Bin
Evan Johnson Lara Lusa
Lisa McShane Stefan Michiels
Eugenia Migliavacca Jorg Rahnenfuhrer
Sherri Rose Willi Sauerbrei
24