big challenges for statisticians · big challenges for statisticians hongtu zhu, ph.d department of...
TRANSCRIPT
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Big Challenges for Statisticians
Hongtu Zhu, Ph.DDepartment of Biostatistics† and Biomedical Research Imaging Center‡
The University of North Carolina at Chapel Hill,
Chapel Hill, NC 27599, USA
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thank NSF and SAMSI!
Thank organizers!
Thank you!
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Science
Statistics
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Part 1. Technical Challenges
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Imaging Science
Imaging Science is a multidisciplinary field concerned with the generation, collection,
duplication, analysis, modification, and visualization of images.
As an evolving field, it includes research and researchers from
Physics, Mathematics, Statistics, Electrical Engineering, Computer
Vision, Computer Science and Perceptual Psychology.
From Wikipedia, the free encyclopedia
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Three key components
•Image acquisition: studies the physical mechanisms
and mathematical models and algorithms by which
imaging devices generate image observations.
•Image interpretation/application: is to see, monitor, and
interpret the targeted world/patterns being imaged.
•Image processing: is any linear or nonlinear operator
that operates on the images and produces targeted
patterns.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Structural MRI
Diffusion MRI
Functional MRI
(resting)
Functional MRI (task)
Level 1: Imaging Data
Overview
• Structural MRI
• Diffusion MRI
• Functional MRI
• Complementary techniques
- Variety of acquisitions
- Measurement basics
- Limitations & artefacts
- Analysis principles
- Acquisition tips
PET EEG/MEG Calcium CT
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Image
Acquisition
Signal Models
& Noise Sources
Image
PreprocessingRepresentation
Segmentation
Registration
Data Analysis
&
Interpretation
Statistical
Modeling
& Inference
Mathematics
& Statistics
Image Processing
Computer
Science/Engineer
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Individual Imaging Analysis
Imaging Construction Image Segmentation
Multimodal Analysis
DTI FLAIR
Marc
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Group Imaging Analysis
Longitudinal/Family BrainGroup Differences
Prediction
Imaging Genetics
NC/Diseased
Registration
Hibar, Dinggang, Martin
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
f
T
ˆ F = T[ f ]
FDA: Functional Data Analysis
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Voxel-wise
Statistical
Models
Multiple ComparisonsSmoothing
Prediction
ImagesRegistration
Estimation
FDA: Functional Data Analysis
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
ill-posed inverse problems
F
f
T
ˆ F = T[ f ]
d(F, F̂)® 0?
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Level 2: A Multiscale Physical System
The van Essen diagramstimulus – activity – measurement chain
Robinson
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
• Different models at different scales.
• Ladder of overlapping models.
• Must be testable against multiple
phenomena.
A Multi-modal Approach
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Level 3: Data Integration
Ritchie et al. (2015).
Nature Review Genetics
Meta-dimensional analysis
An approach whereby all
scales of data are combined
simultaneously to produce
complex models defined
as multiple variables from
multiple scales of data.
Multi-staged analysis
A stepwise or hierarchical
analysis method that reduces
the search space through
different stages of analysis.
Systems genomics
An analysis approach that
models the complex inter- and
intra-individual variations
of traits and diseases using
data from next-generation
omic data.
Data integration
The incorporation of
multi-omic information in
a meaningful way to provide a
more comprehensive analysis
of a biological point of interest.
In this Review, we describe the principles of meta-
dimensional analysis and multi-staged analysis, and
provide an overview of some of the approaches that
are used to predict a given quantitative or categorical
outcome, the tools available to implement these analy-
ses, and the various strengths and weaknesses of these
strategies. In addition, we describe the analytical chal-
lenges that emerge with data sets of this magnitude, and
provide our perspective on how such systems genomic
analyses might develop in the future.
Why integrate data?
Data integration can have numerous meanings; however,
in this Review, we use it to mean the process by which
different types of omic data are combined as predictor
variables to allow more thorough and comprehensive
modelling of complex traits or phenotypes — which are
likely to be the result of an elaborate interplay among
biological variation at various levels of regulation —
through the identification of more informative models.
Data integration methods are now emerging that aim
to bridge the gap between our ability to generate vast
amounts of data and our understanding of biology, thus
reflecting the complexity within biological systems.
The primary motivation behind integrated data analy-
sis is to identify key genomic factors, and importantly
their interactions, that explain or predict disease risk or
other biological outcomes. The success in understand-
ing the genetic and genomic architecture of complex
phenotypes has been modest, and this could be due to
our limited exploration of the interactions among the
genome, transcriptome, metabolome and so on. Data
integration may provide improved power to identify
the important genomic factors and their interactions
(BOX 1). In addition, modelling the complexity of, and
the interactions between, variation in DNA, gene
expression, methylation, metabolites and proteins
may improve our understanding of the mechanism
or causal relationships of complex-trait architecture.
There are two main approaches to data integration:
multi-staged analysis, which involves integrating
information using a stepwise or hierarchical analysis
approach; and meta-dimensional analysis, which refers
to the concept of integrating multiple different data
types to build a multi variate model associated with a
given outcome16–18.
Nature Reviews | Genetics
• SNP• CNV• LOH• Genomic rearrangement• Rare variant
• DNA methylation
• Histone modific
a
tion
• Chromatin
accessibility
• TF binding
• miRNA
• Gene expression
• Alternative splicing
• Long non-coding
RNA
• Small RNA
• Protein expresssion
• Post-translational
modific
a
tion
• Cytokine array
• Metabolite
profil
i
ng in
serum, plasma,
urine, CSF, etc.
Genome ProteomeTranscriptomeEpigenome
DNA Gene mRNA
TF Metabolites
Protein
Transcription Expression Translation Function
Alternative
splicing
miRNA
TFbs
TFbs
TFbs
Me
Histone
Metabolome Phenome
• Cancer
• Metabolic
syndrome
• Psychiatric
disease
Figure 1 | Biological systems multi-omics from the genome, epigenome,
transcriptome, proteome and metabolome to the phenome.
Heterogeneous genomic data exist within and between levels, for example,
single-nucleotide polymorphism (SNP), copy number variation (CNV), loss
of heterozygosity (LOH) and genomic rearrangement, such as translocation,
at the genome level; DNA methylation, histone modification, chromatin
accessibility, transcription factor (TF) binding and micro RNA (miRNA) at the
epigenome level; gene expression and alternat ive splicing at the
transcriptome level; protein expression and post-translational modification
at the proteome level; and metabolite profiling at the metabolome level.
Arrows indicate the flow of genetic information from the genome level to
the metabolome level and, ultimately, to the phenome level. The red crosses
indicate inactivation of transcription or translation. CSF, cerebrospinal
fluid; Me, methylation; TFBS, transcription factor-binding site.
REVIEWS
2 | ADVANCE ONLI NE PUBLI CATION www.nature.com/reviews/genetics
© 2015 Macmillan Publishers Limited. All rights reserved
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Endophenotypes+
Genes+
Genomics)Epigenomics)
Expression)RNA)genes,)
protein4coding)genes)
Transcriptomics)Proteomics)
Metabolomics)
Interactomics)
neuron)development,)organelle)
Neuroscience)Imaging)
Brain)interactome)
Cell)biology)Neuroscience)
Diagnosis)Self4report)
Figure+1.+A)simplified)flow)chart)for)psychiatric)disorders:)from)genes)to)symptoms)
Environmental,+social+and+psychological+factors+
feedback)feedback) feedback) feedback)
Cells)
RNA,)proteins,)metabolites)
Molecules+ Brain+Structure,)circuits,)
physiology)
Symptoms+
Behavioral)tests)
Zhao and Castellanos (2016) Discovery science strategies in studies of the
pathophysiology of child and adolescent psychiatric disorders: promises and limitations
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
http://en.wikipedia.org/wiki/DNA_sequence
Big Data Integration in Health Informatics
G
IE
D Selection
E: environmental factors
G: genetic/genomics
D: disease
I: imaging/device
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Part 2. Career Challenges
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Career Development
Start with simple projects
Learn from others
Try hard to get involved in some large studies
Think about how to do it better, in what sense?
More papers.
Develop new tools and packages.
Write more grants
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Training
SAMSI videos and slides for summer schools and
lectures.
Short Courses in major conferences.
New Graduate Courses
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Collaborations
Good Mentors: Theory and Applications.
Good Collaborators: Radiology, Neuroscience,
Psychiatry, Psychology, Computer Science, …
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Big Public Data Sets:
• Alzheimer’s Disease Neuroimaging Initiative (ADNI)
• NIH MRI Study of Normal Brain Development
• National Database for Autism Research
• Human Connectome Project
• The Cancer Genome Atlas (TCGA)
• UK Biobank
https://en.wikipedia.org/wiki/List_of_neuroscience_databases
Data Sets
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
UK Biobank Project
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
The Human Connectome ProjectThe HCP is to elucidate the neural pathways that underlie brain
function and behavior.
Resting-state fMRI (rfMRI) and dMRI provide
information about brain connectivity.
Task-evoked fMRI reveals much about brain
function.
Structural MRI captures the shape of the highly
convoluted cerebral cortex.
Behavioral data relate brain circuits to individual
differences in cognition, perception, and
personality.
Magnetoencephalography (MEG) combined with
electroencephalography (EEG) yield information
about brain function on a milisecond time scale.
The Heavily Connected Brain
Peter Stern, “Connection, connection, connection…”, Science, Nov. 1 2013: Vol. 342 no. 6158 P.577
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
http://www.nitrc.org/
NITRC = The Source for Neuroimaging Tools and Resources
Statistical Parametric Mapping (SPM)
FMRIB Software Library (FSL)
Analysis of Functional NeuroImages (Afni)
3D Slicer
FreeSurfer
……
Softwares
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Human Brain Mapping (HBM)
ISMRM conference
SNF conference.
Information Processing in Medical Imaging (IPMI)
SIAM Conference on Imaging Science (IS)
Medical Image Computing and Computer Assisted Intervention (MICCAI)
International Symposium on Biomedical Imaging (ISBI)
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Neural Information Processing Systems Foundation (NIPS)
Conferences
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
NeuroImage
Medical Imaging Analysis
IEEE Transactions on Medical Image
Human Brain Mapping
IEEE Transactions on Signal Processing
IEEE Transactions on Image Processing
IEEE Transactions on Signal Processing Magazine
SIAM Journal on Imaging Sciences
IEEE Pattern Analysis and Machine Intelligence
Annals of Applied Statistics, Biometrics
Biostatistics
Journal of American Statistical Association ACS
Publications
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Part 3. Software Challenges
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
http://www.nitrc.org/
NITRC = The Source for Neuroimaging Tools and Resources
Software Development
Lack a good and popular statistical software
for Neuroimaging Data Analysis
from our community
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Software Development
• Share responsibilities and information
• Common input and output files compatible with major packages
• Build small Rcpp and Matlab packages
• Release them through your own websites, our neuroconduct website
and http://www.nitrc.org/
• Focus on a few key tools and expand from them
• Encourage other groups to download and use them.
Start a Neuroconduct project
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Software Development
1. Simulators for different imaging modalities • Evaluate image processing tools
• Evaluate statistical methods (group analysis, reliability)
2. Standardize all image processing and analysis pipelines• fMRI and resting fMRI
• EEG/MEG
• DTI
• CT
• Calcuim
• PET
3. Develop new tools to do multi-modal analysis
4. Develop new tools to integrate imaging, genetic,
and clinical data