integrative multi-scale analyses
DESCRIPTION
Integrative analyses of large scale spatio-temporal datasets play increasingly important roles in many areas of science and engineering. Our recent work in this area is motivated by application scenarios involving complementary digital microscopy, Radiology and "omic" analyses in cancer research. In these scenarios, our objective is to use a coordinated set of image analysis, feature extraction and machine learning methods to predict disease progression and to aid in targeting new therapies. We describe methods we have developed for extraction, management and analysis of features along with the systems software methods for optimizing execution on high end CPU/GPU platforms. We will also describe biomedical results obtained from these studies and extensions of the computational methods to broader application areas.TRANSCRIPT
Joel Saltz MD, PhDEmory UniversityFebruary 2013
Integrative Multi-Scale Analyses
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
a.k.a “Big Data”
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Application Targets
• Multi-dimensional spatial-temporal datasets– Radiology and Microscopy Image Analyses
– Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation
– Biomass monitoring and disaster surveillance using multiple types of satellite imagery
– Weather prediction using satellite and ground sensor data
– Analysis of Results from Large Scale Simulations
• Correlative and cooperative analysis of data from multiple sensor modalities and sources
• What-if scenarios and multiple design choices or initial conditions
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Core Transformations
• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation • Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification
Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)
National Science Foundation Grand Challenge in Land Cover Dynamics
• Remote sensing analysis of high resolution satellite images.
• Databases of land cover dynamics are essential for global carbon models, biogeochemical cycling, hydrological modeling and ecosystem response modeling
• Maps of the world's tropical rain forest during the past three decades.
Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John Townshend
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results
Dimitri Mavriplis, Raja Das, Joel Saltz
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
a.k.a “Big Data”
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Whole Slide Imaging: Scale
Data per slide: 500MB to 100GBRoughly 250-500M Slides/Year in USA
Total: 0.1-10 Exabytes/year
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Pathology Computer Assisted Diagnosis
Shimada, Gurcan, Kong, Saltz
Computerized Classification System for Grading Neuroblastoma
• Background Identification• Image Decomposition (Multi-
resolution levels)• Image Segmentation
(EMLDA)• Feature Construction (2nd
order statistics, Tonal Features)
• Feature Extraction (LDA) + Classification (Bayesian)
• Multi-resolution Layer Controller (Confidence Region)
No
YesImage Tile
InitializationI = L
Background? Label
Create Image I(L)
Segmentation
Feature Construction
Feature Extraction
Classification
Segmentation
Feature Construction
Feature Extraction
Classifier Training
Down-sampling
Training Tiles
Within ConfidenceRegion ?
I = I -1
I > 1?
Yes
Yes
No
No
TRAINING
TESTING
Using TCGA Data to Study
Glioblastoma
Diagnostic Improvement
Molecular Classification
Predictors of Progression
Digital Pathology
Neuroimaging
TCGA Network
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Morphological Tissue Classification
Nuclei Segmentation
Cellular Features
Lee Cooper,Jun Kong
Whole Slide Imaging
Oligodendroglioma Astrocytoma
Nuclear Qualities
Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints?
Application: Oligodendroglioma Component in GBM
Millions of Nuclei Defined by n Features
• Bottom-up analysis: let features define and drive the analysis
• Top-down analysis: use the features with existing diagnostic constructs
TCGA Whole Slide Images
Jun Kong
Step 1:Nuclei
Segmentation
• Identify individual nuclei and their boundaries
Nuclear Analysis Workflow
• Describe individual nuclei in terms of size, shape, and texture
Step 2:Feature
Extraction
Step 1:Nuclei
Segmentation
Oligodendroglioma Astrocytoma
Nuclear Qualities
1 10
Step 3:Nuclei
Classification
Oligo Astro
Representative Nuclei
Comparison of Machine-based Classification to Human Based Classification
Separation of GBM, Oligo1, Oligo2 as Designated by Neuropathologists
Separation of GBM, Oligo1 and Oligo2 as Designated by Machine
Survival Analysis
Human Machine
Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification
Oligo Related Genes
Myelin Basic ProteinProteolipoproteinHoxD1
Nuclear features mostAssociated with Oligo Signature Genes:
Circularity (high)Eccentricity (low)
Millions of Nuclei Defined by n Features
• Bottom-up analysis: let nuclear features define and drive the analysis
• Top-down analysis: analyze features in context of existing diagnostic constructs
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information
Lee Cooper,Carlos Moreno
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Consensus clustering of morphological signatures
Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients
Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering
Nuclear Features Used to Classify GBMs
3 2 1
20 40 60 80 100 120 140 160
20
40
60
80
100
120
140
1602 3 4 5 6 725
30
35
40
45
50
# Clusters
Silh
ouet
te A
rea
0 0.5 1
1
2
3
Silhouette Value
Clu
ster
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)
Featu
re I
ndic
es
CC CM PB
10
20
30
40
500 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
Days
Sur
viva
l
CC
CM
PB
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Associations
Molecular Correlates of MR Features Using TCGA Data
MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools
MR Features compared to TCGA Transcriptional Classes and Genetic Alterations
David Gutman
Capturing structured annotations and markups/ AIM Data Service
VASARI Feature Set
Scott HwangChad HolderAdam Flanders
Test ChiSquare DF P-ValueLog-Rank 12.4775 3 0.0059*Wilcoxon 10.0802 3 0.0179*
Prognostic Significance of Vasari FeaturesTests Between Groups: 0-33% vs. 34-95% Proportion enhancing
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
a.k.a “Big Data”
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Extreme DataCutter Prototype
DataCutter
Pipeline of filters connected though logical streams
In transit processing
Flow control between filters and streams
Developed 1990s-2000s; led to IBM System S
Extreme DataCutter
Two level hierarchical pipeline framework
In transit processing
Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes
Fine grained pipeline operations managed at the node level
Both levels employ filter/stream paradigm
Bottom line – everything ends up as DAGS
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Extreme DataCutter – Two Level Model
Coarse Grained Level
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Node Level Work Scheduling
Fine Grained Level
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Challenge: Structured/Unstructured Grid Calculations with Unpredictable Runtime Dependencies
Key Kernel in Distance Transform, Morphological Reconstruction, Delaney Triagulation
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
“Speedup” relative to single CPU core
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Large Scale Data Management
Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships
Highly optimized spatial query and analyses Implemented in a variety of ways including
optimized CPU/GPU, Hadoop/HDFS and IBM DB2
Spatial Centric – Pathology Imaging “GIS”Point query: human marked point inside a nucleus
.
Window query: return markups contained in a rectangle
Spatial join query: algorithm validation/comparison
Containment query: nuclear featureaggregation in tumor regions
Algorithm Validation: Intersection between Two Result Sets (Spatial Join)
PAIS: Example Queries
. .
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
VLDB 2012
Change Detection, Comparison, and Quantification
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Approach to Integrated Sensor Data Analysis Framework
• Abstract templates specify dataset geometry
• Templates describe collections of space-time regions
• Mapping to memory hierarchies provided by user defined mapping functions
• Leverages Parashar’s DataSpaces
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
a.k.a “Big Data”
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Example Project: Find hot spots in readmissions within 30 days– What fraction of patients with a given principal diagnosis will be
readmitted within 30 days?– What fraction of patients with a given set of diseases will be readmitted
within 30 days?– How does severity and time course of co-morbidities affect readmissions?– Geographic analyses
• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., product lines) useful?
• Need a repeatable process that we can apply identically to both local and UHC data
Clinical Phenotype Characterization and the Emory Analytic Information Warehouse
Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Overall System
I2b2 Web Server
I2b2 Database
Source data
Database Mapper
Source data
Source data
Data Processing
Metadata Manager
Metadata Repository
Query Specification
Investigator
Data Analyst
Data Analyst
Data Modeler
Investigator
Query toolsStudy-
specific Database
Investigator
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
5-year Datasets from Emory and University Healthcare Consortium
• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiation
therapy readmit encounters (CDW data)
• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and American
Community Survey)Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Using Emory & UHC Data to Find Associations With 30-day Readmits
• Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit
• Solution: Transform the data into a much smaller set of variables using heuristic knowledge– Categorize diagnosis and procedure codes using code
hierarchies– Classify continuous variables using standard interpretations
(e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration, sequence)– Apply standard data mining techniques
Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Derived Variables
• 30-day readmit• The 9 Emory Enhanced Risk Assessment Tool diagnosis categories• UHC product lines• Variables derived from a combination of codes and/or laboratory test results
– Obesity– Diabetes/uncontrolled diabetes– End-stage renal disease (ESRD)– Pressure ulcer– Sickle cell disease/sickle cell crisis
• Temporal variables derived over multiple encounters– Multiple MI– Multiple 30-day readmissions– Chemotherapy within 180 (or 365) days before surgery– Previous encounter within the last 90 (or 180) days
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
30-Day Readmission Rates for Derived VariablesEmory Health Care
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Geographic AnalysesUHC Medicine General Product Line (#15)
Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Predictive Modeling for Readmission
• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of the
variables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in a
training dataset– Generate a patient-specific readmission risk for each
encounter
• Rank the encounters by risk for a subsequent 30-day readmission
Sharath Cholleti
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Status of Clinical Phenotype Characterization
• Integrative dataset analysis can leverage patient information gathered over many encounters
• Temporal analyses can generate derived variables that appear to correlate with readmissions
• Predictive modeling has promise of providing decision support
• Data Analytics arm of the Emory New Care Model Initiative led by Greg Esper
• Ongoing analyses involve characterization of clinical phenotype in GWAS, biomarker and quality improvement efforts
• Co-lead (with Bill Hersh) of CTSA CER Informatics taskforce dedicated to this issue
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
a.k.a “Big Data”
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish
Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)
• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)
• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich
Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos
• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Thanks to
• National Cancer Institute• National Library of Medicine• National Science Foundation• Cardiovascular Research Grid (NHLBI)• Minority Health Grid (ARRA)• Emory Health Care• Kaiser Health Care• Winship Cancer Institute• Oak Ridge National Laboratory• Woodruff Health Sciences
Thanks!