bioinformatics and life sciences – standards and ...standards and programming for heterogeneous...
Post on 15-Mar-2020
4 Views
Preview:
TRANSCRIPT
Bioinformatics and Life Sciences –
Standards and Programming for
Heterogeneous Architectures
Eric Stahlberg Ph.D.Eric Stahlberg Ph.D.
(SAIC-Frederick contractor)
stahlbergea@mail.nih.gov
1
SIAM Conference on Parallel Processing for Scientific Computing
Savannah, GA,
February 16, 2012
Caveats: Content and statements following do not constitute any official position or
endorsement, whether stated or implied.
All copyrights of referenced material remain with the original owner.
• Cancer kills every 55 seconds
• Cancer research utilizes bioinformatics heavily
• Bioinformatics is computationally intensive
• Faster solutions help cancer research move faster
Context for Heterogeneous Acceleration
• Faster solutions help cancer research move faster
• Faster and better clinical applications help to impact
patient lives
• Today’s Goal: Encourage paths to improve
bioinformatics applications for cancer research
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
• Faster and better applications
• Better education and preparation in parallel and
distributed computing
• Better and faster data handling solutions
Three Key Needs
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
• Technical and operations contractor to the U.S. National
Cancer Institute
• Federally Funded Research and Development Center for DHHS
SAIC-Frederick, Inc.
• Many technical and operational areas of support for the NCI
including bioinformatics
4
IT
picture
here
NCI Center for Cancer Research
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
BRANCHES
Cell and Cancer Biology
Dermatology
Experimental Immunology
Experimental Transplantation and
Immunology
Genetics
HIV and AIDS Malignancy
HIV DRP Host-Virus Interaction
Medical Oncology
Metabolism
LABS
Basic Research Laboratory
Cancer and Developmental Biology
Laboratory
Chemical Biology Laboratory
Gene Regulation and Chromosome
Biology Laboratory
HIV DRP Retroviral Replication
Laboratory
Laboratory of Biochemistry and
Molecular Biology
Laboratory of Experimental Immunology
Laboratory of Genome Integrity
Laboratory of Human Carcinogenesis
Laboratory of Immune Cell Biology
Laboratory of Metabolism
Laboratory of Molecular Biology
Laboratory of Molecular Immunoregulation
Laboratory of Molecular Pharmacology
Laboratory of Pathology
Laboratory of Population Genetics
NCI Center for Cancer Research
Metabolism
Neuro-Oncology
Pediatric Oncology
Radiation Biology
Radiation Oncology
Surgery
Urologic Oncology
Vaccine
Molecular Biology
Laboratory of Cancer Biology and
Genetics
Laboratory of Cancer Prevention
Laboratory of Cell and Developmental
Signaling
Laboratory of Cell Biology
Laboratory of Cellular and Molecular
Biology
Laboratory of Cellular Oncology
Laboratory of Experimental
Carcinogenesis
Biophysics Laboratory
Laboratory of Population Genetics
Laboratory of Protein Dynamics and Signaling
Laboratory of Receptor Biology and Gene
Expression
Laboratory of Tumor Immunology and
Biology
Macromolecular Crystallography Laboratory
Molecular Targets Laboratory
Structural Biophysics Laboratory
PROGRAMS
Cancer and Inflammation
CCR Nanobiology
HIV Drug Resistance
Molecular Discovery
Molecular Imaging
Mouse Cancer Genetics
• Image processing– 3D imaging
– 2D imaging
• Sequence and protein analysis– Microarray
– Next Generation Sequence Analysis
– Proteomics
Life Science Application Areas
– Proteomics
• Simulation– Molecular interactions and dynamics
– Complex systems biology simulations
• Data mining and analytics– Statistics
– Graph and cluster analysis
– Population analysis
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
DNA RNATranscription
Mitosis Translation
Dataflow View of Basic Biology
Data source
Transform
Process
DNA information flow
Protein feedback loop
Intercellular communication
Duplicated
DNA Intra-Cellular
Functions
Proteins
Cell
Functionsne
w c
ell
Source: http://web.expasy.org/cgi-bin/pathways/show_thumbnails.pl
Metabolic Pathways at Higher Resolution
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Next Generation Sequencing Focus
• Used to understand complex biological systems
• Common types of NGS applications– ChIPseq
– RNAseq
– miRNAseq
Next Generation Sequencing Focus
– miRNAseq
– Epigenetic studies
• Large and growing dataset sizes
• Identify, associate, and compare within individual experiments
• Integrate and compare across experiments
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Data Acquisition Costs PlummetData Acquisition Costs Plummet
11SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Large Data Challenges
• Volume of available data is growing rapidly
• One run produces hundreds of gigabytes of data*
• Policy issues
– HIPAA, security and protection
Big Data = Big Challenges
– HIPAA, security and protection
– Move it, store it, delete it?
– Validation and clinical liability
• Metadata - reliable secondary value
12
*Reference: Barski and Zhao, Journal of Cellular Biochemistry, 107:11-18, 2009
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Generic NextGen Workflow
Sequence
Acquisition
Data
Quality
Evaluation
Sequence
Read
Mapping
Analysis of
Mapped
Reads
Compare
Across
Samples
General NGS Workflow
• Experimental data is progressively concentrated to become knowledge for decision
• Per sample volume of information reduced as data is analyzed
• Concentrated results are integrated to inform decisions
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Example Areas in Next Gen Sequencing
• Genome Assembly
– Combine small fragments of DNA/RNA into high-
confidence composite contigs
– Connect the small pieces into a larger ‘string’ consistent
with observed sequences and known biology
Illustrative Next Gen Sequencing Apps
with observed sequences and known biology
• Read Mapping
– Start with a known baseline reference genome
– Map smaller pieces of DNA/RNA to their “correct” location
on the reference genome allowing for mismatches,
insertions, deletions
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
RNA Sequencing OverviewRNA Sequencing Overview
Source: http://www.bgisequence.com/eu/services/sequencing-services/rna-sequencing/rna-seq/
Challenges in Next Gen Sequencing
• Transferring large datasets
• Processing huge datasets
• Integrating datasets
• Proliferation of sequencing capabilities
Key NGS Challenges
• Proliferation of sequencing capabilities
• Growing data volumes too great to store results
• Overcoming ambiguity with algorithmic improvements
• Reproducibility over time
• Translation to clinical application
• Applications are parallel but not system friendly
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Contrasting Application Goals
• Agile
• Rapid incorporation of new
advances
• Ad hoc development process
– Stable
– Measured incorporation of
proven advances
– Development process
Research Application Aims Clinical Application Aims
Research vs. Clinical Application
•
• Open source
• Documented as needed
• Generally portable
• Limited liability for failures
• Marginal testing
• Reproducibility
• Speed
–
required
– Licensed and proprietary
– Well documented
– Supportable
– Liability for failure
– Certification of testing
– Reproducibility
– Speed SIAM Conference on Parallel Processing
for Scientific Computing, February 16, 2012
• Faster and better applications
• Better education and preparation in parallel and
distributed computing
• Better and faster data handling solutions
Three Key Needs
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
• Improved application development
– Higher speed applications
– More robust applications in PDC environments
– More efficient applications
– Better interoperability among PDC technologies
Why Better Education in PDC?
• More effective application use at run time
– Analysts know how to use parallel computing effectively
– Understanding of scalability to better relate problem size to computational resources
– Improved planning of large computational analysis efforts
– Better run-time efficiency
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Changing a Way of Thinking
Education is key
– Teaching parallel and accelerated
computing across the CS curriculum
– Innovative NSF funded project
– Incorporating parallel computing into
CS, software development, and
computational science
Courses Enhanced
– Computer Literacy
– Intro to Programming
– Data Structures
– Algorithms
– Programming Languages
Changing a Way of Thinking
computational science
– Workshop and website under
development
– See www.accel2apps.org for more
information
– Programming Languages
– Computer Hardware
– Computational Modeling
– Bioinformatics (applications)
– Computational chemistry
(applications)
We gratefully acknowledge the support of the National Science Foundation
Grant CCF-0915805, SHF:Small:RUI:Collaborative Research: Accelerators to Applications –
Supercharging the Undergraduate Computer Science Curriculum
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Why Heterogeneous Acceleration?
• Problems are large
– Recent sample runs have taken up to 4 days to compute
– Experiments include many samples
– Data is becoming too large to move
– Instrument systems are becoming smaller and cheaper
Why Heterogeneous Acceleration ?
– Instrument systems are becoming smaller and cheaper
– Trend to generate much more data continues
• Technologies are heterogeneous
– Multicore is pervasive and proven
– GPU technology is affordable and available
– FPGAs have history for fast bioinformatics
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Parallel Computing in Bioinformatics
• Parallel Computing and bioinformatics
– 182 articles in PubMed since 1995
• GPU and Bioinformatics
– 50 articles dating back to 2007
– 33 articles in CUDA and bioinformatics
• Message Passing and bioinformatics
26 articles with ‘message passing’ and Message Passing
CUDA
GPU
Parallel Computing
Parallel Computing in Bioinformatics
– 26 articles with ‘message passing’ and
bioinformatics
• FPGA and Bioinformatics
– 22 articles in PubMed since 1993
• OpenMP and bioinformatics
– 6 articles in OpenMP and bioinformatics
• OpenCL and bioinformatics
– 3 articles reported
0 50 100 150 200
OpenCL
OpenMP
FPGA
Message Passing
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
Biowulf – NIH HPC Resource
GPU
cluster
available
Biowulf at NIH
Weighing the Merits of Standards
• Pros
• Stabilizes development
efforts
• Improve portability of
algorithms and applications
• Raise productivity and
• Cons
– Takes time for community
adoption
– Possible performance
penalty in some cases
Relative Merits of Standards
• Raise productivity and
innovation
• Improve robustness of
mission critical applications
• Improve supportability
• Channels creativity and
innovation
• Easier education
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
• Not to be confused with OpenACC
• Open Accelerator Initiative provides community knowledge
base of accelerated computing activity
– Components, performance, literature, and more to come
• Encourages interoperability among technologies and
Open Accelerator
• Encourages interoperability among technologies and
standards
• Registration services support application reproducibility and
certification
• Downloads: OpenFPGA draft GenAPI standard
• Visit www.openaccelerator.org
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
• Faster and better applications
– Heterogeneous acceleration
– Support standards and interoperability
– Multiple areas exist
• Better education and preparation in parallel and
Summary
• Better education and preparation in parallel and distributed computing
– Improved application development
– Ease of application use
• Better and faster data handling solutions
– Not addressed here
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
• Colleagues at NCI CCR, SAIC-F ABCC
• National Science Foundation CISE
• Colleagues at Wittenberg and ClemsonDr. Steven Bogaerts, Dr. Kyle Burke, Dr. Brian Shelburne,
Acknowledgements
Dr. Brian Shelburne, Dr. Melissa Smith
• OpenFPGA and OpenAccelerator communities
Contact information:
estahlberg (-at-) gmail.com or stahlbergea(-at-)mail.nih.gov
SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012
top related