enabling patient centric ngs workflows through a hadoop optimized compute platform and graph...
TRANSCRIPT
C O M P U T E | S T O R E | A N A L Y Z E
Enabling patient-centric NGS workflows through a Hadoop®- optimized compute platform and graph analytics Molecular Tri-Con 2015 February 2015 David Anstey, Global Head Life Sciences, Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
The Life Sciences/Healthcare Communities Market and Technology Drivers
The race to understand individual patients, diseases and treatments at the molecular level
Precision Medicine
Organizations struggling to keep compute infrastructures up to date, with rapidly changing life sciences technologies
Pace of Technology
Ad-hoc cluster infrastructures increasing complexity, reliability and usability challenges
Cluster Sprawl
New data sources and emerging analytical approaches to enable predictive modeling and knowledge discovery Data Science
Convergence of analytics and supercomputing opening new opportunities to meet the pace of discovery
Rise of High Performance Analytics
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
The Quest for In-Time Analytics Re
spon
se ti
me
fram
es
<30ms
30ms
10min
>10min
Low-Latency
Batch Few data scientists who wrangle data
Business analysts accustomed to interactive time frames
Streaming data
Stationary data
Low-latency applications require performance optimizations • Memory-storage hierarchies • Fast interconnects
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Multi-Step Analytics Pipelines
Data Prep/ ETL
Stream Processing
Data Mining
Interactive Queries
Actionable Insight
Analytics Pipeline
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Convergence of Analytics and Supercomputing
High-Performance Computing • Finance: portfolio optimization, pricing, risk • Energy: seismic modeling • Life sciences: genomics, drug discovery • Scientific: simulation, weather forecasting
Traditional Big Data • Batch analytics • Undifferentiated systems
“Simulation is the original Big Data Market” – IDC
High-Performance Big Data Analytics • Low-latency analytics • Next-generation architecture
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Analytics Solutions
Powered By Extreme Analytics Platform • Turnkey advanced analytics platform • Next-generation system architecture • Engineered for performance
Graph Discovery Appliance • Discover unknown and hidden
relationships in big data • Real-time data discovery • Realize rapid time-to-value
Copyright 2015 Cray Inc.
Urika-XA™ Urika-GD™
C O M P U T E | S T O R E | A N A L Y Z E
Single Platform for Varied Analytic Workloads
• In the real world, advanced analytics pipelines require an assembly of task-specific tools
• These tools and tasks place very different infrastructure demands, and have resulted in specialized infrastructure (appliances)
• Supports the widest range of workloads in a single system footprint
Batch Analytics
Basic Profiling Statistics Machine Learning Streaming Data Prep
Iterative Analytics Interactive Discovery
Every item in a dataset once
Same subset of data several times
Different subsets each time
Throughput Matters Latency Matters
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Urika-XA Extreme Analytics Platform
Turnkey Advanced Analytics Platform
• Open platform for both pre-configured and user-installed tools
• Hadoop® and Apache Spark™ ecosystem • Emerging high performance analytic
workloads • Unified system management interface
Next-Generation System Architecture
• High performance storage technologies • Battle-tested on cutting-edge
government/scientific analytic applications • Ready for the enterprise
Engineered for Performance
• Dense footprint: 48 nodes (Intel® Xeon® processor), over 1,500 cores, 6 TB memory
• 38 TB SSD and 120 TB POSIX-compliant high-performance storage
• FDR InfiniBand for datapaths and 1 GbE for management
• Scale out to multi-rack configurations Advanced Analytics at Lower TCO
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Pre-Integrated, Open Platform
Pre-Integrated Hardware and Software
• Cloudera Enterprise Hadoop® and Apache Spark™
• Tuned for optimal performance
• Ready to run out of the box
Open and Extensible
• Open platform for user-installed analytics tools
• Scale compute and storage independently
• POSIX compliance for data staging and non-HDFS workloads
• Future-proof, no appliance lock-in
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
“My bioinfo support team is getting better at handling one exome. But I’m sequencing 1000 exomes this month.” Researcher, Top 10 Pharma
“… today’s Bio-IT professionals have to design, deploy, and support IT infrastructures with life cycles measured over several years, in the face of an innovation explosion where major laboratory and research enhancements arrive on the scene every few months.” Chris Dagdigian of The BioTeam
“I’ve been waiting 4 months for my bioinformatics team to do my NGS analysis.” Researcher, Top 10 Pharma
“Although hundreds of thousands of samples have been sequenced, our ability to find, associate, and implicate genetic variants and candidate disease genes far outstrips our ability to understand them.” Sean Sanders, Science/AAAS
What we’re hearing…
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
NGS Data Management is Overwhelming
Production – Huge data volumes and excessive data movement are pushing the limits of many storage and networking infrastructures. Archive – NGS workflows generate huge volumes of data, which is both tedious and costly to retain.
Three NGS Challenges: Sequence Assembly, Bioinformatics and Data Management
NGS Bioinformatics is Complex
Complexity High – Interpreting NGS sequence meaning involves annotation, integration, visualization, and collaboration – requiring diverse expertise. Performance – Bioinformatics is computationally demanding in both performance and scale.
Sequence Assembly is a Bottleneck
Sequence costs down. Sequence volumes up. Huge volume makes assembly the challenge. The rate at which genotypic variation can be characterized is now limited by computational tools, not by sequencing technology.
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Cray’s Next Generation Sequencing Solution: Accelerated Time to Discovery
Genome Assembly
High Throughput NGS Storage and
Archive Environment
Bioinformatics Analytics
Personalized Medicine
Pathway Modeling
Hypothesis Generation Alternative
Indications Biomarker Prediction
Patient Selection
Base Calling
Assembly
Variant Analysis
QC
Annotation
Next Generation Sequencers
Manage all aspects of NGS pipeline in one environment • Address data transfer
and compute bottlenecks
• Speed up whole-genome resequencing analysis
• Fast short-read alignment
• Calculate differential gene expression from large RNA-Seq data sets
• “Single pane of glass” management interface
Enterprise Benefits
• Open architecture
• Reduced footprint
• Eliminates cluster sprawl
• Out-of-the-box performance with flexibility to meet evolving needs
• Pay-as-you-grow storage/archival performance and capacity
• Minimal management burden and lower TCO
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Next Generation Sequencing: Urika-XA for all aspects of NGS Bioinformatics
Next generation sequencers
Urika-XA to simplify the NGS workflow
Manage all aspects of NGS pipeline in one environment • Address data transfer and compute bottlenecks • Speed up whole-genome resequencing analysis • Fast short-read alignment • Calculating differential gene expression from
large RNA-Seq data sets • “Single pane of glass” management interface
Eliminate cluster sprawl Reduce data movement
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
…Add a Scalable Archive Strategy to NGS
Tiered Adaptive Storage (TAS) for active data use and archiving
Policy-based data movement Performs at scale
Next generation sequencers
Urika-XA to simplify the NGS workflow
Copyright 2015 Cray Inc.
• NGS generates enormous amounts of data • Once data is processed much of it is no longer needed but must be saved • A proper archive strategy will eliminate bottlenecks, improve performance and reduce
costs
C O M P U T E | S T O R E | A N A L Y Z E
Bioinformatics-in-a-Box™ Enterprise NGS Data Solution ● NCGR: > 20 years bioinformatics
expertise ● Integrates into corporate IT
infrastructure ● Fully customizable ● Handles new and legacy data,
including microarrays ● Naturally multi-user with secure,
worldwide access
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Make Genomic Data Usable
Genotype Patient Data
Annovar
ClinVar
Sample
dbSNP
Uniprot
KEGG
1,000 Genome
MESH / UMLS / IDC9 /
SNOMED
OMIM
NCBI GENE
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Flexible and Open
Analytics Platform
Engineered for
Performance at Scale
Urika-XA: Flexible & Open Bioinformatics Appliance for Analysis and Interpretation @ Scale
Flexible and Open Analytics Platform High value with low technical barrier to entry • Puts “big data” power in the hands of bioinformaticians & data scientists • Start-ups through large multinational organizations Flexible software stack based on requirements • Start small - scale out to multi-rack configurations • Accommodates a wide range of bioinformatics problems (bioinformatics,
statistics, machine learning, text mining etc.)
Engineered for Performance at Scale Lower TCO than building & maintaining multiple systems • Adapt compute resources to demand • Dense footprint - Over 1,500 cores, 6 TB memory High-Performance Storage • Innovative use of storage technologies • 38 TB SSD and 120 TB POSIX-compliant InfiniBand interconnects HPC Hadoop • Cloudera modified for HPC performance
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Make Big Problems Smaller Using Apache Spark
● Correlating patient abnormal lab results with adverse events
● Data sets: Clinical data and full text of PubMed Central (PMC)
● Performance: ● Non-interactive on the cloud to… ● A fully interactive environment on the
Urika-XA platform
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Performance of the Text Mining Task
0
50
100
150
200
250
300
Cloud Hadoop Spark
Tim
e (m
in)
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Actionable Knowledge about Patients
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Superior Analytics Performance at Lower TCO Performance features • High compute and memory density • InfiniBand fast interconnect • High-speed SSDs • High-performance Sonexion® storage
TCO reduction • Single-platform consolidation of multiple environments • Accelerated time to value • Ease of management with converged infrastructure • Cray reliability and single point of support
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Cray® CS400™ Series Cluster
Supercomputers Capacity Computing Focus • Price/performance/watt • Flexible system configurations • Industry-standard
technologies • Manageability and reliability • Modular scalability
Cray® X40™ Series Supercomputers
Capability Computing Focus
• Application scalability • HPC-optimized HW, SW & IP • Price/performance • Roadmap upgradability • Reliability/availability/
serviceability
Cray® Urika-XA™ Extreme Analytics Platform
Advanced Analytics at Lower
TCO •Pre-integrated, open platform •Hadoop and Spark ecosystem •Unified system management interface
•High performance storage technologies
•Multi-rack configurations
Extended Portfolio of Cray Solutions – Scaling across the Performance Spectrum
Based on the Intel® Xeon® processor 23
C O M P U T E | S T O R E | A N A L Y Z E
Thank You
Dave Anstey: [email protected] Ted Slater: [email protected] Matt Gianni: [email protected]
Tom Bourgoin: [email protected]
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Legal Disclaimer
Copyright 2015 Cray Inc.
Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners.