enabling patient centric ngs workflows through a hadoop optimized compute platform and graph...

25
COMPUTE | STORE | ANALYZE Enabling patient-centric NGS workflows through a Hadoop ® - optimized compute platform and graph analytics Molecular Tri-Con 2015 February 2015 David Anstey, Global Head Life Sciences, Cray Inc.

Upload: cray-inc

Post on 16-Jul-2015

219 views

Category:

Technology


0 download

TRANSCRIPT

C O M P U T E | S T O R E | A N A L Y Z E

Enabling patient-centric NGS workflows through a Hadoop®- optimized compute platform and graph analytics Molecular Tri-Con 2015 February 2015 David Anstey, Global Head Life Sciences, Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

The Life Sciences/Healthcare Communities Market and Technology Drivers

The race to understand individual patients, diseases and treatments at the molecular level

Precision Medicine

Organizations struggling to keep compute infrastructures up to date, with rapidly changing life sciences technologies

Pace of Technology

Ad-hoc cluster infrastructures increasing complexity, reliability and usability challenges

Cluster Sprawl

New data sources and emerging analytical approaches to enable predictive modeling and knowledge discovery Data Science

Convergence of analytics and supercomputing opening new opportunities to meet the pace of discovery

Rise of High Performance Analytics

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

The Quest for In-Time Analytics Re

spon

se ti

me

fram

es

<30ms

30ms

10min

>10min

Low-Latency

Batch Few data scientists who wrangle data

Business analysts accustomed to interactive time frames

Streaming data

Stationary data

Low-latency applications require performance optimizations • Memory-storage hierarchies • Fast interconnects

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Multi-Step Analytics Pipelines

Data Prep/ ETL

Stream Processing

Data Mining

Interactive Queries

Actionable Insight

Analytics Pipeline

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Convergence of Analytics and Supercomputing

High-Performance Computing • Finance: portfolio optimization, pricing, risk • Energy: seismic modeling • Life sciences: genomics, drug discovery • Scientific: simulation, weather forecasting

Traditional Big Data • Batch analytics • Undifferentiated systems

“Simulation is the original Big Data Market” – IDC

High-Performance Big Data Analytics • Low-latency analytics • Next-generation architecture

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Analytics Solutions

Powered By Extreme Analytics Platform • Turnkey advanced analytics platform • Next-generation system architecture • Engineered for performance

Graph Discovery Appliance • Discover unknown and hidden

relationships in big data • Real-time data discovery • Realize rapid time-to-value

Copyright 2015 Cray Inc.

Urika-XA™ Urika-GD™

C O M P U T E | S T O R E | A N A L Y Z E

Single Platform for Varied Analytic Workloads

• In the real world, advanced analytics pipelines require an assembly of task-specific tools

• These tools and tasks place very different infrastructure demands, and have resulted in specialized infrastructure (appliances)

• Supports the widest range of workloads in a single system footprint

Batch Analytics

Basic Profiling Statistics Machine Learning Streaming Data Prep

Iterative Analytics Interactive Discovery

Every item in a dataset once

Same subset of data several times

Different subsets each time

Throughput Matters Latency Matters

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Urika-XA Extreme Analytics Platform

Turnkey Advanced Analytics Platform

• Open platform for both pre-configured and user-installed tools

• Hadoop® and Apache Spark™ ecosystem • Emerging high performance analytic

workloads • Unified system management interface

Next-Generation System Architecture

• High performance storage technologies • Battle-tested on cutting-edge

government/scientific analytic applications • Ready for the enterprise

Engineered for Performance

• Dense footprint: 48 nodes (Intel® Xeon® processor), over 1,500 cores, 6 TB memory

• 38 TB SSD and 120 TB POSIX-compliant high-performance storage

• FDR InfiniBand for datapaths and 1 GbE for management

• Scale out to multi-rack configurations Advanced Analytics at Lower TCO

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Pre-Integrated, Open Platform

Pre-Integrated Hardware and Software

• Cloudera Enterprise Hadoop® and Apache Spark™

• Tuned for optimal performance

• Ready to run out of the box

Open and Extensible

• Open platform for user-installed analytics tools

• Scale compute and storage independently

• POSIX compliance for data staging and non-HDFS workloads

• Future-proof, no appliance lock-in

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

“My bioinfo support team is getting better at handling one exome. But I’m sequencing 1000 exomes this month.” Researcher, Top 10 Pharma

“… today’s Bio-IT professionals have to design, deploy, and support IT infrastructures with life cycles measured over several years, in the face of an innovation explosion where major laboratory and research enhancements arrive on the scene every few months.” Chris Dagdigian of The BioTeam

“I’ve been waiting 4 months for my bioinformatics team to do my NGS analysis.” Researcher, Top 10 Pharma

“Although hundreds of thousands of samples have been sequenced, our ability to find, associate, and implicate genetic variants and candidate disease genes far outstrips our ability to understand them.” Sean Sanders, Science/AAAS

What we’re hearing…

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

NGS Data Management is Overwhelming

Production – Huge data volumes and excessive data movement are pushing the limits of many storage and networking infrastructures. Archive – NGS workflows generate huge volumes of data, which is both tedious and costly to retain.

Three NGS Challenges: Sequence Assembly, Bioinformatics and Data Management

NGS Bioinformatics is Complex

Complexity High – Interpreting NGS sequence meaning involves annotation, integration, visualization, and collaboration – requiring diverse expertise. Performance – Bioinformatics is computationally demanding in both performance and scale.

Sequence Assembly is a Bottleneck

Sequence costs down. Sequence volumes up. Huge volume makes assembly the challenge. The rate at which genotypic variation can be characterized is now limited by computational tools, not by sequencing technology.

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Cray’s Next Generation Sequencing Solution: Accelerated Time to Discovery

Genome Assembly

High Throughput NGS Storage and

Archive Environment

Bioinformatics Analytics

Personalized Medicine

Pathway Modeling

Hypothesis Generation Alternative

Indications Biomarker Prediction

Patient Selection

Base Calling

Assembly

Variant Analysis

QC

Annotation

Next Generation Sequencers

Manage all aspects of NGS pipeline in one environment • Address data transfer

and compute bottlenecks

• Speed up whole-genome resequencing analysis

• Fast short-read alignment

• Calculate differential gene expression from large RNA-Seq data sets

• “Single pane of glass” management interface

Enterprise Benefits

• Open architecture

• Reduced footprint

• Eliminates cluster sprawl

• Out-of-the-box performance with flexibility to meet evolving needs

• Pay-as-you-grow storage/archival performance and capacity

• Minimal management burden and lower TCO

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Next Generation Sequencing: Urika-XA for all aspects of NGS Bioinformatics

Next generation sequencers

Urika-XA to simplify the NGS workflow

Manage all aspects of NGS pipeline in one environment • Address data transfer and compute bottlenecks • Speed up whole-genome resequencing analysis • Fast short-read alignment • Calculating differential gene expression from

large RNA-Seq data sets • “Single pane of glass” management interface

Eliminate cluster sprawl Reduce data movement

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

…Add a Scalable Archive Strategy to NGS

Tiered Adaptive Storage (TAS) for active data use and archiving

Policy-based data movement Performs at scale

Next generation sequencers

Urika-XA to simplify the NGS workflow

Copyright 2015 Cray Inc.

• NGS generates enormous amounts of data • Once data is processed much of it is no longer needed but must be saved • A proper archive strategy will eliminate bottlenecks, improve performance and reduce

costs

C O M P U T E | S T O R E | A N A L Y Z E

Bioinformatics-in-a-Box™ Enterprise NGS Data Solution ● NCGR: > 20 years bioinformatics

expertise ● Integrates into corporate IT

infrastructure ● Fully customizable ● Handles new and legacy data,

including microarrays ● Naturally multi-user with secure,

worldwide access

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Make Genomic Data Usable

Genotype Patient Data

Annovar

ClinVar

Sample

dbSNP

Uniprot

KEGG

1,000 Genome

MESH / UMLS / IDC9 /

SNOMED

OMIM

NCBI GENE

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Flexible and Open

Analytics Platform

Engineered for

Performance at Scale

Urika-XA: Flexible & Open Bioinformatics Appliance for Analysis and Interpretation @ Scale

Flexible and Open Analytics Platform High value with low technical barrier to entry • Puts “big data” power in the hands of bioinformaticians & data scientists • Start-ups through large multinational organizations Flexible software stack based on requirements • Start small - scale out to multi-rack configurations • Accommodates a wide range of bioinformatics problems (bioinformatics,

statistics, machine learning, text mining etc.)

Engineered for Performance at Scale Lower TCO than building & maintaining multiple systems • Adapt compute resources to demand • Dense footprint - Over 1,500 cores, 6 TB memory High-Performance Storage • Innovative use of storage technologies • 38 TB SSD and 120 TB POSIX-compliant InfiniBand interconnects HPC Hadoop • Cloudera modified for HPC performance

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Make Big Problems Smaller Using Apache Spark

● Correlating patient abnormal lab results with adverse events

● Data sets: Clinical data and full text of PubMed Central (PMC)

● Performance: ● Non-interactive on the cloud to… ● A fully interactive environment on the

Urika-XA platform

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Performance of the Text Mining Task

0

50

100

150

200

250

300

Cloud Hadoop Spark

Tim

e (m

in)

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Patient Data to…

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Actionable Knowledge about Patients

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Superior Analytics Performance at Lower TCO Performance features • High compute and memory density • InfiniBand fast interconnect • High-speed SSDs • High-performance Sonexion® storage

TCO reduction • Single-platform consolidation of multiple environments • Accelerated time to value • Ease of management with converged infrastructure • Cray reliability and single point of support

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Cray® CS400™ Series Cluster

Supercomputers Capacity Computing Focus • Price/performance/watt • Flexible system configurations • Industry-standard

technologies • Manageability and reliability • Modular scalability

Cray® X40™ Series Supercomputers

Capability Computing Focus

• Application scalability • HPC-optimized HW, SW & IP • Price/performance • Roadmap upgradability • Reliability/availability/

serviceability

Cray® Urika-XA™ Extreme Analytics Platform

Advanced Analytics at Lower

TCO •Pre-integrated, open platform •Hadoop and Spark ecosystem •Unified system management interface

•High performance storage technologies

•Multi-rack configurations

Extended Portfolio of Cray Solutions – Scaling across the Performance Spectrum

Based on the Intel® Xeon® processor 23

C O M P U T E | S T O R E | A N A L Y Z E

Thank You

Dave Anstey: [email protected] Ted Slater: [email protected] Matt Gianni: [email protected]

Tom Bourgoin: [email protected]

Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Legal Disclaimer

Copyright 2015 Cray Inc.

Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.

Cray Inc. may make changes to specifications and product descriptions at any time, without notice.

All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.

Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners.