easygenomics iscb cloud section 2012

Contact [email protected]

http://www.easygenomics.com

Next Generation Bioinformaticson the Cloud

Xing Xu, Ph.DDirector of Cloud Computing Product

Topics for Today

Behind the cloud product- BGI- The team

The product: EasyGenomics- Why are we building this product?- What can this product do?

Future direction and open questions

2

BGI

The world largest genome sequencing center- Started with Human Genome Project in 1999 with only a

few sequencers.- Now more than 150 sequencers, 6 TB/day sequencing

throughput.

MODEL ABI3730XL

Roche454

ABISOLiD 4

SolexaGA IIx

IlluminaHiSeq 2000

INSTALLATION 16 1 27 6 135

BGI

The world largest genome sequencing center The largest computing and storage center for

genomics in China

- 20,000+ CPU cores- 19 NVIDIA GPUs- 220+ Tflops peak

performance- 17 PB data storage- The storage and computation

capability increase by 10000 folds!

- Still increasing …

BGI


genomics in China One of world leading research institutes in

Genomics

Since 2007, - 253 papers in high-impact journals- Including 47 in Nature and its sub-

journals， 9 in Science， 2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors

- 369 patent applications- 254 software authorship

BGI


genomics in China One of world leading research institutes in

Genomics

BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.

Team for the Cloud Platform

Run like a software company

Managers are from leading software companies, such as HP, Microsoft, and Levono.

Team members are Young, Energetic, and Ambitious.

Fully supported by BGI in-house algorithm development teams.

Product

Development

Testing

Operation

BGI Support

Team for the Cloud Platform

Development Team- Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.- Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc.- Pipeline: Liang Wang etc.

Test & QA Team- Xin Guan, Jingjuan Liu, etc.

PMO & IT Operation- Wenjun Zeng, Litong Lai, Jing Tian, etc.

Product Team- Xing Xu, Jing Guo, Fang Fang etc.

Other BGI Teams

+ + +

Topics for Today




9

Trend of Volume and Cost

10

Geological side of the problem

Sequencing happens EVERYWHERE.

+

Geological side of the problem

Images from omicsmaps.com

BGI

Difficulties of Analysis

In-depth Annotation

Lack of knowledge

Post Tertiary Analysis

Variant Calling

Complicated AlgorithmsComputation intensive

Tertiary Analysis

Mapping

Computation intensiveData storage

Secondary Analysis

Base calling

Data throughputData storage

Primary analysis

http://www.labbase.net/Product/ProductItems-26-126-937-50362.html

http://image.baidu.com/i?ct=503316480&z=&tn=baiduimagedetail&word=bioinformatics&in=8933&cl=2&lm=-1&pn=3&rn=1&di=42697375980&ln=1343&fr=&fmq=&ic=&s=&se=&sme=0&tab=&width=&height=&face=&is=&istype=

Problems and Solutions

13

Problems:

• Big genomic data• Geological distribution

• Algorithm integration

• Computational demand

• Big genomic data• Geological distribution

• Algorithm integration

• Computational demand+)

CloudHigh Speed Data Exchange

Pipelines

Distributed Workloads

Solutions

EasyGenomics™

EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.

Algorithms, Workflows,

Reports

Computational ResourcesDatabase,

Data management

Web portal,Simple UIHigh speed

connection

Bioinformatics Workflows

Data Management

High Speed Connection

Key Features

Bioinformatics Workflow

Four steps: Upload, Create a Sample, Perform Analyses, Download Results

Algorithms: Carefully chosen, tested and optimized

Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly

Homepage

Four task portals

Status of recent works

Warning and Logging

Navigation Tabs

Bioinformatics Workflow--- Pipelines

18

Exome Resequencing RNASeq

Transcriptome

Bioinformatics Workflow---Comprehensive Reports

19

Bioinformatics Workflow---Comprehensive Reports

20

Data Management

“Sample”, “Analysis”, “Project” Mimicking real research procedure Automatic management of underlying data structure

Raw Data

Sample A

Sample B

Analysis I

Analysis II

Analysis XProject I

Create a Sample

Add read groups

Sample Page

Individual report for each lane

Summarized report for all lanes

Data management---Security

Access

Multi-tenancy

Isolation

Compliance

• Username/Password• Biometric access• HTTPS , Aspera fastpTM

• Trusted database connection

• ACL, Data encryption

• Physical isolation• Virtual isolation

• ISO27000

High Speed Data Exchange

Aspera’s patented fasp™ high-speed file transferring technology

10~100X faster than FTP

25

Transfer 24GB in 30 Seconds

26

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.

Transfer 24GB in 30 Seconds

27

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.

A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).

Amount of Data that can be transferred in 24hr

28

Easy-to-Use UI

Reusability- Reuse the same sample for different analyses (different

parameters)- Reuse all parameter settings for different analyses

Simple UI and interactive features- As easy as to do online shopping- Shortcut for predefined setting, at the same time fully

customizable for advance users- Handle batch analyses in one setting

29

Create an Analysis

Selected sample(s)

• One selected sample => Single Analysis• Multiple selected samples => Batch Analyses

Create an Analysis

Selectable modules

Predefined Settings

Shortcut

Create an Analysis

Create an Analysis

Customizable

Create an Analysis

Project TableAdd/Remove

Project

Operation short cuts

Project list table Filter and search box

Analysis Table

Sample Table

A typical user case

38

Topics for Today




39

Future directions

What is the market? Which direction to go?

- Cloud on the public infrastructure vs cloud on the private infrastructure

- SaaS vs PaaS- Data analysis is only one step of the whole process.- What will be the sustained model for the cloud service?

Cloud Service Providers

Market Position

Annotation Providers

Sequencing Service ProvidersInstrument Manufacturers

Personal Genetic TestingProviders

illumina

Software Providers

NOW

https://dnanexus.com/

http://www.nextbio.com/b/

http://en.wikipedia.org/wiki/File:GenomeSpace_Logo.png

Challenge and Solution

DNANexus Basespace(Illumina)

GenomeSpace EasyGenomics Ingenuity/ NextBio

Cloud Public Public Public Private PrivateReasoning Great demand on

space and computation resources

Security, Privacy issue

Positioning Infrastructure (PaaS)

App Store Platform for accessing available tools.

SaaS Solution InformationThey are playing the results from NGS not the raw reads.

Advantage Funding Advance in the

field

Sequencing service Community of

Partners

Strong connection to academia

Sequencing Service Development

Capability

Experience

42

Public vs Private Cloud

Public CloudPros:

− “Limitless” resource− Share data to a wide

range of people− Offering nice platform

Cons:− Security and reliability− Short term cost saving

vs Long term cost nightmare

Private CloudPros:

− Flexibility− Security and Privacy

control− Long-term cost saving

Cons:− Big initial investment− Maintaining the

infrastructure and software on the cloud

But, the line between public and private cloud are blurring.

A sustained model for cloud service?

Key components of cost- Storage- Computational resource- Data transfer- Software usage

App store or Cell phone plan

Long term cost vs Short term cost

Data analysis is NOT ALL!

EPM

Project Management Sample Center Wet Lab

OperationBioinformatics Data Analysis

EPM

Management System

Budgeting

Tasking

Receipt/Storage

Handover

Sample QC

Sample prep

Workflow

Sequencing

Data analysis

Data QC

Sale

s

Bill

ing

Web-based Interface

Management Interfacing Query Statistics

Roadmap of EasyGenomics

46

Jun 2012

Aug 2012

Sep 2012

Dec 2012

Apr 2013

EG1.1 (in Jun)• New result reports• Fully Integrated Data

Exchange Interface

EG1.2 (in Aug)• New read filtering step,

speed up 20x

EG1.3 (in Sep)• Data import from BGI

sequencing service

EG1.5 (est. in Dec)• QC indicator, QC module• New Sample report• Transcriptome workflows• Reference management

EG2.0 (est. in Apr, 2013)• IRODs data management• Data sharing, collaboration• User own applications• Comparison, Filtering tools• Visualization

www.EasyGenomics.com

Free Beta Trial is on going!!

Interpretation is the KEY

Analysis and Interpretation is the KEY

Enabling Technology

49

Best Practice Award for IT Infrastructure

Human Genome SOAPdenovo EasyGenomicsTM (192 cores)

Genome Coverage 86% 86% Assembly Time 70h 55h

No. of Servers 1 15 Memory Size 500GB x 1 24 GB x 15

Mode Centralized Distributed

Hadoop-based Flexible Computing

Enabling Technology

SOAP Hadoop (Gaea)

GPU

50