easygenomics iscb cloud section 2012

50
Contact Us [email protected] http://www.easygenomics.com Next Generation Bioinformatics on the Cloud Xing Xu, Ph.D Director of Cloud Computing Product

Upload: xing-xu

Post on 15-Apr-2017

138 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Easygenomics ISCB Cloud section 2012

Contact [email protected]

http://www.easygenomics.com

Next Generation Bioinformaticson the Cloud

Xing Xu, Ph.DDirector of Cloud Computing Product

Page 2: Easygenomics ISCB Cloud section 2012

Topics for Today

Behind the cloud product- BGI- The team

The product: EasyGenomics- Why are we building this product?- What can this product do?

Future direction and open questions

2

Page 3: Easygenomics ISCB Cloud section 2012

BGI

The world largest genome sequencing center- Started with Human Genome Project in 1999 with only a

few sequencers.- Now more than 150 sequencers, 6 TB/day sequencing

throughput.

MODEL ABI3730XL

Roche454

ABISOLiD 4

SolexaGA IIx

IlluminaHiSeq 2000

INSTALLATION 16 1 27 6 135

Page 4: Easygenomics ISCB Cloud section 2012

BGI

The world largest genome sequencing center The largest computing and storage center for

genomics in China

- 20,000+ CPU cores- 19 NVIDIA GPUs- 220+ Tflops peak

performance- 17 PB data storage- The storage and computation

capability increase by 10000 folds!

- Still increasing …

Page 5: Easygenomics ISCB Cloud section 2012

BGI

The world largest genome sequencing center The largest computing and storage center for

genomics in China One of world leading research institutes in

Genomics

Since 2007, - 253 papers in high-impact journals- Including 47 in Nature and its sub-

journals, 9 in Science, 2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors

- 369 patent applications- 254 software authorship

Page 6: Easygenomics ISCB Cloud section 2012

BGI

The world largest genome sequencing center The largest computing and storage center for

genomics in China One of world leading research institutes in

Genomics

BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.

Page 7: Easygenomics ISCB Cloud section 2012

Team for the Cloud Platform

Run like a software company

Managers are from leading software companies, such as HP, Microsoft, and Levono.

Team members are Young, Energetic, and Ambitious.

Fully supported by BGI in-house algorithm development teams.

Product

Development

Testing

Operation

BGI Support

Page 8: Easygenomics ISCB Cloud section 2012

Team for the Cloud Platform

Development Team- Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.- Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc.- Pipeline: Liang Wang etc.

Test & QA Team- Xin Guan, Jingjuan Liu, etc.

PMO & IT Operation- Wenjun Zeng, Litong Lai, Jing Tian, etc.

Product Team- Xing Xu, Jing Guo, Fang Fang etc.

Other BGI Teams

+ + +

Page 9: Easygenomics ISCB Cloud section 2012

Topics for Today

Behind the cloud product- BGI- The team

The product: EasyGenomics- Why are we building this product?- What can this product do?

Future direction and open questions

9

Page 10: Easygenomics ISCB Cloud section 2012

Trend of Volume and Cost

10

Page 11: Easygenomics ISCB Cloud section 2012

Geological side of the problem

Sequencing happens EVERYWHERE.

+

Geological side of the problem

Images from omicsmaps.com

BGI

Page 12: Easygenomics ISCB Cloud section 2012

Difficulties of Analysis

In-depth Annotation

Lack of knowledge

Post Tertiary Analysis

Variant Calling

Complicated AlgorithmsComputation intensive

Tertiary Analysis

Mapping

Computation intensiveData storage

Secondary Analysis

Base calling

Data throughputData storage

Primary analysis

Page 13: Easygenomics ISCB Cloud section 2012

Problems and Solutions

13

Problems:

• Big genomic data• Geological distribution

• Algorithm integration

• Computational demand

• Big genomic data• Geological distribution

• Algorithm integration

• Computational demand+)

CloudHigh Speed Data Exchange

Pipelines

Distributed Workloads

Solutions

Page 14: Easygenomics ISCB Cloud section 2012

EasyGenomics™

EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.

Algorithms, Workflows,

Reports

Computational ResourcesDatabase,

Data management

Web portal,Simple UIHigh speed

connection

Page 15: Easygenomics ISCB Cloud section 2012

Bioinformatics Workflows

Data Management

High Speed Connection

Key Features

Page 16: Easygenomics ISCB Cloud section 2012

Bioinformatics Workflow

Four steps: Upload, Create a Sample, Perform Analyses, Download Results

Algorithms: Carefully chosen, tested and optimized

Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly

Page 17: Easygenomics ISCB Cloud section 2012

Homepage

Four task portals

Status of recent works

Warning and Logging

Navigation Tabs

Page 18: Easygenomics ISCB Cloud section 2012

Bioinformatics Workflow--- Pipelines

18

Exome Resequencing RNASeq

Transcriptome

Page 19: Easygenomics ISCB Cloud section 2012

Bioinformatics Workflow---Comprehensive Reports

19

Page 20: Easygenomics ISCB Cloud section 2012

Bioinformatics Workflow---Comprehensive Reports

20

Page 21: Easygenomics ISCB Cloud section 2012

Data Management

“Sample”, “Analysis”, “Project” Mimicking real research procedure Automatic management of underlying data structure

Raw Data

Sample A

Sample B

Analysis I

Analysis II

Analysis XProject I

Page 22: Easygenomics ISCB Cloud section 2012

Create a Sample

Add read groups

Page 23: Easygenomics ISCB Cloud section 2012

Sample Page

Individual report for each lane

Summarized report for all lanes

Page 24: Easygenomics ISCB Cloud section 2012

Data management---Security

Access

Multi-tenancy

Isolation

Compliance

• Username/Password• Biometric access• HTTPS , Aspera fastpTM

• Trusted database connection

• ACL, Data encryption

• Physical isolation• Virtual isolation

• ISO27000

Page 25: Easygenomics ISCB Cloud section 2012

High Speed Data Exchange

Aspera’s patented fasp™ high-speed file transferring technology

10~100X faster than FTP

25

Page 26: Easygenomics ISCB Cloud section 2012

Transfer 24GB in 30 Seconds

26

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.

Page 27: Easygenomics ISCB Cloud section 2012

Transfer 24GB in 30 Seconds

27

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.

A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).

Page 28: Easygenomics ISCB Cloud section 2012

Amount of Data that can be transferred in 24hr

28

Page 29: Easygenomics ISCB Cloud section 2012

Easy-to-Use UI

Reusability- Reuse the same sample for different analyses (different

parameters)- Reuse all parameter settings for different analyses

Simple UI and interactive features- As easy as to do online shopping- Shortcut for predefined setting, at the same time fully

customizable for advance users- Handle batch analyses in one setting

29

Page 30: Easygenomics ISCB Cloud section 2012

Create an Analysis

Selected sample(s)

• One selected sample => Single Analysis• Multiple selected samples => Batch Analyses

Page 31: Easygenomics ISCB Cloud section 2012

Create an Analysis

Selectable modules

Predefined Settings

Shortcut

Page 32: Easygenomics ISCB Cloud section 2012

Create an Analysis

Page 33: Easygenomics ISCB Cloud section 2012

Create an Analysis

Customizable

Page 34: Easygenomics ISCB Cloud section 2012

Create an Analysis

Page 35: Easygenomics ISCB Cloud section 2012

Project TableAdd/Remove

Project

Operation short cuts

Project list table Filter and search box

Page 36: Easygenomics ISCB Cloud section 2012

Analysis Table

Page 37: Easygenomics ISCB Cloud section 2012

Sample Table

Page 38: Easygenomics ISCB Cloud section 2012

A typical user case

38

Page 39: Easygenomics ISCB Cloud section 2012

Topics for Today

Behind the cloud product- BGI- The team

The product: EasyGenomics- Why are we building this product?- What can this product do?

Future direction and open questions

39

Page 40: Easygenomics ISCB Cloud section 2012

Future directions

What is the market? Which direction to go?

- Cloud on the public infrastructure vs cloud on the private infrastructure

- SaaS vs PaaS- Data analysis is only one step of the whole process.- What will be the sustained model for the cloud service?

Page 41: Easygenomics ISCB Cloud section 2012

Cloud Service Providers

Market Position

Annotation Providers

Sequencing Service ProvidersInstrument Manufacturers

Personal Genetic TestingProviders

illumina

Software Providers

NOW

Page 42: Easygenomics ISCB Cloud section 2012

Challenge and Solution

DNANexus Basespace(Illumina)

GenomeSpace EasyGenomics Ingenuity/ NextBio

Cloud Public Public Public Private PrivateReasoning Great demand on

space and computation resources

      Security, Privacy issue

Positioning Infrastructure (PaaS)

App Store Platform for accessing available tools.

SaaS Solution InformationThey are playing the results from NGS not the raw reads.

Advantage Funding Advance in the

field

Sequencing service Community of

Partners

Strong connection to academia

Sequencing Service Development

Capability

Experience

42

Page 43: Easygenomics ISCB Cloud section 2012

Public vs Private Cloud

Public CloudPros:

− “Limitless” resource− Share data to a wide

range of people− Offering nice platform

Cons:− Security and reliability− Short term cost saving

vs Long term cost nightmare

Private CloudPros:

− Flexibility− Security and Privacy

control− Long-term cost saving

Cons:− Big initial investment− Maintaining the

infrastructure and software on the cloud

But, the line between public and private cloud are blurring.

Page 44: Easygenomics ISCB Cloud section 2012

A sustained model for cloud service?

Key components of cost- Storage- Computational resource- Data transfer- Software usage

App store or Cell phone plan

Long term cost vs Short term cost

Page 45: Easygenomics ISCB Cloud section 2012

Data analysis is NOT ALL!

EPM

Project Management Sample Center Wet Lab

OperationBioinformatics Data Analysis

EPM

Management System

Budgeting

Tasking

Receipt/Storage

Handover

Sample QC

Sample prep

Workflow

Sequencing

Data analysis

Data QC

Sale

s

Bill

ing

Web-based Interface

Management Interfacing Query Statistics

Page 46: Easygenomics ISCB Cloud section 2012

Roadmap of EasyGenomics

46

Jun 2012

Aug 2012

Sep 2012

Dec 2012

Apr 2013

EG1.1 (in Jun)• New result reports• Fully Integrated Data

Exchange Interface

EG1.2 (in Aug)• New read filtering step,

speed up 20x

EG1.3 (in Sep)• Data import from BGI

sequencing service

EG1.5 (est. in Dec)• QC indicator, QC module• New Sample report• Transcriptome workflows• Reference management

EG2.0 (est. in Apr, 2013)• IRODs data management• Data sharing, collaboration• User own applications• Comparison, Filtering tools• Visualization

Page 47: Easygenomics ISCB Cloud section 2012

www.EasyGenomics.com

Free Beta Trial is on going!!

Page 48: Easygenomics ISCB Cloud section 2012

Interpretation is the KEY

Analysis and Interpretation is the KEY

Page 49: Easygenomics ISCB Cloud section 2012

Enabling Technology

49

Best Practice Award for IT Infrastructure

Human Genome SOAPdenovo EasyGenomicsTM (192 cores)

Genome Coverage 86% 86% Assembly Time 70h 55h

No. of Servers 1 15 Memory Size 500GB x 1 24 GB x 15

Mode Centralized Distributed

Hadoop-based Flexible Computing

Page 50: Easygenomics ISCB Cloud section 2012

Enabling Technology

SOAP Hadoop (Gaea)

GPU

50