easygenomics iscb cloud section 2012

of 50 /50
Contact Us [email protected] http://www.easygenomics.com Next Generation Bioinformatics on the Cloud Xing Xu, Ph.D Director of Cloud Computing Product

Author: xing-xu

Post on 15-Apr-2017




3 download

Embed Size (px)



Contact [email protected]://www.easygenomics.comNext Generation Bioinformaticson the CloudXing Xu, Ph.DDirector of Cloud Computing Product

Good afternoon Ladies and Gentlemen, First of all I would like to thanks everyone coming to the session. My name is Sifei He, Director of BGI Cloud Initiative, driving BGIs Cloud-based-Omics effort. And I am glad to introduce my colleague Dr Xing Xu, Senior Product Manager at BGI, responsible for all aspects of EasyGenomics, BGIs latest SaaS-based bioinformatics product.


Topics for TodayBehind the cloud productBGIThe teamThe product: EasyGenomicsWhy are we building this product?What can this product do?

Future direction and open questions


BGIThe world largest genome sequencing centerStarted with Human Genome Project in 1999 with only a few sequencers.Now more than 150 sequencers, 6 TB/day sequencing throughput.

MODELABI3730XLRoche454ABISOLiD 4SolexaGA IIxIlluminaHiSeq 2000INSTALLATION161276135


BGIThe world largest genome sequencing centerThe largest computing and storage center for genomics in China

20,000+ CPU cores19 NVIDIA GPUs220+ Tflops peak performance17 PB data storageThe storage and computation capability increase by 10000 folds!Still increasing


BGIThe world largest genome sequencing centerThe largest computing and storage center for genomics in ChinaOne of world leading research institutes in GenomicsSince 2007, 253 papers in high-impact journalsIncluding 47 in Nature and its sub-journals9 in Science2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors369 patent applications254 software authorship


BGIThe world largest genome sequencing centerThe largest computing and storage center for genomics in ChinaOne of world leading research institutes in Genomics

BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.


Team for the Cloud Platform

Run like a software companyManagers are from leading software companies, such as HP, Microsoft, and Levono.Team members are Young, Energetic, and Ambitious.Fully supported by BGI in-house algorithm development teams.BGI Support

` ``7

Team for the Cloud Platform Development TeamDev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc.Pipeline: Liang Wang etc.Test & QA TeamXin Guan, Jingjuan Liu, etc.PMO & IT OperationWenjun Zeng, Litong Lai, Jing Tian, etc.Product TeamXing Xu, Jing Guo, Fang Fang etc.Other BGI Teams




` ``8

Topics for TodayBehind the cloud productBGIThe team

The product: EasyGenomicsWhy are we building this product?What can this product do?

Future direction and open questions


Trend of Volume and Cost10

This morning I was reading Mondays USA Today. One of the cover story was a girl at 18, whose family history includes Huntingtons disease has decided to conduct genetic test to see whether she has the fatal gene. What impressed me is the fact that genetic testing and disease are now such close to our daily life. Imagine by 2030 the UN President candidates all publish their complete genome, who would you vote for?

A few years ago it was science fiction but look at the trend today, the cost for 1Mb DNA sequencing has gone down dramatically and thanks to these great instruments, the total number of human genome sequenced has gone from 1 in 2003 when the Human Genome Project releases their data to a few thousands today. The number may vary but the trend wont change. If the red-dotted Moores law continues as it was, we may well see $1000 a genome in 2012 or 2013 and the price will continue to drops toward $0.

In contrast, we will be able to sequence a lot more genome then today, and Id like to quote Martin Leachs Humanity Genome or Hunome 10

Geological side of the problemSequencing happens EVERYWHERE.

+Geological side of the problemImages from omicsmaps.com


Over the past few years, we have been thinking of $1000 a genome and of course have done tons of great works to archive that.

GO-Big. Getting just 0.1% of world human population sequenced would cost $7 Billion, generating around 700 Petabyte of RAW ATGC, equivalent of 85 billions The Complete Harry Potter Collection - eBook.

And thats not the end of the story. Omicsmap team created this nice map to illustrate sequencing capacity around the world. As the price of sequencing drop, there is a reason to believe the map will be looked like this in a few years!

The point is, sequencing is a commodity and it happens everywhere. Key takeaway 1.


Difficulties of Analysis

A lot of the times when I chat with collaborators and partners, everyone was talking about the opportunities and possibilities introduced by NGS. Unfortunately not all of them possess the necessary knowledge and skills to handle the tremendous amount of data generated by NGS which indeed has become one of the biggest obstacles to fully utilize this technology.

On the other hand, scientists often have to deal with numerous difficulties, such as data deliveries on hard drives, management of computing and storage resources, installation and integration of multiple algorithms, and optimization of a number of parameters, to get reliable and meaningful results.

If you wonder how BGI solved it, you are on the right session. If you want to access BGIs bioinformatics solution, the next 20 slides are just for you.


Problems and Solutions13Problems:

Big genomic dataGeological distributionAlgorithm integrationComputational demandBig genomic dataGeological distributionAlgorithm integrationComputational demand


CloudHigh Speed Data ExchangePipelinesDistributed WorkloadsSolutions

Again just to summarized what we have learned. Big Gennomics Data, Geological distribution, Algorithm integration, Computation demand. Whenever there are problems, there is solution.

Cloud, unlimited storage, computation, access from anywhereHigh speed data exchangeWell tested and optimized algorithmFine tuned resource management

Together that makes up EasyGenomics.13

EasyGenomicsEasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.

Algorithms, Workflows,ReportsComputational ResourcesDatabase, Data managementWeb portal,Simple UIHigh speed connection

If we look at EasyGenomics from feature perspective, web portal, algorithms, workflows, resources, database, high speed data exchange all packed as a simple solution on the cloud.14

Key Features


Bioinformatics WorkflowFour steps: Upload, Create a Sample, Perform Analyses, Download ResultsAlgorithms: Carefully chosen, tested and optimizedWorkflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly

At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.16


Four task portals

Status of recent works

Warning and Logging

Navigation Tabs

Bioinformatics Workflow--- Pipelines


Exome ResequencingRNASeqTranscriptome

Bioinformatics Workflow---Comprehensive Reports19

Bioinformatics Workflow---Comprehensive Reports20

Data ManagementSample, Analysis, ProjectMimicking real research procedureAutomatic management of underlying data structureRaw DataSample ASample BAnalysis IAnalysis IIAnalysis X

Project I

When user start a new analysis project, there are three atomic objects he or she needs to look into. Sample which is created by aggregating raw data, Analysis that take Samples as input and Project which encloses multiple analysis.

Filtering, QC Report, Alignment are built-in so that users dont have to worry about it. While different pipelines may have different handles but the basic remains the same. In this way, EasyGenomics enables a unfied underlying data structure, mimicking your real research procedures.


Create a Sample

Add read groups


Sample Page

Individual report for each lane

Summarized report for all lanes

Data management---Security

At EasyGenomcis, we are serious about information security and have designed a secure multitenancy architecture from the ground up. Critical user data is 256bits encrypted to make sure everyone is in stealth mode.

Sample and project data are stored in users designated virtual partition so that no one not even EasyGenomcs operation team can see them.

Same as many online applications, a secure login mechanism is provided and every interaction you make with the system is encrypted using secure HTTP protocol. When it goes to data transfer security, EasyGenomics partnered with Aspera to send/receive your data fast and securely. Last but not least, we will never store your password in plain text!24

High Speed Data ExchangeAsperas patented fasp high-speed file transferring technology

10~100X faster than FTP25

Today we announced a partnership with Aspera to deploy fasp technology with EasyGenomics. Some of you may be familiar with Aspera, it is the same piece of technology used at European Bioinformatics Institue and National Center for Biotechnology Information for delivering large chunk of data over the open Internet. Speed? You know it! 10~100 times faster than FTP.25

Transfer 24GB in 30 Seconds26Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.

Transfer 24GB in 30 Seconds27Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).

Amount of Data that can be transferred in 24hr


Easy-to-Use UI ReusabilityReuse the same sample for different analyses (different parameters)Reuse all parameter settings for different analyses

Simple UI and interactive featuresAs easy as to do online shoppingShortcut for predefined setting, at the same time fully customizable for advance usersHandle batch analyses in one setting


Create an Analysis

Selected sample(s)One selected sample => Single Analysis

Multiple selected samples => Batch Analyses

Create an Analysis

Selectable modules

Predefined Settings


Create an Analysis

Create an Analysis


Create an Analysis

Project Table

Add/Remove Project

Operation short cuts

Project list tableFilter and search box

Analysis Table


Sample Table

A typical user case


Topics for TodayBehind the cloud productBGIThe team

The product: EasyGenomicsWhy are we building this product?What can this product do?

Future direction and open questions


Future directionsWhat is the market?Which direction to go?Cloud on the public infrastructure vs cloud on the private infrastructure SaaS vs PaaSData analysis is only one step of the whole process.What will be the sustained model for the cloud service?

Cloud Service Providers

Market Position

Annotation ProvidersSequencing Service ProvidersInstrument ManufacturersPersonal Genetic TestingProvidersillumina

Software Providers


Challenge and Solution DNANexusBasespace(Illumina)GenomeSpaceEasyGenomicsIngenuity/ NextBioCloudPublicPublicPublicPrivatePrivateReasoningGreat demand on space and computation resourcesSecurity, Privacy issuePositioningInfrastructure (PaaS)App StorePlatform for accessing available tools.SaaS SolutionInformationThey are playing the results from NGS not the raw reads.AdvantageFunding Advance in the fieldSequencing serviceCommunity of Partners Strong connection to academia

Sequencing Service Development CapabilityExperience


Public vs Private CloudPublic CloudPros:Limitless resourceShare data to a wide range of peopleOffering nice platformCons:Security and reliabilityShort term cost saving vs Long term cost nightmarePrivate CloudPros:FlexibilitySecurity and Privacy controlLong-term cost savingCons:Big initial investmentMaintaining the infrastructure and software on the cloud

But, the line between public and private cloud are blurring.

A sustained model for cloud service?Key components of costStorageComputational resourceData transferSoftware usage

App store or Cell phone plan

Long term cost vs Short term cost

Data analysis is NOT ALL!EPMProject Management

Sample Center

Wet Lab Operation

Bioinformatics Data Analysis


Management System

BudgetingTaskingReceipt/StorageHandoverSample QCSample prepWorkflowSequencingData analysisData QC


Web-based Interface



Roadmap of EasyGenomics46EG1.1 (in Jun)New result reportsFully Integrated Data Exchange InterfaceEG1.2 (in Aug)New read filtering step, speed up 20xEG1.3 (in Sep)Data import from BGI sequencing serviceEG1.5 (est. in Dec)QC indicator, QC moduleNew Sample reportTranscriptome workflowsReference managementEG2.0 (est. in Apr, 2013)IRODs data managementData sharing, collaborationUser own applicationsComparison, Filtering toolsVisualization

www.EasyGenomics.comFree Beta Trial is on going!!

Interpretation is the KEYAnalysis and Interpretation is the KEY

That 700PB does freak a lot people, but if anyone in this room ask me what matters the most at todays Big Genomics Data era? I would say information. Raw ATGC does NOT make any sense. When you trying to look into so call the Sex chromosome, 200 million bp decides our gender and more

Up until today, we only get to know a very limited set of knowledge hidden behind our gene. While sequencing continue to be a thrilling race, discovering information behind Big Genomics Data presents huge challenge to the community. And turning those scientific discoveries into consumable application is the silver bullet.

Key takeaway2: Analysis and interpretation of the genome data is the KEY and to apply sequencing information onto application is the Silver Bullet


Enabling Technology49

Best Practice Award for IT Infrastructure

Hadoop-based Flexible Computing

EasyGenomics has integrated the de novo application on Hadoop framework. Comparable performance without high memory constraint. Resequencing workflow based on the Hapdoop framework is in developing.


Enabling TechnologySOAP Hadoop (Gaea)


EasyGenomics has integrated the de novo application on Hadoop framework. Comparable performance without high memory constraint. Resequencing workflow based on the Hapdoop framework is in developing.


The data amount transferred in 24hrs at different data transfer bandwidths. (Assuming the input read size is 10GB, the total results is about 50GB, the clean reads is about 10GB and the aligned reads (BAM) is about 20GB]

Customers Local Resources

A normal user case of EasyGenomics and Customers Local Computational resource. The double line items are Customers data or resource. The single line items are results and data within BGI and EasyGenomics platform. The widths of arrows represent the sizes of data flows (not in real proportion).