easygenomics iscb cloud section 2012
Embed Size (px)
TRANSCRIPT
1
Contact [email protected]://www.easygenomics.comNext Generation Bioinformaticson the CloudXing Xu, Ph.DDirector of Cloud Computing Product
Good afternoon Ladies and Gentlemen, First of all I would like to thanks everyone coming to the session. My name is Sifei He, Director of BGI Cloud Initiative, driving BGIs Cloud-based-Omics effort. And I am glad to introduce my colleague Dr Xing Xu, Senior Product Manager at BGI, responsible for all aspects of EasyGenomics, BGIs latest SaaS-based bioinformatics product.
1
Topics for TodayBehind the cloud productBGIThe teamThe product: EasyGenomicsWhy are we building this product?What can this product do?
Future direction and open questions
2
BGIThe world largest genome sequencing centerStarted with Human Genome Project in 1999 with only a few sequencers.Now more than 150 sequencers, 6 TB/day sequencing throughput.
MODELABI3730XLRoche454ABISOLiD 4SolexaGA IIxIlluminaHiSeq 2000INSTALLATION161276135
3
BGIThe world largest genome sequencing centerThe largest computing and storage center for genomics in China
20,000+ CPU cores19 NVIDIA GPUs220+ Tflops peak performance17 PB data storageThe storage and computation capability increase by 10000 folds!Still increasing
4
BGIThe world largest genome sequencing centerThe largest computing and storage center for genomics in ChinaOne of world leading research institutes in GenomicsSince 2007, 253 papers in high-impact journalsIncluding 47 in Nature and its sub-journals9 in Science2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors369 patent applications254 software authorship
5
BGIThe world largest genome sequencing centerThe largest computing and storage center for genomics in ChinaOne of world leading research institutes in Genomics
BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.
6
Team for the Cloud Platform
Run like a software companyManagers are from leading software companies, such as HP, Microsoft, and Levono.Team members are Young, Energetic, and Ambitious.Fully supported by BGI in-house algorithm development teams.BGI Support
` ``7
Team for the Cloud Platform Development TeamDev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc.Pipeline: Liang Wang etc.Test & QA TeamXin Guan, Jingjuan Liu, etc.PMO & IT OperationWenjun Zeng, Litong Lai, Jing Tian, etc.Product TeamXing Xu, Jing Guo, Fang Fang etc.Other BGI Teams
+
+
+
` ``8
Topics for TodayBehind the cloud productBGIThe team
The product: EasyGenomicsWhy are we building this product?What can this product do?
Future direction and open questions
9
Trend of Volume and Cost10
This morning I was reading Mondays USA Today. One of the cover story was a girl at 18, whose family history includes Huntingtons disease has decided to conduct genetic test to see whether she has the fatal gene. What impressed me is the fact that genetic testing and disease are now such close to our daily life. Imagine by 2030 the UN President candidates all publish their complete genome, who would you vote for?
A few years ago it was science fiction but look at the trend today, the cost for 1Mb DNA sequencing has gone down dramatically and thanks to these great instruments, the total number of human genome sequenced has gone from 1 in 2003 when the Human Genome Project releases their data to a few thousands today. The number may vary but the trend wont change. If the red-dotted Moores law continues as it was, we may well see $1000 a genome in 2012 or 2013 and the price will continue to drops toward $0.
In contrast, we will be able to sequence a lot more genome then today, and Id like to quote Martin Leachs Humanity Genome or Hunome 10
Geological side of the problemSequencing happens EVERYWHERE.
+Geological side of the problemImages from omicsmaps.com
BGI
Over the past few years, we have been thinking of $1000 a genome and of course have done tons of great works to archive that.
GO-Big. Getting just 0.1% of world human population sequenced would cost $7 Billion, generating around 700 Petabyte of RAW ATGC, equivalent of 85 billions The Complete Harry Potter Collection - eBook.
And thats not the end of the story. Omicsmap team created this nice map to illustrate sequencing capacity around the world. As the price of sequencing drop, there is a reason to believe the map will be looked like this in a few years!
The point is, sequencing is a commodity and it happens everywhere. Key takeaway 1.
11
Difficulties of Analysis
A lot of the times when I chat with collaborators and partners, everyone was talking about the opportunities and possibilities introduced by NGS. Unfortunately not all of them possess the necessary knowledge and skills to handle the tremendous amount of data generated by NGS which indeed has become one of the biggest obstacles to fully utilize this technology.
On the other hand, scientists often have to deal with numerous difficulties, such as data deliveries on hard drives, management of computing and storage resources, installation and integration of multiple algorithms, and optimization of a number of parameters, to get reliable and meaningful results.
If you wonder how BGI solved it, you are on the right session. If you want to access BGIs bioinformatics solution, the next 20 slides are just for you.
12
Problems and Solutions13Problems:
Big genomic dataGeological distributionAlgorithm integrationComputational demandBig genomic dataGeological distributionAlgorithm integrationComputational demand
+)
CloudHigh Speed Data ExchangePipelinesDistributed WorkloadsSolutions
Again just to summarized what we have learned. Big Gennomics Data, Geological distribution, Algorithm integration, Computation demand. Whenever there are problems, there is solution.
Cloud, unlimited storage, computation, access from anywhereHigh speed data exchangeWell tested and optimized algorithmFine tuned resource management
Together that makes up EasyGenomics.13
EasyGenomicsEasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.
Algorithms, Workflows,ReportsComputational ResourcesDatabase, Data managementWeb portal,Simple UIHigh speed connection
If we look at EasyGenomics from feature perspective, web portal, algorithms, workflows, resources, database, high speed data exchange all packed as a simple solution on the cloud.14
Key Features
15
Bioinformatics WorkflowFour steps: Upload, Create a Sample, Perform Analyses, Download ResultsAlgorithms: Carefully chosen, tested and optimizedWorkflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly
At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.16
Homepage
Four task portals
Status of recent works
Warning and Logging
Navigation Tabs
Bioinformatics Workflow--- Pipelines
18
Exome ResequencingRNASeqTranscriptome
Bioinformatics Workflow---Comprehensive Reports19
Bioinformatics Workflow---Comprehensive Reports20
Data ManagementSample, Analysis, ProjectMimicking real research procedureAutomatic management of underlying data structureRaw DataSample ASample BAnalysis IAnalysis IIAnalysis X
Project I
When user start a new analysis project, there are three atomic objects he or she needs to look into. Sample which is created by aggregating raw data, Analysis that take Samples as input and Project which encloses multiple analysis.
Filtering, QC Report, Alignment are built-in so that users dont have to worry about it. While different pipelines may have different handles but the basic remains the same. In this way, EasyGenomics enables a unfied underlying data structure, mimicking your real research procedures.
21
Create a Sample
Add read groups
22
Sample Page
Individual report for each lane
Summarized report for all lanes
Data management---Security
At EasyGenomcis, we are serious about information security and have designed a secure multitenancy architecture from the ground up. Critical user data is 256bits encrypted to make sure everyone is in stealth mode.
Sample and project data are stored in users designated virtual partition so that no one not even EasyGenomcs operation team can see them.
Same as many online applications, a secure login mechanism is provided and every interaction you make with the system is encrypted using secure HTTP protocol. When it goes to data transfer security, EasyGenomics partnered with Aspera to send/receive your data fast and securely. Last but not least, we will never store your password in plain text!24
High Speed Data ExchangeAsperas patented fasp high-speed file transferring technology
10~100X faster than FTP25
Today we announced a partnership with Aspera to deploy fasp technology with EasyGenomics. Some of you may be familiar with Aspera, it is the same piece of technology used at European Bioinformatics Institue and National Center for Biotechnology Information for delivering large chunk of data over the open Internet. Speed? You know it! 10~100 times faster than FTP.25
Transfer 24GB in 30 Seconds26Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.
Transfer 24GB in 30 Seconds27Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).
Amount of Data that can be transferred in 24hr
28
Easy-to-Use UI ReusabilityReuse the same sample for different analyses (different parameters)Reuse all parameter settings for different analyses
Simple UI and interactive featuresAs easy as to do online shoppingShortcut for predefined setting, at the same time fully customizable for advance usersHandle batch analyses in one setting
29
Create an Analysis
Selected sample(s)One selected sample => Single Analysis
Multiple selected samples => Batch Analyses
Create an Analysis
Selectable modules
Predefined Settings
Shortcut
Create an Analysis
Create an Analysis
Customizable
Create an Analysis
Project Table
Add/Remove Project
Operation short cuts
Project list tableFilter and search box
Analysis Table
36
Sample Table
A typical user case
38
Topics for TodayBehind the cloud productBGIThe team
The product: EasyGenomicsWhy are we building this product?What can this product do?
Future direction and open questions
39
Future directionsWhat is the market?Which direction to go?Cloud on the public infrastructure vs cloud on the private infrastructure SaaS vs PaaSData analysis is only one step of the whole process.What will be the sustained model for the cloud service?
Cloud Service Providers
Market Position
Annotation ProvidersSequencing Service ProvidersInstrument ManufacturersPersonal Genetic TestingProvidersillumina
Software Providers
NOW
Challenge and Solution DNANexusBasespace(Illumina)GenomeSpaceEasyGenomicsIngenuity/ NextBioCloudPublicPublicPublicPrivatePrivateReasoningGreat demand on space and computation resourcesSecurity, Privacy issuePositioningInfrastructure (PaaS)App StorePlatform for accessing available tools.SaaS SolutionInformationThey are playing the results from NGS not the raw reads.AdvantageFunding Advance in the fieldSequencing serviceCommunity of Partners Strong connection to academia
Sequencing Service Development CapabilityExperience
42
Public vs Private CloudPublic CloudPros:Limitless resourceShare data to a wide range of peopleOffering nice platformCons:Security and reliabilityShort term cost saving vs Long term cost nightmarePrivate CloudPros:FlexibilitySecurity and Privacy controlLong-term cost savingCons:Big initial investmentMaintaining the infrastructure and software on the cloud
But, the line between public and private cloud are blurring.
A sustained model for cloud service?Key components of costStorageComputational resourceData transferSoftware usage
App store or Cell phone plan
Long term cost vs Short term cost
Data analysis is NOT ALL!EPMProject Management
Sample Center
Wet Lab Operation
Bioinformatics Data Analysis
EPM
Management System
BudgetingTaskingReceipt/StorageHandoverSample QCSample prepWorkflowSequencingData analysisData QC
SalesBilling
Web-based Interface
ManagementInterfacingQueryStatistics
45
Roadmap of EasyGenomics46EG1.1 (in Jun)New result reportsFully Integrated Data Exchange InterfaceEG1.2 (in Aug)New read filtering step, speed up 20xEG1.3 (in Sep)Data import from BGI sequencing serviceEG1.5 (est. in Dec)QC indicator, QC moduleNew Sample reportTranscriptome workflowsReference managementEG2.0 (est. in Apr, 2013)IRODs data managementData sharing, collaborationUser own applicationsComparison, Filtering toolsVisualization
www.EasyGenomics.comFree Beta Trial is on going!!
Interpretation is the KEYAnalysis and Interpretation is the KEY
That 700PB does freak a lot people, but if anyone in this room ask me what matters the most at todays Big Genomics Data era? I would say information. Raw ATGC does NOT make any sense. When you trying to look into so call the Sex chromosome, 200 million bp decides our gender and more
Up until today, we only get to know a very limited set of knowledge hidden behind our gene. While sequencing continue to be a thrilling race, discovering information behind Big Genomics Data presents huge challenge to the community. And turning those scientific discoveries into consumable application is the silver bullet.
Key takeaway2: Analysis and interpretation of the genome data is the KEY and to apply sequencing information onto application is the Silver Bullet
48
Enabling Technology49
Best Practice Award for IT Infrastructure
Hadoop-based Flexible Computing
EasyGenomics has integrated the de novo application on Hadoop framework. Comparable performance without high memory constraint. Resequencing workflow based on the Hapdoop framework is in developing.
49
Enabling TechnologySOAP Hadoop (Gaea)
GPU50
EasyGenomics has integrated the de novo application on Hadoop framework. Comparable performance without high memory constraint. Resequencing workflow based on the Hapdoop framework is in developing.
50
The data amount transferred in 24hrs at different data transfer bandwidths. (Assuming the input read size is 10GB, the total results is about 50GB, the clean reads is about 10GB and the aligned reads (BAM) is about 20GB]
Customers Local Resources
A normal user case of EasyGenomics and Customers Local Computational resource. The double line items are Customers data or resource. The single line items are results and data within BGI and EasyGenomics platform. The widths of arrows represent the sizes of data flows (not in real proportion).