peter li: gigadb and galaxy - revolutionizing data dissemination, organization and analysis

24
GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis Peter Li GigaScience [email protected]

Upload: gigascience-bgi-hong-kong

Post on 28-Jan-2015

112 views

Category:

Technology


0 download

DESCRIPTION

Peter Li's talk on GigaDB and Galaxy at BGI's 3rd Bioinformatics Software and Data Release Conference at #ICG7 in Hong Kong, 28th November 2012

TRANSCRIPT

Page 1: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

Peter LiGigaScience

[email protected]

Page 2: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

www.gigasciencejournal.com

Journal and database forlarge-scale data

Editor-in-Chief: Laurie GoodmanEditor: Scott Edmunds

Commissioning Editor: Nicole NogoyLead Curator: Tam Sneddon

Data Platform: Peter Li

in conjunction with

Page 3: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis
Page 4: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Why another *omics journal?

Already many journals publishing research involving large data sets

Resultsreproducibility

Page 5: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Unrepeatability of scientific results

Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses.Nature Genetics 41: 149-155.

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 6: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

How are we supporting data reproducibility?

Data sets

AnalysesGigaScience

paper

Linked to

Linked to

Community tools fordata reproduction and reuse

DOI

Page 7: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Paper DOI

Data set DOI

Linking of papers and data by citation of DOIs

Page 8: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

http://gigadb.org

Page 9: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data”… (see more)

Page 10: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Aspera data transfer

Faster download speeds

Page 11: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

BGI Datasets Get DOI®s

PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum

MicrobeE. Coli O104:H4 TY-2482T2D gut metagenome

Cell-LinesChinese Hamster OvaryMouse methylomes

Human Asian individual (YH)

- DNA Methylome - Genome Assembly

- TranscriptomeCancer (14TB)Single cell bladder cancerHBV infected exomesAncient DNA - Saqqaq Eskimo - Aboriginal Australian

VertebratesDarwin’s FinchGiant panda Macaque -Chinese rhesus -Crab-eatingMini-PigNaked mole rat Parrot, Puerto Rican Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkwormParasitic nematodePacific oyster

Released pre-publicationPaper published in GigaScience

39 data sets

Page 12: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Currently: 39 public datasets*10 citations in references*

Humans Ancient DNA- Aboriginal Australian- Saqqaq Eskimo Asian individual (YH)

Page 13: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

What about the analyses?

Data sets

AnalysesGigaScience

paper

Linked to

Linked to

How will we make analyses availablefor downloading and execution?

Page 14: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Example workflow: Investigate the evolutionary relationships between proteins

Proteinsequences

Bioinformatics data analyses as workflows

QueryMultiplesequencealignment

Page 15: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Implement GigaScience workflowsin a community-accepted format

http://galaxyproject.org

Over 20,000 main Galaxy server users

Over 500 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

Page 16: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Tool list Tool parameterisation Results panel

Page 17: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Pilot project - Integrate BGI SOAP package into Galaxy

Enable SOAP tools to be used from within Galaxy workflows

Page 18: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Data analysis pipelines

SOAP1 SOAP2 SOAPdenovo1 SOAPdenovo2 SOAPsnp SOAPsplice

Integrate BGI SOAP package into Galaxy

Pythonwrapper

Pythonwrapper

Pythonwrapper

Pythonwrapper

Pythonwrapper

Pythonwrapper

Page 19: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

GitHub open code repository

https://github.com/gigascience

Page 20: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Tool list Tool parameterisation Results panel

Page 21: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

SOAPdenovo2 Galaxy workflow

Page 22: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

http://www.myexperiment.org

Page 23: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Why publish in GigaScience?

Benefit• Data hosted in GigaDB• Allocation of DOIs to data• Metadata in isa-tab format• Galaxy tool integration• Use of tools in Galaxy

workflows

Added value• No need to use own servers• Citable data• Aids reuse of data• Supports reuse of tools• Improves documentation• Shows how tool can be used

with other bioinf. software

Page 24: Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Thanks to:

• Tin-Lap Lee and Huayan Gao - CUHK• Tam, Jesse, Scott, Nicole & Laurie - GigaScience

[email protected]