nicole nogoy: gigascience...how licensing can change the way we do research

Post on 28-Jan-2015

106 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Nicole Nogoy's talk at the Victoria University of Wellington on: "GigaScience...how licensing can change the way we do research", March 7th 2014

TRANSCRIPT

Background

...how licensing can change the way we do research

Nicole NogoyVUW, 7 March 2014

Open-DataOpen-Source

Open-Review Open-Access

www.gigasciencejournal.com

Journal, data-platform and database for large-scale data

Editor-in-Chief: Laurie GoodmanExecutive Editor: Scott Edmunds

Commissioning Editor: Nicole NogoyLead Curator: Chris Hunter

Data Platform: Peter LiData Scientist: Rob Davidson

in conjunction with

Open-Review Open-Access

Open-DataOpen-Source

Why? How?

What can be achieved?

Its all about the re-use

To do this everything needs to be free and accessible to be read by humans & machines*

* See: http://www.biomedcentral.com/about/datamining

Take home message:

Era of Data-Driven Science

Using networking power of the internet to tackle problems

Can ask new questions & find patterns & connections hidden in others data

Build on each others efforts quicker & more efficiently

Harness wisdom of the crowds: crowdsourcing, citizen science

Big Potential:

Big Challenges: cultural and technicalRemoving silos and putting in the commons

Usability: interoperable standards/formats for humans/machines

Good for a field:Genomics/Bioinformatics

Long term sharing infrastructure:

Strong use of standards/policies:

Plummeting cost/explosion in volumes:

19961997

19981999

20002001

20022003

20042005

20062007

20080

100

200

300

400

500

600

700rice wheat

Rice v Wheat: consequences of publically available genome data.

Sharing aids specific communities…

Papers

Sharing aids authors…

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

Sharing Detailed Research Data Is Associated with Increased Citation Rate.

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Established in 1995

We’re notlaughing now

Problem: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Growing Issue: increasing number of retractions>15X increase in last decade

Strong correlation of “retraction index” with higher impact factor

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

At current % increase by 2045 as many papers published as retracted!

Reasons• Data not available

• From the start – Lost over time• Software not available

• From the start – Lost over time• Lack of standards

• None established – Not followed• Unclear methods• Missing information• Honest errors• Pure and simple data fabrication

Impact

Wasted TimeWasted money

**Delayed ‘payoff’ to the community**

*** Distrust of Scientists and science***

Juliet Jacobs found out she had lung cancer, she was terrified

But the research at Duke turned out to be wrong. Its gene-based tests proved worthless, and the research behind them was discredited.

Ms. Jacobs died afew months after treatment

?How

GigaSolution: deconstructing the paper

Provide infrastructure and mechanisms of reward for:

• Data availability

• Metadata/curation

• Interoperability

• Availability of workflows

• Transparent analyses

Data

Metadata

Methods

Analyses

GigaSolution: deconstructing the paper

www.gigadb.orgwww.gigasciencejournal.com

Worlds largest genomics organisation with: 20PB storage, 20.5K cores, 212TFlops, >1000 bioinformaticians

Utilizes big-data infrastructure and expertise from:

Combines and integrates:

Open-access journal

Data Publishing Platform

Data Analysis Platform

Open-Access

Why/what/how?Where does licensing fit?

Importance of licensing: ability to mine & reuse content

“By “open access” to [peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

=

=

Needs to be:

NC, ND put unnecessary restrictions and are not counted as “true OA”

CC0 better than CC-BY for datasets to prevent “attribution stacking”

Budapest Open Access Initiative:

Importance of licensing: ability to mine & reuse content

=

Prevents translations, incompatibility issues mixing other licenses, some combinations illegal (e.g. CC-NC-SA & CC-BY-SA), hinders non-profits and mixed-collaborations, practically unenforceable, and dealing with requests more trouble than its worth.

Further reading:http://www.nature.com/nature/journal/v495/n7442/full/495440a.htmlhttp://blogs.ch.cam.ac.uk/pmr/2011/11/29/scientists-should-never-use-cc-nc-this-explains-why/

Use of non CC-BY by publishers = “double dipping” (selling content, reprints, etc.)

• Gives authors control over the integrity of their work and the right to be properly acknowledged and cited.

• Does not grant publicity rights, and attribution can be used to clearly disclaim endorsement

• Restrictions rarely benefit author, and inhibit reuse

Nicole Nogoy
Scott - these bullets all refer to CC-BY except for the last bullet that refers to NC, ND, SA?

Open-DataData PublishingWhy/what/how?

?

New incentives/credit

Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)

?

New incentives/credit

“increase acceptance of research data as legitimate, citable contributions to the scholarly record”.

“data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.

= Data Citation?

http://www.force11.org/datacitation

Anatomy of a Publication

Data

Idea

Study

Analysis

Answer

Metadata

Anatomy of a Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

GigaScience Data Publishing Platform

Currently 60 datasets & almost 50TB data

• TBs of data from: BGI, ACRG, G10K• Provide curation & integration with other DBs

Many data types…

BGI Datasets Get DOIs

PlantsChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghumWheat A+BOtherfMRI

Microbe/metagenomicsE. Coli O104:H4 TY-2482T2D gut metagenomeBulk pooled insectsT. Tengcongensis proteomeCell-LinesChinese Hamster OvaryMouse methylomesCancer quantitative protemicsHuman

Asian individual (YH) - DNA Methylome - Genome Assembly v1+2- TranscriptomeCancer (14TB)Single cell bladder cancerHBV infected exomesAncient DNA - Saqqaq Eskimo - Aboriginal Australian

VertebratesDarwin’s FinchGiant panda Macaque -Chinese rhesus -Crab-eatingMini-PigNaked mole rat Parrot, Puerto Rican Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearDA and F344 rats SheepTibetan antelope

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkwormParasitic nematodePacific oyster

Released pre-publicationPaper Published in GigaScience

Cloud solutions?

Reward better handling of metadata…Novel tools/formats for data interoperability/handling.

Cloud solutions?

Reward better handling of metadata…Novel tools/formats for data interoperability/handling.

BMC Research Awards 2013Winner of open data award

Open-Source

The new way of doing science?

Why/what/how?

Open-Source: the source of it all

• Transparent, fast, collaborative

• Long history, large community

• Many licenses

• Many repositories

• Many users/platforms

Software community understands benefits

Open-Review

Why/what/how?

New & more transparent peer-review:Pre-publication: pre-prints

New & more transparent peer-review:During-publication: open-review

BMC Series Medical Journals

New & more transparent peer-review:Post-publication review

Open content lets you do interesting things post-publication:

New pub models:

Altmetrics:

Comments, blogs, online journal clubs

Examples

Open-DataData Publishing

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

The Peoples Parrot: Amazona vittata Puerto Rican Parrot Genome ProjectRarest parrot, national bird of Puerto Rico

Community funded from artworks, fashion shows, crowdfunding…

Genome annotated by students in community college as part of bioinformatics education

Paper and Data published in GigaScience and GigaDB

Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young Researcher Education. GigaScience 2012, 1:14Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience. http://dx.doi.org/10.5524/100039

Disseminating new types of data

Open-SourceSoftware Publishing

How are we supporting data reproducibility?

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18~21,000 accesses

Open-Code

8 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/~21,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

New & more transparent peer-review:The GigaScience way:

8 referees downloaded & tested data, then signed reports

New & more transparent peer-review:The GigaScience way:

Real-time open-review = paper in arXiv + blogged reviews

Implement workflows in a community-accepted format

http://galaxyproject.org

Over 36,000 main Galaxy server users

Over 1000 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

GigaGalaxy

Tool list Tool parameterisation Results panelResults panel

GigaGalaxy & Metabolomics

Changing the way we publish:

“Deconstructed”Journal

“Regular”Journal

“Conscientious” Online Journal

www.gigasciencejournal.com

Give us your data, papers & pipelines*

Help us make it happen!

nicole@gigasciencejournal.comeditorial@gigasciencejournal.comdatabase@gigasciencejournal.com

Contact us:

* APC’s currently FREE until end of December 2014 , saving you up to £1,250 –

courtesy of BGI

Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Huayen Gao (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)

Thanks to:

@gigasciencefacebook.com/GigaScienceblogs.openaccesscentral.com/blogs/gigablog/

Peter LiChris HunterRob DavidsonJesse Si ZheScott EdmundsNicole NogoyLaurie Goodman

Follow us:www.gigadb.org

galaxy.cbiit.cuhk.edu.hkwww.gigasciencejournal.com

CBIIT

Funding from:Our collaborators:team:

top related