biomed central's open data initiatives
DESCRIPTION
An overview of most of BioMed Central's open data projects in publishing. Presented at the Alliance for Permanent Access conference, 7th November 2012TRANSCRIPT
BioMed Central’s open data initiatives
Alliance for Permanent Access conference7th November 2012
Iain HrynaszkiewiczPublisher (Open Science), BioMed [email protected]
@iainh_z
About BioMed Central
• Launched in 2000, largest global publisher of peer-reviewed open access journals (>240)
• >136,000 peer-reviewed open access articles published
• Part of Springer Science+Business Media since 2008
• Publish using Creative Commons (CC-BY) licenses• Non-journal products include ISRCTN database• Interested in innovation and recognise the growing
need for data sharing and publicationhttp://blogs.biomedcentral.com/bmcblog/tag/Open-Data/
BioMed Central and open data
• Increasing transparency in scientific research and scholarly communication is at the core of strategy
• Data are an increasingly integral part of scholarly communication, with many opportunities for increasing the pace of knowledge discovery
• Publishers, particularly open access publishers, are well-placed to share information across domain boundaries http://www.biomedcentral.com/about/access
“By ‘open data’ BioMed Central means that these data are freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. BioMed Central encourages the use of fully open formats wherever possible.”
BioMed Central open data initiatives
• Data journals and article types• Open Data Award• Data hosting, citation, deposition and linking• Lab notebook-journal integration (LabArchives)• Data licensing• Guidance and best practice e.g. human subjects – confidentiality and
consent• Data formats and standards – efficient reuse• Facilitation of data/text mining research
Problem: Lack of credit/recognition for data sharing and publication
• In science credit is everything but incentives for data publication are still emerging
• Datasets are not generally as discoverable and citable as journal articles – yet
• Requirements for data sharing are field/location-specific
• Need more empirical evidence of the benefits of data publication for individual scientists
Data notes: “[B]riefly describe a biomedical data set or database, with the data being readily accessible and attributed to a source” http://bit.ly/y3Jb3b
Data notes: “[E]xceptional datasets deposited in our GigaScience repository that have been selected for further peer review” http://bit.ly/yPBsAA
Research: E.g. The International Stroke Trial database http://www.trialsjournal.com/content/12/1/101
Solution #1: Journals and article types enabling data publication
Solution #2: Open Data Award
“We ... recognize researchers who have ... have demonstrated leadership in the sharing, standardization, publication, or re-use of biomedical research data.”
http://www.biomedcentral.com/researchawards/opendata
Solution #3: Enable and encourage/require data citation
“References...Only articles, datasets and abstracts that have been published or are in press, or are available through public e-print/preprint servers, may be cited…“Dataset with persistent identifierZheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience. http://dx.doi.org/10.5524/100012."
http://blogs.biomedcentral.com/bmcblog/2012/01/19/citing-and-linking-data-to-publications-more-journals-more-examples-more-impact/
Problem: Where can data be stored – permanently?
• Publishers not best placed to run repositories for long term preservation of large datasets
• Mirrors of publisher content not able to accept arbitrary amounts of additional data
• Many data repositories exist but most are domain/location specific and there are many different types of funding model, license agreement and persistent identifiers in use
Solution #1: Journal with integrated database
Editor-in-Chief:
Laurie Goodman, BGI
(USA)
www.gigasciencejournal.com www.biomedcentral.c
om
• The BGI is covering all APCs for the first year after launch
GigaScience publishes ‘big-data’ studies from the entire spectrum of life
sciences
• Novel publishing format -manuscript publication and data hosting
Editor:
Scott Edmunds, BGI
(China)
Assistant Editor:
Alexandra Basford, BGI
(China)
• Assignment of data DOIs allows separate data citation
Benefits
http://gigadb.org/
GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data”… (see more)
http://gigadb.org/
Anatomy of a GigaScience Publication
Data
Idea
Study
Analysis
Answer
Metadata
Solution #2: Comprehensive author information on available data
repositories
http://datacite.org/repolist
http://www.biomedcentral.com/about/supportingdata
Solution #3: Research on repositories
http://publicationethics.org/files/u661/EthicalEditing_Autumn2012_final.pdf We are looking for repositories with interests in clinical research data – can you help?
Problem: Data are not consistently linked to publications
• Data deposition policies are not established in all fields
• Even where they are links/accession numbers tend to be inconsistently presented and rarely cited
• Researchers may, independently of journal requirements, deposit data in repositories
• A missed opportunity to enhance the literature
Solution #1: ‘Availability of supporting data’ article section
• A tool to put data deposition policies – encouraged or mandated – into practice
• Provides links in a consistent place within an article to supporting data, regardless of the location or format of the data
• Data must be permanently available (DOI or equivalent)
• ~50 journals including GigaScience, BMC series
http://www.biomedcentral.com/about/supportingdata
Availability of supporting data
BMC Res Notes 2012, 5:21 http://www.biomedcentral.com/1756-0500/5/21/
GigaScience 2012, 1:3 http://www.gigasciencejournal.com/content/1/1/3
Solution #3: Lab notebook integration
• BMC authors entitled to LabArchives’ (http://www.labarchives.com/bmc) online lab notebook with 100Mb of free storage
• Features include:- Data publishing with DOIs assignment- Citable, linkable data supporting publications- Reusable/integrate-able data with CC0 waiver- Integrated manuscript submission to BMC journals- Additional free storage (standard is 25Mb)http://blogs.openaccesscentral.com/blogs/bmcblog/entry/labarchives_and_biomed_central_a
LabArchives partnership
24 Oct 2012
Open data partnership leads to release of data from Nobel Prize-winning laboratory for public usehttp://www.biomedcentral.com/presscenter/pressreleases/20121024c
“The data should be released in standardized formats without intellectual property constraints.” Conway PH, VanLare JM: Improving Access to Health Care Data: The Open Government Strategy. JAMA 2010;304(9):1007-1008.
http://pantonprinciples.org/
http://www.isitopendata.org/
“[P]eople mis-use copyright licenses on uncopyrightable materials and data sets: the confusion of the legal right of attribution in copyright with the academic and professional norm of citation of one's efforts.” John Wilbanks, VP, Science, Creative Commons, http://bit.ly/djl5Fa August 11, 2010
“...any restrictions on use should be strongly resisted and we endorse explicit encouragement of open sharing.” Schofield et al.: Post-publication sharing of data and tools. Nature 2009, 461:171.
Problem: Licensing that restricts data integration and (re)use
efficiently
Why Creative Commons CC0?
• interoperability: CC0 is human and machine-readable
• universality: CC0 is global and universal and widely recognized
• simplicity: no need for humans to make, and respond to, individual data requests – avoids “attribution stacking” with CC-BY licenses
Schaeffer P: Why does Dryad use CC0? http://blog.datadryad.org/2011/10/05/why-does-dryad-use-cc0/
http://creativecommons.org/publicdomain/zero/1.0/
Solution: Stakeholder engagement and community collaboration,
leadership
Public consultation on implementing CC0 for data published in open access journals: closes 10th November 2012http://blogs.biomedcentral.com/bmcblog/2012/09/10/put-the-open-in-open-data/
Hrynaszkiewicz I, Cockerill MJ: Open by default: a proposed copyright license and waiver agreement for open access research and data in peer-reviewed journals. BMC Research Notes 2012, 5:494 http://www.biomedcentral.com/1756-0500/5/494
Implementing CC0 in journals – how?
• Specify a date from which the new license would apply to data (CC-BY remains for other content)
• Only applies to data submitted to the journal• Some relatively minor technical and
operational implications• Cultural change may be the biggest challenge• Consultation is identifying common concerns,
FAQs, and further definitions and use cases for open data in journal publicationsHrynaszkiewicz I, Cockerill MJ: Open by default: a proposed copyright
license and waiver agreement for open access research and data in peer-reviewed journals. BMC Research Notes 2012, 5:494 http://www.biomedcentral.com/1756-0500/5/494
Problem: Lack of guidance, exemplars, incentives to make date
reusable• Sharing/publishing detailed human subjects
data, in the absence of explicit consent, can potentially infringe privacy (ethically and legally)
• Data are more (re)usable if published in community endorsed, standard formats
• Standards and appropriate guidance do not yet exist in all domains
• Few incentives to follow data standards
Solution #1: Work with journal editors to produce guidance where it
is needed
BMJ 2010;340:c181Co-published in:Trials 2010, 11:9
Solution #2: Publish exemplars
Solution #2: Publish exemplars
Solution #3: Incentivize, promote and share best practice and
standardshttp://www.biomedcentral.com/bmcresnotes/series/datasharing
http://biosharing.org/standards_view
Problem: Adding value to data of use to researchers, readers and
publishers• Text/data mining applications often are
research project or research specific and not always attractive to commercial publishing platforms and their customers
• Value to the non-expert can be limited• Makes business model/case challenging for
publishers
http://www.biomedcentral.com/about/datamining/
www.casesdatabase.com
www.casesdatabase.com – coming soon
www.casesdatabase.com – coming soon
www.casesdatabase.com – coming soon
The future...
Image adapted from Gillam et al: The Healthcare Singularity and the Age of Semantic Medicine. In The Fourth Paradigm (2009)
Questions?
Iain HrynaszkiewiczPublisher (Open Science), BioMed [email protected]
http://www.mendeley.com/profiles/iain-hrynaszkiewicz/
http://uk.linkedin.com/in/iainhz@iainh_z