open data hk: open science meets open data. a primer from scott edmunds

64
Scott Edmunds @SCEdmunds @GigaScience meets Open science primer

Upload: scott-edmunds

Post on 02-Jul-2015

2.038 views

Category:

Technology


3 download

DESCRIPTION

Talk by Scott Edmunds on open science meets open data at ODHK.MEET.11 on the 21st November 2013 at Delaney's Hong Kong.

TRANSCRIPT

Page 1: Open Data HK: open science meets open data. A primer from Scott Edmunds

Scott Edmunds

@SCEdmunds

@GigaScience

meets

Open science primer

Page 2: Open Data HK: open science meets open data. A primer from Scott Edmunds

Can this be considered open data?

http://biology.clc.uc.edu/fankhauser/labs/genetics/dna_isolation/thymus_dna.htm

Page 3: Open Data HK: open science meets open data. A primer from Scott Edmunds

Does this qualify as open source?

http://2011.igem.org/Team:UC_Davis

Page 4: Open Data HK: open science meets open data. A primer from Scott Edmunds

What is Open (Science) Data?

• Something very very very geeky

• Free & open access to data about the world around us

o Searchable, findable

o Machine-readable, app-makeable, Excel-usable

o Without restrictions/limitations

• This (examples)

Page 5: Open Data HK: open science meets open data. A primer from Scott Edmunds

About me:

• Scott Edmunds

• Molecular biology, sci editing & comms

• Scientific journal & (big) data publishing

• Reproducibility & open science

Journal, data-platform and database for

large-scale biological datawww.gigasciencejournal.com

Page 6: Open Data HK: open science meets open data. A primer from Scott Edmunds

About me:

Page 7: Open Data HK: open science meets open data. A primer from Scott Edmunds

• Formerly Beijing Genomics Institute

• Founded in 1999 (1% of HGP)

• China’s 1st citizen managed not-for-profit research institute funded by commercial sequencing-as-a-service (BGI Tech)

• Now largest genomic organization in the world

• HQ in Shenzhen, most data production in BGI HK (Tai Po)

About my employer:

Page 8: Open Data HK: open science meets open data. A primer from Scott Edmunds

Standing on the shoulders of giants

Page 9: Open Data HK: open science meets open data. A primer from Scott Edmunds

Open Data 1665?

Scholarly articles are merely advertisement of scholarship . The

actual scholarly artefacts, i.e. the data and computational

methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLaband reproducible research, 1995

Page 10: Open Data HK: open science meets open data. A primer from Scott Edmunds

OKFN: 8 types of open data

http://science.okfn.org/

Page 11: Open Data HK: open science meets open data. A primer from Scott Edmunds

Panton Principles

http://pantonprinciples.org/

=

Page 12: Open Data HK: open science meets open data. A primer from Scott Edmunds

Science Data Volumes

Exabytes Petabytes100’s of Petabytes

Sequencing

Mass Spec

Astrophysics HE Physics Biology

Imaging

Square Kilometer Array

Large Hadron Collider

Page 13: Open Data HK: open science meets open data. A primer from Scott Edmunds

Esoteric formats, poorly structured,

Tabular, often spreadsheet based

Issues open data community well used to (data cleaning, scraping, etc.,)

The long tail of scientific data…

?

Page 14: Open Data HK: open science meets open data. A primer from Scott Edmunds

Open Data in Physics1961 CERN pre-prints shelf

http://cerncourier.com/cws/article/cern/28654http://arxiv.org/

1991-date arXiv

Page 15: Open Data HK: open science meets open data. A primer from Scott Edmunds

Open Data in Biology

1934: newsletter era 1987: online era1980: database era 2010’s: “bioinformatics bingo” era

Page 16: Open Data HK: open science meets open data. A primer from Scott Edmunds

BGI HK Chamber O’Illumina’sThe LHC of Biology?

20PB of storage

Page 17: Open Data HK: open science meets open data. A primer from Scott Edmunds

Open Data in Chemistry

Page 18: Open Data HK: open science meets open data. A primer from Scott Edmunds

Closed Data in Chemistry

Page 19: Open Data HK: open science meets open data. A primer from Scott Edmunds

V

Genomics: open-data success story?

Page 20: Open Data HK: open science meets open data. A primer from Scott Edmunds

Sharing/reproducibility helped by stability of:

1. Platforms

1. Repositories

2. Standards

1st Gen 2nd Gen

:

Page 21: Open Data HK: open science meets open data. A primer from Scott Edmunds

Genomics Data Sharing Policies…

1. Automatic release of sequence assemblies within 24 hours.2. Immediate publication of finished annotated sequences.3. Aim to make the entire sequence freely available in the public domain for

both research and development in order to maximise benefits to society.

Bermuda Accords 1996/1997/1998:

1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production.

2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria.

Fort Lauderdale Agreement, 2003:

The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.

Toronto International data release workshop, 2009:

Page 22: Open Data HK: open science meets open data. A primer from Scott Edmunds

0

100

200

300

400

500

600

700rice wheat

Rice v Wheat: consequences of publically available genome data.

Sharing aids fields…

Page 23: Open Data HK: open science meets open data. A primer from Scott Edmunds

Digitizing the world

Can we make everything open data?

Page 24: Open Data HK: open science meets open data. A primer from Scott Edmunds

NO

Page 25: Open Data HK: open science meets open data. A primer from Scott Edmunds

NO

The (non-) human centipede: first sequence

Page 26: Open Data HK: open science meets open data. A primer from Scott Edmunds

SOURCE

USER

NARRATIVE DATA

PUBLISHER

EXTERNAL

DATABASESARRAYEXPRESS

Morphbank

DATA PRODUCTION

CURATION/

INTEGRATION

• Genomics

• Barcoding

• Imaging

• microCT

• Video

(SOCIAL)

MEDIA

Page 27: Open Data HK: open science meets open data. A primer from Scott Edmunds

NO

Page 28: Open Data HK: open science meets open data. A primer from Scott Edmunds

What is open science? 5 flavours:

Benedikt Fecher and Sascha Friesike: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2272036

Page 29: Open Data HK: open science meets open data. A primer from Scott Edmunds

Democratic:

Page 30: Open Data HK: open science meets open data. A primer from Scott Edmunds

Biggest Challenge: Closed Access

WWW.RIGHTTORESEARCH.ORG

Page 31: Open Data HK: open science meets open data. A primer from Scott Edmunds

Biggest Challenge: Closed Access

Handful of closed access STM publishers control market

Force libraries to buy “bundles”

Revenue >$9B

Average cost /article >$5000 USD

Publishers retain copyright

Prevent data mining of content

Withold information from 99.9% who need it!

Page 32: Open Data HK: open science meets open data. A primer from Scott Edmunds

Biggest Challenge: Closed Access

Page 33: Open Data HK: open science meets open data. A primer from Scott Edmunds

Publishing: better than a gold mine

See: http://alexholcombe.wordpress.com/2013/01/09/scholarly-publishers-and-their-high-profits/

Page 34: Open Data HK: open science meets open data. A primer from Scott Edmunds

Increasing strain on library budgets

-50%

0%

50%

100%

150%

200%

250%

300%

350%

400%

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

Perc

enta

ge C

hange

Year

MIT library purchases v inflation 1986-2006

Consumer Price Index % + Serial Expenditures % + # Serials Purchased % +

# Books Purchased % + Book Expenditures % +

Journal expenditure

Inflation

Page 35: Open Data HK: open science meets open data. A primer from Scott Edmunds

Too expensive for Harvard…

Page 36: Open Data HK: open science meets open data. A primer from Scott Edmunds

The good news: the fightback has started…

http://thecostofknowledge.com/

Page 37: Open Data HK: open science meets open data. A primer from Scott Edmunds

The Solution: Open Access

“By “open access” to [peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

Budapest Open Access Initiative:

• Maximizes reuse and access• Gives authors control over the integrity of their work and the right

to be properly acknowledged and cited.• “Real” OA asks for no restrictions/limitations = CC-BY

Page 38: Open Data HK: open science meets open data. A primer from Scott Edmunds

Hong Kong: off the map

https://www.openaccessbutton.org/

Push the button!

Page 39: Open Data HK: open science meets open data. A primer from Scott Edmunds

Hong Kong: good with theses…

http://hub.hku.hk/

Page 40: Open Data HK: open science meets open data. A primer from Scott Edmunds

Hong Kong: still some work to go with OA

…Singapore beats us

Page 41: Open Data HK: open science meets open data. A primer from Scott Edmunds

Pragmatic:

Infrastructure:

Page 42: Open Data HK: open science meets open data. A primer from Scott Edmunds

Pragmatic/Infrastructure:

Wiki science:

• 10,000 distinct gene pages.• 1.42 million words and 78MB data. • 50 million views & 15,000 edits per year.

Crowdsourcing, wisdom of the masses

GeneWiki

GitHub science:

A hypothetical Git workflow for a scientific collaboration involving 3 authors. Karthik Ram: http://www.scfbm.org/content/8/1/7

http://en.wikipedia.org/wiki/Portal:Gene_Wiki

Page 43: Open Data HK: open science meets open data. A primer from Scott Edmunds

Open Lab Notebooks

Page 44: Open Data HK: open science meets open data. A primer from Scott Edmunds

To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our crowdsourcing example:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 45: Open Data HK: open science meets open data. A primer from Scott Edmunds
Page 46: Open Data HK: open science meets open data. A primer from Scott Edmunds
Page 47: Open Data HK: open science meets open data. A primer from Scott Edmunds

Downstream consequences:

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”

1. Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons

4. Example for faster & more open science

Page 48: Open Data HK: open science meets open data. A primer from Scott Edmunds

1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully

illustrated by events following an outbreak of a severe gastro-

intestinal infection in Hamburg in Germany in May 2011. This

spread through several European countries and the

US, affecting about 4000 people and resulting in over 50

deaths. All tested positive for an unusual and little-known

Shiga-toxin–producing E. coli bacterium. The strain was initially

analysed by scientists at BGI-Shenzhen in China, working

together with those in Hamburg, and three days later a draft

genome was released under an open data licence. This

generated interest from bioinformaticians on four continents. 24

hours after the release of the genome it had been assembled.

Within a week two dozen reports had been filed on an open-

source site dedicated to the analysis of the strain. These

analyses provided crucial information about the strain’s

virulence and resistance genes – how it spreads and which

antibiotics are effective against it. They produced results in

time to help contain the outbreak. By July 2011, scientists

published papers based on this work. By opening up their early

sequencing results to international collaboration, researchers in

Hamburg produced results that were quickly tested by a wide

range of experts, used to produce new knowledge and

ultimately to control a public health emergency.

Page 49: Open Data HK: open science meets open data. A primer from Scott Edmunds
Page 50: Open Data HK: open science meets open data. A primer from Scott Edmunds
Page 51: Open Data HK: open science meets open data. A primer from Scott Edmunds

http://www.gov.hk/en/theme/psi/contest/contest_events.htm

Pragmatic/Infrastructure:

Open Innovation Challenges

http://www.scientificamerican.com/openinnovation/

Page 52: Open Data HK: open science meets open data. A primer from Scott Edmunds

Public:

Page 53: Open Data HK: open science meets open data. A primer from Scott Edmunds

Indie Science

Biohacker spaces

CoResearch labs

Crowdfunding

DIYbio

Open hardware

http://www.perlsteinlab.com/

Page 54: Open Data HK: open science meets open data. A primer from Scott Edmunds

Biggest crowdfunding successes

Page 55: Open Data HK: open science meets open data. A primer from Scott Edmunds

Utilizing students: iGEM

iGEM:

http://2011.igem.org/Team:UC_Davis

Page 56: Open Data HK: open science meets open data. A primer from Scott Edmunds

The “Peoples Parrot”Puerto Rican Parrot Genome Project (Amazona vittata )

Rarest parrot, national bird of Puerto Rico

Community funded from artworks, fashion shows, beer brands, crowdfunding…

Genome annotated by students in community college as part of bioinformatics education

Paper and Data published in GigaScience and GigaDB

Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young Researcher Education. GigaScience 2012, 1:14Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience. http://dx.doi.org/10.5524/100039

Page 57: Open Data HK: open science meets open data. A primer from Scott Edmunds

Public: Citizen Science

Galaxy Zoo:

Zoonoverse:

887,355 “Zooites” and counting

https://www.zooniverse.org/

Page 58: Open Data HK: open science meets open data. A primer from Scott Edmunds

Public: Citizen Science

1987-1997

http://sabap2.adu.org.za/

Page 59: Open Data HK: open science meets open data. A primer from Scott Edmunds

Easy to get started…

http://crowdcrafting.org/

Page 60: Open Data HK: open science meets open data. A primer from Scott Edmunds

Public: Games with a Purpose

http://fold.it/http://www.sciencegamecenter.org/

Page 62: Open Data HK: open science meets open data. A primer from Scott Edmunds

OpenSciDev

http://openscidev.com/

Page 63: Open Data HK: open science meets open data. A primer from Scott Edmunds

OpenSciDev

http://openscidev.com/

1. What value framework is a prerequisite for open science?2. How can open science support visibility and communication of

science outside formal academic structures?3. How can open science create education?4. How can the economic and social value of open science be

measured?

• Writing working paper on these questions• Building networks across Africa, Asia, Latin America and the

Caribbean.• Setting up call for funding for OpenSciDev projects ($2-3M)

Questions asked:

Currently working on:

Page 64: Open Data HK: open science meets open data. A primer from Scott Edmunds

To summarize:

• Open data is more than just government data (although research data mostly is government funded too)

• Need for OA advocates & policies in Hong Kong (role for ODHK?)

• Much science community can still learn about open licensing

• Much wider open data community can learn on community engagement from Citizen Science, GWAP, etc.

• Asia (inc HK) behind US/EU on many of these activities, but can we learn lessons from success of iGEM and “Jamboreee” model? *…King+