open science 2014

49
Code as a Research Product Open Source for Open Science Dan Gezelter @gezelter OpenScience.org (also: The University of Notre Dame )

Upload: dan-gezelter

Post on 10-May-2015

605 views

Category:

Science


1 download

DESCRIPTION

Code as a Research Product: Open Source for Open Science Given at the NIAID Bioinformatics festival 2014

TRANSCRIPT

Page 1: Open science 2014

Code as a Research Product !

Open Source for

Open Science

Dan Gezelter @gezelter

OpenScience.org (also: The University of Notre Dame)

Page 2: Open science 2014

Suppose your colleague sends you an email that says, I’ve found something amazing. I don’t have time to tell you exactly what it is, or how I found it, but here’s proof that I discovered it:

smaismrmilmepoetaleumibunenugttauiras

Page 3: Open science 2014

On the 25th of July in 1610, Galileo discovered that Saturn was apparently situated between two smaller companions that always moved together. Wanting to establish his priority of discovery, he sent to Kepler (and others) the following anagram, which he informed them was a coded description of his latest discovery:

smaismrmilmepoetaleumibunenugttauiras

Page 4: Open science 2014

On the 25th of July in 1610, Galileo discovered that Saturn was apparently situated between two smaller companions that always moved together. Wanting to establish his priority of discovery, he sent to Kepler (and others) the following anagram, which he informed them was a coded description of his latest discovery:

smaismrmilmepoetaleumibunenugttauiras

Altissimum planetam tergeminum observavi !

I have observed the highest of the planets [Saturn] three-formed

Page 5: Open science 2014

!

Sadly, this kind of scientific communication was common at the time. Newton, Huygens, Hooke, and Leonardo all used similar devices to hide their discoveries and methods from each other. !!!

Page 6: Open science 2014

!

In 1665, the Philosophical Transactions (one of the earliest scientific journals) was founded by Henry Oldenburg. !!

Page 7: Open science 2014

The Royal Society is collaborating with JSTOR to digitize, preserve, and extend access toPhilosophical Transactions (1665-1678).

www.jstor.org®

Page 8: Open science 2014

The Royal Society is collaborating with JSTOR to digitize, preserve, and extend access toPhilosophical Transactions (1665-1678).

www.jstor.org®

Page 9: Open science 2014

Although the importance of reproducibility is as old as scientific inquiry, the importance of sharing scientific methodology was adopted slowly.

Page 10: Open science 2014

!

Over the next 200 years, publishing and sharing methodology became commonplace. !In August of 1867, the chemist, William Crookes wrote an obituary for his mentor, Michael Faraday. He recounted Faraday’s advice to his students: !

“The secret,” said he, “is comprised in three words — Work, Finish, Publish.”

Page 11: Open science 2014

!

Over the next 200 years, publishing and sharing methodology became commonplace. !In August of 1867, the chemist, William Crookes wrote an obituary for his mentor, Michael Faraday. He recounted Faraday’s advice to his students: !

“The secret,” said he, “is comprised in three words — Work, Finish, Publish.”

!

It must be confessed that young chemists of the present day follow this advice, carefully omitting the second word.

Page 12: Open science 2014

!

Surely science has continued to evolve since 1867… !!

Page 13: Open science 2014

Today, thousands of scientific papers report on computations that cannot be reproduced without access to secret software. !The inner workings of this secret software are hidden from skeptics and other researchers. !If you try to reproduce the capabilities of the secret software in another code, the entity that owns this software bans the university where you work.

Page 14: Open science 2014

Today, thousands of scientific papers report on computations that cannot be reproduced without access to secret software. !The inner workings of this secret software are hidden from skeptics and other researchers. !If you try to reproduce the capabilities of the secret software in another code, the entity that owns this software bans the university where you work.

!

Think I’m exaggerating? BannedByGaussian.org

!

Page 15: Open science 2014

!

Science has been Open since 1665. We just need to remind ourselves of this fact every few years…

!!

Page 16: Open science 2014

• Open Source

• Open Notebook

• Open Data

• Open Metadata

• Open Access

!

What is Open Science?

Page 17: Open science 2014

Transparency in experimental methodology, observation, and collection of data.

• Open Source

• Open Notebook

• Open Data

• Open Metadata

• Open Access

!

What is Open Science?

Page 18: Open science 2014

Transparency in experimental methodology, observation, and collection of data.

Public availability and re-use of scientific data.

• Open Source

• Open Notebook

• Open Data

• Open Metadata

• Open Access

!

What is Open Science?

Page 19: Open science 2014

Public accessibility and transparency of scientific communication.

Transparency in experimental methodology, observation, and collection of data.

Public availability and re-use of scientific data.

• Open Source

• Open Notebook

• Open Data

• Open Metadata

• Open Access

!

What is Open Science?

Page 20: Open science 2014

What is Open Science?

!

Open Science is the idea that scientific knowledge of all kinds should be shared publicly as early as is practical in the discovery process. !

Page 21: Open science 2014

Reproducibility

Reproducibility of experiments is one of the foundations of science.

We expect universality from the results from empirical tests. Independent scientists should be able to subject theories to similar tests in different locations, on different equipment, and at different times and get similar answers.

Page 22: Open science 2014

Reproducible Computational Science

• For simple models and small data sets, calculations are reproducible in principle and in practice.

• As simulations become more complex and data sets become larger, calculations that are reproducible in principle are no longer reproducible in practice without access to the code, data, and meta-data.

• Reproducibility now requires public access to code, data, and meta-data.

Page 23: Open science 2014

Reproducible Computational Science

Reports of numerical experiments should include: !

1. All source code needed to reproduce the calculation 2. All input data used to perform the calculation 3. All meta-data required to allow other codes to use

the input data

These are equivalent to the methodology section of an experimental paper. This standard requires Open Source, Open Data, and Open Meta-data for reproducible computational science.

Page 24: Open science 2014

Reproducible Research Standard

1. Release media components (text, figures) under CC BY. 2. Release code components under MIT license or similar. 3. Attribution license on selection and arrangement of data. 4. Release data under CC0.

V. Stodden, “Enabling reproducible research: Licensing for scientific innovation,” International Journal of Communications Law & Policy 13, 1 (2009).

Page 25: Open science 2014

!

Why aren’t all scientific programs open source?

!!

Page 26: Open science 2014

Two Open Source Science Codes

Started: 1998 2004

Purpose: Molecular Visualization Molecular Dynamics

Languages: Java C++, Python

Developers: 38 17 (graduate students)

Lead Developers: 7 1

Code base: 472,956 lines 92,308 lines

Person-Years: 125 23

Estimated Development Costs: $6,848,949 $629,389

Explicitly-funded Costs: $0 $0

Downloads:Over 831,656 at

SourceForge alone, (possibly millions more)

5,472

External Citations: 221 21

Citations to lead developers: 157 21

Page 27: Open science 2014

Comparative History

Post- doc

Pharma researcher

Graphics guru

Informatics Post-doc

Graduate Student Academic

Grad Student 1

Grad Student 2

AdvisorGroup Code

Grad Student 3

Other Groups

Code Re-use

Page 28: Open science 2014

Jmol is a useful tool• Filled a void created by the death of a closed-source tool.

• Developed by a series of project leads and their geographically-distributed teams. The lead developers hand off the code when they become too busy.

• Application focus changed dramatically over 16 years.

• External users of the code tend to run the application rather than re-use algorithms.

• Jmol became the standard tool for embedding chemical structures in web pages: • RCSB Protein Data Bank (PDB) • Inorganic Crystal Structure Database • Viewer for Folding @ Home projects • Can be directly included in Sakai, Moodle, and WebAssign sites

Page 29: Open science 2014

Jmol is part of other scientific tools• Bioclipse (integrated environment for biomolecule investigation ) • CaGe • ChemPad (3D models calculated on- the-fly from a formula

sketched by hand in a tablet PC • iBabel (a GUI for Openbabel ) • Janocchio (calculates NMR coupling constants and NOEs ) • Molecular Workbench • PFAAT (Protein Family Alignment Annotation Tool) • ProteinGlimpse • Spice • STING Millennium • STRAP • Taverna

Page 30: Open science 2014

Jmol helps disseminate data

Jmol provides structure visualization for: • ACS Chemical Biology • Biochemical Journal • Chemical Reviews (ACS) • Crystallography Journals Online (IUCr) • Molecular BioSystems (Royal Soc. Chem.) • Nature Chemical Biology • Nature Structural & Molecular Biology • Inorganic Chemistry (ACS) • JACS • Journal of Chemical Education • Journal of Molecular Biology • Journal of Natural Products • Organic Letters

Page 31: Open science 2014

Comparative History

Post- doc

Pharma researcher

Graphics guru

Informatics Post-doc

Graduate Student Academic

Grad Student 1

Grad Student 2

AdvisorGroup Code

Grad Student 3

Other Groups

Code Re-use

Page 32: Open science 2014

OpenMD• Merged student codes that carried out similar tasks. • Development was done within one research group and

was piggy-backed on other funded projects. • A journal article outlines the code’s capabilities, and

attribution is requested in the license. • Application development preserved group memory. • External users of the code tend to re-use algorithms

rather than run the application.

Page 33: Open science 2014

OpenMD

• Tight coupling of data & meta-data

• Code versioning information stored in generated data

• Reproduce simulations easily, but reproducibility is not the same as replicability.

• In parallel architectures, replicability may not be possible.

<OpenMD version=2> <MetaData> molecule{ name = "D"; atom[0]{ type = "D"; position(0.0, 0.0, 0.0); orientation(0.0, 0.0, 0.0); } } component{ type = "D"; nMol = 3456; } ensemble = NVE; forceField = "Multipole"; cutoffMethod = "shifted_force"; electrostaticScreeningMethod = "damped"; cutoffRadius = 12.0; dampingAlpha = 0.14; dt = 1.0; runTime = 1e3; sampleTime = 10.0; statusTime = 1.0; seed = 8675309; ## Last run using OpenMD Version: 2.1 Revision: 1972 </MetaData> <Snapshot> …

Page 34: Open science 2014

!

Why aren’t all scientific programs open source?

!!

Page 35: Open science 2014

!

Every discussion in science ends up in a discussion on tenure and grants. !

Page 36: Open science 2014

6�–�College of Science 2013 Accomplishments –�Department of Chemistry and Biochemistry

My graduate students have done excellent research that is well respected in my community, and after their degrees were awarded, they have all gone on to positions in which they can contribute to science and to society in meaningful ways. We are deeply engaged in the university’s goal to become a pre-eminent research university. This past year, our work in the OpenScience movement was recognized nationally and internationally at the White House Champions of Change event. That recognition contributed directly to communication with the external constituents of the university (Goal 5).

Generate Personal “Citation Report”

1. Go to ISI Web of Knowledge: http://www.webofknowledge.com 2. Select the Web of Science tab at the top 3. In 2nd box type in:

• Last name and first initial (no commas). Use all variations, i.e. Doe J OR Doe JR • Make sure that Author appears in the small box on the right

4. Under Current Limits at the bottom of the page, select: • All Years (normally the default) • Click on Search above

5. Under Refine Results in the column on the left side of the page, you may be able to perform refinements that will help to limit the list to only your citations.

6. Click on Refine button 7. Click on Create Citation Report (graph icon/upper right) 8. After generating the citation report, remove any citations that are not yours by checking the left-hand box on

that citation 9. Rerun Citation Report 10. Copy and paste both green plots into the space below (highlight both or right-click and copy each one). 11. Please also copy and paste your citation metrics (h-index, citations etc.), which appear to the right of the plots.

You are done! (Don’t forget to Log Out) * If you need help in generating these citation plots, please contact Thurston Miller [email protected] - He will be happy to assist you! *

Copy and paste Citation Report bar graphs (2) into the space below (please submit as PC viewable graphs, which QuickTime often

are not). Please also copy and paste your citation metrics in this space.

Once a year, academic scientists do this:

Page 37: Open science 2014

• Scientists stay alive professionally by publishing • Paper count • Citation count • h-index

• Time spent on open science projects reduces publication rates • Scientific software tools are often not cited • Even if they were cited, how would that citation get tied to a researcher? • How can a scientist show her institution the value of her project? !!Attribution metrics should (but don’t) take into account:

1. Effort to maintain a useful resource 2. Importance to the scientific community 3. Externalities beyond the scientific community

Recognition & Attribution

Page 38: Open science 2014

Until recently, there was no way to measure open products of research (outside of traditional publications) with a simple metric that can be used by institutions. !This is starting to change. ImpactStory, fidgit DOI lookups..

Recognition & Attribution

What we need is institutional recognition of alternative metrics.

Page 39: Open science 2014

Recognition & Attribution

What we need is institutional recognition of alternative metrics

Page 40: Open science 2014

Recognition & Attribution

What we need is institutional recognition of alternative metrics

And a pony

Page 41: Open science 2014

Recognition & Attribution

What we need is institutional recognition of alternative metrics

And a pony

There is no drive to make these changes in the academic world. !

There’s no drive for this in the for-profit or society journals. The new PLOS One sharing policy is a refreshing change. !

The funding agencies may be our best hope for recognizing code & data as primary research products.

Page 42: Open science 2014

SustainabilityAcademic scientists know almost nothing about good coding practices:

!• Source version control systems (cvs, svn, git, Hg) • Agile (or any other) development models • Design patterns • Object-oriented languages • Strong typing • Public source repositories ( SourceForge, github ) • Differences among open source licenses • Modern build & testing systems • Unit testing • Bug & issue tracking • Designing for usability and usability testing • UI design • Error handling • Introductory user documentation !

Because they aren’t forced into good practices, scientists often create code that is impossible to maintain effectively. This does not lead to sustainable open science.

Page 43: Open science 2014

Sustainability

Why not employ professional programmers to do scientific coding?

Page 44: Open science 2014

Sustainability

Computer scientists often know little about the domain sciences

• It is significantly faster for me to train a computational chemist in good coding practices than it is to train even an accomplished programmer in the various disciplines we use.

Why not employ professional programmers to do scientific coding?

Page 45: Open science 2014

Resources are also necessary for sustainable Open Science NIH, DOE & DARPA fund specific kinds of science. There is little room for projects which enrich the overall scientific enterprise, but don’t constitute novel research themselves. Tools are rarely funded.

!

!

Funding agencies should require delivery of primary research products:

• code in a public repository

• data in a public repository

• make depositions a part of the reporting structure for funded grants

Sustainability

Page 46: Open science 2014

Open source is essential for reproducible, open science.

!

There are no easy solutions to problems of Recognition, Attribution, and Sustainability.

!

That doesn’t mean we get to step away from open source. To do so would be to go back to 1611:

Haec immatura a me jam frustra leguntur oy

Outlook

Page 47: Open science 2014

Open source is essential for reproducible, open science.

!

There are no easy solutions to problems of Recognition, Attribution, and Sustainability.

!

That doesn’t mean we get to step away from open source. To do so would be to go back to 1611:

Haec immatura a me jam frustra leguntur oy

Outlook

Cynthiae figuras aemulatur mater amorum “The mother of love imitates the shape of Cynthia”

(Venus imitates the phases of the moon)

Page 48: Open science 2014

The Alfred P. Sloan Foundation Startup funding for the Open Science Project

Brian Glanz & the Open Science Federation Supporters and friends of Open Science

Michael Nielsen, author of Reinventing Discovery For making us aware of the Galileo & Faraday stories

Acknowledgments

Victoria Stodden, Columbia University developer of the Reproducible Research Standard

The National Science Foundation OpenMD was indirectly supported under grant CHE-0848243

Page 49: Open science 2014

Sources• E. A. Partridge and H. C. Whitaker, “Galileo’s Work on Saturn’s Rings - A Historical Correction,” Popular

Astronomy 3, 408-414 (1896).

• Henry Oldenburg, “The Introduction,” Philosophical Transactions 1, 1-3 (1665) doi:10.1098/rstl.1665.0002

• William Crookes, “Faraday,” The Chemical News XVI(404), 110-111 (1867)

• C. Hempel, Philosophy of Natural Science 49 (1966).

• The distinction between verifiable in principle and verifiable in practice was originally made in: A. J. Ayer, Language, Truth and Logic, (New York: Dover, 1946) p. 32.

• E. Sober Philosophy of Biology (Boulder: Westview Press, 2000), pp. 50-51.

• J. Lett, Science, Reason and Anthropology, The Principles of Rational Inquiry (Oxford: Rowman & Littlefield, 1997), p. 47

• The Yale Law School Roundtable on Data and Code Sharing, “Reproducible Research: Addressing the Need for Data and Code Sharing in Computational Science,” Computing in Science & Engineering 12(5), 8-12 (2010) doi: 10.1109/MCSE.2010.113

• V. Stodden, “The Legal Framework for Reproducible Scientific Research: Licensing and Copyright,” Computing in Science & Engineering 11(1), 35-40 (2009) doi: 10.1109/MCSE.2009.19

• V. Stodden, “Enabling reproducible research: Licensing for scientific innovation,” International Journal of Communications Law & Policy 13, 1 (2009).

• Jmol is available at jmol.sf.net

• OpenMD is available at openmd.org

• Source code analysis and cost estimates were done at ohloh.net , reference counts are from webofknowledge.com