fake science, fake news · resources. schema.org is a community activity that maintains and...

6
In this wondrous age of fake news, media attention has shifted to science. Most of us remember the stink that was made of “climate gate” several years ago and the infrequent references to falsified experimental data in the health sciences and other science and engineering disciplines. Many of us are also familiar with work led by the Center for Open Science that has demonstrated that the results of many studies are not repeatable. I won’t revisit these studies now, but will highlight two papers—one, in The Atlantic, which declares that “the scientific paper is obsolete” (1) and a second, in The Economist, which implies that peer review is in trouble (2) . The first paper bases its argument on the observation that science and scientific papers are much more complex today: The earliest papers were in some ways more readable than papers today. They were less specialized , more direct, shorter, and far less formal. Calculus had only just been invented. Entire data sets could fit in a table on a single page. What little ‘computation’ contributed to the results was done by hand and could be verified in the same way.” (1) The article goes on to suggest that today’s paper is responsible for the replication crisis— i.e., the failure of papers “to report what you’ve actually discovered, clearly enough that someone else can discover it for themselves.” The author then describes the history of the Mathematica Notebook, IPython Notebook, and Jupyter Notebook, and postulates that a Jupyter-like notebook could end the replication crisis, but requires that journals mandate the submission of notebooks and that data sharing and work sharing become commonplace practices for earning prestige and funding. The second article (2) includes estimates of 400,000 papers published this past year in questionable journals—i.e., journals that claim to support peer-review, but do not in fact do so. The article indicates that the blame is to be placed, in part, on open-access journals and on academic administrators that “cannot tell good publications from bad, or do not care.” Proposed suggestions for remedying Fake Science, Fake News Volume 6 Issue 4 ©2018 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 the problem include “abandon anonymous peer review altogether, and make the process open and transparent” and “return to journal subscriptions” (thereby eliminating open- access journals). I agree with the notions that many scientific papers do not adequately enable someone else to re-discover the same results and that peer-review is sometimes flawed, but I do not believe that either article delved deeply into the issues, nor did they offer satisfactory solutions. In particular, creating Jupyter-like Notebooks will not solve the replication crisis alone; likewise, abandoning anonymous peer review and open-access journals altogether will only exacerbate the “peer-review crisis.” I do, in fact, think that Jupyter Notebooks represent a tremendous advance in that they make it possible to capture and document data management and analytical procedures throughout an experiment or project and that, further, they serve as a valuable teaching and research collaboration tool saving significant time and money. But, many other services and products also promote and support replicability, including workflow solutions (e.g., Kepler, Taverna, Pegasus), ontologies which allow researchers to use the same language in describing entities and processes, and emerging provenance-tracking systems. DataONE and affiliated data repositories such as the Arctic Data Center and Dryad (in partnership with California Digital Library) are pioneering new, user-friendly metadata management systems, quality-checking systems and data citation tracking systems that enable researchers to more rapidly document (i.e., describe their data and analytical procedures), assure data are of high-quality, and track how data were subsequently used. These and other solutions by organizations such as DataCite and the Center for Open Science are positively contributing to resolving the replication- and peer-review crises, without going back to the stone age where proprietary software solutions, monolithic publishers, and other dinosaurs ruled the day. We will be discussing many of these solutions at the upcoming DataONE Users Group meeting in addition to a talk by Matt Jones and me that outlines how DataONE-affiliated data repositories can work with professional societies to publish “data papers.” n — William Michener Principal Investigator, DataONE (1) James Somers, 5 April 2018, The scientific paper is obsolete, The Atlantic. (2) Anonymous, 21 June 2018, Some science journals that claim to peer review papers do not do so. The Economist.

Upload: others

Post on 08-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fake Science, Fake News · resources. Schema.org is a community activity that maintains and promotes schemas for structured data resources with a general goal of promoting machine

In this wondrous age of fake news, media attention has shifted to science. Most of us remember the stink that was made of “climate gate” several years ago and the infrequent references to falsified experimental data in the health sciences and other science and engineering disciplines. Many of us are also familiar with work led by the Center for Open Science that has demonstrated that the results of many studies are not repeatable.

I won’t revisit these studies now, but will highlight two papers—one, in The Atlantic, which declares that “the scientific paper is obsolete”(1) and a second, in The Economist, which implies that peer review is in trouble(2). The first paper bases its argument on the observation that science and scientific papers are much more complex today:

“The earliest papers were in some ways more readable than papers today. They were less specialized , more direct, shorter, and far less formal. Calculus had only just been invented. Entire data sets could fit in a table on a single page. What little ‘computation’ contributed to the results was done by hand and could be verified in the same way.” (1)

The article goes on to suggest that today’s paper is responsible for the replication crisis—i.e., the failure of papers “to report what you’ve actually discovered, clearly enough that someone else can discover it for themselves.” The author then describes the history of the Mathematica Notebook, IPython Notebook, and Jupyter Notebook, and postulates that a Jupyter-like notebook could end the replication crisis, but requires that journals mandate the submission of notebooks and that data sharing and work sharing become commonplace practices for earning prestige and funding.

The second article(2) includes estimates of 400,000 papers published this past year in questionable journals—i.e., journals that claim to support peer-review, but do not in fact do so. The article indicates that the blame is to be placed, in part, on open-access journals and on academic administrators that “cannot tell good publications from bad, or do not care.” Proposed suggestions for remedying

Fake Science, Fake News

Volume 6 Issue 4

©2018 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106

the problem include “abandon anonymous peer review altogether, and make the process open and transparent” and “return to journal subscriptions” (thereby eliminating open-access journals).

I agree with the notions that many scientific papers do not adequately enable someone else to re-discover the same results and that peer-review is sometimes flawed, but I do not believe that either article delved deeply into the issues, nor did they offer satisfactory solutions. In particular, creating Jupyter-like Notebooks will not solve the replication crisis alone; likewise, abandoning anonymous peer review and open-access journals altogether will only exacerbate the “peer-review crisis.”

I do, in fact, think that Jupyter Notebooks represent a tremendous advance in that they make it possible to capture and document data management and analytical procedures throughout an experiment or project and that, further, they serve as a valuable teaching and research collaboration tool saving significant time and money. But, many other services and products also promote and support replicability, including workflow solutions (e.g., Kepler, Taverna, Pegasus), ontologies which allow researchers to use the same language in describing entities and processes,

and emerging provenance-tracking systems. DataONE and affiliated data repositories such as the Arctic Data Center and Dryad (in partnership with California Digital Library) are pioneering new, user-friendly metadata management systems, quality-checking systems and data citation tracking systems that enable researchers to more rapidly document (i.e., describe their data and analytical procedures), assure data are of high-quality, and track how data were subsequently used. These and other solutions by organizations such as DataCite and the Center for Open Science are positively contributing to resolving the replication- and peer-review crises, without going back to the stone age where proprietary software solutions, monolithic publishers, and other dinosaurs ruled the day. We will be discussing many of these solutions at the upcoming DataONE Users Group meeting in addition to a talk by Matt Jones and me that outlines how DataONE-affiliated data repositories can work with professional societies to publish “data papers.” n

— William MichenerPrincipal Investigator, DataONE

(1) James Somers, 5 April 2018, The scientific paper is obsolete, The Atlantic.(2) Anonymous, 21 June 2018, Some science journals that claim to peer review papers do

not do so. The Economist.

Page 2: Fake Science, Fake News · resources. Schema.org is a community activity that maintains and promotes schemas for structured data resources with a general goal of promoting machine

� Summer 2018

2

FeaturedRESOURCE�

Data Management Skillbuilding Hub

The DataONE Community Education and Outreach group is excited to launch a new suite of resources aimed at helping all of us improve our data management skills, the Data Management Skillbuilding Hub (https://dataoneorg.github.io/Education/). Whether you are looking to learn about data sharing, working with metadata, or developing best practices surrounding other parts of the data life cycle, we’ve got you covered. Learn more in the Working Group Focus on page 4.

A key feature of these resources is users’ ability to help make them better. That’s you! We invite you to check out the content and let us know what you might change or add, or you can work directly in GitHub to offer amendments yourself. Community resources are only great when we truly work as a community.

.

Page 3: Fake Science, Fake News · resources. Schema.org is a community activity that maintains and promotes schemas for structured data resources with a general goal of promoting machine

� Summer 2018

3

TheDUGoutThe DataONE Users Group is holding its

ninth annual meeting in Tucson, Arizona on July 16, 2018, once again collocating with the Earth Science Information Partners summer meeting. The meeting theme is “Building a Community of Scientific Data Repositories in an Open Science Landscape”. All DUG members are invited to attend the annual meeting and anyone may participate remotely.

The meeting agenda focuses on DataONE sustainability as the second five-year NSF funding period will end in September, 2019.

CyberSPOTSchema.org has emerged as a popular

and easily deployed mechanism for adding rich structured metadata to web accessible resources. Schema.org is a community activity that maintains and promotes schemas for structured data resources with a general goal of promoting machine readable descriptions of resources shared across the Internet. The schema.org vocabulary can be expressed in several different encodings with JSON-LD being a popular syntax. An important aspect of this approach is that a resource referenced by a sitemap.xml can be both machine and human readable. A typical approach would be for the sitemap.xml page to refer to landing pages for datasets. Those landing pages can be in HTML and so provide a rich experience for human vistors to the site. By embedding sitemap.org JSON-LD markup within the HTML, machine vistors to the site can similarly gain a detailed description of the dataset.

DataONE Member Nodes implement a standard set of service interfaces that enable Coordinating Nodes to discover, synchronize, and manage replication of the content, and users to add content to the Member Node repository. These DataONE Member Node APIs are consistent across Member Nodes and so facilitate ease of access since one tool can work across all repositories instead

of being customized for each particular repository implementation. Recognizing the differing publicly exposed capabilities of data repositories, DataONE defines four tiers of implementation, with Tier 1 being read-only, public access; Tier 2 read-only with access control; Tier 3 read-write with access control; and Tier 4 extending Tier 3 with support for acting asa replication target.

DataONE recently explored whether a schema.org approach might sufficiently emulate the characteristics of a Tier 1 Member node, and so provide a very light-weight and widely deployable mechanism for repositories to leverage the benefits of participation in the DataONE federation.

The general mechanisms for content discovery leveraged by web crawlers and indexers such as those implemented by Google Search, Microsoft Bing, Yahoo, Yandex, and many others are straight forward. Given a URL, or even just a domain name (e.g. “mydata.info”) a typical pattern of discovery employed by a web crawler is:

1. Retrieve and process the robots.txt file expected to be in the standard location of http(s)://mydata.info/robots.txt. The robots.txt file contains simple rules that deny access or point to where one or more sitemap.xml files are located.

2. Retrieve the sitemap.xml file and parse to identify the locations of additional sitemaps and resources identified by the sitemap. The sitemap.xml file basically provides a list of URLs pointing to resources available on the web site. Each of the entries also includes a time stamp indicating when the resource was last modified.

3. Each resource listed in the sitemap.xml is retrieved for indexing.

In this manner, an indexing service can easily locate resources using a well established pattern that is non-intrusive and straight forward to implement. A key challenge is that the resources referenced by the sitemap.xml can be completely arbitrary, making it difficult for indexers interested in structured data to efficiently and reliably locate and retrieve the structured metadata and data. In contrast, a DataONE Member Node precisely describes the available resources so that a Coordinating Node or any other client may clearly interpret what the resource is and so appropriately catalog, index, and otherwise manage the resource.

If a repository uses schema.org markup in their Dataset landing pages, the resources referenced by the sitemap.xml are no longer arbitrary, and so can be reliably processed by structured data indexers. Many different

The meeting will include many opportunities for attendees to contribute ideas as DataONE transitions to a more community-driven effort in 2019 and beyond. The DataONE Users Group Steering Committee will introduce a process for updating the DUG’s Procedural Guidelines (a.k.a. the DUG “charter”) so that the evolving role of the DUG is clearly defined for the future. The morning features both a distinguished panel on sustainability featuring Bill Michener, Erin Robinson, Matt Jones, and Amber Budden to be followed by birds of a feather sessions allowing attendees to discuss sustainability of various DataONE components including education and outreach, services, technology, and software.

Other meeting components will be

familiar to DUG attendees including a project update from DataONE leadership, Member Node showcase, DUG business meeting, community presentations, breakout sessions, and a poster session and reception.

Information gathered from DUG Meeting attendees about DataONE sustainability will be brought to subsequent meetings about sustainability and to the September All Hands meeting.

All DUG members are invited to consider applying to join the DUG Steering Committee for 2018 - 2019. Contact [email protected] to express interest. n

— Karl Benedict and Robert J. SanduskyDUG co-chairs

Can schema.org content publishing be utilized as an alternative to a read-only DataONE Member Node?

Page 4: Fake Science, Fake News · resources. Schema.org is a community activity that maintains and promotes schemas for structured data resources with a general goal of promoting machine

� Summer 2018

4

CyberSPOT cont’d

Members of the DataONE Team will be at the following events. Full information on training activities can be found at bit.ly/D1Training and our calendar is available at bit.ly/D1Events.

Jul. 16 DataONE Users Group Meeting (DUG) Tucson, AZ https://www.dataone.org/dataone-users-group/2018-meeting

Jul. 17-20 ESIP Summer Meeting Tucson, AZ http://www.esipfed.org/meetings/upcoming-meetings/esip-summer-meeting-2018

Aug. 5-10 Ecological Society of America Meeting New Orleans, LA https://esa.org/neworleans/

Oct. 11-12 FORCE2018 Montreal, Canada https://www.force11.org/meetings/force2018

Nov. 5-9 International Data WeekBotswana, Africa http://internationaldataweek.org

UpcomingEVENTS��schemas are defined by schema.org. Of particular interest to DataONE are the Dataset, DataCatalog and related structures as these align well with the concept of a Dataset and repository as used in DataONE.

Recent work with the IEDA EarthChem repository has demonstrated that with some minor additions to the schema.org Dataset JSON-LD contained in the landing pages, the resources exposed by that repository using the sitemap.xml pattern described above can effectively operate as a Member Node using the GMN service as an intermediary “slender node” (i.e. proxy service).

This activity is evolving, though the initial results are good and demonstrate successful synchronization and indexing of content exposed solely through the sitemap along

with schema.org marked up landing pages. One aspect that is somewhat cumbersome is providing a more complete description of the components of a dataset. For example, although the constituents of a dataset are listed in the distribution section of the Dataset, properties such as checksums and well defined relationships between components are lacking in the schema.org Dataset schema. However, the inherent extensibility of the standard means other existing standards may be leveraged for these purposes once a community consensus emerges.

Based on this encouraging outcome, we anticipate future alterations to the DataONE Coordinating Nodes to directly support schema.org resources as Tier 1 Member Nodes in the DataONE federation. n

WorkingGroupFOCUSWe are collecting and processing massive amounts of information these days, and the

question on all our minds is how best to manage these data. Some of us do it better than oth-ers, and we can all learn from one another.

The DataONE Community Education and Outreach (CEO WG) working group has pulled together foundational resources that help us all become better data stewards and made them available on a community-owned website, the Data Management Skillbuilding Hub. Whether you want to learn about data sharing, working with metadata, or developing best-practices surrounding other parts of the data life-cycle, we’ve got you covered.

Initially developed in 2011-2012, the Education Modules put together by the CEO WG went through a process of peer review and a final revision in 2016. The following year the working group made the decision to move the content to an open repository (GitHub) in an effort to increase community engagement around data management and education, and in order to build interest in shared maintenance and ongoing sustainability. You can read more about our rationale in the publication “Using Peer Review to Support Development of Community Resources for Research Data Management” (Soyka et al., 2017)

At launch, the repository contains ten lessons associated with the data life-cycle. The les-sons contain slides with annotations that can be viewed using your browser, as well as a hands on exercise, a one-page explanatory sheet, and in some cases, supplementary data as part of a learning activity.

The best part of this new resource? You. To improve these materials, we need your feed-back and input. You can give us any comments or work directly in GitHub to offer amendments yourself. Community resources are only great when we truly work as a community.

At the Friday Harbor Marine Labs, Dr. Robin Elahi is working with undergrads to find out how marine intertidal communities have changed over the past six decades. After designing the course, he and his co-instructor built a syllabus that taught students everything from ba-sic library search methods and field sampling techniques to the biggest problem of all… what do I do with all these data?

Instead of taking time to pull together another set of lessons and exercises for his students, Dr. Elahi used a selection of data management education materials developed at DataONE.

“Rather than reinvent the wheel, I used DataONE education materials to provide my students with the current best practices in data management. It was easy to adapt the lesson and exercise to fit the needs of my class.”

Check out what Dr. Elahi shared with his class, and how he modified the DataONE materials to meet his needs. https://github.com/elahi/fhl470/blob/master/lectures/fhl470_spreadsheets_SP18.pdf. n

Community Engagement and Outreach

IRL: an example of these resources in action

Page 5: Fake Science, Fake News · resources. Schema.org is a community activity that maintains and promotes schemas for structured data resources with a general goal of promoting machine

� Summer 2018

5

In each newsletter issue we will highlight one of our current Member Nodes. The full list of Member Nodes and summary metrics can be found on the DataONE.org site at bit.ly/D1CMNs.

PANGAEA Data Publisher for Earth and Environmental Science https://www.pangaea.de/

PANGAEA is an open access data publisher for Earth & environmental science supporting the long-term archiving and publication of georeferenced data. PANGAEA’s interoperability framework enables the dissemination of metadata and data to registries, data portals, and other service providers such as DataONE. The system also allows data to be published as supplements to science articles or as citable data collections in combination with data journals.

Each dataset can be identified, shared, published and cited by using a Digital Object Identifier (DOI). These include numerical and textual data in addition to image and video data, each as a georeferenced record that contains parameter value, parameter type, and spatial and temporal coverage. In PANGAEA, tens of thousands of different measurement and observation types have been defined to not only improve the consistency of archived data, but also to compensate for the heterogeneity in the different scientific fields. PANGAEA is furnished with an editorial system for data ingest and curation and ensures integrity and authenticity through quality control and the dissemination of metadata according to international standards.

PANGAEA was initiated in the 1990s and has been operational on the Internet since 1995. Hosted by the Alfred Wegener Institute for Polar and Marine Research (AWI), Bremerhaven and the Center for Marine Environmental Sciences (MARUM), Bremen in Germany, PANGAEA continues to be a data management partner in numerous national to international projects. Additionally, PANGAEA is a World Data Center accredited by the International Council for Science World Data System (ICSU WDS) and a World Radiation Monitoring Center (WRMC) within the World Meteorological Organisation Information System (WIS).

PANGAEA is open to any project, institution, or individual scientist to use or to archive and publish data.

MemberNodeDESCRIPTION�

1312 Basehart SEUniversity of New MexicoAlbuquerque, NM 87106

Fax: 505.246.6007

DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under

a Cooperative Agreement.

Project Director:

William [email protected]

505.814.7601

Executive Director:

Rebecca [email protected]

505.382.0890

Director of Community Engagement and Outreach:

Amber [email protected]

505.205.7675

Director of Development and Operations

Dave [email protected]

Upcoming Webinar

Open Science as a Movement: Mozilla’s efforts to build community

and open leadership in science

Tuesday, September 11, 2018 9 am Pacific / 12 noon Eastern

Information and registration at:https://www.dataone.org/

upcoming-webinar

DataONE are teaming up with other data training and service provides to create a Data Help Desk at the ESA this year. Partners from the Arctic Data Center, EDI, iDigBio, DataCite, ESIP and members of the community will be giving short tips, tools and tricks demos in the Exhibit Hall. Join us to find out more about these organizations and add some new skills to your toolbox. For more information and to see the schedule of DataONE talks, sessions and workshops visit https://www.dataone.org/training-activities. n

EcologicalSocietyofAmericaHIGHLIGHT

Page 6: Fake Science, Fake News · resources. Schema.org is a community activity that maintains and promotes schemas for structured data resources with a general goal of promoting machine

� Summer 2018

6

ByTheNUMBERS

5,000+

DATA DISCOVERABLE THROUGH DATAONE

0 k

300 k

600 k

900 k

1,200 k

2012 2014 2016 2018

1.51 M792 Kmetadata data

OUR COMMUNITY

New Member this Quarter

EDUCATION AND OUTREACH

File

s Up

load

ed

SOURCE: CN.DATAONE.ORGOnly the first version of each file is counted

PANGEA DATA PUBLISHER FOR EARTH AND ENVIRONMENTAL STUDIES

40 Member Nodes

Most Visited Pages

Best Practices: Create and document data backup policy

Data Management Planning

19,087

46 TBof content

Users trained

DataONE User Group members

Visits to the public webpage*

488

217 Education Module downloads*

*metrics are monthly averages; symbol denotes change since last quarter

2,2931,111

Visitors to our search page*

Unique downloads*

Searches conducted*

372,676

99.99% Uptime of Coordinating Nodes

1

2

3

Best Practices home page

1 2 3Best Practices Primer

Data Management Plan

Data Management GuidePublic participation

Most Downloaded Resources

31 Webinars

93 Average number of attendees

Webinar Series

1608 Unique webinar attendees

Global Online Views of Webinars in Last Quarter

Education Resources

Data

Metadata

6,627

*metrics are monthly averages; symbol denotes change since last quarter

34

80

210

Metrics inclusive of June 2018

Example from Manua Loa