data accessibility and the role of informatics in predicting the biosphere

24
Data accessibility and the role of informatics in predicting the biosphere Alex Hardisty Director of Informatics Projects, School of Computer Science & Informatics Coordinator, FP7 BioVeL project www.biovel.eu email: har d [email protected] /alexhardisty (occasionally!) 1

Upload: alex-hardisty

Post on 29-Nov-2014

276 views

Category:

Environment


1 download

DESCRIPTION

The variety, distinctiveness and complexity of life – biodiversity in other words and by implication the ecosystems in which it is situated – is our life support system. It is absolutely essential and more important than almost everything else but it is typically taken for granted. Today’s big societal challenges – food and water security, coping with environmental change and aspects of human health – are beyond the abilities of any one individual or research group to solve. Solving them depends not only on collaboration to deliver the appropriate scientific evidence but increasingly on vast amounts of data from multiple sources (environmental, taxonomic, genomic and ecological) gathered by manual observation and automated sensors, digitisation, remote sensing, and genetic sequencing. In April 2012 we called the biodiversity and ecosystems research communities to arms to formulate a consensus view on establishing an infrastructure to improve the accessibility of the ever-increasing volumes of biological data. We published the whitepaper: “A decadal view of biodiversity informatics: challenges and priorities” that has since been viewed more than 24,000 times. We envisage a shared and maintained multi-purpose network of computationally-based processing services sitting on top of an open data domain. By open data domain we mean data that is accessible i.e., published, registered and linked. BioVeL, pro-iBiosphere, ViBRANT and other FP7 funded projects have all explored aspects of this vision.

TRANSCRIPT

Page 1: Data accessibility and the role of informatics in predicting the biosphere

1

Data accessibility and the role of informatics in

predicting the biosphere

Alex HardistyDirector of Informatics Projects, School of Computer Science & InformaticsCoordinator, FP7 BioVeL project www.biovel.euemail: [email protected] /alexhardisty (occasionally!)

Page 2: Data accessibility and the role of informatics in predicting the biosphere

Structuring the biodiversity informatics community at the European level and beyond

Biodiversity Informatics Horizons 2013

• Integration: Make better use of what we have

• Cooperation: Data from the whole world is needed

• Promotion: Europe is well placed to offer leadership

180 experts conclude that there is “a growing need for predictive biosphere modelling”

2

Page 3: Data accessibility and the role of informatics in predicting the biosphere

3

What if …?

Imagine if we could …

… Predict community level dynamics of ecosystems (i.e., behaviours) at scales from local to global, based on the ecology and biology of all individual organisms …e.g., Ecosystems: Time to model all life on Earth. Purves et al., Nature 493 (2013)

Image: Stuart Miles / FreeDigitalPhotos.net

Page 4: Data accessibility and the role of informatics in predicting the biosphere

4

Imagine if we could …… Measure and calculate “Essential Biodiversity Variables” …

… for any geographic area (continental, regional, local), by any person anywhere, using data for that area that may be held by any (research) infrastructure. Not only that, but also learn how to forecast EBVs

Page 5: Data accessibility and the role of informatics in predicting the biosphere

5Photo: Smokestacks against skyline and sunset, Estonia. © Curt Carnemark / World Bank Photo Collection

Depend on collaboration to deliver the evidence, i.e., based on synthesis and modelling of

• Increasingly large amounts of data from multiple sources (environmental, taxonomic, genomic and ecological)

• Gathered by manual observation and automated sensors, digitisation, nextgen sequencing and remote sensing

Beyond the abilities of any one individual or any single research community to collect, observe or generate.

Variety, Velocity and Volume of “Big Data”

Page 6: Data accessibility and the role of informatics in predicting the biosphere

6

Data sharing and QC

Data types

Data source tracking

Data citation tracking

Data integration

GIS

Standards

Technology

0%

100%

DataSoftware architecture

Programming languages

Authentication

Authorization

Middleware

Technology

Standards

Computing infrastructure

0%

100%

Service logic

9 research infrastructures from around the world exhibit “a satisfactory level of potential interoperability”

Topical coverageGeographical coverage

Infrastructure topology

Native interoperability and enablers

Merging of science & policy needs

Merging of science & industry needsEngagement of citizens

Access policy

Licensing and business model

Funding

User applications & interfaces

0%

100%General

From informatics perspective, how close are we to that?

Page 7: Data accessibility and the role of informatics in predicting the biosphere

7

Image from climateprediction.net

A computational challenge: Greater than that of weather forecasting; greater than that of climate prediction?

Harfoot MBJ, Newbold T, Tittensor DP, Emmott S, et al. (2014) Emergent Global Patterns of Ecosystem Structure and Function from a Mechanistic General Ecosystem Model. PLoS Biol 12(4): e1001841. doi:10.1371/journal.pbio.1001841

For 1km resolution, “… 3 to 6 orders of magnitude larger, … an exascale problem”Jack K. HornerIndependent consultant &Adviser to KU Biodiversity Institute

Page 8: Data accessibility and the role of informatics in predicting the biosphere

8

The situation today can be likened to meteorology in 1950’s, 60’s and 70’s (and later in climatology) when the emergence of numerical weather prediction drove demand for: • New observations• The emergence of a global

infrastructure for acquiring, mobilising and normalising data, and

• Better models of global atmospheric behaviour

Page 9: Data accessibility and the role of informatics in predicting the biosphere

9

Accessible data is useful data, not just for research

Direct provision of data/information

Indirect provision through reports

Data and information

National policies/reports

Regional policies/reports

Global policies/reports Assessm

ent p

rocessesG

reen

acc

ou

nti

ng

etc

Diagram courtesy of EC FP7 EU BON project

Page 10: Data accessibility and the role of informatics in predicting the biosphere

10

To be able to predict the biosphere we need to mobilise data and make it accessible

Page 11: Data accessibility and the role of informatics in predicting the biosphere

11

It’s a journey towards

• Global data, covering the whole planet. There are significant gaps everywhere today

• Making all our small-scale, local data – which often characterises the current day practice of field ecology – global

That is to say, we have to mobilise, clean, normalise and quality assure many small sets of data that together can give us the global data we need to calibrate modelsWe are achieving that for certain classes of data but it is not without its difficulties

Page 12: Data accessibility and the role of informatics in predicting the biosphere

Issues arise in each of the 4 stages of mobilising data for synthesis

• Data acquisition– Standardised measurement protocols

• Data curation– Assigning right metadata and persistent identifiers– Finding a home for the data – and putting it there

• Data discovery and access– Finding relevant data– Machine readable access to data i.e., WS front-end

• Data processing / analysis, including re-use– Owners want attribution– Tracking provenance and follow licensing conditions– Problems at every step, on every workflow run

http://envri.eu/rm 12

Page 13: Data accessibility and the role of informatics in predicting the biosphere

13Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com

“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a bleak picture. Very few GBIF registered datasets can be easily and legally used, let alone without restrictions. This is mainly due to data being published with no or a non-standard license.”

See also:

“Showing you this map of aggregated bullfrog occurrences would be illegal”

http://peterdesmet.com/posts/illegal-bullfrogs.html

Page 14: Data accessibility and the role of informatics in predicting the biosphere

14Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com

“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a bleak picture. Very few GBIF registered datasets can be easily and legally used, let alone without restrictions. This is mainly due to data being published with no or a non-standard license.”

See also:

“Showing you this map of aggregated bullfrog occurrences would be illegal”

http://peterdesmet.com/posts/illegal-bullfrogs.html

But see http://www.gbif.org/page/9773

“New approaches to data licensing and

management” agreed by

GBIF Governing Board

Sept 2014, which begins

to address this situation:

CC0, CC-BY, CC-BY-NC

Page 15: Data accessibility and the role of informatics in predicting the biosphere

Data re-use: Owners want attributionExample 1) Taxonomic data refinement Workflow

BioSTIF

CoL 3 levels of attribution• complete work• contributing database of the record• expert who provides taxonomic

scrutiny of the individual record.

GBIF data use agreement• Respect restrictions of access to sensitive data.• Identifier of ownership of data must be retained with every data record (through the workflow)• Publicly acknowledge the Data Publishers whose biodiversity data they have used. • Any additional terms and conditions of use set by the Data Publisher.

Tool license (s)

15

Page 16: Data accessibility and the role of informatics in predicting the biosphere

Model projection

Model test

More problems at every step, on every runExample 2) Niche Modelling Workflow

Create model

Select parameter values for the chosen algorithm

Select algorithm

Test the performance of the parameter in the model

Test performance of the distribution prediction on the

model

Assemble the model on openModeller service

Project Model with prediction layers

High quality occurrence data set

Chan

ging

alg

orith

m, p

aram

eter

va

lues

, and

set

of l

ayer

s

Project Model with original layers

Visualize and publish results

Select layers with environmental factors that are likely to influence the

distribution of the species

Select prediction layers

• License on algorithm• License on software

Licenses on environmental data layers

• Permissions to use• AuthN/AuthZ

• 3rd party software• All issues associated

with publication

Moving data from oneservice to another

16

Page 17: Data accessibility and the role of informatics in predicting the biosphere

17

Only 35% of surveyed datasets (wider scope than just GBIF) are accessible under an open license or waiver, without restriction on use

In a recent EU BON study

For 29 scientific questions relating to needs of European environmental policy, the availability of datasets to answer the questions is in the range ‘satisfactory’ (3) to ‘poor’ (2)

Page 18: Data accessibility and the role of informatics in predicting the biosphere

18

Multiple initiatives to make data more accessible; some are general purpose

… builds the social and technical bridges that enable open sharing of data … researchers and innovators openly sharing data across technologies, disciplines, and countries to address the grand challenges of society.

https://rd-alliance.org/

… successful community supported conventions, policies and practices for data identifiers, formats, checklists and vocabularies that enable data interoperability, citation and stewardship.

http://www.datafairport.org/

ORCID and DataCite initiatives to uniquely identify (respectively) scientists and data sets

Page 19: Data accessibility and the role of informatics in predicting the biosphere

19

Some are more domain specific

Promoting free and open accessto biodiversity information

A framework to focus effort and investment to deliver biodiversity knowledge more effectivelywww.biodiversityinformatics.org/

www.bouchout-declaration.org

Page 20: Data accessibility and the role of informatics in predicting the biosphere

20

A shared and maintained multi-purpose network of computationally-based processing services in an open data domain

Image: CoolDesign / FreeDigitalPhotos.net

With 78 contributors, we published the whitepaper, April 2013 - since viewed more than 34,000 times.

Page 21: Data accessibility and the role of informatics in predicting the biosphere

21

Building a heterogeneous Service Network

Users’ workflows and applications

Sustained Service and Data ProvidersGBIF, CoL, OBIS, WoRMS,EMBL-EBI, BGBM, CRIA, EoL,BHL, ALA, LTER, etc. & more.www.biodiversitycatalogue.org

Recognised and stable Infrastructure ProvidersNational, EGI.eu, PRACE, commercial, EUDAT, etc.

Page 22: Data accessibility and the role of informatics in predicting the biosphere

22

Preparing the next, coordinated steps

Diagram from LinkD Concept Note, September 2014

Page 23: Data accessibility and the role of informatics in predicting the biosphere

LinkD

LinkDScience of ScaleL i fe on Earth

ELODINS ENVRI+

for

What we want to do in LinkD?

From slides by Vince Smith, LinkD proposal coordinator, Natural History Musuem, London

New H2020 proposal“VRE call”

January 2015

Develop the highly responsive digital framework required to enable high

throughput research and support science of scale towards the long term vision of

modelling Life on Earth

Page 24: Data accessibility and the role of informatics in predicting the biosphere

24Photo: A lone farmer walks among rice paddies. © DFATD-MAECD/Tick Collins

Take home message: “It’s a journey”

• Accessible data is the enabler of “in-silico” science that leads towards predicting the biosphere

• A shared multi-purpose network of processing services, sitting on top of open data is the route to interoperability

• Working together as a community is essential