data accessibility and the role of informatics in predicting the biosphere
DESCRIPTION
The variety, distinctiveness and complexity of life – biodiversity in other words and by implication the ecosystems in which it is situated – is our life support system. It is absolutely essential and more important than almost everything else but it is typically taken for granted. Today’s big societal challenges – food and water security, coping with environmental change and aspects of human health – are beyond the abilities of any one individual or research group to solve. Solving them depends not only on collaboration to deliver the appropriate scientific evidence but increasingly on vast amounts of data from multiple sources (environmental, taxonomic, genomic and ecological) gathered by manual observation and automated sensors, digitisation, remote sensing, and genetic sequencing. In April 2012 we called the biodiversity and ecosystems research communities to arms to formulate a consensus view on establishing an infrastructure to improve the accessibility of the ever-increasing volumes of biological data. We published the whitepaper: “A decadal view of biodiversity informatics: challenges and priorities” that has since been viewed more than 24,000 times. We envisage a shared and maintained multi-purpose network of computationally-based processing services sitting on top of an open data domain. By open data domain we mean data that is accessible i.e., published, registered and linked. BioVeL, pro-iBiosphere, ViBRANT and other FP7 funded projects have all explored aspects of this vision.TRANSCRIPT
1
Data accessibility and the role of informatics in
predicting the biosphere
Alex HardistyDirector of Informatics Projects, School of Computer Science & InformaticsCoordinator, FP7 BioVeL project www.biovel.euemail: [email protected] /alexhardisty (occasionally!)
Structuring the biodiversity informatics community at the European level and beyond
Biodiversity Informatics Horizons 2013
• Integration: Make better use of what we have
• Cooperation: Data from the whole world is needed
• Promotion: Europe is well placed to offer leadership
180 experts conclude that there is “a growing need for predictive biosphere modelling”
2
3
What if …?
Imagine if we could …
… Predict community level dynamics of ecosystems (i.e., behaviours) at scales from local to global, based on the ecology and biology of all individual organisms …e.g., Ecosystems: Time to model all life on Earth. Purves et al., Nature 493 (2013)
Image: Stuart Miles / FreeDigitalPhotos.net
4
Imagine if we could …… Measure and calculate “Essential Biodiversity Variables” …
… for any geographic area (continental, regional, local), by any person anywhere, using data for that area that may be held by any (research) infrastructure. Not only that, but also learn how to forecast EBVs
5Photo: Smokestacks against skyline and sunset, Estonia. © Curt Carnemark / World Bank Photo Collection
Depend on collaboration to deliver the evidence, i.e., based on synthesis and modelling of
• Increasingly large amounts of data from multiple sources (environmental, taxonomic, genomic and ecological)
• Gathered by manual observation and automated sensors, digitisation, nextgen sequencing and remote sensing
Beyond the abilities of any one individual or any single research community to collect, observe or generate.
Variety, Velocity and Volume of “Big Data”
6
Data sharing and QC
Data types
Data source tracking
Data citation tracking
Data integration
GIS
Standards
Technology
0%
100%
DataSoftware architecture
Programming languages
Authentication
Authorization
Middleware
Technology
Standards
Computing infrastructure
0%
100%
Service logic
9 research infrastructures from around the world exhibit “a satisfactory level of potential interoperability”
Topical coverageGeographical coverage
Infrastructure topology
Native interoperability and enablers
Merging of science & policy needs
Merging of science & industry needsEngagement of citizens
Access policy
Licensing and business model
Funding
User applications & interfaces
0%
100%General
From informatics perspective, how close are we to that?
7
Image from climateprediction.net
A computational challenge: Greater than that of weather forecasting; greater than that of climate prediction?
Harfoot MBJ, Newbold T, Tittensor DP, Emmott S, et al. (2014) Emergent Global Patterns of Ecosystem Structure and Function from a Mechanistic General Ecosystem Model. PLoS Biol 12(4): e1001841. doi:10.1371/journal.pbio.1001841
For 1km resolution, “… 3 to 6 orders of magnitude larger, … an exascale problem”Jack K. HornerIndependent consultant &Adviser to KU Biodiversity Institute
8
The situation today can be likened to meteorology in 1950’s, 60’s and 70’s (and later in climatology) when the emergence of numerical weather prediction drove demand for: • New observations• The emergence of a global
infrastructure for acquiring, mobilising and normalising data, and
• Better models of global atmospheric behaviour
9
Accessible data is useful data, not just for research
Direct provision of data/information
Indirect provision through reports
Data and information
National policies/reports
Regional policies/reports
Global policies/reports Assessm
ent p
rocessesG
reen
acc
ou
nti
ng
etc
Diagram courtesy of EC FP7 EU BON project
10
To be able to predict the biosphere we need to mobilise data and make it accessible
11
It’s a journey towards
• Global data, covering the whole planet. There are significant gaps everywhere today
• Making all our small-scale, local data – which often characterises the current day practice of field ecology – global
That is to say, we have to mobilise, clean, normalise and quality assure many small sets of data that together can give us the global data we need to calibrate modelsWe are achieving that for certain classes of data but it is not without its difficulties
Issues arise in each of the 4 stages of mobilising data for synthesis
• Data acquisition– Standardised measurement protocols
• Data curation– Assigning right metadata and persistent identifiers– Finding a home for the data – and putting it there
• Data discovery and access– Finding relevant data– Machine readable access to data i.e., WS front-end
• Data processing / analysis, including re-use– Owners want attribution– Tracking provenance and follow licensing conditions– Problems at every step, on every workflow run
http://envri.eu/rm 12
13Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com
“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a bleak picture. Very few GBIF registered datasets can be easily and legally used, let alone without restrictions. This is mainly due to data being published with no or a non-standard license.”
See also:
“Showing you this map of aggregated bullfrog occurrences would be illegal”
http://peterdesmet.com/posts/illegal-bullfrogs.html
14Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com
“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a bleak picture. Very few GBIF registered datasets can be easily and legally used, let alone without restrictions. This is mainly due to data being published with no or a non-standard license.”
See also:
“Showing you this map of aggregated bullfrog occurrences would be illegal”
http://peterdesmet.com/posts/illegal-bullfrogs.html
But see http://www.gbif.org/page/9773
“New approaches to data licensing and
management” agreed by
GBIF Governing Board
Sept 2014, which begins
to address this situation:
CC0, CC-BY, CC-BY-NC
Data re-use: Owners want attributionExample 1) Taxonomic data refinement Workflow
BioSTIF
CoL 3 levels of attribution• complete work• contributing database of the record• expert who provides taxonomic
scrutiny of the individual record.
GBIF data use agreement• Respect restrictions of access to sensitive data.• Identifier of ownership of data must be retained with every data record (through the workflow)• Publicly acknowledge the Data Publishers whose biodiversity data they have used. • Any additional terms and conditions of use set by the Data Publisher.
Tool license (s)
15
Model projection
Model test
More problems at every step, on every runExample 2) Niche Modelling Workflow
Create model
Select parameter values for the chosen algorithm
Select algorithm
Test the performance of the parameter in the model
Test performance of the distribution prediction on the
model
Assemble the model on openModeller service
Project Model with prediction layers
High quality occurrence data set
Chan
ging
alg
orith
m, p
aram
eter
va
lues
, and
set
of l
ayer
s
Project Model with original layers
Visualize and publish results
Select layers with environmental factors that are likely to influence the
distribution of the species
Select prediction layers
• License on algorithm• License on software
Licenses on environmental data layers
• Permissions to use• AuthN/AuthZ
• 3rd party software• All issues associated
with publication
Moving data from oneservice to another
16
17
Only 35% of surveyed datasets (wider scope than just GBIF) are accessible under an open license or waiver, without restriction on use
In a recent EU BON study
For 29 scientific questions relating to needs of European environmental policy, the availability of datasets to answer the questions is in the range ‘satisfactory’ (3) to ‘poor’ (2)
18
Multiple initiatives to make data more accessible; some are general purpose
… builds the social and technical bridges that enable open sharing of data … researchers and innovators openly sharing data across technologies, disciplines, and countries to address the grand challenges of society.
https://rd-alliance.org/
… successful community supported conventions, policies and practices for data identifiers, formats, checklists and vocabularies that enable data interoperability, citation and stewardship.
http://www.datafairport.org/
ORCID and DataCite initiatives to uniquely identify (respectively) scientists and data sets
19
Some are more domain specific
Promoting free and open accessto biodiversity information
A framework to focus effort and investment to deliver biodiversity knowledge more effectivelywww.biodiversityinformatics.org/
www.bouchout-declaration.org
20
A shared and maintained multi-purpose network of computationally-based processing services in an open data domain
Image: CoolDesign / FreeDigitalPhotos.net
With 78 contributors, we published the whitepaper, April 2013 - since viewed more than 34,000 times.
21
Building a heterogeneous Service Network
Users’ workflows and applications
Sustained Service and Data ProvidersGBIF, CoL, OBIS, WoRMS,EMBL-EBI, BGBM, CRIA, EoL,BHL, ALA, LTER, etc. & more.www.biodiversitycatalogue.org
Recognised and stable Infrastructure ProvidersNational, EGI.eu, PRACE, commercial, EUDAT, etc.
22
Preparing the next, coordinated steps
Diagram from LinkD Concept Note, September 2014
LinkD
LinkDScience of ScaleL i fe on Earth
ELODINS ENVRI+
for
What we want to do in LinkD?
From slides by Vince Smith, LinkD proposal coordinator, Natural History Musuem, London
New H2020 proposal“VRE call”
January 2015
Develop the highly responsive digital framework required to enable high
throughput research and support science of scale towards the long term vision of
modelling Life on Earth
24Photo: A lone farmer walks among rice paddies. © DFATD-MAECD/Tick Collins
Take home message: “It’s a journey”
• Accessible data is the enabler of “in-silico” science that leads towards predicting the biosphere
• A shared multi-purpose network of processing services, sitting on top of open data is the route to interoperability
• Working together as a community is essential