GLOBIS-B(654003)
Projectacronym:GLOBIS-B
Projectfulltitle:“GLOBalInfrastructuresforSupportingBiodiversityresearch”
Grantagreementno.:654003
D3.1:Technicalissuesandrisksassociatedwith
generalchallengesofprovisioningresearch
infrastructurestodelivercapabilitiesforEBV
processing
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page2of86
Due-Date: 29th April 2016
Actual Delivery: 30th September 2016
Lead Partner: Cardiff University
Dissemination Level: PU
Status: Public version
Version: 1.0
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page3of86
DOCUMENTINFO
Dateandversionno. Author Comments/Changes
15thJune201616thJune2016,v0.1
AlexHardisty,CU Initialdraft,structuringandoutliningofcontent
16thJune2016,v0.217th–21stJune2016,v0.323rdJune2016,v0.4
AlexHardisty,CU Continuedwritingwhilewaitingforearlyreviewcommentandinputfromothers
30thJune2016,v0.51stJuly2016,v0.6
AlexHardisty,CU Improvedthesectionontranslationalrisks;addedmorecontenttotechnicalissuessub-sections
12thJuly2016,v0.7 AlexHardisty,CU Fillingoutstandinggaps,readyforreview
14thJuly2016,v0.8 AlexHardisty,CU Nearfinaldraftforreviewbyprojectstaff;(stillmissingsectionsfromDavid).
27thJuly2016,v0.915thAugust-3rdSeptember2016,v0.10
DavidManset,GnubilaAlexHardisty,CULucyBastin,JointResearchCentreoftheEuropeanCommissionLeeBelbin,AtlasofLivingAustralia
Insertedsectiononbusinessmodelcanvasasoutcomeofworkshop2.Otherminortidy-ups.Independentcriticalreviewandediting.
30thSeptember2016,v1.0 AlexHardisty,CU Finaleditingandpublication.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page4of86
TABLEOFCONTENTS
1 Executivesummary...................................................................................................................................7
2 Contributors.............................................................................................................................................7
3 Background...............................................................................................................................................8
4 Pre-workshopquestionsandresponses.................................................................................................10
4.1 KeystepsofworkflowsforEBVsspeciesdistribution/abundance................................................10
4.2 Suitabletechnicalapproachtoperformworkflow........................................................................10
4.3 Technicaloptionsavailable............................................................................................................11
4.4 Top3-5technicalchallenges..........................................................................................................11
5 Outcomesfromworkshop1...................................................................................................................12
6 Casestudies............................................................................................................................................15
7 Roleofestablishedandemerginginfrastructures..................................................................................17
7.1 Currentandemerginginfrastructures...........................................................................................17
7.2 ProvisioningresearchinfrastructurestodelivercapabilitiesforEBVprocessing..........................19
8 Generalchallengesandconsiderations..................................................................................................20
8.1 ResponsibilityforEBVproduction.................................................................................................20
8.2 TheEBVproductioncycle..............................................................................................................20
8.3 EBVdatamodelandstructures.....................................................................................................21
8.4 Datastandards...............................................................................................................................21
9 Outcomesfromworkshop2...................................................................................................................21
9.1 OnlyopendataforEBVs................................................................................................................21
9.2 GeneralisedworkflowforEBVprocessing.....................................................................................22
9.3 MappingRIsscopetogeneralisedworkflowsteps........................................................................22
9.3.1 Dashboardofworkflowsteps,risksandneeds.......................................................................22
9.3.2 StrategymodelcanvasforEBVs..............................................................................................23
10 TechnicalissuesforEBVimplementation...............................................................................................25
10.1 On-demand,on-the-flyproductionversusperiodiccyclicproduction..........................................25
10.2 Quality,integrityandaccessibilityofdata.....................................................................................26
10.2.1 Dataquality.............................................................................................................................26
10.2.2 Dataintegrity...........................................................................................................................26
10.2.3 Accessibilityofdata.................................................................................................................27
10.3 Taxonomicmatchingandreconciliation........................................................................................28
10.4 Balancingautomationandexpertinput........................................................................................28
10.5 Standardisedworkflowapproach..................................................................................................29
10.5.1 Rationaleforadoptingastandardisedworkflowapproach....................................................29
10.5.2 Mainissues..............................................................................................................................29
10.6 Dataproductstorageandpublishing.............................................................................................30
10.7 Implementingthehypercubeapproach........................................................................................30
10.8 StandardisedAPIsandmeta-descriptionofthem.........................................................................31
10.9 Provenanceandtraceability,reproducibilityandrepeatability.....................................................31
10.10 Distributedversuscentralisedprocessing.....................................................................................32
10.11 Standardization..............................................................................................................................33
10.11.1 Choiceofstandards.................................................................................................................33
10.11.2 Issuesofstandardization.........................................................................................................33
10.11.3 CompliancewithrequirementsoftheINSPIREDirective........................................................33
10.11.4 Responsibilityforstandardization...........................................................................................34
10.12 Persistentidentifiers......................................................................................................................34
10.13 DifferentICTscenarios...................................................................................................................35
10.14 Softwareenforcementofdatalicensingtermsandconditions.....................................................35
11 TechnicalrisksforEBVimplementation.................................................................................................35
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page5of86
11.1 Readiness/capabilityofinfrastructurestosupportEBVsproduction..........................................35
11.2 Openaccessforservicedependencies..........................................................................................36
11.3 Multilingualism..............................................................................................................................36
11.4 Linkingacrossinfrastructures........................................................................................................36
11.5 Thediversityoftoolsandtechniques............................................................................................36
11.6 PrecisenatureofEBVdataproducts.............................................................................................37
12 TranslationalrisksofEBVsmethodsresearch........................................................................................37
12.1 Theinnovationchain......................................................................................................................37
12.2 Particulargapsintranslation.........................................................................................................40
13 Conclusionsandrecommendations.......................................................................................................41
14 References..............................................................................................................................................42
Annex1:Responsestopre-workshopquestionsQ5-Q8.................................................................................45
Annex2:Figure3expandedforreadability....................................................................................................80
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page6of86
LISTOFTABLESANDFIGURES
Table1:ProjectpartnersandsupportingresearchinfrastructuresoftheGLOBIS-Bproject.......................................9Table2:Mainworkflowstepssuggestedbygroupdiscussions................................................................................13Table3:ICTrelated-challengesarisingfromcasestudies.........................................................................................17Table4:ICTrelated-challengesarisingfrominfrastructureinitiatives......................................................................19Table5:Examplestructuringforhierarchicalprocessing.........................................................................................32Table6:ExamplesofdifferenttypesofEBVdataproduct........................................................................................39
Figure1:PotentialstepsforcalculationofEssentialBiodiversityVariables(EBVs,green);andrelatedscientific(blue)andtechnical(red)questionsandchallenges...................................................................................................9
Figure2:ExamplesofkeyworkflowstepsforbuildingEBVdatasetsonspeciesdistributionsandabundances........22Figure3:Dashboard:Workflowsteps,risksandneeds(expandedforreadabilityinannex2)..................................23Figure4:GLOBIS-B'sBusinessStrategyModelCanvas..............................................................................................24Figure5:Methodologygeneralisation.....................................................................................................................25Figure6:Theinnovationchain-transitionfromresearchtoproductisationtomassuseAdaptedfrom(Robinsonand
Propp2008)...................................................................................................................................................38
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page7of86
1 ExecutivesummaryThisdocumentidentifiesanddescribestechnicalissuesandrisksassociatedwiththegeneralchallengesof
provisioning research infrastructures to deliver capabilities for processing Essential Biodiversity Variables
(EBV)[Pereira2013].
Thisdocumentistheresultofpreparatoryworkforworkshop1(Leipzig,29thFebruary–2
ndMarch2016),
theworkshopitselfandfollow-upactivitiesaftertheworkshopandleadinguptoworkshop2(Sevilla,14–
15thJune2016).Itisarecordofinformationavailableatamomentintime,togetherwithsomeanalysisasa
resultofactivitycarriedoutintasks3.1,3.2and3.3oftheDescriptionofWork.Thedocumentisconcerned
mainlywithassistingtocompletetheworkpackageobjectivesto:
• Maptheuserrequirementswithexistinginfrastructuredataandprocessingcapabilities,identifying
gapswheretheyexist;and,
• DesignamethodologicalframeworktotranslatecandidateEBVsintotransnationalandcross-
infrastructurescientificworkflows.
Thedocumentprovidesinsightstotheproblemofhowcooperatingbiodiversityresearchinfrastructurescan
contributetofrontierresearchintotheharmonisedimplementationofEssentialBiodiversityVariables,by
focusingonofferingdata,workflowsandcomputationalservices.
This document summarises technical issues and risks that have to be addressed to enable practical
development and delivery of EBVs data products. There are several unresolved general challenges and
considerationsassociatedwithharmonisedEBVimplementation,anditplacesthetechnicalissuesandrisks
in the context of those. These general challenges are about assigning responsibility for EBV production,
understanding the EBV production cycle and its needs in terms of datamodel structure and applicable
standards. It is likely that new infrastructurewill beneeded toprovide the tools andworkflows for EBV
production.AsingleandsharedunderstandingofstrategytowardthetechnicalaspectsofEBVdataproducts
productionisneededamongallstakeholders.
There is a significantunresolvedproblem in termsof thenatureof the translationprocess thatmustbe
followedinordertomovefromfrontierresearchthatprovestheprinciplesofEBVstoastateofregularmass
productionofEBVdataproducts,akintoclimatevariabledataproducts.
We recommend the development of a technical roadmap for the next 3 – 5 years, in conjunctionwith
identifyingfirststepstogainpracticalexperienceoftheissuesassociatedwithproducingEBVdataproducts.
2 ContributorsThemaincontributorstothepresentdocumentareAlexHardisty(CardiffUniversity,UK),WP3leaderand
DavidManset(Gnubila,France).Othermembersoftheprojectteamandtheparticipantsofworkshop1have
providedinformationthatsupportsandformsthecontributionsofHardistyandManset.Inparticular,we
acknowledge contributions and inspirations from Daniel Amariles, Lee Belbin, Matthias Obst, Hannu
Saarenmaa,BrianWee,andKristenWilliams.Criticalproof-readingandeditinghasbeencarriedoutbyLucy
Bastin(whoalsoprovidedsuggestionsforTable6)andLeeBelbin.
Muchof thework reported in thepresent document is basedon the evolutionof ideas, the conduct of
workshopsandanalysisoftheoutcomes.Inparticular,thecasestudiesmentionedinsection5areworksin
progressthatwillyieldimportantlessonsregardingthepracticalities(proofs-of-principle)oftheinformatics
capabilitiesandcapacitiesneededforproducingEBVdataproducts.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page8of86
3 BackgroundAkeyquestionforglobalbiodiversitymonitoringishowthemulti-lateralcooperationofdatacollectors,data
providers,monitoringschemes,andbiodiversityresearchinfrastructurescanbeachievedatthegloballevel
tosupporttheharmonisedimplementationofEBVs.Asyet,therehasbeenlittleappliedevaluationofEBVs
fortheirsignificanceinconstructingbiodiversityindicatorsandtheirapplicabilityatdifferentspatiotemporal
scales.Frontierresearchinthisarearequirestheavailabilityandaccessibilityofsubstantialdatasetswith
sufficient spatiotemporal coverage. GLOBIS-B aims at elucidating how the cooperating research
infrastructures(Table1)maycontributetosuchanobjectivebyfocusingonofferingdata,workflowsand
computationalservicesforsupportingthecalculationofEBVs…
• …foranygeographicarea,smallorlarge,fine-grainedorcoarse;
• …atatemporalscaledeterminedbyneedand/orthefrequencyofavailableobservations;
• …atapointintimeinthepast,presentdayorinthefuture;
• …asappropriate,foranyspecies,assemblage,ecosystem,biome,etc.
• …usingdataforthatarea/topicthatmaybeheldbyanyandacrossmultipleresearch
infrastructures;
• …usingaharmonized,widelyacceptedprotocol(workflow)capableofbeingexecutedinany
researchinfrastructure;
• …byany(appropriate)personanywhere.
The scientific questions and methods related to the above aspects will assist in defining the user
requirementsforextractingandhandlingdatatosupportmeasurementofbiodiversitychange.Tothisend,
theprojectbringstogetherkeyscientistswithglobalresearchinfrastructureoperators,technicalexpertsand
legalinteroperabilityexpertstoaddressresearchneedsandinfrastructureservicesunderpinningtheconcept
ofEBVs.WiththisfocusonresearchneedsforcalculatingandtestingEBVs,theshort-termattentionison
ad-hoc on-demand services (and related workflows) in the cooperating research infrastructures. The
experiences obtained may later lead to a more systematic, periodic production cycle where EBV data
productsareproduced,updatedandextendedannually,quarterlyormonthly.
ThelistedsupportingresearchinfrastructuresrepresentthosethathaveagreedtocontributetotheGLOBIS-B
project.FromKisslingetal.(2015)
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page9of86
Table1:ProjectpartnersandsupportingresearchinfrastructuresoftheGLOBIS-Bproject.
TheGLOBIS-Bprojectisuniqueinbringingtogetherbiodiversityscientistsandbiodiversityinformaticsstaff
from research infrastructures to discuss and develop a framework for implementing EBVs. Further
backgroundisavailableinKisslingetal.2015.Theproposed4workshopsaremeantasexperiments,wheredifferent scenariosareconsideredonhowscientistsmaywant to test the relevanceofEBVs forbuilding
indicators,andwhichdata,workflowsandcomputationalcapacityeachscenariowill require. Inturn, the
cooperating research infrastructureswill consider thechallengesandpotential solutionsofproviding the
requireddataandworkflowservicestoachieveglobalinteroperability.Thisrequiresdetaileddiscussionof
thenecessarystepsandtoolsneededtomovefromdatacollection,integration,filtering,throughmodelling,
testingandvalidationtothefinalpresentationofanEBV(Figure2,green).
Figure1:PotentialstepsforcalculationofEssentialBiodiversityVariables(EBVs,green);
andrelatedscientific(blue)andtechnical(red)questionsandchallenges.
Importantscientificdiscussionpoints(Figure1,blue)willbe:
• Whichdataareneededandhowtheyhavetobeintegrated,filteredandharmonised;
• Whichanalyticaltoolsandmodelsneedtobeimplementedandtested;and,
• HowtopresentandvisualizeEBVs.
Relatedtechnicaldiscussionpoints(Figure1,red)are:
• Howworkflowshavetobedesigned;
• WhichInformationandCommunicationTechnology(ICT)approachesandoptionsareavailable;
and,
• Howtheapproachescanbemadeinteroperable.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page10of86
TheaimistoformulatekeyresearchquestionsfortestingEBVsandtohelpidentifywhichtechnicalandlegal
challenges the research infrastructures are facing to support the interoperable, on-demand, ad-hoc
calculationofEBVs.
4 Pre-workshopquestionsandresponsesIn preparation for the firstworkshop andhaving inmind the different perspectives (scientific, technical,
legal/policy)oftheexperts,participantswereaskedtoanswerquestionsonscientificaspectsofEBVsand
thetechnicalaspectsofresearchinfrastructures.Approximately25-30answerswerereceivedrelatingto
thetechnicalaspects(Annex1)andthesehavebeenanalysedthematically.Thequestions,withasummary
oftheemergentthemesintheanswersforeachonearegiveninthefollowingsub-sections.
4.1 KeystepsofworkflowsforEBVsspeciesdistribution/abundanceQ5.Whatarethekeystepsofaworkflow(s)forcalculatingspeciesdistributionand/orabundanceEBVs,startingfromaccessingtherawdatatopresentingavisualresult?Whatarethecomplexitiesinvolved?Whatdatapreparationisneeded?
Manypossibleapproachesandsequencesofstepswerementionedbytherespondents,varyingaccording
totheirexperienceandbackgrounds.Theseranged,forexample,fromdirectinferenceusingremotesensing
imagerythroughtoderivativeapproachesbasedonsomeformofmodelling.Inallcases,dataavailabilityand
quality is mentioned as being of particular concern. These concerns are described often in terms of
understanding fitness for purpose of the available data, gaps in the data and accessibility of data.
Respondentsalsosupplieddetailsofspecificstepstobetakentopreparedataforusee.g.,byeliminating
non-relevantdata,byperformingconsistencychecksandcorrectingbiases,bystandardisingwhatisneeded,
metadatatoassessprecisionandaccuracy,etc.Thetendencyofbothtaxonomyandmethodstoevolvewas
highlightedasasignificantchallenge.
Manyresponsesrecognisedtheneedtoachieveabalancebetweenhavinganautomatedapproachbased
oncommontoolsinasequenceofsteps(aworkflow)andtheapplicationofexperthumaninputatmultiple
pointstoverifyandassurethecorrectprogressoftheprocedure.Thisisparticularlyimportantduringthe
researchphasewhiledifferentapproachestoproducingEBVdataproductsarebeingtestedandassessed.
Access to intermediateoutputs for thispurpose isessential,withdocumentation (i.e., traceability)being
identifiedasthekey.Severalvariationsinthepossiblesequenceofstepsweresuggestedbutgenerallythese
can be consolidated together in sequence to give a generalised workflow for EBV production that has
approximately12-15steps.(Note:Seealsosection9below.)
4.2 SuitabletechnicalapproachtoperformworkflowQ6.Whatisasuitabletechnical(ICT)approachtoperformthisworkflow(s)forcalculatingEBVs(anyplace,anytime,usingdataanywhere,byanyone)?Whatspecialconsiderationshavetobetakenintoaccount?
Manyanswerssuggestusingastandardisedworkflowapproach,althoughthereisvariabilityinthedetails.
Within these responses, adaptationsof existing tools aswell as several alternative approachesbasedon
SpeciesDistributionModelling(SDM),orOnlineAnalyticalProcessing(OLAP),etc.aresuggested.Theseneed
tobe investigated further.Understanding the consequencesofdemandon computational capability and
capacityareimportant.
AcommonthemeiswrappingofdataresourcesandtoolswithstandardAPIs;and/ortheuseofstandardized
vocabulariesformeta-descriptionofthoseAPIs.This isconsistentwiththegeneraltrendtowardsgreater
machine-to-machine(i.e.,software-to-software)interactions.Workonreferencearchitectureandprofilesof
recommendedstandardsisalsomentionedandneedstobepursued.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page11of86
ThelevelofpractitionerskillsneededtosetupandoperateanypipelineforproducingEBVdataproductsis
oftenmentioned.Thereisasharedconcernthatitisnotsomethingthatanuntrainedpersoncanjustturn
thehandlefor.Thereisasetofopinionsrangingfrom"Idoubtthatcannedapproachesareoptimal"to"the
analyseshavetobeadaptedaccordingtotheresearchquestions".Thesereflectalotofuncertaintyamong
theparticipantsontheprecisenatureofEBVdataproducts(section11.6),whattheywillbeusedforand
how they should be produced. This reflects the diversity of thought in the community on EBV research,
development and production. This latter aspect is a theme taken up later in section 12 on translational
aspectsofEBVsmethodsresearch.
4.3 TechnicaloptionsavailableQ7.Whatare the technicaloptionsavailableandwhat ispossible toachieve todayorwithin thenext12months(i.e.,bymid-2017)?Whatdataand/orworkflows,softwareetc.areavailabletoday?Whereisitandhowcanitbeused?
Severalrespondentssuggestedthatitshouldbepossibletomakesubstantialprogresswithin12monthsby,
for exampleworkingwith different user groups and experts, conducting a survey and evaluation of the
availabletoolsandresources,adaptingfromexistingtoolssandmovingtowardsaframeworkfordatathat
avoidsduplicationindatapreparation.Thereareindividualtoolsavailablethataregoodforparticulartasks.
However,theoverridingimpressioncomingfromtheanswersisthatthelandscapeisfragmented.Thereare
manytoolsandtechnologiesthatareusedbydifferentgroupswithinthecommunitysothereisapriorityto
betterunderstandtheneeds/requirementstoconvergeonglobalinteroperability.
4.4 Top3-5technicalchallengesQ8.Whatarethetop3-5technicalchallengesofsupportinginteroperableEBVcalculationsonaglobalbasis?Howcanthesebeaddressedandinwhattimeperiod?Whohastodosomething?
Respondentsmentionedchallengesarisingfrom:
• Datasparsityandvariability,withmostbiodiversitybeingeithercompletelyunrepresentedorvery
under-representedbytheavailabledata.Inthespecificcontextofabundance,thelackofaccurate
andwidespreadrecordingofabundancedatawithconfirmedabsences,(ratherthanjustpresence-
onlydata)wasmentionedseveraltimesasbeingasubstantialbarriertoproducingusefuldata
products.
• Metadataandqualityassuranceofdatawerementionedagainastechnicalchallenges,aswasthe
taxonomicresolutionissue.
• Automationofmodellingusingconsistentandrigorousapproaches,designingforefficient
interactionofcomputingandhumansteps,computationalcapacity,etc.
Tools for the traceability of the entire process, based on an effective system of unique identifiers for
biodiversityoccurrences/objectsareconsideredimportant.However,moreimportantisthereproducibility
androbustnessoftheEBVcalculations,ensuringthattheyscalecorrectly,thatresultsareconsistentandthat
theyaretrustedbydecisionmakers.
One of the outstanding conceptual challenges for the development of EBVs is agreement on common
scales/units/indicesofmeasurementtoallowdataon,forexample,speciesdistributionstobe integrated
sensiblywithdataonhabitatextent,orallelicdiversitywithhabitatextent.Itwassuggestedthatcombining
differentEBVsinastandardisedwayistherealpoweroftheconceptualapproach.
DistributedversuscentralisedresponsibilityandproceduresforproducingEBVsisanimportantissue.Trade-
offshereare toensureconsistencyofcalculationandadequate resourcingat, forexamplenational level
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page12of86
versustheneedforresourcescentrallytocarryoutthistask,anderrorsarisingfromlimitedunderstanding
ofthedatainacentralisedoperation.
Many respondents recognised the need for a global, cooperative effort for 3-5 years on developing and
documentinga system that is truly accessible toa rangeof stakeholders thatneedaccess to robustand
reliable EBV data products. That system is envisaged to comprise a set of specifications, processes and
proceduresandtheirtechnicalimplementationatnationalandinternationallevel.Theroleofnationaland
internationalinfrastructurewithguaranteedlong-termfundingiscriticalinthisvaluechainfromrawdata
(GBIFandBONs)tocreatingEBVdataproductsthroughtoindicatorsdemandedbyIPBES,CBDandothers.It
iscriticalandhastobeclarified.
Thisefforthastoincludeagreementontopicssuchasstandarddatarepresentations,standarddatacontent,
standardmethods fordefiningworkflows formanipulating thedata intoanEBVdataproduct.Thereare
requirementsforcontinueddatamobilisation,e-Servicesasstandardbuildingblocksandencouragingmore
opendataaccessandsharing.
Havingtherightpeoplewiththeskillsandknowledgetodealwiththetechnicalimplementationissues(e.g.,
statistics,workflows,databases)andtheassociatedinteroperabilityandlegalissuesatagloballevelisseen
ascriticaltosuccessandasanissuethatneedsattention.
5 Outcomesfromworkshop1In5parallelgroups,participantstoworkshop1discussedtheabove4questionsandreportedbacktoplenary
session.ImportantpointsrelatingtolikelyworkflowstepsaresummarisedinFigure2below.Technicalissues
arisingfromthesediscussions,reportbacksandfurtherplenarydiscussionarelistedthereafter.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page13of86
Majorworkflowstep Group1 Group2 Group3 Group4 Group5Datapreparation,qualityassurance
1.Requireautomatedmeans
for:
-Assessingdataquality(e.g.,
ALA’sdataqualityflags)
-Assessingrestrictionsonuse
(e.g.licensingrestrictions).
2.Removeduplicates(same
recordfrommultiple
institutes:anissueof
designingpersistentunique
identifiers)
3.Encourageuseof
taxonomicnameresolution
service
1.Getgoodqualityspecies
distributionsdata
-Whatspecieslocations
availablenow.Needforgood
metadata,andtounderstand
thedata(presencerecords).
Dealwithbias.Characterise
wherepeoplemayhave
sampled;alongsidewhat?
2.Getgoodqualityco-
variancedata
-Dowehavethosedata
availableasGIScovariates
3.Expertinputonneededco-
variancedata;possibleuseof
traitsinformationand
databases;howvariables
affectdistrib’n.
Workflowtomobilizedata
moreeasilye.g.,
-Automaticextractand
convertdataandpublish
throughIPT,BDJ(needsto
includeeDWC)
-Datacleaning/checkthe
quality
-CreateeDWC
-Providepolygonsina
definedformat
-GBIFwindowtoextractEBV
data
DWCasbasis;extendedDWC
includingtheneedsofEBV
Example:Detectionof
correctedoccupancybased
oncameratrapdata
1.Identificationofpotential
datasources
-Imagesalongwithmetadata
2.Observations
-Imagetaggingalongwith
commonlyagreedupon
metadatastandard(Camera
Trapfederatedminimum
datastandard)
3.Dataquality
-Basedonthemetadata
proceedtoqualitycontrol
4.‘Validated’observations
-Speciespres/absmatrix
1.Defineadatasetofspecies.
2.Integrateandputin
relationwithotherdataand
metadata:location
information,habitat,human
density,landcover,genetic
information,etc.
3.Youdon’twantto
homogenizeheterogenicdata
andlooseinformation.
Modelling Largematrixinversion
1:Usebiodiversitymodels
identifiedbyEU-BONas
relevant={models}
2:ForeachMin{Model}:
SelectNdimensionsofthe
hypercube.RunmodelM
againstdimensionsN.Add
modeloutputasanother
dimensionintohypercube
4.Importanceofexperts’role
interactingwiththe
workflows:toknowlikely
distribution,tounderstand
observationsthathavebeen
made,internallyinthemodel
doesitlooklikeitspredicting
well
5.Model(selectionof
suitablemodel)
6.Detectioncorrected
occupancy(EBV)
4.Buildmodelkeepingtrack
ofallparameters(and
quality);versionit.
Validation 5.ExpertvalidationofEBV/
modeloutput/results
-Lookatthemodeloutputs
toseeiftheexpected
distributionmatchesreality
-Discusswithexpertsand
adjust,accordingto
understanding
-Makeamodelofrecords
againstsampledplaces.
1.Expert/peer-review
validationofmodels
Table2:Mainworkflowstepssuggestedbygroupdiscussions.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page14of86
The questions about suitable ICT,what can be achieved today and the top 3-5 challenges of supportinginteroperableEBVcalculationspromptedmanydiscussionresponses.Participantsdescribedtheseintheirownterms,accordingtotheirownexpertise,backgroundsandexperiences.The listbelowgroupssimilarresponsesinalogicalsequencetoprovideemergingissues:
1. A rangeof ICT approaches is likely due to the particular EBV and the organisation(s) responsible forproducingthedataproductsassociatedwiththatEBV.Forexample,aremotesensingbasedEBVcouldbequickertooperationalisethanonethatisbasedondistributionmodelling.
2. DifferentICTsystemsmaybeneededforprototypingandproduction.3. Reproducibility.4. Repeatable ‘computer actionable’ workflows / processes / procedures that can integrate data from
multiple sources. Workflows are a means of documenting and implementing steps in sequence(expressedalsoas ‘aplatformtomerge thedifferentanalysismethods’) capturingexpertknowledgeneededtorepeattheprocedure.
5. Theoutputsofthecalculations/modelshavetobecapturedforre-usee.g.,asreferencedataset.Thisisaprovenanceissue.
6. Workflowsshouldbeopentopeerreviewandthustransparentintheirdescriptionandoperation.7. Executionservicesareneededtoruntheworkflowsandwhilesomeareavailable,theyneedtobemore
robust.8. One set of data services / products is needed for scientists and another peer-reviewed, carefully
documentedsetisneededforpolicymakers.9. Newproductsshouldbeeasilyproducedasnewdatabecomesavailable.10. Should EBV data products be published via some kind of federated database with links to all the
underlyingdataproviders?ThisiswhatGEOBONhasbeenconsideringbutpracticalitiesanddetailsarenotclearyet.Therearesimilaritieswiththehypercubeapproach.
11. Ahypercubeapproachprovidesaconceptualmodel thataidscommunityengagement.Everyonecanprojecthis/herowncontributiontothewhole.Thehypercubeapproachallowsustoswitchtoanewparadigm of thinking; thinking in terms of a continuum “hypervolume" (which is an n-dimensionalrepresentationofanecologicalnichedelimitedbythevaluesofnvariables.Hutchinson1957,Blonderetal2014).Deltabetweentwopointsinthehypercubeisameasureofchange.
12. Data inhypercubes,ofwhichtheremaybemanyhastobediscoverableandaccessible,sometadatastandardizationwouldbehelpful.Metadataimprovesdiscoverabilityandnavigationaroundandthroughdatasets.
13. Agreementsaboutdataformatandsafedatatransferprotocolsareneededforeachtypeofinformation.14. Nodatastandardsareavailableyet.DetailedspecificationanddocumentationisneededforeachEBV
dataset/product.15. Forpresentation/visualizationofinformationweshouldusemapsshowingdataintimeslicesinlayers.16. Substantialandeasilyaccessible(cloud)computingresourcesareneededforrunningmodelsthatmay
depend on large datasets of environmental variables and primary biodiversity data. Theremay be asubstantialcostassociatedwithanalyses.
17. SometoolsareavailabletodaythatcanbeappliedtoproducingEBVdataproductsbutagreementonapplicabilityisneeded.
18. Thesystemsshouldbeusablenon-expertusers(user-friendlyinterface).19. ItislikelythatnewplatformsforproducingEBVdataproductswillneedtobedeveloped.20. Managing the survey data (plot field data, remote sensing, etc.) is important for any analyses.
Maintainingthedatastructure,informationrelatingtothedataquality,completenessandqualityofthemetadataisfundamental.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page15of86
21. Thechallengeofaccuratetaxonomicidentificationraisesassociatedresolutionandreconciliationissues,andquestionsabouttoolstosupportthat.Synchronisationoftaxonomiesneedsimprovement.
22. Notonlythedifficultyofconsistentlyproducinganysinglevariablebutalsothedifficultyofconsistentlyproducingacrossseveralvariablessothatwhentheyareusedtogethertheyaremeaningful.
23. Asuiteofdifferentactivitiesisneededtodelivertheevidencelevel(See,forexampleHobernetal.2013).Onesuggestionisthatdataispresentedasalinkedopengraphwithportalsasintegratedviewsofthatgraph.Dataportalsofaggregateddataareviewsoftheevidencewehave.EBVdataproductswillbebasedonaccesstoavailabledata.
24. WhatistheminimummetadatainformativeforEBVs?Whatistheminimumdatatobemobilisedforrelevance?Whatkindofmetadatadoweneedforthedifferentkindsofdataweareusing?
6 CasestudiesWorkshopparticipantsfocusedontheEBVclass‘Speciespopulations’asthedata,modelsandunderstandingarethebestdevelopedtogainmorein-depthunderstanding.
Participantswereencouragedtothinkaboutpracticalimplementationcasesthatcouldbeusedtouncoverthe issues affecting EBVdatasetproduction. These caseswerebuilt on current activities and capabilitiesalreadyinplacetoday.Participantswereencouragedtocontinuediscussionsaftertheworkshop.
Eachcasestudysummarybelowexplainsthekeyactivity,thepurposeandwhatisoriginal/novelaboutit.
e-Birdandatlases:ReviewingallbiodiversitymonitoringprojectsbutfocussesespeciallyoneBirdandotherprojectsthatemployvolunteerstocollectobservations.Thegoalistoprovideaglobalmonitoringsystemforbiodiversityingeneralandbirdsspecifically.Participantsinsuchprojectscansubmitobservations,photos,sounds,andvideoviatheinternetandmobileapplications.Alldataisopenlyavailable.ForeBirdthesoftwarestackandfunctionalityisoriginal,asarethedataanalysesandvisualizationsofmodelscreatedfromeBirddata.
PreparinginvasivespeciesdatatowardsEBVspeciesdistribution: DevelopingaworkflowtoidentifykeyissuestowardsautomationoftheuseofspeciesdistributiondataforEBVsacrosstwoDataPublishers(GBIFandtheALA).ThepurposeistoevaluatetheviabilityofDataPublishers’supportforspeciesdistributionEBVs,documentthelikelyworkflowandtheissuesarising.TheoriginalityliesinaligningDataPublishers’supportforspeciesdistributionEBVs,identificationofissuesofdataintegrationandevaluation(fitnessforuse)ofGBIFandALAdatausinginvasivespeciesastheexemplar.
PublishingLPIdatathroughGBIF:TheLivingPlanetDatabase(LPD)isworkingwithGBIFonthepublicationoftheLivingPlanetIndex(LPI)datathroughGBIF.ThisinvolvesconvertingthedatafromtheLPDintoDarwinCoreFormatandinstallingtheGBIFIntegratedPublishingToolkit(IPT)toprovideregularupdatesfromtheLPDtoGBIF.
MarineEBVpilot:Theactivityistoidentifyrepresentativeproblemsinthestructureandprocessingofmarinedata when calculating Essential Biodiversity Variables for Species populations. The purpose is to gainempirical insight into the validity of existing marine data for EBV calculations and the identification ofobstaclesfortheircalculation.
Metagenomics-Metabarcoding-ELIXIR:Thegoal is to include theentirebiodiversityofmicrobiomes fromanyhabitatsintoanEBV,sheddingnewlightontheirincreasinglyevidentroleaspromotersorindicatorsofecosystemchanges.EBVscanbecomputedsimultaneouslyforanumberofspecies,intheorderofhundredsorthousandsofmicrobialtaxafromagivensample.SeverallargemetagenomiceffortsarealreadygeneratingdatasetsthatcanpotentiallybeusedtomodelmicrobialEBVs.TheseincludetheEarthMicrobiomeProjectandtheMetaSUBproject.Thefirstaimsatgeneratingacomprehensiveatlasofthemicrobialdiversityon
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page16of86
theplanet,withhundredsofthousandsofsamplescollectedworldwide.MetaSUBfocusesonthemicrobialdiversityofcitiesandsubwaysanditisattractiveasmostoftheseenvironmentstypesareconsistentevenin different continents. EBVs based on this data can thus be directly compared. Although the temporaldimension is missing at the moment, association between species presence/absence and geography orenvironmenttypeswillbepossibleatunprecedentedscale.Finally,therelevantcomplementaryactivitiesofELIXIR, the European infrastructure for biological data, concerning data resources and analytical toolsdesignedspecificallyformarinemetagenomicswillalsobeconsideredinthecompositionoftheworkflowformodellingmicrobialEBV.
Implementation ofWildlife Picture Index (WPI) as an EBV: The aim is to create an EBV from primary,standardized,highqualityspeciesdata(~250speciesand~500populationsoftropicalforestground-dwellingmammalsandbirds)fromcameratrapdatacomingfrom17sitesin15countries.ThesecondaryaimistoexposethesedatatotheGEOBONcommunityofpractitionerstoimprovedatauseandanalyticdevelopmentofthesekeydatasets.Theoriginalityisintheuseofstandardizedsensor(cameratrapimage)data,easilyreplicableandscalable.
Table3belowbringstogetheranyimmediateinformationfromeachoftheabovementionedcasestudiesthatmightbehelpfultosupportidentificationofICT-relatedtechnicalissuesandrisks.
Casestudyname Challengesandrisks Specificmissingservices EssentialresourceseBirdandatlases Volunteerparticipationwith
littletrainingrequiresspecificmethodsofdataanalysestocontrolforobservationbiasanddataqualitychallenges.ThesearebeingovercomeandeBirddataisnowbeingusedinmanypeer-reviewedscientificpublications.
Mostmonitoringprojectsneedtoconnectwithindividualswithincountryorregions.Thisisoftenthemostdifficultaspectofmonitoringprojects.Additionally,languagescouldbeabarrierandallaspectsoftheapplication(s)needtobetranslatedintomultiplelanguages.
ThemostessentialresourcetooperateaprojectlikeeBirdisopenaccesstothewebbyallpeopleineverycountry.Thisisnotalwaysthecase.Forexample,mainlandChinadoesnotprovideaccesstoGooglemaps,whichisanimportantpartofeBirdforfindingthelocationyouwereobservingbirds.Evenstill,morethan100,000observationshavebeencollectedinChinaandtherateofgrowthis@40%annually.
PreparinginvasivespeciesdatatowardsEBVspeciesdistribution
1.HavingconsistentwebservicesacrossDataPublishers.2.Dataintegrationandspecificallytaxonomicnameintegration.3.Criteriaforrecordacceptance(fitnessforuse).4.AstandardforrecordstestingandassertionsacrossDataPublishers.
1.ConsistentwebservicesacrossDataPublishers.2.Toolsfortaxonomicnamesintegration.3.Criteriaforrecordacceptance(fitnessforuse).4.AstandardforrecordstestingandassertionsacrossDataPublishers.
1.Infrastructureforexecutingworkflows.2.Workflowsdevelopmentcapability/capacity.
PublishingLPIdatathoughGBIF
1.Gettingdataintotherightformat.2.GainingagreementfromalldataproviderstoLPDtopermitpublishingthedatathroughGBIF.
Attributionofdatawiththeinvolvedmultipledatasources.
SupportfromGBIFforconvertingthedatatoDarwinCoreformatandforsettinguptheIPTtocontinuallyupdatethepublisheddata.
MarineEBVpilot -- Improvementofthedataprocessingpipeline.
--
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page17of86
Casestudyname Challengesandrisks Specificmissingservices EssentialresourcesMetagenomics-Metabarcoding-ELIXIR
1.Spatialdimensionisaparticularlychallengingaspectofagenetic-relatedEBV.Thereisanextremelylargeheterogeneityinanymacro-environmentandhabitatsthataredirectlyincontactshowonlyasmallnumberofcommonspecies.2.Thereisstillamajordisagreementastodefinethe“speciesconcept”formicrobialorganism,evenifsomeoperationalproceduresarecommonlyadopted.3.Therelativeabundanceofaspeciescanchangeasaconsequenceofabundanceshiftsofotherspeciesanditisdifficulttointerprettheminisolation.4.Reproducibilityofthesequencedataqualityandquantityinlongtermassessmentsandoftherelevantanalyses.
5.Internationallysupportedgoldstandardsforreferencemoleculardatabasesandtheirrelevanttaxonomicclassification
Furtheradvancesinhigh-throughputmolecularandbioinformaticresourceswillsignificantlyimprovefuturemicrobialclassificationsandidentifications.
-Richandproperlyannotatedmolecularreferencestandardresources,internationallysupported,includingconsistentmetadata.-Powerfulcomputationalinfrastructuresandpipelinestomanageandanalyseinareproducibleandaccuratewayhugeamountofgenomicdata.
ImplementationofWPIasanEBV
Datascarcity,unwillingnessofpeopletoshare.Knowingwhentomakedataopenlyavailableandhowtocontrolthat,takingintoaccountanylegalramificationsoftheinformationcontainedinthedata(e.g.,revealinglocationofprotectedspecies).
RatherthanexposingdatadirectlytoGBIFTEAMareconsideringsharingdataviaVertnet.ReadytoproduceandshareanEBVfrommillionsofcameratraprecordsunderTEAMandWildlifeInsight;workingtoregistertheEBVwithGEOBONandpublishthemetadata.
TobuildbetterAPIandwebsservicesforTEAMaswellasalltheWildlifeMonitoringAnalyticssoftware(includingWildlifeInsightswhichhasWCSdatainit).
Table3:ICTrelated-challengesarisingfromcasestudies.
7 Roleofestablishedandemerginginfrastructures
7.1 CurrentandemerginginfrastructuresResearch infrastructures (RIs)are facilities, resourcesandservicesdesignedtosupportefficient research.Most usually, research infrastructures are deployments of technologies (scientific instrumentation,computational capabilities, databases, software, networking, etc.) integrated for data management, forexperimentation,analysispipelines,modelbuildingandtesting,simulation,sharingandcollaboration.
RIrepresentativeshavebeenencouragedtothinkabouthowRIscanpracticallysupporttheneedsofEBVs,fromtheperspectiveofsupportingresearchintotheEBVsthemselvesandfortheproductionofEBVdataforusebyothers.
TheinformationbelowprovidedbyRIssummariseshowtheywillsupportEBVdataproduction.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page18of86
Distributed System of Scientific Collections (DiSSCo): Is developing a European Distributed System forScientificCollectionsthatrelieson: i)generationofaccessibleand interoperablecontent (Digitisation); ii)developmentofe-Infrastructuresthatmakescontentopenlyinteroperableandlinked;andiii)harmonisationofpolicies, standardizationofprocessesandexpert training.Thepurposeof the initiative is toallow thescientificcollectionstorespondtourgentsocietalchallengesbyembarkingonanunprecedenteddigitisationanddatamobilisationeffort.Originalandnovelaspectsoftheinitiativeinclude:i)addressingdigitizationina holistic way (taxonomy, imaging, trait (morphology, ecology) extraction, molecular and chemicalinformation,biogeography);ii)scalingup,linkingtogetherthelargestNaturalHistoryCollectionsinEurope;iii)creatingaregistryofspecimensandexpertsacrossEurope;iv)providingapanEuropeanaccesspointforallcollectiondata;andv)servingdatatoothere-infrastructures(e.g.,GBIF,GenBank,etc.).
CRIA:Toolsandservicessupportdatagathering,datatransformationanddatasharingtoenablesharingofbiological collection data (speciesLink) and providing mechanisms to generate ecological niche models(openModeller).Thepurposeofsuchtoolsistosupport,forexampleinitiativeslike“BiogeographyofFloraandFungifromBrazil”1,wherescientistsusebothresourcesthroughauser-friendlyinterfacetogeneratepotentialdistributionmodelsintheBrazilianterritoryfor~3,500plantsspeciesuntilnow.
LifeWatch:IsinitiatingarangeofactivitiesthatincludesworkingtoimprovespeciesobservationsdataforEBVuse;new individualandcommunity (terrestrial andmarine) sensors forgeneratingEBVusabledata;developingaVirtualLaboratoryofferingEarthObservationinformationforEBVs;anddevelopmentanduseof species and individual trait databases with work on ontologies, semantics and (trait) identifiers fordifferentgroupstomakesystemsinteroperableacrosstaxonomicandgeographicboundaries.ThepurposeoftheseactivitiesistoprovidesupportforresearchersonEBVdevelopments,andsubsequentlysupporttogovernmentalandprivateorganisationsinterestedinEBVsandIndicators.Dedicatedvirtualenvironmentsallowingresearcherstofocusontheirscienceratherthanontechnicalproblemsistheinnovativeandoriginalaspect of this initiative. A number of past and current EU projects, including BioVeL are serving as pilotimplementations.Moreover,theBiomolecularThematicCentre(CTB)oftheItaliannodeofLifeWatch,alongwiththerelatedMolecularBiodiversityLaboratory(MoBiLab)isprovidingskillsandadvancedfacilitiesformolecularandbioinformaticsanalysesofmetagenomicdataforsupportingmicrobialEBVsmodelling.
LTER Europe: For a network of long-term monitoring sites across Europe focussed on understandingecosystem processes, LTER Europe is collecting, managing and providing data and information forbiodiversity for the development and validation of EBVs. LTER Europe is enhancing the availability andaccessibilityoflongtermobservationdata.LTEREuropeseekstoinnovateonnovelmethodologicalaspectsofspeciesandcommunitymonitoringthatcouldleaddirectlytoEBVusabledata.
CAS Biodiversity Committee: Activities are concerned with: i) building up species distribution databasefocused on mammals, birds and higher plants; ii) monitoring biodiversity changes via the BiodiversityMonitoringNetworkinChina;andiii)analysingbiodiversitystatusandtrends.Theactivitiesareforstudyingrulesforbiodiversityorigin,evolutionandchangeandforsupportingimplementationactivitiesacrossChinafortheConventiononBiologicalDiversity(CBD).
NationalGermplasmBankofWildSpecies inChina(GBOWS):Thekeyactivity isbankingofseeds,DNA,plants invitro,microbialcultures,andanimalcell lineswithafocusonendangered,endemicspeciesandthosewith potential economic or scientific value. This is for the purpose of collecting and safeguardingChina’swildspeciesbiologicalresources.
NEON:Noinformationavailableattimeofwriting.
1http://biogeo.inct.florabrasil.net
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page19of86
SANBI:Noinformationavailableattimeofwriting.
DataONE:Noinformationavailableattimeofwriting.
Elixir: The CNR GLOBIS-B team is also part of the Italian node of Elixir and participates in the H2020EXCELERATEworkpackagefocusedonMarineMetagenomics.Itsresearchisfocusedonthedevelopmentofmolecularandbioinformaticresourcesformicrobiomeanalysis.BoththepipelinesfordatamanagementandtaxonomicanalysisandthemoleculardataresourcescansupportthemodellingofmicrobiomeEBVs.
Table4belowbringstogetheranyimmediateinformationfromeachoftheabovementionedinfrastructuresinitiatives thatmightbehelpful tosupport identificationof ICT-relatedtechnical issuesandrisks thatarefacedbytheinfrastructuresastheypreparetosupportEBVs.
Infrastructurename Challengesandrisks Specificmissingservices EssentialresourcesDistributedSystemofScientificCollections(DiSSCo)
Generatingthecontent(newdata)andlinkingthemtogetherisaherculeantaskatthescaleenvisagedforthisInfrastructure,involvingdatastandardsandlarge-scaledigitisationactivity.
1.Completetaxonomicbackbone(asaservice).
2.CommonURIframework.
3.Agreeduponontologiesandcontrolledvocabularies.
4.Robustdatabrokeringmechanisms.
1.CloudServices(Cloudcomputingandstorage).
2.Accesstoweb-servicesbyinternationaldataaggregators.
3.LinkstoservicesprovidedbyotherRIs(e.g.LifeWatch,ELIXIR).
CRIA -- -- --
LifeWatch 1.Developinggenericvirtualenvironmentsthatsuitavarietyofresearchinterests.
2.Trusteddatafrommultiplesourcesisneeded.
Dataservices. Accessinglargecomputationalcapacity.
LTEREurope 1.Dataavailability,dataharmonization,usingacommonspecieslist.
2.DatahastobeprovidedbythelocalLTERsitesinaharmonizedmanner.
1.Standardiseddataservicesfromdataproviders(currentlyunderdevelopment).
2.Commondiscoveryandaccessportal(currentlyunderdevelopment).
AspectsofdataprocessingandworkflowenginesaredelegatedtoRIs(e.g.,LifeWatch)andEuropeanscaleprojects(e.g.,EUDAT2020,ENVRIplus)focusingontheseaspects.
CASBiodiversityCommittee
-- -- --
NationalGermplasmBankofWildSpeciesinChina
Thetechnologyofkeepinglivematerialsforaslongaspossible.
-- --
NEON -- -- --
SANBI -- -- --
DataONE -- -- --
Elixir StandardsandguidelinesformoleculardatainteroperabilitywithotherEBVs.
1.MetagenomicandMeta-barcodingdata/metadataretrievalsystemandanalysis
2.Servicesforcomparativeandstatisticalanalyses.
1.Powerfulcomputationalplatforms
2.Standardreferencetaxonomiesandontologies
Table4:ICTrelated-challengesarisingfrominfrastructureinitiatives.
7.2 ProvisioningresearchinfrastructurestodelivercapabilitiesforEBVprocessingImplementing EBVs at production-scale requires global cooperation among biodiversity research,observation and data infrastructures to serve comparable data sets and offer analytical processingcapabilities.ThiscooperationimpliesinteroperabilitybetweentheRIsattheleveloftheservicelogic(García2014,Kissling2015),basedonagreeddatastandardsandwidelyacceptedmethods(workflows)capableof
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page20of86
beingexecutedinanyresearchorobservationinfrastructurefromanywhereintheworld.ThisposesseveralchallengesofprovisioninginfrastructurestodelivercapabilitiesforEBVprocessing.
8 Generalchallengesandconsiderations
8.1 ResponsibilityforEBVproductionWhichkindsoforganisation(s)aregoingtoproduceEBVdataproducts?Itisprobablynottheroleofdatapublishers to takeon the full responsibilityofbuildingparticularEBVworkflows, sonew infrastructure islikelytobeisneededtoprovidethetoolsandworkflowsforEBVproduction.
TherearefewRIstodayintheareaofbiodiversityscienceandecologythatarewell-founded,sustainedandinapositiontotakeontheresponsibilityforproducingEBVdataproducts.Thebusinesscaseneedstobemadeforthiskindofactivity,andforassociatedsustainedfunding.TheregionalBiodiversityObservationNetworks(BON)havearoletoplay.TherearemultiplekindsofdataproductsrelatedtoEBVs.SomeproductsareneededduringthephaseofresearchintoEBVsthemselves.OtherproductsareneededfornewsciencebasedonEBVsdata.A thirdkindofproduct isneeded to supportpolicy-makinganddecisions.Differentinfrastructurescanhavedifferentresponsibilitiesforeachcomponent.
RIs are one option for implementing support for EBV development responsibility, while BiodiversityObservationNetworks(BON)focusismoreonproduction-quality.BONsareabetterfitforproducingEBVsdata products as they have the necessary political and financial support. The RIs are more likely to befocussedonsupportingresearchintothedevelopmentofEBVsdataproductsandEBVmethodsratherthanintheiractualproduction.
It isnecessary tounderstand the scopeand responsibilityofeachof theactors / stakeholdershavinganinterestinEBVs.Thisincludestheestablished/emergingresearchinfrastructuresdataproviders/publishersandmonitoringinfrastructures.Datapublishers,forexamplemustensuredatahavethenecessaryancillaryinformationandconsistent implementationofdataqualityassertions [Belbin2013] thatenablepotentialuserstomakeappropriatedecisionsaboutfitnessforuse.Otheractorshavetoprovidethe‘super-tools’andworkflowsfor implementingdifferentaspectsofEBVproduction, forexample imputingmissingvariables,generating covariates to facilitate bias correction, applying workflows for standardised modellingapplications that allow a level of user intervention, and so forth. Knowing what the steps are in EBVprocessingisthereforeimportant.Thisisaddressedinsection9.2.
The‘innovationchain’bywhichtheemergingEBVmethodsprogressfrombeingascientificresearchoutputthroughseveralstages:fromproofofprinciple,throughmethod/product/servicedevelopmentandintoscience / business / policyproductionanduse is addressed further in section12.At all stages there aredifferentICTissuesthathavetobeaddressed.
8.2 TheEBVproductioncycleOnceweunderstandwhoisresponsibleforwhat,doweaimforanorganised,cyclic(periodic)andsystematicapproachtoproduceandpublishEBVdataproducts,e.g.,annually?Ordoweintendamuchmoread-hoc,on-demand, on-the-fly kind of production process? Each approach places different requirements on theproductionprocessforEBVdataproducts,thenecessaryinfrastructureandtheinformatics.
Ananalogy:Cyclicversuson-demandprintingofbooks:Considerabookprinting. Inacyclicorganisedapproach,thebookcontent(thedata)isheldbythepublisherandsenttobeprintedine.g.,10,000copiesevery fewyears,withsomerevisionseach time. Inanon-demandapproach, thecontent isheldby thepublisherandmadeavailableforindividualreaderstoprintwhentheywantit.Theprinting,bindinganddistributionprocessdifferssubstantiallybetweenthetwocases.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page21of86
AswithEssentialClimateVariables[Bojinski2014]productionofEBVdataproductshastofollowstandard,repeatable,openandtransparentprocedureswithclearandaccessibledocumentationandscrutinyateachstep.
SomeofthetechnicalimplicationsofthetwopossibleapproachestotheEBVproductioncycleareconsideredinsection10.1below.
8.3 EBVdatamodelandstructuresHypercubesaremulti-dimensionaldataarraysorganisedalonglinesoftime,space,speciesandpotentiallyotherdimensions.Thesearrayscanbe“sliced,diced,rolled-upordrilledinto”[Chaudhuri1997].Hypercubesarea logical structure that sitsbetweenapplications (e.g., specific indicators, specific scientists,orothersoftware examples) wanting to make use of EBV data products, and the underlying data on which theproductsarebased.ForeachEBV,areweexpectingtohaveasinglehypercubethatgrowsovertime?Orareweexpectingmultiple-interlinkedhypercubesdelineated,forexamplebygeography,bytimeanddistributedbyownership?
AretheremoreeffectiveconceptualandpracticalstructuresthataresuitableforEBVproduction?
8.4 DatastandardsWhenwehave decidedonmacro-data structures,what is the set of core data classes thatwe considerimportant enough to standardise (specimen, collection, taxon name, taxon concept, sequence, gene,publication, taxon trait, etc.)?Whatmetadata is required tomake the EBV data products discoverable,accessibleandusable?
Howaredataobjects linked andhoware these linksmaintained? For example, from specimen to taxonconcept to taxon name to publication, or from sequence to associated sequences to taxon concepts tospeciesoccurrences?
How can Darwin Core and its extensions help in the production process of EBVs related to speciesdistributionsandabundances?
Thereisaneedforcontrolledvocabulariesandeventuallyforontologiesexpressingrelationsbetweentheconceptsdefinedbythecontrolledvocabularies.Controlledvocabularieshavetobekepttotheminimumnecessary.Multilingualismmust also be supported to allow for mapping between concepts in differentcountries.
9 Outcomesfromworkshop2Participants to workshop 2 continued the discussions and provided further technical information byansweringasurveyfromRIsandEBVgroupsstandpoints.Thelatterprovidedinitialmaterialtobrainstormingon a possible methodological framework to structure EBV developments based on existing strategicmanagementtools.
9.1 OnlyopendataforEBVsThereissupportfortheprincipletoonlyuseopendataintheproductionofEBVdataproducts.Thisprobablymeans(intheshort-termatleast)thatonlyafewdatasets/sourcescanbeused.However,arequirementforinputdatatobeopencouldbeanincentivefordataproviderstopublishandshareunderopenlicensing,sincemanydataprovidersareunderpressuretoshowvalue-addeduseandre-useoftheirproducts.
GBIF advocates free and open access to biodiversity data and recently passed a milestone [GBIF 2016]whereby“allspeciesoccurrencedatasetsonGBIF.orgnowcarrystandardizedlicences,givingdatausersand
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page22of86
publishersgreater clarityandenabling them toproceedwith confidenceabout the termsofuse fordataaccessedthroughtheGBIFnetwork.”Theseareopen,CreativeCommonslicenses.
9.2 GeneralisedworkflowforEBVprocessingAsillustratedinFigure2below,threemajortypesofEBVdatasetscanbedifferentiated:EBV-usabledatasets(orangeworkflowsteps),EBV-readydatasets(greenworkflowsteps),andderived&modelledEBVdata(blueworkflowsteps).Eachsetofworkflowstepscanleadtopublisheddatasetsofthetypeindicated.
Figure2:ExamplesofkeyworkflowstepsforbuildingEBVdatasetsonspeciesdistributionsandabundances.
ItispossibletomapeachResearchInfrastructuretothestepsshowninFigure2foranindicationtheirscopeofresponsibility,fortheirabilitytosupportEBVsproductionandtheirgapsincapabilityandcapacity.
9.3 MappingRIsscopetogeneralisedworkflowsteps
9.3.1 Dashboardofworkflowsteps,risksandneedsThetechnicalinformationhasbeenanalysedandvisualizedinadashboard,Figure3below(expandedforreadability in annex 2). The analysis is unfortunately biased since not all questions were answered.Nevertheless,thisfirstexplorationhighlightssomeinterestingtrendsandgaps.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page23of86
Figure3:Dashboard:Workflowsteps,risksandneeds(expandedforreadabilityinannex2)
ThisvisualizationopenedthediscussiononamethodologicalframeworktosupportEBVgroupsinorganisingtheirfuturedevelopmentsandassessingtheirrelatedkeyperformanceindicators.DavidManset(Gnubila)suggestedtouseabusinessmodelcanvasasafocalcollaborationtoolandtoadaptittotheneedsofEBVdevelopments,i.e.,tocreatea“strategymodelcanvasforEBVs”.
9.3.2 StrategymodelcanvasforEBVsThe Business Model Canvas2 is a strategic management3 template for developing new or documentingexistingbusinessmodels. It isavisualchartwithelementsdescribinganorganisation'sorproduct'svalueproposition, infrastructure, customers, and finances. It assists organisations in aligning their activities byillustrating potential trade-offs. The Business Model Canvas, illustrated in Figure 4 below, was initiallyproposed by Alexander Osterwalder4. It is based on his earlier work on Business Model Ontology(Osterwalder2004).
2"BusinessModelCanvas."Wikipedia.WikimediaFoundation,n.d.Web.9July2016.3"StrategicManagement."Wikipedia.WikimediaFoundation,n.d.Web.9July2016.4TheBusinessModelCanvasnonlinearthinking.typepad.com,July05,2008.AccessedFeb25,2010.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page24of86
Figure4:GLOBIS-B'sBusinessStrategyModelCanvas
Intheresultingstrategymodelcanvas,weusetheterms“users”and“funding”.Thecanvascanbeusedasasimple1-pagesummaryofthevalueprovidedbythetargetEBV,itsusers,howitrelatestothemandthroughwhatchannels.Thecanvasmakesitpossibletoformaliseassociatedkeyactivities,resourcesandpartnerstoimplementtheEBV.Thecanvasalsoconsiderscostsandsourcesoffunding.Overall,thecanvas isafocalpoint for discussionwhere participants can place and position elements in the different sections, whilethinkingabout the“flow”,e.g., fromthepartners to theresources, toassociatedaddedvalueandfinallyusers. Itmustbeseenasacompassandamapatthesametime,whichoncecompletedgivesanoverallpicture.
As illustrated in Figure 5 below, a first canvas has been presented at workshop 2 that will be used todocumentworkwithAtlasofLivingAustraliaonthespeciesdistributionEBV.Workshop3willdevelopthisfurther.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page25of86
Figure5:Methodologygeneralisation
Thedocumentedcanvaseswillbeavailablefor interestedEBVgroups.ThesegroupscouldthentakeoverassociateddevelopmentsandbenefitfromasingleandsharedunderstandingoftheirEBVstrategy.
10 TechnicalissuesforEBVimplementation
10.1 On-demand,on-the-flyproductionversusperiodiccyclicproductionAsintroducedinsection8.2,therearetwopossibletechnicalapproachestoproducingEBVdataproducts–on-demand,on-the-flyasneeded;versusaperiodiccyclicapproachwhereEBVdataproductsareproduced,published,updatedandextended,forexampleannually,quarterlyormonthly.
On-demand,on-the-flyproductionrequiresreadyaccesstorelevantrawdata,theworkflowandprocessingcapacitytotransformrawdatatotheselectedEBVproductfortheindicatedarea(local,regional,national)at the time of interest. Processing capacity “at the touch of a button” is necessary to service theinstantaneousdemandoftherequest(andofsimultaneousrequests).Thesizeandcomplexityofrequestsarenotknowninadvancealthoughgeographicalareaandresolutioncanbeconstrained.Repeatabilityisakeyrequirement,suchthatiftheEBVisagainrequestedon-demandforthesameplaceandtime,withthesame data then the same answer should be delivered5. EBV data production is ad-hoc, responding to
5Note,ofcoursethatwhenmoredatafortheplaceandtimebecomesavailablethiscouldleadtoadifferentanswer.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page26of86
demandsofthemoment,withthequalityassurancechecksin-builtintheprocedure.ArchivingoftheEBVproductsisnotrequired.
Intheperiodiccyclicalapproach,EBVdataproductionismoresystematic,canbeaggregatedoverlargeareas(potentially,thewholeglobe)andpublished/archivedasaneverextendingdatabase(s)ofinformationtobequeriedtoprovidethedatafortheindicatedplaceorareaofinterest(local,regional,national)atthetimeofinterest.Processingcapacitycanbeestimatedinadvance.PeriodicityoftheproductioncycleforanEBVdataproductcanbetunedtotheavailableprocessingcapacityandtotheexpectedtemporalsensitivityofthatEBV.Theinformationisgeneratedonce,archivedandthenavailablesemi-permanentlytobere-usedasneeded. The anytime, anyplace requirement is met not by on-demand computation but by queryingpreviouslycomputeddataproductsthathaveundergoneapost-productionqualityassuranceassessment.
Thischoiceofon-demandorperiodicproductionisfundamentaltothewayproductionprocessesaredefinedand forhow infrastructuresareorganisedandoptimised for calculating, archivingand servingEBVsdataproducts.Thechoicehastobefeasible,efficient,andaffordable.Globalcooperation isneededtoensureconsistency,servingcomparablerawdatasetsandprocessingcapabilitiesforproductionandmaintainingappropriatearchives.TheworkflowsforproducingEBVsdataproductshavetobecapableofbeingexecutedinanyinfrastructure,andfromanywhereintheworld.Thechoiceraisesissuesforpermissionstouseprimarydata,forsecondarydata,forcitationandattributionandforprovenancetracking.
10.2 Quality,integrityandaccessibilityofdata
10.2.1 DataqualityKnowingwhetherdataareofsufficientqualityisamatterofknowingwhatthedataistobeusedforandofknowingtherequirementsthatdatahastomeetforthatpurpose.IntheEBVcontext,thechallengeistodefinetherequirementsthesourcedatahastomeettobefitforthepurposeofproducingEBVdataproducts.Suchdataisknownas‘EBVusabledata’.ThisisnotanICTchallenge.
The technical ICT challenges come from filteringoutnon-compliantdata,orenhancingdata tomeet therequirements.Modifyingdatatomakeitfit-for-purposeconflictswiththeneedtomaintainintegrityofthedata(section10.2.2below).However,reconcilingdatacollectedbydifferentmethodsovertimeisessentialto provide a like-for-like basis for further processing. Automated workflows for data discovery, access,retrieval,preparationandqualityassurancecanhelptominimiseandmanagetheconsequencesfordataintegrity.
Dataqualityassertionsorflagsbasedupontestsandcorrectionsmadebydatapublishersatthetimedataisdepositedintheirrepositorycangoalongwaytowardshelpingpotentialusers(whetherhumanormachine)toknowwhentherequirementsarebeingmet.ThisisdependentontherequirementsforEBVusabledatabeingexpressed incomparable terms.Qualityassurancecanbebasedon ‘errorsbyexception’ checking;eithermanuallybyinspectionor(moredesirably)inanautomatedway.Suchassertionsandflagsmayalsobeassociatedwithdataafteritspublication,usingemergingtoolsandstandardsforannotation.
10.2.2 DataintegrityDataintegrity6isfundamentalifEBVdataproductsaretobeusedforscientificresearchorforenvironmentalpolicyanddecision-making
6AccordingtotheUSA’sFDAmorethan20yearsago,dataintegrityreferstocompleteness,consistencyandaccuracyofdata;alsoincludingelementsofbeingattributable,legible,contemporaneouslyrecorded,originaloratruecopyandaccurate.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page27of86
Datashouldbesecure,traceableandfreefrommanipulativemodification(eitheraccidentalordeliberate)during theEBVproductionprocess.Preservingdata integritybecomes challengingwhen integratingdatafrommultiple sources andpassingdata throughmultiple steps/transformations/processes, eachperhapsoperatedbydifferentserviceproviders.Thereisaneedtoensurethatdataarepassedthroughconsecutivestepsintheprocessingchainintheirentiretyi.e.,withtheircontextualmetadata.Toreducethepossibilityof introducing errors, there is a need to avoid manual manipulations of the data at all stages. Datamanagementsystemsandprocessingworkflowshavetoassistinpreventingmistakesfrombeingmadeandinalertinghumanoperatorswhenproblemsaredetected.
Fixity provisions protect datasets that are too expensive to reproduce in the event of loss, as well aspreventingthemfromchangingafterpublication.PersistentIdentifiers(section10.12)havearoletoplayhere.Decisionshavetobetakenontheleveloffixityneeded,balancingcostofprovidingitagainstlikelihoodofchangeinthedataproducts,potentialfortheirlossandliabilityarisingfromthatlossorfrommanipulativemodification.
Integritygoeshand-in-handwithdataquality.It’sdifficultfordatatobecomplete,consistentandaccurateifqualityrequirementsarenotbeingmet for thedata. Integrity isalsorelatedtoadequatemetadata, toprovenanceandtraceability(section10.9)andtostandardization(section10.11),especiallyofdataformats.
10.2.3 AccessibilityofdataAvailabilityofdatareferstohavingtherightdatainsufficientquantityandqualitythatiscapableofbeingusedforthetaskathand.Datathatisappropriateandsufficientisnotthetechnicalissueherebutaccesstothesedatais.Thisisatechnical,proceduralandlegalissue.
Accessibilityoftheavailabledatameans:
• Canwediscover thedata?Has it beenpublishedand catalogued somewhere? Is there sufficientmetadataandcanwediscoveritbyinterrogatingthecatalogue(i.e.,fromasoftwareprogram)tofindthedataweneed?
• Canweretrievethedata?Isthatretrievalprocessmanualorautomated?Canitbedonebyapieceofsoftware?Whatmetadatainformationisneededtoretrievethedata?
• Areweallowedtousethedata(i.e.,duetoanylicensingrestrictions)?Isthereapricetopay?Whatare the mechanisms for signifying our agreement with the licensing conditions and for makingpayment(ifany)?
Accessibilitycanalsomean:Isthedatawewanttouse‘opendata’7orcloseddata?
Giving workflows and other software direct access to validated collections of open data (i.e., throughstandardizedWebserviceAPIs)minimisesinconsistencies,thedangersofaccidentallyincorporatingeitherpoor quality data or closed data,whilst alsomaintaining integrity of the data into the processing chain.Working on the technical interface between the EBV data production process and the relevant datapublishersisapriority.
Althoughtheprovisionofstandardiseddataservices(Webservice/RESTAPIs)makesiteasierforsoftwaretodiscover,retrieveandutilisedata,andinthelongrunreducesdevelopmentandintegrationeffort,thisisbynomeansanessentialrequirementtobeplacedondatapublishers.
7Inthesenseofthedefinitionathttp://opendefinition.org/
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page28of86
10.3 TaxonomicmatchingandreconciliationA‘completetaxonomicbackboneasaservice’ismentionedasbeingamissingpartoftheinfrastructureforproducingEBVdataproduct. Theneed for taxonomic look-up,matchingand reconciliation capabilities ismentionedseveraltimesasrequired.Whatismeantbytheseneeds?WhatisnecessaryfordatapreparationforEBVs?
TheEUBONUnifiedTaxonomicInformationService(UTIS)8 isdescribedasbeinga‘taxonomicbackbone’.AccordingtoEUBON,theUTISallowsafederatedsearchtoberunonmultipleEuropeancheckliststoreturnaunifiedresultsetoftheindividualresponsesofthevariouschecklists.UTISconnectsthewebservicesofthePan-EuropeanSpeciesdirectoriesInfrastructure(EU-Nomen),EUNISwhichfullycoversNatura2000,theCatalogue of Life (CoL), and the World Register of Marine Species (WoRMS). UTIS can be used in fullcompliancewithAppendix3oftheINSPIREdirective.Currently,itispossibletosearchfortaxaandsynonymsbyscientificnameorvernacularnamestrings.Wherethesearchstringmatchesasynonym,theacceptedtaxon is resolved. Furthermore, UTIS also provides a search mode for resolving taxon identifiers. Theresponsesofthesearchwebservicealwaysincludeinformationontheclassificationandoptionallyonrelatedtaxaasfarasthisdataisdeliveredbythechecklistproviders.
Similarly,VLIZ(Belgium)buildsacentralTaxonomicBackbone(TB)forLifeWatch9.AccordingtoLifeWatchBelgium, this TB includes species information services - taxonomy access services, a taxonomic editingenvironment,speciesoccurrenceservicesandcatalogueservices.TBbringstogetherdifferentcomponentdatabasesanddatasystems.Inadditiontotaxonomicinformation(taxonomicdatabases,speciesregistersandnomenclatures), theTBwill also includebiogeographicaldata (speciesobservations), ecologicaldata(traits),genomicdataandlinkstotheavailableliterature.
A further category of tools is those that provide semi-automated helpwithmatching names and cross-mapping between taxon concepts. Chinese Academy of Sciences Beijing Insitute of Zoology has its“Taxonomic Tree Tool” (TTT)10 while Cardiff University offers its Cross-mapping Tool11. TAXAMATCH12supportsfuzzysearches(e.g.,typosinnames)andiswidelyacknowledgedandused.
10.4 BalancingautomationandexpertinputThereisaneedtoachieveabalancebetweenhavinganautomatedapproachbasedoncommontoolsinasequenceofsteps(aworkflow)andtheessentialapplicationofexperthuman inputatmultiplepointstoverifyandassurethecorrectprogressoftheprocedure.
VariousfactorsassumedifferingimportanceaccordingtothephaseofEBVwork(see12.1below).
DuringtheresearchphaseofEBVdevelopment,accesstointermediateoutputsisessential.Traceabilityofthe inputs, algorithms and parameters is key. The researchermust also have the opportunity to adjustparametersofeachsteptoachievethedesiredeffect.Thisphaseisaniterativeapproachtodevelopingthemethods.
Whenthemethodbecomesstable,reviewed,andacceptedbythecommunity,thereislessneedforhumanorexpert intervention.Automationforefficientuseofcomputationalresources,speedandperformance,etc.becomesmoreimportantthanmanualintervention.Humaninterventionisneededforoccasionalquality
8http://cybertaxonomy.eu/eu-bon/utis/1.0/9http://www.lifewatch.be/en/taxonomic_backbone10http://ttt.biodinfo.org/TF/11<urlneeded>12http://www.cmar.csiro.au/datacentre/taxamatch.htm
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page29of86
controlpurposesorselectionofchoicesfromarestrictednumberofagreedparameterranges–timeperiodandgeographicalareabeingobviousones.
10.5 Standardisedworkflowapproach
10.5.1 RationaleforadoptingastandardisedworkflowapproachWorkflowssupportautomationofroutinetasks.Theyarerobustandreliableandtheworktheyperformisreproducible13.Regardlessofpreciseimplementationmechanism,workflowsareidealwhenastandardised,repeatableapproachwithtraceability(provenance)isneeded(Davidson2008,GobleandDeRoure2009).Workflows are a standard way of doing something and can be an approved procedure in a regulatoryenvironment.
ProductionofEBVdataproductswillinevitablyrequireasubstantiallevelofdataintegrationacrossalargeand dispersed number of data providers, as well as complex data preparation and processing steps.Historically,suchworkhasinvolvedacombinationofmanualworkandbespokecomputing‘scripts’codedinavarietyofprogramminglanguages,withmultipleanddifferentuserinterfaces.Ifsuchprocessingstepsaremeanttobemaintainedandrepeatedovermanyyears,itisunlikelythattheexperts,theirskills,orthecoststo support them remain the same. Hence it is advantageous to develop and preserve all data access,integration,andprocessingstepsofamethodforpreparingEBVdataproductsasaserviceinaworkflow-orientedinfrastructure[Hardisty2016].However,asingleandfullyautomatedEBVproductionworkflowmaynot be realistic, as such processingwill always need some degree of flexibility that can be expensive todevelopandmaintaininaworkflowinfrastructure.However,themostgenerickeystepsofanEBVcalculationcouldbeprovidedbyworkflowsystems.
Aswehaveseen(section9.2,Figure2)ageneralisedsetofworkflowstepsforproducingEBVdatasetscanbe formalisedto take intoaccountmultiplesequentialactivities, suchasaggregationofvariousrawdatasources,dataqualitychecks, identifyingduplicatedata,taxonomicname/cross-mapping,choosing/usingappropriatestatisticalmodelsandcalculatinguncertainty.Thesekindsofstepsare likelytobeneeded inmostEBVsprocedures.
10.5.2 MainissuesThemaintechnicalissuesaroundworkflowsimplementationare:
• Choiceofappropriateworkflowrepresentationlanguage,andtheassociatedexecutionmechanism;
• Implementationofadistributedservicemodelinwhichworkflowsorchestrateexecutionofmultipleindependentservicesexploitingdatafrommultiplesources;
• Toensureworkflowsaresufficientlytransparent intheirdescriptionandoperationtobeopentoscrutinyandpeerreview;
• Toensureeaseofmaintenanceandenhancementovermanyyears;
• Theinclusionofflexibilitytopermithumanexpertinspectionof intermediateresultsandinputtocontroltheexecutionprocedure.
ThechoiceofappropriateworkflowrepresentationlanguagemustacknowledgethatRiswidelyembracedasadefactoprogramming languageforecologytoday,mainlybecause it isopen,extensibleandhasthewidestvarietyofecology-specificRpackagesavailable.Ontheotherhand,languagessuchasPython,CorJava,with their supportingprogrammingandexecutionenvironmentscanbringdifferentadvantagesnotavailablefromcurrentRenvironments.EBVproductionhastobeexecutedinandacrossavarietyofdifferent
13Providedthatstepsaretakentopreservenotonlythedatausedbutalsotheprocessingstepsencapsulatedbytheworkflow.Thisisthenotionof‘researchobjects’.See,forexamplehttp://www.researchobject.org/.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page30of86
infrastructures.Thus,factorssuchasabstractionandseparationfromunderlyingcomputationalresourcesarrangementsandcomprehensibilityhavetobetakenintoaccounttoensureworkflowsaresufficientlyeasyto use, transparent and maintainable. An ability to interface to and interact with a wide range ofheterogeneouslocalisedandremoteservicetypesisessential.
Someexecutionanddataservicesareavailabletodaybutthere isaneedtomovetowardscommonandmorerobustonesthatareinfrastructureorientedratherthandesktoporiented.Thereshouldbenosoftwareinstallationdependenciesfortheuser.Capabilitieshavetobeflexibleenough(‘elastic’)tosupportfluctuatinglevelsofcomputationalandstoragedemandthatcannotbeknowninadvance.Fundamentally,toolsandservices deployed for light, local, benign use by (a few)well-known or internal users are not sufficient.Serviceshavetosupportheavy,remote,andpossiblymalicioususeby ‘riskystrangers’,aswellasusebymultipleuserssimultaneously.Thatistosay,serviceproviders,althoughactinglocallyhavetothinkglobally.
Practitioners have to be sufficiently skilled to develop and test the necessary workflows for such anenvironment. Theyhave tohaveknowledgeof the infrastructure(s) andof theavailable servicese.g., bydiscovering them from the Biodiversity Catalogue14. They have to understand how to optimally exploitcomputingandstorageresourcesforhighlyscalableapplicationsworkinginconjunctionwithlargequantitiesofdistributeddata.
10.6 DataproductstorageandpublishingLargeamountsofderiveddatadeliveredbytheEBVproductionprocesswillhavetobestoredandmanagedonalong-term,semi-permanentbasis.Thesedataproductsarelikelytobeproducedinadistributedmannersohowshouldtheybepublishedforthecommunity?Thiswillbeespeciallychallengingifwedecidetotakeadistributedhypercubeapproach(8.3above).
Whatevertheapproach,wemustdefinethecommonspecificationsforeachEBVdataproduct.Whatdoesthedataproductcontain?Howisitstructured?Whatarethedimensionsoftheproduct?AreanydimensionscommonacrossallorasubsetofEBVdataproducts?Inwhatunitsisthedatapresented?Whatmetadataisneeded?Howisthedatarepresentedandstored?Anappropriatebody,suchasGEOBON,TDWGorRDAhastoberesponsibleforthesespecifications.Seesection10.11belowonStandardization.
ForeachfamilyofEBVs,someagreementoncommondimensionsseemsrequired.FortheoverallcollectionofEBVs,afewcommondimensionsareneeded.ThreecommondimensionsneededbyeveryEBVaretime,spaceandspeciesname/taxa.
Aroughcalculationisneededtodeterminehowmuchphysicalstoragecapacityisrequired.Initiativesareneeded to take on the burden of the storage commitment, which has to be funded, of course. Suchcalculations have to take into account whether the semi-permanent storage of EBVs data will be donecentrally (i.e., inasingle repository,perhapswithduplicatedmirrorsites)orasanetworkofdistributed,independentbutinteroperablestorageservices.
10.7 ImplementingthehypercubeapproachnDimensionalcubes, implementedusingtraditionalrelationaldatabaseshavelongbeenafeatureofdatawarehouses,whererelationalqueriesareusedto“slice,dice,drill-downandroll-up”subsetsofthedatafor“OnLine Analytical Processing” (OLAP) in business intelligence applications15. Today, array-oriented
14BiodiversityCatalogueistheWebservicesregistryforthebiodiversitysciences.Itcanbefoundatwww.biodiversitycatalogue.org.15Foraninsightintodirectionsinthehypercubeparadigmasthebasisforbusinessintelligenceanddecision-supportapplications,seethis2008blogpost:http://blogs.forrester.com/james_kobielus/08-07-14-olaps_cube_crumbling_around_edges.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page31of86
databases like SciDB, Rasdaman, andMonetDB are increasingly used for their ability to easily handle n-dimensionalarrays.
AhypercubeapproachisproposedasthemechanismforwarehousingEBVdataproducts(seesection8.3).
Thetechnicalchallengeistomanageacollectionofhypercubeswhentheyarephysicallydistributed,ownedandloadedbydifferentorganisations,withroll-upacrosshypercubesbeingnecessarytoachievearegionalorglobalviewofthedata.Cataloguingthesehypercubesisrequiredfordiscoveryandeffctiveuse..
10.8 StandardisedAPIsandmeta-descriptionofthemAs‘softwareeatstheworld’ [Andreessen2011]andsoftwaredevelopment itselfcomestorelymoreandmoreoncomponent-basedapproaches,ApplicationProgrammingInterfaces(API)becomebyfarthemostimportant mechanism for achieving interoperability between services. RESTful API Modelling Language(RAML)16isastandardizedmechanismfordesigninganddescribingAPIs.
Well-knownAPIswithstandardizedmeta-descriptionsmakeiteasiertoaccessdataresourcesandassociatedservices.WhendetailsoftheseAPIs(services)areavailableinawell-knowncataloguesuchastheBiodiversityCatalogue14,itiseasiertolinkacrossinfrastructures(11.4below).
Data publishers, tool providers, service providers and infrastructure operators should all encourage andembraceaculturalshifttowardwelldesigned,welldescribed,standardisedAPIsthatareeasytofind,touseandmanage.
10.9 Provenanceandtraceability,reproducibilityandrepeatabilityNeedsrelatedtoprovenanceandtraceabilityrecurofteninconversationsaboutusingandprocessingdata.Tracing data back to its original source is frequently stated as necessary for evaluating results andconclusions.“Whatdatawasthisbasedon?”isthequestionmostoftenasked.Trackingofdatacitationsandre-useofdataisnecessaryforthefundingofadataobservation/collectionactivity.Confirmationthatdatalicensingconditionshavenotbeenbreachedisrequired.Informationaboutthetoolsandcapabilitiesthatwereusedtodoananalysisisneededsothattheanalysiscanbereproduced.
Towhatextentisitnecessarytobeabletoreproducetheentiresequenceofdatacollection,processingandevents that leads to a particular EBV data product? If the process itself is replicated elsewhere (e.g., inmultiple infrastructures)will it reproduce the sameoutputs?How important is reproducibility andwhy?Whatshouldhappenintheeventofadisaster(fireinadatastoragefacility,forexample)?
Reproducibilityisnotthesameasrepeatability17.Repeatable‘computeractionable’workflows/processes/proceduresareessentialtoensurethattheprocesscanbecarriedoutoverandoveragaini.e.,whenevernewdataproductisneed.
Toolsfortraceabilityhavetobebuiltusingstandardmechanisms(suchastheW3CPROVfamily)forcollectingprovenance information, which itself has to be enabled throughout the tool-chain. To capture therelationships that constitute provenance (e.g., composition, transformation, filtering, etc.) persistent
16http://www.raml.org/17 Reproducibility is the ability to duplicate or replicate a sequence of actions, elsewhere and/or by another person workingindependently.Repeatabilityistheabilitytoperformthesametask/workflow,bythesamepersononthesameitem,underthesameorverysimilarconditions(e.g.,withslightlydifferentinputparameters)andwithinareasonablyshortperiodoftimeafterthefirsttimeitwasdone.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page32of86
identifiers (section 10.12) are key for unambiguously identifying the entities (e.g., datasets, algorithms,intermediateproducts,etc.)linkedbythoserelationships.
10.10 DistributedversuscentralisedprocessingDistributedversuscentralisedresponsibilityintheproceduresforproducingEBVsisimportant.Werefertothisasthe“warehousechoice”becauseofitssimilaritytothechoicebetweenasinglecentralwarehousetosupplyforexample,allfoodsupermarkets,orasetofdistributedautonomouswarehouses,eachservingasmallernumberofsupermarkets,orpossiblyjustone.
Distributedorcentralisedprocessingisnotacomputerorgroupofcomputersinasingleplaceversususingcomputersdistributedacrossmultiple locations.Distributedorcentralisedprocessing isalsonotachoicebetweenusinganorganisation’sowncomputers,oroutsourcingtoa3rdparty(suchasacloudcomputingprovider).Thosechoicesareseparatefromthewarehousechoice.Seesection10.13onICTscenarios.
In the warehouse choice we are concerned with allocation of responsibility for performing the task ofproducingEBVdataproducts,considering:
• Locationandownershipofthedataneededforthetask;
• GeographicalareatowhichtherequiredEBVdataproductrelates;• TheneedtoensureconsistencyofcalculationacrossallproducersofEBVdataproducts;
• Thelevelofresourcingneededat,forexamplenationallevelversustheneedtosupportmajorresourcesmorecentrallytocarryoutthetask;
• Theriskoferrorsarising,eitherthroughlimitedunderstandingofthedatainacentralisedoperation,orthroughinconsistentapplicationoftheproceduresinadistributedoperation.
Thewarehouseissueisasocio-political-technicalissue,tobeconsideredinrelationtoorganisation,financingandgovernanceoftheEBVprocedures,takingintoaccountnationalandglobalreportingrequirementsandlikelyuseofEBVdata.
Onescenario,illustratedinTable5foreseesprimarydataobservationsandinitialprocessingorganisedalongpoliticaladministrativedivisionsatanational(orlowerunit)level,withaggregationandroll-uptothelevelof regions and continents18. Within such a two-level scheme, the lower levels can organise their owncentralisedordistributedcomputingwhiletheupperaggregationlevelcanalsoorganiseitselfinacentralisedordistributedmanner.Theupperlevelcanharvestautomaticallyfromthelowerlevel.
Purposeandlevelofdataprocessing Unitresponsibilityfordatamanagementandprocessing
Computingandstorageprovision
Primarydataobservations,initialprocessingandpublishing
Nationalorlowerlevelsofadministration
Canbecentralisedatordistributedacrosstheunitlevel
Aggregation(roll-up),interpolation,extrapolation-forwideruse
Supra-national/regional Centralisedforstorageandpublication;Singleglobalsource,perhapsreplicatedatregionallevelforsecurity.
Processingperroll-upregionAutomatedharvestingfromlowerlevel.Withinregionprocessingcanbecentralisedordistributed.
Table5:Examplestructuringforhierarchicalprocessing
18IUCNProtectedAreas,forexample.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page33of86
Section11.6suggestsadoptingthenomenclatureofNASAdataprocessinglevelsasthemeanstodefineEBVdata products more precisely. Using such an approach could also help to determine issues related tocentralisedordistributedprocessingandassociatedresponsibilities.
10.11 Standardization
10.11.1 ChoiceofstandardsStandards applicable to EBV data, data products, formats, exchange and services fall into three maincategories.
Thefirstcategoryisconcernedwithselectingfromthosemoregeneral-purposestandards,recommendationsand specifications already published by widely recognised organisations such as ISO/IEC, W3C, OpenGeospatialConsortium,OASIS,RDA,etc.SuchstandardsareindependentoftheEBV-specificnatureofthebusiness.Theymaycoverforexample,standardmeansofpersistentlyidentifyingdata,standardmeansoftransferrringitovernetworksandstandardmeansofrecordingandtrackingitsprovenance.
Thesecondcategory isconcernedwithselectingfromdomain-specificstandardssuchasthoseundertheresponsibilityofTDWG(seebelow)forrepresentingandexchangingvarioustypesofbiodiversitydata.Again,thesearelikelytobepre-existingspecificationssuchasDarwinCoreandTAPIR.
Finally,thereisacategoryofspecificationsspecifictothenatureoftheEBVbusinessthatdonotyetexist.Themembersofthiscategoryhavetobeidentified,defined,writtenandagreed(10.11.4below).
10.11.2 IssuesofstandardizationSpecificationofdataformatshastobebasedonanunderstandingofhowthedataismostlikelytobeusedacrossdifferentcategoriesofend-users.Similarly,specificationsofdatatransferprotocolswillbebasedonhow,whatandwhendataistransferred.Thewaythatdataistobestoredandmadeavailable(published)andhowthatisorganised(8.3and10.6above)hasalsotobetakenintoaccount.
Theissuestartswiththerelevantprimarybiodiversitydatathatisalreadybeingcollectedandpublished,andendswiththeEBVdataproductsforresearchersandpolicyanddecision-makers.
Ithasbeenmentionedthatsafedatatransferprotocolsareneeded.Whatdoes‘safe’mean?Safefromwhat?Thishastobefurtherstudied.
10.11.3 CompliancewithrequirementsoftheINSPIREDirectiveInEurope,EBVdataarepartofthe“InfrastructureforSpatialInformationintheEuropeanCommunity”19andhavetobediscoverablethroughthe INSPIREGeoportal20.Thus,compliancewiththerequirementsoftheINSPIREDirective[EUParliament2007]ismandatoryfordataproductsandservicesrelatedtoEBVsintheEU.Theserequirements,forwhichmanytechnicalspecificationshavealreadybeenpublishedincludespecificprovisions relating to metadata; harmonisation of spatial data; unique identification of spatial objects;services fordiscovery, viewing,downloadingand transformingdata; interoperabilityof servicesanddatasharing,etc.
Thereareunlikely tobeanysignificant incompatibilitiesarisingwhenaspectsof thespecificationsput inplacetomeetEuropeanmandatoryrequirementsareappliedmorewidelyina‘bestpractice’senseforall
19http://inspire.ec.europa.eu/20http://inspire-geoportal.ec.europa.eu/
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page34of86
EBVdata,productsandservices.Bythiswemean,forexamplethatmetadatadefinedforspecificpurposesinEuropeislikelytoalsobeusefulelsewhere.
10.11.4 ResponsibilityforstandardizationWhichbody(s)shouldberesponsiblefornewEBVdatastandards?
GEOBON21theBiodiversityObservationNetworkofGEO,isapartofGEO,TheGrouponEarthObservations.GEO BON is building the pathways to link biodiversity data and metadata to GEOSS, the Global EarthObservationSystemofSystems,whichwillprovidedecision-supporttoolstoawidevarietyofusers.GEOBONisaresponsetothelackoforganisedorcoherentinfrastructuretocollectthebiodiversityinformationnecessarytomonitorprogresstowardsobjectivesoftheConventiononBiologicalDiversity(CBD)StrategicPlan.TheGEOBONSecretariatcoordinatesworkonEssentialBiodiversityVariablesandisthemostlikelybodytocoordinateactivitiesrelatedtostandardization.However,itisnotnecessarilytheappropriatebodytocarryouttheactualworkofdraftingandagreeingthenecessarystandards;anactivitythatrequiresahighdegreeoftechnicalskill.Suchataskcouldbedelegated,forexampletoGEOBONWorkingGroup8ondataintegrationandinteroperability,informaticsandportals22.
‘Biodiversity Information Standards (TDWG)’23 focuses on developing and promoting standards forrecordingandexchangingbiological/biodiversitydataaboutorganisms.TDWGisresponsible,forexamplefortheDarwinCore(DwC)standardandhasafutureroletoplayinEBVrelatedstandards.
TheResearchDataAlliance (RDA)24 isa community-drivenorganization thathas thegoalofbuilding thesocialandtechnicalinfrastructuretoenableopensharingofdata.Itwasinitiatedin2013bytheEuropeanCommission, the United States Government's National Science Foundation and National Institute ofStandardsandTechnology,andtheAustralianGovernment’sDepartmentof Innovation.Working inclosecooperationwiththeInternationalCouncilforScience(ICSU)WorldDataSystem(WDS)25anditsCommitteeonDataforScienceandTechnology(CODATA)26,RDAissupportedbymembersfrommorethan110countriesandhasawideremitcoveringallaspectsofdatastandardization.ManyoftherecommendationsemergingfromRDAareapplicabletoenvironmentalsciencesandhencetoEBVs.
10.12 PersistentidentifiersPersistent identifiers (PID) are the means by which long-lasting (i.e., semi-permanent) references toimportantartefactssuchasdatasetsarecreated.GiventhatthereareseveraldifferentPIDmechanismsinpresent-dayuse,thereshouldideallybeaglobalagreementonasinglemechanismforidentifyingallEBVdataproducts.ThemostubiquitousPIDistheDigitalObjectIdentifier(DOI).DOIshavebeenwidelyusedfortraditionalpublicationsandmorerecently,datasets.AdoptingtheDOImechanismallowsthird-partyservicesfor PID assignment, registration and resolution, such as those provided by DataCite and Crossref to beleveraged.
Global agreement will also be needed on the detailed procedures for using a PID mechanism to issue,maintain and track globally unique PIDs. Such an agreement has to include specification of the detailedmetadatacontenttobestoredalongsideaPIDinaPIDregistry,sothatunambiguouscitation,discoveryandre-useofEBVdataproductsispossible.
21http://www.geobon.org/22http://geobon.org/working-groups/working-group-8-data-integration-and-inter-operability-informatics-and-portals/23http://www.tdwg.org/24https://www.rd-alliance.org/25https://www.icsu-wds.org/26http://www.codata.org/
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page35of86
ProvisionhastobemadetoaccommodatetheuseofPIDmechanismsthatareusedforpersistentlyanduniquely identifying the source data that contributes to making EBV data products and those used foridentifying software and other resources that form part of the production process. Workflows have toaccommodatetheheterogeneityofidentificationmechanismsinusearoundthesourcesoftheirinputdata.Theseaspectsareessentialtomaintaindataintegrity(section10.2.2),fortracingprovenance(section10.9)andforpublication,citationandtrackingdatare-use.
Unambiguousreferencingofsubsetsofdata-so-called‘slice,dice,roll-upanddrill-down’subsetsmayalsobeneeded.
ManyfurtherconsiderationsrelatingtoPIDsaregivenin[Atkinson2016].
10.13 DifferentICTscenariosAssuggestedinsection10.10,multipleICTscenariosarepossible,balancingdifferentcomputeandstorageoptionsagainstoneanotherindifferentways.DifferentorganisationsplayingintheEBVdataproductsgamearelikelytoadoptdifferentICTsolutions.ToensureconsistencyofproductionofEBVdataproducts,andtofosterinteroperabilityofbothdataproductsandservicesforproducingthem,thesedifferenceshavetobeaccommodatedbyprovidingacommon‘EBVplatformarchitecture’thatcanbesupportedbythedifferentorganisationsinvolved.Bycommonplatformarchitecturewemeanasetofcomponents(microservices27)with lowvarietyandhighreusabilitypotential thatcanbeadoptedbyallplayers28anddeployedtotheirpreferredICTsolutions.
10.14 SoftwareenforcementofdatalicensingtermsandconditionsWorkflow-orientedprocessingchainshavetoenforcedatalicensingtermsandconditionsfrombeginningtoend.Evenusingopendatacarrieswithitamoralobligationtoacknowledgethesourceofthedataandhowithasbeenusedtocreatethefinaldataproducts.Inanautomatedprocess,thismeanscarryingtherelevantinformation(orapointertoit)foranyparticulardatarecordthroughtheentirechainofprocessingsothatitcanbepresentedaspartofthefinalpresentationoftheEBVdataproduct.
AnEBVdataproductcouldbepreparedonthebasisofsomeconfidentialorsensitivedata(notcommerciallylicensed,butprotected).TheEBVdataproductitselfmightbepublicinformationbutnotthesourcedata.
11 TechnicalrisksforEBVimplementationThesub-sectionsbelowareasetofrisksidentifiedsofar.
11.1 Readiness/capabilityofinfrastructurestosupportEBVsproductionAsnotedinsection8.1,itisnotcleareventhatRIsaretheresponsibleinfrastructuresforEBVproduction.AfewadvancedRIsmaybecapableofsupportingresearchintoEBVsandassociatedproceduresthemselves,butarenotabletosupportproductionatscale.RIstodayarelikelytobeprovidingsupportattheproof-of-principlestage(see12.1andFigure6).Wemayseefirstforaysintoexperimentalintegrationinthecomingyearortwo.
27Microservices are separately deployed units, each representing a service component. Eachmicroservice can be administeredseparately,whilstbeingusedinconjunctionwithanother.Theconceptofmicroservicesisbecomingestablishedasanewpatternofapplication deployment in response to increasingly continuous ‘DevOps’ styles of application delivery, especially in relation to‘SoftwareasaService’(SaaS)styleofapplicationdeliveryonCloudinfrastructures.ItisanoverallsimplificationofthedistributedService-OrientedArchitecture(SOA)pattern.MicroservicesworkwellwithworkflowmanagementsystemssuchasApacheTaverna.28Seealsoparagraphsonpage36relatingto‘integratedplatformforproduction’.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page36of86
ItiscriticaltogaugethetimescalewithinwhichresponsibleinfrastructurescanreachtherequiredreadinessversusthatofthepoliticaldemandtodeliverEBVdataproductsreliablyandasneeded.
11.2 OpenaccessforservicedependenciesInadistributed informaticsapproach,workflows forproducingEBVdataproductsmight relyon servicesoperatedandprovidedbythirdparties.GoogleMapsisonesuchexample.AnotherisGoogleEarthEngine,where the use of a powerful service can require the workflow operator to relinquish someownership/copyrightovertheresultingproduct.
Some organisations and/or countries may not permit access to some Internet based services that areconsideredessentialforcorrectoperationofaworkfloworprocedure.Therefore,whenestablishingcriticalandperhaps long-termservicedependencies, it isnecessarytoconsiderthefactorsthatmayrenderthatserviceinaccessibleorunusableforsomeoperatorsofaworkfloworprocedure.
11.3 MultilingualismAllaspectsoftheEBVproductsandapplication(s)havetobetranslatedintomultiplelanguages.
MultiplesourcesofdatafortheEBVprocessmayimplymultiplelanguagemetadata.Acomputer-assistedthesauruscapabilitycouldbenecessarytoassistautomateddataintegration.
11.4 LinkingacrossinfrastructuresAchieving interoperability across infrastructures is the mechanism by which an integrated means ofproducing and delivering harmonised EBV data products at a variety of resolutions and for differentgeographicareascanbeachieved.When facedwith theprobability thatdifferent ICTapproachescanbeadoptedby eachof thedifferent infrastructures,wehave tobe clear abouthow interoperability canbeachieved.ItisthemainaimoftheGLOBIS-Bprojecttodeterminehowtoachieveinteroperability,usingEBVsasthemotivatingusecase.
TheprecedingCReATIVE-Bprojectalreadyconsideredthedifferentkindsoftechnicalinteroperability,andthelevelatwhichinteroperabilityhastooptimallytakeplace[Hardisty2014].
Achieving interoperability means supporting transactions at the service level across infrastructures -supportingserviceleveltransactionsbetweenusersand/orservices.Usersorservicesofoneinfrastructureshouldbeabletoaccessandutiliseresourcesandservicesofanotherinfrastructuretoachievetheirgoal.
To achieve such interoperability requires coordinated implementation of selected standards andspecifications,usingprecisedefinitions(“profiles”)ofhoweachstandardorspecificationcanbeadoptedandimplementedtosolvethespecifictransactionneed.StandardizedAPIswithstandardizedmeta-descriptions,well-knownbybeingregisteredinacataloguesuchastheBiodiversityCatalogue29isoneexample.Standarddatabrokeringservicesareanother.
Amechanism(i.e.organisation,people)isneededwherebyeachinfrastructurecanberepresentedandcancontributetoandmakeagreementsonstandards,specificationsandtheirimplementation.
11.5 ThediversityoftoolsandtechniquesManypossibleapproachestoproducingEBVshavebeendiscussed.Whathasbeensuggestedisbasedonworkshopparticipants’expertise,theirexperiencewithtoolsandpersonalpreferences.
29BiodiversityCatalogueistheWebservicesregistryforthebiodiversitysciences.Itcanbefoundatwww.biodiversitycatalogue.org.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page37of86
Thebesttoolsforthejobhavetobeselected,basedonanobjectiveevaluationofhowparticulartoolsmeetthespecificneedsoftheprocedures.Insomecases,severaltoolscanfulfiltherequirementsand,solongastheoutputproducedmeetstheneedsoftheprocess,thereisnoreasontoexcludeoneinfavourofanother.Toolswillevolvewithexperienceandresearch.
Thekeyconsiderationsrisksarebasedonthereliability,easeofmaintenanceanduseofappropriatetools.Focussingon a small numberof tools aroundwhich the community congregates, andputting effort intoimprovingthosetoolsseemsaneffectivestrategy.
11.6 PrecisenatureofEBVdataproductsAs noted at the end of section 4.2, it’s clear from the workshop discussions that there is a lack ofunderstanding and therefore consensus on the precise nature of EBV data products.We know (see, forexamplethetoppartofFigure2)thatEBVdataproductsstartfromprimaryobservationdatasets,movingthrough a process of harmonisation eventually to produce interpolated/extrapolated datasets. This ishoweveravaguedefinitionthathastobesubstantiallyimprovedviaamoreformalspecificationofthedetailsforeachEBVdataproduct.
OnepossibilitycouldbetoadoptthenomenclatureofNASAdataprocessinglevels(0–4)30,attemptingtodefine:a)thecharacteristicsoftheEBVdataateachlevel;andb)theprocessingneededtomovedatafromoneleveltothenext.Thiswouldhelptodeterminewhathastobeprocessed(andwhere–seealsosection10.10)aswellaswhathastobestored/retained,andpublishedasusabledataproducts.
12 TranslationalrisksofEBVsmethodsresearch
12.1 TheinnovationchainThestepsfromwherewearetodaytowardsquality-controlledregularEBVproductionarestagesofwhatisknowninthecommercial/industrialworldasan‘innovationchain’[Kline1986];extendingfromfundamentallaboratory research, to appliedmethods, aproductor servicedevelopmentphase, field testing, possibleregulatoryapproval,in-useevaluationsandultimatelytowidespreadadoptionanduse(Figure6).
30https://en.wikipedia.org/wiki/Remote_sensing#Data_processing_levels
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page38of86
Figure6:Theinnovationchain-transitionfromresearchtoproductisationtomassuseAdaptedfrom(RobinsonandPropp2008)
Reading Figure 6 from left to right, with terms in italics being first reference to elements in the figure:ResearchersinprojectscarryoutresearchonindividualEBVs,andthedataandmethodsneededtosupportthem, leadingtoscientificproof-of-principleof thespecificEBV(EBVxorEBVy inthefigure).Computingtechnologyproof-of-principlecanalsoberesearchedatthistimetosolvespecificproblems,dealingwithdataquality issues, or investigating how best to store massive amounts of data, for example. A particulartechnologicalchallengeatthisstageistoselecttherighttools,dataandinformaticstechniquestosupportafirstroundofexperimentalintegration.Thatistosay,tofindinformaticsapproachesthatcanbeappliedincommontoimplementthescientificmethodsforproductionacrossmultipleEBVs.Adopting,forexample,aworkflowapproachtowardsthegenerationofeachEBVdataproductwouldshowsomelevelofintegrationinsofarassomestepsintheworkflowarelikelytobethesameorquitesimilarfordifferentEBVs.However,becauseoftheexperimentalnatureoftheworkatthisstage,thestepsmaynotworkwellwithoutsoftwareadaptation or manual intervention during execution of the workflows. Such a prototype is suitable foroperationonlybypersonswith sufficient expertise and knowledge to check that everythingproceeds asexpectedthroughthevarioussteps.Becauseit’saprototype,someelementsmaybemissing.Theprototypemaynotbeeasilyadaptableforallscenarios,combinationsofparameters,etc.norwillallerrorcaseswillbeproperlycateredfor.Prototypeinfrastructureatthisstageisoftendifficulttomanage,maintainandsupport.
Transitionfromlaboratorytomarketsysteminvolvesinformaticsdevelopmentanddeploymentandbusinessdevelopment i.e., service model and business model definition, planning and deployment, as well asunderstandingofhowtheproductswillactuallybeused.Itiswhereresearchends,andwheredevelopmentinto usable, scalable, flexible data products and their associated robust and reliable ‘commercial-grade’productionandsupportprocessesbegins.Thisisthestagethatconvertstheexperimentalintegrationintoan integratedplatform forproduction.By ‘platform’wemeana setof stable, reusable core components[Baldwin2009]thatsupporttheproductionprocessesforEBVdataproducts.By‘integrated’wemeanunified
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page39of86
asawhole,withthecomponentelementscombinedharmoniously31.ThesecorecomponentsmayincludebothICTelementsandbusinesselements.Togethertheplatformcomponentsoffertheessentialfunctionsfortheintendedbusinessneed.
An integratedplatformforproductionandmanagementofEBVdataproductsmay, forexampleoffergeneralisedcomponentsforeachofthegeneralisedworkflowstepsoutlinedinsection9.2abovethatcanbe used for all EBVs. These components can work in general-purpose utility (cloud) computingenvironments, such as those offered by Amazon, Microsoft, Google, etc., making it possible for anyBiodiversityObservationNetworktoproduceEBVdataproductsfromthedatacollectedbythenetwork.These workflow components will interact and work seamlessly with a large-scale EBV data repositorydatabase,qualityassurance,publishingandarchivalserviceofferedbytheintegratedplatformthatpermitsdifferentBONsandotheractorstocontributeto,manageandpublishqualityassuredEBVdataproducts.Theintegratedplatform,withadditionalcustomcomponentsontopmakesitpossibletoderiveandproducedifferentkindsofEBVdataproducttoserveavarietyofdifferentapplicationpurposes.ApplicationspecificEBV data products are likely to be precise and tailored towards one particular application or type ofapplication.Applicationgeneric EBVdataproducts are less specialised anduseful for abroader rangeofapplications. General-purpose application independent EBV data products are neutral as regards anyparticularintendeduseandhaveverybroadapplication.ExamplesofeacharegiveninTable6below.
ExampleEBVdataproducts Applicationclassofdataproduct TypicalapplicationsorusesPhytoplanktonprimaryproductivity
Applicationindependent Monitoringtrendsincarbonfixation,foodwebmodelling,valuationofecosystemservicesforfisheries
Sizestructureofphytoplanktonassemblage
Applicationgeneric Earlywarningofnutrientenrichment,trendsinmixingandpotentialtoxicevents
Presence/abundanceofparticularzooxanthellaeclades
Applicationspecific Assessmentofcoralreefresistancetotemperaturechange
Wetlandextent/recurrence Applicationindependent Estimationofhabitatforamphibian&otherspecies,climatecirculationmodels,flood/droughtresilienceandfoodsecurityforhumanpopulations.
Seasonalityofwetland Applicationgeneric Assessingeffectivenessofprotectedareasformigratingmammalsandbirds.
Wetlandplantspeciescomposition Applicationspecific Identifyingsuccessionanddrying,invasivespeciesmonitoring.
Table6:ExamplesofdifferenttypesofEBVdataproduct
ThereistheneedtoidentifyandunderstandwhattheEBVdataproductsandrelatedoutputsarelikelytobeusedforinmassuser(varioussocietal)contexts.UsescanincludenewscientificresearchbasedonEBVdata,policyanddecision-making,andappliedconservation(forexample).Therearealsomultiplepathsbywhichthe integratedplatformcanleadtodifferentkindsofEBVdataproduct.Assuggested insection7above,responsibilityformaintainingandoperatingtheintegratedplatformandforproducingtheEBVdataproductsmostlikelylieswiththeindividualResearchInfrastructuresandBiodiversityMonitoringNetworks(bubblelabelled ‘RIs and BONs’ in the figure). Actors such as data publishers also have roles to play. Finally, anunderstandingofhowEBVdataproductsgetadoptedintoeverydayworkingpracticesofthemassusersisrequired.
31Awell-knownexampleofanintegratedplatformistheMicrosoftWindowsoperatingsystem,whichcontainsadefinedsetofcorevaluablecomponents(suchas,forexampletheabilitytodisplaymultiplewindowsonascreen,theabilitytointeractwiththeuserthroughdialogboxes,andfilehandling)thatcanbeexploitedinabroadrangeofdifferentapplications(email,wordprocessing,etc.).
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page40of86
AsillustratedbythelargegreyarrowinFigure6,demandfromandneedsofmassuserscreatea‘pull’fromthemarket–theright-handsideoftheinnovationchain–thatcanexpediteresolutionofthemanyotherchallenges earlier in the chain. Those challenge areas (‘arenas of development’ is the term coined byRobinson)showninthebottompartofthefigure(andalreadyreferredintheexplanationsabove)becomethefocifortargetedinterventionsandactivitiesthat‘oilthechain’ofinnovation.
Roadmaps help us to understand andplanwhat needs to be done. The knowledgeneeded to optimallysupport progress from proof-of-principle through to reliable, repeatable, sustained delivery of EBV dataproductstothemassuserscanbedeterminedandthenapplied.Gapsidentifiedinthenecessaryknowledgecanbe turned into recommendations for intervention. Such interventionsmay include, for examplenewresearch, capacity and capability building, and policy actions ormarket stimulations32 to create amorefavourableenvironmentforthetechnologytoemergeinto.
12.2 ParticulargapsintranslationPreviously[Hardisty2011,Elwyn2012]demonstratedparticulargapsinhownewtechnologiestranslatefromtheresearchlaboratoryintohealthcaresettings,andhownewtechnologiesbecomeembeddedasaroutinepartofeverydayworkingpractice[May2009,May2013].
Research and development, policy-making, decision-making, service planning, delivery and action are ascomplexinenvironmentalandbiodiversitymanagementsettingsasinhealthcaresettings.Thedesignandexploitationoftechnology,particularlysoftwaretechnologyisbeneficialtobotharenas.Inbothcasesthereisamixoftechnicaldesignissues,dataavailabilityandqualityissues,legalandregulatoryissues,economicissues;aswellasthesociologicalandpsychologicalfactorsatplay.Thereisnoreasontosupposethatthetwogapsintranslationuncoveredby[Hardisty2011,Elwyn2012]inoneparticularexampleofhealthcaretechnologytranslationdonotalsoexistelsewhere.
The first gap in translation occurs once a scientific proof-of-principle has been established. The gap isrepresentedbytheneedtoactively involvestakeholderactors(intheEBVcasethis isresearchscientists,policy-makers,decision-makers,conservationmanagers,etc.)inthedesign,productionandexploratoryuseofprototypedataproducts.Wecallthisthe‘settingoffirstuse’.Thissettingiswheretheproof-of-principleisturnedintosomethingthatcanbeproperlyused(trialled)forthefirsttime.Suchtrialsareoftencarefullycontrolledandlimitedinscope,e.g.,intermsperhapsofEBVtype,chosenspecies,selecteddata,area,timeperiod, resolution, etc. Theactors (users) in that settingof first useare known, knowledgeable andwellsupported,limitingthethingsthatcangowrong.Nevertheless,thereissomerisk,arisingmainlyfromlackofiterativecollaborationtoactonearlyfeedbackandadjusttobuildwhatisreallyneeded.
Thesecondgap in translationariseswhenmovingbeyond firstuse toextend, scaleandembed thedataproductsmorewidely.Forexample,embeddinguseofthedataproducts intothe livebusinessprocessesoperatedbythevariousactors,orintobusinessservicesofferedbythoseactorsasintermediariestotheirownendusers,forexampleenvironmentagenciesorconservationbodies.Inthiscase,thespecificendusersarelikelyunknownandmaynotbesupported.Theriskatthisstageisfailuretoproperlyunderstandtheimplementationworkthatactorscarryoutinordertoadoptnewdataproductsintotheirroutinebusinesspractices.Thisworkinvolveselementsofeducation,buildingcommunitiesofpractice,activitytomakethedataproductoperationalintheirworkingpracticesandappraisaloftheeffectandimpactofthenewdata
32AssuggestedduringWorkshop1inLeipzig,onesuch‘marketstimulation’,tobetakenattheleveloftheG7+20couldbetoestablishnational-level biodiversitybureauxor ecosystem servicesbureauxas apublic good; essential assets for collecting andactingonbiodiversitydata.Comparethese,forexamplewiththeroleofweatherbureauxinmodernsociety.AnotherpossibilityistoleverageonassessmentsfromIPBES.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page41of86
productonworkoutcomes[May2009].Buildingconfidenceandcreatingtrustarealargecomponentofsuchwork.
Thesamekindsofactorsare implicated inbothactivitiesbuttheyareoften insufficiently involvedattheearlier stages, leading to lack of understanding and hence lack of confidence and trust. Therefore,understandinghowthedataproductsarelikelytobeadoptedandembeddedintolivebusinessprocesses,service delivery and routineworking practices is crucial knowledge for the earlier product developmentphase.
The gaps we have explained and hence the risks presented are intensified when there is insufficientunderstandingof thetranslationalprocessby those involved inandaffectedby it.Theprocesshas tobeorchestratedasaclearprogrammeofworkwithaccompanyinginterventions.Inthissenseitcomesbacktoproper andwidespread understanding of the innovation chain explained above and the challenge areaswhereattentionhastobefocussed.
13 ConclusionsandrecommendationsResearch Infrastructures (RI) have to respond, either individually or as a coordinated community (e.g.,throughtheGLOBIS-Bproject)to[Pereira2013]onhowRIscansupport interoperableworkflowsforEBVmeasurement.Theyhavetosay,forexamplehowtheycan:
• Supportinteroperableworkflowsformeasuringessentialbiodiversityacrossthetreeoflife;
• Supportthetraceabilityofdatathroughtheworkflowsi.e.,provenance;• Maintainmetadata;and,
• Demonstrateamulti-lateralcooperationamongexistingRIs.
ItisclearthatBiodiversityObservationNetworks(BON)alsohavetobeincludedinmakingsucharesponse.
Wearecurrentlynotadvancedenoughtoprovidedefinitiveanswers.Mappingtheissues(section10)andrisks (section11)againstthevariousstepsofthegeneralisedworkflowforEBVs(section9.2)canhelptoidentify the likely areas of difficulty and prioritise the issues and risks for further attention. Key factorsaffectingdecisionsrelyonresponsibilityandgovernanceforproducingEBVs(section8.1),warehousechoice(section10.10)andEBVdataproductpublicationandstorage(sections8.3and10.6).
Decisions have to be made with political and social elements in mind and from a better developedunderstandingofthetranslationalapproachtofollow(section12).Likeclimatevariables,aperiodiccyclicproductionprocessfordeliveringEBVdataproductsismostlikelywhatisneeded(section10.1).Thishastobeverifiedamongthestakeholders.
Weneedtodevelopthetechnicalstrategytowardsthefuturewithfinerdetailsonaroadmapforthenext3–5years.Tocreatethatroadmapisthemainrecommendationarisingfromthepresentreport.Therearetwomainthreadstosucharoadmap,reflectingtheinformaticsandbusinessdevelopmentsneededtobridgethetwokeygapsoftheinnovationchain(section12.2).
Firstly, theexperimental integrationhastoevolvefrom,developfromand illustratewhatcanbedone ineach infrastructure from a set of scientific and technical case studies. This process will identify keybottlenecks(scientific,technical,legal),addressingmanyofthespecifictechnicalissueshighlighted.
Second,agenericsolutionandrecommendations forsolvingthesebottleneckshastoemerge, leadingtospecificationanddeploymentoftheintegratedplatformforproduction,necessarytomakethetransitiontomassproductionofEBVdataproducts.
Pre-requisitestotheroadmapare:
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page42of86
• HavingabetterunderstandingofthelikelyoverallICTdeploymentscenario;• KnowinghowtoorganisetheworkflowtomeetEBVdataproductrequirements33;
• KnowingwhichorganisationsandinfrastructuresareresponsibleforeachaspectofEBVdataproductsmassproduction;and
• Makingthewarehousechoice.
In parallel we should aim to gainmore practical experience by enabling and supporting RIs and datapublisherstorecognisethestepsandtheissuesinvolvedinsupportingaparticularEBVorsetofEBVsandtheircurrentgaps.Thiswillenablesomestakeholdersto identifytheirsupport forsomeclassesofEBVs.Identificationofabilitiesandgapswilllaygroundworkforstandardapproaches(e.g.,todataqualitytestsandassertions)thatinfrastructurescanconvergeto.Itwillsupportthemfromthebottomupastheyconsidertheirvaluepropositionstrategies(section9.3.2).
14 References[Andreessen2011] Andreessen,M.(2011).WhysoftwareiseatingtheWorld.WallStreetJournal20th
August2011.
[Atkinson2016] Atkinson,M., Hardisty, A., Filgueira, R., Alexandru, C., Vermeulen, A., Jeffery, K.,Loubrieu, T., Candela, L.,Magagna, B.,Martin, P., Chen, Y. and Hellström,M., Aconsistent characterisation of existing and planned Research Infrastructures,Technical report D5.1, ENVRIplus project, 203 pages, May 2016.http://www.envriplus.eu/wp-content/uploads/2016/06/A-consistent-characterisation-of-RIs.pdf.
[Baldwin2011] Baldwin,CY.,andWoodard,CJ.(2011).Thearchitectureofplatforms:Aunifiedview.In: Gawer, A, (ed.) (2011). Platforms, markets and innovation. Edward ElgarPublishing.ISBN9781848440708.
[Belbin2013] Belbin,L.,Daly,J.,Hirsch,T.,Hobern,D.andLaSalle,J.(2013).Aspecialist’sauditofaggregatedoccurrencerecords:An‘aggregators’response.ZooKeys305:67–76.doi:10.3897/zookeys.305.5438.
[Bojinski2014] Bojinski, S., Verstraete,M., Peterson, T. C., Richter, C., Simmons,A.,& Zemp,M.(2014). The conceptof essential climate variables in support of climate research,applications, and policy. Bulletin of the American Meteorological Society, 95(9),1431-1443.
[Chaudhuri1997] Chaudhuri,S.,andUmeshwar,D.(1997).AnoverviewofdatawarehousingandOLAPtechnology.ACMSigmodrecord26.1(1997):65-74.doi:10.1145/248603.248616
[Elwyn2012] Elwyn,G.,Hardisty,A.,Peirce,S.,May,C.,Evans,R.,Robinson,D.,Bolton,C.,Yousef,Z.,Conley,E.,Rana,O.,Gray,A.,andPreece,A.(2012).Detectingdeteriorationinpatients with chronic disease using telemonitoring: navigating the 'trough ofdisillusionment'.JournalofEvaluationinClinicalPracticeVolume18,Issue4,August2012,pages896–903.doi:10.1111/j.1365-2753.2011.01701.x
[EUParliament2007] EUParliament.(2007).Directive2007/2/ECoftheEuropeanParliamentandoftheCouncilof14March2007establishingan Infrastructure forSpatial Information in
33Currentlywearemainlydealingwithlegacydatacollectedforotherpurposes.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page43of86
theEuropeanCommunity(INSPIRE).OfficialJournaloftheEuropeanUnion,vol.50,no.L108,April2007.http://data.europa.eu/eli/dir/2007/2/oj.
[García2014] GarcíaEA,BellisariL,DeLeoF,HardistyA,KeuchkerianS,KonijnJ,etal.(2014)FlockTogether with CReATIVE-B: A Roadmap of Global Research Data InfrastructuresSupportingBiodiversityandEcosystemScience.http://orca.cf.ac.uk/88151/.
[GBIF2016] Licensing milestone for data access in GBIF.org. 18th August 2016.http://www.gbif.org/newsroom/news/data-licensing-milestone. Accessed: 7thSeptember2016.
[Hardisty2014] Hardisty, A., andManset, D. (2014). CReATIVE-B Deliverable D3.2: Guidelines forInteroperabilityforBiodiversityandEcosystemResearchInfrastructures.[TechnicalReport]. Cardiff: Cardiff University, School of Computer Science and Informatics.http://orca.cf.ac.uk/92562/.
[Hardisty2011] Hardisty,A.,Peirce,S.C.,Preece,A.D.,Bolton,C.E.,Conley,E.C.,Gray,W.A.,Rana,O.F.,Yousef,Z.,Elwyn,G.(2011).Bridgingtwotranslationgaps:anewinformaticsresearch agenda for telemonitoring of chronic disease. International Journal ofMedical Informatics Volume 80, Issue 10, Pages 734–744.http://dx.doi.org/10.1016/j.ijmedinf.2011.07.002.
[Hobern2013] Hobern,D.,Apostolico,A.,Arnaud,E.,Bello,J.C.,Canhos,D.,Dubois,G.,Field,D.,AlonsoGarcia,E.,Hardisty,A.,Harrison,J.andHeidorn,B.(2013)Globalbiodiversityinformaticsoutlook:deliveringbiodiversityknowledgeintheinformationage.GlobalBiodiversity Information Facility, Copenhagen. ISBN 87-92020-52-6.http://imsgbif.gbif.org/CMS_ORC/?doc_id=5353&download=1.
[Kissling2015] Kissling, W.D., Hardisty, A., García, E.A., Santamaria, M., De Leo, F., Pesole, G.,Freyhof, J., Manset, D., Wissel, S., Konijn, J. & Los, W. (2015) Towards globalinteroperability for supporting biodiversity research on essential biodiversityvariables(EBVs).Biodiversity,16,99–107.
[Kline1986] Kline,S.J.andRosenberg,N.(1986).Anoverviewofinnovation.Thepositivesumstrategy.R.LandauandN.Rosenberg.Washington,NationalAcademyPress:275-304.
[May2009] May, C. and Finch, T. (2009). Implementation, embedding, and integration: anoutlineofNormalizationProcessTheory.Sociology43(3):535-554.
[May2013] May, C. (2013) Towards a general theory of implementation. ImplementationScience8(1):18.
[Osterwalder2004] Osterwalder, A. (2004) TheBusinessModelOntology - A Proposition InADesignScienceApproach.PhDthesisUniversityofLausanne.
[Pereira2013] Pereira,H.M.,Ferrier,S.,Walters,M.,Geller,G.N., Jongman,R.H.G.,Scholes,R.J.,Bruford,M.W.,Brummitt,N.,Butchart,S.H.M.,Cardoso,A.C.,Coops,N.C.,Dulloo,E.,Faith,D.P.,Freyhof,J.,Gregory,R.D.,Heip,C.,Höft,R.,Hurtt,G.,Jetz,W.,Karp,D.S.,McGeoch, M.A., Obura, D., Onoda, Y., Pettorelli, N., Reyers, B., Sayre, R.,Scharlemann, J.P.W., Stuart, S.N., Turak, E.,Walpole,M.&Wegmann,M. (2013)EssentialBiodiversityVariables.Science,339,277-278.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page44of86
[Robinson2008] Robinson,D.K.R.andPropp,T.(2008)."Multi-pathmappingforalignmentstrategiesinemergingscienceandtechnologies."Technol.Forecast.Soc.75(4):517-538.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page45of86
Annex1:Responsestopre-workshopquestionsQ5-Q8Question5
Whatarethekeystepsofaworkflow(s)forcalculatingspeciesdistributionand/orabundanceEBVs,startingfromaccessingtherawdatatopresentingavisualresult?Whatarethecomplexitiesinvolved?Whatdatapreparationisneeded?
ANSWER:
• Manypossibleapproacheshavebeendescribedintheliterature,rangingfromdirectdetectionfromRSthroughtoSDM(seeabove).
• Mainchallengeisavoidingdevelopingcategoricaldata–sticktocontinuousEBVsANSWER:
• Rawdatashouldbeuploadedtogetherwithametadatafilementioningtheimportantcharacteristicsofthedataset(nameofthemonitoringprogram,typeofdata(occurrenceorcount),timedurationofthemonitoringperiod(start–endordatesoffirstandlastrecords),indicativelocation(e.g.nationalorregionallevel),samplingprotocols(numberofspeciesandtaxonomicgroup,samplingdesignandfrequency,observationtechniques(binocular,satellite,microscope,etc),etc.)aswellasthenameandcontactofthedataprovider(nameoftheinstituteorstaffmember).
• Thedataprovidermayalsoneedtocommittodatasharingagreements:eitherbyagreeingtothedirectaccessto/downloadoftherawdatasothattheuploadeddatacanbe(re-)usedand(re)analysedbyanyotheruser,oratleasttheuseofthedatasetandtheaccessto/downloadofthederivedEBVcalculationsfromhisdatasetbyotherusers.
• Importantly,datapreparationtofittoagivenformatwouldbeapre-requisite.Theformatofthedatasetsshouldmatchwithspecificrequirementsdetailedinsomeguidelinesdescribinge.g.theminimumnumberoffieldcategoriesandtheinformationwhicharesupposedtoberecorded(speciesname,siteID,Latitude-Longitude,quantities(filledwith0/1fordistributionandintegeraboveorequal0forabundance,“NA”fornosamplingorgapsinthesamplingdesign,etc),unitofthequantities(e.g.numberofindividuals,densities,catchperuniteffort)aswellasanomenclaturefornamingeachfieldcategoriessothateachdatasetshouldhavetheexactsamenameforeachfieldcategories.Athesaurusofspeciesnamesandthegeographicalreferencesystem(e.g.WGS84)shouldalsobeprovided.Immediatelyaftertheuploadofthedataset,anautomaticqualitycontrolcouldbeperformedtocheckwhetherthenameofthefieldcategories,theircontent,thenameofallspeciesinthedataset,thegeographicalcoordinates,etc.fitrigorouslytotheformatguidelines.Ifanyrecommendationoftheguidelinesisnotrespected,thedatasetcouldnotbeconsideredforfurtheranalysisandwouldneedtobemodifiedaccordingly.Ifthedatasetisrespectingtheguidelines,itcouldbeallowedtogofurther.
• Then,theuserwouldbeproposedtoperformsomeanalysisbyhimselfornot,whetherhisinterestistomakeuseofhisdataoronlyprovidethemtoothers.Analysiscouldbeperformedthroughasimplifiedinterfacebychoosingstatisticalmethodsandvisualisationtoolsproposedtotheuser.ApplyingthesemethodsandtoolsforcalculatingandvisualizingEBVswouldconsistinlinkingthedatasettopre-writtenscriptsorsoftwarethatwouldhavebemadeavailable,selectedandreviewedbyexperts.TheusermightalsobeofferedtheopportunitytoaccesstootherdatasetsalreadyuploadedinthedatabaseforcalculatingandvisualisingEBVsfromotherspeciesorplaces.Theusercouldalsodownloadfiles(csv,text,etc.)withtheoutcomesoftheanalysis(EBVestimates,trends,maps,etc.).
ANSWER:
• Availabilityofaconsistentsuiteofdata
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page46of86
• DataQualitytestsarethefirstandmostimportantsteptoeliminatenon-relevantdata.• Aconsistentmethodologyforsurveyingabundanceanddistributions.
• IleavethedetailsofSDMandGDMstepstoJaneElithandKristenWilliams.Itisnolongermyexpertise.
ANSWER:
• Foralldata,thelevelofsupportingevidencefortheobservation/measurement/traitshouldbeknownorestimatedanddatashouldbeincorporatedbasedonanappropriateminimumconfidencelevel.
• Foralldata,precisionandaccuracyshouldbeunderstoodforkeydimensions(spatial,temporal,taxonomic,environmental)anddatashouldbeincorporatedbasedonappropriateprecisionandaccuracyforthemodelinquestion.
ANSWER:
• Establishadatadocumentationandaqualityassuranceplanfortheworkflow.
• Determinatesourceofdata:researchprojects,involvingothersectorashealth,hydrocarbons,defense,agronomy,etc;biologicalcollections:citizenscience.Itmeansaculturalchangeformostandacapacityenhancementinthismatters.Alsodatauselicencesmustbeclearinordertoavoidlegalissues.
• Integration:Oncethedataispublishedthemechanismforupdatingandindexingthedatashouldbecleanandfullyinteroperable,that'swhydatastandardizationissoimportant.
• Processing:datacanbeprocessedwiththerelevantvariablesforresearch.Here,thecomputationalcapacityanddevelopmentofscriptsarecomponentscanmakethisstepthemostautomatizedoftheworkflow.
• Analyse:Oncethedataisprocessedcanbeanalysedforgettinginformation.Inthisstepinformatictoolsareusefulbuttheexpertjudgmentdeterminateswhatcanbeconcludedfromtheanalysis,whetherifsomethingcanbeconcludedorifthereisnotenoughinformationfor.
• Visualization:Allshouldbeshownontheweb,whetheritcanbepresentedasamap,table,chartsofstatisticsorapublication.Acontentmanagementsystemwillfacilitatethatwithsomedevelopmentsanddesigntoenhancethewayitcanbegiventothestakeholders.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page47of86
ANSWER:
ANSWER:
• Thekeystepswillbeinthefront-endoftheworkflow.Issuesofintegratingdatabystandardizingreferencesystemsandspatialandtemporalcoveragearemosttimeconsumingandmaybeproblematicinhistoricdatawherecontextualinformationislacking.IfeelthenextmostimportantissueistraceabilitybetweentheevolvingconceptsofEBVs,thedataandalgorithmsusedtocalculatethesemetrics,andevidenceforthevalidityoftheapproachinthescienceliterature.Ithinkformalreferencestoscientificevidence,validateddataandalgorithmsareincreasingimportantwhendisplayingthefinalmetrictodecisionmakers.
ANSWER:
• Formoleculardata(DNAbarcodeormetagenomicsequences):
• Step1:developmentandimplementationofaquerysystemthatintegratestheinformationfromdifferentworldwideavailabledatabases/infrastructure.Thecentralsearchingcriterioncouldbethenameofthetaxonomicclass,possiblyofthespecies,combinedwithothercriteriasuchasthesamplinggeographicallocationortime;
• Step2:developmentandimplementationofasystemtomanageanydatasubmittedbytheuserfortheinvestigatedspecies.Thereshouldbeatemporaryordefinitivedataandanalysisresultsstoragesystem;
• Step3:Implementationintheinfrastructureoftoolsandreferencedatabasesfortaxonomicanalysis.
• Step4:Implementationintheinfrastructureoftoolsforcomparativestatisticalanalysisofpresence/abundanceofspeciesindifferentgeographicalareasandtimepoints.
ANSWER:
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page48of86
• Accessingtherelevantdataatglobalscaleisoneofthebiggestchallengesanditisofhighestimportancetofindagoodbalancebetweenbottom-upandtop-downapproacheswhenfacingthischallenge.Top-downapproachessuchasGBIFimplementaninfrastructureandthenaskcountries/regionstosharedatatofeedthesystem.Theseapproachesareinterestingbecauseoftheirlargespatialscale,buttheymaypreventfromeasilycontrollingforimportantissuessuchasunevensamplingeffort,datastorage/sharingandmobilizationamongcountries(Becketal.,2014).Bottom-up,network-of-networkapproachessuchastheEuroBirdPortal(http://www.eurobirdportal.org/ebp/en/)thatintendtoconnectdifferentsystemstoeachothersareinitiatedbythecountriesthemselvesandhavethepotentialtoprovideinformationassociatedwithalowerlevelofspatialbiasbecausesampling/recordingeffortand/or“sharingwillingness”maybeestimatedinamorestraightforwardway.Reconcilingthetwoapproachesisabigchallengeahead.
ANSWER:
Dependsonthedata,butIwilloutlineforpresence/absencedata(cameratrapimagesorrecordings):
• Importdatafromcameratraps/recorderfromasamplingseasonandensureeachimage/recordinghasthefollowinginfo:Projectname,sitename,date,time,spatialcoordinates,speciesname,andothermetadata(personidentifyingtheimage,samplingperiodname,samplingperioddates,etc.).
• Basicdataconsistencycheck.Forexample,arealldatesandtimeswithintheexpectedtimeframe?Arespeciesnamesconsistentacrossthedata.
• Foreachspeciesinthedatasetcreateamatrixofsamplingpoints(rows)vs.time(columns)indaysoranyothermeaningfultimeinterval.Fillthismatrixwith1,0,orNAdependingonwhetherthespecieswasseenatthispointonthistime(1),notseen(0)orthepointwasnotsampledonthisday.Thismatrixistheinputofbasicoccupancyanalysis.Amatrixofnumberofsamplingeventsofaspeciescanalsobecreated(definedasthenumberoftimesthespecieswassampledatapointatthistime)forabundanceanalysis.Userneedstodeterminewhatconstitutesasamplingevent(e.g.seriesofimagesthatareatleast5minapartintime).Fromthiseventmatrix,pointabundanceanalysiscanbeperformedusingbinomialmixedmodels.
• Ifspeciesobservationsaresparse(speciesdetectedatlessthan5%ofthepointspersamplingperiod)anddetectionprobabilityislow(<0.05),mostmodelswillhavedifficultyconverging.Inthiscase,combinespeciestogether,ordonaïveanalyses(withoutcorrectingfordetectionprobability)withoutcovariates.
• Chooseaseriesofcovariatesthatcanbe:spatial(valueofcovariatechangeswithspace),temporal(valueofcovariatechangeswithtime),both(valueofcovariatechangeswithspaceandtime).Thesecovariatescanbeusedtomodeloccupancy/abundanceaswellasdetectionprobability.
• Temporalanalysiswillrequirefittingadynamicoccupancyordynamicbinomialmixturemodel(forabundance).Softwarealreadyexistsforthis(Presence,packageunmarkedinR,orTEAM’sBayesiananalysisusingJAGSinR).
• Ensurethatmodelrecoverspatternsadequatelybycheckingmodelconsistency.ANSWER:
• Needagoodcomputationalinfrastructure• NeedstandardizationofcomputationalanalysispipelinesANSWER:
• Gettingandprocessingrawdataarethekeystep.Atpresent,rawdatawerecollectedbydifferentagencieswithdifferentstandardsanddifferentdataquality,anddistributedindifferentplacesandlackofdatasharing.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page49of86
ANSWER:
• Thisisgeneralwithoutthinkingaboutwhetheritneedstobeautomated/rolledoutglobally:• Firstthinkaboutwhatyouaretryingtoachievewiththemodellingandwhatsortofdatathatrequires
(don’tstartwiththeavailabledata;thinkfirst).Monitoringchangeismuchmorechallengingthanaone-offspeciesdistributionmodel,and–despitepublishedexamplestothecontrary–Idon’tviewpresence-onlydataassuitable.
• Collectrelevantspeciesobservationdata.Ifthisisnotyourowndataneedssometimetounderstandit–togettogripswiththesurveydesign,tounderstandsurveyeffort,togetafeelingforwhetherspeciesidentificationisreliableetc.Checkwhetherthedatameettherequirementsforthesortofmodellingthatisappropriate.Ifpredictionsaretobemadeacrosslandscapes,dothesamplescoverthemainenvironmentalgradientslikelytobeimportanttothespecies?Checkforanyerrorsinthedata(terrestrialrecordsinthesea;mismatchbetweentextualdescriptionsandlat/longcoordinates;recordsforriverinefishesonland;etc).
• Assessthecoverageofthesamples,oftheenvironmentalandgeographicgradientsintheregionofinterest–thisisrelevantforunderstandingwhetherpredictionstounsampledsitesarelikelytobewellinformed.Ref:Cawsey,E.M.,Austin,M.P.&Baker,B.L.(2002)RegionalvegetationmappinginAustralia:acasestudyinthepracticaluseofstatisticalmodelling.BiodiversityandConservation,11,2239-2274.
• Gathercovariatedata,forboththeobservationprocess(detection)andthestateprocess(occupancy/abundance)–wantvariablesrelevanttothespeciesatagrainthatrepresentsthoseenvironmentsproperly(e.g.aspectisirrelevantatcoarsergrainbutmaybeimportanttothetemperaturesexperiencedbythespecies).Evaluatecorrelationsbetweencovariatesanddecidewhethertoreducethecovariateset,andhow.
• Fitamodel(whatmethodisgoingtobeusedformodelselection?–bigissue),testitsfitandpredictiveability(thelattere.g.withcross-validation).Atthisstageneedtodecidewhatminimumpredictiveabilityisacceptableformonitoringchange.
• Ifrequired,predictoccupancy/abundanceoverwholeregion.• Repeatintime2.Question:doyouusesamesetofcandidatecovariates,orsamesetofcovariates
selectedinthefinalmodelattime1(intuitively:thefirst)• Calculatechange.Includeuncertaintythroughout.ANSWER:
Themaincomplexitiesare:
• Dataaccessibility,whichincludespeoplenotwantingtosharetheirdata.• Dataharmonization,whichisrelatedtodataandmetadatastandards
ANSWER:
• Selectingaspeciesorgroupofspeciesfromataxonomy.• Fetching,providingorgettingautomaticsuggestions(e.g.possiblemisspellings)foralternativenames
forthespecies.• Retrievingwhatyoucall“rawdata”fromeachpossibledatasourceusingallnames.• Manuallyand/orautomaticallyfilteringretrieveddatabasedondataquality,geographical,and/or
temporalparameters.• CalculatingtheEBV.• Storingresults,preferablywithalldataused.• Organizingandpresentingresults.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page50of86
• Therecanbelotsofcomplexitiesdependingonthedetailsoftheworkflow.Forexample,EBVcalculationmaybeassimpleascalculatingtheextentofoccurrencebasedonpointdata,butitmayalsoinvolvecomplexecologicalnichemodellingtechniqueswithpreandpostprocessingsteps.Dataqualityfilterscanbenumerous.Additionally,manystepscouldinvolvehumaninteraction,whichcouldseriouslyaffectworkflowefficiency.Andifthelevelofinteractionincertainstepsistoohigh,itmaybenecessarytoeithercreateintermediarydatabases/servicesorchangetheoriginaldatarepositoriestoacceptsomespecifictaggingmechanismthroughwebservices,sothatfurtherworkflowrunsdonotrequireduplicateworktoreviewthesamedata.
ANSWER:
• Identificationofthetaxonomic,spatialandtemporalscaleswherebestavailablebiodiversitydataareenoughtomakereliabledistributionandabundancepredictions.Taxonandlocationsspecificverificationsandcalibrations.
ANSWER:
• Ithinkthattheworkflowshouldbeginwithprotocolsfordatacollectionandverification.Atthisstage,itwillbeimportanttocollectthemetadatarequiredtodeterminelegalandpolicyinteroperability(perhapsagenericstandardlikeDublinCorecouldbeexpandedbydrawingonotherstandardsandframeworks,includingtheEuropeanInteroperabilityFramework(http://ec.europa.eu/isa/documents/isa_annex_ii_eif_en.pdf).Ofcourse,allrawdatasetswithoutsufficientmetadatawillneedtobere-visited,andnewmetadataadded,beforethesedatasetscanbeused.
• Havingaccuratemetadatatodocumentthingslikedatapolicies,theamountandtypeofPIIincludedinadataset(ifany),andthelegaljurisdictionofdatacollectionisafirststeptowardssupportinginteroperability.Automatedmatchmakingcouldbesufficienttodeterminekeyaspectsoflegalinteroperability,forexamplebyusinginformationincludingnationalprovenance,data/databasestructure,andlicensingtodeterminewhethertwodatasetsarecompatiblefromanintellectualpropertyperspective.
• But,automatedmetadatamatchmakingbetweendatasetsisnotacompletesolution.Forsomedatasets,includingthosecollectedbycitizenscientistsforaspecificpurpose,theinitialgoalsofdatacollectionmaybecompatiblewithsomeformsofre-usebutnotothers.Thiscouldbethoughtofasaformofpoliticalorethicalinteroperability,andisn’talwaysgiventhesameweightaslegalandpolicyconcerns.Thereshouldbesomewayforthecreatorsofadatasettoindicate(uponuploading)theirpreferencesforre-use(isautomatedmatchmakingOK,ordoesthesystemneedtosendtherequesttoahumanresearcherforreview?).
• Whileallnecessarystepsoftheworkflowshouldbeuncoveredthroughthedesignprocess(seeQ6),afewfeaturesthatmayappealtocitizenscientistsincludetheabilitytosaveanin-progressorcompletedanalysis/visualization,theabilitytoexportavisualizationwithreferencestosourcedataandmetadata,theabilitytoworkcollaboratively,andtheabilitytoaskquestionsofexpertsorothersthroughadiscussionfeature.
ANSWER:
• Datapreparation:o Collecttherawdata(potentiallyfromarangeofsources)o Identifyduplicatedatao Quality-checkthespatialandtemporalreferenceifpresent,filterdatawhichcannotbepinned
downintimeandspace.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page51of86
o Lookforobviousoutliersorpossiblemisidentificationsusingmodelledspeciesranges,andhandletheseaccordingtoaconsistentstrategy.
o Investigateandinterpretanyflagswhichmayindicatebreeding/migratory/etcstatus.o Ensurethattaxonomicdefinitions,unitsetcmatchorcanbetransformedbetweenthedifferent
datasets.o Optional:Ifworkingwithabundances,transformobservationstoinferredabundancewhere
samplingeffort/strategyisknown.o Performallnecessarytransformationsandharmonisethedata.
• Computespeciesdistributionsorabundancevalues/mapsforspecifictimesteps.Aparticularcomplexityhereisidentifyingwhereanabsenceofrecordsactuallyindicatesabsenceorlossofthespecies.
• Combinethevalues/mapstogetanideaofCHANGEbetweentheepochs.Here,acomplicationisthattheuncertaintyatbothstepsmaydrownoutanyactualchangesignal.
• Generateclear,usablemaps/tables/charts(ideallyaccessibleviatheweb,withlinkstofullclearmetadataonhowthecomputationwascarriedout,andlinkstothesourcedata).Rawresultsshouldalsobemadeavailable(alongwithmetadata)sothatuserscanpresenttheresultsintheirownchosenway.
ANSWER:
• Changeofdistribution:DownloaddatafromGBIF.Cleanitbymergingsynonymsandduplicaterecords.Choosemeaningfulvariables,runprincipalcomponentanalysis(PCA),choosetherightcalibrationarea,andcalibratetheecologicalnichemodel.Themaindifficultyistogetmodernenvironmentaldatalayers,becauseWorldClimisnowratheroutdated.Therearedatagaps,foinstanceinEasternEurope.Datamanagementisachallengeingeneral.OpenModellerdoesnotofferPCA,etc.Lotsoftechnicalhurdlesanddealingwithdatagaps.
• Changeinabundance:DownloaddatafromGBIF.Cleanitasabove.Cross-tabulateitintoaOLAPhypercube,aimingatreasonabledatadensityateachcell.Computetrendsinvariousspatiotemporalresolutions.Thecomplexityistomaintainperformancewhenusinghundredsofmillionsofrecords.Datacleaningatthatscaleischallenge,becauseitcanonlybedoneautomatically.
ANSWER:
• InSparta,westartwithasetofrecordsfrommultiplespeciesthatwebelievetoberecordedasanassemblage,bywhichwemeanthatrecordsofonespeciescanbeusedtoinferabsencesofothers(seeIsaac&Pocock2015foraUK-centricexpositionofthedataissues).SpartainterfaceswithapackagecalledrnbnthatcollectsdatafromtheBritishNationalBiodiversityNetwork(theUKnodeofGBIF)anditwouldbetrivialtolinkitdirectlywithrGBIF.Thechallengeistoidentifytheassemblage,i.e.whichsetofrecordscanbeconsideredtobeco-recorded.Therestoftheworkflowiswell-definedinSparta,butinvolveconvertingtheserecordsintoa‘visitmatrix’whereeachrowisauniquecombinationofsiteanddate.Ourestimateofsamplingintensityisthelistlength(thenumberofspeciesrecorded,followingSzaboetal2010).Weuse1km2anddayprecision,butcoarserresolutionsarequitepossible.Eachspecies’detectionhistoryisappendedtothismatrix:themodellingthereafterisdescribedinthescientificliterature(e.g.vanStrienetal2013,Powneyetal2015).Thereareissuesofspatialcoverageandspatialautocorrelationthatwehaveyettomaster.
ANSWER:
• Knowingthefitnessofthedataforaspecificquestionisessential.Selectthedatabasedonproperdocumentation.Rawdatacanonlybere-usedforotherapplicationsifproperlystandardizedand
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page52of86
qualitycontrolled.ForEurOBISandOBISweusetaxonmatchingtools,spatialendenvironmentaloutlierdetectioninanautomatedsystemtoassignqualityflagstotherecords.
• Relevantpublication:Vandepitteetal.2015DatabaseANSWER:
• DrawingonexperiencefromEssentialClimateVariables(Bojinskietal2014)itiscleartheprocessandmethodologyforgeneratingEBVdataproductsismulti-stage.
o Assemblingtherelevantrawdataisthefirststep.Insomecasesthiscanbequitestraightforward;retrievingrelevantoccurrencerecordsfromGBIF,forexample.Inothercases,lessso;perhapsrequiringadditionalprocessingandtransformationofsatelliteimages(forexample)orestablishmentofnewobservingprotocols.
o Asecondstepcouldbeconcernedwithadjustingtheassembleddatatoaccountforheterogeneitiesinit;perhapsintheobservingmethodsortofillgapswheredataisabsent.
o DependingonthenatureoftheEBV,amodellingandcorrelationstepmaycomenext,leadingtotheEBVproductitself.
o Thereafter,comespost-productionqualityassurance-achecknecessarytoensuretheuniformityoftheproductwhencomparedtothesameproductcalculatedforadifferentplaceoratadifferenttimeorusingdifferentdata.
o Allofthishastobecarriedoutinastandard,repeatable,openandtransparentmannerwithclearandaccessibledocumentationofeachstep,suchthatitcanbesubjecttoexpertscrutinyandpeerreview.
o Andfinally,theEBVproductneedstobeupdatedregularly,perhapsinnearreal-timesothattheinformationcanbeusedasthebasisformonitoringchange.
ANSWER:
• Keysteps:
o Knowledgeofscientificquestion,samplingdesign,samplingprocedure
o Datamanagement,includingprotocols,standards,etc.
o Datastandardization/normalization
o EBVcalculationmethodassessmentandknowledgeofitsstatisticalproperties
o Identificationoftheappropriateworkflow
o Identificationofthestatisticalpackage
o Identificationofthewebservicewhichprovidesaccesstothepackage
o Assessmentofthecomputationallimitationsoftheofferedwebservice
o Datauploadingandmassaging
o ExecutionoftheEBVcalculation
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page53of86
o Visualizationofresults
• Complexities:
o InfinitenumberofcomplexitiesstartingfromthedesignandsamplingproceduretoEBVscalculation.Unlesseverysinglestepoftheabovestepsoftheprocessiscrystalclear,biasmaycomeatanystageandmayhaveconsiderableeffectonourcalculationandestimation.
ANSWER:
• I’massumingthisquestionaddresseswhattodoaftertherawdatahasbeencollectedandisn’taskingabouthowtogoaboutdoingthemonitoring.Theworkflowdependsonthedatabeingusedandtheintendedoutput.I’lldescribeoneexampleforabundancedata.
• Rawabundancedataareloggedinordertomaketrendscomparablebetweenpopulationsanddiscountthesizeofthepopulationunits(ifappropriate).Thisisessentialifdifferenttypesofpopulationcountshaveusede.g.sightingsperkmoftransectversusbiomass.
• Useanappropriatemodeltoprocessthetimeseriesdata.Thiswilldependonthetypeofdataandwhetheritisstructuredorunstructured.Generalisedadditivemodelscanbeusedforlongertimeseries,linearmodelsforshorterones(Collenetal2009;Bucklandetal2005).Othermodelsareusedwhenusingstructuredsurveydata(Gregoryetal2005).
• Useancillaryinformationattachedtothepopulationdatatodisaggregatetheresultsinmeaningfulwaysandanswermorespecificquestions.
• Geometricmeanisthemostcommonmethodofcombiningamultispeciesabundanceindicator.Usuallyeachspeciesisweightedequallybutthiscanbealteredifthereisanimbalanceinthedataset.
• Complexitieso Addressingbiasinthedata.Weightingcanbeappliedtoaddressunder-representationof
speciesorregions.
o Otherissuesofrepresentationincludehowmuchofaglobalspeciespopulationshouldbemonitoredinlightofthefactthatwewillalwaysbelimitedinhowmuchwecanmonitor.
o Otherissueswhenmonitoringabundanceparticularlyformigratoryspeciescanbeinunderstandingifapopulationhasmovedorshifteditsrangeratherthangenuinelydecreasedinnumbers.
o Understandingwhatmakesupapopulation–isitdefinedinanecologicalsenseordoesitrefertothegeographicalextentofastudysiteandalltheindividualsofaspecieslocatedthere.
ANSWER:
• SeeanswertoQ8(thefirst2priorities).Certainlyreliablemetadataisthekeyissue.
• Inmoreabstractandgeneralterms,myanswerissimple.IwouldtestwhetherthedatamanagementlifecycleanalysisofGEO(iftheyareadaptabletobiodiversitydata–whichIamnotsosureabout)andothersarewellarticulatedortheyarestillwishfulthinking.Toomanypeoplehavebeenthinkingaboutitnottotaketheiroutputsseriously.
ANSWER:
• Speciesdistribution:
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page54of86
o Nameresolutionandreconciliation
o Data(occurrence)harvestingfromacrossresourcesbasedontaxonconceptqueries
o Fitforpurposeevaluation
o Datacleaningalgorithms(confidencelevelthresholds)
o Dataaggregation(occurrencerecords)
o Projectionovermulti-layeredenvironments
o Ad-hocgeo-correlationservices
ANSWER:
• Dealingwithdatafromdifferentdatasourcestheintegrationandanalysismainlyfocusesontwocomplexities:a)thechangingtaxonomyandb)changingmethodsintheestimationofabundanceanddistribution.Informationontheunderlyingmethodsappliedandtherelateduncertaintiesinthequantificationneedstobetakenintoaccountforthecalculation.Theestimationoftheoveralluncertaintyisoneoftheimportantissues.
• Themainissueistogetdataonanationalscaleforthreatenedspecies.Whilemonitoringforcertainspeciesworkswelle.g.alsousingtrackingmechanisms(e.g.whalesorbirds),forothersitismorecomplicatedtogetconsistentfigures.
• Issuestobeaddressed:
o Taxonomicreference
o Methodcomparisonforestimationofabundanceanddistribution
o Estimationofthesingleandoveralluncertainty
o Dealingwithdatagaps(bothintemporalaswellasinspatialterms)
o Dataavailability
• Theissueonusingprovenanceandqualityinformationinautomatedworkflowsisanissuefortheintegrationofinformationfromdifferentsources.
ANSWER:
• Theworkflowwillmostlikelybead-hocdependingonthespecificquestionbeingaddressed.
• Anyworkflowwillincludeatleastthefollowingsteps:
o Itisnecessarytofacilitatetheaccessto(meta-)datareliableandidentifiedresources;
o Datagapanalysistoidentifypotentialholesaffectingdistributions(seee.g.http://www.gbif.org/resource/82566);
o Datacleaning(seee.g.http://http://www.gbif.org/resource/80528)todetectsuitable,fit-for-usedata
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page55of86
(http://www.gbif.org/resource/80623,http://www.unav.es/unzyec/papers/ztp720_postprint.pdf)
o Toprovidethepropere-Toolstoalsoguaranteetheir(semantic)inter-operability;
o Nichemodelling(http://press.princeton.edu/titles/9641.html);
o Sometypeofvisualizationrangingfromthemostbasic(e.g.GIS)tocomplex,network/relationshipgraphsandtheirfurtherintegrationintoproperVirtualResearchEnvironments-VRE(seealsoQuestion6).
• AneffectiveapproachbasedonCOMMONtoolsassimpleaspossible(forexamplePython-based)butflexibletoallowtheusersdetailedanalysis,couldbeencouraged.
ANSWER:
• Dataprep,havingthesourcedataevaluatedtoseeifitisfitforpurpose(https://www.youtube.com/watch?feature=player_embedded&v=eVYwt86mC_4QandComplexityisinagreeinghowtocalculatethemeasureoftheEBVmathematicallysothatitcanbeimplementedinsoftware.
• Oncehavethatalgorithmicallywouldsuggestthatthedataisprocessedintoamulti-dimensionalcube(OLAP–seehttps://en.wikipedia.org/wiki/OLAP_cube)toenableinteractivedashboardatmultiplescales.
ANSWER:
• Therearelikelytobemanyvariationsofaparticularworkflow,dependingonthequestionbeingasked.Thegeneralapproachneedstobescientificallywell-establishedbutalsosufficientlyflexible.
• TheexistingBioVeLenvironmentimplementsmanyofthesestepsalready:
o Accessing,preparingandcleaningthedatatoremoveerroneousrecordsarefundamentalsteps.Conveyingchangesineditingexistingdatabacktothedataprovidersisalsoveryimportant.Forexample,weroutinelyutiliseGBIFdatabutofteneditexistingco-ordinatesmoreaccuratelyoraddco-ordinateswherethereisdetailedlocalityinformationbutnogeo-reference.
o Forspeciesdistributions,ecologicalnichemodellingisawidely-usedtoolforindividualspecies,orbymakingspeciesrichnesssurfacesbystackingmodeloutputs.Modelsrunagainstdifferenttime-stampeddatasetscanrevealchangesinaspeciesdistributionEBVformultiplespecies,ifcomparedagainstanappropriatebaseline.
• Again,theinterestingquestionsarenotsomuchchangesinparticularEBVsovertimebutininferringmechanismsandpredictingfuturepatternsbyintegratingapproachesfordifferentEBVsintosharedindicatorsbasedonacommonmathematicalframework.
ANSWER:
• IthinkthemainissuehereisthebroaddeploymentoftheExtendedDarwinCorewithEventCore.Thisextendedcoreallowsustomanipulatethestructureddatacomingfromsystematicmonitoringefforts,somethingthatwasnotpossibletodowiththeoriginalDarwinCore.Thenextstepwillbetocombinedatafromvarioussources,includingopportunisticandsystematicsampling,togeneratespecies
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page56of86
populationsanddistributionEBVs.Thismayalsoincludetheuseofproxiessuchasland-coverandclimatetoexpandfromthesamplepointstoacontinuoussurface(walltowallmonitoring).
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page57of86
Question6
Whatisasuitabletechnical(ICT)approachtoperformthisworkflow(s)forcalculatingEBVs(anyplace,anytime,usingdataanywhere,byanyone)?Whatspecialconsiderationshavetobetakenintoaccount?
ANSWER:
• TheworkflowforcalculatingEBVswouldideallybeaccessibleatanytimebyanyone.Forthispurpose,anydataprovidershouldcommittoadatasharingagreement(seeQuestion5).Anyuser,eitherprovidingdataornot,wouldneedtocommittoanotheragreementforaccessing,downloadingorusingdatasetsavailableinthedatabase.Thisuseragreementwouldspecifye.g.non-commercialuseofthedata,EBVcalculationsoranyoutcomesarisingfromtheuseofthedatabase.Besides,theusershouldcommittociteallthedatasourcesthatwouldhavebeenused,analysedordownloadedwhenevertheoutcomeswillbepublishedorcommunicated(citationofthenameofeachdatatheproviderandthenameoftheschemefromwhodataasbeen).
ANSWER:
• MonitoringabundanceanddistributionEBVsisunlikelytobeproperlydetermined,unlessspecificprojectteamsareappropriatelyqualified.
• Obviouslyastandardworkflowisrequired,witheachstepjustifiedasefficientandrobust(BestCurrentPractice).
• Whileanyone,anywherecouldusetheapproach,itisunlikelythatatotally‘canned’approachcouldbeoptimalgivencurrentstateofknowledge.
ANSWER:
• Thisquestionisadistraction.Dependingontherepeatabilityandnecessaryparameterisationofthemodelsinquestion,ICTsolutionsmaybebasedonworkflowengines(wherethisasuitablerapidexploration/prototypingapproach)oronrobustlyimplementedalgorithms.ParallelprocessingtechnologieslikeHadoopmaybeimportant.However,allofthisisfranklyrelativelytrivialoncewedeterminewhatwehopetomodelandhowthosemodelsrelatetothesourcedata.
ANSWER:
• Workflowcanbeperformedindifferentconditions,thetechnologiescanbeadaptedtoawell-definedsoftwarearchitecture.Establishinganappropriatesolutionarchitecturetoperformthisworkflowmeansthatwecanaddresstheprincipalconcernsaccordingtotherequirements.Thatalsomeansthatrequirementgatheringshouldbeanexercisethatmustbedonewiththebestjudgement.
ANSWER:
• Datamustcomefromseveralsourcesandshouldbeabletobedynamicallyinterlinkdifferentdatabasesanddocorrections,pointingoutdiscrepanciesetc.Weneedtomobilizealldataandnotfocusonlyonopenaccessdata.Thisbringscopyrightissues.
ANSWER:
• Thedata,contextualinformationandalgorithmsusedinworkflowswillbewidelydistributedandheterogeneousinnature.Therewillneedtobeagreementonanumberofstandardstoenabletheseresourcestobebroughttogether.Iwouldexpectthesetoincludewrappingtheseresourcesasweb
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page58of86
serviceswithstandardvocabulariestodescribethemwheninterrogated.Howtheseservicesarethenmarshalledintoaworkflowcanbecarriedoutinanyworkflowenginethatadheretotheservicestandards
ANSWER:
• ImplementationofanalysissystemsinaWorkflowManagementSystemsuchasGalaxyorTaverna.
• Thesystemsshouldbeusableevenbynon-expertsandincludeauser-friendlyinterface.
ANSWER:
• TEAMoffersananalyticalenginetoperformoccupancyanalysisofcameratrapdatawithorwithoutcovariates(wpi.teamnetwork.org)TheanalysiscanalsobeperformedbyrunningRscriptsforpre-processingandmodelfitting.Inthenearfuturewewillhavethecapabilityofdoingabundanceanalysesaswell.
ANSWER:
• TheCreative-Bdocument“D3.1Comparisonoftechnicalbasisofbiodiversitye-infrastructures”documentstheconceptualarchitecture(figure26)forhowapplications,servicelogic,andresourcescanbestackedandinterfaced.Workflowsareincorporatedintotheframeworkaspartoftheservicelogic.What’smissingfromthatconceptualization,whichwouldbeusefulforthecomputationofEBVs,istoassociatetechnicalstandardsorimplementationtechnologiesthatare“GLOBIS-Brecommended”.Forexample,the“DataResource”componentneedstohaveassociatedwithitstandardsandtechnologies(formetadata,forcatalogservices,fordiscoveryservices,foraccessservices,etc)thatisnotjustalaundrylistofstandardsandtechnologies,butaconstrained,vetted,set:toomuchandwideofasetofoptions,anditbecomesuseless.Technologyimplementers/systemintegratorsappreciatea“sanctioned”,controlledsuiteofoptionsgiventothem,withperhapsareferencearchitectureofhowonesuchcombinationofstandardsandtechnologiesisusedtoimplementasolution.
• Ifeelthatdevelopingsolutionsattheconceptuallevel(liketheconceptualarchitectureabove,butsupplementedwithasmallsuiteof“sanctioned”standardsandtechnologies)coupledwithadocumentedinstanceofonesuchimplementationoftheconceptwouldencourageotherEBVprojectstotrytoadoptasolutionthatwouldbeinteroperablewitheachother,atleastatvariousspotsinthedifferentimplementations.
• Anexampleofaconstrainedsetofoptionsforstandardsandtechnologieshasveryrecently(late2015)beenproposedbytheUSGrouponEarthObservations,USGEO,whichistheUSbodytotheworldwideGEO.ThedatamanagementsubcommitteeofUSGEOhasadraftversionofthe“CommonFrameworkforEarth-ObservationData”(https://www.whitehouse.gov/blog/2015/12/09/improving-access-earth-observations),whichshouldbefinalizedsometimein2016.
ANSWER:
• Idon’tthinkanyworkflowforthispurposeisavailable,now.Qualityandstandardofrowdatahavetobetakenintoaccount.
ANSWER:
• (unsureaboutthemeaningofthisquestion).“byanyone”?Idoubtthatit’ssafe/possibletoautomatethistotheextentthatsomeonewithnomodellingexperiencecoulddoit.Maxent(thespeciesdistributionmodellingsoftware)isagoodexampleofsomethingmadeavailabletonon-expertsthatis
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page59of86
thenusedwithpoorchoicesbymanyusers(becauseit’shardfornewcomerstounderstandthenuancesofthechoices,andmostpeoplegofordefaultsanddon’tunderstandtheimplicationsoftheirchoices).Needsareasonablelevelofcommitmentforsomeonetodeveloptheexpertisetorunthemodelsproperly.Ibelievethemodelsneededforchangedetectionareastepmoredifficultandneedtrainedpeopletorunthem.
• Presumablysomestepsindataprepcanbeautomated.Toolscanbedevelopedtoreportonavailablecovariates.Modelling:E-bird(USA)isagoodexampleofsophisticatedmodellinganalysesappliedtovastquantitiesofdata.Hastakenateamwithconsiderablestatisticalandcomputingskillstosetitupandtocontinuallyevaluatemodeloutput(withinputfromspeciesexperts).Hasn’tbeenleftforotherstorunit(i.e.theanalyticalteamisalwaysthere).
ANSWER:
• Assumingthatthewholeprocessneedstobereplicablebyanyone,theworkflowwouldneedtointeractwithpubliclyaccessibledatarepositoriesandnotdependonspecifichidden/privatedatafromcertainresearchers/institutions.Timeandgeographicrangecouldeasilybeworkflowparametersiftheunderlyingrawdatacontainthesedimensions.Thingscangettrickierifyoualsowanttohandleincompleterawdata,suchasoccurrencedatawithoutcoordinates,havingonlyadescriptionofaplaceinnaturallanguage.Anothercriticalstepistohandleuncertainty.Forexample,occurrencerecordswithhighspatialuncertaintywouldprobablyneedtobediscardedinlocal/regionalscalecalculations,butcouldstillbesuitableforcontinental/globalscalecalculations.
ANSWER:
• Ithinkthisquestioncanonlybeansweredthroughtheprocessofcooperative(oratleastuser-centered)design.TheGLOBIS-Bteamcouldbeginbybrainstormingasetofpersonasrepresentingrelevantstakeholders,includingscientists,policymakers,anddifferenttypesofcitizensciencevolunteers.Then,usersfromeachstakeholdergroupcouldberecruitedtoinformthedesignanddevelopmentofanEBVcalculationplatform.
ANSWER:
• Datapreparation:
o Collectionofrawdatarequirestheusertobeabletoeasilydiscoverallthepossiblesourcesofdatarelatedtoaparticulartaxonomicgroup.Tosomedegreethisispossiblethroughcataloguesearchesbutisstillchallenging.
o Processingandformattingthedatarequiresthatitisoriginallyavailableinaneasily-transformeddigitalformatwhereeachdatasethasatleastsomecommontags,fieldsetc.
o QArequirestheusertohaveaclearideaofwhatconstitutesareasonableorimpossiblevalue/observation.
o Transformationofdatasetsmaybecomputationallyintensive.
o Mostimportantly,eveniftherearesharedopen-sourcelibraries(e.g.,inpythonandR)forperformingtheabovetasks,therewillalwaysbeanelementofuserparameterizationandtweaking,meaningthatpotentiallyhugeamountsofeffortcouldbeexpendedtoproduceEBVsfromthesamedatasetswhichareinconsistent.Theidealwouldbethataggregation,qualitychecking,harmonization,catalogueharvesting/metadatapublicationetcwereallcarriedout
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page60of86
beforetheusergetstothedata.TheagencywhichcomesclosesttodoingthisjobatthemomentisGBIF.
• ComputationofEBVsincludinggap-filling/inference/interpolation.Ifuncertaintyintermsofdetection/misidentificationistobequantified,someMonteCarlosimulations/randompermutationsofthedatawouldbenecessary.Ifpositionalaccuracyislikelytobeaproblem,thisshouldalsobeacknowledgedandtheimpactofproblematicobservationsassessed.
• Computingchangebetweentimesteps–probablythesimpleststep–plainmathsormapalgebra,thoughifuncertaintyistakenintoaccounttherearemorecalculationsnecessarytocalculatelower/upperboundsorquantiles.
• Presentingvisualresults–recommendopeninteroperablewebservicesformaps/standardformatteddata(seebelowforhowthiscanbedone)..
• Whatspecialconsiderationshavetobetakenintoaccount?
o Technicalcapacityofusers,necessaryinvestmentintraining/effort,accesstodata(orunittests)forverification,testingandvalidation–accessibilityofsoftware(i.e.,opensource/freewarevs.corporate).Legal/copyrightrestrictionsoncomponentdata,andwhetherthesepercolatethroughtoderivedproducts.Necessitytoaggregateorobfuscatesensitiverecords.Onebigquestion–howwill‘anyuser,anywhere’beabletogetgoodadviceonwhentheavailabledataistoosparseorinaccurateforuseintheirchosencontext?
ANSWER:
• ComputationsneedtobeperformedinportalsusingOLAP.
ANSWER:
• Werelyheavilyonclustercomputing.SomeofourdatasetsarereachingthelimitsofavailableRAMlimitation?Wearealsofindingthatmanyinvertebrategroupslacksufficientdatatoworkat1km2anddateprecision.
ANSWER:
• [Note:Don’tunderstandwhyeveryoneshouldbeabletocalculateEBV’s.Itdoesrequiresomeskill’s..]
• Virtuallabsthatmakedataandalgorithmsavailablethroughwebservicesoffersmanypossibilities,thisistheapproachtakenbyLifewatchandBiovel,Biodiversitycatalogue.
• Overviewofvirtuallabsforthemarineworld:http://marine.lifewatch.eu/
• Statisticalpackagescaneasilyharvestdatafromwebservice,webuiltseveralinterfacesbasedonR,Rstudio,Rshiny.
• Agood,scalableinfrastructureusingOGCstandardwebservices(WMS/WFS/WCS/WPS)isgeoserver.EMODNETmakesalldataproductsavailableasOGCcompliantwebservices.
• Thedifferentdatasourcescanbequeriedsimultaneously.
o http://www.emodnet.eu/dataservices/
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page61of86
ANSWER:
• Behindtheaboveexplanationliesasignificantissuethathastobeaddressedearly-onbyscientistsandthepotentialend-usersofEBVproducts.ItconcernsafundamentalchoicebetweencalculatinganEBVdataproducton-demandon-the-fly,versusamoreperiodicsystematicproductioncyclewhereEBVdataproductsareproduced,updatedandextended,forexampleannually,quarterlyormonthly.
o Simplistically,on-demand,on-the-flyproductionrequiresreadyaccesstorelevantrawdata,andtotheworkflowandprocessingcapacitytotransformthisrawdatatotheselectedEBVproductfortheindicatedplaceorareaofinterest(local,regional,national)atthetimestampofinterest.Processingcapacity“atthetouchofabutton”isnecessarytoservicetheinstantaneousdemandoftherequest(andofsimultaneousrequests).Sizeandcomplexityofrequestsisnotknowninadvance(althoughthiscanbecontrolledbylimitinggeographicalareaandresolution).Repeatabilityisakeyrequirement,suchthatiftheEBVisagainrequestedon-demandforthesameplaceandtime,thesameanswerhastobedelivered.EBVdataproductionisad-hoc,respondingtodemandsofthemoment,withthequalityassurancechecksin-builtintheprocedure.ArchivingoftheEBVproductsisnotrequired.
o Inthecyclicalapproach,EBVdataproductionissystematic,aggregatedoverlargeareas(potentially,thewholeglobe)andarchivedasaneverextendingdatabase(s)ofinformationtobequeriedtoprovidethedatafortheindicatedplaceorareaofinterest(local,regional,national)atthetimestampofinterest.Processingcapacitycanbeestimatedinadvance.PeriodicityoftheproductioncycleforanEBVcanbetunedtotheavailableprocessingcapacityandtotheexpectedtemporalsensitivityofthatEBV.Theinformationisgeneratedonce,archivedandthenavailableforever(orasetperiodoftime)tobeusedandre-usedasneeded.Theanytime,anyplacerequirementismetnotbyon-demandcomputationbutbyqueryingpreviouslycomputeddataproductsthathaveundergoneapost-productionqualityassuranceassessment.
ANSWER:
• Firstpartofthequestionnotentirelyunderstood
• Specialconsiderations:unlimitedcomputationalcapacity;transparency(leadstoadequaterepeatabilityofanyobservationandanalysis)
ANSWER:
• Severaltechniquesexisttoprocessabundancedata,forexamplethesoftwarepackageTRIM,themethodbehindtheLivingPlanetIndex(Collenetal2009)bothofwhichcouldprobablybedevelopedintoanonlineplatformforusebyanyone.Theformerisusedforabundancedatausingastandardisedmonitoringprotocolforasetofspeciese.g.birds,butterflies.Thelatterapproachcanincorporateabundancedatafromanyspecies,methodandunitofdata
ANSWER:
• Combinationofexisting“official”datamanagementsystems(e.g.digitizednationalbiodiversityinventories)withqualitycontrolledcitizens´sciencebasedapps,andadequateVREsfamiliartobiodiversityscienceactors.Ihavenotatallenoughinformedknowledgetodescribethem(withtheexceptionofveryspecificmarinespeciesestimations).WhatIamsureofisthataclusterofverysophisticatedpoliciesconcerningallthelife-cycledataflowsandreusesisanunavoidabledevelopmentthatnecessarilyhastobeinplace.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page62of86
ANSWER:
• Theimplementationofwebprocessingservices(WPS)withdefinedinputinterfacesincludingqualityinformationwillbeapossiblesolution.TheimplementationofthesestandardisedWPScouldbedoneondifferentplatforms.
• Implementationofdataservicesforspeciesdistributionandabundancedataincludingasemantictaxonomymappingtool.Enhancingservicebasedavailabilityofdataonspeciesisapre-requisiteforthemodellingapproaches.Asearlierstatedinformationonthequalityanduncertaintyofcertainmethodsneedstobeprovidedwiththedata.Hereisalimitationinthecurrentdataserviceswhicheitherfocusonspatialdataservices(e.g.speciesdistributionmaps)orsensorbasedobservations(e.g.asinglespeciesobservation).Furtherdevelopmentontheservicesforthesekindofdataisneeded.
ANSWER:
• Thishasalreadybeenapproachedwithoccurrencedata:
o SeetheGBIFportal(http://www.gbif.org)anddevelopmentsbuiltaroundit:forexampleWALLACE(http://protea.eeb.uconn.edu:3838/wallace2/).Ingeneral,webservicesandRESTabletomilklargedatabasesandproducesubsetsofdataalreadycondensedaccordingtocriteriasuppliedbytheuserwillbethepreferredmethod,asmostscientistsorpractitionersarelikelytopreferexperimentationonthedata(asopposedtofinalproductssuchasready-madenichemodels)
o SeealsotheLifeWatchMarineVirtualResearchEnvironment(VirtualLab)developments(lifewatch.eu)whichcommonconstructionblocksarealsobeingusedfortheimplementationoftheLifeWatchFreshwaterVirtualResearchEnvironment
• Ingeneralterms,thesedevelopmentscouldbeintegrated(beforebeingadaptedaccordingly)inLifeWatchICTdistributede-Infrastructure,astheEuropeanReferencePlatform(ESFRI)inordertofurthercomposesomee-ServicestooffertheseEBVsvaluesinavisualwaythroughthedevelopmentofinturnproperVirtualResearchEnvironments-VRE.Allthisprocessinvolvesanalysingin-detailthefinalusers(“customers”:researchers,decisionmakers-environmentalmanagers)requirements.
• Therefore,allofthisshouldbeperformedthroughthedesign,establishment,deploymentandmaintenanceofanOPENESSandBigDataparadigms-basedConceptualFrameworksuchasLifeWatche-Infrastructureisofferedatthedisposaltothispurpose.
ANSWER:
• UsedatawarehousingtechniquesthroughtheETL(ExtractTransformLoad)processtoaggregatethedataandbuildtheOLAPcube.
ANSWER:
• Again,manyoftheindividualstepsincarryingoutsuchaworkflowhavebeentackledalready,withseveralrunningsequentiallye.g.intheBioVeLenvironment.Havingsuchworkflowsopensource,ormakinguseofrepositoriesofpre-writtencode,sothatanalysescanbeadaptedtoparticularresearchquestionsisanimportantfactorintheirsuccessfulimplementation.
ANSWER:
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page63of86
• Asstatedabove,acombinationofstructureddatainDarwinExtendedCore,andsomestatisticalinference(e.g.correctionforsamplingbias,trenddetection),willbethefirsttargets.UseofSDM’sorhabitatsuitabilitymodelswithremotesensingofproxyvariables(e.g.landcover)mayalsobeusedtoexpandthedatafromthemonitoringpointstocontinuoussurfaces.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page64of86
Question7
Whatarethetechnicaloptionsavailableandwhatispossibletoachievetodayorwithinthenext12months?Whatdataand/orworkflows,softwareetc.areavailabletoday?Whereisitandhowcanitbeused?
ANSWER:
• GlobalanalysisusingthefreelyavailableRSiscentral–postagestampapproachesofjoiningmanylocalstudiesresultininconsistentoutput
ANSWER:
• Dataavailable:seeexamplesinQuestion1.
• Software/methodsavailableforcalculatingorvisualizingEBVs:TRIM(software),PRESENCE(occupancysoftwarefordistributionEBVs)orR-scriptofoccupancymodels,n-mixturemodelsorvisualisationtoolsthatcouldbemadeavailablefrompublicationsoranyexpertcontributor.Q-GISorGRASSareopensourcesoftwarethatcansupportthemappingandthevisualisationoftheEBVs.
ANSWER:
• ThereareDataPublishersthathaveagoodfoundationofdistributiondata,e.g.,GBIF,ALA,CRIA,BISON,SANBIetc.
• MethodssuchasMaxEntandGDMarewellknownandrobust.
• WorkflowsforSDMarewidelyavailable,e.g.,R,BCCvL,BioVel,
• Methodsforestimatingabundancearewellestablished.
ANSWER:
• Fordata,GBIFisprobablythemostcompleteoccurrencedatapoolandweshouldallworktogethertoaggregateallpossibledataonoccurrenceandsampleeventsinoneplace,andcollaborateindataqualityimprovementstothewhole.
• SignificantexistingGISandremote-sensedenvironmentaldatasetsexistandshouldbemadeaccessiblethroughaconsistentdiscoveryandaccesscatalogue.
ANSWER:
• Therearemanytoolsandtechnologiesthatcanspeedupthedevelopmentofthisworkflow:
o StandardsasPlinianCoreandDarwinCore.
o Publishingtools,suchGBIFIPT.
o Queuemessagestechnologies:ApacheKaftka,AmazonSQS.
o Indexingtechnologies:Solr,ElasticSearch
o Powerfulrelationalandnon-relationaldatabases:PostgreSQL,MongoDB,Hadoop.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page65of86
o Maptechnologies:Mapbox,CartoDB.
o Statsvisualization:Kibana.
o Andseveraldataqualitytools:http://community.gbif.org/pg/pages/view/39746/list-of-data-quality-related-tools-in-the-gbif-catalogue
ANSWER:
• WeareworkingalotwithOGCEFandO&Mstandardsatpresenttobringtogetherthedescriptionoftheorigins,configurationandaccessibilityofenvironmentalmonitoringdata.ThisisbeingusedtodeliverSOSwebservices.Therearereferenceimplementation(e.g.52N)forthesestandardswhichmanygroupsareworkingwith.Iwouldliketoexplorehowthesestandardscouldbeappliedtobiodiversity(e.g.fromGBIFetc)tonotonlydeliverspeciesdatabuttheenvironmentalcontextaroundthem.TherewouldthenbemanywaystoassembletheseservicesintoworkflowfromsimplepythonscriptstoTavernastylesystems.
ANSWER:
• Data:
o AsconcernsDNA-barcodingdata,alotofresourcesareavailableonline,suchasGenBankorBOLD.InBOLDeachbarcodesequenceisassociatedwithawellcuratedtaxonomicdescription,withthecollectionsiteanddate,theorganismpicture,etc.
o Asconcernsmetagenomicdata,amongthemostusedreferencedatabasesthereareRDP,GreenGenes,Silva,ITSoneDBePR2/HMaDB.PreviousmetagenomicprojectsequencescanbeexploredfromvariousonlinearchivessuchasEBImetagenomics,MeganDB,iMicrobe,etc
• Workflows,software:
o FortaxonomicassignmentofDNA-barcodingsequencessomeoftheonlineavailablephylogeneticpipelineareSAPeRaxML.AlsointheBOLDsitethetaxonomicassignmentisavailablebutitrequiresapreliminaryregistrationandnotmorethan100sequencescanbeanalysedeachtime.
o FortaxonomicassignmentofmetagenomicdatasetssomeoftheonlineavailablepipelineareBioMaS,QIIME,Mothur,MetaShot,Kraken,Sparta.
o Metagenassist,DESeq2,PhyloseqandMetagenomeseqpackagesareamongthecommonlyusedpackageforthestatisticalandcomparativeanalysis.
ANSWER:
• WecandooccupancyanalysisofcameratrapdataonamassivescalenowusingTEAM’swildlifepictureanalyticssystem.Thiswillcalculatepopulationtrendsandcombinetheseonaflexiblebiodiversityindex(thewildlifepictureindex).Withinthenextmonthswecouldalsoaccommodateothersourcesofdata(acoustic)andperformabundancebasedanalysis.
ANSWER:
• Theanalysispipelinecanbestandardizedinthenext12months
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page66of86
ANSWER:
• Idon’tthinkanyworkflowforthispurposeisavailable,now.
ANSWER:
• Analystscurrentlyrunthesemodelsusingspecialisedsoftwarepackages,withR(thefreestatisticalsoftware),etc.
• Arelevantissue:thereiscurrentlyquiteapushtowards“reproduciblescience”–e.g.https://zoonproject.wordpress.com/.Theabilitytotraceanalysescouldbeexcellent.
ANSWER:
• FirstofallyouneedtochoosewhatEBVsshouldbecalculatedandhow(inmanycasesthesameEBVcanbecalculatedindifferentways).YoucouldstartbylistingthepossibleEBVs,assigningeachonearankof“importance/impact”andarankofassociateddataavailability,thenlistthepossiblewaystocalculatethem,assigningeachwayalevelofcomplexitytofinallydecidewhatcanbedoneinthegiventimeframe.Scientistswillalsoneedtodefinewhichkindofdatawillbeused.Forinstance,ifthewholeGBIFdatabasewillbeused,youmayconsideraspecificpartnershipwiththemtobuildthenewapplicationdirectlyontopoftheirdatabase.Ontheotherside,ifonlyspecificpartsofitwillbeused,youmaycreateaseparateapplication,stillwithsignificantwebserviceinteractiontoretrievedataandalocaldatabasetostoreresults.Aninterfaceontopofthatdatabasecouldbeusedtodisplayresults.Therearemanypossibilitiesforimplementation,includingworkflowmanagementtoolsandothersoftwareframeworks–it’shardtotellatthispointwhatcouldbethebestoptions.
ANSWER:
• Inadditiontotimestampedspeciesoccurrenceandstaticenvironmentaldata,itwouldbegoodtoinvolvespeciesinteraction,physiologicaladaptionanddynamichabitatloss&qualitylayers–iftherearemodelsthatarereadytoconsumesuchdata
ANSWER:
• Withinthenext12monthsitispossibletoa)writethespecificationsforanEBVplatformbyworkingwithdifferentusergroups,andb)inparallel,conductasurveyofmajorexistingsoftwareusedbybiodiversityexpertsandtechnicalexperts.
ANSWER:
• Asstatedabove,GBIFistheagencycurrentlyperformingmanyoftheidentifieddatapreparationtasks.Softwarelibrariesandtoolsexistformostofthesteps(commercialGIS,QuantumGIS,PostGISspatialqueries,R/python/Matlablibraries…butthequestioniswhetheritmakessenseforthisdatapreparationefforttobeduplicated,orwhetheritwouldbepossibletosetupatoolbox/frameworkforthisworkflowwhichcouldbeshared.Ifso,pythoncouldbeausefullanguagesinceithasmanystatistical,datamanipulationandspatiallibraries,andcanoptionallyinteractwithArcMap/QGISwherethoseareinstalledonauser’smachine.Technically,Ithinkthatatleastaprototypeworkflowfordatapreparationcouldbeproducedinthenext12months,thoughthescrapinganddiscoveryofallrelevantinputdataisabigchallenge.
• Computation-Open-sourcelibrariessuchasSparta(https://github.com/BiologicalRecordsCentre/sparta)maybeusefulforthisprocess.Foridentification
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page67of86
andhandlingofproblematicpositionalreferencing,seee.g.http://onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2013.00205.x/abstract.
• Presentingvisualresults–manyaccessibleandinteroperableICTsolutionsareavailable,e.g.,OGCWMS/WFS/WCS(toolslikeGeonode,CartoDBandMapboxhaveloweredtheentrybarrier),andforraw/tabularresults,RESTserviceswhichreturnJSONorothereasilyusableformatsthatdon’trequirecorporatesoftwaretovisualize/chart.
ANSWER:
• TheSwedishLifeWatchAnalysisPortalhttps://www.analysisportal.se/isagoodexampleofwhatweneed.Itjustneedstobescaleduptoanycountryand(sub-)continental,andglobalscales.TheydonotspeakofEBVs,althoughtheyarecomputingsomethingsimilar.EUBONisworkingonthis.
ANSWER:
• Wearebeginningtoexploresupercomputingoptions,includingthroughMicrosoftAzure.
ANSWER:
• Thechoice-on-demandproductionorperiodicproduction-isfundamentalbecauseofitsimplicationsforthewaythatproductionprocessesaredefined,andforhowinfrastructuresareorganisedandoptimisedforcalculating,archivingandservingEBVsdata.Thechoicehastobefeasible,efficient,andaffordable.Globalcooperationisneededtoensureconsistency,servingcomparablerawdatasetsandprocessingcapabilitiesforproductionandmaintainingappropriatearchives.TheworkflowsforproducingEBVsdatahavetobecapableofbeingexecutedinanyinfrastructure,andfromanywhereintheworld.Thechoiceraisesissuesforpermissionstouseprimarydata,forsecondarydata,forcitationandattributionandforprovenancetracking.
• Nowandwithin12months,on-demandcalculationispossibleusingtheBioVeLinfrastructureand,forexampleanadaptedgenericENMworkflow.
o genericENMworkflowonBioVeLportal:https://portal.biovel.eu/workflows/440
o myExperiment:http://www.myexperiment.org/workflows/3355.html
o Documentation:https://wiki.biovel.eu/x/ooSk
ANSWER:
• Technicaloptions:
o Machoftheinfrastructurealreadyinplace:e-infrastructures,suchasLifeWatch,BioVel,iMarine,ViBRANT,thatcouldserveasbuildingblocksoftheinfrastructurerequired
• Nexttwelvemonthtarget:
o Registryofthee-infrastructuresinplace
o Listtheirtechnicalspecsandfeatures
o Deliveraplanbywhichtheservicestheyprovidecanbeinteroperable
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page68of86
• Dataand/orworkflows:
o Mostofdataneededareciteabove
o Workflowsandstatisticalsoftwareoperationalinthecontextofseverale-Infrastructures:BioVel,LifeWatch,ViBRANT,iMarine,Aquamaps,etc.
ANSWER:
• Icouldonlyprovideawildguess.I´dratherdeclinetoanswer.Ihavethefeelingthoughthatopenreusabledata(withappropriatemetadata)issimplynotavailableyet.CertainlythedeparturepointaresomealreadyexistingsystemssuchasGBIForOBIS
ANSWER:
• CurrentlythemainsourcesofinformationwillbeGBIFontheonesideanddatafromtheEuropeanFFHdirectiveontheother.Anissuewiththeestimationonpopulationstatusandtrendsisthattheunderlyingdataarenotprovidedtogetherwiththeestimation.Inadditionduetolackinginformationapartofestimationarebasedonexpertjudgementbacked-upbyin-situdata.
• Intheshorttermfocuses(next12month),theintegrationofinformationonstatusandtrendsasacompilationofestimationswillbethemostsuitableprocedureforawiderangeofspecies.Forsomealready(e.g.certainwhaleandbirdspecies)estimationmodelsonaglobalscaleexist.
• Options:ImplementationofWFS/WCSservicesforspeciesdataobservation.ExtensionofSOSservicesforspeciesobservations(issueofcomplexmonitoringschema).
• Theavailabilityofdatatoestimatespeciespopulationincludingchangesintimeseemstobeagreaterissuethanthetechnicallimitations.Neverthelesstheautomaticintegrationofuncertaintiesalongthechainofmethodsisstillanissuetobesolved.
ANSWER:
• TakingupQuestion6storyline,GBIFispossiblythemostadvanced,already-availableportalandisseedingagrowingcommunityofdevelopersthatareproducingservicesbasedonitsAPI.Also,niche-calculatingservicesareveryusefulasentrypoints.Otherglobalortheme-specificdatasets(e.g.OBIS)areequallyimportant,althoughtheyaregenerallybecomingincreasinglyinteroperable.
• Tothisregard,anintegrationofsomeoftheserelevantdevelopmentsinLifeWatchICTdistributede-Infrastructureshouldbefeasiblewithinthenext12months’timeperiod.
ANSWER:
• Theissueismorethesourcedata,evaluatingifitisfitforpurposeandexpressingthemeasurealgorithmically.Anynumberofoff-the-shelfsoftwareprovidersandopensourcesolutionsexistfordatawarehousing.
ANSWER:
• Seeabove(e.g.BioVeL).Forspeciesdistributionsmostofthestepsarealreadyinplace,saveperhapsexplicitlyvisualisingtheseresultsinthecontextofchangeinaparticularEBV.Analysesofchangeinspeciesabundancearemoreconstrainedbyavailabledata;abundancedataismorespatiallyand
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page69of86
taxonomicallybiased,theanalysesmorecomplexandvaried,butalsopotentiallymoreinformativeofthecausesofgenuinechangesinpopulationabundance.
ANSWER:
• Therearealotofchallengesindoinganythingin12months.IthinkthemaingoalshouldbemobilizingpopulationabundanceandatlasdatasetsintotheDarwinExtendedCore,anddeployacoupleofappsthatcanperformstatisticalanalysesonthose
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page70of86
Question8
Whatarethetop3-5technicalchallengesofsupporting interoperableEBVcalculationsonaglobalbasis?Howcanthesebeaddressedandinwhattimeperiod?Whohastodosomething?
ANSWER:
• Funding
• Suitablehighresolutionimagery(hyperspectral,lidar,hypertemporal)
• Linkbetweenpolicyandspaceagencies
ANSWER:
• Someofthemainchallengeswouldbe:
o Todefinethedataformatguidelinesinawaythattheycanbeappliedtoanykindofmonitoring(standardizedsurveyand/oropportunisticdata)andanytaxonomicgroups(seeQuestion5)
o OncetheEBVabundanceanddistributionwouldbeclearlydefined,toagreeonrobustandsuitablestatisticalmethodsforcalculatingbothdistributionandabundanceEBVswithrespecttotherequesteddataformatandtheheterogeneityofthemonitoring.
o Todefinecriticalstepsoftheworkflowaswelladkeylinkagesinordertoimplementitandmakeitoperational.
• Howitcanbeaddressed:
o Organisingworkshops
o Engagingparticipativecontributionofdataowners/statisticians/biodiversityexperts/technicalITexperts
o Learningandgettinginspiredbyprevioussuccessfulendeavours(e.g.seetheexcellentpublicationbyBarkeretal2015detailingaveryefficientworkflowoflargescaleecologicaldata)
o Barkeretal.2015.EcologicalMonitoringThroughHarmonizingExistingData:LessonsfromtheBorealAvianModellingProject.WildlifeSocietyBulletin9999:1–8;2015;DOI:10.1002/wsb.567
• Whohastodosomething:
o StatisticalandbiodiversityexpertsneedtodefinerobustandsuitablemethodsforEBVscalculationtogether,aswellasprovidingsoftwareorscripts.
o Technicalexpertsofdatabasemanagement/ITexpertsneedtosupporttheimplementationoftheworkflow.
o Biodiversityexpertsandtechnicalexpertsneedtoworktogetherfordefiningtheguidelinesandthedatasharing/datauseagreements.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page71of86
• Timeperiod:Withinthenext2-5yearsseemsreasonable.
ANSWER:
• Whatspecies?Theycan’tbeconsistentinternationally.
• Lackofsystematicdata.Lackofconsistent,internationallyagreedsystematicsurveysofkey/indicator/targetspecies.
• Whilenottechnical,itisthe‘political’thatislikelytobethemostlimitingfactorinachievingeffectiveabundanceanddistributionchangemonitoring.Long-term,ongoingfundingisrequiredforregularsurveys.
ANSWER:
• Thepatchinessandsparsenessofdataespeciallyacrosscontinentalscales.
• LackofclarityaroundpriorityspeciesfororganisinganddeliveringEBVs.
• Absenceofconsistentmodellingapproachesfordeliveringatleastbest-availableEBVdata(orevenaclearandconsistentvisionforaglobalmodelledEBVcomponentine.g.GEOSS)
• LackofclarityaroundGEOBON'splaceindeliveringthemodelleddatalayertositbetweene.g.GBIFontheonesideandIPBESandonwards(toCBD,etc.)ontheother.
ANSWER:
• Bydefaultthecomputationalcapacityisachallenge,butitcanbeovercomeeasily.Dataqualityissuesareveryimportanttoaddressinordertohavemoretrustfulresults,thatinvolvesgeoreferencesforhistoricaldatathatcanbehardtodeterminate.Inmyopinion,themostdifficultchallengesareactuallysocial,sincethecultureofmakingdataopenisnotalwayswellreceived,soassertiveapproximationtodataholderswillbekeyforenrichthesystemcontent.
ANSWER:
• Skillandtraining–withoutthepeoplewiththeskillsandknowledgetodealwithinteroperabilityissuesatagloballevel,thiscannothappen–Researchfundingbodiesneedtoaddressthis(seeBelmontforumreporthttp://www.bfe-inf.org/document/community-edition-place-stand-e-infrastructures-and-data-management-global-change-research)
• Standardsdevelopmentandadoption–Inordertoprovidegloballyinteroperableresourcesrequiresagreementonstandards.Theinternetistheobviousplacetostartthisandmanysciencecommunitiesnowusethisasthebasisoftheirresearchcollaboration.Thiswillhavetorunthroughtoagreementonvocabulariesandontologiestolinkinformationtogether–Thereareexistingstandardsforsomeofthisbutthereneedstobesomemechanism/governancetodriveadoptionofthesewithinday-to-daywork.
• DataPolicy–thereclearlyneedstoopennessintheavailabilityofdata(andalgorithms)sosupportworkflowoperations.Theselegalframeworksexistbutarenotalwaysenforceastheconflictwithresearchersexpectationsof“ownership”ofthedatatheyhavecreated.Thisisstillabigculturalissueinthatneedstobeaddressedatresearchagencylevelwithinlegaljurisdictionsandbyscientificrewardswithinscientificjournals.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page72of86
• Long-termfundingforinformaticsR&Dandsystemsoperation–researchersanddecisionmakingwillnottrustsystemsthathaveuncertainlifespans.Wecannotconvinceresearcherstoentrustdatatosystemswhichhavenolong-termfunding.Ifdecisionsupportsystemsareseenasimportantfordealingwithenvironmentchallenges,theymustbeseenaspartofnational/internationalinfrastructure.Thestabilityrequiredtoestablishthesesystemsandwaysofworkingcannotbeachievedthroughaseriesof3to5yearsresearchgrants.Thereneedstoarollingfundingandreviewofwhatisessentialinfrastructurefordevelopmentofessentialbiodiversityindicators–internationalagreementrequiredbetweenfundingbodies(??)–seeBelmontorRDA??
ANSWER:
• Dataformat:theinformaticsformatofdatatobecorrelatedcanbehighlyvariable.Firstofall,uniqueformatsmustbeselectedforeachtypeofdata.Thentheinfrastructureshouldbedesignedtomanageandintegratetheseformats.
• Accesstodata:itwouldbenecessarytodefineifaccesstodataandtoolsavailableintheinfrastructureisunrestricted,restrictedtocertainusersorunderasimpleregistration.
• Thedatatransferprotocolsmustbesafe:forexample,theuserwhosubmitshisdatadoesnotwanttheybecomepublic.
• Astoragesystemshouldbeimplementedandthefollowingquestionsshouldbeaddressed:whichdatatokeep?Howlong?
• Itcouldbeusefultoinvestigateotherbioinformaticinfrastructure(Elixir,Lifewatch,BioVeL,etc.)andplatformsalreadyavailableforstorageandanalysisofmolecularbiodiversitydata.
ANSWER:
• Ensurethatdatacollectedunderdifferentprotocolsisstandardizedinsomewayandweightedaccordingly
• Platformstosharedatabasesinafederatedway,sothatdatacanbeeasilyaccessibleononesite(forexampleseewildlifeinsights.orgforcameratrapdata).
• Automationofanalysesisachallenge;modelbuildingandconstructionrequiressomedegreeofuserinputunlessanarrowclassofmodelsandcovariatesisrun.
• Willingnessofresearchers,institutionsandgovernmentstosharedataonaglobalscale.
ANSWER:
• Aspartofongoingqualityassessments,evenifonehasanautomatedworkflowtoingestdatafromavarietyofsources,thereshouldbeawaytocomputeaggregatedindicatorsofdataqualityforagivensetofdatasources.Supposeaworkflowingestsfromfourthreedatastreams:
o Dataset1:SubsetofdataretrievedwithconstraintsAfromrepositoryX
o Dataset2:SubsetofdataretrievedwithconstraintsBfromrepositoryX
o Dataset3:SubsetofdataretrievedwithconstraintsAfromrepositoryY
• Somesuiteofqualityindicatorsshouldbecomputableagainstalldatasets,toproduceinformationlike:
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page73of86
Qualityindicator1 Qualityindicator2 Qualityindicator3
Dataset1
Dataset2
Dataset3
• Thequalityindicatorcouldbeofthetype{High,Medium,Low}oraquantitativescore.Itmaybeagoodideatorunauditsofthosequalityindicatorsonsomeregular,orirregular,schedule,ifyou’rerunningaservicetocomputeEBVsforpolicyuse.ThiswillenablesomelevelofcertificationofthequalityoftheEBVforpolicyuse,whichmayeaseanyconcernsaboutdecision-makingbasedoncomputedEBVsthatusedatafromvarioussources.
• ThereisarecentlyNSFfundedprojectcalledMetaDIG(leadbystafffromDataONEandtheUSNationalCenterforEcologicalAnalysisandSynthesis)lookingintodevelopingqualityindicatorsformetadataanddata.Theiraimistodevelopasuiteofqualityindicatorsthatvariouscommunitiesofpractice(e.g.theUSGeologicalSurvey,theLong-termEcologicalResearchnetwork)canuse.
• Provenancemetadatastandard.IamnotsurehowwidespreadisthecommunityacceptanceoftheW3CPROVstandardforprovenancecapture,butitismyhopethatabodylikeGLOBIS-BplaysapartinexaminingtheapplicabilityoftheprovenanceschemathatPROVrecommends,anddetermineswhetheritissuitableforthecomputationofEBVs.Ifeelthatlikewithqualityindicators,makingsurethatworkflowsareaccompaniedbyprovenancemetadatawillbeessentialatsomepointdowntheroad.Thisisespeciallytruegiventheimportanceofreproducibility,whichhasbeenanissuediscussedwithintheBelmontForume-infrastructureanddatamanagementcooperativeresearchaction.
ANSWER:
• Implementationoftheworkflow(s)inadistributedenvironment,assigntasksonnodes(servers/hubs)andlinkthemtogether,2.Efficiencyoftheworkflow,3.Updatingofworkflowandnodes,4.Visualizationforresults
• Ifinstitutionsinbiodiversityinformaticsintheworldworktogether,thisjobcanbedonein36to60months.
• Themostimportantthingisfindenoughfundandtheprojectbewelldesigned.
ANSWER:
• substantialworkassessingavailablespeciesdataandwhetheritcanbemassagedintoaformsuitableforchangemodelling(e.g.intoaformthatallowsdetectiontobeestimated)
• currentonlinedataoftentreataspresence-onlydatathatareactuallyfromstructuredsurveys(i.e.absencesaren’trecorded;itcanbehardtoidentifyallsitesfromtheonesurvey;informationonsurveymethodscanbelost).Improvethis?
• geographicalbiasesincollectionsdataarealreadywellknown(e.g.Amano,T.&Sutherland,W.J.(2013)ProceedingsoftheRoyalSocietyB280.Whataretheprioritiesformonitoringchange?Arepriorityregionsthemostpoorlysampled?Ifso,cansurveysbedesignedtosatisfyseveralneedsatonce(sodecisionsneedingdataNOWareserved,aswellaslongertermaimsformonitoringchange)
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page74of86
• environmentalpredictors–aretheyadequate,andatfineenoughresolutiontobeusefulformonitoring?–samefordetectioncovariates.
• Modelling:howtomanagethetradeoffbetweenwantingitrigorousenoughtoenablereliableestimatesofchange,yetsomehowwidelyavailable?
ANSWER:
• LivingPlanetIndexfromWWF-ZSLandMapofLifefromYale,butfrombothinitiativestheunderlyingdatathatisusedtoderivetheindicesandmodelsisnotavailable.
ANSWER:
• Again,thiswoulddependontheEBVsthatneedtobecalculatedandhowtheywillbecalculated,butpotentialchallengesinclude:
o Handlinglargevolumesofdata(fetchingthemremotelyandprocessingthem).
o Designingforefficienthumaninteraction,ifthiswillbeneeded.
o Dependingonchangestobemadeonthird-partysystems(suchasaskingotherinitiativestocreatenewwebservicesontopoftheirdataormakeotheradjustmentsontheirsystemssothattheycanbeintegratedwithGLOBIS-B).
ANSWER:
• Identifyingandsupportingkeynon-technicalaspectsofinteroperability,includingsemantic,legal,policy,andpoliticalorethicalconsiderations(timeframe:1-3years;couldbeaccomplishedbyahandfulofworkshopsfollowedbyperiodofcommentandconsultation).
• Findingabalancebetweenautomatedmatchmakingandmatchmakingthatrequireshumaninput.ThischallengewillbecontinuallyrevisitedthroughtheEBVandplatformdesignprocess(timeframe:3years+,dependingonfunding).
• Developinganddocumentingasystemthatistrulyaccessibletoarangeofstakeholderaudiences,includingprofessionalresearchers,amateurresearchers,educators,andpolicymakers.AchievingthiswillrequireaclearstatementofgoalsandpurposeadvancedbytheGLOBIS-Bteamandcollaboratorsfollowedbyaninclusivedesignprocess(timeframe:3-5years+,dependingonfunding).
• Makingsurethatthisprojectreachesthewidestpossibleaudiences,withinandbeyondthebiodiversityandlargerscientificcommunity(3-5years+).
ANSWER:
• Patchinessofthedata:distinguishinggapsfromabsences.Totacklethis,directedandsystematicsamplingisneeded.High-qualitycitizenscienceprojectshavesomepotentialbutareofrestrictedvalueingeographically/politicallyinaccessibleareas.
• Gettingnon-digitised/archiveddataintoGBIF–alreadyunderwaywithnewtaskforce,andsomewell-designedcitizenscienceprojectsbasedaroundnaturalists’notebooks.Ensuringthatthesenewobservationsalsofeedimprovedrangemodelling.Researchcouncilsandindividualscientistsalsomaybeabletosupportthiseffort.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page75of86
• ReproducibilityandrobustnessoftheEBVcalculations-ensuringthattheyscalecorrectlywhencomputedatsmallerscales,thatresultsareconsistentandwillbetrustedbydecisionmakers.
ANSWER:
• Fordistributionmodelling,gettingenvironmentaldatalayersbeyondWorldClim.
• Forabundance,downloadandcleaningoffullGBIFdataintoanOLAPcube.
ANSWER:
• SGDR(suigenerisdatabaseright)-Itisfundamentaltokeepinmindthedistinctionbetweendatacreationanddatacollection:onlydatacollection(orpresentation/verification)canleadtotheexistenceofSGDR(ifalltheotherrequirementsaremet).IfdataarecreatedthereisnoSGDR.Tomakethings"easier"thereistheuncleardefinitionofdatacreationanddatacollection(adifferencethatnotnecessarilycorrespondstothescientific/epistemologicaldefinition).
• Importanceofcorrectlabellingofdataandmetadata-Itismandatorythatalldata/datasetareproperlylabelledwiththerighttools:PublicDomainMark,CC0,CCPL.
• TDM,copyright,SGDRandlicences:itisimportanttoemploylicencesthataddressproperlytheseconsiderations(e.g.CCPLv4.0yes;CCPLv3.0dependsbutusuallyno;CCPLv2.0no).
• Inordertolicencedataproperlyitisimportantthatnotonlytherightlegaltoolsbeavailable(tosomeextenttheyalreadyare,e.g.licences)butthatalsotherightsetofincentivesbeavailable(i.e.ifinordertoobtaingrantsorgettenureresearchersneedtohavehighImpactFactors,thentheywillpublishinhighIFjournalsthatnotnecessarilyfollowOAprinciples).Therefore,researcherscannotbe"leftalone"indealingwithcopyright/assessmentissues,buttheyneedprotectivelegislativeinterventions(liketheGermanandDutch,notliketheSpanishorItalian;theUKsolutionisdebatable)+therightsetofincentivesfromfundingbodiesandemployers(e.g.onlypapers/datasetsself-archivedinOA,akagreenroad,willbeusedforevaluationpurposes).
• OAtobesuccessfulrequiresanewapproachnotonlyinthepublicationofscience,butalsoinitsevaluation/assessment.
ANSWER:
• OurcontributiontoEBVarefromtwoangles,copyrightanddatasharingpoliciesontheonehand,andformthepublishedrecord.
• Whatwecancontributeistolookatdata,dataqualityandhowthisrelatestoopenaccesstothedata
• Fromthepublishedrecordthisonlymakessenseintwospecificaspects:Publishingdatasetssothattheycanbecited,e.g.usingeitherGBIForPensoftpublishingfacilities,whichisrelevantinthelongertermtosetupmonitoringschemes.
• Anotheraspectofthepublishedrecordisthatisoftentheonlysourceforrarespeciesbeyondbutterflies,orplants.Thismightaddaspeciallayeroftaxathatrepresentalargepartofbiodiversityandareinmostcasescompletelyunderrepresented.Atthesametime,thequestionmightberaise,whethertheknowndataisstrongenoughtocontributemorethananecdotalevidencetoEBV.
• ForustheexperiencetoparticipateinGLOBIS-Bisthatweareveryinterestedtofindoutweakpointsindata,EBVworkflowanddatapublishing,andhowwecanimprovefuturedatapublishing.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page76of86
ANSWER:
• Highqualitydatafromcharismaticorganismsinrichcountriesoftennotcomparablewithsparsedatafromelsewhere.Weneedtoavoidthelowestcommondenominator.Iseethisessentiallyasamodellingproblem,ratherthanadataavailabilityproblem.
• Metadata(seeabove):moresophisticatedobservationmodelswillbecomputationallyintensive.
• Multispeciesmodelswillbeevenmorecomputationally-demanding.
• Spatialscale,spatialresolutionandtemporalresolutionoftheoutputs.
ANSWER:
• Datagenerationisstillthelimitingfactor:weneedtomeasurefaster,cheaper,automated.
• LifewatchBelgiumdevotesalargepartofbudgettoinstallbiosensornetworksforthemeasurementofphytoplankton,zooplankton,fish,bird,bats.
• Someexamples:http://rshiny.lifewatch.be/
• TheJericoNextandAtlantosprojectsareexamplesatEuropeanandTransAtlanticscale.
ANSWER:
• Therearemultipletechnicalchallengesbutthemainchallengeliesingettingresearchinfrastructuresoperatorstoworktogetheratthegloballeveltopursueanagreedroadmap(e.g.,basedonthatcomingfromtheCReATIVE-Bproject)thatensuresthatthevariousresearchinfrastructuresareinterlinkedandinteroperableinbothtechnicalandlegalterms.Thisrequiresinvestmentfunding.TheresponsibilityshouldbetakenupbytheBelmontForum,perhaps?
ANSWER:
• Challenges:
o Secureunlimitedcomputationalcapacity
o Ensuretransparencyoftheprocess
o Providewebservicesbywhichviewingofdata,usingofdataandworkflow/softwarewillbetrackedandreportedbacktothedevelopers
o MappingofEBVsatglobalscales
• Waystoaddresstechchallenges:
o Engaginggridandcloudinfrastructure
o Developtoolsforthetraceabilityoftheentireprocessonthecyberspace
o Developthepipelinelinksbetweenthedataandworkflows/softwareavailable,aswellaswiththeavailablee-infrastructures
• Whohastodo?
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page77of86
o Scientificcommunityfromaroundtheworld-mobilizingthelargeNetworks:e.g.MARS,WAMS,etc.formarinebenthicbiodiversity,therearemanycommunitiesforotherregimes
o ICTcommunityrelevantwiththebiodiversityinformatics
o EUandotherfundingagenciesatnational,regional,continentalandglobalscale,tocreatetheappropriatefundinginstruments,atleastforthecoordinationoftheactivities.
ANSWER:
• Agreedmetadatastandards(includingrightsstatementmetadata,whichdonotexistatthismoment)foralltheexistingdatasetsunderanswertoQ1.
• Agreedstandardtoincorporateabundance-baseddataoccurrences(Isthereanyoneonplacewithenoughconsensusandreliability?).
• AgreedGISstandardstofacilitatethechartsexpressionsofQ2(andopensourcebased)
• Assumingthatthereisminimumagreementonanswertotheprevious7Qs(whichisabackgroundminimalneed,atleastforsomespeciesortaxa)themainneedIassumeisconductingareallifetestinginwhichdigitizednationalinventories,GBIFdatasets,andbiodiversityspecies-relateddataminingofscientificpublicationsandcitizens´sciencecrowdsourceddata,usingmultiple(oratleastdouble)VREbasedITstocontrolreliabilityofresults.Itwouldneedclearpolicyagreementsandfunding.
ANSWER:
• Datamobilisationacrossmultiplesources(incl.legacyliteratureandcollections)
• UseofcommonorinterchangeableStandardsthatwillallowdatainteroperability
• Openandwell-documentedwebservicestoservedata
• RobustregistriesofdataservicesandStandards
• Developmentofend-userservicestailoredtospecificaudiences
ANSWER:
• Technicalchallenges:
o Datadescriptionincludingtheuncertaintiesinamachinereadablemanner.Oneoftheissuesistheprovisionofthisinformationwhichisnotprimarilyatechnicalissue
o Taxonomicreferencesandmappingofspeciesnamesandspeciesgroups–shouldbealreadybesolvedintheGBIFcontext,butstillintheFFHdirectiveitisanissue
o Provisionoftimeseriesofspeciesobservationsincludingabundanceinformation–withtheissueonhowtoupscalefromregionaldatatoaglobalperspective
• Theestablishmentofconsistentmonitoringschemesforbiodiversityonnationalscaleareanimportantpre-requisiteforfurtheractivities.MethodologicaltheuseofhighresolutionEOdataforhabitatandspeciesestimationneedstobeevaluated.E.g.EcoPotentialwillfocusontheidentificationofwhalesinoneofthetestareasbasedonEOSentineldata.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page78of86
ANSWER:
• Thefirstandforemostchallengeistohaveaglobaldatabaseofoccurrencesthatincludeabundancedata.That’samajorstepfromthecurrent,presence-onlydatasetsthatmakethebulkofgloballyavailablebiodiversitydata.Achallengethatiscurrentlybeingaddressedisincorporatingsampledata.Forthistoworkproperly,therearestillunresolvedchallengesthatmightbeontrackduringthetimeframe:
o Aneffectivesystemforuniqueidentifiers(GUIDS)forbiodiversityoccurrences/objects;
o Anagreed-upon,provenstandardtoincorporateabundance-basedandsample-baseddatatooccurrencedatasets;
o areliablewaytorepresent/identify/describeabsencedata;
o Aclean,authoritativetaxonomicbackboneallowingeasyidentificationoftaxonconcepts,duplications,synonyms,andoveralldeduplicationofoccurrencedata.Someofthesechallenges,aswellasmanyothers,wereidentifiedbyalargenumberofpractitionersthroughacontentneedsassessmentcarriedoutafewyearsago(seehttps://journals.ku.edu/index.php/jbi/article/view/4126andhttps://journals.ku.edu/index.php/jbi/article/view/4094).
• Therefore,andinordertoachievethesegoals,aproperOrganizationalKnowledgeManagementMethodology(OKM)shouldbeestablishedandthenrefined-maintainedbyanOKMCommittee.TheOKMshouldbebasedonthefollowingpremises:
o HowtoidentifysomepracticalcasesfromtheperspectiveofrelevantbioticsandabioticsEBVindicatorstobeperformed.Thiswouldlargelydependonthe“quality”ofthe(meta-)dataresourcesabovementioned.ThisanalysisshouldbeperformedbyaScientificCommittee.
o Tothispurpose,tofurtherintegrate-adaptexistingWorkflowsdevelopmentsintotheLifeWatchICTdistributede-Infrastructure,sothatsomeessential“blocks”givenintheformofe-ServicescanbeofferedinordertocalculateEBVsvaluesandthenpresentedinavisualwaythroughthedevelopmentofproperVirtualResearchEnvironments-VRE.ThisshouldbecoordinatedbyaICTTechnicalCommittee.
• Therefore,notonlywearetalkingaboutthecreationandmaintenanceofaEBVs“ontology-driven”system,butalsoofhowtoguaranteethe“engineering”mechanismsassociatedtotheirintegrationintotheLifeWatch(andsimilar)e-InfrastructuresfromtheICTperspective.
ANSWER:
• Itisobviouslynecessarytohaveacommondatavocabulary,commonstandardandprotocolfordataexchangebutthisalsodependsonwhichstepoftheprocessingisdonebywhom.Thereareatleasttwobroadalternativesandeachofthesehavetheirownchallengesforglobalassessment.
o EBVsarecalculatedseparatelyforeachjurisdiction,regioncontinentandthenaggregatedorreportedglobally.
§ Advantagesofthisisthatmuchoftheburdenisdistributedamongcountries/jurisdictionsdolittleneedstobedonecentrally.Thiswouldputgreateronusoncountriestocoordinatenational,monitoring,assessmentandreportingofbiodiversity
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page79of86
andensureastrongerlinkbetweenbiodiversitymonitoringandmanagementactions,policyandlegislation
§ Challenges:Toensureconsistencyincalculation,dataqualitystandardsetc.amongcountries.Alsosomecountriesjurisdictionwillsimplynothavethestaffandresourcestodotheseanalysessosomeofthiswillhavetobedonecentrally
o EBVsarecalculatedgloballyusingdataobtainedfromeachjurisdiction
§ Advantages:transparencyandconsistencyofcalculation
§ Challenges:majorresourcesmayberequiredtochaseup,acquiredata.Anyerrorsmaynotbeeasilyrecognisedbecausethedatawillbeprocessesbypeoplewhohavelimitedknowledgeofthedata
ANSWER:
• AssumingquestionsonthedefinitionofEBVsdonotneedtobefurtheraddressed,thereneedstobeabroadrecognitionthatmostbiodiversityiscurrentlyun/under-representedbyavailabledata.Themaintechnicalchallengeswouldthenbethat:
o Accurateandwidespreadrecordingofabundancedatawithconfirmedabsences,ratherthanjustpresence-onlydata.ThiswouldsensiblybuildontheexistingGBIFarchitecture.
o Thetaxonomicbackboneneedstobeimprovedandmadeexpliciti.e.synonymiesmadeclear.
o Overtime,changestoexistingpointdatasets(e.g.GBIF)needtobea)recordedandb)explicitlypresentedi.e.newspecimenrecords,newgeo-referencingofspecimenlocalities,editstothetaxonomyandlocationdetailsofexistingrecordssuchasre-determinationsandmoreprecisegeo-referencing.Forplantspecimens,duplicaterecordsofthesamecollectionfromdifferentinstitutionsneedtobeexplicitlylinked;ifgeo-referencingisundertakenretrospectivelythismaydifferbetweenduplicatesofthesamecollection.
o Anestablishedbutflexibleworkflowforspeciesdistributionmodellingandstackingspeciesextentswouldbeimperative.
o OneoftheoutstandingconceptualchallengesforthedevelopmentofEBVsisagreementoncommonscales/units/indicesofmeasurementtoallowdataone.g.speciesdistributionstobeintegratedsensiblywithdataone.g.habitatextent,andinawaycomparablefore.g.allelicdiversitywithe.g.habitatextent.CombiningdifferentEBVsinastandardisedwayistherealpowerofthewholeconceptualapproach.Alternativemetricssuchaseffectivenumbersmaybeworthexploring.
ANSWER:
• Themainproblemiscollectingthedataandpublishingthedataopenly.Atleastwecouldmakealotofinroadsonthelater.
END.
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page80of86
Annex2:Figure3expandedforreadabilityInthisannex,Figure3:Dashboard:Workflowsteps,risksandneedshasbeenexpanded/brokenapartintoitsconstituentpartsoverseveralpagestomakeiteasiertoread.ThechartsillustrateEBVsvsRI/ACTWorkflowSteps,Needs&Risks
GeneticComposition12%
SpeciesPopulation50%
SpeciesTraits13%
CommunityComposition
13%
EcosystemFunction6%
EcosystemStructure6%
EBVCLASSESBEINGADDRESSED
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page81of86
85% 90%
45%60%
70%80%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
DataCollectionand/orRetrieval
DataTransformation
ModelBuilding TestingandValidation
Presentation EBVOutput(y/n)
WorkflowStepsOverview
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page82of86
0%
10%
20%
30%
40%
Taxonomy,ontology,…
Databrokering
Dataqualitycontrol…
Openaccess
Funding
Interoperability
Languages
Interactivtity(connection…
Cloudaccess
StorageAccess
Dataaccessinternationally
Servicecatalogue
Needs
0%
10%
20%
30%
40%
Taxonomy,ontology,vocabulary
Interrelationbetweenspecies
Spatialdimensionandassociatedheterogeneity
Databrokering
Dataqualitycontrol(methodo
&tools)Interoperability
Sustainability
Fragmentationofcollections
Lackofpolicy
Strategyalignment
Risks
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page83of86
57% 57%
86%100%
71%
43%
71%86%
71% 71%86% 86%
71%57%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Identifyanddiscover
appropriatedatasources
Importrawdatawithmetadata
Checkdatasharing
agreementsandlicenses
Basicdataconsistencycheckwith
filters(relevant
dates,timesetc.)
Combine/joindifferentdatasets
Identifyduplicatedata
Checkwhetherdatafitforpurpose
Matchtaxonomy
Detaileddataqualitycheckanddatacleaning(errors,precision,accuracy,outliers)
Checkwhetherdatacoverage
matchesforpurpose
Datapre-processing
Dataanalysisand
processing
Visualisation Download
WorkflowStepsDetailed
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page84of86
0%
71% 71%
100%
0%14%
64%86%
100%
0%0%10%20%30%40%50%60%70%80%90%
100%
WorkflowStepsCoverage
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page85of86
3% 1% 1%4% 5%
3%5% 5% 5% 6% 5% 5%
3% 3% 3%0%
5%
10%
15%
Automatedprocessing,
workflow,etc.
Avoidingduplicationofeffort;sharingofintermediate
products
Closingdatagaps;
accessibilityofrelevantco-variatedata
Dataaccessibility,sharingand
usageagreements
Dataharmonization,normalisation,standardisation
Dataintegrationchallenges
Dataquality-assurancechecksandmetadata
Dataquality-fitnessforpurpose
Dataquality-preparationfor
analysis
Manypossibleapproaches
Partialautomationofanalysis,expertjudgement
Presentation,provenance,traceability,verification
Standardisationandvalidationofmethod
Sustainedinfrastructure;maintainingcapability,capacityandperformance
Taxonomicanalysisandreconciliation
WorfkflowKeyStepsforSpeciesDistribution/AbundanceEBVExample
GLOBIS-B(654003)
D3.1 Technical issues and risks v0.1 Page86of86
0%
10%
20%
30%
40%
50%
60%
70%
80%
AssociatedEBVExamplesConsidered