bionym
DESCRIPTION
TRANSCRIPT
![Page 1: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/1.jpg)
iME4d - BiOnymA concept-mapping workflow for taxon names reconciliation
Friday 7 March 2014 – Rome
A concept-mapping workflow for taxon names reconciliation
Fabio Fiorellato, Edward Vanden Berghe, Gianpaolo Coro, Nicolas Bailly
![Page 2: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/2.jpg)
Big Data make its way to biology
• Data volumes increase dramatically
– Management of large databases (millions of
records) easier
• no longer the realm of professional IT people• no longer the realm of professional IT people
– Biologists wake up to the advantages of
• Good data management, including preservation
• Data sharing
• Makes it possible to do science in a different
way
![Page 3: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/3.jpg)
‘Big Data’: Need for data integration
• Becoming a very realistic possibility– Management of DBs of millions of records
• Needs integration of small, restricted-scope datasets into massive databasesdatasets into massive databases– Intra-discipline integration (homogenous)– Inter-discipline integration (heterogeneous)
• Individual studies too small to inform on a scale commensurate with problems humankind faces– Evidence-based management of living resources– Climate change, global warming…
![Page 4: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/4.jpg)
iMarine biodiversity ‘ecosystem’
Taxon name enrichment
Taxon name reconciliationTaxon name access
Occurrence data access
Environmental data access
openModeller
AquaMaps
Distribution modelling
Occurrence data enrichment
Occurrence data reconciliation
![Page 5: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/5.jpg)
Central role of taxon name reconciliation
Taxon name enrichment
Taxon name reconciliationTaxon name access
Occurrence data access
Environmental data access
openModeller
AquaMaps
Distribution modelling
Occurrence data enrichment
Occurrence data reconciliation
![Page 6: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/6.jpg)
Taxonomic names are the keys…
• … Keys to bind together information on the
same taxon from different sources
• But there are problems• But there are problems
– Different research groups use different spellings
– Accidental misspellings
– Synonym, homonym reconciliation (but outside
scope of ByOnym)
![Page 7: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/7.jpg)
Some people can’t type
• Asthenognathas inaefaipes• Asthenognathus inaeqipes• Asthenognathus maefaipes• Asthenognathus maefaipes• Astheognathus inaequipes• Asthenognathus inaeguipes• Astheognathus inaeqinipes• Asthenognathus inaequipes
![Page 8: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/8.jpg)
Things can go wrong with Excel…
• Clupea harengus Linnaeus, 1758• Clupea harengus Linnaeus, 1759• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1761• Clupea harengus Linnaeus, 1762• …
![Page 9: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/9.jpg)
… very wrong
• Clupea harengus Linnaeus, 1758• Clupea harengus Linnaeus, 1759• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760• …
• Clupea harengus Linnaeus, 2254• Clupea harengus Linnaeus, 2255
![Page 10: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/10.jpg)
Taxonomic names are the keys…
• … Keys to bind together information on the
same taxon from different sources
• But there are problems• But there are problems
– Different research groups use different spellings
– Accidental misspellings
• Reconciliation is necessity, not luxury!!!
![Page 11: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/11.jpg)
Existing systems…
• … Are not flexible– We need flexibility, as our use case will dictate what the ‘optimal’
behaviour of the system is• E.g. manual vs automatic systems
• … Are often coupled to a single ‘reference list’• … Are often coupled to a single ‘reference list’– Using different tax. Scope for test and reference only increases
false positives• E.g. TaxaMatch with IRMNG…
• …Don’t always have throughput needed for large-scale projects – Largest db appr. 20M names – too many pairs!
![Page 12: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/12.jpg)
Our need
• A flexible, highly customisable, workflow-based approach to taxon name matching– User controls input– Output can be used as input in other – Output can be used as input in other
processes– Running on high performance computing
infrastructure
BiOnym!
![Page 13: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/13.jpg)
Introduction to BiOnym
• As a workflow for taxon name mapping and reconciliation, it is
a real-world application of the concept-mapping principles
• It is focused on the domain of taxonomy, with an initial
restriction to marine species only
• Provides a full workflow (not only the concept mapping part)
• Tries to address - and possibly solve - many issues common to • Tries to address - and possibly solve - many issues common to
the taxonomic community
• Its key concept is “species taxonomy”, where concept
properties are the taxonomic atoms
• Is open to integration from third party components
• Takes advantage of the iMarine distributed infrastructure
![Page 14: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/14.jpg)
The iMarine solution: existing state-of-the-art
• A general purpose concept mapping framework
(COMET) was already available in FAO:
– based on an existing FAO product (limited to the fishing vessels domain) initially developed with the support of the
Japanese trust fund
– domain independent (can be tailored to any custom – domain independent (can be tailored to any custom
domain with little effort)
– provided with all the necessary building blocks and
components for general purpose usage
![Page 15: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/15.jpg)
The iMarine solution: the quest for integration
• The integration of COMET inside iMarine was hailed
and expected.
• Its main challenges:
– Identify and define the custom domain (biological taxonomy)
– Design and implement:
• custom COMET matchlets (engine assigning similarity scores to pairs of names)
• additional, reusable tools for data interchange and data preparation
(DwCA converter, input parser, pre- and post-processors)
– Enable components to be easily distributed among worker nodes
inside the infrastructure
– Integration in the iMarine Statistical Manager
![Page 16: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/16.jpg)
The iMarine solution: a success story
• The COMET integration inside iMarine, as part of the
BiOnym workflow, is an example of success story:
– Solving the integration challenges required limited effort
• Harvest names for input through iMarine tools• Send output from BiOnym/COMET on to further tools
– The core matching capabilities of BiOnym were first made – The core matching capabilities of BiOnym were first made
available in June 2013
• Pre- and post-processing; parsing
• Matching through (a series of) matchlets, assigning a similarity
score to pairs of names
– The modular architecture enabled developers to add new
functionalities or improve existing ones with ease
![Page 17: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/17.jpg)
BiOnym key concepts and features
• Its modular architecture is open to contribution and
alternatives
– Workflow stages can be plugged-in with custom business implementations
– Can leverage third party components (e.g. the input data parsing is available
both as an in-house component or as a wrapper of the GNI parser from
globalnames.org)
• Based on standard and open formats• Based on standard and open formats
– Reference data are synthesized from DWCA files
– Input data and matching results are expected and produced as CSV files
– Matching results can also be emitted as XML files in the COMET format
• High flexibility
– Multiple chained matchers, each with its own configuration and thresholds
– Third party matchers (e.g. Tony Rees’ TaxaMatch) can be seamlessly ‘wrapped’
and plugged in the workflow
– Support for collaborative matching results evaluation (expected soon)
![Page 18: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/18.jpg)
BiOnym System: Overview
![Page 19: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/19.jpg)
BiOnym Workflow
![Page 20: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/20.jpg)
Where are we?
• Infrastructure has largely been built• User-friendly GUI is under development• Evaluation
– Efficiency: speed of computations– Efficiency: speed of computations• Parallel system, compares well with others
– Effectiveness: are the results OK?• Ran experiments on different test datasets
– Deliberately introducing misspellings in known lists– ‘Real’ misspellings manually corrected for other purposes
![Page 21: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/21.jpg)
The Bionym Interface
Never mind the small print.
Step 1: Select your data
Step 2: Compose the
matching process. This
relies on infrastructure
resources
Step 3: review results. This
can be private and ‘for your
eyes only’, or public.
![Page 22: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/22.jpg)
The BiOnym Workflow
![Page 23: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/23.jpg)
Visualising
quality assessment
of the results of BiOnym
![Page 24: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/24.jpg)
Where to from here?
• Validation– Not in terms of quality of output but…– Uptake by the biodiversity community
• Sustainability• Sustainability– Who will take over maintenance after iMarine
ends?
• BiOnym is a tool, it is the means to an end– Support Ecosystem Approach to Fisheries
![Page 25: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/25.jpg)
iMarine biodiversity ‘ecosystem’
Taxon name enrichment
Taxon name reconciliationTaxon name access
Occurrence data access
Environmental data access
openModeller
AquaMaps
Distribution modelling
Occurrence data enrichment
Occurrence data reconciliation
![Page 26: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/26.jpg)
BiOnym in its environmentEcological modelling – Rich data management
Taxa Authority FileTaxa Authority FileVernacular Names
Authority File
Vernacular Names
Authority FileDarwin Core ArchiveDarwin Core Archive
Based on the COMET Framework
developed by Fabio Fiorellato (FAO)
![Page 27: BiOnym](https://reader033.vdocuments.net/reader033/viewer/2022051110/54b4ee694a795994458b46d6/html5/thumbnails/27.jpg)
Biodiversity Maps GenerationRetrieve via any GeoNetwork
Ecological modelling - Processing