building a names backbone
TRANSCRIPT
Building a “names
backbone”
Nicky Nicolson, RBG Kew
A names backbone
== “an environment for the management of multiple
overlapping classifications and tracking how these
change over time”
Not a monolith:
• Built on a layered view of the domain – clearly
separating names and taxonomy
• Names form the objective basis for higher layers
The current situation…
Many overlapping systems, few links
… and what we’re aiming for:
Authoritative data, reduced duplication, many more links
Names backbone: a layered environment
Name occurrence layer AKA
“Nomen-clutter”
== any attempt
at the
transcription of
a name..
Names layer
Holds objective
published facts
about a name:
-Orthography
- Authorship
- Protologue
reference
- Type citation
- Objective
synonymy
Concepts layer
Hypotheses
draw names
together to form
concepts via
heterotypic
synonymy
The (current) problem:
Most people want
to operate at
concept level…
The (current) problem:
… but have
to start right
down at the
lowest level
The problem:
Solving the problem…
We need to provide ways to allow people to better
navigate between the layers, and better focus their
efforts – e.g. build classifications using the same
objective bases.
We started with a blank sheet of paper – it’s hard to get
existing systems to conform to the layering that we
need
Drawbacks of data models used to
date
• conflated the storage of names and concepts.
• store only a single classification
• store only the end product of a thought process, not
work in progress
• are difficult to version
• are difficult to query effectively (for hierarchies etc)
A new (graph) model
• Stores data as graphs – composed of nodes and
directed relationships
• Both nodes and relationships can hold data as
properties
• Supports highly interconnected data
• Supports self-referential data
• Optimised for queries on relationships
Using a graph model to hold
concept data: Attempt #1
Two nodes, with name
+ status properties,
and an “accepted_as”
link.
== a naïve use of the
graph model: status is
stored in 2 places
(explicitly in status
property, implicitly
by the participation
relationship)
Using a graph model to hold
concept data: Attempt #2
More strict about the
separation of the
nomenclatural
information (the nodes)
and the taxonomic
information (the
relationships between
nodes), but the link
is still very sparse…
Using a graph model to hold
concept data: Attempt #3
Add an attribute to
indicate which
classification asserts
this subjective
relationship:
Taxonomic status of a
name is inferred from
its participation
in a subjective
taxonomic relationship.
Links become more interesting
than the nodes
Expand the data
held on the
subjective
relationship to allow
it to be
computationally
assessed
Multiple opinions – using the
same name nodes
Reuse the name
nodes to store
multiple opinions
using the same
basic facts (name
nodes)
Relationships held
Objective, e.g.:
• Combination-basionym
• Later_homonym
• Alternative_name_for
• …
Subjective, e.g.:
• Parent_child (taxonomic placement)
• Synonym (heterotypic synonymy)
• …
Objective relationships “stronger” than
subjective
Supporting versioning
We keep all relationships, modifications to the data just
mark relationships as no longer current.
We can always resurrect the state of the graph
== persistent identification of taxon concepts
Versioning = name id +
classification + state
We can always resurrect the state of the graph.
Versioning enables remote curation of the data
Versioning = name id +
classification + state
We can always resurrect the state of the graph.
Versioning enables remote curation of the data
Versioning = name id +
classification + state
We can always resurrect the state of the graph.
Versioning enables remote curation of the data
State1, according to
WCS:
Xus yus Smith (A)
= Aus bus Jones
(S)State2, according to
WCS:
Xus zus White (A)
= Xus yus Smith
(S)
= Aus bus Jones
(S)
What can be done with this kind of
data model?
• Client systems can reliably connect to a version of a
concept
• We can see how concepts change over time
• Researchers can query the data to compare
classifications and identify areas of dispute
Longer term:
• Examine the “computed acceptance” rules used in
TPL - could these be run on the relationships in the
names backbone?
Building it: we first focussed on
the top two layers…
… but we need a way to manage
the name occurrences
Building the name occurrence layer:
Populating it:
• Seed it with authoritative set of names
• Add the version history of these names – how were
these names transcribed in the past?
Using it:
• Load candidate name occurrences and match them,
storing metrics on the match.
Reviewing – a “data improvement” team to:
• Verify the matches, focussing on ambiguity (that
which can’t be done computationally) == annotation
Services: name occurrence layer
- Data input / output:
DwCA
-Linking and
reviewing links
-RSS feeds to
indicate activity
Services: names layer
- Data input / output:
TCS
-Propose addition /
edit of names
-RSS feeds to
indicate activity
- Data input / output:
TCS
-Create
classifications using
names
-Propose
addition / edit of
names to names
layer
-RSS feeds
Services: concepts layer
The names backbone is an
extensible environment:
• Links “name occurrences” to names
• Separates curation of names and concepts
• Supports building concepts on the same objective
basis: enables sharing and reuse of foundation data.
• Allow many relationships to form concepts – supports
multiple overlapping classifications
• Allows distributed curation of the concepts.