building a names backbone

Building a “names

backbone”

Nicky Nicolson, RBG Kew

A names backbone

== “an environment for the management of multiple

overlapping classifications and tracking how these

change over time”

Not a monolith:

• Built on a layered view of the domain – clearly

separating names and taxonomy

• Names form the objective basis for higher layers

The current situation…

Many overlapping systems, few links

… and what we’re aiming for:

Authoritative data, reduced duplication, many more links

Names backbone: a layered environment

Name occurrence layer AKA

“Nomen-clutter”

== any attempt

at the

transcription of

a name..

Names layer

Holds objective

published facts

about a name:

-Orthography

- Authorship

- Protologue

reference

- Type citation

- Objective

synonymy

Concepts layer

Hypotheses

draw names

together to form

concepts via

heterotypic

synonymy

The (current) problem:

Most people want

to operate at

concept level…

The (current) problem:

… but have

to start right

down at the

lowest level

The problem:

Solving the problem…

We need to provide ways to allow people to better

navigate between the layers, and better focus their

efforts – e.g. build classifications using the same

objective bases.

We started with a blank sheet of paper – it’s hard to get

existing systems to conform to the layering that we

need

Drawbacks of data models used to

date

• conflated the storage of names and concepts.

• store only a single classification

• store only the end product of a thought process, not

work in progress

• are difficult to version

• are difficult to query effectively (for hierarchies etc)

A new (graph) model

• Stores data as graphs – composed of nodes and

directed relationships

• Both nodes and relationships can hold data as

properties

• Supports highly interconnected data

• Supports self-referential data

• Optimised for queries on relationships

Using a graph model to hold

concept data: Attempt #1

Two nodes, with name

+ status properties,

and an “accepted_as”

link.

== a naïve use of the

graph model: status is

stored in 2 places

(explicitly in status

property, implicitly

by the participation

relationship)



More strict about the

separation of the

nomenclatural

information (the nodes)

and the taxonomic

information (the

relationships between

nodes), but the link

is still very sparse…



Add an attribute to

indicate which

classification asserts

this subjective

relationship:

Taxonomic status of a

name is inferred from

its participation

in a subjective

taxonomic relationship.

Links become more interesting

than the nodes

Expand the data

held on the

subjective

relationship to allow

it to be

computationally

assessed

Multiple opinions – using the

same name nodes

Reuse the name

nodes to store

multiple opinions

using the same

basic facts (name

nodes)

Relationships held

Objective, e.g.:

• Combination-basionym

• Later_homonym

• Alternative_name_for

• …

Subjective, e.g.:

• Parent_child (taxonomic placement)

• Synonym (heterotypic synonymy)

• …

Objective relationships “stronger” than

subjective

Supporting versioning

We keep all relationships, modifications to the data just

mark relationships as no longer current.

We can always resurrect the state of the graph

== persistent identification of taxon concepts

Versioning = name id +

classification + state

We can always resurrect the state of the graph.

Versioning enables remote curation of the data

Versioning = name id +

classification + state

We can always resurrect the state of the graph.

Versioning enables remote curation of the data

State1, according to

WCS:

Xus yus Smith (A)

= Aus bus Jones

(S)State2, according to

WCS:

Xus zus White (A)

= Xus yus Smith

(S)

= Aus bus Jones

(S)

What can be done with this kind of

data model?

• Client systems can reliably connect to a version of a

concept

• We can see how concepts change over time

• Researchers can query the data to compare

classifications and identify areas of dispute

Longer term:

• Examine the “computed acceptance” rules used in

TPL - could these be run on the relationships in the

names backbone?

Building it: we first focussed on

the top two layers…

… but we need a way to manage

the name occurrences

Building the name occurrence layer:

Populating it:

• Seed it with authoritative set of names

• Add the version history of these names – how were

these names transcribed in the past?

Using it:

• Load candidate name occurrences and match them,

storing metrics on the match.

Reviewing – a “data improvement” team to:

• Verify the matches, focussing on ambiguity (that

which can’t be done computationally) == annotation

Services: name occurrence layer

- Data input / output:

DwCA

-Linking and

reviewing links

-RSS feeds to

indicate activity

Services: names layer


TCS

-Propose addition /

edit of names

-RSS feeds to

indicate activity


TCS

-Create

classifications using

names

-Propose

addition / edit of

names to names

layer

-RSS feeds

Services: concepts layer

The names backbone is an

extensible environment:

• Links “name occurrences” to names

• Separates curation of names and concepts

• Supports building concepts on the same objective

basis: enables sharing and reuse of foundation data.

• Allow many relationships to form concepts – supports

multiple overlapping classifications

• Allows distributed curation of the concepts.

building a names backbone

Documents

names layer

separating names

code governed names

names backbonenicky

objective published

objective basis

rbg kewa names backbone

informal attempt