pattern recognition in biological data a talk for cmt108 “pattern recognition & data mining”...

Pattern Recognition in Biological Data

A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012

(based on a previous talk “Organising Information about Organisms: Classification etc.” given to the COMSC

Biodiversity Informatics Group on 22/11/2012)For more information see:

http://biodiversity.cs.cf.ac.uk/wiki/big:meetings:cmt108

http://biodiversity.cs.cf.ac.uk/wiki/big:meetings:cmt108

The Knowledge Pyramid

• In order to become Information, Data needs to be described by terms which give it meaning.

The Knowledge Pyramid: example

• Read this from the bottom upwards!

4

Names for objects

M62

6

M62

7

Names refer to objects

Number 62 in Messier's catalogue of non-comets:

• M62 → a globular star cluster

• Name: identifier → object (or concept)

8

Making use of identifiers

• Old: – Indexes, catalogues (Linnaeus, Messier,

Argos, ...)

• New: – data files– databases– XML files, RDF etc.

9

Changing concepts

• Alpheratz

• α Andromedae

• δ Pegasi

10

A jellyfish

11

Names for species

• Purple-striped jellyfish

• Chrysaora colorata

• Pelagia colorata

12

Carl Linnaeus

13

Biological examples

• How are names and identifiers for objects used in biological applications?

• Biodiversity informatics (1730 onwards; keeping track of 2 to 10 million species of organisms (animals, plants, fungi, microbes, ...)

• Bioinformatics (1990 onwards; centred on the gene sequences which specify proteins)

14

Some uses of biological information

• Bioinformaticians studying genetic material want to use information about a species to understand its development

• Biodiversity scientists (including taxonomists, ecologists, etc.) want to use molecular data to enhance their classifications, phylogenies and ecological models

• Others want to use species data to predict effects of climate change etc.

15

Biodiversity informatics

• Biological data needs to be linked together in many analyses

• Links often involve the species name as the key linking element

• Good application for Linked Data (IF the species concepts and links are stable)

Changing species concepts

• Good account of the species concept at http://en.wikipedia.org/wiki/Species and how species are named at http://en.wikipedia.org/wiki/Species_naming

• Biological species concept– Not usually usable in practice, hence species concept

is subjective …• … which leads to differences of interpretation• New knowledge → instability of names• Need Rules of Nomenclature, developed after

Linnaeus by Adanson and others ...

http://en.wikipedia.org/wiki/Species

http://en.wikipedia.org/wiki/Species_naming

17

Rules of nomenclature 1• A species (concept, as understood with

current knowledge) has the name of whichever type specimen is included within its scope (range of variation, "circumscription")

• if there is no such type specimen with a published name, a new name is published, with a type specimen

• if there is more than one type specimen with a published name, the oldest name takes priority

Natural History Museum

19

Changing species names

The net effect of the rules of nomenclature is that1. a newly discovered species gets a new name (good)2. if a species is subdivided (found with better knowledge to

be two or more species), one of the parts keeps the original name (hmmm)

3. if two (or more) species are merged (considered to be insufficiently distinct, or duplicated publications for the same species), one of the original names is retained for the new species (uh oh)

4. if a species is moved to a different genus (including the cases where a genus is split or genera are merged), its name changes (not very good at all)

20

Checklists

• These effects mean that at different points in time (or with different taxonomic opinions) the same name may apply to different species, and different names may have been applied to the same species

• It is therefore necessary to keep track of species names (in "checklists") and the allocation of identifiers to the species concepts denoted by the names.

21

Use of checklists

• Taxonomic experts refer to such checklists when handling information about species

• In order to build computer-based species information systems, it is desirable as far as possible to automate the application of checklists to assist in the integration of data from different sources (to create knowledge)

Taxonomy as a thesaurus

Naming species• Several projects in COMSC are concerned with the

terms (names) used to specify the kinds (species) of organisms to which data applies.

• Species are the fundamental units of biological organism variation.

• Various uncertainties arise from– changes in data and classification procedures – traditional taxonomic practice in assigning “scientific names”

(“Latin names”) to species, – the Rules of Nomenclature which should be applied, and– the consequences of using these names in practice.

Hierarchical classification

• Species are grouped into a hierarchical classification of larger groups (“higher taxa”)

• Many of these taxa have approximate equivalents in everyday usage: mammals, cats, grasses, etc.

• A hierarchical classification can be represented by a tree (directed acyclic graph)

Hierarchical classification

• Hierarchical classifications systems are familiar from library catalogues, for example, and other thesauri – these generally have to be constructed

manually, before encountering the data (books, for example)

• Biologists construct hierarchical classifications from the data, after assembling data tables

Alternative classifications

• Multiple classifications exist, because different experts (taxonomists) have differing opinions or use different data sets to form their classifications.

• Current projects in COMSC include cross-mapping between checklists of species in i4Life and OpenBio.

• We are experimenting with ways to cross-map between hierarchies in the i4Life project– surprisingly, we cannot find published examples of

this being done, either in biology or in other domains. (This may be due to our unfamiliarity with alternative terminology used in other disciplines.)

Predictive classification

• A useful classification has predictive value: • If you know that an organism is a mammal,

– you know that it almost certainly has fur– gives birth to live young– produces milk to feed its young

• If you also know it is in the cat family, you know – all the previous things, plus– it has retractile claws– other features which are harder to define but make it look, well,

like a cat

• A predictive classification of objects is very useful if information about them is stored in a database

Natural classification

• This predictive value depends on the classification being a “natural” one, – which reflects as many features of the organism as

possible.

• Special-purpose classifications may be based on a restricted range of information, e.g. – plants which are annual, biennial or perennial, – toadstools which are edible or poisonous, – animals which can be kept without a licence, etc.

• Previous classifications of organisms have not always been “natural” …

Celestial Emporium of Benevolent Knowledge

Contains a taxonomy which divides all animals into one of 14 categories:• Those that belong to the emperor• Embalmed ones• Those that are trained• Suckling pigs• Mermaids• Fabulous ones• Stray dogs• Those that are included in this classification• Those that tremble as if they were mad• Innumerable ones• Those drawn with a very fine camel hair brush• Others• Those that have just broken a flower vase• Those that, at a distance, resemble flies

Celestial Emporium of Benevolent Knowledge

Sadly, according to Wikipedia http://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevolent_Knowledge%27s_Taxonomy

this allegedly ancient Chinese source quoted by Jorge Luis Borges in 1942 is apparently fictitious.

http://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevolent_Knowledge%27s_Taxonomy

http://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevolent_Knowledge%27s_Taxonomy

Creating a “natural” classification

• Need to take into account as much information as possible,– so that the resulting classification is as predictive as possible.

• In practice, biologists tend to short-circuit this process by choosing characteristics that they feel are– more fundamental– are associated with many other characteristics, thus imparting

predictive power.

• Most biologists now choose attributes felt to– reflect the evolutionary changes in organisms,– thus being associated with other characteristics,– therefore providing predictive power.

Taxonomic data sets

Matrix of

• species (rows, data objects) × characteristics (columns, attributes)

Other types of data

which are subject to classification studies:

• Vegetation: matrix of – species (rows, attributes) × – locations (columns, data objects)

• Bioinformatics: protein structures, enzyme types, etc.

• Non-biological data: chemical structures, …

Ordination methods

• Based on linear model: – Principal Components Analysis (PCA)– Principal Coordinates Analysis (PCoA)

• Non-linear (non-parametric): – multidimensional scaling including NMMDS

(Kruskal scaling etc.)

PCA (1)

The original data attributes as axes:

PCA (2)Find new axes and rotate them (using the largest)

Clustering methods

• K-means “clustering”– fixed number of clusters– can be useful when added to a PCA plot

• Agglomerative, including – single linkage, complete linkage, – average linkage (UPGMA), weighted (WPGMA), – Ward's method (Euclidean distance), etc.

• Divisive (particularly developed at Canberra and Southampton for use in vegetation studies): – monothetic (pick best attribute for separation)– polythetic (AXOR based on PCA axes, etc.)

Dendrograms

Visualisation of clustering (other than K-means) by means of dendrograms

Divisive classification

• Previously described method is agglomerative: larger clusters are built from smaller ones

• Divisive methods start with all objects in one cluster, and divide and subdivide repeatedly– monothetic: pick the attribute at each step which best

separates the clusters (on some criterion)– polythetic: multivariate, e.g. find the first PCA axis as

a guide to where to divide (“AXOR”)

• Results can still be shown as dendrograms

Quality of a classification or ordination

• Cophenetic correlation coefficient– applicable to methods which start from a distance (or

similarity) matrix – calculate

• distance between all pairs of objects according to the classification (dendrogram) or ordination (scatter plot)

• calculate the correlation between these values and the original inter-object distances

• Assess the “predictive” property of the classification– make predictions of attribute values from the

classification– test them for correlation with actual known values,

e.g. using Tukey's jackknife.

Uses of a classification (1)

• Taxonomic revision– taxonomists typically do not rely on

automatically generated clusters, but may use them to guide and support their decisions

– Example needed


• Identification keys– an identification key is similar to a monothetic

divisive classification– for example …

Example identification key

Part of a computer-generated key to the grass genera of the Australian Capital Territory

Estimating phylogeny

Now widely used because of the easy availability of molecular sequence data

• Build a tree which attempts to estimate the presumed course of evolutionary change and divergence in a group of organisms

• Try to localise specific changes to a particular point along a branch in the tree– e.g. sequence change, gene substitution, morphological feature– minimise the number of occurrences of such changes

(parsimony)

• The resulting tree is a cladogram, not a dendrogram

Cladogram


• As the basis for reliable retrieval of data– Clusters given names (terms in a thesaurus)

• often but not necessarily by taxonomists

– these terms used in data storage and retrieval

• To increase knowledge – e.g. when investigating the effects of climate change– By recognising different types of organisms so that

variation does not obscure the results– By studies of the causes of the variation– For its own sake (greater knowledge of the

organisms)

Data quality in information retrieval

• noise in data

• errors in data (e.g. GBIF observation locations)

• accuracy of the assignment of terms (species names) to objects being searched for

GBIF observations portal

• GBIF is the Global Biodiversity Information Facility www.gbif.org

• Provides downloadable information on the locations of millions of observations of individual organisms (of known species), including the locations where museum specimens were collected.

• An excellent resource for studies of the distribution of organisms, with applications in modelling the effects of climate change, amongst others.

• Yesson C, Brewer PW, Sutton T, Caithness N, Pahwa JS, et al (2007) How Global Is the Global Biodiversity Information Facility? PLoS ONE 2(11): e1124. doi:10.1371/journal.pone.0001124 investigated this further …

http://www.gbif.org/

Yesson et al. (2007)

• The GBIF portal was queried for georeferenced data (i.e. those with latitude/longitude coordinates)

• For all species names from ILDIS (the plant family Leguminosae/Fabaceae at www.ildis.org), including synonyms but excluding a few special cases. This consisted of 31,086 names representing 20,003 species.

http://www.ildis.org/

Yesson et al. (2007) Fig. 1

Importance of correct species name use …

In conservation …

• Recently fisheries biologists have tried to conserve the endangered Greenback cutthroat trout, Oncorhynchus clarkii stomias by rearing thousands of fish and stocking them in Colorado rivers, lakes and streams.

• Unfortunately (as shown using mitochondrial and nuclear genetic markers), the fish used to stock were mostly the closely related Colorado River cutthroat trout, Oncorhynchus clarkii pleuriticus.

• Metcalf JL, Pritchard VL, Silvestri SM, Jenkins JB, Wood JS, Cowley DE, Evans RP, Shiozawa DK, Martin AP (2007) Across the great divide: genetic forensics reveals misidentification of endangered cutthroat trout populations. Molecular Ecology 2007 Aug 28. DOI: 10.1111/j.1365-294X.2007.03472.x

… and in data aggregation

• Data resources such as GBIF and the Catalogue of Life (www.catalogueoflife.org) are frequently updated.

• A retrieval using a particular species name may therefore produce different results on different occasions.

• This is as you might expect, but …

http://www.catalogueoflife.org/

Errors in names

• Erroneus information will be retrieved if– The terms (species names) being used in the search

are wrong– The species data is recorded in the database using

incorrect names

• Chapman, A. D. 2005. Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.

Chapman (2005): Nomenclatural Error

• “Names form the major key for accessing information in primary species databases. If the name is wrong, then access to the information by users will be difficult, if not impossible. In spite of having rules for biological nomenclature for around 100 years, the nomenclatural and taxonomic information in a database ... is often the most difficult in which to detect and clean errors. It is also the area that causes the most angst and loss of confidence amongst users in primary species databases. This is often due to ignorance amongst users of the need for taxonomic changes and nomenclatural changes, but is also partly due to taxonomists not fully documenting and explaining these changes to users...”

• “The easier of these errors to clean is the nomenclatural data – the misspellings. Lists of names (and synonyms) are the key tools for helping with this task...”

Chapman (2005 ): Taxonomic Error

• “Taxonomic error – the inaccurate identification or misidentification of the collection is the most difficult of errors to detect and clean. ... experts working in taxonomic groups examine the specimens from time to time and determine their circumscription or identification...”

• “Traditional tools include publications such as taxonomic revisions, national and regional floras and faunas, and illustrated checklists. Newer tools include automated and computer-generated keys to taxa; interactive electronic publications with illustrations, descriptions, keys, and illustrated glossaries; character-based databases; imaging tools; scientific image databases that include images of types; systematic images of collections; and easily accessible on-line images (both scientifically verified and others).”

Take-home message

Accurate and reliable information retrieval from databases requires that

• The objects described are accurately classified so that their data is precisely targetted (dissimilar objects are not lumped together)

• The search terms used to locate these objects (e.g. species names) are accurate

• The data values describing the objects (e.g. geographical locations) are accurate

pattern recognition in biological data a talk for cmt108 “pattern recognition & data mining”...

Documents

species data

new species

discovered species

species of organisms

linked data

molecular data

data filesdatabasesxml

current knowledge