pattern recognition in biological data a talk for cmt108 “pattern recognition & data mining”...

63
Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous talk “Organising Information about Organisms: Classification etc.” given to the COMSC Biodiversity Informatics Group on 22/11/2012) For more information see: http://biodiversity.cs.cf.ac.uk/wiki/big:meet ings:cmt108

Upload: piers-hubbard

Post on 29-Dec-2015

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Pattern Recognition in Biological Data

A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012

(based on a previous talk “Organising Information about Organisms: Classification etc.” given to the COMSC

Biodiversity Informatics Group on 22/11/2012)For more information see:

http://biodiversity.cs.cf.ac.uk/wiki/big:meetings:cmt108

Page 2: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

The Knowledge Pyramid

• In order to become Information, Data needs to be described by terms which give it meaning.

Page 3: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

The Knowledge Pyramid: example

• Read this from the bottom upwards!

Page 4: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

4

Names for objects

M62

Page 5: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

5

Page 6: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

6

M62

Page 7: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

7

Names refer to objects

Number 62 in Messier's catalogue of non-comets:

• M62 → a globular star cluster

• Name: identifier → object (or concept)

Page 8: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

8

Making use of identifiers

• Old: – Indexes, catalogues (Linnaeus, Messier,

Argos, ...)

• New: – data files– databases– XML files, RDF etc.

Page 9: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

9

Changing concepts

• Alpheratz

• α Andromedae

• δ Pegasi

Page 10: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

10

A jellyfish

Page 11: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

11

Names for species

• Purple-striped jellyfish

• Chrysaora colorata

• Pelagia colorata

Page 12: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

12

Carl Linnaeus

Page 13: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

13

Biological examples

• How are names and identifiers for objects used in biological applications?

• Biodiversity informatics (1730 onwards; keeping track of 2 to 10 million species of organisms (animals, plants, fungi, microbes, ...)

• Bioinformatics (1990 onwards; centred on the gene sequences which specify proteins)

Page 14: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

14

Some uses of biological information

• Bioinformaticians studying genetic material want to use information about a species to understand its development

• Biodiversity scientists (including taxonomists, ecologists, etc.) want to use molecular data to enhance their classifications, phylogenies and ecological models

• Others want to use species data to predict effects of climate change etc.

Page 15: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

15

Biodiversity informatics

• Biological data needs to be linked together in many analyses

• Links often involve the species name as the key linking element

• Good application for Linked Data (IF the species concepts and links are stable)

Page 16: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Changing species concepts

• Good account of the species concept at http://en.wikipedia.org/wiki/Species and how species are named at http://en.wikipedia.org/wiki/Species_naming

• Biological species concept– Not usually usable in practice, hence species concept

is subjective …• … which leads to differences of interpretation• New knowledge → instability of names• Need Rules of Nomenclature, developed after

Linnaeus by Adanson and others ...

Page 17: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

17

Rules of nomenclature 1• A species (concept, as understood with

current knowledge) has the name of whichever type specimen is included within its scope (range of variation, "circumscription")

• if there is no such type specimen with a published name, a new name is published, with a type specimen

• if there is more than one type specimen with a published name, the oldest name takes priority

Page 18: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Natural History Museum

Page 19: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

19

Changing species names

The net effect of the rules of nomenclature is that1. a newly discovered species gets a new name (good)2. if a species is subdivided (found with better knowledge to

be two or more species), one of the parts keeps the original name (hmmm)

3. if two (or more) species are merged (considered to be insufficiently distinct, or duplicated publications for the same species), one of the original names is retained for the new species (uh oh)

4. if a species is moved to a different genus (including the cases where a genus is split or genera are merged), its name changes (not very good at all)

Page 20: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

20

Checklists

• These effects mean that at different points in time (or with different taxonomic opinions) the same name may apply to different species, and different names may have been applied to the same species

• It is therefore necessary to keep track of species names (in "checklists") and the allocation of identifiers to the species concepts denoted by the names.

Page 21: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

21

Use of checklists

• Taxonomic experts refer to such checklists when handling information about species

• In order to build computer-based species information systems, it is desirable as far as possible to automate the application of checklists to assist in the integration of data from different sources (to create knowledge)

Page 22: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Taxonomy as a thesaurus

Naming species• Several projects in COMSC are concerned with the

terms (names) used to specify the kinds (species) of organisms to which data applies.

• Species are the fundamental units of biological organism variation.

• Various uncertainties arise from– changes in data and classification procedures – traditional taxonomic practice in assigning “scientific names”

(“Latin names”) to species, – the Rules of Nomenclature which should be applied, and– the consequences of using these names in practice.

Page 23: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Hierarchical classification

• Species are grouped into a hierarchical classification of larger groups (“higher taxa”)

• Many of these taxa have approximate equivalents in everyday usage: mammals, cats, grasses, etc.

• A hierarchical classification can be represented by a tree (directed acyclic graph)

Page 24: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Hierarchical classification

• Hierarchical classifications systems are familiar from library catalogues, for example, and other thesauri – these generally have to be constructed

manually, before encountering the data (books, for example)

• Biologists construct hierarchical classifications from the data, after assembling data tables

Page 25: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Alternative classifications

• Multiple classifications exist, because different experts (taxonomists) have differing opinions or use different data sets to form their classifications.

• Current projects in COMSC include cross-mapping between checklists of species in i4Life and OpenBio.

• We are experimenting with ways to cross-map between hierarchies in the i4Life project– surprisingly, we cannot find published examples of

this being done, either in biology or in other domains. (This may be due to our unfamiliarity with alternative terminology used in other disciplines.)

Page 26: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Predictive classification

• A useful classification has predictive value: • If you know that an organism is a mammal,

– you know that it almost certainly has fur– gives birth to live young– produces milk to feed its young

• If you also know it is in the cat family, you know – all the previous things, plus– it has retractile claws– other features which are harder to define but make it look, well,

like a cat

• A predictive classification of objects is very useful if information about them is stored in a database

Page 27: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous
Page 28: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Natural classification

• This predictive value depends on the classification being a “natural” one, – which reflects as many features of the organism as

possible.

• Special-purpose classifications may be based on a restricted range of information, e.g. – plants which are annual, biennial or perennial, – toadstools which are edible or poisonous, – animals which can be kept without a licence, etc.

• Previous classifications of organisms have not always been “natural” …

Page 29: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Celestial Emporium of Benevolent Knowledge

Contains a taxonomy which divides all animals into one of 14 categories:• Those that belong to the emperor• Embalmed ones• Those that are trained• Suckling pigs• Mermaids• Fabulous ones• Stray dogs• Those that are included in this classification• Those that tremble as if they were mad• Innumerable ones• Those drawn with a very fine camel hair brush• Others• Those that have just broken a flower vase• Those that, at a distance, resemble flies

Page 30: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Celestial Emporium of Benevolent Knowledge

Sadly, according to Wikipedia http://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevolent_Knowledge%27s_Taxonomy

this allegedly ancient Chinese source quoted by Jorge Luis Borges in 1942 is apparently fictitious.

Page 31: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Creating a “natural” classification

• Need to take into account as much information as possible,– so that the resulting classification is as predictive as possible.

• In practice, biologists tend to short-circuit this process by choosing characteristics that they feel are– more fundamental– are associated with many other characteristics, thus imparting

predictive power.

• Most biologists now choose attributes felt to– reflect the evolutionary changes in organisms,– thus being associated with other characteristics,– therefore providing predictive power.

Page 32: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Taxonomic data sets

Matrix of

• species (rows, data objects) × characteristics (columns, attributes)

Page 33: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Other types of data

which are subject to classification studies:

• Vegetation: matrix of – species (rows, attributes) × – locations (columns, data objects)

• Bioinformatics: protein structures, enzyme types, etc.

• Non-biological data: chemical structures, …

Page 34: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Ordination methods

• Based on linear model: – Principal Components Analysis (PCA)– Principal Coordinates Analysis (PCoA)

• Non-linear (non-parametric): – multidimensional scaling including NMMDS

(Kruskal scaling etc.)

Page 35: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

PCA (1)

The original data attributes as axes:

Page 36: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

PCA (2)Find new axes and rotate them (using the largest)

Page 37: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous
Page 38: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Clustering methods

• K-means “clustering”– fixed number of clusters– can be useful when added to a PCA plot

• Agglomerative, including – single linkage, complete linkage, – average linkage (UPGMA), weighted (WPGMA), – Ward's method (Euclidean distance), etc.

• Divisive (particularly developed at Canberra and Southampton for use in vegetation studies): – monothetic (pick best attribute for separation)– polythetic (AXOR based on PCA axes, etc.)

Page 39: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Dendrograms

Visualisation of clustering (other than K-means) by means of dendrograms

Page 40: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Divisive classification

• Previously described method is agglomerative: larger clusters are built from smaller ones

• Divisive methods start with all objects in one cluster, and divide and subdivide repeatedly– monothetic: pick the attribute at each step which best

separates the clusters (on some criterion)– polythetic: multivariate, e.g. find the first PCA axis as

a guide to where to divide (“AXOR”)

• Results can still be shown as dendrograms

Page 41: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Quality of a classification or ordination

• Cophenetic correlation coefficient– applicable to methods which start from a distance (or

similarity) matrix – calculate

• distance between all pairs of objects according to the classification (dendrogram) or ordination (scatter plot)

• calculate the correlation between these values and the original inter-object distances

• Assess the “predictive” property of the classification– make predictions of attribute values from the

classification– test them for correlation with actual known values,

e.g. using Tukey's jackknife.

Page 42: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Uses of a classification (1)

• Taxonomic revision– taxonomists typically do not rely on

automatically generated clusters, but may use them to guide and support their decisions

– Example needed

Page 43: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Uses of a classification (2)

• Identification keys– an identification key is similar to a monothetic

divisive classification– for example …

Page 44: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Example identification key

Part of a computer-generated key to the grass genera of the Australian Capital Territory

Page 45: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Estimating phylogeny

Now widely used because of the easy availability of molecular sequence data

• Build a tree which attempts to estimate the presumed course of evolutionary change and divergence in a group of organisms

• Try to localise specific changes to a particular point along a branch in the tree– e.g. sequence change, gene substitution, morphological feature– minimise the number of occurrences of such changes

(parsimony)

• The resulting tree is a cladogram, not a dendrogram

Page 46: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Cladogram

Page 47: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Uses of a classification (3)

• As the basis for reliable retrieval of data– Clusters given names (terms in a thesaurus)

• often but not necessarily by taxonomists

– these terms used in data storage and retrieval

• To increase knowledge – e.g. when investigating the effects of climate change– By recognising different types of organisms so that

variation does not obscure the results– By studies of the causes of the variation– For its own sake (greater knowledge of the

organisms)

Page 48: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Data quality in information retrieval

• noise in data

• errors in data (e.g. GBIF observation locations)

• accuracy of the assignment of terms (species names) to objects being searched for

Page 49: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

GBIF observations portal

• GBIF is the Global Biodiversity Information Facility www.gbif.org

• Provides downloadable information on the locations of millions of observations of individual organisms (of known species), including the locations where museum specimens were collected.

• An excellent resource for studies of the distribution of organisms, with applications in modelling the effects of climate change, amongst others.

• Yesson C, Brewer PW, Sutton T, Caithness N, Pahwa JS, et al (2007) How Global Is the Global Biodiversity Information Facility? PLoS ONE 2(11): e1124. doi:10.1371/journal.pone.0001124 investigated this further …

Page 50: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous
Page 51: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Yesson et al. (2007)

• The GBIF portal was queried for georeferenced data (i.e. those with latitude/longitude coordinates)

• For all species names from ILDIS (the plant family Leguminosae/Fabaceae at www.ildis.org), including synonyms but excluding a few special cases. This consisted of 31,086 names representing 20,003 species.

Page 52: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Yesson et al. (2007) Fig. 1

Page 53: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Yesson et al. (2007) Fig. 2

Page 54: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Yesson et al. (2007) Fig. 4

Page 55: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Yesson et al. (2007) Fig. 3

Page 56: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Importance of correct species name use …

Page 57: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

In conservation …

• Recently fisheries biologists have tried to conserve the endangered Greenback cutthroat trout, Oncorhynchus clarkii stomias by rearing thousands of fish and stocking them in Colorado rivers, lakes and streams.

• Unfortunately (as shown using mitochondrial and nuclear genetic markers), the fish used to stock were mostly the closely related Colorado River cutthroat trout, Oncorhynchus clarkii pleuriticus.

• Metcalf JL, Pritchard VL, Silvestri SM, Jenkins JB, Wood JS, Cowley DE, Evans RP, Shiozawa DK, Martin AP (2007) Across the great divide: genetic forensics reveals misidentification of endangered cutthroat trout populations. Molecular Ecology 2007 Aug 28. DOI: 10.1111/j.1365-294X.2007.03472.x

Page 58: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

… and in data aggregation

• Data resources such as GBIF and the Catalogue of Life (www.catalogueoflife.org) are frequently updated.

• A retrieval using a particular species name may therefore produce different results on different occasions.

• This is as you might expect, but …

Page 59: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Errors in names

• Erroneus information will be retrieved if– The terms (species names) being used in the search

are wrong– The species data is recorded in the database using

incorrect names

• Chapman, A. D. 2005. Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.

Page 60: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Chapman (2005): Nomenclatural Error

• “Names form the major key for accessing information in primary species databases. If the name is wrong, then access to the information by users will be difficult, if not impossible. In spite of having rules for biological nomenclature for around 100 years, the nomenclatural and taxonomic information in a database ... is often the most difficult in which to detect and clean errors. It is also the area that causes the most angst and loss of confidence amongst users in primary species databases. This is often due to ignorance amongst users of the need for taxonomic changes and nomenclatural changes, but is also partly due to taxonomists not fully documenting and explaining these changes to users...”

• “The easier of these errors to clean is the nomenclatural data – the misspellings. Lists of names (and synonyms) are the key tools for helping with this task...”

Page 61: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Chapman (2005 ): Taxonomic Error

• “Taxonomic error – the inaccurate identification or misidentification of the collection is the most difficult of errors to detect and clean. ... experts working in taxonomic groups examine the specimens from time to time and determine their circumscription or identification...”

• “Traditional tools include publications such as taxonomic revisions, national and regional floras and faunas, and illustrated checklists. Newer tools include automated and computer-generated keys to taxa; interactive electronic publications with illustrations, descriptions, keys, and illustrated glossaries; character-based databases; imaging tools; scientific image databases that include images of types; systematic images of collections; and easily accessible on-line images (both scientifically verified and others).”

Page 62: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous

Take-home message

Accurate and reliable information retrieval from databases requires that

• The objects described are accurately classified so that their data is precisely targetted (dissimilar objects are not lumped together)

• The search terms used to locate these objects (e.g. species names) are accurate

• The data values describing the objects (e.g. geographical locations) are accurate

Page 63: Pattern Recognition in Biological Data A talk for CMT108 “Pattern Recognition & Data Mining” by Richard White on 23rd November 2012 (based on a previous