knowledge extraction and visualisation using rule-based machine learning

Knowledge extraction and visualisation using rule-based

machine learning

Dr. Jaume BacarditInterdisciplinary Computing and Complex Systems

(ICOS) research groupUniversity of Nottingham

[email protected]

ICOS seminar. 11/10/2012

mailto:[email protected]


Preface

• I came to Nottingham in 2005 to work as a postdoc in a project applying evolutionary rule learning to protein structure prediction (EPSRC GR/T07534/01). In the project me managed to:

– Generate predictors that are competent with the start-of-the-art

– Indeed, extract human-readable explanations providing new knowledge

– We proposed several improvements to the learning algorithms so they could scale to big problems

• When I became a lecturer in 2008 I started several collaborations with experimentalists analysing biological data of all kinds, always with the goal of extracting knowledge

– Thanks to having sets of rules, it is relatively straightforward to develop a generic methodology to extract knowledge from them, that can be applied almost straight away to a variety of datasets

– Still, we are only at the tip of the iceberg, there are many ways in which this analysis can be made more efficient/reliable/useful

RULE LEARNING

A set of rules as a knowledgerepresentation

X

Y

0 1

1

If (X<0.25 and Y>0.75) or

(X>0.75 and Y<0.25) then

If (X>0.75 and Y>0.75) then

If (X<0.25 and Y<0.25) then

Everything else

Another example

Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)

The BioHEL rule learning system

• BioHEL [Bacardit et al., 09] is an evolutionary learning system that applies the Iterative Rule Learning (IRL) approach

• Designed explicitly to deal with noisy large-scale datasets

• IRL was first used in EC by the SIA system [Venturini, 93]

BioHEL’s learning paradigm– IRL has been used for many years in the ML community,

with the name of separate-and-conquer

– A standard elitist Genetic Algorithm generates each rule

BioHEL’s characteristics 1/2

• Objective function that tries to balance the generation of accurate and general rules– Accurate: not making many mistakes

– General: covering as many examples as possible and covering as much of the search space as possible

• Attribute list rule representation– Automatically identifying the relevant attributes for a given rule and

discarding all the other ones

• Ensemble mechanisms– Exploiting the GA’s stochasticity to construct ensembles of rule sets, all

of them generated from the same data, but with different random seeds, also ensembles for ordinal classification

BioHEL’s characteristics 2/2

• The ILAS windowing scheme– Efficiency enhancement method. Training set divided into strata.

Different GA iterations use different strata for their evaluation using a round-robin policy

• GPGPU-based fitness evaluation– Obtaining ~50x speedups on large datasets on its own and ~700x

speedups in combination with ILAS

CASE STUDIES

Mining –omics data

Protein contact map prediction

Functional Network Reconstruction for seed germination

Microarray data obtained from seed tissue of Arabidopsis Thaliana

122 samples represented by the expression level of almost 14000 genes

It had been experimentally determined whether each of the seeds had germinated or not

Can we learn to predict germination/dormancy from the microarray data?

Bassel et al., Plant Cell 23(9):3101-3116, 2011

Generating rule sets

BioHEL was able to predict the outcome of the samples with 93.5% accuracy (10 x 10-fold cross-validation

Learning from a scrambled dataset (labels randomly assigned to samples) produced ~50% accuracy

If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96 Predict germinationIf At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66 PredictgerminationIf At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66 Predict germinationIf At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80 Predict germinationEverything else Predict dormancy

Identifying regulators

Rule building process is stochastic Generates different rule sets each time the system is

run

But if we run the system many times, we can see some patterns in the rule sets Genes appearing quite more frequent than the rest Some associated to dormancy Some associated to germination

We generated 10K rule sets for each outcome Rules predicted one of the two outcomes Default rule captured the other

Known regulators appear with high frequency in the rules

Generating co-prediction networks of interactions

• For each of the rules shown before to be true, all of the conditions in it need to be true at the same time– Each rule is expressing an interaction between

certain gens

• From a high number of rule sets we can identify pairs of genes that co-occur with high frequency and generate functional networks with a methodology coined as co-prediction

• The network shows different topology when compared to other type of network construction methods (e.g. by gene co-expression)

• Different regions in the network contain the germination and dormancy genes.

• Other visualisations providing the big pictureexist (Urbanowicz et al., 2012)

Experimental validation

We have experimentally verified this analysis

By ordering and planting knockouts for the highly ranked genes

We have been able to identify four new regulators of germination, with phenotype different than the wild type

Same analysis. Different datasets

• We applied the same principle to three cancer datasets from the literature (E. Glaab et al., PLoSONE (2012) 7(7):e39932)

• We checked PubMed to see if the genes linked together in BioHEL’s rules appeared together in the literature

• We used Point-Wise Mutual Information (PMI) to quantify that the genes do not appear linked together in the literature by chance

• Compared the PMI scores of the highly ranked pairs of genes with random pairs

BioHEL’s scores were much better than random

And to lots of other datasets!

• These datasets were generated using transcriptomicstechnology– Looks at RNA

• There are lots of other –omics (hundreds of them)– Proteomics– Lipidomics– Metabolomics– Next-generation sequencing

• Each –omics requires specific preprocessing, but the learning and knowledge extraction process is exactly the same

• Lots of datasets out there

Another example different from -omics

• Protein Structure Prediction aims to predict the 3D structure of a protein based on its primary sequence

Prediction types of PSP

• There are several kinds of prediction problems within the scope of PSP– The main one, of course, is to predict the 3D coordinates

of all atoms of a protein (or at least the backbone) based on its primary sequence

– There are many structural properties of individual residues within a protein that can be predicted, for instance:• The secondary structure state of the residue

• If a residue is buried in the core of the protein or exposed in the surface

– Accurate predictions of these sub-problems can simplify the general 3D PSP problem

Contact Map prediction

• Prediction, for each pair of residues in a protein, whether these residues are in contact (have a small distance between them in the 3D structure) or not

• This problem can be represented by a binary matrix. 1= contact, 0 = non contact. Plotting this matrix reveals the main traits in the protein structure

• Very sparse characteristic: Less than 2% of contacts in native structures

• Training sets easily reach millions of residue pairs

• Our method was one of the top predictors in the last two editions of the CASP competition (actually, the best sequence-based predictor in last CASP)

helices sheets

(Bacardit et al., Bioinformatics (2012) 28 (19): 2441-2448)

Steps for CM prediction

1. Prediction of

Secondary structure (using PSIPRED)

Solvent Accessibility

Recursive Convex Hull

Coordination Number

2. Integration of all these predictions plus other sources of information

3. Final CM prediction (using BioHEL)

Using BioHEL [Bacardit et al., 09]

Characterisation of the contact map problem

Three types of input information were used

1. Detailed information of three different windows of residues centered around The two target residues (2x)

The middle point between them

2. Information about the connecting segment between the two target residues and

3. Global protein information.

1

2

3

Samples and ensembles

Training set contained 32 million pairs of AA and 631 attributes (+60GB of disk space)

50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts

BioHEL is run 25 times for each sample

Prediction is done by a consensus of 1250 rule sets

Confidence of prediction is computed based on the votes distribution in the ensemble.

Whole training process took about 25K CPU hours

Training set

x50

x25

Consensus

Predictions

Samples

Rule sets

Knowledge extraction in contact map prediction

• Basic analysis is exactly the same

Frequent attributes

Frequent pairs ofattributes

But analysis can be much more refined

• Because the representation has a very clear structure and we have lots of domain knowledge

• For instance, there are several way to aggregate the ranks of individual attributes based on characteristics from the representation/domain

Ranks aggregated bysource of information

Ranks aggregated byamino acid type

CHALLENGES AND OPPORTUNITIES

The knowledge extraction can be much more refined

• We just looked at what attributes appear in the rules, but not yet at the shape of the predicates

• Sometimes biasing the representation helps generating knowledge that is more useful to the domain experts

– In the experiments with the seed data BioHEL was constrained to generate only predicates “Att>X”

– But we always have to be careful when introducing bias

Is the knowledge real?

• Data is far from perfect, lots of spurious peaks• Probably many of the edges in the network are false

positives• Strategies for filtering the knowledge

– Classic blind feature selection?– Contrast the knowledge with databases of curated

information about the genes/interactions• Some of these are quite pricy! • Or we need strong text mining skills

– Careful balance is needed, we don’t want to filter true positives

– Using expert knowledge to bias the learning process (Moore & White, 2006)

Modelling the ML problem

• Datasets annotated as “case/controls” are easy

• What happens with N>2 labels?– Tricky for decision lists, as there is an implicit overlap

between rules

• What happens with continuous annotations?– There are similar examples in the literature using

model trees (Nepomuceno-Chamorro et al., 2010)

• What happens when the annotation is a time course?– Ordinal classification problem

References

• BioHEL

– Improving the scalability of rule-based evolutionary learning. J. Bacardit, E.K. Burke and N. Krasnogor. Memetic Computing journal 1(1):55-67, 2009

– Speeding Up the Evaluation of Evolutionary Learning Systems using GPGPUs. M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO2010), 1039-1046, ACM Press, 2010

– Modelling the Initialisation Stage of the ALKR Representation for Discrete Domains and GABIL Encoding. M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation - GECCO2011, pages 1291-1298. ACM, 2011

– Post-processing Operators for Decision Lists. M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation - GECCO2012, pages 847-854. ACM, 2012

– Analysing BioHEL using challenging boolean functions. M. Franco, N. Krasnogor and J. Bacardit. Evolutionary Intelligence, 5(2):87-102, June 2012

References

• Knowledge extraction and visualisation– Prediction of Recursive Convex Hull Class Assignments for Protein Residues. Stout, M.,

Bacardit, J., Hirst, J.D. and Krasnogor, N. Bioinformatics, 24(7):916-923, 2008

– Automated Alphabet Reduction for Protein Datasets. J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. BMC Bioinformatics 10:6, 2009

– Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets. George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit. The Plant Cell, 23(9):3101-3116, 2011

– E. Glaab, J. Bacardit, J.M. Garibaldi and N. Krasnogor. Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data. PLoS ONE 7(7):e39932. 2012. doi:10.1371/journal.pone.0039932

– J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and NatalioKrasnogor. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics (2012) 28 (19): 2441-2448. doi:10.1093/bioinformatics/bts472

– HP Fainberg, K. Bodley, J. Bacardit, D. Li, F. Wessely, NP. Mongan, ME. Symonds, L. Clarke and A. Mostyn, Reduced neonatal mortality in Meishan piglets: a role for hepatic fatty acids? PLoS ONE, in press, 2012

References

• Related work– Nepomuceno-Chamorro, I.A., Aguilar-Ruiz, J.S., and

Riquelme, J.C. (2010). Inferring gene regression networks with model trees. BMC Bioinformatics 11: 517

– Moore, J. and White, B., Exploiting expert knowledge in genetic programming for genome-wide genetic analysis, Parallel Problem Solving from Nature-PPSN IX, pp. 969-977, 2006

– R. J. Urbanowicz, A. Granizo-MacKenzie, and J. H. Moore. Instance-linked attribute tracking and feedback for michigan-style supervised learning classifier systems. In GECCO ’12: Proceedings of the 14th annual conference on Genetic and evolutionary computation , pages 927–934. ACM Press, 2012

Acknowledgements• Natalio Krasnogor• Michael Holdsworth• George Bassel• Enrico Glaab• Pawel Widera• Maria Franco• Anna Swan• Hernan Fainberg

• EPSRC GR/T07534/01 & EP/H016597/1

Knowledge extraction and visualisation using rule-based

machine learning

Dr. Jaume BacarditInterdisciplinary Computing and Complex Systems

(ICOS) research groupUniversity of Nottingham

[email protected]

ICOS seminar. 11/10/2012



knowledge extraction and visualisation using rule-based machine learning

Technology

outcomesdefault rule

rule setsgenes

timeeach rule

high number of rule

evolutionary rule learning

sets of rules

rulebased machine learningdr

phenotype different