classification by machine learning approaches michael j. kerner –...
Post on 15-Jan-2016
228 Views
Preview:
TRANSCRIPT
Classification by
Machine Learning
Approaches
Michael J. Kerner – kerner@cbs.dtu.dk
Center for Biological Sequence AnalysisTechnical University of Denmark
Outline
• Introduction to Machine Learning
• Datasets, Features
• Feature Selection
• Machine Learning Approaches (Classifiers)
• Model Evaluation and Interpretation
• Examples, Exercise
Machine Learning – Data Driven Prediction
To Learn:“to gain knowledge or understanding of or skill in by study, instruction, or experience”
(Merriam Webster English Dictionary, 2005)
Machine Learning:Learning the theory automatically from the data, through a process of inference, model fitting, or learning from examples:
Automated extraction of useful information from a body of data by building good probabilistic models.
Ideally suited for areas with lots of data in the absence of a general theory.
Why do we need Machine Learning?
• Some tasks cannot be defined well, except by examples (e.g. recognition of faces or people).
• Large amounts of data may have hidden relationships and correlations. Only automated approaches may be able to detect these.
• The amount of knowledge about a certain problem / task may be too large for explicit encoding by humans (e.g. in medical diagnostics)
• Environments change over time, and new knowledge is constantly being discovered. A continuous redesign of the systems “by hand” may be difficult.
The Machine Learning Approach
InputData
ClassifierML
e.g. Gene Expression Profiles, …
Machine Learning
Prediction:Yes / No
Machine Learning
• Learning Task:– What do we want to learn or predict?
• Data and assumptions:– What data do we have available? – What is their quality?– What can we assume about the given problem?
• Representation:– What is a suitable representation of the examples to be classified?
• Method and Estimation:– Are there possible hypotheses?– Can we adjust our predictions based on the given results?
• Evaluation:– How well does the method perform?– Might another approach/model perform better?
Learning Tasks
• Classification:– Prediction of an item class.
• Forecasting:– Prediction of a parameter value.
• Characterization:– Find hypotheses that describe groups of items.
• Clustering:– Partitioning of the (unassigned) data set into clusters
with common properties. (Unsupervised learning)
Emergence of Large Datasets
Dataset examples:
• Image processing• Spam email detection• Text mining• DNA micro-array data• Protein function• Protein localization• Protein-protein interaction• …
Dataset Examples
Edible or poisonous ?
Dataset Examples
mRNA Splicing
mRNA Splice Site Prediction
Protein Function Prediction: ProtFun
• Predict as many biologically relevant features as we can from the sequence
• Train artificial neural networks for each category
• Assign a probability for each category from the NN outputs
############## ProtFun 2.2 predictions ########
>KCNA1_HUMAN
# Functional category Prob Odds
Amino_acid_biosynthesis 0.042 1.893
Biosynthesis_of_cofactors 0.119 1.654
Cell_envelope 0.031 0.507
Cellular_processes 0.027 0.373
Central_intermediary_metabolism 0.046 0.731
Energy_metabolism 0.036 0.395
Fatty_acid_metabolism 0.019 1.485
Purines_and_pyrimidines 0.214 0.879
Regulatory_functions 0.013 0.083
Replication_and_transcription 0.019 0.073
Translation 0.129 2.925
Transport_and_binding =>0.717 1.748
# Enzyme/nonenzyme Prob Odds
Enzyme 0.231 0.807
Nonenzyme =>0.769 1.078
# Enzyme class Prob Odds
Oxidoreductase (EC 1.-.-.-) 0.040 0.193
Transferase (EC 2.-.-.-) 0.056 0.163
Hydrolase (EC 3.-.-.-) 0.062 0.195
Lyase (EC 4.-.-.-) 0.020 0.430
Isomerase (EC 5.-.-.-) 0.010 0.321
Ligase (EC 6.-.-.-) 0.017 0.326
# Gene Ontology category Prob Odds
Signal_transducer 0.061 0.284
Receptor 0.055 0.323
Hormone 0.001 0.206
Structural_protein 0.002 0.086
Transporter 0.469 4.299
Ion_channel 0.207 3.633
Voltage-gated_ion_channel =>0.280 12.736
Cation_channel 0.348 7.560
Transcription 0.163 1.270
Transcription_regulation 0.166 1.331
Stress_response 0.011 0.125
Immune_response 0.031 0.370
Growth_factor 0.005 0.372
Metal_ion_transport 0.159 0.345
Complexity of datasets:
• Many instances (examples)
• Instances with multiple features (properties / characteristics)
• Dependencies between the features (correlations)
Emergence of Large Datasets
Data Preprocessing
Instance selection:– Remove identical / inconsistent / incomplete
instances (e.g. reduction of homologous genes, removal of wrongly annotated genes)
Feature transformation / selection:– Projection techniques (e.g. principal
components analysis)– Compression techniques (e.g. minimum
description length)– Feature selection techniques
Benefits of Feature Selection
• Attain good and often even better classification performance using a small subset of features– Less noise in the data
• Provide more cost-effective classifiers– Less features to take into account
smaller datasets faster classifiers
• Identification of (biologically) relevant features for the given problem
Feature Selection
FeatureSubset
Selection
LearningAlgorithm
All Features
FeatureSubset
Selection
Learning Algorithm
All Features
Feature SubsetSearch Algorithm
SelectionCriterion
LearningAlgorithm
SelectedFeatures
Evaluation
OptimalFeatures
OptimalFeatures
OptimalFeatures
Filter approach Wrapperapproach
Filter Approach
• Independent of the classification model• A relevance measure for each feature is calculated• Features with a value lower than a selected threshold t will
be removed
Example: Feature-class entropy• Measures the “uncertainty” about the class when
observing feature i
f1 f2 f3 f4 class f1 f2 f3 f4 class
1 0 1 1 1 1 0 0 0 0
0 1 1 0 1 0 0 1 0 0
1 0 1 0 1 1 1 0 1 0
0 1 0 1 1 0 1 0 1 0
Wrapper approach
• Specific to a classification algorithm• The search for a good feature subset is guided by
a search algorithm • The algorithm uses the evaluation of the classifier
as a guide to find good feature subsets• Search algorithm examples: sequential forward or
backward search, genetic algorithms
Sequential backward elimination– Starts with the set of all features– Iteratively discards the feature whose removal
results in the best classification performance
Wrapper approach
Full feature set : f1,f2,f3,f4
f2,f3,f4 0.7 f1,f3,f4 0.8 f1,f2,f4 0.1 f1,f2,f3 0.75
f3,f40.85
f1,f40.1
f1,f30.8
f40.2
f30.7
Classification Methods
- Decision trees
- Hidden Markov Models (HMMs)
- Support vector machines
- Artificial Neural Networks
- Bayesian methods
- …
Decision Trees
• Simple, practical and easy to interpret• Given a set of instances (with a set of features), a
tree is constructed with internal nodes as the features and the leaves as the classes
Example Dataset: Shall we play golf?
Instance Attributes / Features Class
day outlook temperature humidity windy Play Golf ?
1 sunny hot high FALSE no
2 sunny hot high TRUE no
3 overcast hot high FALSE yes
4 rainy mild high FALSE yes
5 rainy cool normal FALSE yes
6 rainy cool normal TRUE no
7 overcast cool normal TRUE yes
8 sunny mild high FALSE no
9 sunny cool normal FALSE yes
10 rainy mild normal FALSE yes
11 sunny mild normal TRUE yes
12 overcast mild high TRUE yes
13 overcast hot normal FALSE yes
14 rainy mild high TRUE no
today sunny cool high TRUE ?
Example: Shall we play golf today?
WEKA data file (arff format) :
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}
@datasunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yesrainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no
Instance Independent features (attributes) Class
Day Outlook Temperature Humidity Windy Play Golf?
1 sunny hot high FALSE no
2 sunny hot high TRUE no
3 overcast hot high FALSE yes
4 rainy mild high FALSE yes
5 rainy cool normal FALSE yes
6 rainy cool normal TRUE no
7 overcast cool normal TRUE yes
8 sunny mild high FALSE no
9 sunny cool normal FALSE yes
10 rainy mild normal FALSE yes
11 sunny mild normal TRUE yes
12 overcast mild high TRUE yes
13 overcast hot normal FALSE yes
14 rainy mild high TRUE no
Feature compositions
sun
ny
ove
rcas
t
rain
y
ho
t
coo
l
mil
d
hig
h
no
rmal
Tru
e
Fal
se
YE
S
NO
NOYES
Decision TreesJ48 pruned tree------------------outlook = sunny| humidity = high: no (3.0)| humidity = normal: yes (2.0)outlook = overcast: yes (4.0)outlook = rainy| windy = TRUE: no (2.0)| windy = FALSE: yes (3.0)
Number of Leaves : 5Size of the tree : 8
Attributes / Features
Attribute Values
Classes
Artificial Neural Networks (ANNs)
Artificial Neuron
Neural Network
Overfitting
Overfitting:A classifier that performs well on the training examples, but poorly on new examples.
Training and testing on the same data will generally produce a good classifier (for this dataset) with high overfitting.
To avoid overfitting:• Use separate training and testing data• Use cross-validation• Use the simplest model possible
Performance Evaluation
Cross-Validation (10 fold)
Data
TrainingSet
TestSet
Performance Evaluation
Classifier
ML
(9/10)
(1/10)10x
Performance Evaluation
Confusion Matrix
TP True Positives
TN True Negatives
FP False Positives
FN False Negatives
Predicted Label
positive negative
Known positive TP FNLabel negative FP TN
Performance Evaluation
• Precision (PPV) TP / (TP + FP)– Percentage of correct positive predictions
• Recall / Sensitivity TP / (TP + FN)– Percentage of positively labeled instances, also predicted as positive
• Specificity TN / (TN + FP)– Percentage of negatively labeled instances, also predicted as
negative
• Accuracy (TP + TN) / (TP + TN + FP + FN)– Percentage of correct predictions
• Correlation Coefficient (TP * TN – FP * FN)
(TP+FP)*(FP+TN)*(TN+FN)*(FN+TP)
-1 ≤ cc ≤ 1 cc = 1 : no FP or FNcc = 0 : random cc = -1: only FP and FN
ROC – Receiver Operating Characteristic
( FP / (FP + TN) )False Positive Rate, (1 - Specificity)
Tru
e P
os
itiv
e R
ate
, Se
ns
itiv
ity
TP
/ (T
P +
FN
)
ROC – Receiver Operating Characteristic
1 - Specificity
Se
ns
itiv
ity
Case Study - Splice Site Prediction
Case Study - Splice Site Prediction
Splice site prediction:
Correctly identify the borders of introns and exons in genes (splice sites)
• Important for gene prediction
• Split up into 2 tasks:– Donor prediction (exon -> intron)– Acceptor prediction (intron -> exon)
Case Study - Splice Site Prediction
• Splice sites are characterized by a conserved dinucleotide in the intron part of the sequence
– Donor sites :
– Acceptor sites :
• Classification problem:– Distinguish between true GT, AG and false GT, AG.
Case Study - Splice Site Prediction
• Position dependent features
e.g. an A on position 1, C on position 17, ….
• Position independent features
e.g. subsequence “TCG” occurs, “GAG” occurs,…
atcgatcagtatcgat GT ctgagctatgag
atcgatcagtatcgat GT ctgagctatgag
1 2 3 17 28
Features:
Original Data – Human Acceptor Splice Site Sites
>HUMGLUT4B_3535GGGCCCCTAGCGGAAGGAAAAAAATCATGGTTCCATGTGACATGCTGTGTCTTTGTGTCTGCCTGTTCAGGATGGGGAACCCCCTCAGCA>HUMGLUT4B_3763GAGGACAGGTGTCTCGGGGGTGGTGGAAAGGGGACGGTCTGCAGGAAATCTGTCCTCTGCTGTCCCCCAGGTGATTGAACAGAGCTACAA>HUMGLUT4B_4028TGGGGGAAACAGGAAGGGAGCCACTGCTGGGTGCCCTCACCCTCACAGCCTCACTCTGTCTGCCTGCCAGGAAAAGGGCCATGCTGGTCA>HUMGLUT4B_4276TGGGCTTTCAGATGGGAATGGACACCTGCCCTCAGCCCTCTCTTCTTCCCTCGCCCAGGGCTGACATCAGGGCTGGTGCCCATGTACGTG>HUMGLUT4B_4507ATATGGTGGGCTTCCAAGGTAAGGCAGAAGGGCTGAGTGACCTGCCTTCTTTCCCAACCTTCTCCCACAGGTGCTGGGCTTGGAGTCCCT>HUMGLUT4B_4775GCCTCCGCCTCATCTTGCTAGCACCTGGCTTCCTCTCAGGTCCCCTCAGGCCTGACCTTCCCTTCTCCAGGTCTGAAGCGCCTGACAGGC>HUMGLUT4B_5125CCAGCCTGTTGTGGCTGGAGTAGAGGAAGGGGCATTCCTGCCATCACTTCTTCTTCTCCCCCACCTCTAGGTTTTCTATTATTCGACCAG>HUMGLUT4B_5378CCTCACCCACGCGGCCCCTCCTACTTCCCGTGCCCAAAAGGCTGGGGTCAAGCTCCGACTCTCCCCGCAGGTGTTGTTGGTGGAGCGGGC>HUMGLUT4B_5995CTGAGTTGAGGGCAAGGGAAGATCAGAAAGGCCTCAACTGGATTCTCCACCCTCCCTGTCTGGCCCCTAGGAGCGAGTTCCAGCCATGAG>HUMGLUT4B_6716CTGGTTGCCTGAAACTACCCCTTCCCTCCCCACCTCACTCCGTCAACACCTCTTTCTCCACCTGTCCCAGGAGGCTATGGGGCCCTACGT>HSRPS6G_1493CTTTGTAGATGGCTCTACAATTACCTGTATAGATAGTTTCGTAAACTATTTCCCCCCTTTTAATCCTTAGCTGAACATCTCCTTCCCAGC[...]
Arff Data File - WEKA
@RELATION splice-train
@ATTRIBUTE -68_A {0,1}@ATTRIBUTE -68_T {0,1}@ATTRIBUTE -68_C {0,1}@ATTRIBUTE -68_G {0,1}@ATTRIBUTE -67_A {0,1}@ATTRIBUTE -67_T {0,1}@ATTRIBUTE -67_C {0,1}@ATTRIBUTE -67_G {0,1}[...]@ATTRIBUTE 20_A {0,1}@ATTRIBUTE 20_T {0,1}@ATTRIBUTE 20_C {0,1}@ATTRIBUTE 20_G {0,1}@ATTRIBUTE class {true,false}
@DATA0,0,0,1,0,0,0,1, [...] ,1,0,0,0,true0,0,0,1,1,0,0,0, [...] ,1,0,0,0,true0,1,0,0,0,0,0,1, [...] ,1,0,0,0,true0,1,0,0,0,0,0,1, [...] ,0,0,0,1,true[...]1,0,0,0,0,1,0,0, [...] ,0,1,0,0,true0,0,0,1,0,0,1,0, [...] ,0,0,1,0,true0,0,1,0,0,0,1,0, [...] ,0,0,0,1,true0,0,1,0,0,0,1,0, [...] ,0,0,1,0,true
The original sequence files in FASTA format have been converted to represent the four DNA bases in a binary fashion
A: 1 0 0 0T: 0 1 0 0C: 0 0 1 0G: 0 0 0 1
Case Study - Splice Site Prediction
• Local context of 88 nucleotides around the splice site
• 88 position dependent features• A=1000, T=0100, C=0010, G=0001
352 binary features
• Reduce the dataset to contain fewer but relevant features
352 Binary features
15 Binary features
Case Study – Splice Site Sequence Logos
Acceptor Sites:
Donor Sites:
+ 3
+ 2
+ 1- 2
- 3
+ 4- 1
+ 1- 2
- 3
- 1
- 4
- 8
- 9
- 7
- 5
- 6
- 13
- 14
- 12
- 10
- 11
- 15
- 18
- 16
- 17
Exercise:
• Building a prediction tool for human mRNA splice sites
• Feature selection for classification of splice sites
• Tool: The WEKA machine learning toolkit.
• Go tohttp://www.cbs.dtu.dk/~kerner/GeneDisc_Course_2007_MJK/
and follow the instructions
Acknowledgements
Slides and Exercises Adapted from and inspired by:
Søren Brunak
David Gilbert, Aik Choon Tan
Yvan Saeys
top related