machine learning for...

34
Machine learning for bioinformatics A/Prof Nicola Armstrong Mathematics and Statistics Murdoch University

Upload: others

Post on 03-Jun-2020

25 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MachinelearningforbioinformaticsA/ProfNicolaArmstrongMathematicsandStatistics

MurdochUniversity

Page 2: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Machinelearning (ML)isthescientificstudy ofalgorithms andstatisticalmodels thatcomputersystems useinordertoperformaspecifictaskeffectivelywithoutusingexplicitinstructions,relyingonpatternsandinferenceinstead.

Wikipedia

Page 3: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

InBioinformatics…

Page 4: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific
Page 5: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific
Page 6: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific
Page 7: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

7

Classification

Page 8: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

8

Classification

Training SetData with known

classes

ClassificationTechnique

Classificationrule

Discrimination

Page 9: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

9

Classification

Training SetData with known

classes

ClassificationTechnique

Classificationrule

Data with unknown classes

ClassAssignment

Discrimination

Prediction

Page 10: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

10

Classification Rule

Classification techniqueFeature selection

Parameters [pre-determined, estimable]Distance measure

Aggregation methods

Theclassificationruleislikeablackbox,somemethodsprovidemoreinsightintothe contentsofthebox

Page 11: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ClassificationTechniques

• DecisionTreebasedMethods– e.g.randomforests(Breiman 2001)

• Rule-basedMethods• Memorybasedreasoning• NeuralNetworks• NaïveBayes(DLDA)andBayesianBeliefNetworks

• SupportVectorMachines

Page 12: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MultipleRegression

Linear

Logistic

• Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevels.

• Yisthephenotype:– Linear:continuousphenotypicmeasurement.– Logistic:0=no,1=yes.

• β aretheregressioncoefficients.12

! = !!!!!!!!!⋯!!!!!1+ !!!!!!!!!⋯!!!!!

Page 13: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβridge istheridgeregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofridgepenalty.

istheridgepenalty.

RidgeRegression

13

Page 14: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

LASSO(theLeastAbsoluteShrinkageandSelectionOperator)

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβlasso isthelassoregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountoflassopenalty.

isthelassopenalty.14

Page 15: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ElasticNet

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβ0,β istheelasticnetregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofelasticnetpenalty.0≤α≤1iselasticpenaltyweight

15

Page 16: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Ridgevs.LASSOvs.ElasticNet

16

*Notwell– incasenoofSNPs>>noofpeople,themaximumnumberofvariablesthatLASSOcanselectbeforeitsaturatesisequaltothenumberofpeople.

Allregressionmethodsrelyonlinearityassumption

Page 17: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

RandomForest

Page 18: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

NeuralNetworks

Page 19: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

EnsemblepackagesinR

• Allowapplicationandevaluationofmultipletechniquesonadataset

• Simpleandeasytouse

• CMA• caret• ClassifyR

Page 20: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Method Description Function(s) DM DV DD

Wrapper for sparsediscrim’s diagonal LDA function dlda. DLDAtrainInterface,DLDApredictInterface P

Wrapper for PoiClaClus’s Poisson LDA function classify. classifyInterface P

Wrapper for glmnet’s elastic net GLM function glmnet. elasticNetGLMinterface P

Wrapper for pamr’s Nearest Shrunken Centroid functions pamr.train and pamr.predict.

NSCtrainInterfaceNSCpredictInterface P

Wrapper for multinomial logistic regression as implemented in CRAN package ‘mnlogit’.

logisticRegressionTrainInterfacelogisticRegressionPredictInterface

P

Fisher’s Linear Discrimiant Analysis fisherDiscriminant P P*

Feature-wise mixtures of normals and voting mixModelsTrain, mixModelsPredict P P P

Feature-wise kernel density estimation and voting naiveBayesKernel P P P

Wrapper forrandomForest'sfuctionrandomForest. randomForestInterface P P P

Wrapper for e1071’s Support Vector Machine functionsvm.

SVMinterface P P† P†

Classification

* If ordinary numeric measurements have been transformedto absolute deviations by subtractFromLocation.

† If kernel is not “linear”.

Page 21: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MODELPERFORMANCE

Page 22: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

22

Validation:Performanceassessment

• Canbebasedon:– Cross-validation– Testset– Independenttestingonfuturedataset.– Independenttestingonexistingdataset(integrativeanalysis).

Page 23: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Cross-validation

23

PartitiondataintondisjointsetsS1,S2,…,Sn

OmitSk

UsingalldataexceptSk buildclassifier

UseclassifiertopredictclassesforSk

Fork=1,…,n

Fori=1,…,100

Summarystatisticsofperformance

Page 24: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MetricsforPerformanceEvaluation

• Focusonthepredictivecapabilityofamodel– Ratherthanhowfastittakestoclassifyorbuildmodels,scalability,etc.

• ConfusionMatrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a:TP(truepositive)

b:FN(falsenegative)

c:FP(falsepositive)

d:TN(truenegative)

Page 25: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Accuracy

• Issueifdatahighlyskewed/biased– 0.5%ofdataisinclass1andrestisinclass0.Modelhas99.5%accuracy!But,yourmodelcouldjustbe:classifyeachobservationtobeincategory0.

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

Page 26: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Othermetrics• Misclassificationrate:1- Accuracy

• Sensitivity/Recall/truepositiverate:TP/(TP+FN)

• Specificity/truenegativerate:TN/(TN+FP)

• Positivepredictivevalue/Precision:TP/(TP+FP)

• Negativepredictivevalue:TN/(TN+FN)

• F-score:harmonicmeanofprecision&recall2*(precision*recall)/(precision+recall)

1=good,0=bad,doesn’tconsiderTNs

Page 27: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ROC(ReceiverOperatingCharacteristic)

• Developedin1950sforsignaldetectiontheorytoanalyzenoisysignals– Characterizethetrade-offbetweenpositivehitsandfalsealarms

• ROCcurveplotssensitivity (onthey-axis)against(1-specificity)(onthex-axis)

• PerformanceofeachclassifierrepresentedasapointontheROCcurve– changingthethresholdofalgorithm,sampledistributionorcostmatrixchangesthelocationofthepoint

Page 28: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ROCCurve(Sensitivity,1-specificity):• (0,0):declareeverything

tobenegativeclass• (1,1):declareeverything

tobepositiveclass• (1,0):ideal

• Diagonalline:– Randomguessing– Belowdiagonalline:

• predictionisoppositeofthetrueclass

Page 29: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ModelSelection

• Inpractice,oftennotmuchdifferenceinperformancebetweenseveralapproaches.

• Aimtochoosethemodelwhichis:– Interpretable- canweseeorunderstandwhythemodelismakingthedecisionsitmakes?

– Simple- easytoexplainandunderstand– Accurate– Fast(totrainandtest)– Scalable(canbeappliedtoalargedataset)

Page 30: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

EXAMPLES

Page 31: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

HiddenMarkovModels

1

2

K

1

2

K

1

2

K

1

2

K

x1 x2 x3 xK

2

1

K

2HiddenStatesπi

Observations

Page 32: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

exon 1 exon 2 exon 3

AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACG

gene prediction:

input sequence:most probable path:

Genefinding

Page 33: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Crossoversinmeiosis

ChromHMM:annotatinggenomicregions

Ernst&Kellis 2002

Page 34: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MammaPrintvan‘tVeeretalNature2002;vandeVijver etalNEJM2002

Basedoncorrelation