machine learning for...

MachinelearningforbioinformaticsA/ProfNicolaArmstrongMathematicsandStatistics

MurdochUniversity

Machinelearning (ML)isthescientificstudy ofalgorithms andstatisticalmodels thatcomputersystems useinordertoperformaspecifictaskeffectivelywithoutusingexplicitinstructions,relyingonpatternsandinferenceinstead.

Wikipedia

InBioinformatics…

7

Classification

8

Classification

Training SetData with known

classes

ClassificationTechnique

Classificationrule

Discrimination

9

Classification

Training SetData with known

classes

ClassificationTechnique

Classificationrule

Data with unknown classes

ClassAssignment

Discrimination

Prediction

10

Classification Rule

Classification techniqueFeature selection

Parameters [pre-determined, estimable]Distance measure

Aggregation methods

Theclassificationruleislikeablackbox,somemethodsprovidemoreinsightintothe contentsofthebox

ClassificationTechniques

• DecisionTreebasedMethods– e.g.randomforests(Breiman 2001)

• Rule-basedMethods• Memorybasedreasoning• NeuralNetworks• NaïveBayes(DLDA)andBayesianBeliefNetworks

• SupportVectorMachines

MultipleRegression

Linear

Logistic

• Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevels.

• Yisthephenotype:– Linear:continuousphenotypicmeasurement.– Logistic:0=no,1=yes.

• β aretheregressioncoefficients.12

! = !!!!!!!!!⋯!!!!!1+ !!!!!!!!!⋯!!!!!

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβridge istheridgeregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofridgepenalty.

istheridgepenalty.

RidgeRegression

13

LASSO(theLeastAbsoluteShrinkageandSelectionOperator)

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβlasso isthelassoregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountoflassopenalty.

isthelassopenalty.14

ElasticNet

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβ0,β istheelasticnetregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofelasticnetpenalty.0≤α≤1iselasticpenaltyweight

15

Ridgevs.LASSOvs.ElasticNet

16

*Notwell– incasenoofSNPs>>noofpeople,themaximumnumberofvariablesthatLASSOcanselectbeforeitsaturatesisequaltothenumberofpeople.

Allregressionmethodsrelyonlinearityassumption

RandomForest

NeuralNetworks

EnsemblepackagesinR

• Allowapplicationandevaluationofmultipletechniquesonadataset

• Simpleandeasytouse

• CMA• caret• ClassifyR

Method Description Function(s) DM DV DD

Wrapper for sparsediscrim’s diagonal LDA function dlda. DLDAtrainInterface,DLDApredictInterface P

Wrapper for PoiClaClus’s Poisson LDA function classify. classifyInterface P

Wrapper for glmnet’s elastic net GLM function glmnet. elasticNetGLMinterface P

Wrapper for pamr’s Nearest Shrunken Centroid functions pamr.train and pamr.predict.

NSCtrainInterfaceNSCpredictInterface P

Wrapper for multinomial logistic regression as implemented in CRAN package ‘mnlogit’.

logisticRegressionTrainInterfacelogisticRegressionPredictInterface

P

Fisher’s Linear Discrimiant Analysis fisherDiscriminant P P*

Feature-wise mixtures of normals and voting mixModelsTrain, mixModelsPredict P P P

Feature-wise kernel density estimation and voting naiveBayesKernel P P P

Wrapper forrandomForest'sfuctionrandomForest. randomForestInterface P P P

Wrapper for e1071’s Support Vector Machine functionsvm.

SVMinterface P P† P†

Classification

* If ordinary numeric measurements have been transformedto absolute deviations by subtractFromLocation.

† If kernel is not “linear”.

MODELPERFORMANCE

22

Validation:Performanceassessment

• Canbebasedon:– Cross-validation– Testset– Independenttestingonfuturedataset.– Independenttestingonexistingdataset(integrativeanalysis).

Cross-validation

23

PartitiondataintondisjointsetsS1,S2,…,Sn

OmitSk

UsingalldataexceptSk buildclassifier

UseclassifiertopredictclassesforSk

Fork=1,…,n

Fori=1,…,100

Summarystatisticsofperformance

MetricsforPerformanceEvaluation

• Focusonthepredictivecapabilityofamodel– Ratherthanhowfastittakestoclassifyorbuildmodels,scalability,etc.

• ConfusionMatrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a:TP(truepositive)

b:FN(falsenegative)

c:FP(falsepositive)

d:TN(truenegative)

Accuracy

• Issueifdatahighlyskewed/biased– 0.5%ofdataisinclass1andrestisinclass0.Modelhas99.5%accuracy!But,yourmodelcouldjustbe:classifyeachobservationtobeincategory0.

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

Othermetrics• Misclassificationrate:1- Accuracy

• Sensitivity/Recall/truepositiverate:TP/(TP+FN)

• Specificity/truenegativerate:TN/(TN+FP)

• Positivepredictivevalue/Precision:TP/(TP+FP)

• Negativepredictivevalue:TN/(TN+FN)

• F-score:harmonicmeanofprecision&recall2*(precision*recall)/(precision+recall)

1=good,0=bad,doesn’tconsiderTNs

ROC(ReceiverOperatingCharacteristic)

• Developedin1950sforsignaldetectiontheorytoanalyzenoisysignals– Characterizethetrade-offbetweenpositivehitsandfalsealarms

• ROCcurveplotssensitivity (onthey-axis)against(1-specificity)(onthex-axis)

• PerformanceofeachclassifierrepresentedasapointontheROCcurve– changingthethresholdofalgorithm,sampledistributionorcostmatrixchangesthelocationofthepoint

ROCCurve(Sensitivity,1-specificity):• (0,0):declareeverything

tobenegativeclass• (1,1):declareeverything

tobepositiveclass• (1,0):ideal

• Diagonalline:– Randomguessing– Belowdiagonalline:

• predictionisoppositeofthetrueclass

ModelSelection

• Inpractice,oftennotmuchdifferenceinperformancebetweenseveralapproaches.

• Aimtochoosethemodelwhichis:– Interpretable- canweseeorunderstandwhythemodelismakingthedecisionsitmakes?

– Simple- easytoexplainandunderstand– Accurate– Fast(totrainandtest)– Scalable(canbeappliedtoalargedataset)

EXAMPLES

HiddenMarkovModels

1

2

K

…

1

2

K

…

1

2

K

…

…

…

…

1

2

K

…

x1 x2 x3 xK

2

1

K

2HiddenStatesπi

Observations

exon 1 exon 2 exon 3

AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACG

gene prediction:

input sequence:most probable path:

Genefinding

Crossoversinmeiosis

ChromHMM:annotatinggenomicregions

Ernst&Kellis 2002

MammaPrintvan‘tVeeretalNature2002;vandeVijver etalNEJM2002

Basedoncorrelation

machine learning for...

Documents