machine learning for...
TRANSCRIPT
MachinelearningforbioinformaticsA/ProfNicolaArmstrongMathematicsandStatistics
MurdochUniversity
Machinelearning (ML)isthescientificstudy ofalgorithms andstatisticalmodels thatcomputersystems useinordertoperformaspecifictaskeffectivelywithoutusingexplicitinstructions,relyingonpatternsandinferenceinstead.
Wikipedia
InBioinformatics…
7
Classification
8
Classification
Training SetData with known
classes
ClassificationTechnique
Classificationrule
Discrimination
9
Classification
Training SetData with known
classes
ClassificationTechnique
Classificationrule
Data with unknown classes
ClassAssignment
Discrimination
Prediction
10
Classification Rule
Classification techniqueFeature selection
Parameters [pre-determined, estimable]Distance measure
Aggregation methods
Theclassificationruleislikeablackbox,somemethodsprovidemoreinsightintothe contentsofthebox
ClassificationTechniques
• DecisionTreebasedMethods– e.g.randomforests(Breiman 2001)
• Rule-basedMethods• Memorybasedreasoning• NeuralNetworks• NaïveBayes(DLDA)andBayesianBeliefNetworks
• SupportVectorMachines
MultipleRegression
Linear
Logistic
• Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevels.
• Yisthephenotype:– Linear:continuousphenotypicmeasurement.– Logistic:0=no,1=yes.
• β aretheregressioncoefficients.12
! = !!!!!!!!!⋯!!!!!1+ !!!!!!!!!⋯!!!!!
Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβridge istheridgeregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofridgepenalty.
istheridgepenalty.
RidgeRegression
13
LASSO(theLeastAbsoluteShrinkageandSelectionOperator)
Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβlasso isthelassoregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountoflassopenalty.
isthelassopenalty.14
ElasticNet
Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβ0,β istheelasticnetregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofelasticnetpenalty.0≤α≤1iselasticpenaltyweight
15
Ridgevs.LASSOvs.ElasticNet
16
*Notwell– incasenoofSNPs>>noofpeople,themaximumnumberofvariablesthatLASSOcanselectbeforeitsaturatesisequaltothenumberofpeople.
Allregressionmethodsrelyonlinearityassumption
RandomForest
NeuralNetworks
EnsemblepackagesinR
• Allowapplicationandevaluationofmultipletechniquesonadataset
• Simpleandeasytouse
• CMA• caret• ClassifyR
Method Description Function(s) DM DV DD
Wrapper for sparsediscrim’s diagonal LDA function dlda. DLDAtrainInterface,DLDApredictInterface P
Wrapper for PoiClaClus’s Poisson LDA function classify. classifyInterface P
Wrapper for glmnet’s elastic net GLM function glmnet. elasticNetGLMinterface P
Wrapper for pamr’s Nearest Shrunken Centroid functions pamr.train and pamr.predict.
NSCtrainInterfaceNSCpredictInterface P
Wrapper for multinomial logistic regression as implemented in CRAN package ‘mnlogit’.
logisticRegressionTrainInterfacelogisticRegressionPredictInterface
P
Fisher’s Linear Discrimiant Analysis fisherDiscriminant P P*
Feature-wise mixtures of normals and voting mixModelsTrain, mixModelsPredict P P P
Feature-wise kernel density estimation and voting naiveBayesKernel P P P
Wrapper forrandomForest'sfuctionrandomForest. randomForestInterface P P P
Wrapper for e1071’s Support Vector Machine functionsvm.
SVMinterface P P† P†
Classification
* If ordinary numeric measurements have been transformedto absolute deviations by subtractFromLocation.
† If kernel is not “linear”.
MODELPERFORMANCE
22
Validation:Performanceassessment
• Canbebasedon:– Cross-validation– Testset– Independenttestingonfuturedataset.– Independenttestingonexistingdataset(integrativeanalysis).
Cross-validation
23
PartitiondataintondisjointsetsS1,S2,…,Sn
OmitSk
UsingalldataexceptSk buildclassifier
UseclassifiertopredictclassesforSk
Fork=1,…,n
Fori=1,…,100
Summarystatisticsofperformance
MetricsforPerformanceEvaluation
• Focusonthepredictivecapabilityofamodel– Ratherthanhowfastittakestoclassifyorbuildmodels,scalability,etc.
• ConfusionMatrix:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a:TP(truepositive)
b:FN(falsenegative)
c:FP(falsepositive)
d:TN(truenegative)
Accuracy
• Issueifdatahighlyskewed/biased– 0.5%ofdataisinclass1andrestisinclass0.Modelhas99.5%accuracy!But,yourmodelcouldjustbe:classifyeachobservationtobeincategory0.
FNFPTNTPTNTP
dcbada
++++
=+++
+=Accuracy
Othermetrics• Misclassificationrate:1- Accuracy
• Sensitivity/Recall/truepositiverate:TP/(TP+FN)
• Specificity/truenegativerate:TN/(TN+FP)
• Positivepredictivevalue/Precision:TP/(TP+FP)
• Negativepredictivevalue:TN/(TN+FN)
• F-score:harmonicmeanofprecision&recall2*(precision*recall)/(precision+recall)
1=good,0=bad,doesn’tconsiderTNs
ROC(ReceiverOperatingCharacteristic)
• Developedin1950sforsignaldetectiontheorytoanalyzenoisysignals– Characterizethetrade-offbetweenpositivehitsandfalsealarms
• ROCcurveplotssensitivity (onthey-axis)against(1-specificity)(onthex-axis)
• PerformanceofeachclassifierrepresentedasapointontheROCcurve– changingthethresholdofalgorithm,sampledistributionorcostmatrixchangesthelocationofthepoint
ROCCurve(Sensitivity,1-specificity):• (0,0):declareeverything
tobenegativeclass• (1,1):declareeverything
tobepositiveclass• (1,0):ideal
• Diagonalline:– Randomguessing– Belowdiagonalline:
• predictionisoppositeofthetrueclass
ModelSelection
• Inpractice,oftennotmuchdifferenceinperformancebetweenseveralapproaches.
• Aimtochoosethemodelwhichis:– Interpretable- canweseeorunderstandwhythemodelismakingthedecisionsitmakes?
– Simple- easytoexplainandunderstand– Accurate– Fast(totrainandtest)– Scalable(canbeappliedtoalargedataset)
EXAMPLES
HiddenMarkovModels
1
2
K
…
1
2
K
…
1
2
K
…
…
…
…
1
2
K
…
x1 x2 x3 xK
2
1
K
2HiddenStatesπi
Observations
exon 1 exon 2 exon 3
AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACG
gene prediction:
input sequence:most probable path:
Genefinding
Crossoversinmeiosis
ChromHMM:annotatinggenomicregions
Ernst&Kellis 2002
MammaPrintvan‘tVeeretalNature2002;vandeVijver etalNEJM2002
Basedoncorrelation