introduction to machine learning - github · introduction machine learning [1] • learning begins...

72
CODATA-RDA, Advanced workshop on Bioinformatics, Trieste 2018 Introduction to Machine Learning Amel Ghouila [email protected] @AmelGhouila

Upload: others

Post on 20-May-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

IntroductiontoMachineLearning

AmelGhouila [email protected]

@AmelGhouila

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

InstitutPasteurdeTunis

2

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018 3

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Algorithmsexamples04

ExamplesofapplicationsinBioinformatics05

Sessionoverview

01 IntroductiontobasicconceptsofDataminingandMachinelearning

02 Machinelearningtaxonomy

03 Supervisedclassificationvsunsupervisedclassification

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

https://www.linkedin.com/pulse/technology-increase-vs-department-budgets-sam-errington/

5CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018 7

FromDatatoknowledge

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

AI&ML

•  AIisabroaderconceptthanMLwhichadressestheuseofcomputerstomimicthecongnitivefunctionsofhumans.

•  Whenmachinescarryouttasksbasedonalgorithmsinanintelligentmanner,thatisAI

•  MLisasubsetofAIandfocusesontheabilityofmachinestoreceiveasetofdataandlearnfromit,improvealgorithmsastheylearnmoreaboutinformationbeingprocessed

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

ML&Datamining•  MLembodiestheprinciplesofDM•  DMandMLhavethesamefoundationbutindifferentways

•  DMrequireshumaninteraction•  DMcan’tseetherelashionshipbetweendifferentdataaspectswiththesamedepthasML

•  MLlearnsfromthedataandallowsthemachinetoteachitself

•  DMistypicallyusedasaninformationsourceforMLtopullfrom

•  MLismoreaboutbuildingthepredictionmodel

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

AI,ML&DM

•  Dataminingproducesinsights•  MLproducespredictions•  AIproducesactions

https://medium.freecodecamp.org/using-machine-learning-to-predict-the-quality-of-wines-9e2e13d7480d

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Deeplearning

•  DeeplearningisasubsetofML•  DeeplearningalgorithmsgoaleveldeeperthanclassicalMLinvolvingmanylayers

•  Layers:setofnestedhierarchyofrelatedconcepts

•  Theanswertoaquestionisobtainedbyansweringotherrelateddeeperquestions

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

DataisattheheartofML

•  Machinelearningalgorithmsaredrivenbythedataused

•  Dataqualityisveryimportant•  Identifyingincomplete,incorrectandirrelevantpartsofthedataisanimportantstep

•  PreprocessingdatabeforeapplyingMLiscrucialstep

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Howdowehumanmakedecisions?Doweallmakethesamedecisions?

13

Creativity, Limited memory

Observations

External information

Experiences

Beliefs, creativity, common sens

Compare to expectations

Analyze differences

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Howdoesacomputerwork?

14

Follow instructions given by human

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Artificialintelligence

15

Fast response Ability to memorize big

amounts of data

Stimulate human behavior and cognitive

process

Capture and preseve human expertise

Data

Computing+

Storage

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Artificialintelligence

16

Results Predication and

Rules Data

Machinelearningalgorithms

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

HowdoMachineslearn?

18

Data to model

Create models

Decision

Prediction, categorization

Evaluate models

Refine models

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

IntroductionMachineLearning[1]

•  Learningbeginswithobservationsordata–  Examples:directexperience,orinstruction

•  The system looks for patterns in data and makes betterdecisionsinthefuturebasedontheexamplesthatweprovide

•  Theprimaryaimistoallowthecomputerslearnautomaticallywithout human intervention or assistance and adjust actionsaccordingly.

InputDataMachineLearning(Model)

Prediction

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

•  For example in the context of genome annotation, a machinelearningsystemcanbeusedto:–  ‘learn’ how to recognize the locations of transcription startsites(TSSs)inagenomesequence

–  identifysplicesitesandpromoters•  In general, if one can compile a list of sequence elements of a

given type, then a machine learning method can probably betrainedtorecognizethoseelements.

IntroductionMachineLearning[2]

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

•  Anymachine learning problem can be represented withthefollowingthreeconcepts:– WewillhavetolearntosolveataskT.

•  Forexample,performgenomeannotation.– WewillneedsomeexperienceEtolearntoperformthetask.Usually,experienceisrepresentedthroughadataset.

•  For the gene prediction, experience comes as a set of sequenceswhose genes have been previously discovered and their locationsannotated.

– WewillneedameasureofperformancePtoknowhowwellwearesolvingthetaskandalsotoknowwhetherafterdoingsome modifications, our results are improving or gettingworse.

•  The percentage of genes that our gene predictionmodel is correctlyclassifyingasgenescouldbePforourgenepredictiontask.

IntroductiontoMachineLearning[3]

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

TheMLtaxonomy

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

TheMLtaxonomy•  Machinelearningalgorithmsareoftencategorizedassupervised

orunsupervised.

•  We also have semi-supervised machine learning andreinforcementmachinelearning.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

SupervisedMachineLearning

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

SupervisedMachineLearningAlgorithms[1]

•  Applywhat has been learned in the past to newdata usinglabeledexamplestopredictfutureevents.

•  Starting from the analysis of a known training dataset, thelearning algorithm produces a prediction model that canprovidetargetsforanynewinput(aftersufficienttraining).

•  The learningalgorithmcanalsocompare itsoutputwith thecorrect, intended output and find errors in order tomodifyandimprovethepredictionmodelaccordingly.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Classificationvsregression

https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Classificationvsregression

Classification Regression

Discreate,categoricalvariable Continous(realnumberrange)

Supervisedclassificationproblem

Supervisedclassificationproblem

Assigntheoutputtoaclass(alabel)

Predicttheoutputvalueusingtrainingdata

Predictthetypeoftumor(harmfulvsnotharmful)

Predictahouseprice,predictsurvivaltime

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

ValidationofsupervisedMLalgorithmsresults

•  To test the performance of the learningsystem– The system canbe testedwith sequenceswherethelabelsareknown(andwereexcludedfromthetraining set because they were intended to beusedforthispurpose).

– Based on the results of the test data, theperformance of the learning system can beassessed.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Trainingsetandtestset

Testingset

Dataset

Trainingset

Usedtotrainthealgorithm Estimatetheaccuracyofthemodel

Splitthedatasetrandomly!Usecross-validationUnderfittingandoverfittingproblems

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

K-foldcrossvalidation

https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/#type-of-learning-problems

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Examplesofsupervisedlearningalgorithms

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Linearregression•  Regressionalgorithmscanbeusedforexamplewhensomecontinuousvalue

needstobecomputedascomparedtoclassificationwheretheoutputiscategoical.

•  Sowheneverthereisaneedtopredictsomefuturevalueofaprocesswhichiscurrentlyrunning,regressionalgorithmcanbeused.

•  Operatingonatwodimensionalsetofobservations(twocontinuousvariables),simplelinearregressionattemptstofit,asbestaspossible,alinethroughthedatapoints.

•  Theregressionline(ourmodel)becomesatoolthatcanhelpuncoverunderlyingtrendsinourdataset.

•  Theregressionline,whenproperlyfitted,canserveasapredictivemodelfornewevents.

•  LinearRegressionsarehoweverunstableincasefeaturesareredundant,i.e.ifthereismulticollinearity

•  Examplewherelinearregressioncanbeusedare:–  Usinggeneexpressiondatatoclassify(orpredict)tumortypesusinggene

expressiondata

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Applyinglinearregression

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

DecisionTrees(Supervised)•  Singletreesareusedveryrarely,butincompositionwithmany

others they build very efficient algorithms such as RandomForestorGradientTreeBoosting.

•  Decision trees easily handle feature interactions and they arenon-parametric, so there isnoneed toworryaboutoutliersorwhetherthedataislinearlyseparable.

•  Disadvantagesare:–  Oftenthetreeneedstoberebuiltwhennewexamplescomeon.–  Decision trees easily overfit, but ensemblemethods like random

forests(orboostedtrees)takecareofthisproblem.–  Theycanalsotakea lotofmemory (themore featuresyouhave,

thedeeperandlargeryourdecisiontreeislikelytobe)•  Treesareexcellenttoolsforhelpingtochoosebetweenseveral

coursesofaction.•  Example: Classification of genomic islands using decision trees

andensemblealgorithms

34

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

RandomForest(Supervised)•  RandomForestisanensembleofdecisiontrees.•  It can solve both regression and classification problems with

largedatasets.•  It also helps identifymost significant variables from thousands

ofinputvariables.•  RandomForest ishighly scalable toanynumberofdimensions

andhasgenerallyquiteacceptableperformances.•  HoweverwithRandomForest,learningmaybeslow(depending

on the parameterization) and it is not possible to iterativelyimprovethegeneratedmodels

•  RandomForestcanbeusedinreal-worldapplicationssuchas:–  Predictpatientsforhighrisksforcertaindiseases

35

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

SupportVectorMachines(Supervised)•  Support Vector Machine (SVM) is a supervised machine learning

techniquethat iswidelyused inpatternrecognitionandclassificationproblems — whenyourdatahasexactlytwoclasses.

•  Advantages includehighaccuracyandeven if thedata isnot linearlyseparable in the base feature space, SVM can work well with anappropriatekernel.

•  However SVMsarememory-intensive, hard to interpret, anddifficulttotune.

•  SVM is especially popular in text classification problems where veryhigh-dimensionalspacesarethenorm.

•  SVMcanbeusedinreal-worldbioinformaticsapplicationssuchas:–  detectingpersonswithcommondiseasessuchasdiabetes–  Classificationofgenomicislands

36

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

NaiveBayes(Supervised)•  ItisaclassificationtechniquebasedonBayes’theorem.•  Advantagesinclude:

–  veryeasytobuildandparticularlyusefulforverylargedatasets.–  outperformevenhighlysophisticatedclassificationmethods.–  agoodchoicewhenCPUandmemoryresourcesarealimitingfactor.–  A goodmethod if something fast and easy that performs pretty well is

needed.•  Its main disadvantage is that it does not consider the interactions

betweenfeatures.•  NaiveBayescanbeusedinreal-worldapplicationssuchas:

–  Mininghousekeepinggenes–  geneticassociationstudies–  discoveringAlzheimergeneticbiomarkersfromwholegenomesequencing

(WGS)data

37

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

UnsupervisedLearning

https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

UnsupervisedMachineLearningAlgorithms[1]•  Incontrasttosupervisedmachinelearningalgorithms,they:–  areappliedwhenthe informationusedtotrain isneitherclassifiednorlabeled.

–  can infer a function to describe a hidden structure fromunlabeleddata.

–  do not figure out the right output, but explore the dataandcandrawinferencesfromdatasetstodescribehiddenstructuresfromunlabeleddata.

•  The goal for unsupervised learning is to model theunderlyingstructureordistributioninthedatainordertolearnmoreaboutthedata.–  Algorithms are left to their own devises to discover andpresenttheinterestingstructureintheinputdata.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

UnsupervisedMachineLearningAlgorithms[2]•  Unsupervised learning problems can be further grouped into

clustering,associationanddimensionalityreductionproblems:–  Clustering: A clustering problem is where you want to discover the

inherentgroupingsinthedata,suchasclusteringDNAsequencesintofunctionalgroups.

–  Association: Anassociationrulelearningproblemiswhereyouwantto discover rules that describe large portions of your data, such asusingassociationanalysis-basedtechniquesforpre-processingproteininteractionnetworksforthetaskofproteinfunctionprediction.

–  Dimensionality Reduction: Often we are working with data of highdimensionality—each observation comes with a high number ofmeasurements—a dimension reduction procedure is usuallyconducted to reduce the variable space before the subsequentanalysisiscarriedout..

•  For example in a gene-expression analysis, dimension reduction can beusedtofindalistofcandidategeneswithamoreoperablelengthideallyincludingalltherelevantgenes.

•  Leaving many uninformative genes in the analysis can lead to biasedestimatesandreducedpower.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

UnsupervisedMachineLearningAlgorithms[3]•  Clustering is an exploratory data analysis technique that

allows us to organize a pile of information into meaningfulsubgroups (clusters) without having any prior knowledge oftheirgroupmemberships.

•  Eachclusterthatarisesduringtheanalysis–  definesa groupofobjects that sharea certaindegreeofsimilarity but are more dissimilar to objects in otherclusters,which iswhy clustering is also sometimes calledunsupervisedclassification.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

UnsupervisedMachineLearningAlgorithms[5]•  Taking the example of the gene-finding model, when a

labeled training set isnotavailable,unsupervised learning isrequired.

•  Consider the interpretationofaheterogeneouscollectionofepigenomic data sets, such as those generated by theEncyclopediaofDNAElements(ENCODE)ConsortiumandtheRoadmapEpigenomicsProject.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

UnsupervisedMachineLearningAlgorithms[6]•  A priori, we expect that the patterns of chromatin

accessibility, histone modifications and transcription factorbinding along the genome should be able to provide adetailedpictureof thebiochemicaland functionalactivityofthegenome.– We may also expect that these activities could beaccuratelysummarizedusingafairlysmallsetoflabels.

•  Todiscoverwhattypesof labelbestexplainthedata,ratherthan imposing a pre-determined set of labels on the data,unsupervisedlearningmethodcanbeapplied.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

UnsupervisedMachineLearningAlgorithms[7]–  Itwilluseonlytheunlabeleddataandthedesirednumberof different labels to assign as input to automaticallypartitionthegenome intosegmentsandassigna label toeachsegment,withthegoalofassigningthesamelabeltosegmentsthathavesimilardata.

•  The unsupervised approach requires an additional step inwhichsemanticsmustbemanuallyassignedtoeachlabel,butit provides the benefits of enabling training when labeledexamples are unavailable and has the ability to identifypotentiallynoveltypesofgenomicelements.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Supervisedvsunsupervisedlearning

Supervised Unsupervised

Inputdataislabelled Inputdataisunlabelled

Usestrainingdataset Usesjustinputdataset

Knownnumberofclasses Unkownnumberofclasses

Guidedbyexpert(labelleddataprovided)

Selfguidedlearning(usingsomecriteria)

Goal:predictclassorvaluelabel

Goal:analysedata,determinedatastructure/grouping

Classificationandregression Clustering,dimensionalityreduction,densityestimation

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

https://www.cisco.com/c/m/en_us/network-intelligence/service-provider/digital-transformation/get-to-know-machine-learning.html

SupervisedvsunsupervisedLearning

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

ClusteringvalidationAssessthequalityofaclusteringalgorithmresultsisimportanttoavoidfindingrandompatternsindata.•  Internalclustervalidation:usesonlyinternalinformationtotheclusteringprocess

withoutreferencetoexternalinformation(clustersseparability,clustershomogenity,etc.)

•  Clustersshouldbewell-separatedandintra-clusterdistanceshouldbesmall–  Silhouettecoefficient:estimatestheaveragedistancebetweenclusters–  Dunnindex:estimatesdistancesbetweenobjectsinthesameclustervsobjectsindifferent

clusters(shouldbemaximized)•  Externalclustervalidation:comparesresultstoanexternallyknownresults

(example:classlabels),comparedifferentclusteringmethodsresults,etc.•  Relativeclustervalidation:evaluatestheclusteringmethodsbyvaryingdifferent

parametersvaluesforthesamealgorithm(example:numberofclusters)

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Examplesofunsupervisedlearningalgorithms

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

NeuralNetworks(Canbesupervisedorunsupervised)

•  Neural Networks take in the weights of connections betweenneurons.Whenallweightsaretrained,theneuralnetworkcanbeutilizedtopredicttheclassoraquantity.

•  WithNeural networks, extremely complexmodels can be trainedandtheycanbeutilizedasakindofblackbox.

•  Disadvantages:–  parameterizationisextremelydifficultinneuralnetworks.–  Theyarealsoveryresourceandmemoryintensive.

•  NNcanbe joinedwith the “deep approach” tobuildmodels thatcanpickpreviouslyunpredictablecases.

•  They may be applied for classification, predictive modelling andbiomarkeridentificationwithindatasetsofhighcomplexitysuchastranscriptorgeneexpressiondatageneratedfromDNAmicroarrayanalysis, or peptide/protein level data generated by massspectrometry.

49

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

PrincipalComponentAnalysis(PCA)(Unsupervised)

•  PCAprovidesdimensionalityreduction.•  Sometimes you have a wide range of features, probably

highly correlated between each other, and models caneasilyoverfitonahugeamountofdata.ThenPCAcanbeapplied.

•  Advantage:–  in addition to the low-dimensional sample representation, itprovidesasynchronizedlow-dimensionalrepresentationofthevar iables . The synchronized sample and var iablerepresentationsprovideawaytovisuallyfindvariablesthatarecharacteristicofagroupofsamples.

•  PCAcanbeusedinbioinformaticsto:–  Analysegeneexpressiondata

50

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

K-Means(Unsupervised)•  Thegoalofk-meansistofindgroupsinthedata,withthenumber

ofgroupsrepresentedbythevariableK.•  Thealgorithmworksiterativelytoassigneachdatapointtooneof

Kgroupsbasedonthefeaturesthatareprovided.Datapointsareclusteredbasedonfeaturesimilarity.

•  Advantage: Easy to implement and fast and efficient in terms ofcomputationalcost

•  Disadvantageinclude:–  Initialseedshaveastrongimpactonthefinalresults–  Theorderofthedatahasanimpactonthefinalresults–  K-Meansneedstoknowinadvancehowmanyclusterstherewillbein

your data, so this may require a lot of trials to “guess” the best Knumberofclusterstodefine.

•  Example: popular and simple partition computational models forclusteringmicroarraydata

51

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Semi-supervisedMachineLearningAlgorithms[1]•  Insupervisedlearning,thealgorithmreceivesasinputa collection of data points, each with an associatedlabel,whereas inunsupervised learning thealgorithmreceivesthedatabutnolabels.–  The semi-supervised setting is a mixture of these twoapproaches: the algorithm receives a collection of datapoints, but only a subset of these data points haveassociatedlabels.

•  So, they fall somewhere in between supervised andunsupervisedlearning,sincetheyusebothlabeledandunlabeleddatafortraining–typicallyasmallamountoflabeleddataandalargeamountofunlabeleddata.

•  The systems that use this method are able toconsiderablyimprovelearningaccuracy.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Semi-supervisedMachineLearningAlgorithms[2]•  Consider the gene finding model where thesystem is provided with labeled data andunlabeleddata.–  The learning procedure begins by constructing aninitialgene-findingmodelonthebasisof the labeledsubsetofthetrainingdataalone.

– Next, the model is used to scan the genome, andtentativelabelsareassignedthroughoutthegenome.

–  These tentative labels can then be used to improvethe learnedmodel, and the procedure iterates untilnonewgenesarefound.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Semi-supervisedMachineLearningAlgorithms[3]•  In practice, gene-finding systems are often trained using a

semi-supervisedapproach,inwhichtheinputisacollectionofannotatedgenesandanunlabeledwhole-genomesequence.

•  The semi-supervisedapproachcanworkmuchbetter thanafullysupervisedapproachbecausethemodel isabletolearnfrom a much larger set of genes— all of the genes in thegenome— rather than only the subset of genes that havebeenidentifiedwithhighconfidence.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

ReinforcementMachineLearningAlgorithms[1]

•  The learning system interacts with the environment byproducingactionsanddiscoverserrorsorrewards.–  Thegoalistodevelopasystem(agent)thatimprovesits performance based on interactions with itsenvironment.

•  Through its interactionwith the environment, an agentcanthenusereinforcement learningto learnaseriesofactions that maximizes this reward via an exploratorytrial-and-errorapproachordeliberativeplanning.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

ReinforcementMachineLearningAlgorithms[2]•  The idea behind Reinforcement Learning is that anagent will learn from the environment by interactingwithitandreceivingrewardsforperformingactions.

•  Learningfrominteractionwiththeenvironmentcomesfromournaturalexperiences.–  Considerachildinalivingroomwhoseesafireplaceandapproachesit.

–  It’s warm, it’s positive, the child feels good (PositiveReward+1)andunderstandsthatfireisapositivething.

–  Next he tries to touch the fire and it burns his hand(Negative reward -1). He then understands that fire ispositivewhenhe isasufficientdistanceaway,because itproduces warmth. But getting too close to it, he will beburned.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

DeepLearningAlgorithms[1]•  Alsoknownasdeepstructuredlearningorhierarchicallearning•  It is a subfieldofmachine learning concernedwithalgorithms

inspired by the structure and function of the brain calledartificialneuralnetworks.

•  Can perform learning in supervised and/or unsupervisedmanners.

•  Teachcomputerstodowhatcomesnaturallytohumans:learnbyexample–  keytechnologybehinddriverlesscars,enablingthemtorecognizeastopsign,ortodistinguishapedestrianfromalamppost.

–  UsedinmedicalResearch•  Cancer researchers areusingdeep learning to automatically detect cancercells.

•  TeamsatUCLAbuiltanadvancedmicroscopethatyieldsahigh-dimensionaldata set used to train a deep learning application to accurately identifycancercells.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

DeepLearningAlgorithms[2]•  Whiledeeplearningwasfirsttheorizedinthe1980s,there

aretwomainreasonsithasonlyrecentlybecomeuseful:–  Deeplearningrequireslargeamountsoflabeleddata.

•  For example, driverless car development requiresmillions ofimagesandthousandsofhoursofvideo.

–  Deeplearningrequiressubstantialcomputingpower.•  High-performance GPUs have a parallel architecture that isefficientfordeeplearning.

•  When combined with clusters or cloud computing, thisenablesdevelopmentteamstoreducetrainingtimeforadeeplearningnetworkfromweekstohoursorless.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

DeepLearningAlgorithms[3]•  Mostdeeplearningmethodsuseneuralnetworkarchitectures,whichiswhy

deeplearningmodelsareoftenreferredtoasdeepneuralnetworks.•  Theterm“deep”usuallyreferstothenumberofhiddenlayersintheneural

network.–  Traditionalneuralnetworksonlycontain2-3hiddenlayers,whiledeepnetworks

canhaveasmanyas150.•  Deep learningmodels are trained by using large sets of labeled data and

neural network architectures that learn features directly from the datawithouttheneedformanualfeatureextraction.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

DeepLearningAlgorithms[4]•  Deep learning is now one of the most activefieldsinmachinelearningandhasbeenshowntoimprove performance in image and speechrecognition.

•  Thepotentialofdeeplearninginhigh-throughputbiologyisclear–  it allows to better exploit the availability ofincreasinglylargeandhigh-dimensionaldatasets(e.g.from DNA sequencing, RNA measurements, flowcytometry or automated microscopy) by trainingcomplex networks with multiple layers that capturetheirinternalstructure

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

DeepLearningAlgorithms[5]•  Example

–  Multi-label Deep Learning for Gene Function Annotation inCancer Pathways [RenchuGuan, XuWang,MaryQu Yang, YuZhang, Fengfeng Zhou, Chen Yang& Yanchun Liang ScientificReportsvolume8,(2018)]

–  Applieddeeplearningtoexplorefulltextsofbiomedicalarticlescontainingdetailedmethodologies,experimentalresults,criticaldiscussionsandinterpretationscanbefound,fortheanalysisofgenemulti-functionsrelevanttocancerpathwaysderivedfromfull-textbiomedicalpublications.

•  Without the involvement of a biologist to do a feature study aboutthedata.

–  Experimental results on eight KEGG cancer pathways revealedthatthisnewsystemisnotonlysuperiortoclassicalmulti-labellearning models, but it can also achieve numerous genefunctionsrelatedtoimportantcancerpathways.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

OpportunitiesforDeepLearninginGenomics

https://towardsdatascience.com/opportunities-and-obstacles-for-deep-learning-in-biology-and-medicine-6ec914fe18c2

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018From: Machine learning in bioinformatics Brief Bioinform. 2006;7(1):86-112. doi:10.1093/bib/bbk007

ApplicationsofMLinBioinformatics

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Problemathand Data Method/s

Identificationofbiomarkers ProteomicsdatasetsTranscriptomicsdatasets

BioHEL–rule-basedlearningmethod(Swanetal,2015)

CancerClassificationfromMicroarrayGeneExpressionData

Transcriptomicsdata J4.8decisiontreefromWekaNaïveBayes,SVM(Pengetal,2007)

Inferenceofdemographichistoryandrecombinationratesinpopulationgenetics

Populationgenomicdatasets

ArtificialNeuralNetwork(ANN)(Blumetal,2010)

ShowinghowrelationshipsamongindividualssampledfromEuropelargelymirroredgeography

Populationgenetics PCA(Novemberetal,2008)

Uncoverdifferencesinevolutionaryratesalongachromosome

PhylogeneticData HiddenMarkovModel(HMM)(Schrideretal,2018)

QuantifytheabilityofTF-bindingsignalstostatisticallypredicttheexpressionlevelsofpromoters.

Cell-line-specificTRFbindingdata

RandomForest,SupportVectorRegression(SVR),multivariateadaptiveregressionsplines(MARS)

GenomicSelectioninBreedingWheatforRustResistance

GenomicSelectiondata

ReproducingkernelHilbertspace,BayesianLASSO,randomforestregression,Supportvectorclassification(González-Camachoetal,2018)

Assigningeachreadfromhigh-throughputsequencingoftheDNApresentinthesampletoataxonomicclade

MetagenomicsData CompositionalapproachesbasedonthegenerativeNBclassifierandbasedonKraken(Vervieretal,2016)

64

SomeexamplesofusetoolsusedforMLinbioinformatics

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

IsthereaperfectMLtechnique?

•  Thereisnotonesolution(onemachinelearningalgorithm)oroneapproachthatfitsallproblems.

•  Foreachproblem,thereisnotonesinglesolution.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Whichtechniquetouse?

•  Size,qualityandnatureofthedatatobeanalysed.

•  Thequestion,theanswerexpected,andalsoexpectedaccuracy.

•  Howtheresultwillbeused•  Timeandcomputingresourcesavailable.•  Alwaysgoodtocheckperformanceofdifferentalgorithmsandcompareresults.

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Whatkindofdatadoyouhave?

•  Ifthedatatobeanalysedislabelled,itisasupervisedlearningproblem.

•  However,evenwhenlabelsareavailable,itisnotalwaysthecasethattakingasupervisedapproachisagoodidea(sizeandqualityoftrainingandtestsets).

•  Ingeneral,supervisedlearningshouldbeemployedonlywhenthetrainingsetandtestsetareexpectedtoexhibitsimilarstatisticalproperties

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Whatkindofdatadoyouhave?

•  Ifthedatatobeanalysedisunlabelledandtheaimistofindstructure,itisanunsupervisedlearningproblem.

•  Iftheaimistooptimizeanobjectivefunctionbyinteractingwithanenvironment,itisareinforcementlearningproblem.

•  Whensupervisedlearningisfeasible,itisoftenthecasethatadditional,unlabelleddatapointsareeasytoobtain.

•  Howdoyoudecidewhetherit’sasupervisedorsemi-supervisedapproach?

•  Agoodruleofthumbistousesemi-supervisedlearningifyoudonothaveverymuchlabelleddataandyouhaveaverylargeamountofunlabelleddata

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Whatistheexcpectedoutput?•  Iftheoutputofyourmodelisanumber,itisaregression

problem.–  Two-classclassificationofgeneexpressiondata

•  Iftheoutputofyourmodelisaclass,itisaclassificationproblem.–  GenomicclassificationofAML

•  Iftheoutputofyourmodelisasetofinputgroups,itisaclusteringproblem.–  Patternsingeneexpressionatdifferentdevelopmentalstagesofzebrafish

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Tools•  All themethods listed above are already available either in

Python,R (https://www.r-project.org/about.html )orMatlabusingexistingpackages.Somebasiccodeneedstobewritten.

•  If youarenotused towriting code, youmayusea tool likeWEKA (https://www.cs.waikato.ac.nz/ml/weka/) orRapidMiner (https://rapidminer.com/) – the methods arealready implementedandyousimplyneedto loadyourdataineithercsv,arff,…formatandruntheselectedmethods.

•  Some useful R packages R implementing many MLt e c h n i q u e s :https://cran.r-project.org/web/views/MachineLearning.html

70

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

SomeOnlineResources

•  https://machinelearningmastery.com/start-here/

•  https://www.datascience.com/blog•  https://www.mathworks.com/discovery/machine-learning.html

•  https://www.coursera.org/browse/data-science

71

CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018

Sources

•  http://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#data-preparation

•  https://medium.mybridge.co/30-amazing-machine-learning-projects-for-the-past-year-v-2018-b853b8621ac7

•  ShakuntalaBaichooandZahraMunglooslides(H3ABionet,MLgroup)