introduction to machine learning - github · introduction machine learning [1] • learning begins...
TRANSCRIPT
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
IntroductiontoMachineLearning
AmelGhouila [email protected]
@AmelGhouila
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Algorithmsexamples04
ExamplesofapplicationsinBioinformatics05
Sessionoverview
01 IntroductiontobasicconceptsofDataminingandMachinelearning
02 Machinelearningtaxonomy
03 Supervisedclassificationvsunsupervisedclassification
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
https://www.linkedin.com/pulse/technology-increase-vs-department-budgets-sam-errington/
5CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
AI&ML
• AIisabroaderconceptthanMLwhichadressestheuseofcomputerstomimicthecongnitivefunctionsofhumans.
• Whenmachinescarryouttasksbasedonalgorithmsinanintelligentmanner,thatisAI
• MLisasubsetofAIandfocusesontheabilityofmachinestoreceiveasetofdataandlearnfromit,improvealgorithmsastheylearnmoreaboutinformationbeingprocessed
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
ML&Datamining• MLembodiestheprinciplesofDM• DMandMLhavethesamefoundationbutindifferentways
• DMrequireshumaninteraction• DMcan’tseetherelashionshipbetweendifferentdataaspectswiththesamedepthasML
• MLlearnsfromthedataandallowsthemachinetoteachitself
• DMistypicallyusedasaninformationsourceforMLtopullfrom
• MLismoreaboutbuildingthepredictionmodel
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
AI,ML&DM
• Dataminingproducesinsights• MLproducespredictions• AIproducesactions
https://medium.freecodecamp.org/using-machine-learning-to-predict-the-quality-of-wines-9e2e13d7480d
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Deeplearning
• DeeplearningisasubsetofML• DeeplearningalgorithmsgoaleveldeeperthanclassicalMLinvolvingmanylayers
• Layers:setofnestedhierarchyofrelatedconcepts
• Theanswertoaquestionisobtainedbyansweringotherrelateddeeperquestions
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
DataisattheheartofML
• Machinelearningalgorithmsaredrivenbythedataused
• Dataqualityisveryimportant• Identifyingincomplete,incorrectandirrelevantpartsofthedataisanimportantstep
• PreprocessingdatabeforeapplyingMLiscrucialstep
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Howdowehumanmakedecisions?Doweallmakethesamedecisions?
13
Creativity, Limited memory
Observations
External information
Experiences
Beliefs, creativity, common sens
Compare to expectations
Analyze differences
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Howdoesacomputerwork?
14
Follow instructions given by human
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Artificialintelligence
15
Fast response Ability to memorize big
amounts of data
Stimulate human behavior and cognitive
process
Capture and preseve human expertise
Data
Computing+
Storage
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Artificialintelligence
16
Results Predication and
Rules Data
Machinelearningalgorithms
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
HowdoMachineslearn?
18
Data to model
Create models
Decision
Prediction, categorization
Evaluate models
Refine models
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
IntroductionMachineLearning[1]
• Learningbeginswithobservationsordata– Examples:directexperience,orinstruction
• The system looks for patterns in data and makes betterdecisionsinthefuturebasedontheexamplesthatweprovide
• Theprimaryaimistoallowthecomputerslearnautomaticallywithout human intervention or assistance and adjust actionsaccordingly.
InputDataMachineLearning(Model)
Prediction
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
• For example in the context of genome annotation, a machinelearningsystemcanbeusedto:– ‘learn’ how to recognize the locations of transcription startsites(TSSs)inagenomesequence
– identifysplicesitesandpromoters• In general, if one can compile a list of sequence elements of a
given type, then a machine learning method can probably betrainedtorecognizethoseelements.
IntroductionMachineLearning[2]
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
• Anymachine learning problem can be represented withthefollowingthreeconcepts:– WewillhavetolearntosolveataskT.
• Forexample,performgenomeannotation.– WewillneedsomeexperienceEtolearntoperformthetask.Usually,experienceisrepresentedthroughadataset.
• For the gene prediction, experience comes as a set of sequenceswhose genes have been previously discovered and their locationsannotated.
– WewillneedameasureofperformancePtoknowhowwellwearesolvingthetaskandalsotoknowwhetherafterdoingsome modifications, our results are improving or gettingworse.
• The percentage of genes that our gene predictionmodel is correctlyclassifyingasgenescouldbePforourgenepredictiontask.
IntroductiontoMachineLearning[3]
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
TheMLtaxonomy• Machinelearningalgorithmsareoftencategorizedassupervised
orunsupervised.
• We also have semi-supervised machine learning andreinforcementmachinelearning.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
SupervisedMachineLearningAlgorithms[1]
• Applywhat has been learned in the past to newdata usinglabeledexamplestopredictfutureevents.
• Starting from the analysis of a known training dataset, thelearning algorithm produces a prediction model that canprovidetargetsforanynewinput(aftersufficienttraining).
• The learningalgorithmcanalsocompare itsoutputwith thecorrect, intended output and find errors in order tomodifyandimprovethepredictionmodelaccordingly.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Classificationvsregression
https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Classificationvsregression
Classification Regression
Discreate,categoricalvariable Continous(realnumberrange)
Supervisedclassificationproblem
Supervisedclassificationproblem
Assigntheoutputtoaclass(alabel)
Predicttheoutputvalueusingtrainingdata
Predictthetypeoftumor(harmfulvsnotharmful)
Predictahouseprice,predictsurvivaltime
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
ValidationofsupervisedMLalgorithmsresults
• To test the performance of the learningsystem– The system canbe testedwith sequenceswherethelabelsareknown(andwereexcludedfromthetraining set because they were intended to beusedforthispurpose).
– Based on the results of the test data, theperformance of the learning system can beassessed.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Trainingsetandtestset
Testingset
Dataset
Trainingset
Usedtotrainthealgorithm Estimatetheaccuracyofthemodel
Splitthedatasetrandomly!Usecross-validationUnderfittingandoverfittingproblems
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
K-foldcrossvalidation
https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/#type-of-learning-problems
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Linearregression• Regressionalgorithmscanbeusedforexamplewhensomecontinuousvalue
needstobecomputedascomparedtoclassificationwheretheoutputiscategoical.
• Sowheneverthereisaneedtopredictsomefuturevalueofaprocesswhichiscurrentlyrunning,regressionalgorithmcanbeused.
• Operatingonatwodimensionalsetofobservations(twocontinuousvariables),simplelinearregressionattemptstofit,asbestaspossible,alinethroughthedatapoints.
• Theregressionline(ourmodel)becomesatoolthatcanhelpuncoverunderlyingtrendsinourdataset.
• Theregressionline,whenproperlyfitted,canserveasapredictivemodelfornewevents.
• LinearRegressionsarehoweverunstableincasefeaturesareredundant,i.e.ifthereismulticollinearity
• Examplewherelinearregressioncanbeusedare:– Usinggeneexpressiondatatoclassify(orpredict)tumortypesusinggene
expressiondata
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
DecisionTrees(Supervised)• Singletreesareusedveryrarely,butincompositionwithmany
others they build very efficient algorithms such as RandomForestorGradientTreeBoosting.
• Decision trees easily handle feature interactions and they arenon-parametric, so there isnoneed toworryaboutoutliersorwhetherthedataislinearlyseparable.
• Disadvantagesare:– Oftenthetreeneedstoberebuiltwhennewexamplescomeon.– Decision trees easily overfit, but ensemblemethods like random
forests(orboostedtrees)takecareofthisproblem.– Theycanalsotakea lotofmemory (themore featuresyouhave,
thedeeperandlargeryourdecisiontreeislikelytobe)• Treesareexcellenttoolsforhelpingtochoosebetweenseveral
coursesofaction.• Example: Classification of genomic islands using decision trees
andensemblealgorithms
34
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
RandomForest(Supervised)• RandomForestisanensembleofdecisiontrees.• It can solve both regression and classification problems with
largedatasets.• It also helps identifymost significant variables from thousands
ofinputvariables.• RandomForest ishighly scalable toanynumberofdimensions
andhasgenerallyquiteacceptableperformances.• HoweverwithRandomForest,learningmaybeslow(depending
on the parameterization) and it is not possible to iterativelyimprovethegeneratedmodels
• RandomForestcanbeusedinreal-worldapplicationssuchas:– Predictpatientsforhighrisksforcertaindiseases
35
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
SupportVectorMachines(Supervised)• Support Vector Machine (SVM) is a supervised machine learning
techniquethat iswidelyused inpatternrecognitionandclassificationproblems — whenyourdatahasexactlytwoclasses.
• Advantages includehighaccuracyandeven if thedata isnot linearlyseparable in the base feature space, SVM can work well with anappropriatekernel.
• However SVMsarememory-intensive, hard to interpret, anddifficulttotune.
• SVM is especially popular in text classification problems where veryhigh-dimensionalspacesarethenorm.
• SVMcanbeusedinreal-worldbioinformaticsapplicationssuchas:– detectingpersonswithcommondiseasessuchasdiabetes– Classificationofgenomicislands
36
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
NaiveBayes(Supervised)• ItisaclassificationtechniquebasedonBayes’theorem.• Advantagesinclude:
– veryeasytobuildandparticularlyusefulforverylargedatasets.– outperformevenhighlysophisticatedclassificationmethods.– agoodchoicewhenCPUandmemoryresourcesarealimitingfactor.– A goodmethod if something fast and easy that performs pretty well is
needed.• Its main disadvantage is that it does not consider the interactions
betweenfeatures.• NaiveBayescanbeusedinreal-worldapplicationssuchas:
– Mininghousekeepinggenes– geneticassociationstudies– discoveringAlzheimergeneticbiomarkersfromwholegenomesequencing
(WGS)data
37
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
UnsupervisedLearning
https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
UnsupervisedMachineLearningAlgorithms[1]• Incontrasttosupervisedmachinelearningalgorithms,they:– areappliedwhenthe informationusedtotrain isneitherclassifiednorlabeled.
– can infer a function to describe a hidden structure fromunlabeleddata.
– do not figure out the right output, but explore the dataandcandrawinferencesfromdatasetstodescribehiddenstructuresfromunlabeleddata.
• The goal for unsupervised learning is to model theunderlyingstructureordistributioninthedatainordertolearnmoreaboutthedata.– Algorithms are left to their own devises to discover andpresenttheinterestingstructureintheinputdata.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
UnsupervisedMachineLearningAlgorithms[2]• Unsupervised learning problems can be further grouped into
clustering,associationanddimensionalityreductionproblems:– Clustering: A clustering problem is where you want to discover the
inherentgroupingsinthedata,suchasclusteringDNAsequencesintofunctionalgroups.
– Association: Anassociationrulelearningproblemiswhereyouwantto discover rules that describe large portions of your data, such asusingassociationanalysis-basedtechniquesforpre-processingproteininteractionnetworksforthetaskofproteinfunctionprediction.
– Dimensionality Reduction: Often we are working with data of highdimensionality—each observation comes with a high number ofmeasurements—a dimension reduction procedure is usuallyconducted to reduce the variable space before the subsequentanalysisiscarriedout..
• For example in a gene-expression analysis, dimension reduction can beusedtofindalistofcandidategeneswithamoreoperablelengthideallyincludingalltherelevantgenes.
• Leaving many uninformative genes in the analysis can lead to biasedestimatesandreducedpower.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
UnsupervisedMachineLearningAlgorithms[3]• Clustering is an exploratory data analysis technique that
allows us to organize a pile of information into meaningfulsubgroups (clusters) without having any prior knowledge oftheirgroupmemberships.
• Eachclusterthatarisesduringtheanalysis– definesa groupofobjects that sharea certaindegreeofsimilarity but are more dissimilar to objects in otherclusters,which iswhy clustering is also sometimes calledunsupervisedclassification.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
UnsupervisedMachineLearningAlgorithms[5]• Taking the example of the gene-finding model, when a
labeled training set isnotavailable,unsupervised learning isrequired.
• Consider the interpretationofaheterogeneouscollectionofepigenomic data sets, such as those generated by theEncyclopediaofDNAElements(ENCODE)ConsortiumandtheRoadmapEpigenomicsProject.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
UnsupervisedMachineLearningAlgorithms[6]• A priori, we expect that the patterns of chromatin
accessibility, histone modifications and transcription factorbinding along the genome should be able to provide adetailedpictureof thebiochemicaland functionalactivityofthegenome.– We may also expect that these activities could beaccuratelysummarizedusingafairlysmallsetoflabels.
• Todiscoverwhattypesof labelbestexplainthedata,ratherthan imposing a pre-determined set of labels on the data,unsupervisedlearningmethodcanbeapplied.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
UnsupervisedMachineLearningAlgorithms[7]– Itwilluseonlytheunlabeleddataandthedesirednumberof different labels to assign as input to automaticallypartitionthegenome intosegmentsandassigna label toeachsegment,withthegoalofassigningthesamelabeltosegmentsthathavesimilardata.
• The unsupervised approach requires an additional step inwhichsemanticsmustbemanuallyassignedtoeachlabel,butit provides the benefits of enabling training when labeledexamples are unavailable and has the ability to identifypotentiallynoveltypesofgenomicelements.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Supervisedvsunsupervisedlearning
Supervised Unsupervised
Inputdataislabelled Inputdataisunlabelled
Usestrainingdataset Usesjustinputdataset
Knownnumberofclasses Unkownnumberofclasses
Guidedbyexpert(labelleddataprovided)
Selfguidedlearning(usingsomecriteria)
Goal:predictclassorvaluelabel
Goal:analysedata,determinedatastructure/grouping
Classificationandregression Clustering,dimensionalityreduction,densityestimation
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
https://www.cisco.com/c/m/en_us/network-intelligence/service-provider/digital-transformation/get-to-know-machine-learning.html
SupervisedvsunsupervisedLearning
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
ClusteringvalidationAssessthequalityofaclusteringalgorithmresultsisimportanttoavoidfindingrandompatternsindata.• Internalclustervalidation:usesonlyinternalinformationtotheclusteringprocess
withoutreferencetoexternalinformation(clustersseparability,clustershomogenity,etc.)
• Clustersshouldbewell-separatedandintra-clusterdistanceshouldbesmall– Silhouettecoefficient:estimatestheaveragedistancebetweenclusters– Dunnindex:estimatesdistancesbetweenobjectsinthesameclustervsobjectsindifferent
clusters(shouldbemaximized)• Externalclustervalidation:comparesresultstoanexternallyknownresults
(example:classlabels),comparedifferentclusteringmethodsresults,etc.• Relativeclustervalidation:evaluatestheclusteringmethodsbyvaryingdifferent
parametersvaluesforthesamealgorithm(example:numberofclusters)
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
NeuralNetworks(Canbesupervisedorunsupervised)
• Neural Networks take in the weights of connections betweenneurons.Whenallweightsaretrained,theneuralnetworkcanbeutilizedtopredicttheclassoraquantity.
• WithNeural networks, extremely complexmodels can be trainedandtheycanbeutilizedasakindofblackbox.
• Disadvantages:– parameterizationisextremelydifficultinneuralnetworks.– Theyarealsoveryresourceandmemoryintensive.
• NNcanbe joinedwith the “deep approach” tobuildmodels thatcanpickpreviouslyunpredictablecases.
• They may be applied for classification, predictive modelling andbiomarkeridentificationwithindatasetsofhighcomplexitysuchastranscriptorgeneexpressiondatageneratedfromDNAmicroarrayanalysis, or peptide/protein level data generated by massspectrometry.
49
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
PrincipalComponentAnalysis(PCA)(Unsupervised)
• PCAprovidesdimensionalityreduction.• Sometimes you have a wide range of features, probably
highly correlated between each other, and models caneasilyoverfitonahugeamountofdata.ThenPCAcanbeapplied.
• Advantage:– in addition to the low-dimensional sample representation, itprovidesasynchronizedlow-dimensionalrepresentationofthevar iables . The synchronized sample and var iablerepresentationsprovideawaytovisuallyfindvariablesthatarecharacteristicofagroupofsamples.
• PCAcanbeusedinbioinformaticsto:– Analysegeneexpressiondata
50
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
K-Means(Unsupervised)• Thegoalofk-meansistofindgroupsinthedata,withthenumber
ofgroupsrepresentedbythevariableK.• Thealgorithmworksiterativelytoassigneachdatapointtooneof
Kgroupsbasedonthefeaturesthatareprovided.Datapointsareclusteredbasedonfeaturesimilarity.
• Advantage: Easy to implement and fast and efficient in terms ofcomputationalcost
• Disadvantageinclude:– Initialseedshaveastrongimpactonthefinalresults– Theorderofthedatahasanimpactonthefinalresults– K-Meansneedstoknowinadvancehowmanyclusterstherewillbein
your data, so this may require a lot of trials to “guess” the best Knumberofclusterstodefine.
• Example: popular and simple partition computational models forclusteringmicroarraydata
51
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Semi-supervisedMachineLearningAlgorithms[1]• Insupervisedlearning,thealgorithmreceivesasinputa collection of data points, each with an associatedlabel,whereas inunsupervised learning thealgorithmreceivesthedatabutnolabels.– The semi-supervised setting is a mixture of these twoapproaches: the algorithm receives a collection of datapoints, but only a subset of these data points haveassociatedlabels.
• So, they fall somewhere in between supervised andunsupervisedlearning,sincetheyusebothlabeledandunlabeleddatafortraining–typicallyasmallamountoflabeleddataandalargeamountofunlabeleddata.
• The systems that use this method are able toconsiderablyimprovelearningaccuracy.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Semi-supervisedMachineLearningAlgorithms[2]• Consider the gene finding model where thesystem is provided with labeled data andunlabeleddata.– The learning procedure begins by constructing aninitialgene-findingmodelonthebasisof the labeledsubsetofthetrainingdataalone.
– Next, the model is used to scan the genome, andtentativelabelsareassignedthroughoutthegenome.
– These tentative labels can then be used to improvethe learnedmodel, and the procedure iterates untilnonewgenesarefound.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Semi-supervisedMachineLearningAlgorithms[3]• In practice, gene-finding systems are often trained using a
semi-supervisedapproach,inwhichtheinputisacollectionofannotatedgenesandanunlabeledwhole-genomesequence.
• The semi-supervisedapproachcanworkmuchbetter thanafullysupervisedapproachbecausethemodel isabletolearnfrom a much larger set of genes— all of the genes in thegenome— rather than only the subset of genes that havebeenidentifiedwithhighconfidence.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
ReinforcementMachineLearningAlgorithms[1]
• The learning system interacts with the environment byproducingactionsanddiscoverserrorsorrewards.– Thegoalistodevelopasystem(agent)thatimprovesits performance based on interactions with itsenvironment.
• Through its interactionwith the environment, an agentcanthenusereinforcement learningto learnaseriesofactions that maximizes this reward via an exploratorytrial-and-errorapproachordeliberativeplanning.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
ReinforcementMachineLearningAlgorithms[2]• The idea behind Reinforcement Learning is that anagent will learn from the environment by interactingwithitandreceivingrewardsforperformingactions.
• Learningfrominteractionwiththeenvironmentcomesfromournaturalexperiences.– Considerachildinalivingroomwhoseesafireplaceandapproachesit.
– It’s warm, it’s positive, the child feels good (PositiveReward+1)andunderstandsthatfireisapositivething.
– Next he tries to touch the fire and it burns his hand(Negative reward -1). He then understands that fire ispositivewhenhe isasufficientdistanceaway,because itproduces warmth. But getting too close to it, he will beburned.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
DeepLearningAlgorithms[1]• Alsoknownasdeepstructuredlearningorhierarchicallearning• It is a subfieldofmachine learning concernedwithalgorithms
inspired by the structure and function of the brain calledartificialneuralnetworks.
• Can perform learning in supervised and/or unsupervisedmanners.
• Teachcomputerstodowhatcomesnaturallytohumans:learnbyexample– keytechnologybehinddriverlesscars,enablingthemtorecognizeastopsign,ortodistinguishapedestrianfromalamppost.
– UsedinmedicalResearch• Cancer researchers areusingdeep learning to automatically detect cancercells.
• TeamsatUCLAbuiltanadvancedmicroscopethatyieldsahigh-dimensionaldata set used to train a deep learning application to accurately identifycancercells.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
DeepLearningAlgorithms[2]• Whiledeeplearningwasfirsttheorizedinthe1980s,there
aretwomainreasonsithasonlyrecentlybecomeuseful:– Deeplearningrequireslargeamountsoflabeleddata.
• For example, driverless car development requiresmillions ofimagesandthousandsofhoursofvideo.
– Deeplearningrequiressubstantialcomputingpower.• High-performance GPUs have a parallel architecture that isefficientfordeeplearning.
• When combined with clusters or cloud computing, thisenablesdevelopmentteamstoreducetrainingtimeforadeeplearningnetworkfromweekstohoursorless.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
DeepLearningAlgorithms[3]• Mostdeeplearningmethodsuseneuralnetworkarchitectures,whichiswhy
deeplearningmodelsareoftenreferredtoasdeepneuralnetworks.• Theterm“deep”usuallyreferstothenumberofhiddenlayersintheneural
network.– Traditionalneuralnetworksonlycontain2-3hiddenlayers,whiledeepnetworks
canhaveasmanyas150.• Deep learningmodels are trained by using large sets of labeled data and
neural network architectures that learn features directly from the datawithouttheneedformanualfeatureextraction.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
DeepLearningAlgorithms[4]• Deep learning is now one of the most activefieldsinmachinelearningandhasbeenshowntoimprove performance in image and speechrecognition.
• Thepotentialofdeeplearninginhigh-throughputbiologyisclear– it allows to better exploit the availability ofincreasinglylargeandhigh-dimensionaldatasets(e.g.from DNA sequencing, RNA measurements, flowcytometry or automated microscopy) by trainingcomplex networks with multiple layers that capturetheirinternalstructure
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
DeepLearningAlgorithms[5]• Example
– Multi-label Deep Learning for Gene Function Annotation inCancer Pathways [RenchuGuan, XuWang,MaryQu Yang, YuZhang, Fengfeng Zhou, Chen Yang& Yanchun Liang ScientificReportsvolume8,(2018)]
– Applieddeeplearningtoexplorefulltextsofbiomedicalarticlescontainingdetailedmethodologies,experimentalresults,criticaldiscussionsandinterpretationscanbefound,fortheanalysisofgenemulti-functionsrelevanttocancerpathwaysderivedfromfull-textbiomedicalpublications.
• Without the involvement of a biologist to do a feature study aboutthedata.
– Experimental results on eight KEGG cancer pathways revealedthatthisnewsystemisnotonlysuperiortoclassicalmulti-labellearning models, but it can also achieve numerous genefunctionsrelatedtoimportantcancerpathways.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
OpportunitiesforDeepLearninginGenomics
https://towardsdatascience.com/opportunities-and-obstacles-for-deep-learning-in-biology-and-medicine-6ec914fe18c2
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018From: Machine learning in bioinformatics Brief Bioinform. 2006;7(1):86-112. doi:10.1093/bib/bbk007
ApplicationsofMLinBioinformatics
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Problemathand Data Method/s
Identificationofbiomarkers ProteomicsdatasetsTranscriptomicsdatasets
BioHEL–rule-basedlearningmethod(Swanetal,2015)
CancerClassificationfromMicroarrayGeneExpressionData
Transcriptomicsdata J4.8decisiontreefromWekaNaïveBayes,SVM(Pengetal,2007)
Inferenceofdemographichistoryandrecombinationratesinpopulationgenetics
Populationgenomicdatasets
ArtificialNeuralNetwork(ANN)(Blumetal,2010)
ShowinghowrelationshipsamongindividualssampledfromEuropelargelymirroredgeography
Populationgenetics PCA(Novemberetal,2008)
Uncoverdifferencesinevolutionaryratesalongachromosome
PhylogeneticData HiddenMarkovModel(HMM)(Schrideretal,2018)
QuantifytheabilityofTF-bindingsignalstostatisticallypredicttheexpressionlevelsofpromoters.
Cell-line-specificTRFbindingdata
RandomForest,SupportVectorRegression(SVR),multivariateadaptiveregressionsplines(MARS)
GenomicSelectioninBreedingWheatforRustResistance
GenomicSelectiondata
ReproducingkernelHilbertspace,BayesianLASSO,randomforestregression,Supportvectorclassification(González-Camachoetal,2018)
Assigningeachreadfromhigh-throughputsequencingoftheDNApresentinthesampletoataxonomicclade
MetagenomicsData CompositionalapproachesbasedonthegenerativeNBclassifierandbasedonKraken(Vervieretal,2016)
64
SomeexamplesofusetoolsusedforMLinbioinformatics
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
IsthereaperfectMLtechnique?
• Thereisnotonesolution(onemachinelearningalgorithm)oroneapproachthatfitsallproblems.
• Foreachproblem,thereisnotonesinglesolution.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Whichtechniquetouse?
• Size,qualityandnatureofthedatatobeanalysed.
• Thequestion,theanswerexpected,andalsoexpectedaccuracy.
• Howtheresultwillbeused• Timeandcomputingresourcesavailable.• Alwaysgoodtocheckperformanceofdifferentalgorithmsandcompareresults.
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Whatkindofdatadoyouhave?
• Ifthedatatobeanalysedislabelled,itisasupervisedlearningproblem.
• However,evenwhenlabelsareavailable,itisnotalwaysthecasethattakingasupervisedapproachisagoodidea(sizeandqualityoftrainingandtestsets).
• Ingeneral,supervisedlearningshouldbeemployedonlywhenthetrainingsetandtestsetareexpectedtoexhibitsimilarstatisticalproperties
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Whatkindofdatadoyouhave?
• Ifthedatatobeanalysedisunlabelledandtheaimistofindstructure,itisanunsupervisedlearningproblem.
• Iftheaimistooptimizeanobjectivefunctionbyinteractingwithanenvironment,itisareinforcementlearningproblem.
• Whensupervisedlearningisfeasible,itisoftenthecasethatadditional,unlabelleddatapointsareeasytoobtain.
• Howdoyoudecidewhetherit’sasupervisedorsemi-supervisedapproach?
• Agoodruleofthumbistousesemi-supervisedlearningifyoudonothaveverymuchlabelleddataandyouhaveaverylargeamountofunlabelleddata
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Whatistheexcpectedoutput?• Iftheoutputofyourmodelisanumber,itisaregression
problem.– Two-classclassificationofgeneexpressiondata
• Iftheoutputofyourmodelisaclass,itisaclassificationproblem.– GenomicclassificationofAML
• Iftheoutputofyourmodelisasetofinputgroups,itisaclusteringproblem.– Patternsingeneexpressionatdifferentdevelopmentalstagesofzebrafish
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Tools• All themethods listed above are already available either in
Python,R (https://www.r-project.org/about.html )orMatlabusingexistingpackages.Somebasiccodeneedstobewritten.
• If youarenotused towriting code, youmayusea tool likeWEKA (https://www.cs.waikato.ac.nz/ml/weka/) orRapidMiner (https://rapidminer.com/) – the methods arealready implementedandyousimplyneedto loadyourdataineithercsv,arff,…formatandruntheselectedmethods.
• Some useful R packages R implementing many MLt e c h n i q u e s :https://cran.r-project.org/web/views/MachineLearning.html
70
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
SomeOnlineResources
• https://machinelearningmastery.com/start-here/
• https://www.datascience.com/blog• https://www.mathworks.com/discovery/machine-learning.html
• https://www.coursera.org/browse/data-science
71
CODATA-RDA,AdvancedworkshoponBioinformatics,Trieste2018
Sources
• http://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#data-preparation
• https://medium.mybridge.co/30-amazing-machine-learning-projects-for-the-past-year-v-2018-b853b8621ac7
• ShakuntalaBaichooandZahraMunglooslides(H3ABionet,MLgroup)