knowledge discovery - ist department at ritrpv/local/syllabi/discovery/knowledgediscovery1.pdfthe...

Post on 24-May-2018

224 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

KnowledgeDiscovery

Ourgoal

......to understanding (wisdom) ......to knowledge ......to information

data

WhydoweneedKnowledgeDiscovery?

•  DataExplosion:webusage,automateddatacollec?ontools,maturedatabasetechnology

•  ToomuchdataandtooliAleknowledge

•  HumansnotabletosiDthroughthedataeffec?vely

•  Computa?onalapproachestodataanalysisarerequiredforthecon?nuallyincreasing,accumulateddata

Poten?alApplica?ons

•  Marketanalysis,customerrela?onshipmanagement

•  Riskanalysisandmanagement•  Frauddetec?on•  Textminingnewsgroups,email,documents•  Webminingoflogs,datastreamsforcustomiza?on,adver?sing,marke?ng

•  BiologyandMedicine‐manytypesofhigh‐throughputdatafordiagnos?cs,predic?veandpersonalizedmedicine

Linktoimagereference

Linktoimagereference

EvenBeAerConsulttheDomainExpert(s)

TheProcess

•  GuidedDiscovery– PBL– KnowledgeDiscovery– Learnthroughexamplesandprac?ce

•  Samegeneralapproachmaybeappliedtomanydifferentproblemdomains

•  Selectappropriatemethodstocustomizeapproach

•  Noonerightanswer!

RunningExampleofKD

•  GeneExpressionData•  Whyagoodexample?

– Biotechnologyadvancescreatedhugeinfluxofdata

– Biologistsnotequippedtoanalyzethedata– Computa?onalscien?stsdidn’tunderstandthebiology

– KDDprocesssorelyneeded– Hassignificantlyadvancedoverthelast10years

Papers

•  Datapreprocessingandtransforma?on–  Quackenbush

•  Needforstandards– MAGE‐ML–  www.mged.org

•  MininglargedatasetsforpaAerns– MolecularClassifica?onofCancer–  Golubetal.

ATypicalScenario

•  Biologistdesignsandrunsanexperimentanddeliverssamples(alongwith$$)totheFunc?onalGenomicslabforhigh‐throughputgeneexpressionanalysis.AcoupleweekslaterbiologistpicksupaCDwithmul?plefilescontainingtherawdataandsomepreprocesseddata…notknowinghowtoanalyzethedatabiologistcallsinyourhelp…

•  Wheredowestart?– Understandthedomainandtheproblems

13

HighThroughputSystemsforStudyingGlobalGeneExpressionare

Complex

•  Needtolearnaboutandconsider:–  thebiologybehindtheexperiments&theinterpreta?onoftheexperiments

– Howthedataisacquired(biotechnology)–  thedataissues

14

BiologyBasics:TheFlowofInforma?on

Ageneisexpressedin2steps: DNAistranscribedintoRNA(mRNA)

RNAistranslatedintoprotein

15

GenotypetoPhenotype

•  Individualcellsinanorganismhavethesamegenes(DNA)–  thegenotype

but….notallgenesareac?ve(expressed)ineachcell

•  Itistheexpressionofthousandsofgenesandtheirproducts(RNA,proteins),func?oninginacomplicatedandorchestratedway,thatmakeaspecificcellwhatitis.–  thephenotype

16

GeneExpressionDependsonContext

•  Thesubsetsofgenesthatareexpressed(RNA/protein)willdifferamongcells,?ssues,organs,condi?ons…–  thesubsetexpressedconfersuniqueproper?estothecell

musclemuscle

neuron liver

17

Differen?alGeneExpression

•  Thelevelofexpressionofgenesalsodifferswiththecellularcontext

•  i.e.theamountofagivenRNAwillvary

•  Wecanthinkofgeneexpression(inhigherorganisms)ashavingbothan“on/off”switchand“volume”control

18

WhatBiologistsWanttoKnow:SpecificPaAernsofGeneExpression•  Tissue/Celltype‐specific ‐e.g.skincellvs.braincell ‐e.g.kera?nocytevs.melanocyte

•  Developmentalstage ‐e.g.embryonicskincellvs.adultskincell

•  Diseasestate

‐e.g.normalskincellvs.skintumorcell•  Environment‐specific(drugs,toxins)

‐e.g.skincelluntreatedvs.treated

19

Butalso,themoredifficultproblem:GeneNetworks

•  Genesandtheirproductsarerelatedthroughtheirrolesin:– metabolicpathways– cellsignallingnetworks

20

MetabolicPathway

FromKEGGDatabase

21

CellSignallingNetworks

www.mpi‐dortmund.mpg.de/departments/dep1/signaltransduk?on/image3.gif

22

WhatcanwelearnbystudyingglobalpaAernsofgeneexpression?

•  Individualgeneexpressionpa1erns•  Classifica5ons:fordiagnosis,predic?on…

– GroupsofGenes– Moleculartaxonomyofdisease

•  GeneNetworks/Pathways:– Reconstruc?onofmetabolic&regulatorypathways

Nowthatwehavesomeunderstandingofthedomainandgoals…

•  Whataboutthedata?– Howarethedatagenerated?– Datatype?– Dataquality?– Needfordatacleaningandpreprocessing?

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

25

GeneChip®Oligonucleo?deArray

High‐throughputgeneexpressionanalysis

26

RecallthatDNAandRNAarecomposedofstringsofnucleo?des

•  Ageneofinterestwillhaveaspecificnucleo?desequence

•  DNAandRNAsequencescanformbondswithcomplementarybasesonanotherstring‐calledbase‐pairing.

•  Whenwedothisexperimentallywecallithybridiza?onandwecandetectitbylabelingoneofthestrings(akastrands)

GeneChip®ExpressionAnalysis

Hybridiza?onandStaining

Array

cRNATarget

HybridizedArray

Streptavidin‐phycoerythrinconjugate

CourtesyofM.Hessner,CAAGEDWorkshop

HowdoAffymetrixmicroarrayswork?

•  12‐20probesarepickedto“interrogate”agene,theideaistogetmul?plemeasurements.Eachprobeisa25meroligonucleo?dethatbindstoagene

•  Thecollec?onofprobesthataredesignedtohybridizetothesamegeneiscalleda“probeset”….maybetensofthousandsoftheseprobesetsonagivenchip

•  Probesetnameshaveiden?fica?onnamescalled“AffymetrixIds”,andlooklike“10329_g_at”,etc.OnanyGenechip,someprobesetsarededicatedfor“QualityControl”,thesebeginwith“AFFX_”

•  Take‐homemessage:havetolearnalotofterminology

29

AffymetrixChips

300,000“Probes”PerfectMatchandMismatchAverageDifferenceValuesCourtesyofJ.GlasnerCAAGEDWorkshop

AffymetrixAnalysis

•  Highresolu?onimageofthescannedmicroarraygeneratesaDATfile

•  Sincetheprobesarelaidoutinagridfashion,andeachprobeposi?ondeterminedintermsofitsX‐Yco‐ordinates,onecancomputethePMandMMprobeintensi?esfromthepixelatedimage

•  TheCDF(chipdefini?onfile)libraryfilecontainstheXYlayoutofeveryprobe

AffymetrixDataFlow

ScanChip

HybridizedGeneChip

DATfileProcessImage(GCOS)

CELfile

CDFfile

MAS5(GCOS)

CHPfile

TXTfile

RPTfileEXPfile

GeneChipOpera?ngSoDware(GCOS)‐AffymetrixhAp://www.affymetrix.com/products/soDware/specific/gcos.affx

AffymetrixFileTypes•  DATfile:

–  Raw(TIFF)op?calimageofthehybridizedchip•  CDFFile(ChipDescrip?onFile):

–  ProvidedbyAffy,describeslayoutofchip•  CELFile:

–  ProcessedDATfile(intensity/posi?onvalues)–  hAp://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/

AffxFileFormats/cel.html•  CHPFile:

–  The“CHP”filecontainssummarizedgeneexpressionscoresaDerprobecellsareanalyzed;

–  formatis:Gene Avg.D PresenceAFFX_CreX_at 48 AAFFX_BioB_at 149 P

•  TXTFile:–  Probesetexpressionvalueswithannota?on(CHPfileintextformat)

•  RPTFile–  GeneratedbyAffysoDware,reportofQCinfo

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

DataQuality

•  Mostdataminingtechniquescantoleratesomelevelofimperfec?oninthedata,butimprovingdataqualitycanimprovequalityofanalyses

•  Mainissues– Noise– Outliers– Missingvalues

– Duplicatedata–  Inconsistentdata

35

ThereareManyProblemsFacingExpressionAnalysisontheBiotechside

•  Standardiza?on&qualitycontrolintheexperiments(affectsdataqualityatmanylevels)

•  Cost

36

Probleminreproducibilityofexperimentaldata

•  Lotsofvaria?oninarrays–  morethan100experimentalsteps

•  Sourcesofvaria?on–  biologicalvariabilityineachRNAextract–  eachlabelingreac?onisdifferent–  eachslideisaseparatehybridiza?on–  spotsontheslidearevariableacrossslides(andwithinslideswhen

doublespoAed)

–  each“color”isscannedseparately•  NeedReplicatesandSta?s?cs!

37

Outcome

•  “Noisy”data•  Datapreprocessingisnecessary

– normaliza?on

– scaling•  Heavyrelianceonsta?s?cstoday

Whatdothespots(intensitymeasurements)represent?

•  Fluorescenceintensityisameasureoftherela?veabundanceofindividualmRNAs(expressedgenes)ingivensamples–  e.g.experimentalrela?vetocontrol

•  But,geneexpressionexperimentsarerunon“mul?plesamples”Why?

•  Wearetryingtounderstandadynamicprocess‐eachsampleonlyrepresentsa“snapshot”–  Compareamongsamples(differentarrays)

–  Compareacrossa?me‐courseofrelatedsamples

Howcanweusethedata?

•  Wecanonlyreallydependonbetween‐samplefoldchangeforMicroarraysnotabsolutevaluesorwithinsamplecomparisons(>1.3‐2.0foldchange,ingeneral)

•  Take‐homemessage:Havetobecarefulwhencomparingbetweenarrays;fromexperimenttoexperiment….

40

Pre‐processing

•  Genefiltering–  controlgenes–  uninforma?vegenes

•  Normaliza?onandscaling–  allowscomparisonsacrossarrays

–  scalingtocontroldynamicrange

•  Transforma?on•  logarithmictransforma?onforimprovedsta?s?calproper?es

Normaliza?on

Cy3signal(log2)

Cy5signal(log

2)

Take‐homeMessage

•  Importanttorememberthatoncepreprocessing,normaliza?on,transforma?onofthedatahaveoccurred,alldownstreamminingwillbeaffected.

DataRepresenta?on

•  Flatfile•  Vectordata•  Sparsematrix(text)data

•  Sequencedata(e.g.weborgenomic)

•  Timeseries

•  Imagedata

•  Spa?o‐temporal

Threelevelsofmicroarraygeneexpressiondataprocessing

Brazma et al., Nature Genetics, 29:365-371, 2001

OutcomesofMicroarrayAnalysis

Large,complexdatasetsofhighdimensionality– exampleofarou?nestudy:

50,000“genes”from20samples‐approx.1‐2X106piecesofdata

 challengesforBioinforma?cs• annota?on,storage,retrieval,sharingofdata•  informa?onfromthedata

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

StateofMicroarrayData

•  Wideavailabilityoftechnologyhasgivenrisetoalargenumberofdistributeddatabases

•  datascaAeredamongmanyindependentsites(accessibleviaInternet)ornotpubliclyavailableatall

•  Needforstandardiza?on!

MGEDGroupandStandardiza?onIssues

•  MicroarrayGeneExpressionDatabase(MGED)Group

www.mged.org

•  MGEDistakingonthechallengeofstandardiza?on

•  Fourmajorprojects

•  MIAME‐Theformula?onoftheminimuminforma?onaboutamicroarrayexperimentrequiredtointerpretandverifytheresults.

•  MAGE‐Theestablishmentofadataexchangeformat(MAGE‐ML)andobjectmodel(MAGE‐OM)formicroarrayexperiments.

MGEDProjects

MGEDProjects

•  Ontologies‐Thedevelopmentofontologiesformicroarrayexperimentdescrip?onandbiologicalmaterial(biomaterial)annota?oninpar?cular.

•  Normaliza?on‐Thedevelopmentofrecommenda?onsregardingexperimentalcontrolsanddatanormaliza?onmethods.

MAGE‐ML

•  theXMLrepresenta?onoftheMAGE‐OM•  theDTD(documenttypedefini?on)iswhatisspecifiedinMAGE_ML–  rulesordeclara?ons– whattagscanbeused– whattagscontain

•  MAGE‐OM•  hAp://www.mged.org/Workgroups/MAGE/mage‐om.html

•  mappingofmicroarrayexperimentalworkflowtotheOM

•  DTD•  hAp://www.omg.org/docs/dtc/03‐05‐03.dtd

•  MAGE‐STKsoDwaretoolkit–  definesanAPItoMAGE‐OM–  inJava,Perl,C++

•  Usedto–  exportdatatoMAGE_ML–  tostoredatainrela?onaldatabase–  inputdatatoanalysistools

•  Reader:MAGE‐MLdocsintoobjects•  Writer:objectsintoMAGE‐ML

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

DataMiningTechniques

•  Exploratorydataanalysis•  Descrip?vemodeling

•  Predic?vemodeling

•  PaAerndiscovery•  others

ExploratoryDataAnalysis

•  Interac?veandvisual•  Insightandfeelforthedatainabroadsense

–  Providesummaries•  e.g.max/min,mean/median,varianceetc

–  Visualiza?on•  Histograms,scaAerplots

•  Usefulfordatavalida?onorverifica?on•  Simpleexploratorydataanalysisisinvaluable

– Alwaysgetacursoryviewofthedatabeforeapplyingdataminingalgorithms

PaAernDiscovery

•  Discoverinteres?nglocalpaAernsindataratherthantocharacterizedataglobally

•  Marketbasketdata– Discoverthatifcustomersbuywineandbread,theybuycheesewitha0.9probability

– Knownasassocia?onrules

Descrip?veModeling

•  Buildmodelforunderlyingprocess– Simulatethedataifneeded

•  Clusteranalysistofindnaturalgroupsinthedata

•  Bayesiannetworktofinddependencymodelsamongvariables

Predic?veModeling

•  PredictavariableY,givenap‐dimensionalvectorX–  Classifica?on:Yiscategorical–  Regression:Yisreal‐valued

•  Muchlikefunc?onapproxima?on–  Learningtherela?onshipbetweenYandX

•  Sta?s?csandmachinelearninghavemanyalgorithmsforpredic?vemodeling–  EmphasisisoDenonpredic?veaccuracyratherthanunderstandingthemodelitself.

MiningofExpressionDataRecallthat:•  AgeneexpressionpaAernderivedfromasinglemicroarrayissimplyasnapshot(oneexperimentalsamplevsreference)

•  Usuallywanttounderstandaprocessorchangesinexpressionoveracollec?onofsamples

geneexpressionprofile

62

WorkingwithGeneExpressionData

•  Hypothesis‐drivenapproaches– Typicallymodel‐oriented– Descrip?vesta?s?csrelyingonpriorknowledgeandgooddesign

•  Discovery‐based– Few,ifany,apriorihypotheses– Data‐drivenandalgorithm‐oriented– Sta?s?calalgorithms– Machinelearningusingheuris?ctechniques

63

Tes?ngHypotheses

•  Basedonpriorbiologicalknowledge•  Simplest

–  lookforindividualdifferen?allyexpressedgenes–  foldchanges

•  ScaAerplot•  Sta?s?calmeasures

64

ScaAerplot

65

Somesimplesta?s?cs

•  Ifwearelookingatsamplesthatseemtobelongtotwogroupsorcondi?ons

•  t‐testcomparesthemeansoftwogroupswhileaccoun?ngforthestandarderrorofthedifferenceofthemeans

•  ANOVAifwanttoextendtheanalysistomorethantwogroups

66

But,genechipsallowustomeasurethousandsofgenes....

•  Acrossmul?plesamples

GoalofAnalysisofExpressionMatrix

•  Somesta?s?calmethodsappliedto:1.  “Group”similargenestogether=>groupsof

func?onallysimilargenes.

2.  ”Group”similarcellsamplestogether.

3.  “Extract”representa?vegenesineachgroup.

Typicalapproach

•  LookforpaAerns–  comparerowstofindevidenceforco‐regula?onofgenes–  comparecolumnstofindevidenceforrelatednessamongsamples

1)Chooseameasureofsimilarity(distance)amongtheobjectsbeingcompared‐eachroworcolumnisconsideredavectorinspace

2)Then,grouptogetherobjects(genesorsamples)withsimilarproper?es‐isamul?dimensionalanalysis

69

Anexperiment

•  12Genes•  Expressionvaluesat0,2,4,6,8and10hours

70

Table4.2ofCampbell/Heyer•  Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs

C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125

71

Takelogs• 

C 0 3.0 3.58 4.0 3.58 3.0 D 0 1.58 2.0 2.0 1.58 1.0 E 0 2.0 3.0 3.0 3.0 3.0 F 0 0 0 -2.0 -2.0 -3.32 G 0 1.0 1.58 2.0 1.58 1.0 H 0 -1.0 -1.6 -2.0 -1.6 -1.0 I 0 2.0 3.0 2.0 0 -1.0 J 0 1.0 0 1.0 0 1.0 K 0 0 0 0 1.58 1.58 L 0 1.0 1.58 2.0 1.58 1.0 M 0 -1.6 -2.0 -2.0 -1.6 -1.0 N 0 -3.0 -3.59 -4.0 -3.59 -3.0

•  Compare

72

HowSimilararetwoRows?

•  Howsimilararetheexpressionsoftwogenes?

•  Firstwe’llnormalizeeachrow

•  Calculatethemeanandstandarddevia?onforeachgene

•  Normalizeeachvaluebysubtrac?ngthemeananddividingbythestandarddevia?on.

73

HowSimilararetwoRows?

•  CalculatethePearsonCorrela?onbetweenpairsofrows

•  Correla?onquan?fiestheextenttowhichtheexpressionpaAernsoftwogenesgoupordowntogether,regardlessoftheirmagnitudes.

•  Calculatedbytakingthedotproductofthetwovectors

> (pc '( 1 2 3 4 3 2 ) ; row G '( 1 2 3 4 3 2 )) ; row L 1.0 > (pc '( 1 2 3 4 3 2 ) ; row G '( 1 3 4 4 3 2 )) ; row D 0.8971499589146109

74

Someotherpairs•  Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs

C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125

> (pc '( 1 3 4 4 3 2) ; row D '( 1 .33 .25 .25 .33 .5)) ; row M -0.9260278787295065 > (pc '( 1 2 3 4 3 2) ; row G '( 1 .5 .33 .25 .33 .5)) ; row H -0.9090853650855358

75

PearsonCorrela?on

•  pc(G,L)=1‐‐iden?callyexpressedgenes•  pc(G,D)=.897‐‐similarlyexpressedgenes•  pc(D,M)=‐.926‐‐reciprocallyexpressed•  pc(G,H)=‐.909‐‐alsoreciprocallyexpressed

Descrip?veandPredic?veModeling

•  Clustering•  Featureextrac?on/selec?on•  Classifica?on‐discrimina?onanalysis

Analy?cApproaches

•  Clustering:Identification of associations between data points; organization of data into groups

•  UnsupervisedClustering:genesclusteredbysimilarity/correla?on,orothercriteriabasedonX‐values‐nousefulexternalinforma?onabouttheY–variables(theresponse),isused→doesn’trevealgroupsofgeneswithspecialinterestfor?ssuediscrimina?on

•  SupervisedMethods:‐groupingofvariables(genes),controlledbyinforma?onabouttheXandYvariables→supervisedalgorithmstrytofindgeneclusters,whoseaverageexpressionprofilehasgreatpoten?alforexplainingtheresponseY,i.e.for?ssuediscrimina?on

•  UnsupervisedClusteringAlgorithms– Hierarchical– K‐means– Self‐organizingmaps– Others

Eisen et al.

http://www.pnas.org/cgi/content/full/95/25/14863

samples

g

e

n

e

s

Gene Expression Matrix

& Hierarchical Clustering

Theory

•  HierarchicalClusteringworksbysequen?allyjoiningthetwonearestclustersandthenhierarchicallyjoiningthenexttwoclosestclustersandsooninthisfashion,joiningthenearestclustersfirstandfarthestclusterslast.

•  Ini?allyeachindividualdataptissetequaltoonecluster

HierarchicalClusteringAlgorithm

•  GivenasetofNitemstobeclustered,andanN*Ndistance(orsimilarity)matrix.

1.  Startbyassigningeachitemtoacluster,sothatifyouhaveNitems,youwillnowhaveNclusters,eachcontainingjustoneitem.Letthedistances(similari?es)betweentheclustersbedefinedasthesameasthedistances(similari?es)betweentheitemstheycontain.

2.  Findtheclosest(mostsimilar)pairofclustersandmergethemintoasinglecluster.Younowhaveoneclusterless.

3.  Computedistances(similari?es)betweenthenewclusterandeachoftheoldclusters.

4.  Repeatsteps2and3un?lallitemsareclusteredintoasingleclusterofsizeN.

Hierarchicalinac?on

Varia?onsofHierarchicalAlgorithm

•  Step3(compu?ngdistancesbetweenthenewclusterandeachoftheoldclusters)canbedoneinseveraldifferentways.SingleLinkage,averagelinkageandcompletelinkage.

•  Insinglelinkagethedistancebetweenclustersisequaltotheshortestdistancefromanyonememberofoneclustertoanyonememberoftheothercluster.

•  InAveragelinkagethedistancebetweentwoclustersisdefinedastheaveragedistancebetweenanymemberofoneclustertoanymemberoftheothercluster.

•  Completelinkageisdefinedasthethemaximumdistancefromanyonememberofthefirstclustertoanyonememberofthesecondcluster.

Varia?onsofHierarchicalAlgorithm

•  SelfOrganizingTreeAlgorithm– Unsupervisedneuralnetworkwithabinarytreetopology

– Combina?onofSOMandhierarchicalclustering

– Run?meisapproximatelylinear•  Fasterthannormalhierarchicalmethod

– Usesdivisivemethod•  IncomparisontoboAomupmethodofhierarchical

Advantages

•  Hierarchicalclusteringresultsinavisualrepresenta?onthatisconvenientforhumanstoanalyze

•  Unlikek‐meansandSOM,doesnothaveanaprioriclusternumber

Whyclusteranalysismaynotbe“the”answer

•  Clusteringmethodstypicallyrequireuserinputs:

Example:distancemeasure•  Clusteringmethodsdifferinthewaythatthenumberofclustersarespecified.

•  ClusteringmethodsareoDensensi?vetotheini?aliza?oncondi?on(star?ngguess)

•  Localvs.globalsamplingofclusteringspace

ClusterAnalysisChallenges

•  “Noise”inthedataitself•  Largedatasets

– mostofthetechniquescurrentlyusedwerenotdevelopedformul?dimensionaldata

•  Whataboutnetworks?–  limita?onofclusteranalysis:similarityinexpressionpaAernsuggestsco‐regula?onbutdoesn’trevealcause‐effectrela?onships

FeatureSelec?on&Classifica?on

•  First,iden?fyfeatures(genes)thatdiscriminatebetweenclasses

•  Thenusefeaturesforclassifica?on– machinelearningapproach– supervisedanalysis– assignmentofanewsampletoapreviouslyspecifiedclass,basedonsamplefeaturesandatrainedclassifier

“Classic”Example:Classifica?onofAMLvs.ALL

•  Biological/ClinicalProblems:•  previously,nosinglereliabletesttodis?nguishthem•  differgreatlyinclinicalcourse&responsetotreatments

Golub et al., Science Oct 15 1999: 531-537

• Comparing 2 acute leukemias • acute myeloid leukemia (AML) • acute lymphoid leukemia (ALL)

Golub et al., Science Oct 15 1999: 531-537

Study Design

The prediction of a new sample is based on 'weighted votes' of a set of informative genes

Resultsofthestudy

1)Clusteringofmicroarraydatausingtumorsofknowntype

found1100of6817genescorrelatedwithclassdis?nc?on

2)Forma?onofaclasspredictor=50mostinforma?vegenesusedasatrainingset

classifica?onofunknowntumors

Golub et al., Science Oct 15 1999: 531-537

Results

Howtotestthevalidityofclasspredictors?

•  Cross‐valida?ontests:The50‐genepredictorassigned36ofthe38samplesaseitherAMLorALLandtheremainingtwoasuncertain(PS<0.3).All36predic?onsagreedwiththepa?ents'clinicaldiagnosis;

•  Independenttest:The50‐genepredictorwasappliedtoanindependentcollec?onof34leukemiasamples.Thepredictorassigned29ofthe34samples,andtheaccuracywas100%;

•  Predic?onstrength:medianPS=0.77incross‐valida?onand0.73inindependenttest(Fig.3A).

Results

Classdiscovery

•  IftheAML‐ALLdis?nc?onwerenotalreadyknown,couldithavebeendiscoveredsimplyonthebasisofgeneexpression?

Results

Twoclusteranalysis

(1).Clustertumorsbygeneexpression:

•  Atwo‐clusterSOMwasappliedtoautoma?callygroupthe38ini?alleukemiasamplesintotwoclassesonthebasisoftheexpressionpaAernofall6817genes.

Results

Determinewhetherputa?veclassesproducedaremeaningful.

•  TheclusterswerefirstevaluatedbycomparingthemtotheknownAML‐ALLclasses(Fig.4A).ClassA1containedmostlyALL(24of25samples)andclassA2containedmostlyAML(10of13samples).TheSOMwasthusquiteeffec?veatautoma?callydiscoveringthetwotypesofleukemia.

Results

•  Howcouldoneevaluatesuchputa?veclustersifthe"right"answerwerenotalreadyknown?

Classdiscoverycouldbetestedbyclasspredic?on;Ifputa?veclassesreflecttruestructure,thenaclasspredictorbasedontheseclassesshouldperformwell.

top related