knowledge discovery - ist department at ritrpv/local/syllabi/discovery/knowledgediscovery1.pdfthe...

100
Knowledge Discovery

Upload: truonghuong

Post on 24-May-2018

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

KnowledgeDiscovery

Page 2: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Ourgoal

......to understanding (wisdom) ......to knowledge ......to information

data

Page 3: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

WhydoweneedKnowledgeDiscovery?

•  DataExplosion:webusage,automateddatacollec?ontools,maturedatabasetechnology

•  ToomuchdataandtooliAleknowledge

•  HumansnotabletosiDthroughthedataeffec?vely

•  Computa?onalapproachestodataanalysisarerequiredforthecon?nuallyincreasing,accumulateddata

Page 4: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Poten?alApplica?ons

•  Marketanalysis,customerrela?onshipmanagement

•  Riskanalysisandmanagement•  Frauddetec?on•  Textminingnewsgroups,email,documents•  Webminingoflogs,datastreamsforcustomiza?on,adver?sing,marke?ng

•  BiologyandMedicine‐manytypesofhigh‐throughputdatafordiagnos?cs,predic?veandpersonalizedmedicine

Page 5: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Linktoimagereference

Page 6: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Linktoimagereference

Page 7: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples
Page 8: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

EvenBeAerConsulttheDomainExpert(s)

Page 9: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

TheProcess

•  GuidedDiscovery– PBL– KnowledgeDiscovery– Learnthroughexamplesandprac?ce

•  Samegeneralapproachmaybeappliedtomanydifferentproblemdomains

•  Selectappropriatemethodstocustomizeapproach

•  Noonerightanswer!

Page 10: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

RunningExampleofKD

•  GeneExpressionData•  Whyagoodexample?

– Biotechnologyadvancescreatedhugeinfluxofdata

– Biologistsnotequippedtoanalyzethedata– Computa?onalscien?stsdidn’tunderstandthebiology

– KDDprocesssorelyneeded– Hassignificantlyadvancedoverthelast10years

Page 11: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Papers

•  Datapreprocessingandtransforma?on–  Quackenbush

•  Needforstandards– MAGE‐ML–  www.mged.org

•  MininglargedatasetsforpaAerns– MolecularClassifica?onofCancer–  Golubetal.

Page 12: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

ATypicalScenario

•  Biologistdesignsandrunsanexperimentanddeliverssamples(alongwith$$)totheFunc?onalGenomicslabforhigh‐throughputgeneexpressionanalysis.AcoupleweekslaterbiologistpicksupaCDwithmul?plefilescontainingtherawdataandsomepreprocesseddata…notknowinghowtoanalyzethedatabiologistcallsinyourhelp…

•  Wheredowestart?– Understandthedomainandtheproblems

Page 13: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

13

HighThroughputSystemsforStudyingGlobalGeneExpressionare

Complex

•  Needtolearnaboutandconsider:–  thebiologybehindtheexperiments&theinterpreta?onoftheexperiments

– Howthedataisacquired(biotechnology)–  thedataissues

Page 14: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

14

BiologyBasics:TheFlowofInforma?on

Ageneisexpressedin2steps: DNAistranscribedintoRNA(mRNA)

RNAistranslatedintoprotein

Page 15: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

15

GenotypetoPhenotype

•  Individualcellsinanorganismhavethesamegenes(DNA)–  thegenotype

but….notallgenesareac?ve(expressed)ineachcell

•  Itistheexpressionofthousandsofgenesandtheirproducts(RNA,proteins),func?oninginacomplicatedandorchestratedway,thatmakeaspecificcellwhatitis.–  thephenotype

Page 16: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

16

GeneExpressionDependsonContext

•  Thesubsetsofgenesthatareexpressed(RNA/protein)willdifferamongcells,?ssues,organs,condi?ons…–  thesubsetexpressedconfersuniqueproper?estothecell

musclemuscle

neuron liver

Page 17: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

17

Differen?alGeneExpression

•  Thelevelofexpressionofgenesalsodifferswiththecellularcontext

•  i.e.theamountofagivenRNAwillvary

•  Wecanthinkofgeneexpression(inhigherorganisms)ashavingbothan“on/off”switchand“volume”control

Page 18: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

18

WhatBiologistsWanttoKnow:SpecificPaAernsofGeneExpression•  Tissue/Celltype‐specific ‐e.g.skincellvs.braincell ‐e.g.kera?nocytevs.melanocyte

•  Developmentalstage ‐e.g.embryonicskincellvs.adultskincell

•  Diseasestate

‐e.g.normalskincellvs.skintumorcell•  Environment‐specific(drugs,toxins)

‐e.g.skincelluntreatedvs.treated

Page 19: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

19

Butalso,themoredifficultproblem:GeneNetworks

•  Genesandtheirproductsarerelatedthroughtheirrolesin:– metabolicpathways– cellsignallingnetworks

Page 20: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

20

MetabolicPathway

FromKEGGDatabase

Page 21: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

21

CellSignallingNetworks

www.mpi‐dortmund.mpg.de/departments/dep1/signaltransduk?on/image3.gif

Page 22: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

22

WhatcanwelearnbystudyingglobalpaAernsofgeneexpression?

•  Individualgeneexpressionpa1erns•  Classifica5ons:fordiagnosis,predic?on…

– GroupsofGenes– Moleculartaxonomyofdisease

•  GeneNetworks/Pathways:– Reconstruc?onofmetabolic&regulatorypathways

Page 23: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Nowthatwehavesomeunderstandingofthedomainandgoals…

•  Whataboutthedata?– Howarethedatagenerated?– Datatype?– Dataquality?– Needfordatacleaningandpreprocessing?

Page 24: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

Page 25: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

25

GeneChip®Oligonucleo?deArray

High‐throughputgeneexpressionanalysis

Page 26: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

26

RecallthatDNAandRNAarecomposedofstringsofnucleo?des

•  Ageneofinterestwillhaveaspecificnucleo?desequence

•  DNAandRNAsequencescanformbondswithcomplementarybasesonanotherstring‐calledbase‐pairing.

•  Whenwedothisexperimentallywecallithybridiza?onandwecandetectitbylabelingoneofthestrings(akastrands)

Page 27: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

GeneChip®ExpressionAnalysis

Hybridiza?onandStaining

Array

cRNATarget

HybridizedArray

Streptavidin‐phycoerythrinconjugate

CourtesyofM.Hessner,CAAGEDWorkshop

Page 28: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

HowdoAffymetrixmicroarrayswork?

•  12‐20probesarepickedto“interrogate”agene,theideaistogetmul?plemeasurements.Eachprobeisa25meroligonucleo?dethatbindstoagene

•  Thecollec?onofprobesthataredesignedtohybridizetothesamegeneiscalleda“probeset”….maybetensofthousandsoftheseprobesetsonagivenchip

•  Probesetnameshaveiden?fica?onnamescalled“AffymetrixIds”,andlooklike“10329_g_at”,etc.OnanyGenechip,someprobesetsarededicatedfor“QualityControl”,thesebeginwith“AFFX_”

•  Take‐homemessage:havetolearnalotofterminology

Page 29: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

29

AffymetrixChips

300,000“Probes”PerfectMatchandMismatchAverageDifferenceValuesCourtesyofJ.GlasnerCAAGEDWorkshop

Page 30: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

AffymetrixAnalysis

•  Highresolu?onimageofthescannedmicroarraygeneratesaDATfile

•  Sincetheprobesarelaidoutinagridfashion,andeachprobeposi?ondeterminedintermsofitsX‐Yco‐ordinates,onecancomputethePMandMMprobeintensi?esfromthepixelatedimage

•  TheCDF(chipdefini?onfile)libraryfilecontainstheXYlayoutofeveryprobe

Page 31: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

AffymetrixDataFlow

ScanChip

HybridizedGeneChip

DATfileProcessImage(GCOS)

CELfile

CDFfile

MAS5(GCOS)

CHPfile

TXTfile

RPTfileEXPfile

GeneChipOpera?ngSoDware(GCOS)‐AffymetrixhAp://www.affymetrix.com/products/soDware/specific/gcos.affx

Page 32: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

AffymetrixFileTypes•  DATfile:

–  Raw(TIFF)op?calimageofthehybridizedchip•  CDFFile(ChipDescrip?onFile):

–  ProvidedbyAffy,describeslayoutofchip•  CELFile:

–  ProcessedDATfile(intensity/posi?onvalues)–  hAp://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/

AffxFileFormats/cel.html•  CHPFile:

–  The“CHP”filecontainssummarizedgeneexpressionscoresaDerprobecellsareanalyzed;

–  formatis:Gene Avg.D PresenceAFFX_CreX_at 48 AAFFX_BioB_at 149 P

•  TXTFile:–  Probesetexpressionvalueswithannota?on(CHPfileintextformat)

•  RPTFile–  GeneratedbyAffysoDware,reportofQCinfo

Page 33: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

Page 34: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

DataQuality

•  Mostdataminingtechniquescantoleratesomelevelofimperfec?oninthedata,butimprovingdataqualitycanimprovequalityofanalyses

•  Mainissues– Noise– Outliers– Missingvalues

– Duplicatedata–  Inconsistentdata

Page 35: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

35

ThereareManyProblemsFacingExpressionAnalysisontheBiotechside

•  Standardiza?on&qualitycontrolintheexperiments(affectsdataqualityatmanylevels)

•  Cost

Page 36: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

36

Probleminreproducibilityofexperimentaldata

•  Lotsofvaria?oninarrays–  morethan100experimentalsteps

•  Sourcesofvaria?on–  biologicalvariabilityineachRNAextract–  eachlabelingreac?onisdifferent–  eachslideisaseparatehybridiza?on–  spotsontheslidearevariableacrossslides(andwithinslideswhen

doublespoAed)

–  each“color”isscannedseparately•  NeedReplicatesandSta?s?cs!

Page 37: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

37

Outcome

•  “Noisy”data•  Datapreprocessingisnecessary

– normaliza?on

– scaling•  Heavyrelianceonsta?s?cstoday

Page 38: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Whatdothespots(intensitymeasurements)represent?

•  Fluorescenceintensityisameasureoftherela?veabundanceofindividualmRNAs(expressedgenes)ingivensamples–  e.g.experimentalrela?vetocontrol

•  But,geneexpressionexperimentsarerunon“mul?plesamples”Why?

•  Wearetryingtounderstandadynamicprocess‐eachsampleonlyrepresentsa“snapshot”–  Compareamongsamples(differentarrays)

–  Compareacrossa?me‐courseofrelatedsamples

Page 39: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Howcanweusethedata?

•  Wecanonlyreallydependonbetween‐samplefoldchangeforMicroarraysnotabsolutevaluesorwithinsamplecomparisons(>1.3‐2.0foldchange,ingeneral)

•  Take‐homemessage:Havetobecarefulwhencomparingbetweenarrays;fromexperimenttoexperiment….

Page 40: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

40

Pre‐processing

•  Genefiltering–  controlgenes–  uninforma?vegenes

•  Normaliza?onandscaling–  allowscomparisonsacrossarrays

–  scalingtocontroldynamicrange

•  Transforma?on•  logarithmictransforma?onforimprovedsta?s?calproper?es

Page 41: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Normaliza?on

Cy3signal(log2)

Cy5signal(log

2)

Page 42: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Take‐homeMessage

•  Importanttorememberthatoncepreprocessing,normaliza?on,transforma?onofthedatahaveoccurred,alldownstreamminingwillbeaffected.

Page 43: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

DataRepresenta?on

•  Flatfile•  Vectordata•  Sparsematrix(text)data

•  Sequencedata(e.g.weborgenomic)

•  Timeseries

•  Imagedata

•  Spa?o‐temporal

Page 44: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Threelevelsofmicroarraygeneexpressiondataprocessing

Brazma et al., Nature Genetics, 29:365-371, 2001

Page 45: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

OutcomesofMicroarrayAnalysis

Large,complexdatasetsofhighdimensionality– exampleofarou?nestudy:

50,000“genes”from20samples‐approx.1‐2X106piecesofdata

 challengesforBioinforma?cs• annota?on,storage,retrieval,sharingofdata•  informa?onfromthedata

Page 46: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

Page 47: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

StateofMicroarrayData

•  Wideavailabilityoftechnologyhasgivenrisetoalargenumberofdistributeddatabases

•  datascaAeredamongmanyindependentsites(accessibleviaInternet)ornotpubliclyavailableatall

•  Needforstandardiza?on!

Page 48: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

MGEDGroupandStandardiza?onIssues

•  MicroarrayGeneExpressionDatabase(MGED)Group

www.mged.org

•  MGEDistakingonthechallengeofstandardiza?on

•  Fourmajorprojects

Page 49: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

•  MIAME‐Theformula?onoftheminimuminforma?onaboutamicroarrayexperimentrequiredtointerpretandverifytheresults.

•  MAGE‐Theestablishmentofadataexchangeformat(MAGE‐ML)andobjectmodel(MAGE‐OM)formicroarrayexperiments.

MGEDProjects

Page 50: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

MGEDProjects

•  Ontologies‐Thedevelopmentofontologiesformicroarrayexperimentdescrip?onandbiologicalmaterial(biomaterial)annota?oninpar?cular.

•  Normaliza?on‐Thedevelopmentofrecommenda?onsregardingexperimentalcontrolsanddatanormaliza?onmethods.

Page 51: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

MAGE‐ML

•  theXMLrepresenta?onoftheMAGE‐OM•  theDTD(documenttypedefini?on)iswhatisspecifiedinMAGE_ML–  rulesordeclara?ons– whattagscanbeused– whattagscontain

Page 52: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

•  MAGE‐OM•  hAp://www.mged.org/Workgroups/MAGE/mage‐om.html

•  mappingofmicroarrayexperimentalworkflowtotheOM

Page 53: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

•  DTD•  hAp://www.omg.org/docs/dtc/03‐05‐03.dtd

Page 54: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

•  MAGE‐STKsoDwaretoolkit–  definesanAPItoMAGE‐OM–  inJava,Perl,C++

•  Usedto–  exportdatatoMAGE_ML–  tostoredatainrela?onaldatabase–  inputdatatoanalysistools

•  Reader:MAGE‐MLdocsintoobjects•  Writer:objectsintoMAGE‐ML

Page 55: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

KnowledgeDiscoveryProcessConsulttheDomainExpert(s)

Page 56: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

DataMiningTechniques

•  Exploratorydataanalysis•  Descrip?vemodeling

•  Predic?vemodeling

•  PaAerndiscovery•  others

Page 57: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

ExploratoryDataAnalysis

•  Interac?veandvisual•  Insightandfeelforthedatainabroadsense

–  Providesummaries•  e.g.max/min,mean/median,varianceetc

–  Visualiza?on•  Histograms,scaAerplots

•  Usefulfordatavalida?onorverifica?on•  Simpleexploratorydataanalysisisinvaluable

– Alwaysgetacursoryviewofthedatabeforeapplyingdataminingalgorithms

Page 58: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

PaAernDiscovery

•  Discoverinteres?nglocalpaAernsindataratherthantocharacterizedataglobally

•  Marketbasketdata– Discoverthatifcustomersbuywineandbread,theybuycheesewitha0.9probability

– Knownasassocia?onrules

Page 59: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Descrip?veModeling

•  Buildmodelforunderlyingprocess– Simulatethedataifneeded

•  Clusteranalysistofindnaturalgroupsinthedata

•  Bayesiannetworktofinddependencymodelsamongvariables

Page 60: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Predic?veModeling

•  PredictavariableY,givenap‐dimensionalvectorX–  Classifica?on:Yiscategorical–  Regression:Yisreal‐valued

•  Muchlikefunc?onapproxima?on–  Learningtherela?onshipbetweenYandX

•  Sta?s?csandmachinelearninghavemanyalgorithmsforpredic?vemodeling–  EmphasisisoDenonpredic?veaccuracyratherthanunderstandingthemodelitself.

Page 61: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

MiningofExpressionDataRecallthat:•  AgeneexpressionpaAernderivedfromasinglemicroarrayissimplyasnapshot(oneexperimentalsamplevsreference)

•  Usuallywanttounderstandaprocessorchangesinexpressionoveracollec?onofsamples

geneexpressionprofile

Page 62: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

62

WorkingwithGeneExpressionData

•  Hypothesis‐drivenapproaches– Typicallymodel‐oriented– Descrip?vesta?s?csrelyingonpriorknowledgeandgooddesign

•  Discovery‐based– Few,ifany,apriorihypotheses– Data‐drivenandalgorithm‐oriented– Sta?s?calalgorithms– Machinelearningusingheuris?ctechniques

Page 63: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

63

Tes?ngHypotheses

•  Basedonpriorbiologicalknowledge•  Simplest

–  lookforindividualdifferen?allyexpressedgenes–  foldchanges

•  ScaAerplot•  Sta?s?calmeasures

Page 64: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

64

ScaAerplot

Page 65: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

65

Somesimplesta?s?cs

•  Ifwearelookingatsamplesthatseemtobelongtotwogroupsorcondi?ons

•  t‐testcomparesthemeansoftwogroupswhileaccoun?ngforthestandarderrorofthedifferenceofthemeans

•  ANOVAifwanttoextendtheanalysistomorethantwogroups

Page 66: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

66

But,genechipsallowustomeasurethousandsofgenes....

•  Acrossmul?plesamples

Page 67: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

GoalofAnalysisofExpressionMatrix

•  Somesta?s?calmethodsappliedto:1.  “Group”similargenestogether=>groupsof

func?onallysimilargenes.

2.  ”Group”similarcellsamplestogether.

3.  “Extract”representa?vegenesineachgroup.

Page 68: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Typicalapproach

•  LookforpaAerns–  comparerowstofindevidenceforco‐regula?onofgenes–  comparecolumnstofindevidenceforrelatednessamongsamples

1)Chooseameasureofsimilarity(distance)amongtheobjectsbeingcompared‐eachroworcolumnisconsideredavectorinspace

2)Then,grouptogetherobjects(genesorsamples)withsimilarproper?es‐isamul?dimensionalanalysis

Page 69: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

69

Anexperiment

•  12Genes•  Expressionvaluesat0,2,4,6,8and10hours

Page 70: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

70

Table4.2ofCampbell/Heyer•  Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs

C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125

Page 71: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

71

Takelogs• 

C 0 3.0 3.58 4.0 3.58 3.0 D 0 1.58 2.0 2.0 1.58 1.0 E 0 2.0 3.0 3.0 3.0 3.0 F 0 0 0 -2.0 -2.0 -3.32 G 0 1.0 1.58 2.0 1.58 1.0 H 0 -1.0 -1.6 -2.0 -1.6 -1.0 I 0 2.0 3.0 2.0 0 -1.0 J 0 1.0 0 1.0 0 1.0 K 0 0 0 0 1.58 1.58 L 0 1.0 1.58 2.0 1.58 1.0 M 0 -1.6 -2.0 -2.0 -1.6 -1.0 N 0 -3.0 -3.59 -4.0 -3.59 -3.0

•  Compare

Page 72: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

72

HowSimilararetwoRows?

•  Howsimilararetheexpressionsoftwogenes?

•  Firstwe’llnormalizeeachrow

•  Calculatethemeanandstandarddevia?onforeachgene

•  Normalizeeachvaluebysubtrac?ngthemeananddividingbythestandarddevia?on.

Page 73: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

73

HowSimilararetwoRows?

•  CalculatethePearsonCorrela?onbetweenpairsofrows

•  Correla?onquan?fiestheextenttowhichtheexpressionpaAernsoftwogenesgoupordowntogether,regardlessoftheirmagnitudes.

•  Calculatedbytakingthedotproductofthetwovectors

> (pc '( 1 2 3 4 3 2 ) ; row G '( 1 2 3 4 3 2 )) ; row L 1.0 > (pc '( 1 2 3 4 3 2 ) ; row G '( 1 3 4 4 3 2 )) ; row D 0.8971499589146109

Page 74: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

74

Someotherpairs•  Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs

C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125

> (pc '( 1 3 4 4 3 2) ; row D '( 1 .33 .25 .25 .33 .5)) ; row M -0.9260278787295065 > (pc '( 1 2 3 4 3 2) ; row G '( 1 .5 .33 .25 .33 .5)) ; row H -0.9090853650855358

Page 75: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

75

PearsonCorrela?on

•  pc(G,L)=1‐‐iden?callyexpressedgenes•  pc(G,D)=.897‐‐similarlyexpressedgenes•  pc(D,M)=‐.926‐‐reciprocallyexpressed•  pc(G,H)=‐.909‐‐alsoreciprocallyexpressed

Page 76: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Descrip?veandPredic?veModeling

•  Clustering•  Featureextrac?on/selec?on•  Classifica?on‐discrimina?onanalysis

Page 77: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Analy?cApproaches

•  Clustering:Identification of associations between data points; organization of data into groups

•  UnsupervisedClustering:genesclusteredbysimilarity/correla?on,orothercriteriabasedonX‐values‐nousefulexternalinforma?onabouttheY–variables(theresponse),isused→doesn’trevealgroupsofgeneswithspecialinterestfor?ssuediscrimina?on

•  SupervisedMethods:‐groupingofvariables(genes),controlledbyinforma?onabouttheXandYvariables→supervisedalgorithmstrytofindgeneclusters,whoseaverageexpressionprofilehasgreatpoten?alforexplainingtheresponseY,i.e.for?ssuediscrimina?on

Page 78: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

•  UnsupervisedClusteringAlgorithms– Hierarchical– K‐means– Self‐organizingmaps– Others

Page 79: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Eisen et al.

http://www.pnas.org/cgi/content/full/95/25/14863

samples

g

e

n

e

s

Gene Expression Matrix

& Hierarchical Clustering

Page 80: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Theory

•  HierarchicalClusteringworksbysequen?allyjoiningthetwonearestclustersandthenhierarchicallyjoiningthenexttwoclosestclustersandsooninthisfashion,joiningthenearestclustersfirstandfarthestclusterslast.

•  Ini?allyeachindividualdataptissetequaltoonecluster

Page 81: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

HierarchicalClusteringAlgorithm

•  GivenasetofNitemstobeclustered,andanN*Ndistance(orsimilarity)matrix.

1.  Startbyassigningeachitemtoacluster,sothatifyouhaveNitems,youwillnowhaveNclusters,eachcontainingjustoneitem.Letthedistances(similari?es)betweentheclustersbedefinedasthesameasthedistances(similari?es)betweentheitemstheycontain.

2.  Findtheclosest(mostsimilar)pairofclustersandmergethemintoasinglecluster.Younowhaveoneclusterless.

3.  Computedistances(similari?es)betweenthenewclusterandeachoftheoldclusters.

4.  Repeatsteps2and3un?lallitemsareclusteredintoasingleclusterofsizeN.

Page 82: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Hierarchicalinac?on

Page 83: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Varia?onsofHierarchicalAlgorithm

•  Step3(compu?ngdistancesbetweenthenewclusterandeachoftheoldclusters)canbedoneinseveraldifferentways.SingleLinkage,averagelinkageandcompletelinkage.

•  Insinglelinkagethedistancebetweenclustersisequaltotheshortestdistancefromanyonememberofoneclustertoanyonememberoftheothercluster.

•  InAveragelinkagethedistancebetweentwoclustersisdefinedastheaveragedistancebetweenanymemberofoneclustertoanymemberoftheothercluster.

•  Completelinkageisdefinedasthethemaximumdistancefromanyonememberofthefirstclustertoanyonememberofthesecondcluster.

Page 84: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Varia?onsofHierarchicalAlgorithm

•  SelfOrganizingTreeAlgorithm– Unsupervisedneuralnetworkwithabinarytreetopology

– Combina?onofSOMandhierarchicalclustering

– Run?meisapproximatelylinear•  Fasterthannormalhierarchicalmethod

– Usesdivisivemethod•  IncomparisontoboAomupmethodofhierarchical

Page 85: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Advantages

•  Hierarchicalclusteringresultsinavisualrepresenta?onthatisconvenientforhumanstoanalyze

•  Unlikek‐meansandSOM,doesnothaveanaprioriclusternumber

Page 86: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Whyclusteranalysismaynotbe“the”answer

•  Clusteringmethodstypicallyrequireuserinputs:

Example:distancemeasure•  Clusteringmethodsdifferinthewaythatthenumberofclustersarespecified.

•  ClusteringmethodsareoDensensi?vetotheini?aliza?oncondi?on(star?ngguess)

•  Localvs.globalsamplingofclusteringspace

Page 87: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

ClusterAnalysisChallenges

•  “Noise”inthedataitself•  Largedatasets

– mostofthetechniquescurrentlyusedwerenotdevelopedformul?dimensionaldata

•  Whataboutnetworks?–  limita?onofclusteranalysis:similarityinexpressionpaAernsuggestsco‐regula?onbutdoesn’trevealcause‐effectrela?onships

Page 88: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

FeatureSelec?on&Classifica?on

•  First,iden?fyfeatures(genes)thatdiscriminatebetweenclasses

•  Thenusefeaturesforclassifica?on– machinelearningapproach– supervisedanalysis– assignmentofanewsampletoapreviouslyspecifiedclass,basedonsamplefeaturesandatrainedclassifier

Page 89: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

“Classic”Example:Classifica?onofAMLvs.ALL

•  Biological/ClinicalProblems:•  previously,nosinglereliabletesttodis?nguishthem•  differgreatlyinclinicalcourse&responsetotreatments

Golub et al., Science Oct 15 1999: 531-537

• Comparing 2 acute leukemias • acute myeloid leukemia (AML) • acute lymphoid leukemia (ALL)

Page 90: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Golub et al., Science Oct 15 1999: 531-537

Study Design

Page 91: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples
Page 92: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

The prediction of a new sample is based on 'weighted votes' of a set of informative genes

Page 93: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Resultsofthestudy

1)Clusteringofmicroarraydatausingtumorsofknowntype

found1100of6817genescorrelatedwithclassdis?nc?on

2)Forma?onofaclasspredictor=50mostinforma?vegenesusedasatrainingset

classifica?onofunknowntumors

Golub et al., Science Oct 15 1999: 531-537

Page 94: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Results

Howtotestthevalidityofclasspredictors?

•  Cross‐valida?ontests:The50‐genepredictorassigned36ofthe38samplesaseitherAMLorALLandtheremainingtwoasuncertain(PS<0.3).All36predic?onsagreedwiththepa?ents'clinicaldiagnosis;

•  Independenttest:The50‐genepredictorwasappliedtoanindependentcollec?onof34leukemiasamples.Thepredictorassigned29ofthe34samples,andtheaccuracywas100%;

•  Predic?onstrength:medianPS=0.77incross‐valida?onand0.73inindependenttest(Fig.3A).

Page 95: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Results

Classdiscovery

•  IftheAML‐ALLdis?nc?onwerenotalreadyknown,couldithavebeendiscoveredsimplyonthebasisofgeneexpression?

Page 96: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Results

Twoclusteranalysis

(1).Clustertumorsbygeneexpression:

•  Atwo‐clusterSOMwasappliedtoautoma?callygroupthe38ini?alleukemiasamplesintotwoclassesonthebasisoftheexpressionpaAernofall6817genes.

Page 97: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Results

Determinewhetherputa?veclassesproducedaremeaningful.

•  TheclusterswerefirstevaluatedbycomparingthemtotheknownAML‐ALLclasses(Fig.4A).ClassA1containedmostlyALL(24of25samples)andclassA2containedmostlyAML(10of13samples).TheSOMwasthusquiteeffec?veatautoma?callydiscoveringthetwotypesofleukemia.

Page 98: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples

Results

•  Howcouldoneevaluatesuchputa?veclustersifthe"right"answerwerenotalreadyknown?

Classdiscoverycouldbetestedbyclasspredic?on;Ifputa?veclassesreflecttruestructure,thenaclasspredictorbasedontheseclassesshouldperformwell.

Page 99: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples
Page 100: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples