presented by john quackenbush, ph.d. at the june 10, 2003 meeting of the pharmacology toxicology...
Post on 05-Jan-2016
215 Views
Preview:
TRANSCRIPT
Presented by Presented by John Quackenbush, Ph.D.John Quackenbush, Ph.D.
at the at the June 10, 2003June 10, 2003meeting of themeeting of the
Pharmacology Toxicology SubcommitteePharmacology Toxicology Subcommitteeof theof the
Advisory Committee for Pharmaceutical ScienceAdvisory Committee for Pharmaceutical Science
Challenges in Data Challenges in Data Management and Analysis for Management and Analysis for
MicroarraysMicroarrays
FDAFDA
10 June 200310 June 2003
Selecting the Appropriate Selecting the Appropriate PlatformPlatform
ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATCACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATCACGTAGCTAGCACGTAGCTAGCTTGATCGTAGCTAGCGATCGTAGCTAGC CGTAGCTAGCTCGTAGCTAGCTGGATCGTAGCTAGCTATCGTAGCTAGCT GTAGCTAGCTGGTAGCTAGCTGAATCGTAGCTAGCTATCGTAGCTAGCTA TAGCTAGCTGATAGCTAGCTGATTCGTAGCTAGCTAGCGTAGCTAGCTAG AGCTAGCTGATAGCTAGCTGATCCGTAGCTAGCTAGCGTAGCTAGCTAGC GCTAGCTGATCGCTAGCTGATCGGTAGCTAGCTAGCTTAGCTAGCTAGCT CTAGCTGATCGCTAGCTGATCGTTAGCTAGCTAGCTAAGCTAGCTAGCTA TAGCTGATCGTTAGCTGATCGTAAGCTAGCTAGCTAGGCTAGCTAGCTAG AGCTGATCGTAAGCTGATCGTAGGCTAGCTAGCTAGCCTAGCTAGCTAGC GCTGATCGTAGGCTGATCGTAGCCTAGCTAGCTAGCTTAGCTAGCTAGCT CTGATCGTAGCCTGATCGTAGCTTAGCTAGCTAGCTGAGCTAGCTAGCTG TGATCGTAGCTTGATCGTAGCTAAGCTAGCTAGCTGAGCTAGCTAGCTGA GATCGTAGCTAGATCGTAGCTAGGCTAGCTAGCTGATCTAGCTAGCTGAT ATCGTAGCTAGATCGTAGCTAGCCTAGCTAGCTGATCTAGCTAGCTGATC
Design and Design and synthesize chipssynthesize chips
Affymetrix GeneChip™ Expression AnalysisAffymetrix GeneChip™ Expression Analysis
Generate DNAGenerate DNASequenceSequence
ACGTAGCTAGCACGTAGCTAGCTGATCGTAGCTGATCGTAGCTAGCTAGCTAGCTGATCTAGCTAGCTAGCTGATC
Affymetrix GeneChip™ Expression AnalysisAffymetrix GeneChip™ Expression Analysis
Obtain RNAObtain RNASamplesSamples
Prepare Prepare FluorescentlyFluorescently
LabeledLabeledProbesProbes
ControlControl
TestTest
Scan chipsScan chips
AnalyzeAnalyze
PMPM
MMMM
Hybridize andHybridize andwash chipswash chips
Microbial
ORFs
Design PCR Primers
PCR Products
Eukaryotic
Genes
Select cDNA clones
PCR Products
Microarray Overview IMicroarray Overview I
For each plate set,For each plate set,many identical replicasmany identical replicas
Microarray SlideMicroarray Slide(with 60,000 or more(with 60,000 or more
spotted genes)spotted genes)
+
Microtiter PlateMicrotiter Plate
Many different plates Many different plates containing different genescontaining different genes
Microarray Gene Chip Overview IIMicroarray Gene Chip Overview II
Obtain RNA SamplesObtain RNA SamplesPrepare FluorescentlyPrepare Fluorescently
Labeled ProbesLabeled Probes
ControlControl
TestTest
Hybridize,Hybridize,WashWash
MeasureMeasureFluorescenceFluorescencein 2 channelsin 2 channels
red/greenred/green
Analyze the dataAnalyze the datato identifyto identifypatterns ofpatterns of
gene expressiongene expression
GeneGeneSpots Spots on anon anArrayArray
FluorescenceFluorescenceIntensityIntensity
ExpressionExpressionMeasurementMeasurement
TissueTissueSelectionSelection
DifferentialDifferentialState/StageState/StageSelectionSelection
RNA PreparationRNA Preparationand Labelingand Labeling
CompetitiveCompetitiveHybridizationHybridization
Microarray Expression AnalysisMicroarray Expression Analysis
Lack of standardization makes direct Lack of standardization makes direct comparison of results a challengecomparison of results a challengeLot-to-log variation in arrays can introduce Lot-to-log variation in arrays can introduce artifacts – are the results dependent on the artifacts – are the results dependent on the biology or on the arrays (or technician or biology or on the arrays (or technician or reagent lots or ....)reagent lots or ....)Commercial arrays provide a standard and Commercial arrays provide a standard and remove some design considerations (one remove some design considerations (one sample, one array), but cost up to 10x (or sample, one array), but cost up to 10x (or greater) more than in-house arraysgreater) more than in-house arraysArrays demand good LIMS systems for sample Arrays demand good LIMS systems for sample trackingtracking
Platform-related issuesPlatform-related issues
Microarray AnalysisMicroarray Analysis
Choose an experimentally interesting and tractable Choose an experimentally interesting and tractable model systemmodel systemDesign an experiment with comparisons between Design an experiment with comparisons between related variants related variants Include sufficient biological replication to make good Include sufficient biological replication to make good estimatesestimatesHybridize and collect dataHybridize and collect dataNormalize and filterNormalize and filterMine data for biological patterns of expressionMine data for biological patterns of expressionIntegrate expression data with other ancillary data Integrate expression data with other ancillary data such, including genotype, phenotype, the genome, such, including genotype, phenotype, the genome, and its annotationand its annotation
General Microarray StrategyGeneral Microarray Strategy
Annotating andAnnotating andComparing ArraysComparing Arrays
TIGR Gene Indices TIGR Gene Indices home page home page
www.tigr.org/tdb/tgiwww.tigr.org/tdb/tgi
~60 species~60 species
>16,000,000 sequences>16,000,000 sequences
~60 species~60 species
>16,000,000 sequences>16,000,000 sequences
The Mouse Gene IndexThe Mouse Gene Index <http://www.tigr.org/tdb/mgi><http://www.tigr.org/tdb/mgi>
A TC ExampleA TC Example
Babak ParviziBabak Parvizi
GO Terms GO Terms and EC Numbersand EC Numbers
The TIGR Gene IndicesThe TIGR Gene Indices <http://www.tigr.org.tdb/tdb/tgi><http://www.tigr.org.tdb/tdb/tgi>
Dan Lee, Ingeborg HoltDan Lee, Ingeborg Holt
Tentative OrthologuesTentative Orthologues
And ParaloguesAnd Paralogues
Building TOGs: Reflexive, Transitive ClosureBuilding TOGs: Reflexive, Transitive Closure
Thanks to Woytek Makałowski and Mark Boguski Thanks to Woytek Makałowski and Mark Boguski
TOGA: An Sample Alignment: TOGA: An Sample Alignment: bithoraxoid-like proteinbithoraxoid-like protein
Gene Finding in HumansGene Finding in Humans is easy!is easy!
Razvan SultanaRazvan Sultana
Gene Finding in HumansGene Finding in Humans is easy?is easy?
Razvan SultanaRazvan Sultana
Gene Finding in HumansGene Finding in Humans is difficult?is difficult?
Razvan SultanaRazvan Sultana
Gene Finding in HumansGene Finding in Humans is difficult?is difficult?
Razvan SultanaRazvan Sultana
A genome and its annotation is A genome and its annotation is onlyonly a a hypothesis that must be tested.hypothesis that must be tested.
http://pga.tigr.org/tools.shtmlhttp://pga.tigr.org/tools.shtml
RESOURCERER RESOURCERER Jennifer TsaiJennifer Tsai
RESOURCERER: An ExampleRESOURCERER: An Example
RESOURCERER: Using Genetic MarkersRESOURCERER: Using Genetic Markers
Next step: Integrate QTLsNext step: Integrate QTLs
The “complete” genome is incompleteThe “complete” genome is incompleteGene names are not yet well definedGene names are not yet well defined
One gene may have many namesOne gene may have many namesOne gene may have many sequencesOne gene may have many sequencesOne sequence may have many namesOne sequence may have many names
Analysis and interpretation depends on well Analysis and interpretation depends on well annotated gene setsannotated gene sets
Gene names, Gene Ontology Assignments, and Gene names, Gene Ontology Assignments, and pathway informationpathway information
Cross-species comparisons require good Cross-species comparisons require good knowledge of orthologues and paraloguesknowledge of orthologues and paralogues
Annotation IssuesAnnotation Issues
Tools and TechniquesTools and Techniquesfor Array Analysisfor Array Analysis
Design the experimentDesign the experiment
Perform the hybridizations and generate Perform the hybridizations and generate imagesimages
Analyze images to identify genes and Analyze images to identify genes and expression levels (hybridization intensities)expression levels (hybridization intensities)
Normalize expression levels to facilitate Normalize expression levels to facilitate comparisonscomparisons
Analyze expression data to find biologically Analyze expression data to find biologically relevant patternsrelevant patterns
Analysis stepsAnalysis steps
MADAM: Microarray Data ManagerMADAM: Microarray Data Manager
Available with OSI source and MySQLAvailable with OSI source and MySQL
Joseph WhiteJoseph WhiteJerry LiJerry Li
Alexander SaeedAlexander SaeedVasily SharovVasily Sharov
Syntek Inc.Syntek Inc.
MAGE-ML exportMAGE-ML exportby Juneby June
Goal is to measure ratios of gene expression levelsGoal is to measure ratios of gene expression levels(ratio)(ratio)ii = R = Rii/G/Gii
where Rwhere Rii/G/Gii are, respectively , the measured are, respectively , the measured
intensities for the intensities for the iith spot.th spot.
In a self-self hybridization, we would expect all ratios In a self-self hybridization, we would expect all ratios to be equal to one:to be equal to one:
R Rii/G/Gii = 1 for all = 1 for all ii. But they may not be.. But they may not be.
Why not?Why not? Unequal labeling efficiencies for Cy3/Cy5Unequal labeling efficiencies for Cy3/Cy5 Noise in the systemNoise in the system Differential expressionDifferential expression
Normalization brings (appropriate) ratios back to one.Normalization brings (appropriate) ratios back to one.
Why Normalize Data?Why Normalize Data?
LOWESS ResultsLOWESS Results
MIDAS: Data AnalysisMIDAS: Data Analysis Wei LiangWei Liang
Available with sourceAvailable with source
Variance Stabilization,Variance Stabilization,Adding Error Models,Adding Error Models,
MAANOVA,MAANOVA,Automated ReportingAutomated Reporting
MeV: Data Mining ToolsMeV: Data Mining Tools Alexander SaeedAlexander SaeedAlexander SturnAlexander SturnNirmal BhagabatiNirmal Bhagabati
John BraistedJohn BraistedSyntek Inc.Syntek Inc.
Datanaut, Inc.Datanaut, Inc.
Available with OSI sourceAvailable with OSI source
There is no standard method for data analysisThere is no standard method for data analysis
The same algorithm with a small change in The same algorithm with a small change in parameters (such as distance metric) can parameters (such as distance metric) can produce very different resultsproduce very different results
Data normalization plays a big role in Data normalization plays a big role in identifying “differentially expressed” genesidentifying “differentially expressed” genes
Much of the apparent disparity in microarray Much of the apparent disparity in microarray datasets can be attributed to differences in datasets can be attributed to differences in data analysis methods, from image processing data analysis methods, from image processing to normalization to data miningto normalization to data mining
Analysis IssuesAnalysis Issues
Data Reporting StandardsData Reporting Standards
What data should we collect?What data should we collect? Nature GeneticsNature Genetics 29, December 2001 29, December 2001
<http://www.mged.org><http://www.mged.org>MAGE-ML – XML-based data exchange formatMAGE-ML – XML-based data exchange format
EVERYTHINGEVERYTHING
Publications on Microarray Data Exchange StandardsPublications on Microarray Data Exchange Standards
MIAME Standards:MIAME Standards:Nature family, Cell family, EMBO reports, Bioinformatics,Nature family, Cell family, EMBO reports, Bioinformatics,Genome Research, Genome Biology, Science, The Lancet,Genome Research, Genome Biology, Science, The Lancet,Science, and others….Science, and others….
MIAME Standards are a start, but still evolvingMIAME Standards are a start, but still evolving
Implementation will require further Implementation will require further development of ontologies to create standard development of ontologies to create standard descriptorsdescriptors
MIAME-Tox MIAME-Tox <http://www.mged.org/MIAME1.1-DenverDraft.DOC><http://www.mged.org/MIAME1.1-DenverDraft.DOC> represents an attempt to extend this to represents an attempt to extend this to toxicologytoxicology
Software must be developed to read/write Software must be developed to read/write MAGE-MLMAGE-ML
Public databases need to be extended to meet Public databases need to be extended to meet Tox needsTox needs
Standardization IssuesStandardization Issues
ScienceScience
Integrating ExpressionIntegrating Expressionwith other datawith other data
Innate ImmunityInnate ImmunityInnate ImmunityInnate ImmunityAdaptive ImmunityAdaptive Immunity
Pathophysiologic ConditionsPathophysiologic Conditions
Immunomodulatory GenesImmunomodulatory Genes
SepsisARDS
Asthma
SepsisARDS
Asthma
Antigen PresentationAntigen Presentation
Cytokines andAdhesion Proteins
Cytokines andAdhesion Proteins
CD14CD14
LPSLPS TLR ProteinsTLR Proteins
NF-BNF-B
IBIB
InflammatoryCell Recruitment
InflammatoryCell Recruitment
LBPLBP
DegradationDegradation
NIKNIK
TRAF-6TRAF-6
MyD88MyD88IRAK2IRAK2
BPIBPI
Adapted from Godowski. NEJM 1999; 340:1835Adapted from Godowski. NEJM 1999; 340:1835
MD-2MD-2
David SchwartzDavid Schwartz
C57BL/6 DBA/2
BXD5BXD29 BXD39 BXD42
ExamplesExamples
BXD Recombinant Inbred Strains (n=32)BXD Recombinant Inbred Strains (n=32)
200200
400400
600600
800800
10001000
11
Lav
age
PM
Ns
x 10
L
avag
e P
MN
s x
10 33 /
ml
/ml
P1P1
LL11
L2L2 L3L3H1H1
H2H2
H3H3
P2P2
P1+P1+ H3H3++
P2+P2+ H2+H2+
L1+L1+ H1H1++L2L2
++L3L3++
R (P1+P2)R (P1+P2)
53 Hybridizations53 Hybridizations
P1P1 P2P2
P1+P1+
L1L1H1H1
L1+L1+ H1+H1+
P2+P2+
Result: ~425 “significant” genesResult: ~425 “significant” genes
C57BL/6 DBA/2
BXD5BXD29 BXD39 BXD42
ExamplesExamples
IDEAIDEA: Build QTL Maps and use those: Build QTL Maps and use thoseto filter expression datato filter expression data
Goal: Find differentially expressed genes Goal: Find differentially expressed genes genetically linked to responsegenetically linked to response
BXD Recombinant Inbred Strains (n=32)BXD Recombinant Inbred Strains (n=32)
200200
400400
600600
800800
10001000
11
Lav
age
PM
Ns
x 10
L
avag
e P
MN
s x
10 33 /
ml
/ml
525525
Genes in QTLGenes in QTL Genes by MicroarrayGenes by Microarray
426426
Microarray Expression-QTL Consensus Microarray Expression-QTL Consensus Candidate GenesCandidate Genes
4646
Candidate genes for follow-up and validationCandidate genes for follow-up and validation
BG076932BG076932 annexin A1 (Anxa1) annexin A1 (Anxa1)BG085317BG085317 arginase type II (Arg2) arginase type II (Arg2)BG064781BG064781 cytidine 5'-triphosphate synthase (Ctps) cytidine 5'-triphosphate synthase (Ctps)BG085740BG085740 ets-related transcription facto ets-related transcription factoBG063515BG063515 ferritin heavy chain (Fth) ferritin heavy chain (Fth)BG078398BG078398 MARCKS-like protein (Mlp) MARCKS-like protein (Mlp)AW556835AW556835 protein tyrosine phosphatase, non-receptor type 2 (Ptpn2) protein tyrosine phosphatase, non-receptor type 2 (Ptpn2)BG077485BG077485 ring finger protein (C3HC4 type) 19 (Rnf19) ring finger protein (C3HC4 type) 19 (Rnf19)BG085186BG085186 surfactant protein-D gene surfactant protein-D geneAW550270AW550270 tenascin C (Tnc) tenascin C (Tnc)BG065761BG065761 tumor necrosis factor, alpha-induced protein 2 (Tnfaip2) tumor necrosis factor, alpha-induced protein 2 (Tnfaip2)BG074379BG074379 co-chaperone mt-GrpE#2 precursor putative co-chaperone mt-GrpE#2 precursor putative BG080688BG080688 CSF-1CSF-1BG067349BG067349 C-type lectin MincleC-type lectin MincleBG073439BG073439 DKFZp564O1763DKFZp564O1763AW551388AW551388 E2F-like transcriptional repressor proteinE2F-like transcriptional repressor proteinBG076460BG076460 glutamate-cysteine ligase catalytic subunit (GLCLC) glutamate-cysteine ligase catalytic subunit (GLCLC) BG080666BG080666 gly96gly96BG067921BG067921 GTP binding proteinGTP binding proteinBG072974BG072974 DKFZp547B146DKFZp547B146BG070296BG070296 DKFZp566F164DKFZp566F164BG074109BG074109 Hsp86-1Hsp86-1BG077487BG077487 hypoxia inducible factor 1hypoxia inducible factor 1
BG078274BG078274 I kappa B alpha geneI kappa B alpha geneBG084405BG084405 IAP-1IAP-1BG069214BG069214 inhibitor of apoptosis protein 1inhibitor of apoptosis protein 1BG067127BG067127 interferon regulatory factor 1interferon regulatory factor 1BG080268BG080268 KCKCBG070106BG070106 lipocalinlipocalinBG064651BG064651 MAILMAILBG063925BG063925 metallothionein IImetallothionein IIBG077818BG077818 metallothionein-Imetallothionein-IBG073108BG073108 MHC class III region RDMHC class III region RDBG064928BG064928 mitogen-responsive 96mitogen-responsive 96BG072801BG072801 S100A9S100A9BG086320BG086320 SDF-1-betaSDF-1-betaBG072793BG072793 T-cell activating proteinT-cell activating proteinBG073446BG073446 TH1 proteinTH1 proteinBG072227BG072227 TNFaTNFaBG068491BG068491BG071081BG071081BG067341BG067341BG067620BG067620BG067670BG067670BG066678BG066678BG071169BG071169
Candidate Gene Set for LPS responseCandidate Gene Set for LPS response
00 33 66 99 1212
zz zzzz
zz zzzz
zz zzzz
zz zzzz
zz zzzz
Sleep Deprivation Studies in MouseSleep Deprivation Studies in Mouse
zz zzzz
zz zzzz
zz zzzz
zz zzzz
zz zzzz
zz zzzz
zz zzzz
zz zzzz
zz zzzz
zz zzzz
Experimental ParadigmExperimental ParadigmCompare gene expression between sleeping and Compare gene expression between sleeping and sleep-deprived mice in cortex and hypothalamussleep-deprived mice in cortex and hypothalamus
Perform 3 biological replicatesPerform 3 biological replicates
Normalize and filter data and use data mining techniques to Normalize and filter data and use data mining techniques to select distinct patterns of gene expressionselect distinct patterns of gene expression
Use Gene Ontology (GO) assignments to classify genes by Use Gene Ontology (GO) assignments to classify genes by cellular localization, molecular function, biological processcellular localization, molecular function, biological process
Use GO analysis to develop an understanding of responseUse GO analysis to develop an understanding of response
Differential Expression in CortexDifferential Expression in Cortex
Energy MetabolismTranscription;Mitochondrial and Ribosomal Proteins
Stress Response
Metabolism andSignal Transduction
Differential Expression in HypothalamusDifferential Expression in Hypothalamus
Sleep signaling
Predicting OutcomePredicting Outcome
Patients present with tumors, many of which Patients present with tumors, many of which are indistinguishable.are indistinguishable.
Histology can provide some information, but Histology can provide some information, but these have little predictive power.these have little predictive power.
Microarrays provide a “fingerprint” that can Microarrays provide a “fingerprint” that can serve as a phenotypic measure that may be serve as a phenotypic measure that may be linked to outcome.linked to outcome.
This is a huge problem in data mining.This is a huge problem in data mining.
The problem
The problem in pictures: Adenocarcinomas
32k Human Arrays32k Human Arrays
cDNA Multi-Organ Cancer ClassifiercDNA Multi-Organ Cancer Classifier
hierarchical clusteringhierarchical clustering(Pearson correlation)(Pearson correlation)
UNSUPERVISEDUNSUPERVISEDCLASSIFICATIONCLASSIFICATION
Artificial neural network Artificial neural network training and validationtraining and validation
SUPERVISEDSUPERVISEDCLASSIFICATIONCLASSIFICATION
77 tumor samples; 144 hybridization assays77 tumor samples; 144 hybridization assays
Normalization and flip-dye replica consistency Normalization and flip-dye replica consistency check check
Statistical filtering of genesStatistical filtering of genes(Kruskal-Wallis H-test)(Kruskal-Wallis H-test)
685 genes 685 genes
breastbreast
ovaryovary lunglung
p < 0.05p < 0.05Divide experiments into training Divide experiments into training
and validation sets and validation sets
Validation25%
Training75%
Input data:Input data:A list of genes withA list of genes withexpression levelsexpression levels
Output data:Output data:A tumor typeA tumor typecallcall
Neural Networks and CancerNeural Networks and Cancer
““hidden layers” allowhidden layers” allowcomplex connectionscomplex connections
Training:Training:Adjusts weightsAdjusts weightsand connectionsand connections
Neural Networks and CancerNeural Networks and Cancer
Breast TumorBreast Tumor
Tumor TypeTumor Type Number of Number of SamplesSamples
Array PlatformArray Platform
BladderBladder 1919 U95, HU6800U95, HU6800
BreastBreast 4242 U95, HU6800, TIGR 32kU95, HU6800, TIGR 32k
Central Nervous – Atypical Central Nervous – Atypical Teratoid/RhandoidTeratoid/Rhandoid
1010 HU6800HU6800
Central Nervous GliomaCentral Nervous Glioma 1010 HU6800HU6800
Central Nervous - MedulloblastomaCentral Nervous - Medulloblastoma 7070 HU6800HU6800
ColonColon 4141 U95, HU6800, TIGR 32kU95, HU6800, TIGR 32k
Stomach/EG JunctionStomach/EG Junction 3030 U95, TIGR 32kU95, TIGR 32k
KidneyKidney 3131 U95, HU6800, TIGR 32kU95, HU6800, TIGR 32k
Leukemia – Acute Lymphocyite B CellLeukemia – Acute Lymphocyite B Cell 1010 HU6800HU6800
Leukemia – Acute Lymphocyite T CellLeukemia – Acute Lymphocyite T Cell 1010 HU6800HU6800
Leukemia – Acute MyelogenousLeukemia – Acute Myelogenous 1010 HU6800HU6800
Lung – AdenocarcinomaLung – Adenocarcinoma 7171 U95, HU6800, TIGR 32kU95, HU6800, TIGR 32k
Lung – Squamous Cell CarcinomaLung – Squamous Cell Carcinoma 2121 U95U95
Lymphoma - FollicularLymphoma - Follicular 1111 HU6800HU6800
Lymphoma – Large B CellLymphoma – Large B Cell 1111 HU6800HU6800
MelanomaMelanoma 1010 HU6800HU6800
MesotheliomaMesothelioma 1010 HU6800HU6800
OvaryOvary 4444 U95, HU6800, TIGR 32kU95, HU6800, TIGR 32k
PancreasPancreas 2626 U95, HU6800, TIGR 32kU95, HU6800, TIGR 32k
ProstateProstate 4242 U95, HU6800U95, HU6800
UterusUterus 1010 HU6800HU6800
Tumors in the Universal ClassifierTumors in the Universal Classifier
543 tumor samples543 tumor samples21 tumor types21 tumor types95% of all cancers95% of all cancers
Data AcquisitionData Acquisition
NormalizationNormalizationand Scalingand Scaling
StatisticalStatisticalScreeningScreening
Neural NetworkNeural NetworkTraining andTraining and
ValidationValidation
Microarray DatabaseMicroarray Database
Training SetTraining SetTumor 1Tumor 1Tumor 2Tumor 2Tumor 3Tumor 3Tumor 4Tumor 4Tumor 5Tumor 5
……Tumor nTumor n
Test SetTest SetTumor 1Tumor 1Tumor 2Tumor 2Tumor 3Tumor 3Tumor 4Tumor 4Tumor 5Tumor 5
……Tumor nTumor n
ClassifierClassifier
All NormalizedAll Normalizedand Scaled Genesand Scaled Genes
Kruskal-WallisKruskal-WallisBonferoni f(x)Bonferoni f(x)
CorrelativeCorrelativeGene SubsetGene Subset
U95A=124U95A=124
Hu6800=136Hu6800=136
U95AU95A
Hu6800Hu6800
Gene 1 2.2Gene 1 2.2Gene 2 0.5Gene 2 0.5Gene 3 1.2Gene 3 1.2 … …
U95AU95A Hu6800Hu6800
TIGRTIGR
Gene 1 2.2Gene 1 2.2Gene 2 0.5Gene 2 0.5Gene 3 1.2Gene 3 1.2
… …
Average AcrossAverage AcrossChips usingChips usingReferenceReference
Gene-by-GeneGene-by-Geneusing Referenceusing Reference
Gene-by-GeneGene-by-Geneusing Referenceusing Reference
We collected 540 expression profilesWe collected 540 expression profiles 21 tumor types21 tumor types 95% of all cancers95% of all cancers
10 Independent Classifiers10 Independent Classifiers 75% of data for training, 25% for test75% of data for training, 25% for test Average ~88% accuracy Average ~88% accuracy
Web based Classifier availableWeb based Classifier available So far, 7 of 8So far, 7 of 8** in classification in classification 84% accuracy in classifying primary source84% accuracy in classifying primary source
of metsof mets* Bad RNA* Bad RNA
Summary
Statistical significance is not the same as Statistical significance is not the same as biological significancebiological significanceIf you perturb a system, If you perturb a system, manymany genes change genes change their expression levelstheir expression levelsMultiple pathways and features in the data can Multiple pathways and features in the data can be revealed through different analysis be revealed through different analysis methodsmethodsGenes which are good for classification or Genes which are good for classification or prognostics may not be biologically relevantprognostics may not be biologically relevantExtracting meaning from microarrays will Extracting meaning from microarrays will require new software and toolsrequire new software and toolsThe most important thing we need is The most important thing we need is moremore data collected and stored in a standard data collected and stored in a standard fashionfashion
Further challenges in analysis?Further challenges in analysis?
The “complete” genomes are incompleteThe “complete” genomes are incompleteMany of the signatures we see on arrays do not have Many of the signatures we see on arrays do not have immediate biological implicationsimmediate biological implicationsMost often genes are included on the arrays that are Most often genes are included on the arrays that are used solely for normalizationused solely for normalizationLarger datasets may reveal diagnostic or prognostic Larger datasets may reveal diagnostic or prognostic patterns that are not obvious at presentpatterns that are not obvious at presentReported “variation” in the assays must be Reported “variation” in the assays must be understoodunderstood
Differences in laboratory and analysis protocols areDifferences in laboratory and analysis protocols are likely sources likely sources
There is a need to define QC and analysis standardsThere is a need to define QC and analysis standardsThere is clearly a need for a large database of There is clearly a need for a large database of expression profiles linked to other relevant ancillary expression profiles linked to other relevant ancillary informationinformation
Barriers to Toxicology Applications
Science is built with facts as a house is with Science is built with facts as a house is with stones – but a collection of facts is no more a stones – but a collection of facts is no more a science than a heap of stones is a house.science than a heap of stones is a house. – – Jules Henri PoincareJules Henri Poincare
The TIGR Gene Index TeamThe TIGR Gene Index TeamFoo CheungFoo Cheung
Svetlana KaramychevaSvetlana KaramychevaYudan LeeYudan Lee
Babak ParviziBabak ParviziGeo PerteaGeo Pertea
Razvan SultanaRazvan SultanaJennifer TsaiJennifer Tsai
John QuackenbushJohn QuackenbushJoseph WhiteJoseph White
Funding provided by the Department of EnergyFunding provided by the Department of Energyand the National Science Foundationand the National Science Foundation
TIGR Human/Mouse/Arabidopsis TIGR Human/Mouse/Arabidopsis Expression TeamExpression Team
Emily ChenEmily ChenBryan FrankBryan Frank
Renee GaspardRenee GaspardJeremy HassemanJeremy Hasseman
Heenam KimHeenam KimLara LinfordLara Linford
Simon KwongSimon KwongJohn QuackenbushJohn Quackenbush
Shuibang WangShuibang WangYonghong WangYonghong Wang
Ivana YangIvana YangYan YuYan Yu
Array Software Hit TeamArray Software Hit TeamNirmal BhagabatiNirmal Bhagabati
John BraistedJohn BraistedTracey CurrierTracey Currier
Jerry LiJerry LiWei LiangWei Liang
John QuackenbushJohn QuackenbushAlexander I. SaeedAlexander I. Saeed
Vasily SharovVasily SharovMathangi ThaiagarjianMathangi Thaiagarjian
Joseph WhiteJoseph WhiteAssistantAssistantSue MineoSue MineoFunding provided by the National Cancer Institute,Funding provided by the National Cancer Institute,
the National Heart, Lung, Blood Institute,the National Heart, Lung, Blood Institute,and the National Science Foundationand the National Science Foundation
H. Lee Moffitt Center/USFH. Lee Moffitt Center/USFTimothy J. YeatmanTimothy J. Yeatman
Greg BloomGreg Bloom
TIGR PGA CollaboratorsTIGR PGA CollaboratorsNorman LeeNorman LeeRenae MalekRenae Malek
Hong-Ying WangHong-Ying WangTruong LuuTruong Luu
Bobby BehbahaniBobby Behbahani
TIGR Faculty, IT Group, and StaffTIGR Faculty, IT Group, and Staff
<johnq@tigr.org><johnq@tigr.org>AcknowledgmentsAcknowledgments
PGA CollaboratorsPGA CollaboratorsGary Churchill (TJL)Gary Churchill (TJL)Greg Evans (NHLBI)Greg Evans (NHLBI)Harry Gavaras (BU)Harry Gavaras (BU)
Howard Jacob (MCW)Howard Jacob (MCW)Anne Kwitek (MCW)Anne Kwitek (MCW)Allan Pack (Penn)Allan Pack (Penn)
Beverly Paigen (TJL)Beverly Paigen (TJL)Luanne Peters (TJL)Luanne Peters (TJL)
David Schwartz (Duke)David Schwartz (Duke)
EmeritusEmeritusJennifer Cho (TGI)Jennifer Cho (TGI)
Ingeborg Holt (TGI)Ingeborg Holt (TGI)Feng Liang (TGI)Feng Liang (TGI)
Kristie Abernathy (mA)Kristie Abernathy (mA)Sonia Dharap(mA)Sonia Dharap(mA)
Julie Earle-Hughes (mA)Julie Earle-Hughes (mA)Cheryl Gay (mA)Cheryl Gay (mA)Priti Hegde (mA)Priti Hegde (mA)
Rong Qi (mA)Rong Qi (mA)Erik Snesrud (mA)Erik Snesrud (mA)
top related