limmabotta/didattica/l1.3.pdf · 2007. 5. 28. · • t-statistics is widespread in assessing...
TRANSCRIPT
1
• T-statistics is widespread in assessing differential expression.
• Unstable variance estimates that arise when sample size is small can be corrected using:– Error fudge factors (SAM)– Bayesian methods (Limma)
•• TT--statistics is widespread in assessing statistics is widespread in assessing differential expression.differential expression.
•• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:––– Error fudge factors (SAM)Error fudge factors (SAM)Error fudge factors (SAM)–– Bayesian methods (Bayesian methods (LimmaLimma) )
LimmaLimma
Linear model analysis of Linear model analysis of microarraysmicroarrays
2
Bayesian regularized tBayesian regularized t--testtest((BaldiBaldi & Long 2001)& Long 2001)
C
C
T
T
CT
nn
mmt22 σσ
+
−=
The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a
function of the mean expression of the gene
The method tries to decouple the meanThe method tries to decouple the mean––variance dependency variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a
function of the mean expression of the genefunction of the mean expression of the gene
The empirical variance is modulated by ν0 ‘pseudo-observations’associated with a background variance σ0
2
The empirical variance is modulated by The empirical variance is modulated by νν00 ‘‘pseudopseudo--observationsobservations’’associated with a background variance associated with a background variance σσ00
22
My gene{
Bayesian regularized tBayesian regularized t--testtest
The main goal of this approach is to stabilize the variance estimates that arise when sample size is small,
to make more robust the t-test results
The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,
to make more robust the tto make more robust the t--test resultstest results
3
Bayesian regularized tBayesian regularized t--testtest
The regularized t-test makes more evident the presence of significant differential expressions
The regularized tThe regularized t--test makes more evident the test makes more evident the presence of significant differential expressionspresence of significant differential expressions
BH correctionBH correction
•• BH is the most used method for the correction of BH is the most used method for the correction of type I errors in microarray analysis.type I errors in microarray analysis.
•• However, it has some limitation due to the initial However, it has some limitation due to the initial hypotheses:hypotheses:–– The gene expressions are independent from each The gene expressions are independent from each
other.other.–– The raw distribution of p values should be uniform in The raw distribution of p values should be uniform in
the non significant range.the non significant range.
4
The application of BH correctionto these pvalues will not produceany differential expressed gene!
The application of BH correctionto these pvalues will not produceany differential expressed gene!
5
Venn Diagrams
6
Time Course experimentsTime Course experiments
• maSigPro is a R package for the analysis of single and multiseries time course microarray experiments.
• maSigPro follows a two steps regression strategy to find genes with– significant temporal expression changes – significant differences between experimental
groups.
•• Time course experimental design:Time course experimental design:–– WeWe denotedenote experimentalexperimental groupsgroups asas the the experimentalexperimental
factorfactor ((dummydummy variablesvariables) ) forfor whichwhich temporaltemporal profilesprofilesare are defineddefined (e.g. ”Treatment A”, ”Tissue1”, (e.g. ”Treatment A”, ”Tissue1”, etcetc) )
–– ConditionsConditions are are eacheach experimentalexperimental groupgroup vs. time vs. time combinationcombination (e.g. ”Treatment A at Time 0”). (e.g. ”Treatment A at Time 0”). ConditionsConditions can can havehave or or notnot replicatesreplicates. .
–– VariablesVariables are the are the regressionregression variablesvariables defineddefined byby the the maSigPromaSigPro approachapproach forfor the the experimentexperiment regressionregressionmodel. model.
–– maSigPromaSigPro definesdefines dummydummy variablesvariables toto model model differencesdifferences betweenbetween experimentalexperimental groupsgroups. .
–– DummyDummy variablesvariables, Time and , Time and theirtheir interactionsinteractions are the are the variablesvariables of the of the regressionregression model.model.
7
Time Course design for Time Course design for maSigPromaSigPro
All these information should becollapsed in the Target columnof the targets file using _ tocombine data.This can be done using the function JOIN in excel.
IMPORTANT: each treatment at each time has itscorresponding untreatedcontrol!
Do Do notnot forgetforget!
•• Multiple test Multiple test problemproblem isis alsoalso presentpresent in in mSigPromSigPro analysisanalysis..
•• ThereforeTherefore, , beforebefore runningrunning maSigPromaSigPro, , rememberremember toto performperform some some filterfilter basedbased on on functionalfunctional informationinformation or or samplessamplesdistributiondistribution..
8
Some parametersneed to be set
Q: The first step is to compute a regression fit for each gene. The p-valueassociated to the F-Statistic of the model are computed and they are subsequently used to select significant genes. maSigPro corrects this p-value for multiple comparisons by applying false discovery rate (FDR) procedures. The level of FDR control is given by the function parameter Q.
Some parametersneed to be set
Alpha: maSigPro applies a variable selection procedure to find significantvariables for each gene. This will ultimatelly be used to find which are the profile differences betweenexperimental groups. At each regression step the p-value of each variable is computed and variablesget in/out the model when this p-value is lower or higher than the given cut-off value alfa.
9
Some parametersneed to be set
R-squared: The following step is to generate lists of significant genes accordingto the way we want to see results.As filtering maSigPro uses the R-squared of the regression model.
WhatWhat isis the the RR--squaredsquared coefficientcoefficient??
•• r.squaredr.squared: : the "fraction of variance explained by a linearthe "fraction of variance explained by a linearmodel“model“
RR22 = 1 = 1 -- Sum(R[i]Sum(R[i]22) / ) / Sum((y[iSum((y[i]]-- y*)y*)22))
where y* is the mean of where y* is the mean of y[iy[i] if there is an ] if there is an intercept and zero otherwise.intercept and zero otherwise.
10
YY
XX
Sum(R[i]Sum(R[i]22))
YY
XX
Sum((y[iSum((y[i]]-- y*)y*)22))
R-squared graphical viewRR22 = 1 = 1 -- Sum(R[i]Sum(R[i]22) / ) / Sum((y[iSum((y[i]]-- y*)y*)22))
R-squared graphical viewRR22 = 1 = 1 -- 0/ 0/ Sum((y[iSum((y[i]]-- y*)y*)22)=1)=1
YY
XX
Sum(R[i]Sum(R[i]22))
YY
XX
Sum((y[iSum((y[i]]-- y*)y*)22))
11
Sum(R[i]Sum(R[i]22) = ) = Sum((y[iSum((y[i]]-- y*)y*)22))
R-squared graphical viewRR22 = 1 = 1 -- Sum(R[i]Sum(R[i]22) / ) / Sum((y[iSum((y[i]]-- y*)y*)22)= 0)= 0
Sum((y[iSum((y[i]]-- y*)y*)22))
YY
XX
YY
XX
Computation info are available in the main R window
Step 1
The procedure first adjusts this global model by the least-squared technique toidentify differentially expressed genes and selects significant genes applyingfalse discovery rate control procedures.
Step 2
Secondly, stepwise regression is applied as a variable selection strategy to studydifferences between experimental groups and to find statistically significantdifferent profiles.
12
Analysis pipeAnalysis pipe--lineline
NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis
AnnotationAnnotationBiological Biological
KnowledgeKnowledgeextractionextraction
QualityQualitycontrolcontrol
13
AnnotationAnnotation
•• An important issue in microarray data An important issue in microarray data analysis is the specific association of analysis is the specific association of probe identifiers with genome annotated probe identifiers with genome annotated transcripts. transcripts.
•• A critical point in annotation is the way A critical point in annotation is the way in which the association between in which the association between probes and genes is produced.probes and genes is produced.
Annotation in Annotation in AffymetrixAffymetrix•• NetAffxNetAffx: : AffymetrixAffymetrix annotation repositoryannotation repository•• BioconductorBioconductor::
–– uses a specific annotation library, uses a specific annotation library, AnnBuilderAnnBuilder, to create annotation , to create annotation libraries starting from the association probe set libraries starting from the association probe set identifieridentifier→→GeneBankGeneBankaccession number (i.e. the primary target for probes design). accession number (i.e. the primary target for probes design).
•• RESOURCERER (Tsai et al. 2001):RESOURCERER (Tsai et al. 2001):–– the annotation tool at TIGR center uses EST and gene sequences the annotation tool at TIGR center uses EST and gene sequences
stored in the TGI databases (stored in the TGI databases (www.tigr.org/tdb/tgi.shtmlwww.tigr.org/tdb/tgi.shtml). ). –– They provide an analysis of publicly available EST and gene sequThey provide an analysis of publicly available EST and gene sequence ence
data for the identification of transcripts and their placement idata for the identification of transcripts and their placement in a genomic n a genomic context, and the identification of context, and the identification of orthologsorthologs and and paralogsparalogs wherever wherever possible. possible.
•• Neither Neither BioconductorBioconductor nor TIGR methods operate at the probe level, nor TIGR methods operate at the probe level, nor do they consider the limited reliability of some sets due tonor do they consider the limited reliability of some sets due to probe probe crosscross--hybridization or erroneous probe/transcript annotation. hybridization or erroneous probe/transcript annotation.
•• EnsemblEnsembl::–– Annotation with the Annotation with the EnsemblEnsembl tool is built by direct matching of tool is built by direct matching of AffymetrixAffymetrix
probes over the probes over the EnsemblEnsembl sequence database. sequence database. –– Its weak point is that matching of only 50% of the probes of a sIts weak point is that matching of only 50% of the probes of a specific set pecific set
to an to an EnsemblEnsembl gene is needed for a true association definition "probe set gene is needed for a true association definition "probe set identifier"/"identifier"/"EnsemblEnsembl gene identifier". gene identifier".
14
Gene Gene OntologyOntology
OntologiesOntologies
•• An ontology is a specification of a An ontology is a specification of a conceptualization:conceptualization:–– a hierarchical mapping of concepts within a given frame a hierarchical mapping of concepts within a given frame
of reference.of reference.
•• An ontology is a restricted structured vocabulary of An ontology is a restricted structured vocabulary of terms that represent domain knowledge. terms that represent domain knowledge.
•• An ontology specifies a vocabulary that can be An ontology specifies a vocabulary that can be used to exchange queries and assertions. used to exchange queries and assertions.
•• A commitment to the use of the ontology is an A commitment to the use of the ontology is an agreement to use the shared vocabulary in a agreement to use the shared vocabulary in a consistent way.consistent way.
15
The Gene OntologyThe Gene Ontology•• The goal of the Gene Ontology (GO) Consortium is to The goal of the Gene Ontology (GO) Consortium is to
produce a controlled vocabulary that can be applied to all produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in organisms even as knowledge of gene and protein roles in cells is accumulating and changing. cells is accumulating and changing. –– http://http://www.geneontology.orgwww.geneontology.org//
•• For genes and gene products the Gene Ontology For genes and gene products the Gene Ontology Consortium (GO) is an initiative that is designed to address Consortium (GO) is an initiative that is designed to address the problem of defining the problem of defining common set of terms and common set of terms and descriptions for basic biological functionsdescriptions for basic biological functions..
•• GO provides a restricted vocabulary as well as clear GO provides a restricted vocabulary as well as clear indications of the relationships between terms.indications of the relationships between terms.
The Gene OntologyThe Gene Ontology
• The Gene Ontology (GO) consortium produces three independent ontologies for gene products.
• The three ontologies are:– molecular function of a gene product which is defined to
be biochemical activity or action of the gene product (MF 7220).
– biological process interpreted as a biological objective to which the gene product contributes (BP 9529).
– cellular component is a component of a cell that is part of some larger object or structure (CC 1536).
16
The Graph Structure of GOThe Graph Structure of GO
• The GO ontologies are structured as directed acyclic graphs (DAGs) that represent a network in which each term may be a child of one or more parents.
• GO node is interchangeable with GO term.• Child terms are more specific than their
parents:– The term “transmembrane receptor protein-
tyrosine kinase” is child of• “transmembrane receptor” and “protein tyrosine
kinase”.
The Graph Structure of GOThe Graph Structure of GO
• The relationship between a child and a parent can be characterized by the relations:– is a – has a (part of)
• “mitotic chromosome” is a child of“chromosome” and the relationship is an is arelation.
• “telomere” is a child of “chromosome” with the has a relation.
17
Top node
Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GOGraph of GO relationships for the term: transcription factor (GO:0003700):0003700)
GO structureGO structure
Induced GO graph for a set of diff Induced GO graph for a set of diff exprsexprs genes.genes.
GO can be used to link differentially expressed GO can be used to link differentially expressed genes to specific functional classesgenes to specific functional classes..
Top nodeThe induced GO graph colored according to unadjusted The induced GO graph colored according to unadjusted hypergeometrichypergeometric pp--valuevalue≤≤0.010.01
18
Consider a population of genes representing a diverse set of GO terms shown below as
different colors.
Consider a population of genes representing a Consider a population of genes representing a diverse set of GO terms shown below as diverse set of GO terms shown below as
different colors.different colors.
Many methods can be used to identify a set of differentially expressed genes
Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes
19
What are the some of the predominant GO terms represented in the set of differentially
expressed genes and how should significance be assigned to a discovered GO term?
What are the some of the predominant GO What are the some of the predominant GO terms represented in the set of differentially terms represented in the set of differentially
expressed genes and how should significance expressed genes and how should significance be assigned to a discovered GO term?be assigned to a discovered GO term?
Example:Example:Population Size: Population Size: 40 genes40 genes
Subset of differentially Subset of differentially expressed genes: expressed genes: 12 genes12 genes
10 genes, shown in light blue, have a common GO term and 8 10 genes, shown in light blue, have a common GO term and 8 occur within the set of differentially expressed genes.occur within the set of differentially expressed genes.
20
Contingency MatrixContingency Matrix
A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed
membership and membership to a GO term.
A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed
membership and membership to a GO term.
outout
ininGO termGO term
outoutininSubsetSubset
22
44 2626
88
ContingencyContingencyMatrixMatrix
21
HypergeometricHypergeometric DistributionDistribution
ddcc
bbaa
a+ca+c
a+ba+b
b+db+d
c+dc+d
!!!!!)!()!()!()!(
)!()!(!
!!)!(
!!)!(
dcbandbcadcba
dcban
dbdb
caca
++++=
++
+×
+
The probability of any The probability of any particularparticularmatrix occurring by randommatrix occurring by randomselection, given no associationselection, given no associationbetween the two variables, is givenbetween the two variables, is givenby the by the hypergeometrichypergeometric rulerule..
Assigning Significance to the FindingsAssigning Significance to the Findings
The HyperGeometric Test permits us to determine if there are non-random associations between the two variables, differential expression membership and membership to a
particular Gene Ontology term.
The The HyperGeometricHyperGeometric TestTest permits us to determine if there permits us to determine if there are nonare non--random associations between the two variables, random associations between the two variables, differential expression membership and membership to a differential expression membership and membership to a
particular Gene Ontology term. particular Gene Ontology term.
262644
2288
inin outout
inin
outout
SubsetSubset
GO termGO term p p ≈≈ .0002.0002
( 2x2 contingency matrix )( 2x2 contingency matrix )
22
EASEEASE(Expression Analysis Systematic Explorer)(Expression Analysis Systematic Explorer)
•• EASE analysis identifies prevalent biological EASE analysis identifies prevalent biological themes within gene clusters.themes within gene clusters.
•• The highestThe highest--ranking themes derived by a ranking themes derived by a computational method can recapitulate manually computational method can recapitulate manually derived themes in previously published derived themes in previously published microarray, proteomics and SAGE results, and microarray, proteomics and SAGE results, and to provide evidence that these themes are stable to provide evidence that these themes are stable to varying methods of gene selection.to varying methods of gene selection.
HosackHosack et al. Genome Biol., 4:R70et al. Genome Biol., 4:R70--R70.8, 2003.R70.8, 2003.
23
•• Consider all of the ResultsConsider all of the Results
EASE reports all themes represented in a cluster and EASE reports all themes represented in a cluster and although some themes may not meet statistical although some themes may not meet statistical significance it may still be important to note that significance it may still be important to note that particular biological roles or pathways are represented particular biological roles or pathways are represented in the cluster.in the cluster.
•• Independently Verify RolesIndependently Verify Roles
Once found, biological themes should be Once found, biological themes should be independently verified using annotation resources.independently verified using annotation resources.
EASE ResultsEASE Results
24
GOstatsGOstats packagepackage•• ToTo performperform anan analysisanalysis usingusing the the
HypergeometricHypergeometric--basedbased test, one test, one needsneeds toto definedefinea a gene gene universeuniverse and a list of and a list of selectedselected genesgenesfromfrom the the universeuniverse..
•• ToTo identifyidentify the set of the set of expressedexpressed genesgenes fromfrom a a microarraymicroarray experimentexperiment, R. Gentleman (, R. Gentleman (GOstatsGOstatsdeveloperdeveloper) ) proposedproposed thatthat a a nonnon--specificspecific filterfilter bebeappliedapplied and and thatthat the the genesgenes thatthat pass the pass the filterfilter bebeusedused toto formform the the universeuniverse forfor anyany subsequentsubsequentfunctionalfunctional analysesanalyses..
The reason of this representation isthe selection of the GO terms that
contains smaller subsets.
25
ClassificationClassification
ClassificationClassification
• The task of diagnosing cancer on the basis of microarray data has been termed class prediction in the literature.
•• The task is to classify and predict the The task is to classify and predict the diagnostic category of a sample on the diagnostic category of a sample on the basis of its gene expression profile. basis of its gene expression profile.
26
The example of classification The example of classification problem used in PAM publicationproblem used in PAM publication
• Data for small round blue cell tumors (SRBCT) of childhood (Khan et al. 2001), consisting of expression measurements on 2,308 genes, were obtained from glass-slide cDNA microarrays.
• The tumors are classified as:– Burkitt lymphoma (BL),– Ewing sarcoma (EWS), – neuroblastoma (NB), – rhabdomyosarcoma(RMS).
• A total of 63 training samples and 25 test sampleswere provided, although five of the latter were not SRBCTs.
PAMPAM
•• PAM is a modification of the nearestPAM is a modification of the nearest--centroidcentroid method, called ‘‘nearest shrunken method, called ‘‘nearest shrunken centroidcentroid.’’.’’
•• PAM uses ‘‘dePAM uses ‘‘de--noised’’ versions of the noised’’ versions of the centroidscentroids as prototypes for each class. as prototypes for each class.
CentroidsCentroids ((greygrey) and shrunken ) and shrunken centroidscentroids ((redred) for the SRBCT dataset) for the SRBCT datasetThe overall The overall centroidcentroid has been subtracted from the has been subtracted from the centroidcentroid from each class.from each class.
27
• SBRCT classification: training (tr, green), cross-validation (cv, red), and test (te, blue) errors are shown as a function of the threshold parameter .
• The value 4.34 is chosen and yields a subset of 43 selected genes.
• Shrunken differences dik for the 43 genes having at least one nonzero difference. • The genes with nonzero components in each class are almost mutually exclusive.
28
PAM performancePAM performance
•• Misclassification rates for seven classifiers on six microarray Misclassification rates for seven classifiers on six microarray datasets based on 50 datasets based on 50 random partitions into learning sets (tworandom partitions into learning sets (two--thirds of the data) and test sets (onethirds of the data) and test sets (one--third of third of the data)the data)
•• The nearest shrunken The nearest shrunken centroidcentroid classifier (PAM), as well as the simple benchmarks classifier (PAM), as well as the simple benchmarks NNR and DLDA do surprisingly well and can almost keep up except NNR and DLDA do surprisingly well and can almost keep up except on the prostate on the prostate data (the largest dataset in the analysis).data (the largest dataset in the analysis).
•• The success of such methodologically simple tools is limited to The success of such methodologically simple tools is limited to gene expression gene expression datasets with small sample size.datasets with small sample size.