cs 5263 bioinformatics lecture 23 microarray data analysis

CS 5263 BioinformaticsLecture 23Microarray Data Analysis

What is a MicroarrayConceptually similar to (reverse) Northern blot(Many) probes, rather than mRNAs, are fixed on some surface, in an ordered wayGene 1Gene 300

Microarray categoriescDNAs microarrayEach probe is the cDNA of a gene (500-5000nt)

Oligonucleotide microarrayEach probe is a synthesized short DNA (uniquely corresponding to a substring of a gene)Affymetrix: ~ 25mersAglient: ~ 60 mers

Spotted cDNA microarray

Array ManufacturingEach tube contains cDNAs corresponding to a unique gene. Pre-amplified, and spotted onto a glass slide

Experimentcy3cy5

Data acquisitionComputer programs are used to process the image into digital signals. Segmentation: determine the boundary between signal and background Results: gene expression ratios between two samples

Affymetrix GeneChip

Array Designmultiple probes (11~16) for each genefrom Affymetrix Inc.

Experimentfrom Affymetrix Inc.Each probe set combines to give an absolute expression level.Image segmentation is relatively easy. But need to subtract background.

Affymetrix GeneChipOne color designcDNA microarrayTwo color design

PreprocessingImage processingAnalog to digitalBackground subtractionAccount for non-specific hybridizationTransformationConvenience, normal distribution assumptionsNormalizationRemove systematic biasesFiltering, averaging, etc.Remove random noisesOrder may be different.May be combined.

Background subtractionFor cDNA array, relatively straightforwardFor oligo array, how to combine PM and MM?Does MM really measure non-specific hybridization?Recent studies suggest to ignore MM entirely or use with cautionAvailable software toolsMAS 5 (by affymetrix)dChIPGCRMA

TransformationLog transformation for two-color array

TransformationLog transformation for one-color arrayWhen get a data set from someone, be careful with the scale

NormalizationPurpose: Correct for systematic errorsMake data from different samples comparable

Best approach to detect problems: visualization

An example data setJ DeRisi, V Iyer, and P Brown, Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale, Science, 278: 680 686, 1997Yeast cells grow in glucose mediumWhen glucose was depleted, cells change their metabolic pathwayscDNA microarrayTest: 2, 4, 6, 8, 10, 12, 14 hours after growth Control: 0 hourNo replicates!No normalization!Use fold-change to get differentially expressed genes!

Histogram of log ratiosTwo possibilities: Dye effect Sample differenceMedian = -0.27

Total intensity normalizationmean(cy3) = 3141mean(cy5) = 28383141 / 2838 = 1.11

Other options: use median use subset of genesExclude 10% extreme House-keeping genesSpike-in genesEtc.

Net effect: constant factor for every geneMedian = -0.1

Intensity-intensity plotTotal intensity normalization worked well here

Intensity-intensity plotDid not work well for this experimentDye-swapping can probably help?

M-A plotA: log2(cy5 * cy3) = log2(cy5)+log2(cy3)M: log2(cy5 / cy3) = = log2(cy5)-log2(cy3)

M-A plotDependency of M on A

Lowess normalizationLowess: Locally Weighted RegressionFit local polynomial functionsM adjusted according to fitted lineAMAM

Replicate filteringGenes with very high variability in replicates are questionableLog2(ratio1) Log2(ratio2) Ratio 1Ratio 2

Preprocessing questionsWhat kind of array it is?Two-color?One-color?Oligo array?cDNA array?How is the experiment designed?Time series?Test vs control? (what kind of control?)What kind of preprocessing has been done?What values are given: raw intensities or ratios?Transformation? Log scale? Linear scale? Normalization: within-array? Cross-array?What are the next steps?Identifying differentially expressed genes?Clustering?

Identify differentially expressed genesNave approach: fold changeLog2 (cy5 / cy3) > 1: up-regulated / inducedLog2(cy5 / cy3) < -1: down-regulated / repressed

Still widely used very simpleMain problem: genes with low expression levels may have a large fold change by chanceFrom 10 to 100: ten foldFrom 1000 to 3000: three foldHowever: low-intensity => relatively high variance

Problem with fold changeThe most differentially expressed genes are the ones with the lowest average expression levels

More robust estimation of differentially expressionEstimate variance as a function of average expressionCompute a Z-score depending on location: Z(x) = (x - ) / (x)x : log2(R/G) value. : local mean(x): local standard deviation

SAM (Significance Analysis of Microarrays)Test: replicate 1, replicate 2, replicate 3Control: replicate 1, replicate 2, replicate 3

Which one is more significantly differentially expressed?

T1T2T3C1C2C3RatioGene11000200015002003002506Gene2100020003000100015005002Gene310010001002080508Gene418001700190010008009002

SAM (Significance Analysis of Microarrays)Basic idea: Students t-test

Larger t => higher significanceP-value can be directly computed for t-testPermutation test can be used

T1T2T3C1C2C3tGene11000200015002003002504.3P110003003002002000200-0.4P22000100020030010002500.96P32001000100030025010000.60

False Discovery Rate (FDR)Multiple testing problemP-value cutoff = 0.05We tested 10000 genesWould expect 500 genes by chance at this significance levelBonferroni correctionUse p-value cutoff 0.05 / 10000Meaning among all genes selected, only 0.05 are expected to be false positiveToo conservativeFalse Discovery Rate (FDR)FDR = 0.1, meaning among all genes selected, (say 100), we would expect 10 to be false positiveAcceptable to biologistsSeveral different approaches to estimate

Microarray data analysisOften perform multiple experiments under varying conditionsTemperatureTime seriesDifferent chemical treatmentDifferent tissueDifferent mutant

Microarray data analysisFor each gene we have a vector ej = (e1j, e2j, , edj)

What to do next?

Supervised vs unsupervised learningSupervised learning (Classification)Associate genes with phenotypesE.g.: Genes A, B and C induced => cancer repressed => not cancerGoal: to learn such a function from dataClassification algorithms:Decision tree, SVM, neural networks, nave bayes, etc.Unsupervised learning (clustering)

AML: acute myeloid leukemia ALL: acute lymphoblastic leukemiaGolub et. al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286: 531 537, 1999

Clustering microarray dataGroup genes into co-expressed sets Genes with similar expression patterns across multiple experiments may be co-regulated

Group experiments into clustersExperiments within the same group may have similar gene expression signatureFor example, disease sub-types that can be classified from gene expression data

Clustering microarray dataHow to tell if two expression vectors are similar?Define the (dis)-similarity measure between two vectors How to group multiple profiles into meaningful subsets ?Describe the clustering procedure Are the results meaningful ? Evaluate biological meaning of a clustering

(Dis)-similarity measuresEuclidean distancePearson correlation coefficientCosine similarityEtc.

Clustering algorithmsHierarchical clusteringK-means clusteringSelf Organizing Maps (SOMs)Spectral clusteringEtc.

Hierarchical clusteringAgglomerative or divisive (less popular)Agglomerative basic idea:Given n genesInitially every gene in a single clusterfor each iterationfind two most similar genes (or gene groups), combine into one clusterTerminate when only one cluster is left

(how to define similarity between two groups?)

Hierarchical clusteringExact behavior depends on how to compute the distance between two clustersNo need to specify number of clustersA distance cutoff is often chosen to break tree into clustersabcdef

Distance between clustersSingle-linkageNot recommendedComplete-linkage

Average-linkage

Centroid methodhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

K-meansBasic idea:Given n genesEstimate number of clusters: k(Randomly) choose k genes as cluster centersAssign each gene to the closest centerRe-compute center for each clusterUntil assignment is stableSimilarity to EM. Objective function: minimize total distance to cluster centers.http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

An exampleA synthetic data setGenesExperiments

Hierarchical clusteringAverage linkage. Cluster genes only.

Average linkage. Cluster both genes and experiments.

K-meansK = 15

Another view of clustersExperimentsLog ratioLog ratio

Evaluating clusteringDo genes in the same cluster share similar functions?Functional enrichment analysisDo genes in the same cluster share similar cis-regulatory motifs?Motif finding

Gene Ontology (GO)Gene functions were often defined using free textHard to extract, transfer, revise, predict, annotate, comprehend, manage The list of vocabularies should be pre-defined and commonly agreedGene Ontology provides a controlled vocabulary to describe gene and gene product attribute

Gene ontologyTwo partsOntology: list of vocabularies (terms) to useAnnotations: characterizing genes using ontology terms

Three ontology categoriesBiological processMolecular functionCellular components

Part of a GO graph

Each GO category is a directed acyclic graph

A term can have multiple parents, and multiple children.

A gene can be Annotated by multiple terms.

If annotated by a child term, automatically annotated by all ascendant terms.

Functional enrichment analysisTotal number of genes: 6000Cluster A: 100 genesWhat functions do these 100 genes have?Do they share some functions significantly?Gene with function of interestGene with other functionsExample:Among 6000 genes, 60 genes have function F. In cluster A, 55 out of 100 genes also have function F.Significance can be computed using cumulative hyper-geometric test.

Significance of enrichmentM = 6000m = 100N = 60n = 55P-value = 8e-100Example:Among 6000 genes, 60 genes have function F. In cluster A, 55 out of 100 genes also have function F.Significance can be computed using cumulative hyper-geometric test.

An applicationTavazoie et al, Systematic determination of genetic network architecture, Nature Genetics, 22, 19993000 yeast genes, 15 time points during cell cycleUse k-means clustering, k=30Clusters correlate well with known functionAlignACE motif finding 600-long upstream regionsMany motifs known

Cell-division cycleA cell duplicates its genome and divides into two identical cellsFour phasesG1 (preparation)S (DNA duplication)G2 (preparation)M (cell division)

Motifs in Clusters

Enriched functions in clusters

Overview of course

Main themes by biological subjectSequence analysisAlignmentString matchingMotif findingGene predictionRNA secondary structure predictionFunctional genomicsMicroarrayFunctional enrichment analysis

Main themes by algorithmic techniquesDynamic programmingAlignment, HMM, RNA structureProbabilistic modelingHMM (regular grammar)RNA structure (context free grammar)Motif findingSuffix treesApplications in bioinfoClustering

Sequence alignmentDP algorithm for global and local alignmentNeedleman-WunschSmith-WatermanAlignment with affine gap costLinear space alignmentHeuristic alignmentBounded alignmentBLASTAlignment statisticsExtreme value distributionMultiple sequence alignment

Probabilistic modelsHMMsHMMs for pair-wise sequence alignmentHMMs for multiple alignmentHMMs for gene predictionStochastic context-free grammar for RNA structure predictionViterbi and posterior decodingExpectation-maximizationGibbs Sampling

GoalsBasis of sequence analysis and other computational biology algorithmsOverall picture about the fieldRead / criticize research articlesThink about the sub-field that best suits your background to exploreCommunicate and exchange ideas with (computational) biologists

Good luck with your final exams

See you on Dec 11

Please remember to turn in your homework and final project report by Dec 6

cs 5263 bioinformatics lecture 23 microarray data analysis

Documents

gene expression ratios

log2cy5 cy3

log2cy5 log2cy3m

unique gene

probe set

systematic errorsmake

microarray data analysiswhat

background results