cs 5263 bioinformatics lecture 23 microarray data analysis

68
CS 5263 Bioinformatics Lecture 23 Microarray Data Analysis

Upload: alison-johnson

Post on 13-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

  • CS 5263 BioinformaticsLecture 23Microarray Data Analysis

  • What is a MicroarrayConceptually similar to (reverse) Northern blot(Many) probes, rather than mRNAs, are fixed on some surface, in an ordered wayGene 1Gene 300

  • Microarray categoriescDNAs microarrayEach probe is the cDNA of a gene (500-5000nt)

    Oligonucleotide microarrayEach probe is a synthesized short DNA (uniquely corresponding to a substring of a gene)Affymetrix: ~ 25mersAglient: ~ 60 mers

  • Spotted cDNA microarray

  • Array ManufacturingEach tube contains cDNAs corresponding to a unique gene. Pre-amplified, and spotted onto a glass slide

  • Experimentcy3cy5

  • Data acquisitionComputer programs are used to process the image into digital signals. Segmentation: determine the boundary between signal and background Results: gene expression ratios between two samples

  • Affymetrix GeneChip

  • Array Designmultiple probes (11~16) for each genefrom Affymetrix Inc.

  • Experimentfrom Affymetrix Inc.Each probe set combines to give an absolute expression level.Image segmentation is relatively easy. But need to subtract background.

  • Affymetrix GeneChipOne color designcDNA microarrayTwo color design

  • PreprocessingImage processingAnalog to digitalBackground subtractionAccount for non-specific hybridizationTransformationConvenience, normal distribution assumptionsNormalizationRemove systematic biasesFiltering, averaging, etc.Remove random noisesOrder may be different.May be combined.

  • Background subtractionFor cDNA array, relatively straightforwardFor oligo array, how to combine PM and MM?Does MM really measure non-specific hybridization?Recent studies suggest to ignore MM entirely or use with cautionAvailable software toolsMAS 5 (by affymetrix)dChIPGCRMA

  • TransformationLog transformation for two-color array

  • TransformationLog transformation for one-color arrayWhen get a data set from someone, be careful with the scale

  • NormalizationPurpose: Correct for systematic errorsMake data from different samples comparable

    Best approach to detect problems: visualization

  • An example data setJ DeRisi, V Iyer, and P Brown, Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale, Science, 278: 680 686, 1997Yeast cells grow in glucose mediumWhen glucose was depleted, cells change their metabolic pathwayscDNA microarrayTest: 2, 4, 6, 8, 10, 12, 14 hours after growth Control: 0 hourNo replicates!No normalization!Use fold-change to get differentially expressed genes!

  • Histogram of log ratiosTwo possibilities: Dye effect Sample differenceMedian = -0.27

  • Total intensity normalizationmean(cy3) = 3141mean(cy5) = 28383141 / 2838 = 1.11

    Other options: use median use subset of genesExclude 10% extreme House-keeping genesSpike-in genesEtc.

    Net effect: constant factor for every geneMedian = -0.1

  • Intensity-intensity plotTotal intensity normalization worked well here

  • Intensity-intensity plotDid not work well for this experimentDye-swapping can probably help?

  • M-A plotA: log2(cy5 * cy3) = log2(cy5)+log2(cy3)M: log2(cy5 / cy3) = = log2(cy5)-log2(cy3)

  • M-A plotDependency of M on A

  • Lowess normalizationLowess: Locally Weighted RegressionFit local polynomial functionsM adjusted according to fitted lineAMAM

  • Replicate filteringGenes with very high variability in replicates are questionableLog2(ratio1) Log2(ratio2) Ratio 1Ratio 2

  • Preprocessing questionsWhat kind of array it is?Two-color?One-color?Oligo array?cDNA array?How is the experiment designed?Time series?Test vs control? (what kind of control?)What kind of preprocessing has been done?What values are given: raw intensities or ratios?Transformation? Log scale? Linear scale? Normalization: within-array? Cross-array?What are the next steps?Identifying differentially expressed genes?Clustering?

  • Identify differentially expressed genesNave approach: fold changeLog2 (cy5 / cy3) > 1: up-regulated / inducedLog2(cy5 / cy3) < -1: down-regulated / repressed

    Still widely used very simpleMain problem: genes with low expression levels may have a large fold change by chanceFrom 10 to 100: ten foldFrom 1000 to 3000: three foldHowever: low-intensity => relatively high variance

  • Problem with fold changeThe most differentially expressed genes are the ones with the lowest average expression levels

  • More robust estimation of differentially expressionEstimate variance as a function of average expressionCompute a Z-score depending on location: Z(x) = (x - ) / (x)x : log2(R/G) value. : local mean(x): local standard deviation

  • SAM (Significance Analysis of Microarrays)Test: replicate 1, replicate 2, replicate 3Control: replicate 1, replicate 2, replicate 3

    Which one is more significantly differentially expressed?

    T1T2T3C1C2C3RatioGene11000200015002003002506Gene2100020003000100015005002Gene310010001002080508Gene418001700190010008009002

  • SAM (Significance Analysis of Microarrays)Basic idea: Students t-test

    Larger t => higher significanceP-value can be directly computed for t-testPermutation test can be used

    T1T2T3C1C2C3tGene11000200015002003002504.3P110003003002002000200-0.4P22000100020030010002500.96P32001000100030025010000.60

  • False Discovery Rate (FDR)Multiple testing problemP-value cutoff = 0.05We tested 10000 genesWould expect 500 genes by chance at this significance levelBonferroni correctionUse p-value cutoff 0.05 / 10000Meaning among all genes selected, only 0.05 are expected to be false positiveToo conservativeFalse Discovery Rate (FDR)FDR = 0.1, meaning among all genes selected, (say 100), we would expect 10 to be false positiveAcceptable to biologistsSeveral different approaches to estimate

  • Microarray data analysisOften perform multiple experiments under varying conditionsTemperatureTime seriesDifferent chemical treatmentDifferent tissueDifferent mutant

  • Microarray data analysisFor each gene we have a vector ej = (e1j, e2j, , edj)

    What to do next?

  • Supervised vs unsupervised learningSupervised learning (Classification)Associate genes with phenotypesE.g.: Genes A, B and C induced => cancer repressed => not cancerGoal: to learn such a function from dataClassification algorithms:Decision tree, SVM, neural networks, nave bayes, etc.Unsupervised learning (clustering)

  • AML: acute myeloid leukemia ALL: acute lymphoblastic leukemiaGolub et. al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286: 531 537, 1999

  • Clustering microarray dataGroup genes into co-expressed sets Genes with similar expression patterns across multiple experiments may be co-regulated

    Group experiments into clustersExperiments within the same group may have similar gene expression signatureFor example, disease sub-types that can be classified from gene expression data

  • Clustering microarray dataHow to tell if two expression vectors are similar?Define the (dis)-similarity measure between two vectors How to group multiple profiles into meaningful subsets ?Describe the clustering procedure Are the results meaningful ? Evaluate biological meaning of a clustering

  • (Dis)-similarity measuresEuclidean distancePearson correlation coefficientCosine similarityEtc.

  • Clustering algorithmsHierarchical clusteringK-means clusteringSelf Organizing Maps (SOMs)Spectral clusteringEtc.

  • Hierarchical clusteringAgglomerative or divisive (less popular)Agglomerative basic idea:Given n genesInitially every gene in a single clusterfor each iterationfind two most similar genes (or gene groups), combine into one clusterTerminate when only one cluster is left

    (how to define similarity between two groups?)

  • Hierarchical clusteringExact behavior depends on how to compute the distance between two clustersNo need to specify number of clustersA distance cutoff is often chosen to break tree into clustersabcdef

  • Distance between clustersSingle-linkageNot recommendedComplete-linkage

    Average-linkage

    Centroid methodhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

  • K-meansBasic idea:Given n genesEstimate number of clusters: k(Randomly) choose k genes as cluster centersAssign each gene to the closest centerRe-compute center for each clusterUntil assignment is stableSimilarity to EM. Objective function: minimize total distance to cluster centers.http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

  • An exampleA synthetic data setGenesExperiments

  • Hierarchical clusteringAverage linkage. Cluster genes only.

  • Average linkage. Cluster both genes and experiments.

  • K-meansK = 15

  • Another view of clustersExperimentsLog ratioLog ratio

  • Evaluating clusteringDo genes in the same cluster share similar functions?Functional enrichment analysisDo genes in the same cluster share similar cis-regulatory motifs?Motif finding

  • Gene Ontology (GO)Gene functions were often defined using free textHard to extract, transfer, revise, predict, annotate, comprehend, manage The list of vocabularies should be pre-defined and commonly agreedGene Ontology provides a controlled vocabulary to describe gene and gene product attribute

  • Gene ontologyTwo partsOntology: list of vocabularies (terms) to useAnnotations: characterizing genes using ontology terms

    Three ontology categoriesBiological processMolecular functionCellular components

  • Part of a GO graph

    Each GO category is a directed acyclic graph

    A term can have multiple parents, and multiple children.

    A gene can be Annotated by multiple terms.

    If annotated by a child term, automatically annotated by all ascendant terms.

  • Functional enrichment analysisTotal number of genes: 6000Cluster A: 100 genesWhat functions do these 100 genes have?Do they share some functions significantly?Gene with function of interestGene with other functionsExample:Among 6000 genes, 60 genes have function F. In cluster A, 55 out of 100 genes also have function F.Significance can be computed using cumulative hyper-geometric test.

  • Significance of enrichmentM = 6000m = 100N = 60n = 55P-value = 8e-100Example:Among 6000 genes, 60 genes have function F. In cluster A, 55 out of 100 genes also have function F.Significance can be computed using cumulative hyper-geometric test.

  • An applicationTavazoie et al, Systematic determination of genetic network architecture, Nature Genetics, 22, 19993000 yeast genes, 15 time points during cell cycleUse k-means clustering, k=30Clusters correlate well with known functionAlignACE motif finding 600-long upstream regionsMany motifs known

  • Cell-division cycleA cell duplicates its genome and divides into two identical cellsFour phasesG1 (preparation)S (DNA duplication)G2 (preparation)M (cell division)

  • Motifs in Clusters

  • Enriched functions in clusters

  • Overview of course

  • Main themes by biological subjectSequence analysisAlignmentString matchingMotif findingGene predictionRNA secondary structure predictionFunctional genomicsMicroarrayFunctional enrichment analysis

  • Main themes by algorithmic techniquesDynamic programmingAlignment, HMM, RNA structureProbabilistic modelingHMM (regular grammar)RNA structure (context free grammar)Motif findingSuffix treesApplications in bioinfoClustering

  • Sequence alignmentDP algorithm for global and local alignmentNeedleman-WunschSmith-WatermanAlignment with affine gap costLinear space alignmentHeuristic alignmentBounded alignmentBLASTAlignment statisticsExtreme value distributionMultiple sequence alignment

  • Probabilistic modelsHMMsHMMs for pair-wise sequence alignmentHMMs for multiple alignmentHMMs for gene predictionStochastic context-free grammar for RNA structure predictionViterbi and posterior decodingExpectation-maximizationGibbs Sampling

  • GoalsBasis of sequence analysis and other computational biology algorithmsOverall picture about the fieldRead / criticize research articlesThink about the sub-field that best suits your background to exploreCommunicate and exchange ideas with (computational) biologists

  • Good luck with your final exams

    See you on Dec 11

    Please remember to turn in your homework and final project report by Dec 6