cot 6930 hpc & bioinformatics microarray data analysis xingquan zhu dept. of computer science...
TRANSCRIPT
![Page 1: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/1.jpg)
COT 6930HPC & Bioinformatics
Microarray Data Analysis
Xingquan Zhu
Dept. of Computer Science and Engineering
![Page 2: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/2.jpg)
DNA RNA
cDNAESTsUniGene
phenotype
GenomicDNADatabases
Protein sequence databases
protein
Protein structure databases
transcription translation
Gene expressiondatabase
![Page 3: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/3.jpg)
Outline
Gene Expression and Biological Network What, Why, and How
DNA Microarray Microarray Construction Comparative Hybridization Data Analysis
Public Databases
![Page 4: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/4.jpg)
Gene Expression
Gene expression Genes are expressed when they are transcribed onto RNA Amount of mRNA indicates gene activity
No mRNA → gene is off mRNA present → gene is on & performing function
Biologically Some genes are always expressed in all tissues
Estimated 10,000 housekeeping / ubiquitous genes Other genes are selectively on
Depending on tissue, disease, and/or environment Change in environment → change in gene expression
So organism can respond
![Page 5: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/5.jpg)
Biological Network Gene expression does not happen in isolation
Individual genes code for function Produce mRNA → protein performing function
Sets of genes can form pathways Gene products can turn on / off other genes
Sets of pathways can form networks When pathways interact
Biology is a study of networks Genes Proteins Etc…
![Page 6: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/6.jpg)
Type of Biological Networks
Genetic network Interactions between genes, gene products
Gene regulation network Network of control decisions to turn genes on / off Subset of genetic network
Metabolic network Network of interactions between proteins Synthesize / break down molecules (enzymes, cofactors)
![Page 7: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/7.jpg)
An example of Genetic Network
![Page 8: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/8.jpg)
Gene Regulation Network
![Page 9: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/9.jpg)
An example of Metabolic network
![Page 10: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/10.jpg)
Examining Biological Networks – Benefits
Learn about gene function / regulation Tissue differentiation Response to environmental factors
Identify / treat diseases Discover genetic causes of disease Evaluate effect of drugs
Detect impact of DNA sequence variation (mutations) Detection of mutations (e.g., SNPs) Genetic typing
![Page 11: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/11.jpg)
Examining Biological Networks – Approach
Measure protein / mRNA in cells In different tissues (e.g., brain vs. muscle)
Find gene / protein with tissue-specific function As environment changes
Find genes / proteins responsible for response In healthy & diseased tissues
Find proteins / genes responsible for disease (if any) Help identify diseases based on gene expression
In different individuals Detect DNA sequence variation
![Page 12: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/12.jpg)
Examining Biological Networks
Direct approach Measure protein production / interaction in cell
2D electrophoresis Mass spectroscopy Protein microarray
Advantages Precise results on proteins
Disadvantages Low throughput (for now)
![Page 13: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/13.jpg)
Examining Biological Networks
Indirect approach Measure mRNA production (gene expression) in cell
Random ESTs DNA microarray
Advantages High throughput Can test large variety of mRNA simultaneously
Disadvantages RNA level not always correlated with protein level / function Misses changes at protein level Results may thus be less precise
![Page 14: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/14.jpg)
Outline
Gene Expression and Biological Network What, Why, and How
DNA Microarray Microarray Construction Comparative Hybridization Data Analysis
Public Databases
![Page 15: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/15.jpg)
DNA Microarray
Question How to determine whether a gene is expressed, or how
to measure mRNA?
![Page 16: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/16.jpg)
DNA Microarray
![Page 17: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/17.jpg)
Hybridization to the Chip
![Page 18: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/18.jpg)
The Chip is Scanned
![Page 19: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/19.jpg)
Images
![Page 20: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/20.jpg)
Video: http://www.youtube.com/watch?v=VNsThMNjKhM
![Page 21: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/21.jpg)
Oligonucleotide (GeneChip) vs. Spotted Arrays
GeneChip Microarray A gene is a probe set A set of (11-16)
probes form a probe set
Probe length: 25 bp Can use small amount
of RNA Efficient hybridization
Spotted Microarray One probe per gene Probe length:
hundreds to 1k bp Less expensive
![Page 22: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/22.jpg)
Probe set
PM
MM
Probe Pair PM
MM
MMProbe cell
1.28 cm1.
28 c
m
GeneChip: Chip->Probeset->Probe pair->Probe
![Page 23: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/23.jpg)
25-mer unique oligo
mismatch in the middle nuclieotide
multiple probes (11~16) for each gene
from Affymetrix Inc.
GeneChip Array Design
![Page 24: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/24.jpg)
Affymetrix GeneChip
![Page 25: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/25.jpg)
Affymetrix GeneChip
![Page 26: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/26.jpg)
DNA Microarray Design & Analysis
Microarray Microarray construction Array design
Choosing probe sequences Comparative Hybridization (data collection)
Measure relative amount of mRNA Image processing of scanned images
Spot detection, normalization, quantization Data Analysis
Statistical test, noise handling (low-level) Clustering, classification (high-level)
![Page 27: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/27.jpg)
cDNA
Complementary DNA Sequences are the complements of the original mRNA
sequences Why don’t we simple capture mRNA
The environment is full of RNA-digesting enzymes Free RNA is quickly degraded To prevent the experimental samples from being lost, they
are reverse-transcribed back into more stable DNA form
![Page 28: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/28.jpg)
cDNA
![Page 29: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/29.jpg)
DNA Microarray Construction Construction
Drops (spots) of cDNA fragments as probes Attach to glass slide / nylon array at known
locations Use mechanical pins & robotics
Use Label cDNA with fluorescent dyes (fluor) Measure contrast in intensity Use laser / CCD scanner
![Page 30: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/30.jpg)
DNA Microarray: Automatic Detection
![Page 31: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/31.jpg)
DNA Microarray Choice of probe
Include genes of interest Examine sequence databases
Avoid redundancy No duplicate probes
Avoid cross hybridization Genechip alleviates this
problem by using probe pairs PM MM
Can use software to help choose probes
Or simply buy pre-designed arrays Complete genomes of yeast,
Drosophila, C. elegans 33,000+ human genes from
GenBank RefSeq on 2 microarrays
Expensive but labor-saving
![Page 32: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/32.jpg)
DNA Microarray Design & Analysis
Microarray Microarray construction
Spotted cDNA arrays, in situ photolithography… Array design
Choosing probe sequences Comparative Hybridization (data collection)
Measure relative amount of mRNA Image processing of scanned images
Spot detection, normalization, quantization Data Analysis
Statistical test, noise handling (low-level) Clustering, classification (high-level)
![Page 33: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/33.jpg)
Comparative Hybridization
Goal Measure relative amount of
mRNA expressed Algorithm
Choose cell populations mRNA extraction and reverse
transcription Fluorescent labeling of cDNA’s
(normalized) Hybridization to microarray Scan the hybridized array Interpret scanned image
![Page 34: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/34.jpg)
Comparative Hybridization
![Page 35: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/35.jpg)
Comparative Hybridization
![Page 36: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/36.jpg)
Comparative Hybridization
Color determined by relative RNA concentrations Brightness determined by total concentration
![Page 37: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/37.jpg)
DNA Microarray Methodology
Anatomy of a Comparative Gene Expression Study http://
www.cs.wustl.edu/~jbuhler/research/array/#diagram Flash Animation
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
![Page 38: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/38.jpg)
DNA Microarray Design & Analysis
Microarray Microarray construction
Spotted cDNA arrays, in situ photolithography… Array design
Choosing probe sequences Comparative Hybridization (data collection)
Measure relative amount of mRNA Image processing of scanned images
Spot detection, normalization, quantization Data Analysis
Statistical test, noise handling (low-level) Clustering, classification (high-level)
![Page 39: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/39.jpg)
Streamlined Array Analysis
Normalize
normal tumor tumor normal normal tumorID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL
AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 PAFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 PAFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 PAFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 PAFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 PAFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 PAFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 PAFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 PAFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 PAFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 AAFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 AAFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 AAFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 AAFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 AAFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 AAFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 AAFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 AAFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 AAFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 AAFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 AAFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 AAFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 AAFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 AAFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 AAFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 PAFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 AAFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 PAFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 PAFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 PAFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 PAFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 PAFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 PAFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 PAFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 PAFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 PAFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 PAFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P
Raw data Filter
ClassificationSignificance Clustering
Gene lists
Function(Genome Ontology)
•Present/Absent•Minimum value•Fold change
•t-test•Machine learning
•Hierarchical CL •Biclustering
![Page 40: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/40.jpg)
Microarray data
Gene 1
Gene 2
Gene N
Exp 1
E 1
Exp 2
E 2
Exp 3
E 3
![Page 41: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/41.jpg)
Microarray data analysis
begin with a data matrix (gene expression values versus samples)
Typically, there are many genes (>> 10,000) and few samples (~ 10)
![Page 42: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/42.jpg)
Low-Level Data Analysis
Normalization: when you have variability in measurements, you need
replication and statistics to find real differences Significance test:
It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates
![Page 43: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/43.jpg)
Sources of Variability in Raw Data Biological variability Sample preparation
Probe labeling RNA extraction
Experimental condition temperature, time, mixing, etc.
Scanning laser and detector, chemistry of the flourescent label
Image analysis identifying and quantifying each spot on the array
![Page 44: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/44.jpg)
Data Normalization
Can control for many of the experimental sources of variability (systematic, not random or gene specific)
Bring each image to the same average brightness Can use simple math or fancy:
divide by the mean (whole chip or by sectors) LOESS (locally weighted regression)
No sure biological standards
![Page 45: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/45.jpg)
Page 193
Scatter plots One of the most common visualization method for
microarray data. Useful to compare gene expression values from two
microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes
![Page 46: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/46.jpg)
Scatter plot analysis of microarray data
expression level high
low
up
down
![Page 47: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/47.jpg)
Brain
Astrocyte Astrocyte
Fibroblast
Differential Gene Expressionin Different Tissue and Cell Types
The major goal of scatter plot is to identify genes that are differentially regulated between different experimental conditions.
We are interested in outliers
![Page 48: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/48.jpg)
DNA Microarray Design & Analysis
Microarray Microarray construction
Spotted cDNA arrays, in situ photolithography… Array design
Choosing probe sequences Comparative Hybridization (data collection)
Measure relative amount of mRNA Image processing of scanned images
Spot detection, normalization, quantization Data Analysis
Statistical test, noise handling (low-level) Clustering, classification (high-level)
![Page 49: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/49.jpg)
Higher Level Data Analysis
Computational tasks: Clustering Classification Statistical validation Data visualization Pattern detection
Biological problems: Discovery of common sequences in co-regulated genes Meta-studies using data from multiple experiments Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
![Page 50: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/50.jpg)
Microarray data
Gene 1
Gene 2
Gene N
Exp 1
E 1
Exp 2
E 2
Exp 3
E 3
![Page 51: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/51.jpg)
Why care about “clustering” ?E1 E2 E3
Gene 1
Gene 2
Gene N
E1 E2 E3
Gene N
Gene 1
Gene 2
•Discover functional relationSimilar expression functionally related
•Assign function to unknown gene
•Find which gene controls which other genes
![Page 52: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/52.jpg)
Types of Clustering Methods
Hierarchical Link similar genes, build up to a tree of all
K-mean Clustering Self Organizing Maps (SOM)
Split all genes into similar sub-groups Finds its own groups (machine learning)
Bi-Clustering
![Page 53: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/53.jpg)
Some distance measures
Given vectors x = (x1, …, xn), y = (y1, …, yn)
Euclidean distance:
Manhattan distance:
Correlation
distance:
n
iiiE yxyxd
1
2)(),(
.),(1
n
iiiM yxyxd
.)()(
))((1),(
1
2
1
2
1
ii
ii
iii
Cyyxx
yyxxyxd
![Page 54: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/54.jpg)
Finding a Centroid
We use the following equation to find the n dimensional centroid point amid k n dimensional points:
),...,2
,1
(),...,,( 11121 k
xnth
k
ndx
k
stxxxxCP
k
ii
k
ii
k
ii
k
Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)
)5,5()3
924,
3
852(
CP
![Page 55: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/55.jpg)
Hierarchical Clustering
E1 E2 E3
•Treat each example as a cluster•While (clusters >1)
•Merge two clusters with the least distance•Update cluster centroid•Clusters--
•Endwhile
•EasyNo need to specify the number of clusters beforehand
•Trouble to interpret “tree” structure
![Page 56: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/56.jpg)
K-means Algorithm
1. Choose k initial center points randomly2. Cluster data using Euclidean distance (or other distance
metric)3. Calculate new center points for each cluster using only
points within the cluster4. Re-Cluster all data using the new center points
1. This step could cause data points to be placed in a different cluster
5. Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met
![Page 57: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/57.jpg)
An example with k=2
1. We Pick k=2 centers at random
2. We cluster our data around these center points
![Page 58: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/58.jpg)
K-means example with k=2
3. We recalculate centers based on our current clusters
![Page 59: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/59.jpg)
K-means example with k=2
4. We re-cluster our data around our new center points
![Page 60: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/60.jpg)
K-means example with k=2
5. We repeat the last two steps until no more data points are moved into a different cluster
![Page 61: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/61.jpg)
Cluster Quality
Since any data can be clustered, how do we know our clusters are meaningful? The size (diameter) of the cluster vs. The inter-cluster distance Distance between the members of a cluster and the cluster’s
center Diameter of the smallest sphere
![Page 62: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/62.jpg)
Cluster Quality Continued
size=5
size=5distance=2
0
distance=5
Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter
![Page 63: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/63.jpg)
Cluster Quality Continued
Quality can be assessed simply by looking at the diameter of a cluster
A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.
![Page 64: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/64.jpg)
k-means comments
Strength
Easy
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n. Weakness
Sensitive to the initial seeds
Applicable only when mean is defined, then what about categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
![Page 65: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/65.jpg)
A Problem of K-means
Sensitive to outliers Outlier: objects with extremely large values
May substantially distort the distribution of the data
When mean is not meaningful K-medoids: the most centrally located object in a
cluster
++
0
1
2
3
4
5
67
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
67
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 66: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/66.jpg)
A Problem K-means: Differing Density
Original Points K-means (3 Clusters)
![Page 67: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/67.jpg)
Clusters with non-convex shapes
Original Points K-means (2 Clusters)
![Page 68: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/68.jpg)
A parallel k-means package
Parallel K-Means Data Clustering http://www.ece.northwestern.edu/~wkliao/Kmeans/
index.html
![Page 69: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/69.jpg)
Other clustering methods
Self Organizing Maps (SOM) Determine its own groups by using neural networks
Bi-clustering Simultaneously merge columns and rows into
clusters Group of genes Group of examples
![Page 70: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/70.jpg)
Two-way clusteringof genes (y-axis)and cell lines (x-axis)
![Page 71: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/71.jpg)
Outline
Gene Expression and Biological Network What, Why, and How
DNA Microarray Microarray Construction Comparative Hybridization Data Analysis
Public Databases
![Page 72: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/72.jpg)
Public Databases
Gene Expression data is an essential aspect of annotating the genome
Publication and data exchange for microarray experiments
Data mining/Meta-studies Common data format - XML MIAME (Minimal Information About a
Microarray Experiment)
![Page 73: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/73.jpg)
GEO at the NCBI
![Page 74: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/74.jpg)
Array Express at EMBL
![Page 75: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/75.jpg)
Array Express at EMBLArray Express at EMBL
![Page 76: COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering](https://reader035.vdocuments.net/reader035/viewer/2022062800/56649e165503460f94b00988/html5/thumbnails/76.jpg)
Outline
Gene Expression and Biological Network What, Why, and How
DNA Microarray Microarray Construction Comparative Hybridization Data Analysis
Public Databases