microarrays and rnaseq: technologies and data processing€¦ · homemade arrays: pcr products or...
TRANSCRIPT
Department of Animal Sciences Department of Biostatistics & Medical Informatics
University of Wisconsin - Madison
Guilherme J. M. Rosa
Microarrays and RNAseq: Technologies and Data Processing
OUTLINE
Æ Introduction
w Central dogma of molecular biology
Æ Transcriptional Profiling Technologies
w Earlier methods
w Microarrays
w RT-PCR
w RNA-Seq
Æ Data Acquisition and Data Pre-processing
CENTRAL DOGMA OF MOLECULAR BIOLOGY
Environment
Genetics
Phenotype Gene expression
Ø Southern blotting and Northern blotting Ø Microarrays Ø RT-PCR Ø RNAseq
GENE EXPRESSION ASSAY TECHNOLOGIES (TRANSCRIPTION LEVEL):
Detection of specific DNA fragments by gel-transfer hybridization 1) The mixture of double-stranded DNA fragments generated by restriction nuclease
treatment of DNA is separated according to length by electrophoresis. 2) A sheet of either nitrocellulose paper or nylon paper is laid over the gel, and the separated
DNA fragments are transferred to the sheet by blotting. 3) The gel is supported on a layer of sponge in a bath of alkali solution, and the buffer is
sucked through the gel and the nitrocellulose paper by paper towels stacked on top of the nitrocellulose.
4) As the buffer is sucked through, it denatures the DNA and transfers the single-stranded fragments from the gel to the surface of the nitrocellulose sheet, where they adhere firmly. (This transfer is necessary to keep the DNA firmly in place while the hybridization procedure is carried out).
5) The nitrocellulose sheet is carefully peeled off the gel. 6) The sheet containing the bound single-stranded DNA fragments is placed in a sealed
container together with buffer containing a radioactively labeled DNA probe specific for the required DNA sequence.
7) The sheet is exposed for a prolonged period to the probe under conditions favoring hybridization.
8) The sheet is removed from the container and washed thoroughly, so that only probe molecules that have hybridized to the DNA on the paper remain attached.
9) After autoradiography, the DNA that has hybridized to the labeled probe will show up as bands on the autoradiograph.
SOUTHERN BLOTTING
SOUTHERN BLOTTING
• An adaptation of this technique to detect specific sequences in RNA is called Northern blotting. In this case mRNA molecules are electrophoresed through the gel and the probe is usually a single-stranded DNA molecule.
• Northern blots allow investigators to determine the molecular weight of an mRNA and to measure relative amounts of the mRNA present in different samples.
Microarrays use the natural chemical attraction between DNA and RNA molecules to determine
the expression level of genes
C pairs with G and A pairs with T or U
A good match sticks, a bad match doesn't
MICROARRAY TECHNOLOGY
MICROARRAY TECHNOLOGY
Homemade arrays: PCR products or pre-synthesized oligonucleotides probes are spotted using robot technology; two-color system
Affymetrix: High-density oligonucleotide array, short (25-mer) oligos synthesized in situ (photolithography); single-channel
Agilent (HP): pre-synthesized oligonucleotide (60-mer) probes are printed using inkjet technology; two-color system
Illumina: pre-synthesized oligonucleotide (50-mer); single-channel, multiple arrays (6 or 8) per slide
NimbleGen: pre-synthesized oligonucleotide (60-mer); multiple probes per gene; 4-plex (4 samples per array)
Variations: kind of probe (PCR product or oligos), length of oligos, how probes are deposited on slide, pre- or in situ synthesized,
number of samples co-hybridized in each slide
TWO COLOR VS. SINGLE CHANNEL SYSTEMS
SINGLE CHANNEL
TWO COLOR SYSTEMS
MICROARRAYS PLATFORMS
Multi-step process to extract RNA from the sample and make millions of copies.
Chop up the RNA
At the same time the RNA is copied, molecules of a chemical called biotin (orange cups) are attached to each strand. These biotin molecules will
act as a molecular glue for fluorescent molecules.
Wash Sample Over the Array
Fluorescent stain that sticks to the biotin
Amount of fluorescent stain proportional to the amount of
RNA molecules that hybridized to the DNA probe
Comparing Gene Expression between “loud speakers” and “normal
speakers”
Biological question"Differentially expressed genes"Sample class prediction etc."
Testing"
Biological verification "and interpretation"
Microarray experiment"
Estimation"
Experimental design"
Image analysis"
Normalization"
Clustering" Discrimination"
Microarray Technology
Two-color systems
MICROARRAY TECHNOLOGY TWO-COLOR PLATFORMS (cDNA or LONG OLIGOS)
An Actual Gene Expression Image
4 x 12 patches (print tips)
19 x 19 spot / patch Example:
IMAGE ANALYSIS
STEPS IN IMAGES PROCESSING
3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.
1. Addressing: locate centers.
2. Segmentation: classification of pixels either as signal or background (using seeded region growing).
Some image analysis software: ArrayWorx, Dapple, GenePix, ImaGene, ScanAlyse, Spot, UCSF Spot, etc..
SEGMENTATION METHODS
• Fixed circles
• Adaptive Circle
• Adaptive Shape – Edge detection – Seeded Region Growing
• Histogram Methods – Adaptive threshold
• Clustering algorithms – Robust to “sickle-cell”, “donut-shaped” spots.
Seeded Region Growing (Yang et al., 2002)
Fixed Circle
SOME LOCAL BACKGROUNDS
GenePix
QuantArray
ScanAnalyze
GeneTAC LS IV
Background adjustment method more important than segmentation (Yang et al., 2002)
QUANTIFICATION OF EXPRESSION
ð For each spot on the slide may calculate:
Red intensity = Rfg - Rbg
Green intensity = Gfg - Gbg
(fg = foreground, bg = background)
ð And combine them in the log (base 2) ratio:
Log2( Red intensity / Green intensity)
Data pre-processing (Normalization)
(Geschwind, 2001)
Sources of Variability in Microarray Experiments
Biological heterogeneity
Specimen collection/Handling effects
Biological Heterogeneity within Specimen
RNA extraction/amplification
Fluor labeling
Hybridization
size/shape of spot
sample distribution across slide
Scanning (Voltage/power/software)
M-A Plot (log intensity ratio vs. mean log-intensity)
DYE-INTENSITY BIAS
M = log(Cy3/Cy5) = log(Cy3) - log(Cy5)
M
A
0 IDEAL
SITUATION
A = = [log(Cy3) + log(Cy5)]/2 Cy5Cy3log ×
Low intensities High intensities
Cy3 < Cy5
Cy3 > Cy5
M-A Plot
DYE-INTENSITY BIAS
LO(W)ESS Locally-Weighted Regression and
Smoothing Scatterplots
Basic Idea: � For x= x0, specify a neighborhood
� Weighted least squares to fit linear or quadratic functions at x0
certain radius
smoothing parameter (measured as a percentage
of the data points)
(by a decreasing function of the distances from x0)
(center of the neighborhood)
M-A Plot
M
A
0
Fitted curve
Normalization
Æ Normalized intensities
2MA)3Cylog(*
* +=2MA)5Cylog(*
* −=
Adjusted Values
and
M̂MM* −=
Normalized M Loess-predicted M
Normalization
Æ LOESS (Local Regression)
Before After
M-A plots for two arrays (each column), before and after quantile normalization (each row).
Large points represent 20 spiked-in probes
Irizarry et al. (2003)
PRINT-TIP SPECIFIC
NORMALIZATION
SCALE NORMALIZATION Æ Some scale adjustments may be required so that the relative expression levels from one particular experiment (slide) do not dominate the average relative expression levels across replicate experiments.
(Yang et al., 2002)
WITHIN-SLIDE SCALE NORMALIZATION Scale normalization across print-tips
I
I
k
n
jij
n
jij
ii
i
M
Ma
∏∑
∑
= =
==
1 1
2
1
2
ˆ
) ,0(~ 22σiij aNM
i
ijij a
MM
ˆ* =
jth log-ratio in the ith print-tip
Estimate of the scale factor ai
Normalized values
SPATIAL EFFECTS ON SLIDE
Top 2.5% of ratios red, bottom 2.5% of ratios green
SPATIAL EFFECTS ON SLIDE
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
………………………………………………
SPATIAL EFFECTS ON SLIDE
For each patch (or print tip), visualize intensities of control genes or use “ robust” measures as median and trimmed means.
Trimmed mean ( ): mean after eliminating the 100.α% of
the smallest and biggest values. αx
i.e., mean of the 100.(1-2α)% of middle numbers.
SPATIAL TREND (‘Median filter’) (Wilson et al., 2003)
Æ Median log ratio over spatial neighborhood of each spot
Æ Patch (print-tip) within array, and spot within patch may be included into the model for the analysis of the data (we’ll see it later)
spatial neighborhood (e.g. 3 × 3 block)
Wilson et al. (2003)
black: low expression flagged spots
Row 1: M-A plot and Image of log(R/G) spot values
Row 2: Same after Loess normalization
Row 3: Same after pin normalization
Array 5: Before and after housekeeping
normalization
Before and after spatial normalization (rows 1 and 2); four
arrays (each column)
Wilson et al. (2003)
Microarray Technology
High Density Oligonucleotide Arrays
MICROARRAY TECHNOLOGY Illumina BeadChip Technology
MICROARRAY TECHNOLOGY Affymetrix Genechip® Gene Expression Microarrays
In situ Oligonucleotide Syntesis: Photolithography
22 different probes for each gene (11 pairs of PM-MM)
An Actual Gene Expression Image
∑=
−=P
1igigig )MMPM(
P1sSignal (expression index):
EXPRESSION INDEX
Æ SOME ALTERNATIVE APPROACHES
• Model Based Expression Index (MBEI): Li and Wong (2001)
• MAS 5.0 Statistical Algorithm: Affymetrix (2001)
• Robust Multichip Average (RMA): Irizarry et al. (2003)
“average difference” or “signal”
∑=
−=J
1jgjgjg )MMPM(
J1y
QUANTILE NORMALIZATION (Bolstad et al., 2003)
� where probe intensities for array i
� Sort each column of X to give Xsort
� Take the means across rows of Xsort and assign their values to each element in the row to get X*
� Get Xnormalized by rearranging each column of X* to have the original ordering as in X
] ,, ,[X nn1np xxx …=
:ix
Æ Goal: to make the same distribution of each array
QUANTILE NORMALIZATION (Bolstad et al., 2003)
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=
npp2p1
2n2212
1n2111
xxx
xxxxxx
X
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=
)p(n)p(2)p(1
)2(n)2(2)2(1
)1(n)1(2)1(1
sort
xxx
xxxxxx
X
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=
ppp
222
111
*
xxx
xxxxxx
X
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
66538946
25127578
10333210
norm
xxx
xxxxxx
X
� �
� �
CYCLIC LOESS (High Density Oligonucleotide Arrays)
(Bolstad et al., 2003)
Æ M-A plot from two arrays at a time
Æ Normalization (LOESS) carried out in a pairwise manner
Æ Adjustments for each of the arrays in each pair are recorded
Æ For any array k the adjustments relative to arrays 1, … , k – 1, k + 1, … , n are weighted and applied to array k
• Generally only 1 or 2 complete iterations suffice
qRT-PCR Technology
PCR
RELATIVE QUANTIFICATION (COMPARATIVE CT)
CT
Estimating Fold Change Between Two Observations
If E = 1:
If EX ≠ ER ≠ 1:
T
T
TC
cb,C
q,C
cb,S
q,S )E1()E1(K)E1(K
XX
FC ΔΔ−Δ−
Δ−
+=+
+==
TC
cb,S
q,Scb,q 2
XX
FC ΔΔ−==
( )[ ]
( )[ ])R(C)R(CR
)X(C)X(CX
cb,S
q,S
qTcbT
qTcbT
E1E1
XX
FC −
−
+
+==
COMMENTS
Ø ΔΔCT methods lack generality for statistical analyses of hierarchically replicated qRT-PCR data
Ø Linear mixed models are more appropriate for the analysis of relative quantification RT-PCR data *
* Steibel JP, Poletto R, Coussens PM and Rosa GJM. A powerful and flexible linear mixed model framework for the analysis of relative
quantification RT-PCR data. Genomics 94: 146-152, 2009.
RNAseq Technology
FIRST GENERATION SEQUENCING (Sanger, 1974)
The Sanger Method
1. Create an entire
sequence of nested sub
fragments including the
original fragment
2. Figure out which base
each fragment ends with
4 tubes and using a gel
Automation of the Sanger method
Ø Fluorescently labeled dideoxynucleotides
Ø The gel is “read” by a fluorimeter and the data are stored in a computer file
Ø Takes advantage of miniaturization to engage in
massively parallel analysis
Ø Applications: whole genome sequencing, RNA-
Seq, ChIP-Seq, etc.
Ø Anything we can do with microarrays, we can
probably do better with sequencing techniques
NEXT GENERATION SEQUENCING
DIFFERENT PLATFORMS
Ø Read length
Ø Number of reads
Ø Total throughput (size of the data)
Ø Time for the analysis
Ø Costs
Illumina Solexa System
454 Roche ABI SOLID
Helicos Bioscience
RNA-Seq TECHNOLOGY
Measuring transcriptomes with RNA-Seq
Two key concepts related to RNA-Seq
Paired-end sequencing 220 bp
Two reads 80 bp
150 bp
Genome
Transcriptome
Reads
Mapping
Splice junction fragments
Tasks with RNA-Seq data
• Differential expression: Given: RNA-Seq reads from two different samples and transcript sequences Do: Predict which transcripts have different abundances between the two samples
• Assembly: Given: RNA-Seq reads (and possibly a genome sequence) Do: Reconstruct full-length transcript sequences from the reads
• Quantification: Given: RNA-Seq reads and transcript sequences Do: Estimate the relative abundances of transcripts (“gene expression”)
Advantages of RNA-Seq over Microarrays
1. No reference sequence needed
2. Low background noise
3. High technical reproducibility
4. Larger dynamic range of expression levels
5. Analysis of alternative splicing
6. Identification and characterization of novel transcripts
VS
RNA-Seq Computational Pipeline
Ø The number of reads (counts) mapping to the biological feature of
interest (e.g. gene, transcript or exon) is considered to be linearly
related to the abundance of the target feature
Gene counts depends on: ü sequencing depth ü gene length
ü expression level
NORMALIZATION
MEASURING EXPRESSION
Problem: Need to scale RNA counts per gene to total sample coverage
Solution: Divide counts per million reads
Problem: Longer genes have more reads, gives better chance to detect DE
Solution: Divide counts by gene length
Ø Differential Expression requires comparison of 2 or more RNA-Seq samples
Ø Number of reads (coverage) will not be exactly the same for each sample
Normalization method
1000000 1000
number of reads of the regiontotal reads region lengthRPKM =
×
Counts are divided by the transcript length (kb) times the total number of
millions of mapped reads: reads per kb per million read sequenced
ANALYSIS OF DIFFERENTIAL EXPRESSION
Ø Parametric approaches: Counts modeled using
known probability distributions such as Binomial,
Poisson or Negative Binomial; Linear model
methodology (Gaussian approximation) for
transformed data, e.g. log transformation
Ø Non-parametric approaches: Chi-squared based
approaches; Fisher’s exact test; Sampling-based
approaches