microarrays and rnaseq: technologies and data processing€¦ · homemade arrays: pcr products or...

Department of Animal Sciences Department of Biostatistics & Medical Informatics

University of Wisconsin - Madison

Guilherme J. M. Rosa

Microarrays and RNAseq: Technologies and Data Processing

OUTLINE

Æ Introduction

w Central dogma of molecular biology

Æ Transcriptional Profiling Technologies

w Earlier methods

w Microarrays

w RT-PCR

w RNA-Seq

Æ Data Acquisition and Data Pre-processing

CENTRAL DOGMA OF MOLECULAR BIOLOGY

Environment

Genetics

Phenotype Gene expression

Ø  Southern blotting and Northern blotting Ø Microarrays Ø  RT-PCR Ø  RNAseq

GENE EXPRESSION ASSAY TECHNOLOGIES (TRANSCRIPTION LEVEL):

Detection of specific DNA fragments by gel-transfer hybridization 1)  The mixture of double-stranded DNA fragments generated by restriction nuclease

treatment of DNA is separated according to length by electrophoresis. 2)  A sheet of either nitrocellulose paper or nylon paper is laid over the gel, and the separated

DNA fragments are transferred to the sheet by blotting. 3)  The gel is supported on a layer of sponge in a bath of alkali solution, and the buffer is

sucked through the gel and the nitrocellulose paper by paper towels stacked on top of the nitrocellulose.

4)  As the buffer is sucked through, it denatures the DNA and transfers the single-stranded fragments from the gel to the surface of the nitrocellulose sheet, where they adhere firmly. (This transfer is necessary to keep the DNA firmly in place while the hybridization procedure is carried out).

5)  The nitrocellulose sheet is carefully peeled off the gel. 6)  The sheet containing the bound single-stranded DNA fragments is placed in a sealed

container together with buffer containing a radioactively labeled DNA probe specific for the required DNA sequence.

7)  The sheet is exposed for a prolonged period to the probe under conditions favoring hybridization.

8)  The sheet is removed from the container and washed thoroughly, so that only probe molecules that have hybridized to the DNA on the paper remain attached.

9)  After autoradiography, the DNA that has hybridized to the labeled probe will show up as bands on the autoradiograph.

SOUTHERN BLOTTING

SOUTHERN BLOTTING

•  An adaptation of this technique to detect specific sequences in RNA is called Northern blotting. In this case mRNA molecules are electrophoresed through the gel and the probe is usually a single-stranded DNA molecule.

•  Northern blots allow investigators to determine the molecular weight of an mRNA and to measure relative amounts of the mRNA present in different samples.

Microarrays use the natural chemical attraction between DNA and RNA molecules to determine

the expression level of genes

C pairs with G and A pairs with T or U

A good match sticks, a bad match doesn't

MICROARRAY TECHNOLOGY

MICROARRAY TECHNOLOGY

Homemade arrays: PCR products or pre-synthesized oligonucleotides probes are spotted using robot technology; two-color system

Affymetrix: High-density oligonucleotide array, short (25-mer) oligos synthesized in situ (photolithography); single-channel

Agilent (HP): pre-synthesized oligonucleotide (60-mer) probes are printed using inkjet technology; two-color system

Illumina: pre-synthesized oligonucleotide (50-mer); single-channel, multiple arrays (6 or 8) per slide

NimbleGen: pre-synthesized oligonucleotide (60-mer); multiple probes per gene; 4-plex (4 samples per array)

Variations: kind of probe (PCR product or oligos), length of oligos, how probes are deposited on slide, pre- or in situ synthesized,

number of samples co-hybridized in each slide

TWO COLOR VS. SINGLE CHANNEL SYSTEMS

SINGLE CHANNEL

TWO COLOR SYSTEMS

MICROARRAYS PLATFORMS

Multi-step process to extract RNA from the sample and make millions of copies.

Chop up the RNA

At the same time the RNA is copied, molecules of a chemical called biotin (orange cups) are attached to each strand. These biotin molecules will

act as a molecular glue for fluorescent molecules.

Wash Sample Over the Array

Fluorescent stain that sticks to the biotin

Amount of fluorescent stain proportional to the amount of

RNA molecules that hybridized to the DNA probe

Comparing Gene Expression between “loud speakers” and “normal

speakers”

Biological question"Differentially expressed genes"Sample class prediction etc."

Testing"

Biological verification "and interpretation"

Microarray experiment"

Estimation"

Experimental design"

Image analysis"

Normalization"

Clustering" Discrimination"

Microarray Technology

Two-color systems

MICROARRAY TECHNOLOGY TWO-COLOR PLATFORMS (cDNA or LONG OLIGOS)

An Actual Gene Expression Image

4 x 12 patches (print tips)

19 x 19 spot / patch Example:

IMAGE ANALYSIS

STEPS IN IMAGES PROCESSING

3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

1. Addressing: locate centers.

2. Segmentation: classification of pixels either as signal or background (using seeded region growing).

Some image analysis software: ArrayWorx, Dapple, GenePix, ImaGene, ScanAlyse, Spot, UCSF Spot, etc..

SEGMENTATION METHODS

•  Fixed circles

•  Adaptive Circle

•  Adaptive Shape –  Edge detection –  Seeded Region Growing

•  Histogram Methods –  Adaptive threshold

•  Clustering algorithms –  Robust to “sickle-cell”, “donut-shaped” spots.

Seeded Region Growing (Yang et al., 2002)

Fixed Circle

SOME LOCAL BACKGROUNDS

GenePix

QuantArray

ScanAnalyze

GeneTAC LS IV

Background adjustment method more important than segmentation (Yang et al., 2002)

QUANTIFICATION OF EXPRESSION

ð For each spot on the slide may calculate:

Red intensity = Rfg - Rbg

Green intensity = Gfg - Gbg

(fg = foreground, bg = background)

ð And combine them in the log (base 2) ratio:

Log2( Red intensity / Green intensity)

Data pre-processing (Normalization)

(Geschwind, 2001)

Sources of Variability in Microarray Experiments

Biological heterogeneity

Specimen collection/Handling effects

Biological Heterogeneity within Specimen

RNA extraction/amplification

Fluor labeling

Hybridization

size/shape of spot

sample distribution across slide

Scanning (Voltage/power/software)

M-A Plot (log intensity ratio vs. mean log-intensity)

DYE-INTENSITY BIAS

M = log(Cy3/Cy5) = log(Cy3) - log(Cy5)

M

A

0 IDEAL

SITUATION

A = = [log(Cy3) + log(Cy5)]/2 Cy5Cy3log ×

Low intensities High intensities

Cy3 < Cy5

Cy3 > Cy5

M-A Plot

DYE-INTENSITY BIAS

LO(W)ESS Locally-Weighted Regression and

Smoothing Scatterplots

Basic Idea: � For x= x0, specify a neighborhood

� Weighted least squares to fit linear or quadratic functions at x0

certain radius

smoothing parameter (measured as a percentage

of the data points)

(by a decreasing function of the distances from x0)

(center of the neighborhood)

M-A Plot

M

A

0

Fitted curve

Normalization

Æ Normalized intensities

2MA)3Cylog(*

* +=2MA)5Cylog(*

* −=

Adjusted Values

and

M̂MM* −=

Normalized M Loess-predicted M

Normalization

Æ LOESS (Local Regression)

Before After

M-A plots for two arrays (each column), before and after quantile normalization (each row).

Large points represent 20 spiked-in probes

Irizarry et al. (2003)

PRINT-TIP SPECIFIC

NORMALIZATION

SCALE NORMALIZATION Æ Some scale adjustments may be required so that the relative expression levels from one particular experiment (slide) do not dominate the average relative expression levels across replicate experiments.

(Yang et al., 2002)

WITHIN-SLIDE SCALE NORMALIZATION Scale normalization across print-tips

I

I

k

n

jij

n

jij

ii

i

M

Ma

∏∑

∑

= =

==

1 1

2

1

2

ˆ

) ,0(~ 22σiij aNM

i

ijij a

MM

ˆ* =

jth log-ratio in the ith print-tip

Estimate of the scale factor ai

Normalized values

SPATIAL EFFECTS ON SLIDE

Top 2.5% of ratios red, bottom 2.5% of ratios green

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………

………………………………………………


For each patch (or print tip), visualize intensities of control genes or use “ robust” measures as median and trimmed means.

Trimmed mean ( ): mean after eliminating the 100.α% of

the smallest and biggest values. αx

i.e., mean of the 100.(1-2α)% of middle numbers.

SPATIAL TREND (‘Median filter’) (Wilson et al., 2003)

Æ Median log ratio over spatial neighborhood of each spot

Æ Patch (print-tip) within array, and spot within patch may be included into the model for the analysis of the data (we’ll see it later)

spatial neighborhood (e.g. 3 × 3 block)

Wilson et al. (2003)

black: low expression flagged spots

Row 1: M-A plot and Image of log(R/G) spot values

Row 2: Same after Loess normalization

Row 3: Same after pin normalization

Array 5: Before and after housekeeping

normalization

Before and after spatial normalization (rows 1 and 2); four

arrays (each column)

Wilson et al. (2003)

Microarray Technology

High Density Oligonucleotide Arrays

MICROARRAY TECHNOLOGY Illumina BeadChip Technology

MICROARRAY TECHNOLOGY Affymetrix Genechip® Gene Expression Microarrays

In situ Oligonucleotide Syntesis: Photolithography

22 different probes for each gene (11 pairs of PM-MM)

An Actual Gene Expression Image

∑=

−=P

1igigig )MMPM(

P1sSignal (expression index):

EXPRESSION INDEX

Æ SOME ALTERNATIVE APPROACHES

• Model Based Expression Index (MBEI): Li and Wong (2001)

• MAS 5.0 Statistical Algorithm: Affymetrix (2001)

• Robust Multichip Average (RMA): Irizarry et al. (2003)

“average difference” or “signal”

∑=

−=J

1jgjgjg )MMPM(

J1y

QUANTILE NORMALIZATION (Bolstad et al., 2003)

� where probe intensities for array i

� Sort each column of X to give Xsort

� Take the means across rows of Xsort and assign their values to each element in the row to get X*

� Get Xnormalized by rearranging each column of X* to have the original ordering as in X

] ,, ,[X nn1np xxx …=

:ix

Æ Goal: to make the same distribution of each array

QUANTILE NORMALIZATION (Bolstad et al., 2003)

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

npp2p1

2n2212

1n2111

xxx

xxxxxx

X

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

)p(n)p(2)p(1

)2(n)2(2)2(1

)1(n)1(2)1(1

sort

xxx

xxxxxx

X

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

ppp

222

111

*

xxx

xxxxxx

X

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

66538946

25127578

10333210

norm

xxx

xxxxxx

X

� �

� �

CYCLIC LOESS (High Density Oligonucleotide Arrays)

(Bolstad et al., 2003)

Æ M-A plot from two arrays at a time

Æ Normalization (LOESS) carried out in a pairwise manner

Æ Adjustments for each of the arrays in each pair are recorded

Æ For any array k the adjustments relative to arrays 1, … , k – 1, k + 1, … , n are weighted and applied to array k

• Generally only 1 or 2 complete iterations suffice

qRT-PCR Technology

RELATIVE QUANTIFICATION (COMPARATIVE CT)

CT

Estimating Fold Change Between Two Observations

If E = 1:

If EX ≠ ER ≠ 1:

T

T

TC

cb,C

q,C

cb,S

q,S )E1()E1(K)E1(K

XX

FC ΔΔ−Δ−

Δ−

+=+

+==

TC

cb,S

q,Scb,q 2

XX

FC ΔΔ−==

( )[ ]

( )[ ])R(C)R(CR

)X(C)X(CX

cb,S

q,S

qTcbT

qTcbT

E1E1

XX

FC −

−

+

+==

COMMENTS

Ø  ΔΔCT methods lack generality for statistical analyses of hierarchically replicated qRT-PCR data

Ø  Linear mixed models are more appropriate for the analysis of relative quantification RT-PCR data *

* Steibel JP, Poletto R, Coussens PM and Rosa GJM. A powerful and flexible linear mixed model framework for the analysis of relative

quantification RT-PCR data. Genomics 94: 146-152, 2009.

RNAseq Technology

FIRST GENERATION SEQUENCING (Sanger, 1974)

The Sanger Method

1.  Create an entire

sequence of nested sub

fragments including the

original fragment

2.  Figure out which base

each fragment ends with

4 tubes and using a gel

Automation of the Sanger method

Ø  Fluorescently labeled dideoxynucleotides

Ø  The gel is “read” by a fluorimeter and the data are stored in a computer file

Ø  Takes advantage of miniaturization to engage in

massively parallel analysis

Ø  Applications: whole genome sequencing, RNA-

Seq, ChIP-Seq, etc.

Ø  Anything we can do with microarrays, we can

probably do better with sequencing techniques

NEXT GENERATION SEQUENCING

DIFFERENT PLATFORMS

Ø  Read length

Ø  Number of reads

Ø  Total throughput (size of the data)

Ø  Time for the analysis

Ø  Costs

Illumina Solexa System

454 Roche ABI SOLID

Helicos Bioscience

RNA-Seq TECHNOLOGY

Measuring transcriptomes with RNA-Seq

Two key concepts related to RNA-Seq

Paired-end sequencing 220 bp

Two reads 80 bp

150 bp

Genome

Transcriptome

Reads

Mapping

Splice junction fragments

Tasks with RNA-Seq data

• Differential expression: Given: RNA-Seq reads from two different samples and transcript sequences Do: Predict which transcripts have different abundances between the two samples

• Assembly: Given: RNA-Seq reads (and possibly a genome sequence) Do: Reconstruct full-length transcript sequences from the reads

• Quantification: Given: RNA-Seq reads and transcript sequences Do: Estimate the relative abundances of transcripts (“gene expression”)

Advantages of RNA-Seq over Microarrays

1.  No reference sequence needed

2.  Low background noise

3.  High technical reproducibility

4.  Larger dynamic range of expression levels

5.  Analysis of alternative splicing

6.  Identification and characterization of novel transcripts

VS

RNA-Seq Computational Pipeline

Ø  The number of reads (counts) mapping to the biological feature of

interest (e.g. gene, transcript or exon) is considered to be linearly

related to the abundance of the target feature

Gene counts depends on: ü  sequencing depth ü  gene length

ü  expression level

NORMALIZATION

MEASURING EXPRESSION

Problem: Need to scale RNA counts per gene to total sample coverage

Solution: Divide counts per million reads

Problem: Longer genes have more reads, gives better chance to detect DE

Solution: Divide counts by gene length

Ø  Differential Expression requires comparison of 2 or more RNA-Seq samples

Ø  Number of reads (coverage) will not be exactly the same for each sample

Normalization method

1000000 1000

number of reads of the regiontotal reads region lengthRPKM =

×

Counts are divided by the transcript length (kb) times the total number of

millions of mapped reads: reads per kb per million read sequenced

ANALYSIS OF DIFFERENTIAL EXPRESSION

Ø  Parametric approaches: Counts modeled using

known probability distributions such as Binomial,

Poisson or Negative Binomial; Linear model

methodology (Gaussian approximation) for

transformed data, e.g. log transformation

Ø Non-parametric approaches: Chi-squared based

approaches; Fisher’s exact test; Sampling-based

approaches

microarrays and rnaseq: technologies and data processing€¦ · homemade arrays: pcr products or...

Documents