nadia davidson - introduction to rna-seq
DESCRIPTION
The central dogma of genetics is that the genome, comprised of DNA, encodes many thousands of genes that can be transcribed into RNA. Following this, the RNA may be translated into amino acids giving a functional protein. While the genome of an individual will be identical for each cell throughout their body, the number of transcribed copies of each gene, as RNA, will differ due to the different functional requirement of each tissue type. An important area of research within genetics is to study the genome in‐action, through RNA. For example, by comparing the quantities of each gene’s RNA between different tissue types, through development, in disease or in different environments – known as differential gene expression analysis. RNA‐Seq, or high throughput RNA sequencing, has accelerated research in this area. The technology works by reverse transcribing the RNA back into DNA, sheering it into smaller fragments, then reading each fragments sequence in parallel to give millions of short “reads”, each between approximately 50‐200 bases in length. With this data comes a computational and statistical challenge because the biology must be inferred from millions of short sequences. Along with technical biases, there is true biological variability between samples of the same type, which must be accounted for. In this talk I discuss the applications of RNA‐Seq, its challenges and some of the bioinformatics strategies being employed to analyse this complex data. In particular, I will focus on the steps involved in differential gene expression analysis, for both model organisms, like human, and more exotic organisms, without a sequenced genome. First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/TRANSCRIPT
![Page 1: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/1.jpg)
Nadia Davidson Murdoch Childrens Research Institute
Introduction to RNA-Seq
Winter School in Mathematical and Computational Biology 2014
![Page 2: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/2.jpg)
The central dogma of molecular biology
Image from wikipedia
![Page 3: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/3.jpg)
Alterna9ve splicing
DNA RNA
![Page 4: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/4.jpg)
Transcrip9onal abundance
DNA RNA
2 copies
mul9ple copies, different “splice” variants
![Page 5: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/5.jpg)
Transcrip9onal abundance
RNA – cell type A RNA – cell type B
Different quan99es, different “splice” variants
![Page 6: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/6.jpg)
A
G
Which copy is expressed more?
DNA
G
Base change aKer transcrip9on
DNA
RNA
Structural rearrangement in the genome fuses Gene A to Gene B
DNA
RNA
Gene A Gene B
Benefits and opportuni9es of RNA-‐seq • Differen9al expression
– Comparing the expression between different samples
• Whole transcriptome sequencing – Annota9on of new exons, transcribed regions, genes or non-‐coding RNAs
– The ability to look at alterna9ve splicing
– Allele specific expression – RNA edi9ng – Fusion genes in cancer – Etc.
![Page 7: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/7.jpg)
RNA-‐Seq
@HWI-ST945:93:c02g4acxx GGAAAAGGCAGAGGGTGGACTAAATGCTCAATCATGGGATTCTAATCTGG + CCCFFFFFHHHHHJJFGIIJJJJJJJJJJJJJGJJJJJGIIJJJJJJJJJJJIHIJJJJJIIJJJ
Millions to billions of these
![Page 8: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/8.jpg)
RNA-‐Seq data analysis
• Whole transcriptome sequencing: – What were the original full length transcript sequences?
– This Talk • Differen8al expression:
– Do we have more blue transcripts in one cell type than another?
– Next Talk
![Page 9: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/9.jpg)
What were the original full length transcript sequences…
if we have a reference genome?
![Page 10: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/10.jpg)
The reference annota9on • Model organisms have a reference annota9on
• E.g. ENSEMBL, RefSeq, UCSC, GENCODE all provide the posi9on
of known genes in the reference genome • OKen, we assume these are the full set of transcripts of a gene • But how do we know which gene a read came from?
ScalechrX:
50 kb hg1972,800,000 72,850,000 72,900,000
Ensembl Gene Predictions - Ensembl 75ENST00000602584ENST00000438453ENST00000421245
ENST00000373504ENST00000373502ENST00000498407ENST00000498318
chrX (q13.2) 22.2 12 q21.1 Xq23 24 Xq25 Xq28
UCSC screen shot
![Page 11: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/11.jpg)
Mapping reads to the genome
Cole Trapnell & Steven L Salzberg, Nature Biotechnology 27, 455 -‐ 457 (2009)
• Some reads can be mapped wholly to the genome (grey) • Other reads need to be ‘split’ across splice sites (blue) • So#ware: Tophat, STAR, Subread
![Page 12: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/12.jpg)
What were the original full length transcript sequences…
if we have a reference genome but want to find something novel?
![Page 13: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/13.jpg)
Map reads
Graph splicing events
Traverse the graph
Genome guided assembly
Gene func9on? e.g. BLAST against the protein database or a related species (Blast2GO) Jeffrey A. Mar9n & Zhong Wang Nature Reviews Gene9cs 12, 671-‐682 (October 2011)
So#ware: Cufflinks, Scripture
![Page 14: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/14.jpg)
What were the original full length transcript sequences…
if we don’t have a reference
genome?
![Page 15: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/15.jpg)
De novo transcriptome assembly • Like genome assembly • But also needs to deal with:
– Splicing – Non-‐uniform coverage
• SoKware: (Trinity, Oases, TransAbyss)
0 20 40 60 80
05
00
00
15
00
00
25
00
00
35
00
00
Reads (Millions)N
um
be
r o
f tr
an
scri
pts
A
0 20 40 60 80
05
00
10
00
15
00
20
00
Reads (Millions)
Me
an
tra
nsc
rip
t le
ng
th (
bp
)
B
0 20 40 60 80
0500
10
00
15
00
Reads (Millions)
Me
dia
n t
ran
scri
pt
len
gth
(b
p)
C
0 20 40 60 80
05
00
10
00
15
00
20
00
25
00
30
00
Reads (Millions)
N5
0 (
bp
)
D
0 20 40 60 80
01
00
00
20
00
03
00
00
40000
50000
Reads (Millions)
Nu
mb
er
of
loci
E
0 20 40 60 800
50
01
00
01
50
02
00
02
50
0Reads (Millions)
Lo
ci p
er
mill
ion
re
ad
s
F
0 20 40 60 80
02
00
04
00
06
00
08
00
0
Reads (Millions)
Tra
nsc
rip
ts p
er
mill
ion
re
ad
s
G
0 20 40 60 80
02
46
810
Reads (Millions)
Ave
rag
e t
ran
scri
pts
pe
r lo
cus
H
Samples
C.multidentata H.californensis P.robusta H.imbricata S.similis D.gigas Mouse!C10Figure 3
Francis et. al., BMC Genomics 2013
• Challenges: – Accuracy – Computa9onal requirements – Lots of transcripts. Need to filter and cluster transcripts into genes (e.g. with Corset, CD-‐HIT-‐EST, assembler informa9on etc.)
![Page 16: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/16.jpg)
What were the original full length transcript sequences…
if we have a reference genome but
it’s not very good?
![Page 17: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/17.jpg)
More common than you may think
– Non-‐model organisms: • A badly assembled genome • No reference genome, but one of a related species
– Model organisms: • Cancer • Poorly assembled regions in an otherwise good reference genome
– No standard approach
![Page 18: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/18.jpg)
Example -‐ Annota9ng the chicken W sex chromosome
Chicken is a model organisms, but the sequenced reference W chromosome is poorly assembled with missing sequence. Mo9va9on: The mechanism for sex determina9on in birds has not been proven. Are there any novel W genes which could be involved?
Source: hkp://mac122.icu.ac.jp/gen-‐ed/mendel-‐gifs/13-‐sex-‐chromosomes.JPG
![Page 19: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/19.jpg)
Experiment and analysis Extracted and sequenced mRNA from the gonads of
4 female and 4 male embryonic chickens
1.4 billion 100bp paired-‐end reads
Re-‐assembled the reference annota9on sequences (Ensembl), with a genome guided assembly (Cufflinks) and a de novo assembly (Abyss)
Iden9fied W genes as those with female specific expression
Discovered 2 novel W genes and for 1/3 of known W gene sequence which were previously incomplete, we found the full length sequences.
Some W candidates were followed up in the lab for sex determina9on studies
![Page 20: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/20.jpg)
An example of one W gene
Ayers et al, 2013 Reference Annota9on
Genome
Genome guided
Coverage
0 500 1000 1500 2000 2500
194
0 Blastoderm
Gonads
De novo assembly
On the W chromosome in the reference chicken genome On “Unknown” con9gs in the reference chicken genome On an autosome in the reference chicken genome
base posi9on in the transcript
Take home message: All approaches have their strengths and limita9ons
![Page 21: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/21.jpg)
Summary • RNA-‐seq is very powerful!
– It allows both the transcript sequence and the rela9ve quan99es to be measured.
– It has numerous applica9ons: • It compliments DNA sequencing by telling us how the genome is actually used is a par9cular cell type.
• In some cases (e.g. non-‐model organisms) it can circumvent the need for DNA sequencing.
– There are standard pipelines for some applica9ons, but many require a problem specific solu9on. Challenging but fun!
![Page 22: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/22.jpg)
Acknowledgements MCRI Bioinforma8cs The (Alicia) Oshlack Lab
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
Red Jungle Fowl (credit: NHGRI)
Chicken W genes: MCRI Compara8ve Development Craig Smith Ka9e Ayers
Feel free to email me with ques8ons: [email protected]
![Page 23: Nadia Davidson - Introduction to rna-seq](https://reader033.vdocuments.net/reader033/viewer/2022052505/554e765ab4c9054a698b4dce/html5/thumbnails/23.jpg)
More informa9on • General:
– Wang et al, RNA-‐Seq: a revolu9onary tool for transcriptomics, Nature Reviews Gene9cs 2009
• Differen9al Expression Pipelines and Reviews: – Alicia Oshlack et al., From RNA-‐seq reads to differen9al expression results, Genome
Biology 2010 – Anders et al., Count-‐based differen9al expression analysis of RNA sequencing data using
R and Bioconductor, Nature Protocols, 2013 – hkp://bioinf.wehi.edu.au/RNAseqCaseStudy/
• Assembly Pipelines and Reviews: – Jeffrey A. Mar9n1 & Zhong Wang, Next-‐genera9on transcriptome assembly, Nature
Reviews Gene9cs 2011 – hkps://code.google.com/p/corset-‐project/wiki/Example – Hass et al., De novo transcript sequence reconstruc9on from RNA-‐seq using the Trinity
plasorm for reference genera9on and analysis, Nature Protocols, 2013 • The human transcriptome (ENCODE):
– Sarah Djebali et al, Landscape of transcrip9on in human cells, Nature 2012