rnaseq’analysis’ - open source at scilifelab · rnaseq’analysis ’!it’s ... •...

RNAseq analysis -‐it’s complicated

March 2017

Defining functional DNA elements in the human genome Kellis M et al. PNAS 2014;111:6131-6138

RNA reads are not enough to iden:fy func:onal RNAs

Depending on the steps from sample to RNA seq will give different results

AAAAAAAA

enrichments -‐>

reads -‐>

library -‐>

RNA-‐> PolyA (mRNA) RiboMinus (-‐ rRNA) Size <50 nt (miRNA ) …..

Size of fragment Strand specific 5’ end specific 3’ end specific …..

Single end (1 read per fragment) Paired end (2 reads per fragment)

Mapping (Pär Engström)

•  Use RNA specific mapper •  Use a two-‐pass workflow •  STAR or HISAT •  For long (PacBio) reads, STAR, BLAT or GMAP can be used

Gene and Isoform detec:on

Long reads might be the way to go

Promises and pi_alls Long reads

•  Low throughput (-‐) •  Complete transcripts (++) •  Not quan:ta:ve (-‐) •  Only highly expressed genes (-‐) •  Expensive (-‐) •  Low background noise (+) •  Easy downstream analysis (++)

short reads •  High throughput (++) •  Quan:ta:ve (++) •  Frac:ons of transcripts (-‐) •  Full dynamic range (+-‐) •  Unlimited dynamic range (+) •  Cheap (+) •  Low background noise (+) •  Strand specificity (+) •  Re-‐sequencing (+)

RNA-‐seq analysis workflow RNA$seq(analysis(workflow(

Reads(

Reference(

Mapping( Mapped(reads(STAR(

genome.fa(

mappedReads.bam(reads.fastq.gz(

Gene(expression(

Gene(annotaCon:(ref.bed(/(ref.ga(

Do a lot of QC

Reads(

Reference(

Mapping( Mapped(reads(STAR(

genome.fa(

mappedReads.bam(reads.fastq.gz(

Gene(expression(

Gene(annotaCon:(ref.bed(/(ref.ga(

Read(QC($(FastQC(

Mapping(staCsCcs(

Mapping(QC($(RseQC(

Outlier(detecCon,(

sample(swaps(

More varia:on when using top hat 2 with default seengs than when using STAR or

Stampy with default seeng

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●● ●

●

●

●

● ● ●

●

●

● ● ●●

●

●

● ● ●

●

● ●

● ●

●

●

● ● ●

●

●

●

● ● ●

●●

● ● ● ●●

●

● ● ●

●

● ●

●●

●

●

● ● ●

60

70

80

90

Cg4

a_KS

2C

g88_

14_K

S3C

g88_

44_K

S2C

g89_

16_K

S2C

g89_

3_KS

1C

g8c_

KS2

Cg8

f_KS

1C

g94_

6_KS

1C

r1G

R1_

2_KS

1C

r1G

R1_

2_KS

2C

r1G

R1_

2_KS

3C

r23_

9_KS

2C

r39_

1_TS

1_KS

1C

r75_

2_3_

KS1

Cr7

9_29

_ext

ra1

Cr8

4_21

_KS1

CrN

2256

1_KS

2In

ter3

_1In

ter4

_1_1

Inte

r4_1

_2In

ter4

_1_3

Inte

r4_1

_4In

ter5

_1In

tra6_

3In

tra7_

2_1

Intra

7_2_

2In

tra7_

2_3

Intra

8_2

species

Perc

ent p

rope

rly m

appe

d pa

ired

end

read

s

Colour●

●

●

StampySTARTophat2

Shape● Flower

Leaf

Compare mapping efficacy of different RNA−seq assemblers

C.rube

lla

C.grandiflo

ra

hybrid

C.grandiflo

ra

RNA QC PCA(analysis(detected(potenCal(

sample(swaps(

Principal component 1 separates samples from flowers and leaves

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

−150 −100 −50 0 50 100 150

−50

050

PC1

PC2

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Cg4a_KS2_FCg4a_KS2_LCg88_14_KS3_FCg88_14_KS3_LCg88_44_KS2_FCg88_44_KS2_LCg89_16_KS2_FCg89_16_KS2_LCg89_3_KS1_FCg89_3_KS1_LCg8c_KS2_FCg8c_KS2_LCg8f_KS1_FCg8f_KS1_LCg94_6_KS1_FCg94_6_KS1_LCr1GR1_2_KS1_FCr1GR1_2_KS1_LCr1GR1_2_KS2_FCr1GR1_2_KS2_LCr1GR1_2_KS3_FCr1GR1_2_KS3_LCr23_9_KS2_FCr23_9_KS2_LCr39_1_TS1_KS1_FCr39_1_TS1_KS1_LCr75_2_3_KS1_FCr75_2_3_KS1_LCr79_29_extra1_FCr79_29_extra1_LCr84_21_KS1_FCr84_21_KS1_LCrN22561_KS2_FCrN22561_KS2_LInter3_1_FInter3_1_LInter4_1_1_FInter4_1_1_LInter4_1_2_FInter4_1_2_LInter4_1_3_FInter4_1_4_LInter5_1_FInter5_1_LIntra6_3_FIntra6_3_LIntra7_2_1_FIntra7_2_1_LIntra7_2_2_FIntra7_2_2_LIntra7_2_3_FIntra7_2_3_LIntra8_2_FIntra8_2_L

C. rubella

Principal component 2 and 3 separates the different species

C. rubella

−60 −40 −20 0 20 40 60 80

−50

050

100

PC2

PC3

C. grandiflora

C. rubella

hybrid

Differen:al expression analysis Mikael Huss

The iden:fica:on of genes (or other types of genomic features, such as transcripts or exons) that are expressed in significantly different quan::es in dis:nct groups of samples, be it biological condi:ons (drug-‐treated vs. controls), diseased vs. healthy individuals, different :ssues, different stages of development, or something else.

Typically univariate analysis (one gene at a :me) – even though we know that genes are not independent

Decision tree for so/ware selec2on (2016)

Sleuth

, Sleuth

?

Gene-‐set analysis (GSA)

Immune response

Pyruvate

Gene-‐level data Gene-‐set data (results)

PPARG

Gene-‐set analysis GO-‐terms Pathways Chromosomal loca?ons Transcrip?on factors Histone modifica?ons Diseases etc…

Samples

Gen

es

We will focus on transcriptomics and differen:al expression analysis However, GSA can in principle be used on all types of genome-‐wide data.

Analysis regarding Type II Diabetes p−values up

0

0.00017

0.00033

5e−04

0.00067

0.00083

0.001

p−values down

0

0.00017

0.00033

5e−04

0.00067

0.00083

0.001

mean, distinct−directional (both)

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

KEGG_GLYCOLYSIS_GLUCONEOGENESIS

KEGG_PENTOSE_PHOSPHATE_PATHWAY

KEGG_STEROID_BIOSYNTHESIS

KEGG_OXIDATIVE_PHOSPHORYLATION

KEGG_PURINE_METABOLISM

KEGG_PYRIMIDINE_METABOLISMKEGG_GLUTATHIONE_METABOLISM

KEGG_LINOLEIC_ACID_METABOLISM

KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_LACTO_AND_NEOLACTO_SERIES

KEGG_TERPENOID_BACKBONE_BIOSYNTHESIS

KEGG_AMINOACYL_TRNA_BIOSYNTHESISKEGG_RIBOSOME

KEGG_DNA_REPLICATION

KEGG_PROTEASOME

KEGG_PROTEIN_EXPORT

KEGG_CELL_CYCLE

KEGG_CARDIAC_MUSCLE_CONTRACTION

KEGG_FOCAL_ADHESIONKEGG_ECM_RECEPTOR_INTERACTION

KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION

KEGG_ALZHEIMERS_DISEASEKEGG_PARKINSONS_DISEASE

KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALSKEGG_HUNTINGTONS_DISEASE

KEGG_VIBRIO_CHOLERAE_INFECTION

Max node size:177 genes

Min node size:15 genes

Max edge width:84 genes

Min edge width:1 genes

Expression of genes on pathway

miRNA seq analysis

(Berezikov et al. Genome Research, 2011.)

(Sandberg,$Nature$Methods$2014)$

Single cell sequencing

Exercises •  Mapping

–  STAR –  HISAT2

•  Tutorial for reference guided assembly –  Cufflinks –  String:e

•  Tutorial for de novo assembly –  Trinity

•  Visualise mapped reads and assembled transcripts on reference –  IGV

•  RNA quality controll –  Tutorial for RNA seq Quality Control

•  Differen:al expression analysis –  DEseq2 –  Calisto and Sleuth –  mul: variate analysis in SIMCA

•  small RNA analysis –  miRNA analysis

•  Introductory –  Introduc:on to the RNA seq data provided –  Short introduc:on to R –  Short introduc:on to IGV

•  Beta labs –  Single cell RNA PCA and clustering –  Gene set analysis

•  UPPMAX –  sbatch script example

Need help??

•  We are here for you. Apply for help.

rnaseq’analysis’ - open source at scilifelab · rnaseq’analysis ’!it’s ... •...

Documents