rnaseq’analysis’ - open source at scilifelab · rnaseq’analysis ’!it’s ... •...

22
RNAseq analysis it’s complicated March 2017

Upload: vankhanh

Post on 29-Apr-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

RNAseq  analysis  -­‐it’s  complicated  

March  2017    

Defining functional DNA elements in the human genome Kellis M et al. PNAS 2014;111:6131-6138

RNA  reads  are  not  enough  to  iden:fy  func:onal  RNAs  

Depending  on  the  steps  from  sample  to  RNA  seq  will  give  different  results  

AAAAAAAA  

enrichments  -­‐>  

reads  -­‐>  

library  -­‐>  

RNA-­‐>   PolyA                              (mRNA)  RiboMinus            (-­‐  rRNA)  Size    <50  nt          (miRNA  )  …..    

Size  of  fragment  Strand  specific  5’  end  specific    3’  end  specific  …..    

Single  end  (1  read  per  fragment)  Paired  end  (2  reads  per  fragment)  

Mapping  (Pär  Engström)  

•  Use  RNA  specific  mapper  •  Use  a  two-­‐pass  workflow  •  STAR  or  HISAT      •  For  long  (PacBio)  reads,  STAR,  BLAT  or  GMAP  can  be  used  

Gene  and  Isoform  detec:on  

Long  reads  might  be  the  way  to  go  

Promises  and  pi_alls  Long  reads  

•  Low  throughput                (-­‐)  •  Complete  transcripts    (++)  •  Not  quan:ta:ve      (-­‐)  •  Only  highly  expressed  genes  (-­‐)  •  Expensive            (-­‐)  •  Low  background  noise    (+)  •  Easy  downstream  analysis    (++)  

 

   

short  reads  •  High  throughput        (++)  •  Quan:ta:ve        (++)  •  Frac:ons  of  transcripts        (-­‐)  •  Full  dynamic  range      (+-­‐)  •  Unlimited  dynamic  range                (+)  •  Cheap                      (+)  •  Low  background  noise    (+)  •  Strand  specificity      (+)  •  Re-­‐sequencing      (+)  

 

RNA-­‐seq  analysis  workflow  RNA$seq(analysis(workflow(

Reads(

Reference(

Mapping( Mapped(reads(STAR(

genome.fa(

mappedReads.bam(reads.fastq.gz(

Gene(expression(

Gene(annotaCon:(ref.bed(/(ref.ga(

Do  a  lot  of  QC  

Reads(

Reference(

Mapping( Mapped(reads(STAR(

genome.fa(

mappedReads.bam(reads.fastq.gz(

Gene(expression(

Gene(annotaCon:(ref.bed(/(ref.ga(

Read(QC($(FastQC(

Mapping(staCsCcs(

Mapping(QC($(RseQC(

Outlier(detecCon,(

sample(swaps(

More  varia:on  when  using  top  hat  2  with  default  seengs  than  when  using  STAR  or  

Stampy  with  default  seeng  

●●

●●

●●

● ●

●● ●

● ● ●

● ● ●●

● ● ●

● ●

● ●

● ● ●

● ● ●

●●

● ● ● ●●

● ● ●

● ●

●●

● ● ●

60

70

80

90

Cg4

a_KS

2C

g88_

14_K

S3C

g88_

44_K

S2C

g89_

16_K

S2C

g89_

3_KS

1C

g8c_

KS2

Cg8

f_KS

1C

g94_

6_KS

1C

r1G

R1_

2_KS

1C

r1G

R1_

2_KS

2C

r1G

R1_

2_KS

3C

r23_

9_KS

2C

r39_

1_TS

1_KS

1C

r75_

2_3_

KS1

Cr7

9_29

_ext

ra1

Cr8

4_21

_KS1

CrN

2256

1_KS

2In

ter3

_1In

ter4

_1_1

Inte

r4_1

_2In

ter4

_1_3

Inte

r4_1

_4In

ter5

_1In

tra6_

3In

tra7_

2_1

Intra

7_2_

2In

tra7_

2_3

Intra

8_2

species

Perc

ent p

rope

rly m

appe

d pa

ired

end

read

s

Colour●

StampySTARTophat2

Shape● Flower

Leaf

Compare mapping efficacy of different RNA−seq assemblers

C.rube

lla  

C.grandiflo

ra  

hybrid  

C.grandiflo

ra  

RNA  QC  PCA(analysis(detected(potenCal(

sample(swaps(

Principal  component  1  separates  samples  from  flowers  and  leaves    

●●

●●

−150 −100 −50 0 50 100 150

−50

050

PC1

PC2

Cg4a_KS2_FCg4a_KS2_LCg88_14_KS3_FCg88_14_KS3_LCg88_44_KS2_FCg88_44_KS2_LCg89_16_KS2_FCg89_16_KS2_LCg89_3_KS1_FCg89_3_KS1_LCg8c_KS2_FCg8c_KS2_LCg8f_KS1_FCg8f_KS1_LCg94_6_KS1_FCg94_6_KS1_LCr1GR1_2_KS1_FCr1GR1_2_KS1_LCr1GR1_2_KS2_FCr1GR1_2_KS2_LCr1GR1_2_KS3_FCr1GR1_2_KS3_LCr23_9_KS2_FCr23_9_KS2_LCr39_1_TS1_KS1_FCr39_1_TS1_KS1_LCr75_2_3_KS1_FCr75_2_3_KS1_LCr79_29_extra1_FCr79_29_extra1_LCr84_21_KS1_FCr84_21_KS1_LCrN22561_KS2_FCrN22561_KS2_LInter3_1_FInter3_1_LInter4_1_1_FInter4_1_1_LInter4_1_2_FInter4_1_2_LInter4_1_3_FInter4_1_4_LInter5_1_FInter5_1_LIntra6_3_FIntra6_3_LIntra7_2_1_FIntra7_2_1_LIntra7_2_2_FIntra7_2_2_LIntra7_2_3_FIntra7_2_3_LIntra8_2_FIntra8_2_L

C.  rubella    

Principal  component  2  and  3  separates  the  different  species    

C.  rubella    

−60 −40 −20 0 20 40 60 80

−50

050

100

PC2

PC3

C.  grandiflora  

C.  rubella    

hybrid  

Differen:al  expression  analysis    Mikael  Huss    

The  iden:fica:on  of  genes  (or  other  types  of  genomic  features,  such  as  transcripts  or  exons)  that  are  expressed  in  significantly  different  quan::es  in  dis:nct  groups  of  samples,  be  it  biological  condi:ons  (drug-­‐treated  vs.  controls),  diseased  vs.  healthy  individuals,  different  :ssues,  different  stages  of  development,  or  something  else.  

Typically  univariate  analysis  (one  gene  at  a  :me)  –  even  though  we  know  that  genes  are  not  independent  

Decision  tree  for  so/ware  selec2on  (2016)  

Sleuth  

,  Sleuth    

?  

Gene-­‐set  analysis  (GSA)    

Immune  response

Pyruvate

Gene-­‐level  data Gene-­‐set  data  (results)

PPARG

Gene-­‐set  analysis GO-­‐terms Pathways Chromosomal  loca?ons Transcrip?on  factors Histone  modifica?ons Diseases etc…

Samples

Gen

es

We  will  focus  on  transcriptomics  and  differen:al  expression  analysis  However,  GSA  can  in  principle  be  used  on  all  types  of  genome-­‐wide  data.  

Analysis  regarding  Type  II  Diabetes  p−values up

0

0.00017

0.00033

5e−04

0.00067

0.00083

0.001

p−values down

0

0.00017

0.00033

5e−04

0.00067

0.00083

0.001

mean, distinct−directional (both)

●●

● ●

KEGG_GLYCOLYSIS_GLUCONEOGENESIS

KEGG_PENTOSE_PHOSPHATE_PATHWAY

KEGG_STEROID_BIOSYNTHESIS

KEGG_OXIDATIVE_PHOSPHORYLATION

KEGG_PURINE_METABOLISM

KEGG_PYRIMIDINE_METABOLISMKEGG_GLUTATHIONE_METABOLISM

KEGG_LINOLEIC_ACID_METABOLISM

KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_LACTO_AND_NEOLACTO_SERIES

KEGG_TERPENOID_BACKBONE_BIOSYNTHESIS

KEGG_AMINOACYL_TRNA_BIOSYNTHESISKEGG_RIBOSOME

KEGG_DNA_REPLICATION

KEGG_PROTEASOME

KEGG_PROTEIN_EXPORT

KEGG_CELL_CYCLE

KEGG_CARDIAC_MUSCLE_CONTRACTION

KEGG_FOCAL_ADHESIONKEGG_ECM_RECEPTOR_INTERACTION

KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION

KEGG_ALZHEIMERS_DISEASEKEGG_PARKINSONS_DISEASE

KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALSKEGG_HUNTINGTONS_DISEASE

KEGG_VIBRIO_CHOLERAE_INFECTION

Max node size:177 genes

Min node size:15 genes

Max edge width:84 genes

Min edge width:1 genes

Expression  of  genes  on  pathway  

miRNA  seq  analysis  

(Berezikov  et  al.  Genome  Research,  2011.)  

(Sandberg,$Nature$Methods$2014)$

Single  cell  sequencing  

Exercises  •  Mapping  

–  STAR    –  HISAT2  

•  Tutorial  for  reference  guided  assembly  –  Cufflinks  –  String:e  

•  Tutorial  for  de  novo  assembly  –  Trinity  

•  Visualise  mapped  reads  and  assembled  transcripts  on  reference  –  IGV  

•  RNA  quality  controll  –  Tutorial  for  RNA  seq  Quality  Control  

•  Differen:al  expression  analysis  –  DEseq2  –  Calisto  and  Sleuth  –  mul:  variate  analysis  in  SIMCA  

•  small  RNA  analysis  –  miRNA  analysis  

•  Introductory  –  Introduc:on  to  the  RNA  seq  data  provided  –  Short  introduc:on  to  R  –  Short  introduc:on  to  IGV  

•  Beta  labs  –  Single  cell  RNA  PCA  and  clustering  –  Gene  set  analysis  

•  UPPMAX  –  sbatch  script  example  

Need  help??  

•  We  are  here  for  you.  Apply  for  help.