(a tool kit for) deep seq data analysis [email protected] mouse genetics january 24,...

33
(A tool kit for) Deep Seq Data Analysis (A tool kit for) Deep Seq Data Analysis [email protected] http://drosophile.org Mouse Genetics January 24, 2013, 14:00–16:30

Upload: janis-hicks

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

(A tool kit for) Deep Seq Data Analysis(A tool kit for) Deep Seq Data Analysis

[email protected]

http://drosophile.org

Mouse GeneticsJanuary 24, 2013, 14:00–16:30

Page 2: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Why deep seq analyses ?Why deep seq analyses ?

Your project involves:Your project involves:

Mutation and SNP identification or analysis (genome re-sequencing)Gene/Disease Linkage (genome re-sequencing)Pathogen identification (de novo sequence assembly or re-sequencing)transcriptome analysis (RNAseq)DNA methylation study (medip-seq)Chromatin study (ChIPseq)Transcription factor study (ChIPseq)miRNAs, siRNA, piRNA, tRF, etc... (small RNA seq)Single cell transcriptome analysis

Deep seqQualitative information

Quantitative information

Page 3: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Sequencing TechnologiesSequencing Technologies

Page 4: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Sequencing Technologies : Sequencing Technologies : Quantitative FactsQuantitative Facts

From

Page 5: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Sequencing Technologies : Focus on Sequencing Technologies : Focus on Illumina technologyIllumina technology

« Librar« Library »y »

« Cluste« Clusters»rs»

Cluster Cluster SequencesSequences

Page 6: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

For mRNA seq, non Directional

Page 7: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

20-30nt RNA gel purification

small RNAseq library small RNAseq library preparationpreparation(Directional)(Directional)

(Biases)

Library “Bar coding”

Page 8: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

ChIPseq library preparationChIPseq library preparation(Non Directional)(Non Directional)

Page 9: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30
Page 10: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

What can I do with my sequence reads ?What can I do with my sequence reads ?

Locus discovery/mutation discovery/Splicing annotationLocus discovery/mutation discovery/Splicing annotation

Annotation & visualizationAnnotation & visualization Read quantitative profiling (Transcriptome, chromatin profiling, etc..)Read quantitative profiling (Transcriptome, chromatin profiling, etc..)

StatisticsStatistics Structure analysis of precursors, signatures…Structure analysis of precursors, signatures…

Maths & StatisticsMaths & Statistics ……

Page 11: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Platform Selection

Library Preparation

Sequencing

Quality Control

Alignment Assembly

Visualization & Statistics•Normalization (library comparison)•Peak finding (Binding sites, Breakpoints, etc…)•Differential Calling (expression, variants, etc)

Think to the number of replicate when starting

What am I going to sequence ?

Inherent biasesSpecific benefits(Read length, single or paired ends, number of reads)

Whole genomeWhole exomeTarget enrichment

Size selectionAmplificationSingle Cell Protocol

Number of Cycles Number of lanes

Adapter ClippingQuality trimming

Contaminant and Error identification

BowtieBWA……Nature Methods 2009P Flicek & E Birney

VelvetSSAKE……PLoS ONE 6(3)Zhang W, Chen J, et al. (2011)

R& Open Source software tools

Flowchart of a sequencing project

Page 12: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

A case study: miRNAs (and other A case study: miRNAs (and other small RNAs)small RNAs)

metHen1

snoRNA, tRNA, rRNA fragments+

Page 13: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

20-30nt RNA gel purification

small RNAseq library small RNAseq library preparationpreparation(Directional)(Directional)

(Biases)

Library “Bar coding”

Page 14: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Basic MaterialBasic Material

A sequence file (fastq format)A sequence file (fastq format) A computer with enough RAM (8 Gigabytes is a good start)A computer with enough RAM (8 Gigabytes is a good start) A Unix compliant Operating System + a bit of « basic know how »A Unix compliant Operating System + a bit of « basic know how » A couple of very useful softwares with Graphic User Interface (GUI)A couple of very useful softwares with Graphic User Interface (GUI)

TextWrangler, an advanced text editor with RegEx integrationTextWrangler, an advanced text editor with RegEx integration R (for statistics and, more importantly, Graphics)R (for statistics and, more importantly, Graphics) ……

GALAXY is an (our) optionGALAXY is an (our) option Knowledge of at least one programming languageKnowledge of at least one programming language

Python, Perl, Java, C++Python, Perl, Java, C++……

Page 15: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

What is this big* fastq file containning ?What is this big* fastq file containning ?

* Size limit to open a text file with a text editor (~1.2 Gb)Unix Terminal .more <path/to/the/file>

$ more GKG-13.fastq @HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1]B]VWaaaaaagggfggggggcggggegdgfgeggbab@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^

………

HeaderSequenceHeaderQuality (ascii encoded)

Page 16: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

How many sequence reads in my file ?How many sequence reads in my file ?

wc - l <path/to/my/file>$ wc -l GKG-13.fastq

25703828 GKG-13.fastq>>> 25 703 828 / 46 425 957 sequence reads

#!/usr/bin/pythonimport sys

readDic= {}Nbre_reads = 0Nbre_lines = 0F = open(sys.argv[1])for line in F: Nbre_lines += 1 if Nbre_lines % 4 == 2: Nbre_reads += 1 readDic[line] = readDic.get(line, 0) + 1F.close()print "%s reads" % Nbre_readsprint "%s distinct sequences" % (len(readDic))print "%f complexity" % (len(readDic)/float(Nbre_reads))

$ fastq_complexity.py GKG-13.fastq 6 425 957 reads550 706 distinct sequences0.085700 complexity

Page 17: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Are my sequence reads containing the adapter ?Are my sequence reads containing the adapter ?

cat <path/file> | grep CTGTAGG | wc –l grep -c "CTGTAGG" <path/file>

lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep CTGTAGG | wc -l

6355061

lbcd-05:GKG13demo deepseq$ grep -c "CTGTAGG" GKG-13.fastq

6355061

6 355 061 out of6 425 957 sequences… not bad (98.8%)

My 3’ adaptater: CTGTAGGCACCATCAAT

lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep ATCTCGT| wc -l

308

A contrario

Page 18: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Take home message n°1Take home message n°1

Unix Operating Systems already contain powerful native tools for text analysis

$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l

Outputs the content of a file, line by line

The output is passed to the input of the next command

perl interpreter is called with –ne options (loop & execute) In line perl code

Regular expression

The output is passed to the input of the next command

wc with –l option counts the lines

Page 19: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Quality Control. Can I trust my sequences ?Quality Control. Can I trust my sequences ?

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Demo with the GUI version

Page 20: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

fastQC in GALAXY

Quality Control. Can I trust my sequences ?Quality Control. Can I trust my sequences ?

Page 21: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

How to clip the adapter ?How to clip the adapter ?

3’ adapter: CTGTAGGCACCATCAAT

http://hannonlab.cshl.edu/fastx_toolkit/index.html

fastq_to_fasta -r -n -i GKG-13.fastq | fastx_clipper -a CTGTAGGCACCATCAAT -l 18 -o GKG-13_clip-pipe.fasta

Page 22: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Clipping in GALAXYClipping in GALAXY

Page 23: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

http://bowtie-bio.sourceforge.net/

Bowtie aligns reads on indexed genomes

• Download, install Bowtie and rtfm.• Download your genome (format FASTA)• Build the Bowtie index using bowtie-build

deepseq$ bowtie-build fasta_libraries/dmel-all-chromosome-r5.37.fasta dmel-r5.37

~nn min (but indexed references available)deepseq$ ls –laht-rw-r--r-- 1 deepseq staff 49M Mar 24 17:24 dmel-r5.37.rev.1.ebwt-rw-r--r-- 1 deepseq staff 19M Mar 24 17:24 dmel-r5.37.rev.2.ebwt-rw-r--r-- 1 deepseq staff 49M Mar 24 17:20 dmel-r5.37.1.ebwt-rw-r--r-- 1 deepseq staff 19M Mar 24 17:20 dmel-r5.37.2.ebwt-rw-r--r-- 1 deepseq staff 331K Mar 24 17:16 dmel-r5.37.3.ebwt-rw-r--r-- 1 deepseq staff 39M Mar 24 17:16 dmel-r5.37.4.ebwt

Page 24: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Deepseq$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated

A bowtie alignment (Demo on Mac)A bowtie alignment (Demo on Mac)

~/bin/bowtie-0.12.7/indexes/dmel-f GKG-13_clip-pipe.fasta-v 1-k 1-p 2--al droso_matched_GKG-13.fa--un unmatched_GKG13.fa> GKG13_bowtie_output.tabulated

# reads processed: 5997502# reads with at least one reported alignment: 5045151 (84.12%)# reads that failed to align: 952351 (15.88%)Reported 5045151 alignments to 1 output stream(s)

Page 25: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Bowtie outputsBowtie outputsdeepseq$ ls -laht-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa

deepseq$ more GKG13_bowtie_output.tabulated21 + 2L 20487495 TGGAATGTAAAGAAGTATGGAG 30 - 3L 15836559 GTGAATTCTCCCAGTGCCAAG 25 + 3R 5916902 TGAACACAGCTGGTGGTATCC 23 - 2L 11953462 CCCGTGAATTCTTCCAGTGCCATT 27 + 3R 5916902 TGAACACAGCTGGTGGTATC 26 - 3R 9289997 TCCTGCGGCACTAGTACTTA 18 - 2L 11953465 GTGAATTCTTCCAGTGCCATT 22 - 3R 8377246 ATTGCTGGAATCAAGTTGCTGAC 20 + 3L 11650036 TTTGTGACCGACACTAACGGGTA 24 + 2R 16493585 TGGAAGACTAGTGATTTTGTT 28 + 3L 10358380 TAGGAACTTCATACCGTGCTCT 35 + X 18022302 CTTGTGCGTGTGACAGCGGCT 41 - 3RHet 138608 TGGCGACCGTGACAGGACCCG 42 + 3R 5916902 TGAACACAGCTGGTGGTATCC

deepseq$ more droso_matched_GKG-13.fa>21TGGAATGTAAAGAAGTATGGAG>26TAAGTACTAGTGCCGCAGGA>24TGGAAGACTAGTGATTTTGTT>23AATGGCACTGGAAGAATTCACGGG>27TGAACACAGCTGGTGGTATC

deepseq$ more unmatched_GKG13.fa>29AGGGGGCTATTTCACTACTGGA>33CGATGATGACGGTACCCGTAGA>37GCTAGTCGGTACTTGAAAC>59TGGTTGCAATAGCTTCTGGCGGA>61GATGAGTGCTAGATGTAGGGA

Tabular alignment report

Aligned sequences Unaligned sequences

Page 26: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Sequence reads (fasta format)

Bowtie Pre-miRNAs (miRBase)

Unmatched reads

Unmatched reads

Transposons

Unmatched reads

Genes

Unmatched reads

Unmatched reads

Remaining unmatched sequences

Bowtie

Bowtie

Non coding RNAs

Bowtie

Bowtie

Bowtie

Intergenic regions

Viruses, transgenes, etc…

hierarchical

annotation

of

sequence

datasets

A pipeline for small RNA annotation (see in GED Galaxy)A pipeline for small RNA annotation (see in GED Galaxy)

Matched reads(fasta)

Read Count

Matched reads(fasta)

Read Count

Matched reads(fasta)

Read Count

Matched reads(fasta)

Read Count

Matched reads(fasta)

Read Count

Matched reads(fasta)

Read Count

Page 27: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

samtoolssamtools

http://samtools.sourceforge.net/

Preparation of a BAM file and its associated index

$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -M 1 --best -p 2 -S | samtools view -bS -o

GKG-13_clip-pipe.fasta.bam - ; samtools sort GKG-13_clip-pipe.fasta.bam GKG-13_clip-pipe.fasta.bam.sorted ;

samtools index GKG-13_clip-pipe.fasta.bam.sorted.bam

306K GKG-13_clip-pipe.fasta.bam.sorted.bam.bai

42M GKG-13_clip-pipe.fasta.bam.sorted.bam

80M GKG-13_clip-pipe.fasta.bam

Sam formatBam format (for Genome Browsers)

• Sorted• Indexed• Compressed

~3 min

Page 28: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Upload of BAM file to a remote server (amazon cloud)Passing the URL to Ensembl (Gbrowse, Modencode, etc..)

Read visualization in a Genome BrowserRead visualization in a Genome Browser

Page 29: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Naive and primed murine pluripotent stem cells have distinct miRNA signatures Naive and primed murine pluripotent stem cells have distinct miRNA signatures

ESC1 ESC2 EpiSC2EpiSC1EpiSC3

A. Jouneau (INRA Jouy en Josas)E. Heard (Institut Curie)C. Antoniewski (Institut Pasteur)M. Cohen-Tannoudji (Institut Pasteur)

Page 30: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

miRNA profilingmiRNA profiling

Sequence reads (fasta format)

Bowtie Pre-miRNAs (miRBase)

Bowtie Output

Parsing

Read maps for all miRNAs Hit list for miR_5p and miR_3p

deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt

miR profiling / hit list agregation

Page 31: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Differential CallingDifferential Calling

Sequence reads (fasta format)

Bowtie Pre-miRNAs (miRBase)

Bowtie Output

Parsing

Read maps for all miRNAs Hit list for miR_5p et miR_3p

deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt

DESeq

Heatplus

(Bioconductor)

http://www.r-project.org/

Page 32: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

touRism

countsTable <- read.delim( "~/Documents/Pasteur_DEMO/mouse_hits.txt", header=TRUE, stringsAsFactors=TRUE )

head(countsTable)

rownames(countsTable)<- countsTable$gene

countsTable <- countsTable[ , -1 ]

head(countsTable)

summary(countsTable)

plot(countsTable)

plot(log(countsTable,10))

conds <- c( "EPI", "EPI", "EPI", "ES", "ES" )

cds <- newCountDataSet( countsTable, conds )

cds <- estimateSizeFactors( cds )

sizeFactors( cds )

cds = estimateDispersions( cds, method="pooled")

vsd <- getVarianceStabilizedData( cds )

dists <- dist( t( vsd ) )

heatmap( as.matrix( dists ), symm=TRUE, margins=c(12,12),cexRow=1, cexCol=1)

SampleVar<-apply(vsd,1,var)

vsd2<-cbind(vsd,SampleVar)

vsd3<-vsd2[order(vsd2[,6], decreasing=TRUE),]

head(vsd3)

vsd3<-head(vsd3,100)

vsd3<-vsd3[,-6]

head(vsd3)

heatmap.2(vsd3, col=brewer.pal(11, "RdBu"), scale="none", trace="none", margins=c(3,45), ,cexRow=0.7, cexCol=1, dendrogram="column",

density.info="none", keysize=0.7)

cds = estimateDispersions( cds, method="per-condition", sharingMode="fit-only")

res = nbinomTest( cds, "EPI", "ES" )

resNA = res[-which(is.na(res[,8])),]

resNA[order(resNA[,8]), ]

Load DESeq, gplots and RcolorBrewer

Page 33: (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr  Mouse Genetics January 24, 2013, 14:00– 16:30

Deep Seq Data Analysis, Deep Seq Data Analysis, Final Take Home MessagesFinal Take Home Messages

Think to your deep seq replicates at starting

Keep a hand on your data, from « fastq stage »

Keep a hand on the analysis because this is your project

Always keep an eye on « Normalization » and « Differential »

Don’t be afraid by bioinformatics, but don’t reinvent the wheel

It’s open source, open manual

It’s not magic, yes you can

It’s fun

You cannot escape, so take it easy.