(a tool kit for) deep seq data analysis [email protected] mouse genetics january 24,...

(A tool kit for) Deep Seq Data Analysis(A tool kit for) Deep Seq Data Analysis

[email protected]

http://drosophile.org

Mouse GeneticsJanuary 24, 2013, 14:00–16:30

Why deep seq analyses ?Why deep seq analyses ?

Your project involves:Your project involves:

Mutation and SNP identification or analysis (genome re-sequencing)Gene/Disease Linkage (genome re-sequencing)Pathogen identification (de novo sequence assembly or re-sequencing)transcriptome analysis (RNAseq)DNA methylation study (medip-seq)Chromatin study (ChIPseq)Transcription factor study (ChIPseq)miRNAs, siRNA, piRNA, tRF, etc... (small RNA seq)Single cell transcriptome analysis

Deep seqQualitative information

Quantitative information

Sequencing TechnologiesSequencing Technologies

Sequencing Technologies : Sequencing Technologies : Quantitative FactsQuantitative Facts

From

Sequencing Technologies : Focus on Sequencing Technologies : Focus on Illumina technologyIllumina technology

« Librar« Library »y »

« Cluste« Clusters»rs»

Cluster Cluster SequencesSequences

For mRNA seq, non Directional

20-30nt RNA gel purification

small RNAseq library small RNAseq library preparationpreparation(Directional)(Directional)

(Biases)

Library “Bar coding”

ChIPseq library preparationChIPseq library preparation(Non Directional)(Non Directional)

What can I do with my sequence reads ?What can I do with my sequence reads ?

Locus discovery/mutation discovery/Splicing annotationLocus discovery/mutation discovery/Splicing annotation

Annotation & visualizationAnnotation & visualization Read quantitative profiling (Transcriptome, chromatin profiling, etc..)Read quantitative profiling (Transcriptome, chromatin profiling, etc..)

StatisticsStatistics Structure analysis of precursors, signatures…Structure analysis of precursors, signatures…

Maths & StatisticsMaths & Statistics ……

Platform Selection

Library Preparation

Sequencing

Quality Control

Alignment Assembly

Visualization & Statistics•Normalization (library comparison)•Peak finding (Binding sites, Breakpoints, etc…)•Differential Calling (expression, variants, etc)

Think to the number of replicate when starting

What am I going to sequence ?

Inherent biasesSpecific benefits(Read length, single or paired ends, number of reads)

Whole genomeWhole exomeTarget enrichment

Size selectionAmplificationSingle Cell Protocol

Number of Cycles Number of lanes

Adapter ClippingQuality trimming

Contaminant and Error identification

BowtieBWA……Nature Methods 2009P Flicek & E Birney

VelvetSSAKE……PLoS ONE 6(3)Zhang W, Chen J, et al. (2011)

R& Open Source software tools

Flowchart of a sequencing project

A case study: miRNAs (and other A case study: miRNAs (and other small RNAs)small RNAs)

metHen1

snoRNA, tRNA, rRNA fragments+

20-30nt RNA gel purification

small RNAseq library small RNAseq library preparationpreparation(Directional)(Directional)

(Biases)

Library “Bar coding”

Basic MaterialBasic Material

A sequence file (fastq format)A sequence file (fastq format) A computer with enough RAM (8 Gigabytes is a good start)A computer with enough RAM (8 Gigabytes is a good start) A Unix compliant Operating System + a bit of « basic know how »A Unix compliant Operating System + a bit of « basic know how » A couple of very useful softwares with Graphic User Interface (GUI)A couple of very useful softwares with Graphic User Interface (GUI)

TextWrangler, an advanced text editor with RegEx integrationTextWrangler, an advanced text editor with RegEx integration R (for statistics and, more importantly, Graphics)R (for statistics and, more importantly, Graphics) ……

GALAXY is an (our) optionGALAXY is an (our) option Knowledge of at least one programming languageKnowledge of at least one programming language

Python, Perl, Java, C++Python, Perl, Java, C++……

What is this big* fastq file containning ?What is this big* fastq file containning ?

* Size limit to open a text file with a text editor (~1.2 Gb)Unix Terminal .more <path/to/the/file>

$ more GKG-13.fastq @HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1]B]VWaaaaaagggfggggggcggggegdgfgeggbab@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^

………

HeaderSequenceHeaderQuality (ascii encoded)

How many sequence reads in my file ?How many sequence reads in my file ?

wc - l <path/to/my/file>$ wc -l GKG-13.fastq

25703828 GKG-13.fastq>>> 25 703 828 / 46 425 957 sequence reads

#!/usr/bin/pythonimport sys

readDic= {}Nbre_reads = 0Nbre_lines = 0F = open(sys.argv[1])for line in F: Nbre_lines += 1 if Nbre_lines % 4 == 2: Nbre_reads += 1 readDic[line] = readDic.get(line, 0) + 1F.close()print "%s reads" % Nbre_readsprint "%s distinct sequences" % (len(readDic))print "%f complexity" % (len(readDic)/float(Nbre_reads))

$ fastq_complexity.py GKG-13.fastq 6 425 957 reads550 706 distinct sequences0.085700 complexity

Are my sequence reads containing the adapter ?Are my sequence reads containing the adapter ?

cat <path/file> | grep CTGTAGG | wc –l grep -c "CTGTAGG" <path/file>

lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep CTGTAGG | wc -l

6355061

lbcd-05:GKG13demo deepseq$ grep -c "CTGTAGG" GKG-13.fastq

6355061

6 355 061 out of6 425 957 sequences… not bad (98.8%)

My 3’ adaptater: CTGTAGGCACCATCAAT

lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep ATCTCGT| wc -l

308

A contrario

Take home message n°1Take home message n°1

Unix Operating Systems already contain powerful native tools for text analysis

$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l

Outputs the content of a file, line by line

The output is passed to the input of the next command

perl interpreter is called with –ne options (loop & execute) In line perl code

Regular expression

The output is passed to the input of the next command

wc with –l option counts the lines

Quality Control. Can I trust my sequences ?Quality Control. Can I trust my sequences ?

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Demo with the GUI version

fastQC in GALAXY

Quality Control. Can I trust my sequences ?Quality Control. Can I trust my sequences ?

How to clip the adapter ?How to clip the adapter ?

3’ adapter: CTGTAGGCACCATCAAT

http://hannonlab.cshl.edu/fastx_toolkit/index.html

fastq_to_fasta -r -n -i GKG-13.fastq | fastx_clipper -a CTGTAGGCACCATCAAT -l 18 -o GKG-13_clip-pipe.fasta

Clipping in GALAXYClipping in GALAXY

http://bowtie-bio.sourceforge.net/

Bowtie aligns reads on indexed genomes

• Download, install Bowtie and rtfm.• Download your genome (format FASTA)• Build the Bowtie index using bowtie-build

deepseq$ bowtie-build fasta_libraries/dmel-all-chromosome-r5.37.fasta dmel-r5.37

~nn min (but indexed references available)deepseq$ ls –laht-rw-r--r-- 1 deepseq staff 49M Mar 24 17:24 dmel-r5.37.rev.1.ebwt-rw-r--r-- 1 deepseq staff 19M Mar 24 17:24 dmel-r5.37.rev.2.ebwt-rw-r--r-- 1 deepseq staff 49M Mar 24 17:20 dmel-r5.37.1.ebwt-rw-r--r-- 1 deepseq staff 19M Mar 24 17:20 dmel-r5.37.2.ebwt-rw-r--r-- 1 deepseq staff 331K Mar 24 17:16 dmel-r5.37.3.ebwt-rw-r--r-- 1 deepseq staff 39M Mar 24 17:16 dmel-r5.37.4.ebwt

Deepseq$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated

A bowtie alignment (Demo on Mac)A bowtie alignment (Demo on Mac)

~/bin/bowtie-0.12.7/indexes/dmel-f GKG-13_clip-pipe.fasta-v 1-k 1-p 2--al droso_matched_GKG-13.fa--un unmatched_GKG13.fa> GKG13_bowtie_output.tabulated

# reads processed: 5997502# reads with at least one reported alignment: 5045151 (84.12%)# reads that failed to align: 952351 (15.88%)Reported 5045151 alignments to 1 output stream(s)

Bowtie outputsBowtie outputsdeepseq$ ls -laht-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa

deepseq$ more GKG13_bowtie_output.tabulated21 + 2L 20487495 TGGAATGTAAAGAAGTATGGAG 30 - 3L 15836559 GTGAATTCTCCCAGTGCCAAG 25 + 3R 5916902 TGAACACAGCTGGTGGTATCC 23 - 2L 11953462 CCCGTGAATTCTTCCAGTGCCATT 27 + 3R 5916902 TGAACACAGCTGGTGGTATC 26 - 3R 9289997 TCCTGCGGCACTAGTACTTA 18 - 2L 11953465 GTGAATTCTTCCAGTGCCATT 22 - 3R 8377246 ATTGCTGGAATCAAGTTGCTGAC 20 + 3L 11650036 TTTGTGACCGACACTAACGGGTA 24 + 2R 16493585 TGGAAGACTAGTGATTTTGTT 28 + 3L 10358380 TAGGAACTTCATACCGTGCTCT 35 + X 18022302 CTTGTGCGTGTGACAGCGGCT 41 - 3RHet 138608 TGGCGACCGTGACAGGACCCG 42 + 3R 5916902 TGAACACAGCTGGTGGTATCC

deepseq$ more droso_matched_GKG-13.fa>21TGGAATGTAAAGAAGTATGGAG>26TAAGTACTAGTGCCGCAGGA>24TGGAAGACTAGTGATTTTGTT>23AATGGCACTGGAAGAATTCACGGG>27TGAACACAGCTGGTGGTATC

deepseq$ more unmatched_GKG13.fa>29AGGGGGCTATTTCACTACTGGA>33CGATGATGACGGTACCCGTAGA>37GCTAGTCGGTACTTGAAAC>59TGGTTGCAATAGCTTCTGGCGGA>61GATGAGTGCTAGATGTAGGGA

Tabular alignment report

Aligned sequences Unaligned sequences

Sequence reads (fasta format)

Bowtie Pre-miRNAs (miRBase)

Unmatched reads

Unmatched reads

Transposons

Unmatched reads

Genes

Unmatched reads

Unmatched reads

Remaining unmatched sequences

Bowtie

Bowtie

Non coding RNAs

Bowtie

Bowtie

Bowtie

Intergenic regions

Viruses, transgenes, etc…

hierarchical

annotation

of

sequence

datasets

A pipeline for small RNA annotation (see in GED Galaxy)A pipeline for small RNA annotation (see in GED Galaxy)

Matched reads(fasta)

Read Count


Read Count


Read Count


Read Count


Read Count


Read Count

samtoolssamtools

http://samtools.sourceforge.net/

Preparation of a BAM file and its associated index

$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -M 1 --best -p 2 -S | samtools view -bS -o

GKG-13_clip-pipe.fasta.bam - ; samtools sort GKG-13_clip-pipe.fasta.bam GKG-13_clip-pipe.fasta.bam.sorted ;

samtools index GKG-13_clip-pipe.fasta.bam.sorted.bam

306K GKG-13_clip-pipe.fasta.bam.sorted.bam.bai

42M GKG-13_clip-pipe.fasta.bam.sorted.bam

80M GKG-13_clip-pipe.fasta.bam

Sam formatBam format (for Genome Browsers)

• Sorted• Indexed• Compressed

~3 min

Upload of BAM file to a remote server (amazon cloud)Passing the URL to Ensembl (Gbrowse, Modencode, etc..)

Read visualization in a Genome BrowserRead visualization in a Genome Browser

Naive and primed murine pluripotent stem cells have distinct miRNA signatures Naive and primed murine pluripotent stem cells have distinct miRNA signatures

ESC1 ESC2 EpiSC2EpiSC1EpiSC3

A. Jouneau (INRA Jouy en Josas)E. Heard (Institut Curie)C. Antoniewski (Institut Pasteur)M. Cohen-Tannoudji (Institut Pasteur)

miRNA profilingmiRNA profiling



Bowtie Output

Parsing

Read maps for all miRNAs Hit list for miR_5p and miR_3p

deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt

miR profiling / hit list agregation

Differential CallingDifferential Calling



Bowtie Output

Parsing

Read maps for all miRNAs Hit list for miR_5p et miR_3p

deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt

DESeq

Heatplus

(Bioconductor)

http://www.r-project.org/

touRism

countsTable <- read.delim( "~/Documents/Pasteur_DEMO/mouse_hits.txt", header=TRUE, stringsAsFactors=TRUE )

head(countsTable)

rownames(countsTable)<- countsTable$gene

countsTable <- countsTable[ , -1 ]

head(countsTable)

summary(countsTable)

plot(countsTable)

plot(log(countsTable,10))

conds <- c( "EPI", "EPI", "EPI", "ES", "ES" )

cds <- newCountDataSet( countsTable, conds )

cds <- estimateSizeFactors( cds )

sizeFactors( cds )

cds = estimateDispersions( cds, method="pooled")

vsd <- getVarianceStabilizedData( cds )

dists <- dist( t( vsd ) )

heatmap( as.matrix( dists ), symm=TRUE, margins=c(12,12),cexRow=1, cexCol=1)

SampleVar<-apply(vsd,1,var)

vsd2<-cbind(vsd,SampleVar)

vsd3<-vsd2[order(vsd2[,6], decreasing=TRUE),]

head(vsd3)

vsd3<-head(vsd3,100)

vsd3<-vsd3[,-6]

head(vsd3)

heatmap.2(vsd3, col=brewer.pal(11, "RdBu"), scale="none", trace="none", margins=c(3,45), ,cexRow=0.7, cexCol=1, dendrogram="column",

density.info="none", keysize=0.7)

cds = estimateDispersions( cds, method="per-condition", sharingMode="fit-only")

res = nbinomTest( cds, "EPI", "ES" )

resNA = res[-which(is.na(res[,8])),]

resNA[order(resNA[,8]), ]

Load DESeq, gplots and RcolorBrewer

Deep Seq Data Analysis, Deep Seq Data Analysis, Final Take Home MessagesFinal Take Home Messages

Think to your deep seq replicates at starting

Keep a hand on your data, from « fastq stage »

Keep a hand on the analysis because this is your project

Always keep an eye on « Normalization » and « Differential »

Don’t be afraid by bioinformatics, but don’t reinvent the wheel

It’s open source, open manual

It’s not magic, yes you can

It’s fun

You cannot escape, so take it easy.

(a tool kit for) deep seq data analysis [email protected] mouse genetics january 24,...

Documents

directional slide

sequencing technologies

sequencing project slide

c slide

chipseq library preparation

deep seq data analysis

mrna seq

deep seq analyses