(a tool kit for) deep seq data analysis [email protected] mouse genetics january 24,...
TRANSCRIPT
(A tool kit for) Deep Seq Data Analysis(A tool kit for) Deep Seq Data Analysis
http://drosophile.org
Mouse GeneticsJanuary 24, 2013, 14:00–16:30
Why deep seq analyses ?Why deep seq analyses ?
Your project involves:Your project involves:
Mutation and SNP identification or analysis (genome re-sequencing)Gene/Disease Linkage (genome re-sequencing)Pathogen identification (de novo sequence assembly or re-sequencing)transcriptome analysis (RNAseq)DNA methylation study (medip-seq)Chromatin study (ChIPseq)Transcription factor study (ChIPseq)miRNAs, siRNA, piRNA, tRF, etc... (small RNA seq)Single cell transcriptome analysis
Deep seqQualitative information
Quantitative information
Sequencing TechnologiesSequencing Technologies
Sequencing Technologies : Sequencing Technologies : Quantitative FactsQuantitative Facts
From
Sequencing Technologies : Focus on Sequencing Technologies : Focus on Illumina technologyIllumina technology
« Librar« Library »y »
« Cluste« Clusters»rs»
Cluster Cluster SequencesSequences
For mRNA seq, non Directional
20-30nt RNA gel purification
small RNAseq library small RNAseq library preparationpreparation(Directional)(Directional)
(Biases)
Library “Bar coding”
ChIPseq library preparationChIPseq library preparation(Non Directional)(Non Directional)
What can I do with my sequence reads ?What can I do with my sequence reads ?
Locus discovery/mutation discovery/Splicing annotationLocus discovery/mutation discovery/Splicing annotation
Annotation & visualizationAnnotation & visualization Read quantitative profiling (Transcriptome, chromatin profiling, etc..)Read quantitative profiling (Transcriptome, chromatin profiling, etc..)
StatisticsStatistics Structure analysis of precursors, signatures…Structure analysis of precursors, signatures…
Maths & StatisticsMaths & Statistics ……
Platform Selection
Library Preparation
Sequencing
Quality Control
Alignment Assembly
Visualization & Statistics•Normalization (library comparison)•Peak finding (Binding sites, Breakpoints, etc…)•Differential Calling (expression, variants, etc)
Think to the number of replicate when starting
What am I going to sequence ?
Inherent biasesSpecific benefits(Read length, single or paired ends, number of reads)
Whole genomeWhole exomeTarget enrichment
Size selectionAmplificationSingle Cell Protocol
Number of Cycles Number of lanes
Adapter ClippingQuality trimming
Contaminant and Error identification
BowtieBWA……Nature Methods 2009P Flicek & E Birney
VelvetSSAKE……PLoS ONE 6(3)Zhang W, Chen J, et al. (2011)
R& Open Source software tools
Flowchart of a sequencing project
A case study: miRNAs (and other A case study: miRNAs (and other small RNAs)small RNAs)
metHen1
snoRNA, tRNA, rRNA fragments+
20-30nt RNA gel purification
small RNAseq library small RNAseq library preparationpreparation(Directional)(Directional)
(Biases)
Library “Bar coding”
Basic MaterialBasic Material
A sequence file (fastq format)A sequence file (fastq format) A computer with enough RAM (8 Gigabytes is a good start)A computer with enough RAM (8 Gigabytes is a good start) A Unix compliant Operating System + a bit of « basic know how »A Unix compliant Operating System + a bit of « basic know how » A couple of very useful softwares with Graphic User Interface (GUI)A couple of very useful softwares with Graphic User Interface (GUI)
TextWrangler, an advanced text editor with RegEx integrationTextWrangler, an advanced text editor with RegEx integration R (for statistics and, more importantly, Graphics)R (for statistics and, more importantly, Graphics) ……
GALAXY is an (our) optionGALAXY is an (our) option Knowledge of at least one programming languageKnowledge of at least one programming language
Python, Perl, Java, C++Python, Perl, Java, C++……
What is this big* fastq file containning ?What is this big* fastq file containning ?
* Size limit to open a text file with a text editor (~1.2 Gb)Unix Terminal .more <path/to/the/file>
$ more GKG-13.fastq @HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1]B]VWaaaaaagggfggggggcggggegdgfgeggbab@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
………
HeaderSequenceHeaderQuality (ascii encoded)
How many sequence reads in my file ?How many sequence reads in my file ?
wc - l <path/to/my/file>$ wc -l GKG-13.fastq
25703828 GKG-13.fastq>>> 25 703 828 / 46 425 957 sequence reads
#!/usr/bin/pythonimport sys
readDic= {}Nbre_reads = 0Nbre_lines = 0F = open(sys.argv[1])for line in F: Nbre_lines += 1 if Nbre_lines % 4 == 2: Nbre_reads += 1 readDic[line] = readDic.get(line, 0) + 1F.close()print "%s reads" % Nbre_readsprint "%s distinct sequences" % (len(readDic))print "%f complexity" % (len(readDic)/float(Nbre_reads))
$ fastq_complexity.py GKG-13.fastq 6 425 957 reads550 706 distinct sequences0.085700 complexity
Are my sequence reads containing the adapter ?Are my sequence reads containing the adapter ?
cat <path/file> | grep CTGTAGG | wc –l grep -c "CTGTAGG" <path/file>
lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep CTGTAGG | wc -l
6355061
lbcd-05:GKG13demo deepseq$ grep -c "CTGTAGG" GKG-13.fastq
6355061
6 355 061 out of6 425 957 sequences… not bad (98.8%)
My 3’ adaptater: CTGTAGGCACCATCAAT
lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep ATCTCGT| wc -l
308
A contrario
Take home message n°1Take home message n°1
Unix Operating Systems already contain powerful native tools for text analysis
$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l
Outputs the content of a file, line by line
The output is passed to the input of the next command
perl interpreter is called with –ne options (loop & execute) In line perl code
Regular expression
The output is passed to the input of the next command
wc with –l option counts the lines
Quality Control. Can I trust my sequences ?Quality Control. Can I trust my sequences ?
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Demo with the GUI version
fastQC in GALAXY
Quality Control. Can I trust my sequences ?Quality Control. Can I trust my sequences ?
How to clip the adapter ?How to clip the adapter ?
3’ adapter: CTGTAGGCACCATCAAT
http://hannonlab.cshl.edu/fastx_toolkit/index.html
fastq_to_fasta -r -n -i GKG-13.fastq | fastx_clipper -a CTGTAGGCACCATCAAT -l 18 -o GKG-13_clip-pipe.fasta
Clipping in GALAXYClipping in GALAXY
http://bowtie-bio.sourceforge.net/
Bowtie aligns reads on indexed genomes
• Download, install Bowtie and rtfm.• Download your genome (format FASTA)• Build the Bowtie index using bowtie-build
deepseq$ bowtie-build fasta_libraries/dmel-all-chromosome-r5.37.fasta dmel-r5.37
~nn min (but indexed references available)deepseq$ ls –laht-rw-r--r-- 1 deepseq staff 49M Mar 24 17:24 dmel-r5.37.rev.1.ebwt-rw-r--r-- 1 deepseq staff 19M Mar 24 17:24 dmel-r5.37.rev.2.ebwt-rw-r--r-- 1 deepseq staff 49M Mar 24 17:20 dmel-r5.37.1.ebwt-rw-r--r-- 1 deepseq staff 19M Mar 24 17:20 dmel-r5.37.2.ebwt-rw-r--r-- 1 deepseq staff 331K Mar 24 17:16 dmel-r5.37.3.ebwt-rw-r--r-- 1 deepseq staff 39M Mar 24 17:16 dmel-r5.37.4.ebwt
Deepseq$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated
A bowtie alignment (Demo on Mac)A bowtie alignment (Demo on Mac)
~/bin/bowtie-0.12.7/indexes/dmel-f GKG-13_clip-pipe.fasta-v 1-k 1-p 2--al droso_matched_GKG-13.fa--un unmatched_GKG13.fa> GKG13_bowtie_output.tabulated
# reads processed: 5997502# reads with at least one reported alignment: 5045151 (84.12%)# reads that failed to align: 952351 (15.88%)Reported 5045151 alignments to 1 output stream(s)
Bowtie outputsBowtie outputsdeepseq$ ls -laht-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa
deepseq$ more GKG13_bowtie_output.tabulated21 + 2L 20487495 TGGAATGTAAAGAAGTATGGAG 30 - 3L 15836559 GTGAATTCTCCCAGTGCCAAG 25 + 3R 5916902 TGAACACAGCTGGTGGTATCC 23 - 2L 11953462 CCCGTGAATTCTTCCAGTGCCATT 27 + 3R 5916902 TGAACACAGCTGGTGGTATC 26 - 3R 9289997 TCCTGCGGCACTAGTACTTA 18 - 2L 11953465 GTGAATTCTTCCAGTGCCATT 22 - 3R 8377246 ATTGCTGGAATCAAGTTGCTGAC 20 + 3L 11650036 TTTGTGACCGACACTAACGGGTA 24 + 2R 16493585 TGGAAGACTAGTGATTTTGTT 28 + 3L 10358380 TAGGAACTTCATACCGTGCTCT 35 + X 18022302 CTTGTGCGTGTGACAGCGGCT 41 - 3RHet 138608 TGGCGACCGTGACAGGACCCG 42 + 3R 5916902 TGAACACAGCTGGTGGTATCC
deepseq$ more droso_matched_GKG-13.fa>21TGGAATGTAAAGAAGTATGGAG>26TAAGTACTAGTGCCGCAGGA>24TGGAAGACTAGTGATTTTGTT>23AATGGCACTGGAAGAATTCACGGG>27TGAACACAGCTGGTGGTATC
deepseq$ more unmatched_GKG13.fa>29AGGGGGCTATTTCACTACTGGA>33CGATGATGACGGTACCCGTAGA>37GCTAGTCGGTACTTGAAAC>59TGGTTGCAATAGCTTCTGGCGGA>61GATGAGTGCTAGATGTAGGGA
Tabular alignment report
Aligned sequences Unaligned sequences
Sequence reads (fasta format)
Bowtie Pre-miRNAs (miRBase)
Unmatched reads
Unmatched reads
Transposons
Unmatched reads
Genes
Unmatched reads
Unmatched reads
Remaining unmatched sequences
Bowtie
Bowtie
Non coding RNAs
Bowtie
Bowtie
Bowtie
Intergenic regions
Viruses, transgenes, etc…
hierarchical
annotation
of
sequence
datasets
A pipeline for small RNA annotation (see in GED Galaxy)A pipeline for small RNA annotation (see in GED Galaxy)
Matched reads(fasta)
Read Count
Matched reads(fasta)
Read Count
Matched reads(fasta)
Read Count
Matched reads(fasta)
Read Count
Matched reads(fasta)
Read Count
Matched reads(fasta)
Read Count
samtoolssamtools
http://samtools.sourceforge.net/
Preparation of a BAM file and its associated index
$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -M 1 --best -p 2 -S | samtools view -bS -o
GKG-13_clip-pipe.fasta.bam - ; samtools sort GKG-13_clip-pipe.fasta.bam GKG-13_clip-pipe.fasta.bam.sorted ;
samtools index GKG-13_clip-pipe.fasta.bam.sorted.bam
306K GKG-13_clip-pipe.fasta.bam.sorted.bam.bai
42M GKG-13_clip-pipe.fasta.bam.sorted.bam
80M GKG-13_clip-pipe.fasta.bam
Sam formatBam format (for Genome Browsers)
• Sorted• Indexed• Compressed
~3 min
Upload of BAM file to a remote server (amazon cloud)Passing the URL to Ensembl (Gbrowse, Modencode, etc..)
Read visualization in a Genome BrowserRead visualization in a Genome Browser
Naive and primed murine pluripotent stem cells have distinct miRNA signatures Naive and primed murine pluripotent stem cells have distinct miRNA signatures
ESC1 ESC2 EpiSC2EpiSC1EpiSC3
A. Jouneau (INRA Jouy en Josas)E. Heard (Institut Curie)C. Antoniewski (Institut Pasteur)M. Cohen-Tannoudji (Institut Pasteur)
miRNA profilingmiRNA profiling
Sequence reads (fasta format)
Bowtie Pre-miRNAs (miRBase)
Bowtie Output
Parsing
Read maps for all miRNAs Hit list for miR_5p and miR_3p
deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt
miR profiling / hit list agregation
Differential CallingDifferential Calling
Sequence reads (fasta format)
Bowtie Pre-miRNAs (miRBase)
Bowtie Output
Parsing
Read maps for all miRNAs Hit list for miR_5p et miR_3p
deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt
DESeq
Heatplus
(Bioconductor)
http://www.r-project.org/
touRism
countsTable <- read.delim( "~/Documents/Pasteur_DEMO/mouse_hits.txt", header=TRUE, stringsAsFactors=TRUE )
head(countsTable)
rownames(countsTable)<- countsTable$gene
countsTable <- countsTable[ , -1 ]
head(countsTable)
summary(countsTable)
plot(countsTable)
plot(log(countsTable,10))
conds <- c( "EPI", "EPI", "EPI", "ES", "ES" )
cds <- newCountDataSet( countsTable, conds )
cds <- estimateSizeFactors( cds )
sizeFactors( cds )
cds = estimateDispersions( cds, method="pooled")
vsd <- getVarianceStabilizedData( cds )
dists <- dist( t( vsd ) )
heatmap( as.matrix( dists ), symm=TRUE, margins=c(12,12),cexRow=1, cexCol=1)
SampleVar<-apply(vsd,1,var)
vsd2<-cbind(vsd,SampleVar)
vsd3<-vsd2[order(vsd2[,6], decreasing=TRUE),]
head(vsd3)
vsd3<-head(vsd3,100)
vsd3<-vsd3[,-6]
head(vsd3)
heatmap.2(vsd3, col=brewer.pal(11, "RdBu"), scale="none", trace="none", margins=c(3,45), ,cexRow=0.7, cexCol=1, dendrogram="column",
density.info="none", keysize=0.7)
cds = estimateDispersions( cds, method="per-condition", sharingMode="fit-only")
res = nbinomTest( cds, "EPI", "ES" )
resNA = res[-which(is.na(res[,8])),]
resNA[order(resNA[,8]), ]
Load DESeq, gplots and RcolorBrewer
Deep Seq Data Analysis, Deep Seq Data Analysis, Final Take Home MessagesFinal Take Home Messages
Think to your deep seq replicates at starting
Keep a hand on your data, from « fastq stage »
Keep a hand on the analysis because this is your project
Always keep an eye on « Normalization » and « Differential »
Don’t be afraid by bioinformatics, but don’t reinvent the wheel
It’s open source, open manual
It’s not magic, yes you can
It’s fun
You cannot escape, so take it easy.