macmanes evolution2014 trimming talk
DESCRIPTION
This is a talk I presented at the Evolution meeting in Raleigh, NC in June 2014. It describes the work to date establishing optimal trimming for mRNAseq data.TRANSCRIPT
Optimal Trimming of mRNA sequence data
Matthew MacManes University of New Hampshire !
Twitter: @PeroMHC [email protected]
Quality trimming of NGS data
• Universal practice
0.0
0.1
0.2
0.3
0.4
Nucleotide Position
Prob
abilit
y of
nuc
leot
ide
erro
r
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Quality trimming of NGS data0.
00.
10.
20.
30.
4
Nucleotide Position
Prob
abilit
y of
nuc
leot
ide
erro
r
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Phred=5
Quality trimming of NGS data0.
00.
10.
20.
30.
4
Nucleotide Position
Prob
abilit
y of
nuc
leot
ide
erro
r
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Phred=10
Quality trimming of NGS data0.
00.
10.
20.
30.
4
Nucleotide Position
Prob
abilit
y of
nuc
leot
ide
erro
r
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Phred=20
Trimming Experiment
• 2 Illumina datasets > adapter trimmed.
• Subsampled to 10M, 20M, 50M, 75M, 100M PE reads.
• Trimmed at Phred 0,2,5,10,20
• Assembled using Trinity and SOAPdenovo-Trans
• Developed metrics for evaluating transcriptome assemblies.
MacManes, Frontiers in Genetics 2014
1000
1400
1800
Num
ber o
f nuc
leot
ide
erro
rs p
er M
b of
ass
embl
y
No Trim Phred=2 Phred=5 Phred=10 Phred=20
10M 20M 50M 75M 100M
Quality trimming reduces error
MacManes, Frontiers in Genetics 2014
4000
5000
6000
7000
Num
ber o
f nuc
leot
ide
erro
rs p
er M
b of
ass
embl
y
No Trim Phred=2 Phred=5 Phred=10 Phred=20
SOAP10M SOAP20M
1000
1400
1800
Num
ber o
f nuc
leot
ide
erro
rs p
er M
b of
ass
embl
y
No Trim Phred=2 Phred=5 Phred=10 Phred=20
10M 20M 50M 75M 100M
Quality trimming reduces error
−5−4
−3−2
−10
1
Perc
ent d
iff in
num
ber o
f uni
que
BLAS
T hi
ts
No Trim Phred=2 Phred=5 Phred=10 Phred=20
10M 20M 50M 75M 100M
Quality trimming reduces BLAST hits
MacManes, Frontiers in Genetics 2014
−5−4
−3−2
−10
1
Perc
ent d
iff in
num
ber o
f uni
que
BLAS
T hi
ts
No Trim Phred=2 Phred=5 Phred=10 Phred=20
10M 20M 50M 75M 100M
−6−4
−20
Perc
ent d
iff in
num
ber o
f uni
que
BLAS
T hi
ts
No Trim Phred=2 Phred=5 Phred=10 Phred=20
SOAP10M SOAP20M
Quality trimming reduces BLAST hits
−15
−10
−50
Perc
ent d
iff in
num
ber o
f com
plet
e C
DS
No Trim Phred=2 Phred=5 Phred=10 Phred=20
10M 20M 50M 75M 100M
Quality trimming reduces complete CDS
MacManes, Frontiers in Genetics 2014
−15
−10
−50
Perc
ent d
iff in
num
ber o
f com
plet
e C
DS
No Trim Phred=2 Phred=5 Phred=10 Phred=20
10M 20M 50M 75M 100M
Quality trimming reduces complete CDS−1
5−1
0−5
0Pe
rcen
t diff
in n
umbe
r of c
ompl
ete
CDS
No Trim Phred=2 Phred=5 Phred=10 Phred=20
SOAP10M SOAP20M
Summary
• Trimming does reduce assembly error, but at the cost of content & contiguity.
• Proposed guidelines.
1. To max assembly content and contiguity ➠ Trim at 0 or 2
2. If concerned about error ➠ Trim at Phred=5
3. Usually probably never trim at Phred ≥ 10
MacManes, Frontiers in Genetics 2014