reference’free(analysis(of( genomes(with(k’mers( · preqc(’(repeats(figure 2: the estimated...

28
Referencefree analysis of genomes with kmers

Upload: others

Post on 02-Nov-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Reference-­‐free  analysis  of  genomes  with  k-­‐mers  

Page 2: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51
Page 3: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51
Page 4: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51
Page 5: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51
Page 6: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Preqc  -­‐  repeats  

Figure 2: The estimated repeat branch rate for each genome as a function of

k. The yeast data stops at k=51 as the number of repeat branches found falls

below the minimum threshold for emitting an estimate.

20

h8ps://github.com/jts/sga/wiki/Preqc  

Page 7: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Preqc  –  GC  bias  /  coverage  

(a)

(b)

(c)

Figure 4: Two-dimensional histogram of (GC, count) pairs for 31-mers for the

fish (a), yeast (b) and oyster (c) data sets

22

h8ps://github.com/jts/sga/wiki/Preqc  

Page 8: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Preqc  –  predicted  conDg  lengths  

Figure 5: The N50 length of simulated contigs for k from 21 to 91, in increments

of 5

23

h8ps://github.com/jts/sga/wiki/Preqc  

Page 9: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Preqc  –  esDmated  genome  size  

Genome Reference-Free Estimate Published size

yeast 13 Mbp 12 Mbp [30]

oyster 537 Mbp 545-637 Mbp [9]

fish 922 Mbp 1000 Mbp [2]

bird 1094 Mbp 1200 Mbp [2]

snake 1408 Mbp 1600 Mbp [2]

human 2913 Mbp 3102 Mbp (GRC37)

Table 1: The genome size estimates from our method compared to previously

published estimates

29

h8ps://github.com/jts/sga/wiki/Preqc  

Page 10: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51
Page 11: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Khmer-­‐recipes  

h8p://khmer-­‐recipes.readthedocs.org/en/latest/  

Page 12: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Reference  &  quality-­‐score  independent  approaches  (k-­‐mers)  

Zhang  et  al.,  h8ps://peerj.com/preprints/890/  

Page 13: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Mouse  mRNAseq  

Zhang  et  al.,  h8ps://peerj.com/preprints/890/  

Page 14: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

K-­‐mer  abundance  trimming  removes  errors  effecDvely!  

Zhang  et  al.  PLoS  One,  2014  

Page 15: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

CTB  research  -­‐  diginorm  

h8p://arxiv.org/abs/1203.4802  

Page 16: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Approach:  Digital  normalizaDon  (a  computaDonal  version  of  library  normalizaDon)  

Species A

Species B

Ratio 10:1Unnecessary data

81%

Suppose  you  have  a  diluDon  factor  of  A  (10)  to  B(1).    To  get  10x  of  B  you  need  to  get  100x  of  A!    

Overkill!!    

This  100x  will  consume  disk  space  and,  because  of  

errors,  memory.    

We  can  discard  it  for  you…  

Page 17: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

True sequence (unknown)

Reads(randomly sequenced)

Digital  normalizaDon  

Page 18: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

True sequence (unknown)

Reads(randomly sequenced)

X

Digital  normalizaDon  

Page 19: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

True sequence (unknown)

Reads(randomly sequenced)

XX

XX

XX

XX

X

X

X

Digital  normalizaDon  

Page 20: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

True sequence (unknown)

Reads(randomly sequenced)

XX

XX

XX

XX

X

X

X

Digital  normalizaDon  

Page 21: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

True sequence (unknown)

Reads(randomly sequenced)

XX

XX

XX

XX

X

If next read is from a highcoverage region - discard

X

X

Digital  normalizaDon  

Page 22: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Digital  normalizaDon  True sequence (unknown)

Reads(randomly sequenced)

XX

XX

XX

XX

X

XX

X

XX

XX

X

X

XX

X

XX

XRedundant reads

(not needed for assembly)

Page 23: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Digital  normalizaDon  approach  A  digital  analog  to  cDNA  library  normalizaDon,  

diginorm:    

•  Is  single  pass:  looks  at  each  read  only  once;  

•  Does  not  “collect”  the  majority  of  errors;  

•  Keeps  all  low-­‐coverage  reads;  

•  Smooths  out  coverage  of  regions.  

Page 24: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Coverage  before  digital  normalizaDon:  

(MD  amplified)  

Page 25: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Coverage  aeer  digital  normalizaDon:  

Normalizes  coverage    Discards  redundancy    Eliminates  majority  of  errors    Scales  assembly  dramaDcally.    Assembly  is  98%  idenDcal.  

Page 26: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Digital  normalizaDon  approach  A  digital  analog  to  cDNA  library  normalizaDon,  diginorm  is  a  read  prefiltering  approach  that:  

 •  Is  single  pass:  looks  at  each  read  only  once;  

•  Does  not  “collect”  the  majority  of  errors;  

•  Keeps  all  low-­‐coverage  reads;  

•  Smooths  out  coverage  of  regions.  

Page 27: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Con(g  assembly  is  significantly  more  efficient  and  now  scales  with  underlying  genome  size  

•  Transcriptomes,  microbial  genomes  incl  MDA,  and  most  metagenomes  can  be  assembled  in  under  50  GB  of  RAM,  with  idenDcal  or  improved  results.  

Page 28: Reference’free(analysis(of( genomes(with(k’mers( · Preqc(’(repeats(Figure 2: The estimated repeat branch rate for each genome as a function of k. The yeast data stops at k=51

Digital  normalizaDon  retains  informaDon,  while  discarding  data  and  errors