seqmap: mapping massive amount of oligonucleotides to the genome hui jiang et al. bioinformatics...

SeqMap: mapping massive amount of oligonucleotides to the genome

Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396

The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides

from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45

Presented by: Xia Li

Short-read mapping software

Software Technique Reference

GNUMAPHashing refs + base quality + repeated regions Clement et al., 2010

Novoalign Hashing refs Novocraft, unpublished

SOAP Hashing refs Li et al., 2008

SeqMap Hashing reads Jiang et al., 2008

RMAP Hashing reads + read quality Smith et al., 2008

Eland Hashing reads Cox, unpublished

Bowtie BWT Langmead et al., 2009

Sliderlexicographically sorting + base quality Malhis et al., 2009

SeqMap

• Motivation– Hashing genome usually needs large memory (e.g.

SOAP needs 14GB memory when mapping to the human genome)

– Allow more substitutions and insertion/deletion

SeqMap

• Pigeonhole principle– Spaced seed alignment– ELAND, SOAP, RMAP

• Hash reads• Insertion/deletion:

2/4 combinations with1/2 shifted one nucleotideto its left or right

Short Read

Short read look up table (indexed by 2 parts)

Split into 4 parts

All combinations of 2/4 parts

Reference GenomeImage credit: J. Ruan

Experiment & Result

Experiment & Result

• Deal with more substitutions and insertion/deletion

Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions

GNUMAP

• Motivation– Base uncertainty

• Such as nearly equal or low probabilities to A, C, G or T• Filter low quality reads [RMAP] -> discard up to half of the

reads (Harismendy et al., 2009)

– Repeated regions in the genome• Discard them -> loss of up to half of the data (Harismendy

et al., 2009)• Record one -> unequal mapping to some of the repeat

regions• Record all -> each location having 3 times the correct

score

GNUMAP

• Flow-chart

Probabilistic Needleman-Wunsch

Alignment Score

ACTGAACCATACGGGTACTGAACCATGAA

AACCAT

GGGTACAACCATTAC

Read from sequencer

GGGTACAACCAT

Read is added to both repeat regions proportionally to their match qualityweighted by its # of occurrences in the genome

Slide credit: N. Clement

Experiment & Result

Comments

• SeqMap– Pos: dealing with more

substations/insertion/deletion– Cons: memory consuming, not fast

• GNUMAP– Pos: consider base quality and repeated regions ->

generate more useful information and achieves best performance (~15% increase)

– Cos: memory consuming, slow, more noise

seqmap: mapping massive amount of oligonucleotides to the genome hui jiang et al. bioinformatics...

Documents