presented by: xia li

12
SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next- generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45 Presented by: Xia Li

Upload: alaina

Post on 20-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

SeqMap : mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45 . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented by: Xia Li

SeqMap: mapping massive amount of oligonucleotides to the genome

Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396

The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides

from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45

Presented by: Xia Li

Page 2: Presented by: Xia Li

Short-read mapping softwareSoftware Technique ReferenceGNUMAP Hashing refs + base quality +

repeated regions Clement et al., 2010

Novoalign Hashing refs Novocraft, unpublishedSOAP Hashing refs Li et al., 2008SeqMap Hashing reads Jiang et al., 2008RMAP Hashing reads + read quality Smith et al., 2008Eland Hashing reads Cox, unpublishedBowtie BWT Langmead et al., 2009

Slider lexicographically sorting + base quality Malhis et al., 2009

Page 3: Presented by: Xia Li

SeqMap

• Motivation– Hashing genome usually needs large memory (e.g.

SOAP needs 14GB memory when mapping to the human genome)

– Allow more substitutions and insertion/deletion

Page 4: Presented by: Xia Li

SeqMap

• Pigeonhole principle– Spaced seed alignment– ELAND, SOAP, RMAP

• Hash reads• Insertion/deletion:

2/4 combinations with1/2 shifted one nucleotideto its left or right

Short Read

Short read look up table (indexed by 2 parts)

Split into 4 parts

All combinations of 2/4 parts

Reference GenomeImage credit: J. Ruan

Page 5: Presented by: Xia Li

Experiment & Result

Page 6: Presented by: Xia Li

Experiment & Result

• Deal with more substitutions and insertion/deletion

Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions

Page 7: Presented by: Xia Li

GNUMAP

• Motivation– Base uncertainty

• Such as nearly equal or low probabilities to A, C, G or T• Filter low quality reads [RMAP] -> discard up to half of the

reads (Harismendy et al., 2009)– Repeated regions in the genome

• Discard them -> loss of up to half of the data (Harismendy et al., 2009)

• Record one -> unequal mapping to some of the repeat regions

• Record all -> each location having 3 times the correct score

Page 8: Presented by: Xia Li

GNUMAP

• Flow-chart

Page 9: Presented by: Xia Li

Probabilistic Needleman-Wunsch

Page 10: Presented by: Xia Li

Alignment Score

ACTGAACCATACGGGTACTGAACCATGAA

AACCAT

GGGTACAACCATTAC

Read from sequencer

GGGTACAACCAT

Read is added to both repeat regions proportionally to their match qualityweighted by its # of occurrences in the genome

Slide credit: N. Clement

Page 11: Presented by: Xia Li

Experiment & Result

Page 12: Presented by: Xia Li

Comments

• SeqMap– Pos: dealing with more

substations/insertion/deletion– Cons: memory consuming, not fast

• GNUMAP– Pos: consider base quality and repeated regions ->

generate more useful information and achieves best performance (~15% increase)

– Cos: memory consuming, slow, more noise