blockwise suffix sorting for space-efficient burrows-wheeler

14
Blockwise Suffix Sorting for Space-Efficient Burrows- Wheeler Ben Langmead Based on work by Juha Kärkkäinen

Upload: winifred-vazquez

Post on 30-Dec-2015

18 views

Category:

Documents


0 download

DESCRIPTION

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler. Ben Langmead Based on work by Juha K ä rkk ä inen. Motivation. Burrows-Wheeler Transformation (BWT) of a large text allows: Fast exact matching Compact representation (compared to suffix tree/array) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Blockwise Suffix Sorting forSpace-Efficient Burrows-

WheelerBen Langmead

Based on work by Juha Kärkkäinen

Page 2: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Motivation• Burrows-Wheeler Transformation (BWT) of a large text allows:

– Fast exact matching– Compact representation (compared to suffix tree/array)– More readily compressible (basis of bzip)

• The FM Index exploits an indexed and compressed BWT to allow:– Exact matching in time linear in the size of the pattern– Memory footprint as much as 50% smaller than original

string

• FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk

Page 3: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Background

• Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array

a c a a c g $ g c $ a a a c

Suffix array BurrowsWheelerMatrix

Last column

BWTText

Page 4: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Problem

• Memory footprint of building and storing suffix array is much larger than the BWT itself– Human genome: SA: ~12 GB, BWT: ~0.8 GB– Attempt to build BWT over whole human genome on a 32

GB server exhausts memory and crashes (I tried)

Page 5: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution• Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix

Sorting”– Theoretical Computer Science, 387 (3), pp. 249-257, Sept.

2007

• Observation:– BWT[i] depends only on SA[i], not on any other element of

SA

• Corollary:– No need to keep all of SA in memory at once!

• Solution:– Build SA and BWT a small “chunk” or “block” at a time– Greatly reduces the memory overhead

• By something like a factor of B, where B = # of blocks

Page 6: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution

• Typical suffix sort:

Page 7: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution

• Blockwise suffix sort:

Page 8: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution

• Calculate and sort a random sample of the suffixes

Page 9: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution

• Samples are used as “bookends” for “buckets”

? $

B1 B2 B3 B4

Page 10: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution

• In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket

$

B1 B2 B3 B4Pass 1

Page 11: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution

• After a bucket has been sorted and turned into a BWT segment, it is discarded

Pass B B1 B2 B3 B4

$

Page 12: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Solution

• Good time bounds in the presence of long repeats require use of a difference cover sample– Acts like an oracle that determines relative

lexicographical order of two suffixes that share a prefix of some length v

Page 13: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Project Goals

• Basic goal:– Write a correct, usable library implementing blockwise

SA sort and BWT building– Characterize performance and time/space tradeoffs

• Stretch goals:– Fine-tune for performance and memory usage– Implement difference cover sample

• Question: is this necessary for good performance on real-life inputs?

Page 14: Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Concluding Remarks

• BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way