blockwise suffix sorting for space-efficient burrows-wheeler
DESCRIPTION
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler. Ben Langmead Based on work by Juha K ä rkk ä inen. Motivation. Burrows-Wheeler Transformation (BWT) of a large text allows: Fast exact matching Compact representation (compared to suffix tree/array) - PowerPoint PPT PresentationTRANSCRIPT
Blockwise Suffix Sorting forSpace-Efficient Burrows-
WheelerBen Langmead
Based on work by Juha Kärkkäinen
Motivation• Burrows-Wheeler Transformation (BWT) of a large text allows:
– Fast exact matching– Compact representation (compared to suffix tree/array)– More readily compressible (basis of bzip)
• The FM Index exploits an indexed and compressed BWT to allow:– Exact matching in time linear in the size of the pattern– Memory footprint as much as 50% smaller than original
string
• FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk
Background
• Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array
a c a a c g $ g c $ a a a c
Suffix array BurrowsWheelerMatrix
Last column
BWTText
Problem
• Memory footprint of building and storing suffix array is much larger than the BWT itself– Human genome: SA: ~12 GB, BWT: ~0.8 GB– Attempt to build BWT over whole human genome on a 32
GB server exhausts memory and crashes (I tried)
Solution• Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix
Sorting”– Theoretical Computer Science, 387 (3), pp. 249-257, Sept.
2007
• Observation:– BWT[i] depends only on SA[i], not on any other element of
SA
• Corollary:– No need to keep all of SA in memory at once!
• Solution:– Build SA and BWT a small “chunk” or “block” at a time– Greatly reduces the memory overhead
• By something like a factor of B, where B = # of blocks
Solution
• Typical suffix sort:
Solution
• Blockwise suffix sort:
Solution
• Calculate and sort a random sample of the suffixes
Solution
• Samples are used as “bookends” for “buckets”
? $
B1 B2 B3 B4
Solution
• In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket
$
B1 B2 B3 B4Pass 1
Solution
• After a bucket has been sorted and turned into a BWT segment, it is discarded
Pass B B1 B2 B3 B4
$
Solution
• Good time bounds in the presence of long repeats require use of a difference cover sample– Acts like an oracle that determines relative
lexicographical order of two suffixes that share a prefix of some length v
Project Goals
• Basic goal:– Write a correct, usable library implementing blockwise
SA sort and BWT building– Characterize performance and time/space tradeoffs
• Stretch goals:– Fine-tune for performance and memory usage– Implement difference cover sample
• Question: is this necessary for good performance on real-life inputs?
Concluding Remarks
• BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way