blockwise suffix sorting for space-efficient burrows-wheeler

Blockwise Suffix Sorting forSpace-Efficient Burrows-

WheelerBen Langmead

Based on work by Juha Kärkkäinen

Motivation• Burrows-Wheeler Transformation (BWT) of a large text allows:

– Fast exact matching– Compact representation (compared to suffix tree/array)– More readily compressible (basis of bzip)

• The FM Index exploits an indexed and compressed BWT to allow:– Exact matching in time linear in the size of the pattern– Memory footprint as much as 50% smaller than original

string

• FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk

Background

• Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array

a c a a c g $ g c $ a a a c

Suffix array BurrowsWheelerMatrix

Last column

BWTText

Problem

• Memory footprint of building and storing suffix array is much larger than the BWT itself– Human genome: SA: ~12 GB, BWT: ~0.8 GB– Attempt to build BWT over whole human genome on a 32

GB server exhausts memory and crashes (I tried)

Solution• Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix

Sorting”– Theoretical Computer Science, 387 (3), pp. 249-257, Sept.

2007

• Observation:– BWT[i] depends only on SA[i], not on any other element of

SA

• Corollary:– No need to keep all of SA in memory at once!

• Solution:– Build SA and BWT a small “chunk” or “block” at a time– Greatly reduces the memory overhead

• By something like a factor of B, where B = # of blocks

Solution

• Typical suffix sort:

Solution

• Blockwise suffix sort:

Solution

• Calculate and sort a random sample of the suffixes

Solution

• Samples are used as “bookends” for “buckets”

? $

B1 B2 B3 B4

Solution

• In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket

$

B1 B2 B3 B4Pass 1

Solution

• After a bucket has been sorted and turned into a BWT segment, it is discarded

Pass B B1 B2 B3 B4

$

Solution

• Good time bounds in the presence of long repeats require use of a difference cover sample– Acts like an oracle that determines relative

lexicographical order of two suffixes that share a prefix of some length v

Project Goals

• Basic goal:– Write a correct, usable library implementing blockwise

SA sort and BWT building– Characterize performance and time/space tradeoffs

• Stretch goals:– Fine-tune for performance and memory usage– Implement difference cover sample

• Question: is this necessary for good performance on real-life inputs?

Concluding Remarks

• BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way

blockwise suffix sorting for space-efficient burrows-wheeler

Documents