umd bioscience nov2008

1
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics and Computation Biology, University of Maryland, College Park, MD, USA http://bowtie-bio.sourceforge.net Abstract Bowtie Index Bowtie is an ultrafast, memory-efficient program for aligning short reads to mammalian genomes. Burrows- Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour to the human genome in a small memory footprint: approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve greater alignment speed. Bowtie is free, open source software available for download from http://bowtie-bio.sourceforge.net Burrows-Wheeler Transform (BWT) Burrows-Wheeler Matrix Bowtie builds a genome index using a scheme based on the Burrows-Wheeler Transform (BWT) [1] and the FM Index [2]. The Burrows-Wheeler Transformation of a text T, BWT(T), is constructed as shown below. The Burrows-Wheeler Matrix of T is the matrix whose rows are all distinct cyclic rotations of T$ sorted lexicographically. BWT(T) is the sequence of characters in the rightmost column of the matrix. the i th occurrence of character X in the last column… …corresponds to the same text character as the i th occurrence of X in the first column The Burrows-Wheeler Matrix has a property called the LF mapping: the i th occurrence of character X in the last column corresponds to the same text character as the i th occurrence of X in the first column. This property underlies algorithms that use 351x 59 x Results Reads aligned (%) Bowtie - SOAP- mode 67.4 SOAP 67.3 Bowtie - Maq- mode 71.9 Maq 74.7 Bowtie Search Algorithm Bowtie’s search algorithm is based on the FM Index exact matching algorithm [2]. This algorithm proceeds in rounds where progressive rounds calculate the range of matrix rows beginning with progressively longer suffixes of the query string. The range of rows under consideration at each step is delimited by the red and green arrows. As the algorithm considers additional characters in the query (purple), the arrows are updated using the LF mapping (blue). Rows still under consideration when all query characters are exhausted correspond to reference positions matching the query (orange). “a” “a” “a “c c “a “a “a “c “a “a “a acaacg || aaa acaacg | | aaa acaacg || aaa A Bowtie genome index consists of the BWT plus some auxiliary data structures. All told, the index is only about 65% larger than the input genome. This is radically smaller than other indexing schemes such as suffix trees, suffix arrays, and seed hash tables. This allows Bowtie to run easily and efficiently on a typical computer with 2 gigabytes of RAM. To allow mismatches in alignments, Bowtie extends on the above algorithm by “backtracking” to previously matched query characters and making substitutions. Bowtie searches the space of legal alignments using a quality- aware, greedy, depth-first search. Sequence: Phred Quals: (higher number = higher confidence) G C C A T A C G G A T T A G C C 40 40 35 40 40 40 40 30 30 20 15 15 40 40 40 40 G C C A T A C G G A C T A G C C 40 40 35 40 40 40 40 30 30 20 15 15 40 40 40 40 G C C A T A C G G G C T A G C C 40 40 35 40 40 40 40 30 30 20 15 15 40 40 40 40 References 1. Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124 2. Ferragina P, Manzini G: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. IEEE Computer Society; 2000 Bowtie aligns reads dramatically faster than Maq (http://maq.sf.net) and SOAP (http://soap.genomics.org.cn), two competing short read alignment tools. The chart shows throughput in reads/CPU hour, running Bowtie 0.9.6, SOAP 1.10, and Maq 0.6.6. Each program was set to report at most one hit per read, a typical setting for resequencing applications. Maq and SOAP were run with default parameters, and Bowtie was run with parameters to mimic both programs’ reporting constraints. Bowtie and Maq were compared on a 2.4GHz Intel Core 2 Duo workstation with 2GB of RAM. A Bowtie index of the human genome can be built on a computer with 2 GB of RAM in less than a day (right), though indexing is faster with more RAM (left). When aligning on computers with multiple processor cores, Bowtie’s throughput increases significantly. Rightmost query positions are the likeliest backtracking targets because shorter suffixes are likely to occur in the genome “by chance.” When more than one mismatch is allowed, excessive backtracking can make search very slow. To avoid this, Bowtie uses “double indexing”: Bowtie indexes the genome as well as the reverse (mirror image) of the genome. Searching the mirror index with a mirror query is equivalent to a normal search that proceeds left-to-right instead of right-to-left. Normal search combined with mirrored search avoids most or all excessive backtracking without missing valid alignments. Bowtie index of the reference is substantially smaller than SOAP’s, which is too large to run on most workstations. Maq achieves a small footprint by indexing small batches of reads. The chart below shows performance for aligning 8.84 million 35-base-pair reads from the 1000 Genomes pilot (SRA Accession SRR001115) to the human genome (NCBI build 36.3). Bowtie has comparable sensitivity to SOAP and Maq. Maq finds some 3-mismatch alignments when asked for 2- mismatch alignments, accounting for its slightly higher sensitivity. acctagattcagaggtcaccataggcacatgcag Don’t backtrack to positions in this region of the read Allow mismatches in this part of the read Matching gacgtacacggataccactggagacttagatcca Matching reverse Above: double indexing applied to a 1-mismatch search. Searches allowing more than one mismatch follow a similar pattern, though some excessive backtracking is unavoidable. Bowtie Index Suffix Tree Suffix Array Hash Tables 1.3 gigabytes >35 gigabytes>12 gigabytes >12 gigabytes Human genome memory footprint Above: Backtracking scenarios leading to 3 distinct 1-mismatch alignments of “aaa” Because Bowtie builds a compact index of the genome, Bowtie indexes are easy to distribute over the Internet and easy to reuse. Reuse allows the cost of building or downloading the index to be amortized over many runs. Pre-built indexes for model organisms including human, chimp, mouse and dog can be downloaded from http://bowtie-bio.sourceforge.net

Upload: jonathan-uesato

Post on 06-Nov-2015

226 views

Category:

Documents


2 download

DESCRIPTION

bio

TRANSCRIPT

  • Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

    Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. SalzbergCenter for Bioinformatics and Computation Biology, University of Maryland, College Park, MD, USA

    http://bowtie-bio.sourceforge.netAbstract

    Bowtie Index

    Bowtie is an ultrafast, memory-efficient program for aligning short reads to mammalian genomes. Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour to the human genome in a small memory footprint: approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve greater alignment speed. Bowtie is free, open source software available for download from http://bowtie-bio.sourceforge.net Burrows-Wheeler Transform (BWT)Burrows-Wheeler MatrixBowtie builds a genome index using a scheme based on the Burrows-Wheeler Transform (BWT) [1] and the FM Index [2]. The Burrows-Wheeler Transformation of a text T, BWT(T), is constructed as shown below. The Burrows-Wheeler Matrix of T is the matrix whose rows are all distinct cyclic rotations of T$ sorted lexicographically. BWT(T) is the sequence of characters in the rightmost column of the matrix.the ith occurrence of character X in the last columncorresponds to the same text character as the ith occurrence of X in the first column The Burrows-Wheeler Matrix has a property called the LF mapping: the ith occurrence of character X in the last column corresponds to the same text character as the ith occurrence of X in the first column. This property underlies algorithms that use the BWT to navigate or search the text. Results

    Bowtie Search Algorithm

    Bowties search algorithm is based on the FM Index exact matching algorithm [2]. This algorithm proceeds in rounds where progressive rounds calculate the range of matrix rows beginning with progressively longer suffixes of the query string.The range of rows under consideration at each step is delimited by the red and green arrows. As the algorithm considers additional characters in the query (purple), the arrows are updated using the LF mapping (blue). Rows still under consideration when all query characters are exhausted correspond to reference positions matching the query (orange).aaaccaaacaaaacaacg || aaaacaacg| |aaaacaacg || aaaA Bowtie genome index consists of the BWT plus some auxiliary data structures. All told, the index is only about 65% larger than the input genome. This is radically smaller than other indexing schemes such as suffix trees, suffix arrays, and seed hash tables. This allows Bowtie to run easily and efficiently on a typical computer with 2 gigabytes of RAM.To allow mismatches in alignments, Bowtie extends on the above algorithm by backtracking to previously matched query characters and making substitutions. Bowtie searches the space of legal alignments using a quality-aware, greedy, depth-first search.References

    Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124Ferragina P, Manzini G: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. IEEE Computer Society; 2000

    Bowtie aligns reads dramatically faster than Maq (http://maq.sf.net) and SOAP (http://soap.genomics.org.cn), two competing short read alignment tools. The chart shows throughput in reads/CPU hour, running Bowtie 0.9.6, SOAP 1.10, and Maq 0.6.6. Each program was set to report at most one hit per read, a typical setting for resequencing applications. Maq and SOAP were run with default parameters, and Bowtie was run with parameters to mimic both programs reporting constraints. Bowtie and Maq were compared on a 2.4GHz Intel Core 2 Duo workstation with 2GB of RAM.A Bowtie index of the human genome can be built on a computer with 2 GB of RAM in less than a day (right), though indexing is faster with more RAM (left).When aligning on computers with multiple processor cores, Bowties throughput increases significantly. Rightmost query positions are the likeliest backtracking targets because shorter suffixes are likely to occur in the genome by chance. When more than one mismatch is allowed, excessive backtracking can make search very slow. To avoid this, Bowtie uses double indexing: Bowtie indexes the genome as well as the reverse (mirror image) of the genome. Searching the mirror index with a mirror query is equivalent to a normal search that proceeds left-to-right instead of right-to-left. Normal search combined with mirrored search avoids most or all excessive backtracking without missing valid alignments.Bowtie index of the reference is substantially smaller than SOAPs, which is too large to run on most workstations. Maq achieves a small footprint by indexing small batches of reads.The chart below shows performance for aligning 8.84 million 35-base-pair reads from the 1000 Genomes pilot (SRA Accession SRR001115) to the human genome (NCBI build 36.3).Bowtie has comparable sensitivity to SOAP and Maq. Maq finds some 3-mismatch alignments when asked for 2-mismatch alignments, accounting for its slightly higher sensitivity.reverseAbove: double indexing applied to a 1-mismatch search. Searches allowing more than one mismatch follow a similar pattern, though some excessive backtracking is unavoidable.Bowtie IndexSuffix TreeSuffix ArrayHash Tables1.3 gigabytes>35 gigabytes>12 gigabytes>12 gigabytesHuman genome memory footprintAbove: Backtracking scenarios leading to 3 distinct 1-mismatch alignments of aaaBecause Bowtie builds a compact index of the genome, Bowtie indexes are easy to distribute over the Internet and easy to reuse. Reuse allows the cost of building or downloading the index to be amortized over many runs. Pre-built indexes for model organisms including human, chimp, mouse and dog can be downloaded from http://bowtie-bio.sourceforge.net

    Reads aligned (%)Bowtie - SOAP-mode67.4 SOAP67.3 Bowtie - Maq-mode71.9 Maq74.7

    Chart1

    13619SOAP

    804Maq

    1169Bowtie

    Peak VM (Megabytes)

    Memory available (gigabytes)

    Human index build

    Sheet1

    Peak VM (Megabytes)

    SOAP13,619

    Maq804

    Bowtie1169

    To resize chart data range, drag lower right corner of range.

    Chart1

    13619

    804

    1169

    Peak VM (Megabytes)

    Megabytes

    Search - peak memory footprint

    Sheet1

    Peak VM (Megabytes)

    SOAP13,619

    Maq804

    Bowtie1169

    To resize chart data range, drag lower right corner of range.