a high performance multiple sequence alignment system for pyrosequencing reads from multiple...

11
J. Parallel Distrib. Comput. 72 (2012) 83–93 Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Research note A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes Fahad Saeed a , Alan Perez-Rathke b , Jaroslaw Gwarnicki b , Tanya Berger-Wolf b , Ashfaq Khokhar c,a The National Heart Lung and Blood Institute (NHLBI), National Institutes of Health (NIH) Bethesda MD, USA b Department of Computer Science, University of Illinois at Chicago, IL, USA c Department of Electrical and Computer Engineering, University of Illinois at Chicago, IL, USA article info Article history: Received 7 July 2010 Received in revised form 15 August 2011 Accepted 17 August 2011 Available online 16 September 2011 Keywords: Multiple sequence alignment Pyrosequencing Parallel algorithms Computational biology Genome alignment and mapping abstract Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for a large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such a large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors. © 2011 Elsevier Inc. All rights reserved. 1. Introduction For over a decade, Sanger sequencing has been the cornerstone of genome sequencing including that of microbial genomes. Im- provements in DNA sequencing techniques and advances in data storage and analysis, as well as developments in bioinformatics have reduced the cost to a mere $8000–$10 000 per megabase of high quality genome draft sequence. However, the need of more efficient and cost effective approaches has led to the development of new sequencing technologies such as the 454 GS20 sequenc- ing platform. It is a non-cloning pyrosequencing based platform that is capable of generating 1 million overlapping reads in a sin- gle run. However, multitude of factors, such as relatively short read lengths as compared to Sanger (as of year 2010: an average of 100–400 nt length reads compared to 800–1000 nt length reads for Sanger sequencing), lack of a paired end protocol, and limited accuracy of individual reads for repetitive DNA, particularly in the case of monopolymer repeats, present many computational chal- lenges [17] to make pyrosequencing useful for biology and bioin- formatics applications. One of the most widely employed preprocessing step for many applications, including haplotype reconstruction [10,51], analysis Corresponding author. E-mail address: [email protected] (A. Khokhar). of microbial community analysis [29], analysis of genes for dis- eases [18], is the alignment of these reads with the wild type. For important applications such as viral population estimation or hap- lotype reconstruction of various viruses e.g., HIV in a population, scientists usually have the information about the wild type genome of the virus. To this date, numerous tools have been suggested for mapping short reads to the single reference genome. These meth- ods include strategies for indexing substrings of short reads or ref- erences with the use of k-mer counts or space seeds. The list of academic tools available for such mapping include Bowtie, BWA, CloudBurst, MAQ, MOM, MosaikAligner, mrFAST, mrsFAST, Pash, PASS, PatMaN, RazorS, RMAP, SeqMap, SHRiMP, SliderII, SOAP, SOAP2, ssaha2 [2,4,6,15,19,23,25,28,26,30,31,35,38,39,43,47,52,1], and commercial tools such as ZOOM [27]. The current demand in alignment can be met with new indexing strategies and efficient data processing [24]. However, these strategies are usually at the cost of simplifying the mapping problem and not allowing complex alignments, including gaps or alignment with multiple reference genomes. Natural strains of genomes that have a high level of individual difference, such as natural inbred strains of Arabidopsis or diver- gent strains of HIV, pose a considerable alignment challenge [45]. It has been shown in [3,53] that several percent of reference genome can be missing or very divergent in distinct strains of different species. This poses a problem for pairwise mapping algorithms 0743-7315/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2011.08.001

Upload: fahad-saeed

Post on 07-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

J. Parallel Distrib. Comput. 72 (2012) 83–93

Contents lists available at SciVerse ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

Research note

A high performance multiple sequence alignment system for pyrosequencingreads from multiple reference genomesFahad Saeed a, Alan Perez-Rathke b, Jaroslaw Gwarnicki b, Tanya Berger-Wolf b, Ashfaq Khokhar c,∗a The National Heart Lung and Blood Institute (NHLBI), National Institutes of Health (NIH) Bethesda MD, USAb Department of Computer Science, University of Illinois at Chicago, IL, USAc Department of Electrical and Computer Engineering, University of Illinois at Chicago, IL, USA

a r t i c l e i n f o

Article history:Received 7 July 2010Received in revised form15 August 2011Accepted 17 August 2011Available online 16 September 2011

Keywords:Multiple sequence alignmentPyrosequencingParallel algorithmsComputational biologyGenome alignment and mapping

a b s t r a c t

Genome resequencing with short reads generated from pyrosequencing generally relies on mappingthe short reads against a single reference genome. However, mapping of reads from multiple referencegenomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each otherand the reference genomes, existingmultiple sequence alignment(MSA)methods cannot be used becausethey do not take into account the position of these short reads with respect to the genome, and are highlyinefficient for a large number of sequences. In this paper, we develop a highly scalable parallel algorithmbased on domain decomposition, referred to as P-Pyro-Align, to align such a large number of reads fromsingle or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneousreads, and has been implemented on a cluster of workstations using MPI library. Experimental results fordifferent problem sizes are analyzed in terms of execution time, quality of the alignments, and the abilityof the algorithm to handle reads frommultiple haplotypes. We report high quality multiple alignment ofup to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedupswith increasing number of processors.

© 2011 Elsevier Inc. All rights reserved.

1. Introduction

For over a decade, Sanger sequencing has been the cornerstoneof genome sequencing including that of microbial genomes. Im-provements in DNA sequencing techniques and advances in datastorage and analysis, as well as developments in bioinformaticshave reduced the cost to a mere $8000–$10000 per megabase ofhigh quality genome draft sequence. However, the need of moreefficient and cost effective approaches has led to the developmentof new sequencing technologies such as the 454 GS20 sequenc-ing platform. It is a non-cloning pyrosequencing based platformthat is capable of generating 1 million overlapping reads in a sin-gle run. However,multitude of factors, such as relatively short readlengths as compared to Sanger (as of year 2010: an average of100–400 nt length reads compared to 800–1000 nt length readsfor Sanger sequencing), lack of a paired end protocol, and limitedaccuracy of individual reads for repetitive DNA, particularly in thecase of monopolymer repeats, present many computational chal-lenges [17] to make pyrosequencing useful for biology and bioin-formatics applications.

One of the most widely employed preprocessing step for manyapplications, including haplotype reconstruction [10,51], analysis

∗ Corresponding author.E-mail address: [email protected] (A. Khokhar).

0743-7315/$ – see front matter© 2011 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2011.08.001

of microbial community analysis [29], analysis of genes for dis-eases [18], is the alignment of these reads with the wild type. Forimportant applications such as viral population estimation or hap-lotype reconstruction of various viruses e.g., HIV in a population,scientists usually have the information about thewild type genomeof the virus. To this date, numerous tools have been suggested formapping short reads to the single reference genome. These meth-ods include strategies for indexing substrings of short reads or ref-erences with the use of k-mer counts or space seeds. The list ofacademic tools available for such mapping include Bowtie, BWA,CloudBurst, MAQ, MOM, MosaikAligner, mrFAST, mrsFAST, Pash,PASS, PatMaN, RazorS, RMAP, SeqMap, SHRiMP, SliderII, SOAP,SOAP2, ssaha2 [2,4,6,15,19,23,25,28,26,30,31,35,38,39,43,47,52,1],and commercial tools such as ZOOM [27]. The current demand inalignment can be met with new indexing strategies and efficientdata processing [24]. However, these strategies are usually at thecost of simplifying themapping problemandnot allowing complexalignments, including gaps or alignment with multiple referencegenomes.

Natural strains of genomes that have a high level of individualdifference, such as natural inbred strains of Arabidopsis or diver-gent strains of HIV, pose a considerable alignment challenge [45]. Ithas been shown in [3,53] that several percent of reference genomecan be missing or very divergent in distinct strains of differentspecies. This poses a problem for pairwise mapping algorithms

84 F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93

Fig. 1. Reads from four different divergent strains of genomes of the same species.

because the mappings can result in regions that are inaccessiblewhen aligned to the wildtype of the species. This is particularlytrue for algorithms that do not take into account mismatches andgaps. Fig. 1 shows an example set of reads from different strains ofgenomes from the same species.

Apart from the above mentioned difficulties in using pairwisemapping algorithms for short reads, aligning against only a singlereference has its own shortcomings. Aligning against a singlereference biases the analysis toward a comparison within thesequence space highly conserved with the reference. Taking intoaccount all the strains of the genome reduces this bias. Aligningeach read individuallywith each reference genome loses importantinformation such as variation SNP’s in different strains, besidesincreasing computation time and memory. Therefore, to interpretthe alignment results some kind of merging and interpretingstrategy is required.

One such strategy was recently introduced by Schneebergeret al. in [45], called GenomeMapper. It performs simultaneousalignment of short reads against multiple genomes using a hash-based data structure [45,37]. However, the accuracy of suchalignments would be limited by the graph structure that capturesthe information from different strains of the same species. Alsothe method has the drawback that the reads are pairwise alignedto the graph structure and the comparison of reads with eachother is not achieved. In this paper, we present a solution to theproblem of aligning pyroreads from multiple genomes using amultiple alignment methodology on multiprocessor platforms.

In theory, alignment of multiple sequences can be achievedusing pair-wise alignment, each pair getting alignment scoreand then maximizing the sum of all the pair-wise alignmentscores. But for optimal alignment the sum of all the pair-wisealignment scores needs to be maximized, which is an NP completeproblem [50]. Toward this end, dynamic programming basedsolutions of O(LN) complexity have been pursued, where N is the

number of sequences and L is the average length of a sequence.Such accurate optimizations are not practical for a large numberof sequences as is the case in pyrosequencing, thus makingheuristic algorithms as the only feasible option. The literature onthese heuristics is vast and includes widely used works, such asNotredame et al. [36], Edgar [7], Thompson et al. [48], Do et al. [5],and Morgenstern [32].

Recently we introduced a multiple alignment system forpyrosequencing reads, known as pyro-align [42]. Although, thesequential algorithm is highly efficient compared to the otheralignment algorithms, it is still not feasible tomultiple align a largenumber of reads from multiple genomes on a single processor. Ittakes over 7 days to multiple align 200,000 pyroreads on singleprocessor using this best known sequential algorithm [42], and thetime increases drastically with the increase in number of reads.

In this paper, we present a parallel multiple alignment solutionfor aligning reads frommultiple genomes. The objective of our workis to develop a high performance multiple alignment system forsmall error prone reads frommultiple genomes strands of the samespecies, such that the errors in the alignment are ‘highlighted’and the system is able to handle a large number of reads, as maybe expected from pyrosequencing procedure. We emphasize thatthis problem is considerably harder and different compared tomapping of the reads to a single reference genome.

The multiple sequence alignment algorithm presented, distinctfrom pairwise mapping or genome reconstruction, is specificallydesigned for the alignment of reads from multiple genomes fromdifferent strains of the same species. The proposed algorithm isa combination of the domain decomposition concept introducedin [40,41] and the exploitation of pyroreads characteristics forparallelization, therefore it is capable of aligning a very largenumber of pyroreads. It takes into account the position of the readswith respect to the reference genomes, and assigns weight to theleading and trailing gaps for the reads. After the decomposition,the reads are distributed amongmultiple processors using ametriccalculated from the reads. The reads are multiple aligned on eachprocessor with respect to each other and the genome. Then, thealigned reads on the processors are merged by exploiting thecharacteristics specific to pyro-reads. We assume that the readsmay be generated from one or many genomes, with ‘forward’orientation. We also assume that a wildtype of the species isavailable, as is generally the case for resequencing experiments. Inour experiments, we have used HIV-pol gene virus as the referencegenome (with length of 1971 bp) and simulator ReadSim [44] togenerate these reads. Since the complexity formultiple alignmentsis dominated by the number of reads, the length of the genomehas minimal effect in terms of memory and timings. Please notethat the above genome is only used as a proof of concept forthe technique. The importance of aligning reads from multiplegenomes is discussed in the literature [37,45] and is not stated herein the interest of brevity.We are able to align 200,000 reads on a 24processor system in just 278 min. The alignment of same numberof reads takes approximately 7 days on a single node system.

For the sake of completeness, let us first formally definethe multiple sequence alignment problem in its generic form,without indulging with the issues such as scoring functions. Let Nsequences be presented as a set S = {S1, S2, S3, . . . , SN} and letS ′

= {S ′

1, S′

2, S′

3, . . . , S′

N} be the aligned sequence set, such that allthe sequences in S ′ are of equal length, have maximum overlap,and the score of the global map is maximum according to somescoring mechanism suitable for the application. A perfect multiplealignment for pyroreads would be, that the reads are aligned witheach other such that the position of the reads with respect tothe reference genome(s) is conserved; the reads have maximumoverlap (for conserved genome regions) and are of equal lengthsafter the alignment, including leading and trailing gaps.

F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93 85

2. Proposed parallel multiple sequence alignment algorithmfor pyroreads: P-Pyro-Align

The intuitive idea behind the proposed P-Pyro-Align algorithmis to first place the reads in correct orientation with respectto the wildtype and then use progressive alignment to achievethe final alignment. The starting position of the reads w.r.t. thegenome is referred to as s-rank and is used to redistribute thereads on multiple processors. Regular sampling [14] based on s-rank is used to achieve load-balancing among multiple processors.The final alignment is produced by combining the alignmentproduced on each processor. To produce the global alignment weexploit the characteristics of the pyroreads and bring in techniquesfrom data structures and parallel computing such as histogrammapping of indel positions on each processor, sampling basedload-balancing and many-to-many communications to realize alow complexity, highly scalable solution in terms of time, memoryand communication.

First in Fig. 2 we present an outline of the proposed parallelalignment algorithm, P-Pyro-Align. This outline is followed by adetailed description of the salient steps of the algorithm.

Algorithm 1 P-Pyro-AlignRequire: p processor for computationRequire: N pyrosequencing reads and reference-

genome/wildtype, such that a distinct block of N/p reads isassigned to each processor.

Ensure: Gaps are inserted in each of the reads so that

• All reads have the same length (including trailing and leadinggaps).

• Score of the global map is maximum and the error insertionsand deletions are ‘highlighted’ after the alignment.

1. Locally compute a semi-global alignment of all thesequences with the reference genome in each processor(this will place each read in the correct position w.r.t.genome).

2. Use parallel Sample sort to sort the reads based on s-ranks.3. Align sequences in each processor using progressive

alignment systemwith trailing and leading gaps. The gapsare assigned different weights based on the position of thereads. The leading and trailing gaps are assigned weightsdifferent than the indels that are encountered in betweenbase pairs of the reads.

4.Merge the aligned reads on each processor using acustomized merging tree.

5. Continue the merging till all sequences are merged toproduce a global multiple alignment.

6. Output the final alignment.

2.1. Semi-global alignment

The first step is to determine the position of each read withrespect to the reference genome. If this step is omitted, thereare number of alignments that would be correct, but would beinaccurate if analyzed in the global context. A read that is notconstricted in terms of genomic position, may give the same score(SP score) for the multiple alignment but would be incorrect incontext of the reference. To accomplish the task of ‘placing’ thereads in the correct context with respect to the reference genomewe employ semi-global alignment procedure.1

1 This can be replaced with an indexing methodology for large genomes; formultiprocessors, in favor of simplicity, we employ simple semi-global alignmentwhich can be enhanced using indexing.

The semi-global alignment is also referred to as overlappingalignment because the sequences are globally aligned ignoringthe start and end gaps. For semi-global alignment we use a wellknown modified version of Needleman–Wunsch algorithm [34]described in [46]. The modification in the basic version of Needle-man–Wunsch is required to ignore the leading and trailing gapsof the reads when aligning to the reference genome. If the leadingand trailing gaps are not ignored, considering the short length ofthe reads, the alignment scores would be dominated by these gaps,hence giving an inaccurate alignment with respect to the genome.

Let the two sequences to be aligned be s and t of length m andn; and M(i, j) presents the score of the optimal alignment. Since,we do not wish to penalize the starting gaps, we modify the dy-namic programming matrix by initializing the first row and firstcolumn to be zero. The gaps at the end are also not to be penalized.Let M(i, j) represent the optimal score of s1, . . . , si and t1, . . . , tj.Then M(m, j) is the score that represents optimally aligned s witht1,...,j. The optimal alignment therefore, is nowdetected as themax-imum value on the last row or column. Therefore the best score isM(i, j) = maxk,l(M(k, n),M(m, l)), and the alignment can be ob-tained by tracking the path from M(i, j) to M(0, 0). For additionaldetails on semi-global alignment we refer the readers to [46].

Once each read has been semi-globally aligned with thewildtype genome on each processor, we obtain reads with leadingand trailing gaps, where the first character after the gaps is thestarting position of the read with respect to the wildtype genome.

2.2. Domain decomposition of the reads

After the reads are alignedwith respect to thewildtype, on eachprocessor we have the information of the reads position. This readposition, referred to as s-rank, is important for realizing correctalignment. We also exploit this information to decompose our do-main of pyro-reads on multiprocessor platforms. The decomposi-tion is similar to our generic strategy for multiple alignment [40].

Since we are assuming N/p random sequences on each proces-sor; each processor may have reads with starting positions fromthe start to the end of genome. However, we aim to reallocate thereads for progressive alignment such that the reads that are closerto each other in terms of s-rank are clustered on a single processor.

Progressive alignments are produced in each processor bycombining pairwise alignments of most similar sequences. Allprogressive alignment methods require two stages: first stage inwhich the relationships between the sequences are representedas a tree, called a guide tree, and second stage in which multiplesequence alignment (MSA) is built by adding the sequencessequentially to the growing MSA according to the guide tree.

The method followed by most of the progressive multiplealignment algorithms is that first a pairwise similarity measure iscomputed for each pair of sequences and thismeasure is organizedin a matrix form. A tree is constructed from this distance matrixusing Unweighted Pair Group Method with Arithmetic Mean(UPGMA) or neighboring joining. The progressive alignment is thusbuilt, following the branching order of the tree, giving a multiplealignment. These steps require O(N2) time each, where N is thenumber of reads.

In the case of pyrosequencing, we exploit the fact that the readsare coming from nearly the same reference in terms of species. Thisin turn implies that the reads starting from the same or near same‘starting’ point with respect to the reference genome are likely tobe similar to each other. Therefore, we already have the orderinginformation or the ‘guide tree’.

Let there be N/p number of reads in each processor R =

R1, R2, . . . , RN/p generated from the reference genome of lengthLG. Also, let the length of each read be denoted by L(Rp). Afterexecuting semi-global alignment using the algorithm discussed in

86 F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93

Fig. 2. A visually represented summary of the proposed algorithm.

the previous section, let each read be presented by Rpq, where thepth read has q leading gaps and LG−q−L(Rp) trailing gaps. Then thereordering algorithm would reorder the reads such that the readRpq comes ‘before’ R(p+1)(q+x), 1 ≤ q ≤ LG and 1 ≤ x ≤ L(Rp).

In an ideal load-balancing situation the number of reads oneach processor would be the range of the s-ranks divided by thenumber of processors. However, the coverage of each positionin pyro-sequencing is not constant and can vary considerablywhich would lead to unbalanced load on processors. Therefore,we employ regular sampling [14] based distribution of reads onmultiple processors. For the sake of completeness, the analysis ofthe distribution of the reads using regular sampling is extensivelydiscussed in Section 3.

2.3. Pair-wise and profile–profile alignments

The ordering and the distribution of the reads on multipleprocessors determined in the preceding step is now used toconduct the progressive alignment. The procedure described here

is executed on each of the processors independently after thereads have been distributed according to the s-ranks. Traditionalprogressive alignment requires that the sequences most similar toeach other are aligned first. Thereafter, sequences are added oneby one to the multiple alignments determined according to somesimilarity metric. In order to devise a low complexity system, wepresented a hierarchical progressive alignment procedure in [42].However, our experiments with extremely large data sets foralignment suggest that this hierarchical strategy does not workaccurately due to weighted leading and trailing gaps. The parallelprogressive alignment procedure is formulated as follows.

On each processor, pair-wise local alignment using standardSmith–Waterman is executed on each overlapping pair of reads(the ordering is still the same as discussed in the previous section).After this stage, the reads are aligned in pairs such that we have N

2p

pairs of aligned reads on each processor. These N2p pairs of reads are

then used for profile alignments as discussed below.Profile–profile alignments are used to re-align two or more

existing alignments (in our case the pairs of aligned reads). It is

F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93 87

Fig. 3. The Merging strategy for alignments.

useful for two reasons; one being that the user may want to addsequences gradually, and second being that the user may want tokeep one high quality profile fixed and keep on adding sequencesaligned to that fixed profile [48].

In this stage of the algorithm, the N/2p pairs of aligned readshave to be combined to get a multiple alignment. The pairedprofiles are then combined one by one, with reads added inascending order of s-ranks.

In order to apply pair-wise alignment functions to profiles, ascoring functionmust be defined, similar to the substitutionmeth-ods defined for pair-wise alignments. Profile sum of pairs (PSP) isthe function used in Clustalw [48], Mafft [21] and Muscle [8] tomaximize Sum of Pairs (SP) score, which in turn maximizes thealignment score such that the columns in the profiles are pre-served. Profile functions have evolved to be quite complex andgood discussion on these can be found at [8,20,33,9]. For our pur-poses we use the profile functions from the Clustalw System.

2.4. The merging strategy

The ordering and alignment of the reads with weighted leadingand trailing gaps on each processor gives a very high quality

alignment. The important question is how tomerge independentlyaligned reads on different processors into a single alignment inan efficient manner such that the computations required areminimal and communication between processors is far less thanthe computational load.

The merging for general multiple alignment on multiple pro-cessors has been introduced for domain decomposition strategyin [40]. We further modify the technique introduced in [40] to givea low complexity merging solution for pyrosequencing reads interms of computational and communication load.

We observe that after the alignment performed in eachprocessor, large columns of gaps are introduced over the entirerange because of the insertions and deletions in pyrosequencingreads. This is precisely the only information required tomerge twohigh quality alignments. Since the reads are coming from the samereference species, a large column of indels in one alignment andabsence of any in the other alignment implies that the indels inthe first alignment are due to insertions and/or deletions in thosereads. Thus, insertion of a large column of indels in the secondalignment would make a perfect global alignment for pyro-reads.The concept is illustrated in Fig. 3. The important point to notehere is, that unlike traditional profile–profile alignment, the only

88 F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93

Fig. 4. The final Alignment of the reads.

Table 1Computation costs.

Step O (time)

Semi-global alignment of (w = N/p) reads with referencegenome

w × LR × LG

Sorting of N/p sequences based on s-rank w logw

Sample k = p − 1 sequences w

Sorting of k × p sample s-ranks (k × p) log(k × p)Clustalw executed on (w = N/p) sequences in parallel w2

+ LR2

Computations for merging 2 × w × LG

Total computation costfor w =

Np

O

Np

2+ (LR)2 +

Np

LRLG

information that is shared among processors is the position of thelarge indels in the alignment on individual processors. This smallamount of data is communicated among the processors involvedin the merging. An example final alignment from the P-Pyro-Alignalgorithm can be seen in Fig. 4.

3. Analysis of computation and communication costs

For the computation and communication analysis we use acoarse grained computing model such as C3-model [13,12]. Also,for analysis purposes, we assume that the Clustalw System [48]is being used at each processor as the underlying profile–profilealignment system. It must be noted that the computationcomplexity of the alignment step will vary depending on thesequential MSA system used for alignment within each processor.

In the following analysis we assume that each processor hasw = N/p pyro-reads, where N is the total number of reads to bealigned, and p is the number of processors. The average length of aread is LR and the length of the genome is LG. In Table 1, we outlinethe computation cost of each step of the algorithm.

3.1. Communication cost

The communication overhead is an important factor that dic-tates the performance of a distributed message passing parallelsystem. If the communication overhead is much higher than thecomputation cost, the performance of the system is limited. Fortu-nately, the communication cost of our system is much less thanthe cost of the alignment. Essentially, the proposed P-Pyro-Alignalgorithm has three rounds of communication. In the first round,a small set of samples is collected at the root processor and a setof pivots based on s-ranks is broadcast from the root processor. Inthe second round, reads are redistributed to achieve better align-ments and balanced load distribution. In the third round, the lo-cation of large indel columns from each processor are broadcast

to all the other processors. For the analysis of the communica-tion costs we have adopted the coarse grained computation model[13,22] and domain decomposition communication cost analy-sis [40]. However,we ignore themessage start up costs and assumeunit time to transmit each data byte.

We have assumed the Regular Sampling strategy [14] becauseof its suitability to our problem domain. Some of the reasons are:

1. The strategy is independent of the distribution of originaldata, compared to some other strategies such as Huang andChow [16].

2. It helps in partitioning of data into ordered subsets of approx.equal size. This presents an efficient strategy for load balancingas unequal number of sequences on different processors wouldmean unequal computation load, leading to poor performance.In the presence of data skew, regular sampling guarantees thatno processor computes more than

2N

p

sequences [14].

3. It has been shown in [14] that regular sampling yields optimalpartitioning results as long as N > p3, i.e., the number of dataitems N is much larger than the number of processors p, whichwould be a normal case in the MSA application.

4. The partitioning of data can also be achieved by dividingthe length of the genome with the number of processorsand distributing the reads according to the pivot from thedivision. However, distributing randomly positioned readsw.r.t. the genome from pyrosequencing can create unbalancedcomputational loads onmultiprocessors. Hence partitioning thedata trivially does not give any theoretical guarantees for loaddistribution making the process more random and unreliable.

3.1.1. First communication roundAssuming k = p−1, i.e., each processor chooses p−1 samples,

the complexity of the first phase is O(p2LR) + O(p log p) + O(k ×

p log p), where O(p2LR) is the time to collect p(p − 1) samplesof average length LR at the root processor, O(p log p) is the time

F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93 89

required to broadcast p − 1 pivots to all the processor and (k ×

p log p) is the time required to broadcast k× p sequences to all theprocessors.

3.1.2. Second communication roundIn the second round each processor sends the sequences having

s-rank in the range of bucket i to processor i. Each processorpartitions its block of reads into p sub-blocks using pivots as bucketboundaries, and sends sub-block i to processor i. The sizes of thesesub-blocks can vary from0 to N

p sequences depending on the initialdata distribution. Taking the average case where the elements inthe processor are distributed uniformly, each sub-block will haveNp2

sequences. Thus this step would require O

Np

time assuming

an all-to-all personalized broadcast communication primitive [12].However, in the followingwe show that based on regular samplingno processor will receive more than 2N

p elements in total in theworst case. Therefore still the overall communication cost will beO

NpLR

.

Let us denote the pivots chosen in the first phase by the array:y1, y2, y3, . . . , yp−1. Consider any processors i, where 1 < i < p.All the sequences to be processed by processor i must have s-rank >yi−1 and ≤ yi. There are (i − 2)p +

p2 sequences of the

regular sample which are ≤ yi−1, implying that there are at leastlb =

(i − 2)p +

p2

Np2

sequences in the entire data that have s-rank ≤ yi−1. On the other hand, there are (p − i)p −

p2 sequences

in the regular sample that have s-rank >yi. Thus, there are ub =(p − i)p −

p2

Np2

sequences of N which are >yi. Since the totalnumber of sequences is N , at most N − ub − lb sequences willget assigned to processor i. It is easy to show that this expressionis upper bounded by 2N

p . The cases for i = 1 and i = p arespecial because the pivot interval for these two processors is p

2 . Theload for these processors will always be less than 2N

p . Due to pagelimitations, we refer to [14] for further details of the analysis.

3.1.3. Third communication roundIn the third round each processor sends the position of long

column indels in the alignments. In the analysis we assume aworstcase scenario where a long column of indels is encountered at eachalternating position i.e. after each column of DNA nucleotides acolumn of indels is encountered in the alignment. We will assumethat the columns from each processors are broadcasted to everyother processor. The length of the columns are equal to the numberof reads N because of obvious reasons and the number of columnswould be equal to LG as an upper bound. Then the communicationcost would be equal to O(NLG log p).

The total asymptotic time complexity T of the algorithm isshown in Eq. (3).

Computation Costs = O

Np

2

+ O((LR)2)

+O

Np

LRLG

(1)

Communication Costs = O(p2LR) + O(p log p) + O

NpLR

+O(NLG log p) (2)

T ≈ O

Np

2

+ O((LR)2) + O

Np

LRLG

+O

NpLR

+ O(NLG log p). (3)

4. Performance analysis

The performance evaluation process has been divided into twoparts: the first part deals with the quality assessment, and thesecond part deals with traditional HPC metrics such as executiontime, scalability, memory requirements, etc. The performanceevaluation of the P-Pyro-Align algorithm is carried out on a BeowulfCluster consisting of 24 Intel Xeon processors, each running at3.0 GHz, with 512 KB cache and 2 GB DRAM memory. As for theinterconnection network, the system uses Intel Gigabit networkinterface cards on each cluster node. The operating system on eachnode is RedHat Linux 4.2 (Kernel level: 2.6.9-22.ELsmp).

4.1. Experimental setup and quality assessment

As discussed earlier in the paper, the exact solution for mul-tiple alignment is not feasible and heuristics are employed. Mostof these heuristics perform well in practice but there is generallyno theoretical justification possible for these heuristics [11]. For P-Pyro-Align it can be shown that the semi-global alignment of thereads with the reference genome is analogous to center star align-ment. The center star alignment is shown to give results within 2-approx of the optimal alignment [11] inworst case and same can beexpected from the semi-global alignment of reads with referencegenome. The accuracy of the later stages is confirmed by rigorousquality assessment procedure described in the section below.

The assessment of the quality is done in the spirit of determin-ing how good themultiple alignment of the reads is. Aligning readsfrom multiple genomes have been shown to increase the numberof recovered SNP’s by 15%, of deletions by 22.6% and insertions by37.2% [45]. However, our quality assessment is to determine howgood the multiple alignment is. Clearly, the advantages discussedfor usingmultiple genomes over single referencewould be evidentif the alignment quality obtained is good [45].

To investigate the quality of the alignment produced by thealgorithm we have used ReadSim simulator [44] to generate thereads. The quality assessment of multiple alignment is generallycarried out using benchmarks such as Prefab [7] or BaliBase [49].However, these benchmarks are not designed to assess the qualityof the aligned reads produced from pyrosequencing, and there areno benchmarks available specifically for these reads. Therefore, asystem has to be developed to assess the quality of the alignedreads. The experimental setup for the quality assessment of thealignment procedure is shown in Fig. 5 and is explained below.

Our quality assessment has two objectives: (1) to assess thequality of the alignment produced by P-Pyro-Align with respect tothe original genome, and (2) to ensure that the system is able tohandle reads from multiple genomes for alignment.

We used a HIV pol gene virus with length of 1971 bp as thewildtype for the experiments. Thewildtype is then used to produce6 sets of genomes, randomly mutated at different rates. The sixsets of genomes are Dist-003, Dist-005, Dist-007, Dist-010, Dist-013 and Dist-015, with mutations of 3%, 5%,7%, 10%, 13%, 15%,respectively.2 Now using the mutated genomes, 5000 reads weregenerated using standard ReadSim simulator parameters withforward orientation.

The generated reads from these mutated genomes werethen aligned with the wildtype using the proposed P-Pyro-AlignAlgorithm. This procedure is adopted because generally scientistsonly have a wildtype of the microbial genomes available andtherefore it depicts a more practical scenario.

2 Random mutations at distinct positions as well as inserting rows of mutationsfrom 2 bp to 17 bp were used.

90 F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93

Fig. 5. The experimental setup for the quality assessment of the multiple alignment program.

Fig. 6. The quality of alignment with different mutation rates for P-Pyro-Align,pairwise and gap propagation methods.

After the alignment, a majority consensus of the reads isobtained. A metric referred to as quality score is defined that isequal to the percentage of the correct consensus obtained from thealignment with respect to the true genome. The quality score isthen calculated for the consensus obtained from the aligned readswith the original genome from which the reads were generated.

We compare the accuracy of the algorithm with two differentmethods. First we compare with themethods that use simple pair-wise alignment of the reads with the reference genome. Simplepairwise alignment is used to show how bad the alignment canbe with respect to the original when multiple genomes are used.Second, we compare it with a gap propagation method, proposedin [10]. For this, we have used ShoRAH software package [51]which uses a pairwise gap propagation method to mimic multiplesequence alignment and is similar to mapping algorithm suchas Bowtie, MAQ etc. Simply put, gap propagation method buildsmultiple alignment from pairwise alignments by ‘propagating’ thegaps. Propagation of gaps is accomplished for every positionwhereat least one read has an inserted base, a gap is inserted in the

reference genome and, consequently, in all reads that overlap thegenome at that position. The complexity of the procedure is atleast O(N2), and it overestimates the gaps. However, it mimicsmultiple alignment and thus is a good comparisonwith our system.A majority consensus is obtained from each type of alignmentsdiscussed above. Our experiments suggest that in the event ofmultiple genomes mapping algorithms such as Bowtie performclose to gap propagation method, if the genetic difference in thegenomes is not large.

The results of the alignments obtained and the accuracy of theconsensus are shown in Fig. 6 for 5000 reads. The accuracy ofthe consensus obtained using just the pairwise alignment is lessthan 58% for 3% mutations and continues to decrease with 42%accuracy for 15% mutations. The quality obtained from the P-Pyro-Align is always greater than 93% and exhibits a sustained qualitywith increasing mutations. The accuracy of the gap propagationprocedure is comparable to P-Pyro-Align for smallmutations, but asthe mutations increase the accuracy of gap propagation decreasesand is also not feasible for a large number of reads.

To illustrate that the alignment system works with reads frommultiple genomes, we have used a mixture of mutated reads fromDist-003, Dist-005, Dist-007, Dist-010, Dist-013 and Dist-015 toget a new set of reads. The new set contains reads (2500 fromeach genome) from the mutated sets of genomes. The reads arethen aligned by the P-Pyro-Align algorithm using wildtype as thereference genome. The results of alignment for this mixture setare shown in Fig. 7 for Dist-003/Dist-005, Dist-005/Dist-007, Dist-007/Dist-010, Dist-010/Dist-013 and Dist-013/Dist-015 mixtures.It must be noted here that we do not have a ‘ground truth’ genomein these cases and hence no genome is available to compare theconsensus obtained from the alignment.

However, we do know the mutation rates for the genomesfrom which the mixture sets were generated. Therefore, if anoptimal alignment of these reads is obtained, the ‘mutation’ in theconsensus should not be greater than the combined mutations ofthe genomes. For example consider the case of Dist-003/Dist-005

F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93 91

Fig. 7. The quality of alignment with different mixtures of mutation rates for P-Pyro-Align, pairwise and gap propagation methods.

Fig. 8. The quality of alignment produced using P-Pyro-Align with increasingaverage length of the reads.

mixture. We know themutation rates for the genomes fromwhichthe reads generated were 3% and 5% with respect to the wildtype.Therefore, for accurate alignment, the consensus of the alignmentshould not vary more than 8%, in the worst case, when comparedto the wildtype. Samewould be true for the other cases consideredaccording to themutation rates of the genomes. As can be seen thatthe results of the alignment compared with the wildtype are wellwithin the expected limits. The accuracy of the pairwise alignmentof the reads with the reference genome (in this case the wildtype),and that obtained using propagation method is also shown forcomparison.

The quality of the alignment produced using P-Pyro-Align alsohad to be tested for various other parameters such as numberof reads and the length of the reads. We present the results forour algorithm with increasing average length of the reads andwith increasing number of reads. The quality of the alignment for100 bp–400bp reads using P-Pyro-Align on8processors is shown inFig. 8. As can be seen from the figure that with increasing length ofthe read the quality of the alignment is always greater than 99.6%.

The quality of the alignment produced with increasing numberof reads for up to 100,000 is reported in Fig. 9. As can be seen thatwith increasing number of reads, the quality does not deteriorateand is always greater than 99%. Some decrease in the quality canbe attributed to the merge strategy that may introduce additionalindels for a large number of reads.

If the parallel alignment algorithmdoes not sustain qualitywithincreasing number of processors, then the scalability of the systemis limited. We have tested the quality produced by P-Pyro-Alignwith increasing number of reads up to 100,000 and 24 processors.As can be seen in Fig. 10 the alignment produced with increasingdecomposition granularity has no significant effect on the quality.With increasing number of processors of up to 24 processors and100,000 reads is never less than 98%.

Fig. 9. The quality of alignment produced using P-Pyro-Align with increasingnumber of reads.

Fig. 10. The quality of alignment produced using P-Pyro-Align with increasingnumber of processors.

4.2. Performance in terms of high performance computing parameters

In this section we analyze the performance of our algorithmin terms of traditional HPC parameters such as execution time,speedups and scalability. The objective of the evaluation is todetermine the advantages of the proposed P-Pyro-Align algorithmin terms of speedups and reduction in execution time.

For the sake of coherence we generated similar kinds of sets forexecution time evaluation as were selected for quality assessment.We report results for up to 0.5 million reads. To the best of authorsknowledge, there are no published reports of multiple aligning ofsuch a large number of reads in the literature.

As can be seen in Fig. 11 that the execution time decreasessharply with increasing number of processors. We are able to align100,000 reads on a 24 processor system in just 139min. The timingfor aligning 200,000 reads on a 24 processor system is 278min. Thesame number of readswould take approximately 7 days on a singleprocessor. The number of days to get alignment on single processorsystem increases drastically with increasing number of reads.

As shown in Fig. 12, P-Pyro-Align exhibits linear speedup ona 24 processor system. For less number of reads, the speedupsinitially obtained are near-linear e.g. for 20,000 reads the speedupobtained for 8 processor nodes is approximately factor 7; butwith increasing processors the speedup decreases. This is due tothe fact that with smaller number of reads and more processors,the amount of communication overhead is comparatively morecompared to the computation costs. However, it can be seen thatwith increasing number of reads, e.g. 0.5 million, super-linearspeedups can be observed. This is because (1/p)2 factor decreasebecomes dominant as the number of reads on each processor

92 F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93

Fig. 11. Scalability of the execution time w.r.t. the number of processors.

Fig. 12. Speedup for P-Pyro-Alignwith increasing number of processors.

increases, and it is in agreement with the complexity analysis inSection 3. It is safe to assume that the same behavior would beexhibited with increasing number of reads and processors.

5. Conclusion

With rapidly increasing knowledge of variants, alignment ofreads from multiple genomes would be of increasing importance.We have demonstrated that multiple alignment for a large num-ber of error-prone reads is not only possible but it can also pro-duce meaningful results. It gives access to regions that are highlydivergent from the first reference or wildtype and would be ren-dered useless with mapping/pairwise alignments with the refer-ence genome.

To this end, we have presented a highly scalable data parallelalgorithm to multiple align a large number of short reads fromthe pyrosequencing procedure. The proposed multiple alignmentcan take advantage of data decomposition to get results in areasonable time. We also presented the quality assessment resultsand compared those with the results obtained by simple pair-wisealignment procedure and ‘propagation’ methods.

We emphasize here, as of today there are no multiplealignment methods currently available to align short error prone

pyrosequencing reads up to 0.5 million with a potential toalign even larger number of reads on large parallel systems. Webelieve that the parallel alignment method presented in the paperwould find its uses in a multitude of computational biology andbioinformatics problems requiring alignments of a large numberof reads.

Acknowledgments

This work was completed when Fahad Saeed was a PhD can-didate at the Department of Electrical and Computer Engineering,University of Illinois at Chicago, IL USA.

References

[1] C. Alkan, J. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hormozdiari,J. Kitzman, C. Baker, M. Malig, O. Mutlu, S. Sahinalp, R. Gibbs, E. Eichler,Personalized copy number and segmental duplication maps using next-generation sequencing, Nat. Genet. (2009).

[2] D. Campagna, A. Albiero, A. Bilardi, E. Caniato, C. Forcato, S.Manavski, N. Vitulo,G. Valle, Pass: a program to align short sequences, Bioinformatics 25 (2009)967–968.

[3] R. Clark, G. Schweikert, C. Toomajian, S. Ossowski, G. Zeller, P. Shinn,N.Warthmann, T. Hu, G. Fu, D. Hinds, H. Chen, K. Frazer, D. Huson, B. Scholkopf,M. Nordborg, G. Ratsch, J. Ecker, D.Weigel, Common sequence polymorphismsshaping genetic diversity in arabidopsis thaliana, Science 317 (2007) 338–342.

[4] C. Coarfa, A. Milosavljevic, Pash 2.0: scaleable sequence anchoring for next-generation sequencing technologies, Pac. Symp. Biocomput, vol. 13, 2008,pp. 102–113.

[5] C. Do, M. Mahabhashyam, M. Brudno, S. Batzoglou, Probcons: probabilisticconsistency-based multiple sequence alignment, Genome Res. 15 (2005)330–340.

[6] H. Eaves, Y. Gao, Mom:maximum oligonucleotidemapping, Bioinformatics 25(2009) 969–970.

[7] R. Edgar, Muscle: multiple sequence alignment with high accuracy and highthroughput, Nucleic Acids Res. 32 (5) (2004).

[8] R. Edgar, Muscle: a multiple sequence alignment method with reduced timeand space complexity, BMC Bioinformatics (2004) 1471–2105.

[9] R.C. Edgar, K. Sjolander, A comparison of scoring functions for proteinsequence profile alignment, Bioinformatics 20 (8) (2004) 1301–1308.

[10] N. Eriksson, L. Pachter, Y. Mitsuya, S.-Y. Rhee, C. Wang, B. Gharizadeh,M. Ronaghi, R.W. Shafer, N. Beerenwinkel, Viral population estimation usingpyrosequencing, PLoS Comput. Biol. 4 (2008) e1000074.

[11] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Scienceand Computational Biology, Cambridge University Press, 1997.

[12] S.E. Hambrusch, F. Hameed, A.A. Khokhar, Communication operations oncoarse-grained mesh architectures, Parallel Comput. 21 (5) (1995) 731–751.

[13] S.E. Hambrusch, A. Khokhar, C3: a parallel model for coarse-grainedmachines,J. Parallel Distrib. Comput. 32 (2) (1996) 139–154.

[14] S. Hanmao, S. Jonathan, Parallel sorting by regular sampling, J. Parallel Distrib.Comput. 14 (4) (1992) 361–372.

[15] F. Hormozdiari, C. Alkan, E. Eichler, S. Sahinalp, Combinatorial algorithmsfor structural variation detection in high-throughput sequenced genomes,Genome Res. 19 (2009) 1270–1278.

[16] J.S. Huang, Y.C. Chow, Parallel sorting and data partitioning by sampling,in: Proceedings of the IEEE Computer Society’s Seventh InternationalComputer Software and Applications Conference, 1983, pp. 627–631.

[17] C.A. Hutchison III, DNA sequencing: bench to bedside and beyond, NucleicAcids Res. 35 (2007) 6227–6237.

[18] X.-L. Hou, Q.-Y. Cao, H.-Y. Jia, Z. Chen, Pyrosequencing analysis of the gyrbgene to differentiate bacteria responsible for diarrheal diseases, Eur. J. Clin.Microbiol. Infect. Dis. 27 (7) (2007) 587–596.

[19] H. Jiang, W. Wong, Seqmap: mapping massive amount of oligonucleotides tothe genome, Bioinformatics 24 (2008) 2395–2396.

[20] D.T. Jones, W.R. Taylor, J.M. Thornton, The rapid generation of mutation datamatrices from protein sequences, BMC Bioinformatics 8 (3) (1991) 275–282.

[21] K. Katoh, K. Misawa, K. Kuma, T. Miyata, Mafft: a novel method for rapidmultiple sequence alignment based on fast Fourier transform, Nucleic AcidsRes. 30 (14) (2002) 3059–3066.

[22] V. Kumar, A. Grama, A. Gupta, G. Karpis, Introduction to Parallel Computing:Design and Analysis of Parallel Algorithms, Benjamin-Cummings Pub. Co.,1994.

[23] B. Langmead, C. Trapnell, M. Pop, S. Salzberg, Ultrafast and memory-efficientalignment of short DNA sequences to the human genome, Genome Biol. 10(2009) R25.

[24] B. Langmead, C. Trapnell, M. Pop, S. Salzberg, Ultrafast and memory-efficientalignment of short DNA sequences to the human genome, Genome Biol. 10(2009) R25.

[25] H. Li, R. Durbin, Fast and accurate short read alignment with burrows-wheelertransform, Bioinformatics 25 (2009) 1754–1760.

[26] R. Li, Y. Li, K. Kristiansen, J. Wang, Soap: short oligonucleotide alignmentprogram, Bioinformatics 24 (2008) 713–714.

[27] H. Lin, Z. Zhang, M. Zhang, B. Ma, M. Li, Zoom! zillions of oligos mapped,Bioinformatics 24 (2008) 2431–2437.

[28] H. Li, J. Ruan, R. Durbin, Mapping short DNA sequencing reads and callingvariants using mapping quality scores, Genome Res. 18 (2008) 1851–1858.

F. Saeed et al. / J. Parallel Distrib. Comput. 72 (2012) 83–93 93

[29] Z. Liu, C. Lozupone,M. Hamady, F.D. Bushman, R. Knight, Short pyrosequencingreads suffice for accurate microbial community analysis, Nucleic Acids Res.(2007) gkm541.

[30] R. Li, C. Yu, Y. Li, T. Lam, S. Yiu, K. Kristiansen, J. Wang, Soap2: an improvedultrafast tool for short read alignment, Bioinformatics 25 (2009) 1966–1967.

[31] N. Malhis, Y. Butterfield, M. Ester, S. Jones, Slider: maximum use of probabilityinformation for alignment of short sequence reads and snp detection,Bioinformatics 25 (2009) 6–13.

[32] B. Morgenstern, Dialign: multiple DNA and protein sequence alignment atbibiserv, Nucleic Acids Res. 32 (2004) W33–W36.

[33] T. Muller, R. Spang, M. Vingron, Estimating amino acid substitution models:a comparison of dayhoff’s estimator, the resolvent approach and a maximumlikelihood method, Mol. Biol. Evol. 19 (1) (2002) 8–13.

[34] S.B. Needleman, C.D. Wunsch, A general method applicable to the search forsimilarities in the amino acid sequence of two proteins, J. Mol. Biol. 48 (1970)443–453.

[35] Z. Ning, A. Cox, J.Mullikin, Ssaha: a fast searchmethod for largeDNAdatabases,Genome Res. 11 (2001) 1725–1729.

[36] C. Notredame, D. Higgins, J. Heringa, T-coffee: a novel method for multiplesequence alignments, J. Mol. Biol. 302 (2000) 205–217.

[37] S. Ossowski, K. Schneeberger, R. Clark, C. Lanz, N. Warthmann, D. Weigel,Sequencing of natural strains of arabidopsis thalianawith short reads, GenomeRes. 18 (2008) 2024–2033.

[38] K. Prufer, U. Stenzel, M. Dannemann, R. Green, M. Lachmann, J. Kelso, Patman:rapid alignment of short sequences to large databases, Bioinformatics 24(2008) 1530–1531.

[39] S. Rumble, P. Lacroute, A. Dalca, M. Fiume, A. Sidow, M. Brudno, Shrimp:accurate mapping of short color-space reads, PLoS Comput. Biol. 5 (2009)e1000386.

[40] F. Saeed, A.A. Khokhar, A domain decomposition strategy for alignment ofmultiple biological sequences on multiprocessor platforms, J. Parallel Distrib.Comput. 69 (7) (2009) 666–677.

[41] F. Saeed, A. Khokhar, Sample-align-d: a high performance multiple sequencealignment system using phylogenetic sampling and domain decomposition,in: Proc. 23rd IEEE International Parallel and Distributed Processing Sympo-sium, April 2007.

[42] F. Saeed, A. Khokhar, O. Zagordi, N. Beerenwinkel, Multiple sequencealignment system for pyrosequencing reads, in: Proceedings of the 1stInternational Conference onBioinformatics andComputational Biology, BICoB,2009, pp. 362–375.

[43] M. Schatz, Cloudburst: highly sensitive read mapping with mapreduce,Bioinformatics 25 (2009) 1363–1369.

[44] M.S.R. Schmid, S.C. Schuster, D. Huson, Readsim-a simulator for sanger and 454sequencing, 2006.

[45] K. Schneeberger, J. Hagmann, S. Ossowski, N. Warthmann, S. Gesing,O. Kohlbacher, D. Weigel, Simultaneous alignment of short reads againstmultiple genomes, Genome Biol. 10 (9) (2009) R98.

[46] C. Setubal, J. Meidanis, Introduction to Computational Molecular Biology, PWSPublishing, 1997.

[47] A. Smith, Z. Xuan, M. Zhang, Using quality scores and longer reads improvesaccuracy of solexa read mapping, BMC Bioinformatics 9 (2008) 128.

[48] J. Thompson, D. Higgins, T.J. Gibson, Clustal w: improving the sensitivityof progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res. 222 (1994)4673–4690.

[49] J.D. Thompson, F. Plewniak, O. Poch, Balibase: a benchmark alignmentdatabase for the evaluation of multiple alignment programs, Bioinformatics15 (1) (1999) 87–88.

[50] L. Wang, T. Jiang, On the complexity of multiple sequence alignment,J. Comput. Biol. 1 (4) (1994) 337–348.

[51] Wang, Mitsuya, Gharizadeh, Ronaghi, Shafer, Characterization of mutationspectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance,Genome Res. 17 (8) (2007) 1195–1201.

[52] D. Weese, A. Emde, T. Rausch, A. Doring, K. Reinert, Razers: fast read mappingwith sensitivity control, Genome Res. 19 (2009) 1646–1654.

[53] G. Zeller, R. Clark, K. Schneeberger, A. Bohlen, D. Weigel, G. Ratsch, Detectingpolymorphic regions in arabidopsis thaliana with resequencing microarrays,Genome Res. 18 (2008) 918–929.

Fahad Saeed is a Research Fellow in the Epithelial SystemsBiology Laboratory (ESBL) at National Heart Lung andBlood Institute (NHLBI), National Institutes of Health (NIH)Bethesda, Maryland. He was a postdoctoral fellow in thesame lab from Aug 2010 to March 2011. He receivedhis Ph.D. in the Department of Electrical and ComputerEngineering, University of Illinois at Chicago in 2010. Hehas served as visiting scientist in the Department of Bio-Systems Science and Engineering (D-BSSE), ETH Zürichand Swiss Institute of Bioinformatics (SIB) in 2008 and2009 respectively. Dr. Saeed is the recipient of numerous

national and international awards and honors. He is serving as the Program co-chairof the 4th Bioinformatics and Computational Biology conference (BICoB), 2012. Hehas been nominated for the best dissertation award of the year at the Universityof Illinois in the year 2010. He was the recipient of NIH postdoctoral fellowshipfor the year 2010–2011 and recipient of ThinkSwiss fellowship for the years 2007and 2008 by the government of Switzerland. His research interests include paralleland distributed algorithms and architectures, proteomics and genomics andparallelalgorithms for bioinformatics applications. He is a member of IEEE and ACM.

Alan Perez-Rathke received his B.S. and M.S. degreesin computer science from the University of Illinois atUrbana–Champaign and Chicago campuses respectively.His thesis work has largely focused on applications ofhigh performance computing to the field of bioinformatics.Prior to becoming a graduate student, Alan worked forseveral years as a software engineer at a large video gamecompany. He now wants to leverage his technical skillsto help promote research in genomics, proteomics, andmedicine—particularly in the area of cancer. Currently, hehopes to fulfill his goal of combining computer science and

medicine through acceptance into a combined MD/Ph.D. program.

Jaroslaw Gwarnicki received his B.S. and M.S. degreesin Computer Science from the University of Illinois atChicago. Ever since being awarded Bachelor Degree hehas been employed at a major gaming industry companywhere as a Senior Software Engineer he specializes in thearea of optimizations and parallel computing. His plansfor future involve applying the skills and ideas obtainedthrough the presented research to improve and advancethe game development industry.

Tanya Berger-Wolf is an Associate Professor in theDepartment of Computer Science at the University ofIllinois at Chicago where she heads the ComputationalPopulation Biology Lab. Her research interests are inapplications of computational techniques to problemsin population biology of plants, animals, and humans,from genetics to social interactions. Dr. Berger-Wolf hasreceived her Ph.D. in Computer Science from Universityof Illinois at Urbana–Champaign in 2002. She has spenttwo years as a postdoctoral fellow at the Universityof New Mexico working in computational phylogenetics

and a year at the Center for Discrete Mathematics and Theoretical ComputerScience (DIMACS) doing research in computational epidemiology. She has receivednumerous awards for her research and mentoring, including the US NationalScience Foundation CAREER Award in 2008 and the UIC Mentor of the Year Awardin 2009.

AshfaqKhokhar received his Ph.D. in Computer Engineer-ing from the University of Southern California, in 1993.After his Ph.D., he spent two years as a Visiting AssistantProfessor in the Department of Computer Sciences andSchool of Electrical and Computer Engineering at PurdueUniversity. In 1995, he joined the Department of Electricaland Computer Engineering at the University of Delaware,where he first served as Assistant Professor and then asAssociate Professor. In Fall 2000, Dr. Khokhar joined UIC inthe Department of Computer Science and Department ofElectrical and Computer Engineering, where he currently

serves on the rank of Professor. Dr. Khokhar has published over 200 technical pa-pers and book chapters in refereed conferences and journals in the areas of wirelessnetworks, multimedia systems, data mining, and high performance computing. Heis a recipient of the NSF CAREER award in 1998. His paper entitled ‘‘Scalable S-to-PBroadcasting in Message Passing MPPs’’ has won the Outstanding Paper award inthe International Conference on Parallel Processing in 1996. He has served as theProgram Chair of the 17th Parallel and Distributed Computing Conference (PDCS),2004, Vice Program Chair for the 33rd International Conference on Parallel Process-ing (ICPP), 2004, and Program Chair of the IEEE Globecom Symposium on Ad hocand Sensor Networks, 2009. He is a Fellow of IEEE for his contributions to multime-dia computing and databases. His research interests include: wireless and sensornetworks, multimedia systems, data mining, and high performance computing.