high performance direct pairwise comparison of genomic sequences christopher mueller, mehmet...

High Performance Direct Pairwise Comparison of Genomic Sequences

Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine

HiCOMB

April 4, 2005

Denver, Colorado

April 4, 2005 High-Performance Direct Pairwise Comparison of Large Genomic Sequences 2

Introduction Goals

Generate data for large format visualization Exploit parallel features present in commodity hardware

SIMD/vector processors SMP/multiple processors per machine Clusters

Genome Comparison Dot plot is the only complete method for comparing genomes Often ruled out due to quadratic running time Size of data has an upper bound and modern hardware is

approaching the point where this bound is (almost) within reach Target Data

DNA sequences, one direction (5’ to 3’) Target Platform

Apple dual processor G5, Altivec vector processor


Related Work

BLAST Apple and Genentech (AGBLAST), 5x speedup using

Altivec Smith-Waterman

Rognes and Seeberg, 6x speedup using MMX HMMER

Erik Lindahl, 30% improvement using Altivec Hardware Solutions

Various commercial FPGA solutions exist for different algorithms (e.g., TimeLogic’s DeCypher platform offers BLAST, HMM, SW)


SIMD Overview

3

2

5

3 2 1 4

2 4 5 9

5 6 6 13

+

Normal SIMD

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Image from http://developer.apple.com/hardware/ve

Single Instruction, Multiple Data Perform the same operation on many

data items at once Vector registers can be divided

according to the data type The Altivec registers in the G5 are 128

bits wide. Vector programming using gcc on

Apple G5s is one step removed from assembly programming

Functions are thin wrappers around assembly calls

The optimizer does not cover vector operations

Memory loads and stores are handled by the programmer and must be properly byte aligned


The Dot Plot

QuickTime™ and aTIFF (LZW) decompressor


Dotplot comparing the human and fly mitochondrial genomes (generated by DOTTER)

NAÏVE_DOTPLOT(qseq, sseq, win, strig): // qseq - column sequence // sseq - row sequence // win - number of elements to compare // for each point // strig - number of matches required // for a point

for each q in qseq: for each s in sseq: score = 0 for each (q’, s’) in (qseq[q:q+win], s[s:s+win]): if q’ == s’: score += 1 end if q’ end for each (q’,s’)

if score > strig: AddDot(q, s) end if score

end for each s end for each q

qseq

sseq

win = 3strig = 2


The Standard Algorithm

STD_DOTPLOT(qScores, s, win, strig): dotvec = zeros(len(q))

for each char c in s: dotvec = shift(dotvec, 1) dotvec += qScores[c]

if index(c) > win: delchar = s[index(c) - win] dotvec -= shift(qScores[delchar], win)

for each dot in dotvec > strig: display(dot) end for each dot

end for iend DOTPLOT




Data Parallel Dot Plot

VECTOR_DOTPLOT(qScores, s, win, strig):

// Group diagonals by the upper and lower // triangular sections of the martix for each vector diagonal D: runningScore = vector(0)

for each char c in s: score = VecLoad(qScores[c]) runningScore = VecAdd(score, r_score)

if index(c) > win: delChar = s[index(c) - win] delscore = VecLoad(qScores[delChar]) runningScore = VecSub(score, delscore)

if VecAnyElementGte(runningScore, strig): scores = VectorUnpack(runningScore) for each score in scores > strig: Output(row(c), col(score), score) end for each score end if VecGte()

end for each c end for each Dend VECTOR_DOTPLOT




Coarse Grained Parallelism

Block Level Parallelism Block the matrix into columns Overlap by the number of

characters in the window

Single Machine Run one thread per processor Create one memory mapped file

per processor

Cluster Run one instance per machine

and one thread per processor. Store results locally (e.g. /tmp)


Model-driven Implementation

Goal: Break the algorithm into basic operations that can be modeled independently to understand the performance

issues at each step.

Data Streams(data read speed)

Vector Operations(instruction throughput)

Sparse Matrix Format

Data output

(data write speed)


Data Stream Models

Single stream pointer is similar to indexing, but a little slower

For the four score streams, indexed 1/4 of the time, maintaining the pointers costs more than lookup

Pointer vs. Index numbers varied based on the compiler version

// Base case// S-sequence is one stream pointers++;

// Q-sequence is four streamsuchar *qScore[4];

// Option 1: Four Pointers// Keep pointers to the current// position in the score vectorsqScore[0]++;qScore[1]++;qScore[2]++;qScore[3]++;score = *qScore[*s];

// Option 2: Index// Index the score vectors with // a counteri++;score = qScore[*s][i];

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Pointer Four Pointers Index

Data Stream Performance (Mops)


Vector Performance Models

Attempts to model infrequent write operations were unsuccessful

Storing all dots yields high performance, but this is not practical for large comparisons

StoreFreq provides a lower bound on performance

// Model Variablesuchar *data = randseq(), out[16]; long i = 0, l = len(data);vector uchar sum = 0, value; // VecAddfor(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum);}

// StoreAllfor(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); out = VecStore(sum); Save(out);}

// StoreFreqint freq = l * alpha;for(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); if(i % freq) { // Pipeline stall! out = VecStore(sum); Save(out); }}

Vector Model Performance (Mops)


Pipeline Management// Sequence of Vector Operations

// score = VecLoad(qScores[c])score1 = vec_ld(0, ptemp); // unalgined score2 = vec_ld(16, ptemp); // loads vperm = vec_lvsl(0, ptemp); score = vec_perm(score1, score2, vperm);

runningScore = vec_add(score, r_score);

// delscore = VecLoad(qScores[delChar])score1 = vec_ld(0, ptemp); score2 = vec_ld(16, ptemp); vperm = vec_lvsl(0, ptemp); delscore = vec_perm(score1, score2, vperm);

runningScore = vec_sub(score, delscore);

if(vec_any_ge(runningScore, strig)) { scores = vec_st(runningScore) // Main processor for(i = 0; i < 16; i++) { if(hit[i] > ustrig ) { dm.AddDot(y, x + i, hit[i]); } }}



Each line shows each cycle for one instruction. Instructions are offset (x-axis) based on starting time. Time flows from top to bottom (y-axis).

The left plot shows a series of add/delete steps with no dots generated.

The bottom plot shows the pipeline being interrupted when a dot is generated.



Cycle-accurate Plots of the Instructions in Flight


Sparse Matrix Format// Option 1// std::vector CSR-eqse Sparse Matrixstruct Dot { int col; int value;};

struct Row { int num; vector<Dot> cols;};

typedef vector<Row*> DotMatrixVec;

// Option 2// Memory Mapped Coordinate-wise// Sparse Matrixstruct RowDot { int row; int col; int value;};

RowDot *out = (RowDot*)mmap(…);

Sparse Matrix Format Performance (Mops)

Both approaches required some maintenance to avoid exhausting main memory

mmap avoids a second pass through the data during the save step

3.85x

6.78x

1.0x

0

100

200

300

400

500

600

700

800

900

1000

DOTTER base std::vector mmap


Data Location

Large, shared data is often located on network drives

This adds a network hop for all disk I/O

Even for infrequent I/O, this can significantly affect performance

Data Location Performance (Mops)

The std::vector sparse matrix had a slight benefit.

The mmap sparse matrix improved significantly with local data storage.

1.0x1.0x

1.35x

1.98x

0

100

200

300

400

500

600

700

800

900

1000

std::vector mmap

NFSLocal


Traditional Manual Optimizations

Prefetch G5 hardware prefetch is very good Attempts to optimize had negative impact

Blocking Slight negative impact due to burps in the stream

Unrolling Complicated code very quickly No measurable improvement


System Details Apple Dual 2.0 GHz G5, 3.5 GB RAM 100 Mbit network to file server OS X 10.3.5 (Darwin Kernel Version 7.5.0) g++ 3.3 (build 1620)

-O3 -fast (different compiler, aggressive optimizations) -altivec (limited optimizations) Upgrade from 1614 to 1620 improved DOTTER’s performance

by 30% Libraries

Boost::thread Data (from GenBank)

Mitochondrial genomes E. Coli, Listeria bacterial genomes


0

200

400

600

800

1000

1200

1400

1600

1800

2000

DOTTER (1 CPU) Data-parallel,mitochondrial (1 CPU)

Data-parallel,mitochondrial (2 CPU)

Data-parallel, bacterial(2 CPU)

Results

Single Machine Mitochondrial (~20 kbp)

DOTTER vs. Data-parallel Bacterial (4.5 Mbp)

Data-parallel only

Cluster (16 dual processor 2.3 GHz G5s)

Bacterial Comparison 92 min, 8 sec (1 node) 5 min, 42 sec (16 nodes)

Final Results (Mops)

7.0x

13.0x

Scalability (time/nodes)

0

5

10

15

20

25

30

35

40

45

0 2 4 6 8 10 12 14 16 18

Number of Nodes

Gops

Actual

Linear

1.0x

Scalability


Visualization

Results rendered to PDF Target Displays

2x4, 6400x2400 tiled display wall

IBM T221, 3840x2400, 204 dpi display Magnifying glass required

High resolution formats 600 dpi laser printer 1200 dpi ink jet printer High resolution, no

interactivity


Conclusions

Modern commodity hardware is close to providing the performance necessary for large direct genomic comparisons. 5,000,000 base pair sequences are realistic (bacteria) 50,000,000 base pair sequences are possible (small

human chromosomes) It is important to take a careful, experimental

approach to implementation and to test all assumptions.


Acknowledgements

Jeremiah Willcock helped develop the initial prototype Eric Wernert, Craig Jacobs, and Charlie Moad from the

UITS Advanced Visualization Lab at Indiana University provided visualization support

This work was supported by a grant from the Lilly Endowment

ReferencesApple Developer’s Connection, Velocity Engine and Xcode, from, Apple Developer Connection, Cupertino, CA, 2004.

http://developer.apple.com/hardware/ve http://developer.apple.com/tools/xcode

A. J. Gibbs and G. A. McIntyre, The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences, Eur J Biochem, 16 (1970), pp. 1-11.

E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix Program with Dynamic Threshold Control Suited for Genomic DNA and Protein-Sequence Analysis, Gene-Combis, 167 (1995), pp. 1-10.

http://developer.apple.com/hardware/ve

http://developer.apple.com/tools/xcode

high performance direct pairwise comparison of genomic sequences christopher mueller, mehmet...

Documents