parallel computation in biological sequence analysis: paralign & turboblast larissa smelkov

55
Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST

Larissa Smelkov

Biological Sequence AlignmentLocalGlobal

Goal

Alg

orith

m

App

licat

ion

To identify conserved regions and differences

To see whether 2 strings have a common substring

Needleman-Wunsch Smith-Waterman Comparing two

genes with same function (human vs. mouse)

Comparing two proteins with similar function

Searching for local similarities in large sequences (newly sequenced genomes)

Looking for motifs in 2 proteins

Protein Responsible for Iron Transport

HumanMQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY

ChickenMKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVVDETKDCIIKIMKGEADAV

Protein Responsible for Iron Transport

HumanMQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY

ChickenMKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVVDETKDCIIKIMKGEADAV

Similar Substrings

DSLSGGERQ–RA–WIAMLVAQDSRC

: : : : : : : : : : : : : :

DQLSGSPRQNRIQWIAVLKAEKSKC

Talk Outline

Problem Description Smith-Waterman Algorithm BLAST ParAlign TurboBLAST Comparison

Problems of Comparison of 2 Sequences Evolution Factor

Additions Deletions Substitutions

Human Factor Typos Duplicates

Solution

Smith-Waterman Algorithm (S-W) Score Matrix Gap Penalty

Score Matrix: BLOSUM45A R N D C Q E G H I L K M F P S T W Y V B Z X

A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 -1 -1 0

R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 -1 0 -1

N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 4 0 -1

D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 5 1 -1

C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 -2 -3 -2

Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 0 4 -1

E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 1 4 -1

G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3 -1 -2 -1

H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 0 0 -1

I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 -3 -3 -1

L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 -3 -2 -1

K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 0 1 -1

M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 -2 -1 -1

F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 -3 -3 -1

P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 -2 -1 -1

S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 0 0 0

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 0 -1 0

W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 -4 -2 -2

Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1 -2 -2 -1

V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5 -3 -3 -1

B -1 -1 4 5 -2 0 1 -1 0 -3 -3 0 -2 -3 -2 0 0 -4 -2 -3 4 2 -1

Z -1 0 0 1 -3 4 4 -2 0 -3 -2 1 -1 -3 -1 0 -1 -2 -2 -3 2 4 -1

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -2 -1 -1 -1 -1 -1

Pairwise Alignment Example

ELEPHANT

PANTHER

S-W: Dynamic Programming Matrix E L E P H A N T

0 0 0 0 0 0 0 0 0

P0 0 -3 0 9 -2 -1 -2 -1

A0 -1 -1 -1 -1 -2 5 -1 0

N0 0 -3 0 -2 1 -1 6 0

T0 -1 -1 -1 -1 -2 0 0 5

H0 0 -2 0 -2 10 -2 1 -2

E0 6 -3 6 0 0 -1 0 -1

R0 0 -2 0 -2 0 -2 0 -1

S-W: Formula

T[i-1, j-1] + score(s[i], t[j])

T[i, j] = max T[i-1, j] – gT[i, j-1] – g0

g – gap penalty

g = 8 (in our example)

T[i-1, j-1]

0 0

0 0

T[i, j-1]

T[i-1, j]

?

S-W: Dynamic Programming Matrix E L E P H A N T

0 0 0 0 0 0 0 0 0

P0 0 -3 0 9 -2 -1 -2 -1

A0 -1 -1 -1 -1 -2 5 -1 0

N0 0 -3 0 -2 1 -1 6 0

T0 -1 -1 -1 -1 -2 0 0 5

H0 0 -2 0 -2 10 -2 1 -2

E0 6 -3 6 0 0 -1 0 -1

R0 0 -2 0 -2 0 -2 0 -1

S-W: Dynamic Programming Matrix E L E P H A N T

0 0 0 0 0 0 0 0 0

P0 0 -3 0 9 -2 -1 -2 -1

A0 -1 -1 -1 -1 -2 5 -1 0

N0 0 -3 0 -2 1 -1 6 0

T0 -1 -1 -1 -1 -2 0 0 5

H0 0 -2 0 -2 10 -2 1 -2

E0 6 -3 6 0 0 -1 0 -1

R0 0 -2 0 -2 0 -2 0 -1

S-W: Dynamic Programming Matrix E L E P H A N T

0 0 0 0 0 0 0 0 0

P0 0 -3 0 9 -2 -1 -2 -1

A0 -1 -1 -1 -1 -2 5 -1 0

N0 0 -3 0 -2 1 -1 6 0

T0 -1 -1 -1 -1 -2 0 0 5

H0 0 -2 0 -2 10 -2 1 -2

E0 6 -3 6 0 0 -1 0 -1

R0 0 -2 0 -2 0 -2 0 -1

0 0 0 0 0 0 0 0 0

0 0 0 0 9 1 0 0 0

0 0 0 0 1 7 6 0 0

0 0 0 0 0 2 4 12 4

0 0 0 0 0 0 2 4 19

0 0 0 0 0 10 2 3 11

0 6 0 6 0 2 9 1 3

0 0 4 0 4 0 1 9 1

S-W: Dynamic Programming Matrix E L E P H A N T

0 0 0 0 0 0 0 0 0

P0 0 -3 0 9 -2 -1 -2 -1

A0 -1 -1 -1 -1 -2 5 -1 0

N0 0 -3 0 -2 1 -1 6 0

T0 -1 -1 -1 -1 -2 0 0 5

H0 0 -2 0 -2 10 -2 1 -2

E0 6 -3 6 0 0 -1 0 -1

R0 0 -2 0 -2 0 -2 0 -1

S-W: Result Alignment

ELEPHANT

: : : :

P– ANTHER

S-W: Summary

Uses Score matrix Gap penalties

Complexity O(mn)

Sensitivity High

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

~ 33 mln sequences

as of Feb. 14,

2004

Growth of GenBank

BLAST: Basic Local Alignment Search Tool

BLAST: Steps

Divide both sequences into words of length w default w = 3

Calculate score for each pair Extend high scored pairs to increase score

BLAST: Divide Sequences

P A N T H E R

P A NA N T

N T HT H E

H E R

E L E P H A N T

E L EL E P

E P HP H A

H A NA N T

BLAST: Calculate Score

E L E

P A N

0 -1 0 score: -1

L E P

P A N

-3-1-2 score: -6

E P H

P A N

0 -1 1 score: 0

P H A

P A N

9 -2-1 score: 6

H A N

P A N

-2 5 6 score: 9

A N T

P A N

-1 -1 0 score: -2

BLAST: Sort Pairs on Score

Word1 Word2 ScoreANT ANT 16PAN HAN 9PAN PHA 6PAN EPH 0PAN ELE -1PAN ANT -2PAN LEP -6… … …

0 0 0 0 0 0 0 0 0

0 0 0 0 9 1 0 0 0

0 0 0 0 1 7 6 0 0

0 0 0 0 0 2 4 12 4

0 0 0 0 0 0 2 4 19

0 0 0 0 0 10 2 3 11

0 6 0 6 0 2 9 1 3

0 0 4 0 4 0 1 9 1

BLAST: Extension E L E P H A N T

0 0 0 0 0 0 0 0 0

P0 0 -3 0 9 -2 -1 -2 -1

A0 -1 -1 -1 -1 -2 5 -1 0

N0 0 -3 0 -2 1 -1 6 0

T0 -1 -1 -1 -1 -2 0 0 5

H0 0 -2 0 -2 10 -2 1 -2

E0 6 -3 6 0 0 -1 0 -1

R0 0 -2 0 -2 0 -2 0 -1

BLAST: Summary

Uses Score matrix Gap penalties Heuristics to reduce computations

Complexity O(m) with O(n) processors

Sensitivity Low

SensitivityAXBXCXDXE

ABCDE Task: Align 2 sequences:

Smith-Waterman:

BLAST:

AXBXCXDXE

: : : : :

A– B– C– D– E

Ø(no similar substrings)

S-W vs. BLASTSpeed

SensitivityS-W

BLAST

S-W and BLAST

Using them now Too costly Inefficient Time-consuming

Solution More heuristics More parallelism

ParAlign

ParAlign: Steps

Find ungapped alignments Calculate approximate alignment scores Choose high-scored sequences Apply S-W

ParAlign: Microparallelism

Divide wide registers into smaller units Perform the same operation on different

data sources Modern microprocessors have this

technology built in

ParAlign: Calculate Scores in ParallelE L E P H A N T

0 0 0 0 0 0 0 0 0

P0 0 -3 0 9 -2 -1 -2 -1

A0 -1 -1 -1 -1 -2 5 -1 0

N0 0 -3 0 -2 1 -1 6 0

T0 -1 -1 -1 -1 -2 0 0 5

H0 0 -2 0 -2 10 -2 1 -2

E0 6 -3 6 0 0 -1 0 -1

R0 0 -2 0 -2 0 -2 0 -1

ParAlign: Estimate of GapsE L E P H A N T

0 0 0 0 0 0 0 0 0

P 0 0 -3 0 9 -2 -1 -2 -1

A 0 -1 -1 -4 -1 7 3 -2 -2

N 0 0 -4 -1 -6 0 6 9 -2

T 0 -1 -1 -5 -2 -8 1 6 14

H 0 0 -3 -1 -7 8 -10 2 4

E 0 6 -3 3 -1 -7 7 -10 1

R 0 0 4 -3 1 -1 -9 7 -11

ParAlign: Apply S-W in ParallelE L E P H A N T

0 0 0 0 0 0 0 0 0

P 0 0 -3 0 9 -2 -1 -2 -1

A 0 -1 -1 -4 -1 7 3 -2 -2

N 0 0 -4 -1 -6 0 6 9 -2

T 0 -1 -1 -5 -2 -8 1 6 14

H 0 0 -3 -1 -7 8 -10 2 4

E 0 6 -3 3 -1 -7 7 -10 1

R 0 0 4 -3 1 -1 -9 7 -11

ParAlign: Summary Uses

SIMD technology (single instruction multiple data)

S-W Algorithm Heuristics to reduce computations

Requirement for machine Modern microprocessor

Speed Fast

Sensitivity Medium

TurboBLAST

TurboBLAST: Steps

Divide the job Parts of query against partition of database

Apply BLAST Merge results

TurboBLAST: Implementation

A three-tier system Components

Client Master Workers

TurboBLAST: SchemaMasterClient

Workers

tasksjob

task

•Divide task•Schedule subtasks•Solve subtasks•Merge results

TurboHub

DB request

File Provider

DB part

•Sets up tasks•Manages execution•Coordinates Workers•Provides VSM

•Divides job into tasks•Writes results to file

results

results

request

task

It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.

It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.

TurboBLAST: Client

Takes a BLAST job and divides it into a number of initial BLAST tasks.

Submits these tasks to the Master Retrieves the results, and writes them to

file.

TurboBLAST: Master

Accepts tasks from Clients and sets them up to for processing by the Workers

Includes TurboHub (the server portion of a parallel execution system)

Includes File Provider (Java application that manages the databases)

TurboBLAST: Worker

Workers are processors Run a Java application and perform the

BLAST computations Merge the result Are responsible for scheduling

TurboHub

TurboHub is execution engine for parallel and distributed Java applications

Scalable high performance Wide range of computing environments Manages the flow of data through the workflows Schedules the components Transforms data between components Balances load Handles errors

TurboBLAST: TurboHub

Manages task execution Coordinates the Workers Provides a virtual shared memories Supports dynamic changes in the set of

Workers Supports fault tolerance

TurboBLAST: File Provider

Maintains a copy of each database Delivers all or part of each database to

Workers as they require them

TurboBLAST: Advantages Size of each task is optimal

processing is efficient on the processor that computes the task

Large set of tasks no waste of time for processors

No algorithm change Support for all flavors of BLAST Ease to update

Applicable for different environments (PC, Macintosh …)

TurboBLAST: Experiment

Input data 500 proteins 200 – 400 amino acids in each

Database 1,681,522,266 sequences

Hardware IBM Linux cluster 8 dual-processor workstations

2 Pentium III processors, 996 Mhz each 2 Gbyte memory

100 Mbit Ethernet

TurboBLAST: Results of Experiment

0

2

4

6

8

10

12

14

16

18

2 4 6 8 10 12 14 16

Processors

Sp

ee

d-u

p

X=Y

TurboBLAST

TurboBLAST: Results of Experiment

TurboBLAST: Summary Divide and Conquer

Use many copies of BLAST in parallel Uses BLAST Algorithm Requirement for each machine

Java VM Local BLAST executable

Speed Very fast

Sensitivity Low

Comparison of Algorithms/Products

Speed

SensitivityS-W

BLAST

ParAlign

TurboBLAST

References R.D. Bjornson, A.H. Sherman, S.B. Weston, N.

Willard, J. Wing “TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub” Intl. Parallel and Distributed Processing Symposium (IPDPS), 2002.

Rognes T.“ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches”Oxford University Press, 2001

Don’t ask any Questions, please…

PS

Web site there you can donate your computer time to participate in search of methods to cure cancer:

http://www.the-optimists.org.uk