parallel computation in biological sequence analysis: paralign & turboblast larissa smelkov
Post on 21-Dec-2015
218 views
TRANSCRIPT
Biological Sequence AlignmentLocalGlobal
Goal
Alg
orith
m
App
licat
ion
To identify conserved regions and differences
To see whether 2 strings have a common substring
Needleman-Wunsch Smith-Waterman Comparing two
genes with same function (human vs. mouse)
Comparing two proteins with similar function
Searching for local similarities in large sequences (newly sequenced genomes)
Looking for motifs in 2 proteins
Protein Responsible for Iron Transport
HumanMQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY
ChickenMKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVVDETKDCIIKIMKGEADAV
Protein Responsible for Iron Transport
HumanMQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY
ChickenMKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVVDETKDCIIKIMKGEADAV
Problems of Comparison of 2 Sequences Evolution Factor
Additions Deletions Substitutions
Human Factor Typos Duplicates
Score Matrix: BLOSUM45A R N D C Q E G H I L K M F P S T W Y V B Z X
A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 -1 -1 0
R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 -1 0 -1
N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 4 0 -1
D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 5 1 -1
C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 -2 -3 -2
Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 0 4 -1
E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 1 4 -1
G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3 -1 -2 -1
H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 0 0 -1
I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 -3 -3 -1
L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 -3 -2 -1
K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 0 1 -1
M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 -2 -1 -1
F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 -3 -3 -1
P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 -2 -1 -1
S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 0 0 0
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 0 -1 0
W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 -4 -2 -2
Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1 -2 -2 -1
V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5 -3 -3 -1
B -1 -1 4 5 -2 0 1 -1 0 -3 -3 0 -2 -3 -2 0 0 -4 -2 -3 4 2 -1
Z -1 0 0 1 -3 4 4 -2 0 -3 -2 1 -1 -3 -1 0 -1 -2 -2 -3 2 4 -1
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -2 -1 -1 -1 -1 -1
S-W: Dynamic Programming Matrix E L E P H A N T
0 0 0 0 0 0 0 0 0
P0 0 -3 0 9 -2 -1 -2 -1
A0 -1 -1 -1 -1 -2 5 -1 0
N0 0 -3 0 -2 1 -1 6 0
T0 -1 -1 -1 -1 -2 0 0 5
H0 0 -2 0 -2 10 -2 1 -2
E0 6 -3 6 0 0 -1 0 -1
R0 0 -2 0 -2 0 -2 0 -1
S-W: Formula
T[i-1, j-1] + score(s[i], t[j])
T[i, j] = max T[i-1, j] – gT[i, j-1] – g0
g – gap penalty
g = 8 (in our example)
T[i-1, j-1]
0 0
0 0
T[i, j-1]
T[i-1, j]
?
S-W: Dynamic Programming Matrix E L E P H A N T
0 0 0 0 0 0 0 0 0
P0 0 -3 0 9 -2 -1 -2 -1
A0 -1 -1 -1 -1 -2 5 -1 0
N0 0 -3 0 -2 1 -1 6 0
T0 -1 -1 -1 -1 -2 0 0 5
H0 0 -2 0 -2 10 -2 1 -2
E0 6 -3 6 0 0 -1 0 -1
R0 0 -2 0 -2 0 -2 0 -1
S-W: Dynamic Programming Matrix E L E P H A N T
0 0 0 0 0 0 0 0 0
P0 0 -3 0 9 -2 -1 -2 -1
A0 -1 -1 -1 -1 -2 5 -1 0
N0 0 -3 0 -2 1 -1 6 0
T0 -1 -1 -1 -1 -2 0 0 5
H0 0 -2 0 -2 10 -2 1 -2
E0 6 -3 6 0 0 -1 0 -1
R0 0 -2 0 -2 0 -2 0 -1
S-W: Dynamic Programming Matrix E L E P H A N T
0 0 0 0 0 0 0 0 0
P0 0 -3 0 9 -2 -1 -2 -1
A0 -1 -1 -1 -1 -2 5 -1 0
N0 0 -3 0 -2 1 -1 6 0
T0 -1 -1 -1 -1 -2 0 0 5
H0 0 -2 0 -2 10 -2 1 -2
E0 6 -3 6 0 0 -1 0 -1
R0 0 -2 0 -2 0 -2 0 -1
0 0 0 0 0 0 0 0 0
0 0 0 0 9 1 0 0 0
0 0 0 0 1 7 6 0 0
0 0 0 0 0 2 4 12 4
0 0 0 0 0 0 2 4 19
0 0 0 0 0 10 2 3 11
0 6 0 6 0 2 9 1 3
0 0 4 0 4 0 1 9 1
S-W: Dynamic Programming Matrix E L E P H A N T
0 0 0 0 0 0 0 0 0
P0 0 -3 0 9 -2 -1 -2 -1
A0 -1 -1 -1 -1 -2 5 -1 0
N0 0 -3 0 -2 1 -1 6 0
T0 -1 -1 -1 -1 -2 0 0 5
H0 0 -2 0 -2 10 -2 1 -2
E0 6 -3 6 0 0 -1 0 -1
R0 0 -2 0 -2 0 -2 0 -1
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
~ 33 mln sequences
as of Feb. 14,
2004
Growth of GenBank
BLAST: Steps
Divide both sequences into words of length w default w = 3
Calculate score for each pair Extend high scored pairs to increase score
BLAST: Divide Sequences
P A N T H E R
P A NA N T
N T HT H E
H E R
E L E P H A N T
E L EL E P
E P HP H A
H A NA N T
BLAST: Calculate Score
E L E
P A N
0 -1 0 score: -1
L E P
P A N
-3-1-2 score: -6
E P H
P A N
0 -1 1 score: 0
P H A
P A N
9 -2-1 score: 6
H A N
P A N
-2 5 6 score: 9
A N T
P A N
-1 -1 0 score: -2
BLAST: Sort Pairs on Score
Word1 Word2 ScoreANT ANT 16PAN HAN 9PAN PHA 6PAN EPH 0PAN ELE -1PAN ANT -2PAN LEP -6… … …
0 0 0 0 0 0 0 0 0
0 0 0 0 9 1 0 0 0
0 0 0 0 1 7 6 0 0
0 0 0 0 0 2 4 12 4
0 0 0 0 0 0 2 4 19
0 0 0 0 0 10 2 3 11
0 6 0 6 0 2 9 1 3
0 0 4 0 4 0 1 9 1
BLAST: Extension E L E P H A N T
0 0 0 0 0 0 0 0 0
P0 0 -3 0 9 -2 -1 -2 -1
A0 -1 -1 -1 -1 -2 5 -1 0
N0 0 -3 0 -2 1 -1 6 0
T0 -1 -1 -1 -1 -2 0 0 5
H0 0 -2 0 -2 10 -2 1 -2
E0 6 -3 6 0 0 -1 0 -1
R0 0 -2 0 -2 0 -2 0 -1
BLAST: Summary
Uses Score matrix Gap penalties Heuristics to reduce computations
Complexity O(m) with O(n) processors
Sensitivity Low
SensitivityAXBXCXDXE
ABCDE Task: Align 2 sequences:
Smith-Waterman:
BLAST:
AXBXCXDXE
: : : : :
A– B– C– D– E
Ø(no similar substrings)
S-W and BLAST
Using them now Too costly Inefficient Time-consuming
Solution More heuristics More parallelism
ParAlign: Steps
Find ungapped alignments Calculate approximate alignment scores Choose high-scored sequences Apply S-W
ParAlign: Microparallelism
Divide wide registers into smaller units Perform the same operation on different
data sources Modern microprocessors have this
technology built in
ParAlign: Calculate Scores in ParallelE L E P H A N T
0 0 0 0 0 0 0 0 0
P0 0 -3 0 9 -2 -1 -2 -1
A0 -1 -1 -1 -1 -2 5 -1 0
N0 0 -3 0 -2 1 -1 6 0
T0 -1 -1 -1 -1 -2 0 0 5
H0 0 -2 0 -2 10 -2 1 -2
E0 6 -3 6 0 0 -1 0 -1
R0 0 -2 0 -2 0 -2 0 -1
ParAlign: Estimate of GapsE L E P H A N T
0 0 0 0 0 0 0 0 0
P 0 0 -3 0 9 -2 -1 -2 -1
A 0 -1 -1 -4 -1 7 3 -2 -2
N 0 0 -4 -1 -6 0 6 9 -2
T 0 -1 -1 -5 -2 -8 1 6 14
H 0 0 -3 -1 -7 8 -10 2 4
E 0 6 -3 3 -1 -7 7 -10 1
R 0 0 4 -3 1 -1 -9 7 -11
ParAlign: Apply S-W in ParallelE L E P H A N T
0 0 0 0 0 0 0 0 0
P 0 0 -3 0 9 -2 -1 -2 -1
A 0 -1 -1 -4 -1 7 3 -2 -2
N 0 0 -4 -1 -6 0 6 9 -2
T 0 -1 -1 -5 -2 -8 1 6 14
H 0 0 -3 -1 -7 8 -10 2 4
E 0 6 -3 3 -1 -7 7 -10 1
R 0 0 4 -3 1 -1 -9 7 -11
ParAlign: Summary Uses
SIMD technology (single instruction multiple data)
S-W Algorithm Heuristics to reduce computations
Requirement for machine Modern microprocessor
Speed Fast
Sensitivity Medium
TurboBLAST: Steps
Divide the job Parts of query against partition of database
Apply BLAST Merge results
TurboBLAST: SchemaMasterClient
Workers
tasksjob
task
•Divide task•Schedule subtasks•Solve subtasks•Merge results
TurboHub
DB request
File Provider
DB part
•Sets up tasks•Manages execution•Coordinates Workers•Provides VSM
•Divides job into tasks•Writes results to file
results
results
request
task
It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.
It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.
TurboBLAST: Client
Takes a BLAST job and divides it into a number of initial BLAST tasks.
Submits these tasks to the Master Retrieves the results, and writes them to
file.
TurboBLAST: Master
Accepts tasks from Clients and sets them up to for processing by the Workers
Includes TurboHub (the server portion of a parallel execution system)
Includes File Provider (Java application that manages the databases)
TurboBLAST: Worker
Workers are processors Run a Java application and perform the
BLAST computations Merge the result Are responsible for scheduling
TurboHub
TurboHub is execution engine for parallel and distributed Java applications
Scalable high performance Wide range of computing environments Manages the flow of data through the workflows Schedules the components Transforms data between components Balances load Handles errors
TurboBLAST: TurboHub
Manages task execution Coordinates the Workers Provides a virtual shared memories Supports dynamic changes in the set of
Workers Supports fault tolerance
TurboBLAST: File Provider
Maintains a copy of each database Delivers all or part of each database to
Workers as they require them
TurboBLAST: Advantages Size of each task is optimal
processing is efficient on the processor that computes the task
Large set of tasks no waste of time for processors
No algorithm change Support for all flavors of BLAST Ease to update
Applicable for different environments (PC, Macintosh …)
TurboBLAST: Experiment
Input data 500 proteins 200 – 400 amino acids in each
Database 1,681,522,266 sequences
Hardware IBM Linux cluster 8 dual-processor workstations
2 Pentium III processors, 996 Mhz each 2 Gbyte memory
100 Mbit Ethernet
0
2
4
6
8
10
12
14
16
18
2 4 6 8 10 12 14 16
Processors
Sp
ee
d-u
p
X=Y
TurboBLAST
TurboBLAST: Results of Experiment
TurboBLAST: Summary Divide and Conquer
Use many copies of BLAST in parallel Uses BLAST Algorithm Requirement for each machine
Java VM Local BLAST executable
Speed Very fast
Sensitivity Low
References R.D. Bjornson, A.H. Sherman, S.B. Weston, N.
Willard, J. Wing “TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub” Intl. Parallel and Distributed Processing Symposium (IPDPS), 2002.
Rognes T.“ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches”Oxford University Press, 2001