locating conserved genes in whole genome scale
DESCRIPTION
Locating conserved genes in whole genome scale. Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS). Outline. Motivation Challenges of Whole Genome Alignment Four approaches and their performance - PowerPoint PPT PresentationTRANSCRIPT
Locating conserved genesLocating conserved genesin whole genome scalein whole genome scale
Prudence WongPrudence Wong
University of LiverpoolUniversity of LiverpoolJune 2005June 2005
joint work withjoint work with
HL Chan, TW Lam, HF Ting,HL Chan, TW Lam, HF Ting,
SM Yiu (HKU), WK Sung (NUS)SM Yiu (HKU), WK Sung (NUS)
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
Mouse & HumanMouse & HumanDo they looklike the same?
Mouse and human are genetically very similar
What do we mean by similar?
Many genes that can be found in human are also found in mouse as well – conserved genes
Mouse Chromosome 16Human Chromosome 16
m16
h03
Genome A
Genome B
Gene X
Gene X
Gene Y
Gene Y
Gene Z
Gene Z
Identify regions on the genomesthat possibly contain
their conserved genes.
Whole Genome Whole Genome AlignmentAlignment
Difference in ordering of conserved could be related to
mutations.
For related species, num. of mutations is usually small.
possibly a mutation
OutlineOutline MotivationMotivation Challenges of Whole Genome Challenges of Whole Genome
AlignmentAlignment Four approaches and their Four approaches and their
performanceperformance Longest Common SubsequenceLongest Common Subsequence Clustering ApproachClustering Approach Mutation Sensitive SelectionMutation Sensitive Selection Hybrid ApproachHybrid Approach
RemarksRemarks
Data sizeData size Usually very large (e.g., human Usually very large (e.g., human
chromosomes vs mouse chromosomes vs mouse chromosomes)chromosomes)ExamplesExamples
Human Chr Human Chr No.No.
LengtLengthh
Mouse Chr Mouse Chr No.No.
LengtLengthh
11 245245MM 11 134134MM33 200200MM 22 181181MM1111 135135MM 77 134134MM1515 100100MM 88 129129MM2020 6464MM 1616 9999MM
Cannot use global alignment tools because of the large size
ObservationsObservations a conserved gene a conserved gene may not be identicalmay not be identical in in
the two genomes, nevertheless, there are the two genomes, nevertheless, there are some some common substringscommon substrings unique to this unique to this conserved gene (called conserved gene (called MUMMUM))
locate all MUMs over the two genomes, yet locate all MUMs over the two genomes, yet not every MUM corresponds to conserved not every MUM corresponds to conserved genesgenes
Gene X Gene Y
Gene Y Gene X
Noise
Number of MUMsNumber of MUMs
Mouse Chr Mouse Chr No.No.
Human Chr Human Chr No.No.
# # of MUMsof MUMs
77 1919 52,39452,3941515 2222 71,61371,6131616 1616 66,53666,5361616 2222 61,20061,2001717 1616 29,00129,0011717 1919 56,23656,2361919 1111 29,81429,814
Size is smaller comparing with chromosome length
MUMs for M16-H03MUMs for M16-H03
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
50000000
0 10000000 20000000 30000000 40000000 50000000 60000000 70000000
Conserved genes
Human Chromosome 03
Mou
se C
hro
mosom
e 1
6
Generation of MUMusing suffix tree
How to choose the
right MUMs?
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
MUM SelectionMUM Selection MUMmer-1MUMmer-1 [Delcher et al. Nucleic Acids Research 1999][Delcher et al. Nucleic Acids Research 1999]
longest common subsequences (effectively assume no longest common subsequences (effectively assume no mutations)mutations)
MUMmer-2MUMmer-2 [Delcher et al. Nucleic Acids Research 2002][Delcher et al. Nucleic Acids Research 2002] & & MUMmer-3MUMmer-3 [Kurtz et al. Genome Biology 2004][Kurtz et al. Genome Biology 2004] clustering heuristicsclustering heuristics most popular tool to uncover conserved genes in WG most popular tool to uncover conserved genes in WG
scalescale
MaxMinClusterMaxMinCluster [Wong et al. Bioinformatics 2004*][Wong et al. Bioinformatics 2004*] clustering, optimizationclustering, optimization
MSSMSS Mutation Sensitive SelectionMutation Sensitive Selection [Chan et al. Bioinformatics [Chan et al. Bioinformatics 2005*]2005*] capture mutationscapture mutations
HybridHybrid approach approach [Chan et al. Bioinformatics 2005*][Chan et al. Bioinformatics 2005*] combine mutation sensitive and clustering approachescombine mutation sensitive and clustering approaches* our results
Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %
Mouse/Mouse/
HumanHuman
IntragenusIntragenus
BaculoviradBaculoviradee
IntergenusIntergenus
BaculoviradBaculovirade e
MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)
MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)
MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)
MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)
MaxMinClustesr+MSMaxMinClustesr+MSSS
91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)
coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published
conserved genesconserved genes
Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %
Mouse/Mouse/
HumanHuman
IntragenusIntragenus
BaculoviradBaculoviradee
IntergenusIntergenus
BaculoviradBaculovirade e
MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)
MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)
MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)
MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)
MaxMinClustesr+MSMaxMinClustesr+MSSS
91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)
coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published
conserved genesconserved genes
MSS outperforms MaxMinCluster and MUMmer-3 on closely related
species
Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %
Mouse/Mouse/
HumanHuman
IntragenusIntragenus
BaculoviradBaculoviradee
IntergenusIntergenus
BaculoviradBaculovirade e
MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)
MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)
MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)
MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)
MaxMinClustesr+MSMaxMinClustesr+MSSS
91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)
coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published
conserved genesconserved genes
BUT MSS performs worse on species relatively farther
apart
Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %
Mouse/Mouse/
HumanHuman
IntragenusIntragenus
BaculoviradBaculoviradee
IntergenusIntergenus
BaculoviradBaculovirade e
MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)
MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)
MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)
MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)
MaxMinClustesr+MSMaxMinClustesr+MSSS
91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)
coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published
conserved genesconserved genes
both hybrid approaches perform well for species
farther apart
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
Longest Common Longest Common SubsequenceSubsequence
LCS
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
LCS Approach (MUMmer-1) does not take mutations into
account
MUMmer-2 & -3 cluster by heuristic MaxMinCluster formalizes
clustering as a combinatorial combinatorial optimization problemoptimization problem
Clustering approachClustering approach ObservationsObservations
NoiseNoise MUMs are usually MUMs are usually shortshort and and isolatedisolated
A conserved gene usually contains a A conserved gene usually contains a sequence of MUMs that are close and sequence of MUMs that are close and have sufficient length => have sufficient length => clustersclusters
Gene X Gene Y
Gene Y Gene X
Noise
ChallengeChallenge Challenge: some conserved
genes do not induce clusters of sufficient length
Solution: relax the definition of clusters to allow the presence of noise
Noisy clusterNoisy cluster Suppose Gap=100, MinSize=40Suppose Gap=100, MinSize=40
> 100 apart
length = 20
a 1-noisy cluster
Noisy clusterNoisy cluster Suppose Gap=100, MinSize=40Suppose Gap=100, MinSize=40
> 100 apart
length = 20
a 2-noisy cluster
MaxMinClustesrMaxMinClustesr Problem formulationProblem formulation
find a collection of k-noisy clusters such find a collection of k-noisy clusters such that the smallest cluster has the that the smallest cluster has the maximum weightmaximum weight
Dynamic programmingDynamic programmingO(kO(k22nn22) time, ) time, O(kO(k22n)n) space space
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
Capture mutations more directly
Mutation Sensitive Mutation Sensitive SelectionSelection
select subsets of MUMsselect subsets of MUMs
subset of MUMs
transformed by a few mutations
three types of mutations:three types of mutations:reversal, transposition, reversed-transpositionreversal, transposition, reversed-transposition
k-mutated subsequencesk-mutated subsequences Given two sequences A & B and an integer k,Given two sequences A & B and an integer k,
a pair of a pair of subsequence X of Asubsequence X of A & & subsequence Y of Bsubsequence Y of B is is called a pair of called a pair of k-mutated subsequencesk-mutated subsequences if ifX can be transformed to Y by at most k mutationsX can be transformed to Y by at most k mutations
reversaltransposition
a pair of 2-mutated subsequences
MUMs are signed; reversal reverts sign of MUMs
Mutation Sensitive Mutation Sensitive SelectionSelection
Problem formulation:Problem formulation: To find a pair of k-mutated subsequences with To find a pair of k-mutated subsequences with
maximum weightmaximum weight
We believe that the problem is NP-hardWe believe that the problem is NP-hard The Genome Rearrangement Problem, The Genome Rearrangement Problem,
believed to be NP-hard, can be reduced to this believed to be NP-hard, can be reduced to this problemproblem
We give an We give an efficient approximation efficient approximation algorithmalgorithm the resulting weight is close to (at least the resulting weight is close to (at least
1/(3k+1)1/(3k+1) times) the maximum possible weight times) the maximum possible weightO(n2log n + kn2) time, O(n2) space
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
Hybrid ApproachHybrid Approach first first apply clustering approachapply clustering approach to to
identify clusters which are obviously identify clusters which are obviously conserved genesconserved genes can apply either MUMmer-3 or can apply either MUMmer-3 or
MaxMinClusterMaxMinCluster these these clusters are treated as MUMclusters are treated as MUM with with
bigger weightbigger weight then then apply MSSapply MSS to process these MUM to process these MUM
together with the remaining MUMtogether with the remaining MUM
OutlineOutline MotivationMotivation
Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment
Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence
Clustering ApproachClustering Approach
Mutation Sensitive SelectionMutation Sensitive Selection
Hybrid ApproachHybrid Approach
RemarksRemarks
RemarksRemarks Experiments show thatExperiments show that
MaxMinCluster > LCSMaxMinCluster > LCS
MMS > MaxMinCluster for closely MMS > MaxMinCluster for closely related speciesrelated species
MMS does not perform well for species MMS does not perform well for species relatively farther apartrelatively farther apart
Hybrid approach is the best for both Hybrid approach is the best for both closely related and farther apart speciesclosely related and farther apart species
Thank you!Thank you!
Q & AQ & A
Approximation AlgorithmApproximation Algorithm Super-BackboneSuper-Backbone
maximum weight common maximum weight common subsequencessubsequences
Identify k mutation blocksIdentify k mutation blocks having high weighthaving high weight do not overlap with Super-Backbone too do not overlap with Super-Backbone too
muchmuch this is formulated as a sub-problem and this is formulated as a sub-problem and
solved optimally by dynamic solved optimally by dynamic programmingprogramming
Report Super-Backbone & k mutation Report Super-Backbone & k mutation blocksblocksO(n2log n + kn2) time, O(n2) space
MutationsMutations three types of mutations:three types of mutations:
reversal, transposition, reversed-reversal, transposition, reversed-transpositiontransposition
a b c d e f g h i j k l m n o p q r s t u v w x y z
a d c b e f g h i j k l m n o p q r s t u v w x y z
reversal
a d c b e k l m n o p q r s t u v w x y f g h i j ztransposition
reversed-transposition
a d c b e k l t s r q p o m n u v w x y f g h i j z