locating conserved genes in whole genome scale

36
Locating conserved genes Locating conserved genes in whole genome scale in whole genome scale Prudence Wong Prudence Wong University of Liverpool University of Liverpool June 2005 June 2005 joint work with joint work with HL Chan, TW Lam, HF Ting, HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS) SM Yiu (HKU), WK Sung (NUS)

Upload: avedis

Post on 06-Jan-2016

47 views

Category:

Documents


9 download

DESCRIPTION

Locating conserved genes in whole genome scale. Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS). Outline. Motivation Challenges of Whole Genome Alignment Four approaches and their performance - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Locating conserved genes in whole genome scale

Locating conserved genesLocating conserved genesin whole genome scalein whole genome scale

Prudence WongPrudence Wong

University of LiverpoolUniversity of LiverpoolJune 2005June 2005

joint work withjoint work with

HL Chan, TW Lam, HF Ting,HL Chan, TW Lam, HF Ting,

SM Yiu (HKU), WK Sung (NUS)SM Yiu (HKU), WK Sung (NUS)

Page 2: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

Page 3: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

Page 4: Locating conserved genes in whole genome scale

Mouse & HumanMouse & HumanDo they looklike the same?

Mouse and human are genetically very similar

What do we mean by similar?

Many genes that can be found in human are also found in mouse as well – conserved genes

Mouse Chromosome 16Human Chromosome 16

m16

h03

Page 5: Locating conserved genes in whole genome scale

Genome A

Genome B

Gene X

Gene X

Gene Y

Gene Y

Gene Z

Gene Z

Identify regions on the genomesthat possibly contain

their conserved genes.

Whole Genome Whole Genome AlignmentAlignment

Difference in ordering of conserved could be related to

mutations.

For related species, num. of mutations is usually small.

possibly a mutation

Page 6: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation Challenges of Whole Genome Challenges of Whole Genome

AlignmentAlignment Four approaches and their Four approaches and their

performanceperformance Longest Common SubsequenceLongest Common Subsequence Clustering ApproachClustering Approach Mutation Sensitive SelectionMutation Sensitive Selection Hybrid ApproachHybrid Approach

RemarksRemarks

Page 7: Locating conserved genes in whole genome scale

Data sizeData size Usually very large (e.g., human Usually very large (e.g., human

chromosomes vs mouse chromosomes vs mouse chromosomes)chromosomes)ExamplesExamples

Human Chr Human Chr No.No.

LengtLengthh

Mouse Chr Mouse Chr No.No.

LengtLengthh

11 245245MM 11 134134MM33 200200MM 22 181181MM1111 135135MM 77 134134MM1515 100100MM 88 129129MM2020 6464MM 1616 9999MM

Cannot use global alignment tools because of the large size

Page 8: Locating conserved genes in whole genome scale

ObservationsObservations a conserved gene a conserved gene may not be identicalmay not be identical in in

the two genomes, nevertheless, there are the two genomes, nevertheless, there are some some common substringscommon substrings unique to this unique to this conserved gene (called conserved gene (called MUMMUM))

locate all MUMs over the two genomes, yet locate all MUMs over the two genomes, yet not every MUM corresponds to conserved not every MUM corresponds to conserved genesgenes

Gene X Gene Y

Gene Y Gene X

Noise

Page 9: Locating conserved genes in whole genome scale

Number of MUMsNumber of MUMs

Mouse Chr Mouse Chr No.No.

Human Chr Human Chr No.No.

# # of MUMsof MUMs

77 1919 52,39452,3941515 2222 71,61371,6131616 1616 66,53666,5361616 2222 61,20061,2001717 1616 29,00129,0011717 1919 56,23656,2361919 1111 29,81429,814

Size is smaller comparing with chromosome length

Page 10: Locating conserved genes in whole genome scale

MUMs for M16-H03MUMs for M16-H03

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

50000000

0 10000000 20000000 30000000 40000000 50000000 60000000 70000000

Conserved genes

Human Chromosome 03

Mou

se C

hro

mosom

e 1

6

Page 11: Locating conserved genes in whole genome scale

Generation of MUMusing suffix tree

How to choose the

right MUMs?

Page 12: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

Page 13: Locating conserved genes in whole genome scale

MUM SelectionMUM Selection MUMmer-1MUMmer-1 [Delcher et al. Nucleic Acids Research 1999][Delcher et al. Nucleic Acids Research 1999]

longest common subsequences (effectively assume no longest common subsequences (effectively assume no mutations)mutations)

MUMmer-2MUMmer-2 [Delcher et al. Nucleic Acids Research 2002][Delcher et al. Nucleic Acids Research 2002] & & MUMmer-3MUMmer-3 [Kurtz et al. Genome Biology 2004][Kurtz et al. Genome Biology 2004] clustering heuristicsclustering heuristics most popular tool to uncover conserved genes in WG most popular tool to uncover conserved genes in WG

scalescale

MaxMinClusterMaxMinCluster [Wong et al. Bioinformatics 2004*][Wong et al. Bioinformatics 2004*] clustering, optimizationclustering, optimization

MSSMSS Mutation Sensitive SelectionMutation Sensitive Selection [Chan et al. Bioinformatics [Chan et al. Bioinformatics 2005*]2005*] capture mutationscapture mutations

HybridHybrid approach approach [Chan et al. Bioinformatics 2005*][Chan et al. Bioinformatics 2005*] combine mutation sensitive and clustering approachescombine mutation sensitive and clustering approaches* our results

Page 14: Locating conserved genes in whole genome scale

Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %

Mouse/Mouse/

HumanHuman

IntragenusIntragenus

BaculoviradBaculoviradee

IntergenusIntergenus

BaculoviradBaculovirade e

MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)

MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)

MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)

MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)

MaxMinClustesr+MSMaxMinClustesr+MSSS

91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)

coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published

conserved genesconserved genes

Page 15: Locating conserved genes in whole genome scale

Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %

Mouse/Mouse/

HumanHuman

IntragenusIntragenus

BaculoviradBaculoviradee

IntergenusIntergenus

BaculoviradBaculovirade e

MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)

MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)

MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)

MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)

MaxMinClustesr+MSMaxMinClustesr+MSSS

91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)

coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published

conserved genesconserved genes

MSS outperforms MaxMinCluster and MUMmer-3 on closely related

species

Page 16: Locating conserved genes in whole genome scale

Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %

Mouse/Mouse/

HumanHuman

IntragenusIntragenus

BaculoviradBaculoviradee

IntergenusIntergenus

BaculoviradBaculovirade e

MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)

MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)

MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)

MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)

MaxMinClustesr+MSMaxMinClustesr+MSSS

91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)

coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published

conserved genesconserved genes

BUT MSS performs worse on species relatively farther

apart

Page 17: Locating conserved genes in whole genome scale

Overview of ResultsOverview of Results Average coverage (sensitivity) — in %Average coverage (sensitivity) — in %

Mouse/Mouse/

HumanHuman

IntragenusIntragenus

BaculoviradBaculoviradee

IntergenusIntergenus

BaculoviradBaculovirade e

MUMmer-3MUMmer-3 77 (27)77 (27) 66 (71)66 (71) 43 (62)43 (62)

MaxMinClusterMaxMinCluster 84 (29)84 (29) 69 (75)69 (75) 45 (59)45 (59)

MSSMSS 91 (29)91 (29) 79 (75)79 (75) 36 (53)36 (53)

MUMmer-3+MSSMUMmer-3+MSS 91 (28)91 (28) 79 (75)79 (75) 48 (43)48 (43)

MaxMinClustesr+MSMaxMinClustesr+MSSS

91 (27)91 (27) 79 (82)79 (82) 51 (53)51 (53)

coverage: % of published conserved genes reportedcoverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published sensitivity: % of MUMs reported that reside in published

conserved genesconserved genes

both hybrid approaches perform well for species

farther apart

Page 18: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

Page 19: Locating conserved genes in whole genome scale

Longest Common Longest Common SubsequenceSubsequence

LCS

Page 20: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

LCS Approach (MUMmer-1) does not take mutations into

account

MUMmer-2 & -3 cluster by heuristic MaxMinCluster formalizes

clustering as a combinatorial combinatorial optimization problemoptimization problem

Page 21: Locating conserved genes in whole genome scale

Clustering approachClustering approach ObservationsObservations

NoiseNoise MUMs are usually MUMs are usually shortshort and and isolatedisolated

A conserved gene usually contains a A conserved gene usually contains a sequence of MUMs that are close and sequence of MUMs that are close and have sufficient length => have sufficient length => clustersclusters

Gene X Gene Y

Gene Y Gene X

Noise

Page 22: Locating conserved genes in whole genome scale

ChallengeChallenge Challenge: some conserved

genes do not induce clusters of sufficient length

Solution: relax the definition of clusters to allow the presence of noise

Page 23: Locating conserved genes in whole genome scale

Noisy clusterNoisy cluster Suppose Gap=100, MinSize=40Suppose Gap=100, MinSize=40

> 100 apart

length = 20

a 1-noisy cluster

Page 24: Locating conserved genes in whole genome scale

Noisy clusterNoisy cluster Suppose Gap=100, MinSize=40Suppose Gap=100, MinSize=40

> 100 apart

length = 20

a 2-noisy cluster

Page 25: Locating conserved genes in whole genome scale

MaxMinClustesrMaxMinClustesr Problem formulationProblem formulation

find a collection of k-noisy clusters such find a collection of k-noisy clusters such that the smallest cluster has the that the smallest cluster has the maximum weightmaximum weight

Dynamic programmingDynamic programmingO(kO(k22nn22) time, ) time, O(kO(k22n)n) space space

Page 26: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

Capture mutations more directly

Page 27: Locating conserved genes in whole genome scale

Mutation Sensitive Mutation Sensitive SelectionSelection

select subsets of MUMsselect subsets of MUMs

subset of MUMs

transformed by a few mutations

three types of mutations:three types of mutations:reversal, transposition, reversed-transpositionreversal, transposition, reversed-transposition

Page 28: Locating conserved genes in whole genome scale

k-mutated subsequencesk-mutated subsequences Given two sequences A & B and an integer k,Given two sequences A & B and an integer k,

a pair of a pair of subsequence X of Asubsequence X of A & & subsequence Y of Bsubsequence Y of B is is called a pair of called a pair of k-mutated subsequencesk-mutated subsequences if ifX can be transformed to Y by at most k mutationsX can be transformed to Y by at most k mutations

reversaltransposition

a pair of 2-mutated subsequences

MUMs are signed; reversal reverts sign of MUMs

Page 29: Locating conserved genes in whole genome scale

Mutation Sensitive Mutation Sensitive SelectionSelection

Problem formulation:Problem formulation: To find a pair of k-mutated subsequences with To find a pair of k-mutated subsequences with

maximum weightmaximum weight

We believe that the problem is NP-hardWe believe that the problem is NP-hard The Genome Rearrangement Problem, The Genome Rearrangement Problem,

believed to be NP-hard, can be reduced to this believed to be NP-hard, can be reduced to this problemproblem

We give an We give an efficient approximation efficient approximation algorithmalgorithm the resulting weight is close to (at least the resulting weight is close to (at least

1/(3k+1)1/(3k+1) times) the maximum possible weight times) the maximum possible weightO(n2log n + kn2) time, O(n2) space

Page 30: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

Page 31: Locating conserved genes in whole genome scale

Hybrid ApproachHybrid Approach first first apply clustering approachapply clustering approach to to

identify clusters which are obviously identify clusters which are obviously conserved genesconserved genes can apply either MUMmer-3 or can apply either MUMmer-3 or

MaxMinClusterMaxMinCluster these these clusters are treated as MUMclusters are treated as MUM with with

bigger weightbigger weight then then apply MSSapply MSS to process these MUM to process these MUM

together with the remaining MUMtogether with the remaining MUM

Page 32: Locating conserved genes in whole genome scale

OutlineOutline MotivationMotivation

Challenges of Whole Genome Challenges of Whole Genome AlignmentAlignment

Four approaches and their performanceFour approaches and their performance Longest Common SubsequenceLongest Common Subsequence

Clustering ApproachClustering Approach

Mutation Sensitive SelectionMutation Sensitive Selection

Hybrid ApproachHybrid Approach

RemarksRemarks

Page 33: Locating conserved genes in whole genome scale

RemarksRemarks Experiments show thatExperiments show that

MaxMinCluster > LCSMaxMinCluster > LCS

MMS > MaxMinCluster for closely MMS > MaxMinCluster for closely related speciesrelated species

MMS does not perform well for species MMS does not perform well for species relatively farther apartrelatively farther apart

Hybrid approach is the best for both Hybrid approach is the best for both closely related and farther apart speciesclosely related and farther apart species

Page 34: Locating conserved genes in whole genome scale

Thank you!Thank you!

Q & AQ & A

Page 35: Locating conserved genes in whole genome scale

Approximation AlgorithmApproximation Algorithm Super-BackboneSuper-Backbone

maximum weight common maximum weight common subsequencessubsequences

Identify k mutation blocksIdentify k mutation blocks having high weighthaving high weight do not overlap with Super-Backbone too do not overlap with Super-Backbone too

muchmuch this is formulated as a sub-problem and this is formulated as a sub-problem and

solved optimally by dynamic solved optimally by dynamic programmingprogramming

Report Super-Backbone & k mutation Report Super-Backbone & k mutation blocksblocksO(n2log n + kn2) time, O(n2) space

Page 36: Locating conserved genes in whole genome scale

MutationsMutations three types of mutations:three types of mutations:

reversal, transposition, reversed-reversal, transposition, reversed-transpositiontransposition

a b c d e f g h i j k l m n o p q r s t u v w x y z

a d c b e f g h i j k l m n o p q r s t u v w x y z

reversal

a d c b e k l m n o p q r s t u v w x y f g h i j ztransposition

reversed-transposition

a d c b e k l t s r q p o m n u v w x y f g h i j z