identifying conserved spatial patterns in genomes

Identifying conserved spatial patterns

in genomes

Rose HobermanComputer Science Department

Carnegie Mellon University

University of Chicago0ct 20, 2006

My focus: Spatial Comparative Genomics

Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and

evolves.

A simple model of a genome

an ordered list of genes

4 53 71 62 8 9 11 12 1310 14 15 1716 19 20183

Genomic Change

speciation

Sequence Mutation + Chromosomal Rearrangements

species 2

species 1

Ancestral genome

Types of Genomic Rearrangements

4 53 71 62

8 97 11 12104 531 62 13 14 15 1716 19 2018

8 9 11 12 1310 14 15 1716 19 2018

Inversions

431 2

Duplications/Insertions

3 89111213 10141517 161920 18

20191813 14 15 1716

Loss

8 97 11 12104 531 62 431 2 2013 14 15 1716

Fissions and fusions

4 53 71 2 3 13141517 161920 18 3 891112 10

Types of Genomic Rearrangements

InversionsDuplications/InsertionsLoss

An Essential Task forSpatial Comparative Genomics

4 53 71 2

8 97 11 12104 531 62 431 2

13141517 161920 18

2013 14 15 1716

3

431 2

891112 10

Species 2

Species 1

Identify chromosomal regions that descended from the same region in the genome of the common ancestor

Outline

Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic

regions?

Problem Background Why is this challenging? Introduction to cluster finding

Results Statistics for pairwise clusters Statistics for three-way clusters

Pevzner, Tesler. Genome Research 2003

Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution

Reconstruct history of chromosomal rearrangements

Infer ancestral genetic map

Phylogeny reconstruction

Identify ancient whole genome duplications

Identification of homologous chromosomal segments is a key task in comparative genomics

Guillaume Bourque et al. Genome Research 2004

1. Genome evolution Reconstruct

history of chromosomal rearrangements









Whole genome duplication

chromosome 2chromosome 1

Ancestral chromosome

chr 2

chr 1





Identify ancient whole genome duplications McLysaght et al

Nature Genetics, 2002.

Infer functional associations

Predict operons

Identify horizontal transfers

2. Understand gene function and regulation in bacteria

Identification of conserved chromosomal structure is a key task in comparative genomics

Loss

Insertion

Inversions

Outline


regions?



Closely related genomes

Related regions are easy to identify

4 53 71 2

8 97 11 12104 531 62 431 2

13141517 161920 18

2013 14 15 1716

3

431 2

891112 10Species 1

Species 2

Five hundred million years...

Homologous regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly

preserved

More Distantly Related Genomes

45

3 71 2

89 11 12104 51 62 31 2

89 11121310 141517 161920

18

2013 1415 17 16

3

4 1

18

7

11

Gene clusters Similar gene content Neither gene content nor order is perfectly

45

3 71 2

89 11 12104 51 62 31 2

89 11121310 141517 161920

18

2013 1415 17 16

3

4 1

18

7

11

The signature of diverged regions

A Framework for Identifying Gene Clusters

1. Find homologous genes

2. Formally define a “gene cluster”

3. Design an algorithm to find clusters

4. Statistically verify clusters

Why Validate Clusters Statistically?

After sufficient time has passed, gene order will become randomized

Uniform random data tends to be “clumpy”

Some genes will end up close together in both genomes simply by chance

43 71 2

89 11 12104 51 62 31 2

1310 141517 161920

2013 1415 17 16

3

4 1

18

7

11

Cluster Statistics

r-windows

max-gap

2 regionsDurand & Sankoff,

2003my work

3 regions my work unsolved

Cluster Models

Outline


regions?






3. Design an algorithm to find clusters


Gene Homology

Identification of homologous gene pairs generally based on sequence similarity conserved genomic context is also informative

Assumptions matches are binary (similarity scores are

discarded) each gene is homologous to at most one other

gene in the other genome


1. Find homologous genes Formally define a “gene cluster”

3. Devise an algorithm to identify clusters


Where are the gene clusters?

Intuitive notion: pairs of regions that are dense with homologs

How can we formalize this intuition?

The r-window cluster definition

Two windows of size r that share at least m homologous gene pairs

r is a user-specified parameter

r = 6

(Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)

A max-gap chain

The distance or “gap” between genes is equal to the number of intervening genes

A set of genes in a genome form a max-gap chain if the gap between adjacent genes is never

greater than g (a user-specified parameter)

g= 2 gap= 3

The max-gap cluster definition

g= 2 g= 3gap= 3

A set of genes form a max-gap cluster in two genomes if

1. the genes forms a max-gap chain in each genome

2. the cluster is maximal (i.e. not contained within a larger cluster)



2. Formally define a “gene cluster” Devise an algorithm to identify

clusters


Max-gap search algorithms

Many genomic studies use a max-gap criteria Each group designs their own search algorithm These are often greedy algorithms, but greedy

algorithms miss disordered max-gap clusters

Hoberman et al, RECOMB Comp Genomics 2005

Greedy, Agglomerative Algorithms

1. initialize a cluster as a single homologous pair

2. search for a gene in proximity on both chromosomes

3. either extend the cluster and repeat, or terminate

g = 2

Greedy Algorithms Impose Order Constraints

g = 2

A max-gap cluster of size four

A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size

two

Algorithms and Definition Mismatch

Greedy, bottom-up algorithms will not find all max-gap clusters

There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002)

Cluster statistics depend on the search space, which depends on which algorithm is used




3. Devise an algorithm to identify clusters

Statistically verify clustersAn example…

Example: Whole genome self-comparison to detect duplicated blocks

Chose g=30

Compared all human chromosomes to all other chromosomes to find max-gap gene clusters

McLysaght et al. Nature Genetics, 2002.

How can we use statistical analysis?

Could two regions display this degree of similarity simply by chance?

Is g=30 a reasonable choice of gap size? Are larger clusters less likely to occur by

chance? How large does a cluster have to be before we

are surprised to observe it?

McLysaght et al. Nature Genetics, 2002.

3 genes duplicatedout of ~25 13 genes

29 genes

10 genes duplicatedout of ~100

Chr 3

Chr 17

Outline


regions?



Cluster Statistics

r-windows

max-gap


2003my work


Cluster Models

The max-gap definition is the most widely used cluster definition in genomic analyses

Yet there is no formal statistical model for max-gap clusters

Overbeek et al 1999: inferring functional coupling of genes in bacteriaVision et al 2000: origins of genomic duplications in ArabidopsisFriedman and Hughes 2001: gene duplication and structure of eukaryotic

genomesTamames 2001: evolution of gene order conservation in prokaryotesVandepoele et al 2002: microcolinearity between Arabidopsis and riceMcLysaght et al 2002: genomic duplication during early chordate

evolutionSimillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in ArabidopsisLuc et al 2003: gene teams for comparative genomicsChen et al 2004: operon prediction in newly sequenced bacteriaBourque et al 2005: comparison of mammalian and chicken genome

architectures …

Is this cluster biologically meaningful? Could it have occurred in a comparison of

two random genomes?

The QuestionSuppose two whole genomes were

compared, and this max-gap cluster was identified:

Statistical Testing

Hypothesis testing Alternate hypothesis: shared ancestry Null hypothesis: random gene order

Discard clusters that could have arisen under the null model

Determine the probability of observing a similar cluster under the null hypothesis

Given an allowed gap size of g, what is the probability of observing a max-gap cluster

containing exactly h matching gene pairs?

How do we calculate this probability?

h=4

The Problem

The Inputs

n: number of genes in each genome

m: number of matching genes pairs

g: the maximum gap allowed in a cluster

h: number of matching genes in the cluster

h=4

m=6 n=22

g=2

The probability when m = n

…the probability of a max-gap cluster is 1

(regardless of the allowed gap size)

If gene content is identical…

Probability of a cluster of size h when m < n

h genes m genesm-h genes

1. Create chains of h genes in both

genomes

2. Place m-h remaining genes so they do not

extend the cluster

3. Normalize to get a probability

*

Basic approachEnumerate all ways to:

Probability of observing a cluster of size h

number of ways to place h genes so they form a chain

in both genomes

number of ways to place m-h remaining genes so they do not

extend the cluster

All configurations of m gene pairs in two genomes of size n

Number of ways to place h genes in two genomes so they form a cluster

h genes m genesm-h genes

Choose h genes to compose

the cluster

Assign each gene to a

selected spot in each genome

Select h spots in each genome, so they form a

max-gap chain

Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004



in both genomes


extend the cluster

All configurations of m gene pairs in two genomes of size n

Approach: design a rule specifying where the genes can

be placed so that the cluster is not extended count the positions that satisfy the rule

Counting the number of ways to place m-h genes outside the cluster

g = 1

h = 3

Rule 1: A gene can go anywhere except in the cluster (the white box).

g = 1

gaps ≤ 1


Too lenient

overcounts

Rule 2: Every gene must be at least g+1 positions from

the cluster (outside the grey box).

g = 1

g + 1

g +1


Too strict

undercounts

g = 1

h = 3 gap > 1

gap > 1


Rule 2: Every gene must be at least g+1 positions from

the cluster (outside the grey box).

Too strict

undercounts

g = 1

gap > 1


Rule 3: At most one member of each gene pair can be in

the grey box.

Too lenient

overcounts

Rule 3: At most one member of each gene pair can be in

the grey box.

g = 1

Counting the number of ways to place m-h genes outside the clustergaps ≤ 1

Too lenient

overcounts

g = 1

Acceptable positions for a gene depend on the positions of the remaining genes

Use strict and lenient rules to calculate upper and lower bounds on G


Estimating G

Upper bound: Erroneously

enumerates this configuration

Lower bound: Fails to enumerate this configuration



in both genomes


extend the cluster

Hoberman, Sankoff, Durand Journal of Computational Biology, 2005

What can we learn from this statistical result? Are we less likely to observe a large cluster

(containing more gene pairs) than a small cluster?

How large does a cluster have to be before we are surprised to observe it?

How do we choose the maximum allowed gap value?

Clu

ster

Pro

babi

lity

g=10

Whole-genome comparison cluster statistics

g=20

n=1000, m=250

h (cluster size)

With a significance

threshold of 10-4, any cluster

containing 8 or more genes is

significant.

E. coli vs B. Subtilis

Under null hypothesis, by g=25 all genes should form a single cluster

Completecluster doesn’t form until g=110

Typical operonsizes

clusters above the orange line are significant at the .001 level

Algorithm:Bergeron et al, 02

Statistics:Hoberman et al, 05

By g=9, probabilities no longer decrease monotonically with cluster size

Statistical analysis of max-gap gene clusters

1. Data analysis: Identifies statistically significant max-gap clusters

2. Search: Suggests a more principled approach for choosing a gap size that will yield significant clusters

3. Modeling: Provides insight into criteria for cluster definitions larger clusters should be less likely to occur by

chance

Outline


regions?



Cluster Statistics

r-windows

max-gap


2003my work


Cluster Models

Joint work with Narayanan Raghupathy

Duplicated segments can be particularly difficult to identify


chromosome 2

chromosome 1


chr 2

chr 1

Duplicated segments can be particularly difficult to identify


chr 2

chr 1


Reciprocal gene loss

Comparisons of multiple genomes offer significantly more power to detect highly diverged homologous segments

Arabidopsis thaliana region 1

Rice chr 2

Arabidopsis thaliana region 2

Simillion et al, PNAS 2002

Rice

At 2At 1

5 3

0

1

20 22

22

126 26

At 2At 1

Existing Statistical Approaches

x2

x23x123

x3x13x1

x12

1. Only consider x123, genes shared between all regions disregards pairwise overlaps2. Conduct multiple pairwise comparisons does not consider the greater impact of genes

shared among all three regions requires that at least two pairwise clusters are

independently significant

W1W3

W2

Durand and Sankoff, J Comp Biology, 2003Methods reviewed in Simillion et al, Bioessays, 2004

No existing statistical approach considers all these quantities in a single test.

r-windows for pairwise comparison

Two windows of size r that share at least m homologous gene pairs

r = 6

With two regions there is a natural test statistic

An r-window cluster spanning two regions can be characterized by the number of shared

genes, m.

mr-m r-m

Durand and Sankoff, J Comp Biology, 2003

The probability that two random windows share m

genes:

r = 6

With three regions there are many more quantities to consider

Three-way overlap: x123

Pairwise overlaps: x12, x23, x13

x3

x23x123

x2x12x1

x13

We want to consider both types of overlaps

Our r-window statistics for three windows

Given three randomly selected windows of length r,

we derived equations for the probability of observing such a cluster

under a null hypothesis of random gene order.

),,,( 232313131212123123 xXxXxXxXP

Raghupathy, Hoberman, Durand. APBC 2007.

x3

x23x123

x2x12x1

x13

First tests that uses both pairwise and three-way overlaps

How serious are these limitations?

Pairwise tests have a number of limitations:

1. They do not consider the greater impact of genes shared among all three regions

2. They require that at least two pairwise comparisons are independently significant

Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.

What is the impact of ignoring three-way genes?

a) Three genes are shared by all three windows

b) Three distinct genes are shared by each pair of windows

030

0 303

3

Each pair of regions shares exactly

three genes

Which cluster is least likely to occur by chance?

a)

0h0

0

h0h

h

b)

n=5000, r=100

h

h

hxxij 123,00, 123 xhxij

(a) is less likely to occur than (b)

a) Three genes are shared by all three windows

b) Three distinct genes are shared by each pair of windows

Pairwise tests cannot distinguish these two clusters

but pairwise tests score them identically

How much of a limitation is this?

Pairwise tests have a number of limitations:

1. They do not consider the greater impact of genes shared among all three regions

2. They require that at least two pairwise comparisons are independently significant

Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.

When no genes appear in all three regions, are pairwise statistical tests sufficient?

n=5000, r =100, x123=0

h

α=0.001

Two Pairwise Tests

h hh

h ?h

Three-window

0

0

When no genes appear in all three regions, are pairwise statistical tests sufficient?

The three-window test is strictly more general than two pairwise tests

The three-window test will always reject the null hypothesis for any cluster in which the pairwise tests reject the null hypothesis

Two Pairwise Tests

5 55

8 ?8

Three-window

0

0

These results suggest that multi-region tests can identify more distantly related homologous regions.

There is an ongoing debate about whether there was 1 or 2 whole genome duplications in early vertebrates

More powerful statistical tests might help resolve this issue

Summary1. First statistical tests for max-gap clusters,

the most commonly used definition determines minimum significant cluster size allows more principled selection of gap

parameter reveals problems with max-gap definition

2. More sensitive statistical tests for comparisons of three genomic regions first test to consider both pairwise and three-

way overlaps can identify more distantly related regions may be able to resolve outstanding questions in

molecular evolution

Acknowledgements

Collaborators Dannie Durand, Biological Sciences and

Computer Science, CMU David Sankoff, Math and Statistics University of

Ottawa Narayanan Raghupathy, Biological Sciences,

CMU

Funders Barbara Lazarus Women@IT Fellowship The Sloan Foundation

Thanks

Questions?

Advantages of an analytical approach

Analyzing incomplete datasets Principled parameter selection Understanding statistical trends Insight into tradeoffs between

definitions Efficiency Accuracy

1 2 3 4 5 …. n-L+1 …. n

Ways to place the leftmost gene in the chain, so there are at least L-1 places left

The maximum length of the chain is: L = (h-1)g + h

The number of ways to create a chain of h genes

Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

There are h-1 gaps in a chain of h genes

Choices for the size of each gap

(from 0 to g)


Chains near the end of

the genome

Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

There are h-1 gaps in a chain of h genes

1 2 3 4 5 …. n-L+1 …. n

Choices for the size of each gap

(from 0 to g)


l = w-1

Gaps are constrained:

And sum of gaps is constrained:

Counting clusters at the end of the genome

l = m

g1 g2 g3 gm-1

l < w

A known solution:

l = w-1

Gaps are constrained:

And sum of gaps is constrained:

Counting clusters at the end of the genome

l = m

…

d(m,g,m)

w-2

Cluster Length

d(m,g,m)

+ …

d(m,g,m) + d(m,g,m+1) +

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

ww-1 … m+1m

d(m,g,m+1)+

d(m,g,m) d(m,g,m+1)+

Line of Symmetry

l = w-1

l = m

Exploiting Symmetry

g=

1

1

g g-1

g-1

=

wm

m+1 w-1

1

2 g-2

=m+2 w-2 1

2 g-2

g

g

g-1 g-1

g

g

l=

…

d(m,g,m)

w-2

Cluster Length

d(m,g,m)

+ …

d(m,g,m) + d(m,g,m+1) +

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

ww-1 … m+1m

d(m,g,m+1)+

d(m,g,m) d(m,g,m+1)+

(g+1)m-1

(g+1)m-1

(g+1)m-1

l = w-1

l = m

Number of ways to position h genes in a genome of n genes so they form a max-gap chain

Ways to placeremaining h-1

genes

Starting positions near end

Starting positions

identifying conserved spatial patterns in genomes

Documents

chromosomal regions

genome research

key task

genome duplicationspevzner

genome changes

homologous regions

related genomic regions

ancestral chromosomechr