identifying conserved spatial patterns in genomes
DESCRIPTION
Identifying conserved spatial patterns in genomes. Rose Hoberman Computer Science Department Carnegie Mellon University. University of Chicago 0ct 20, 2006. My focus: Spatial Comparative Genomics. - PowerPoint PPT PresentationTRANSCRIPT
Identifying conserved spatial patterns
in genomes
Rose HobermanComputer Science Department
Carnegie Mellon University
University of Chicago0ct 20, 2006
My focus: Spatial Comparative Genomics
Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and
evolves.
A simple model of a genome
an ordered list of genes
4 53 71 62 8 9 11 12 1310 14 15 1716 19 20183
Genomic Change
speciation
Sequence Mutation + Chromosomal Rearrangements
species 2
species 1
Ancestral genome
Types of Genomic Rearrangements
4 53 71 62
8 97 11 12104 531 62 13 14 15 1716 19 2018
8 9 11 12 1310 14 15 1716 19 2018
Inversions
431 2
Duplications/Insertions
3 89111213 10141517 161920 18
20191813 14 15 1716
Loss
8 97 11 12104 531 62 431 2 2013 14 15 1716
Fissions and fusions
4 53 71 2 3 13141517 161920 18 3 891112 10
Types of Genomic Rearrangements
InversionsDuplications/InsertionsLoss
An Essential Task forSpatial Comparative Genomics
4 53 71 2
8 97 11 12104 531 62 431 2
13141517 161920 18
2013 14 15 1716
3
431 2
891112 10
Species 2
Species 1
Identify chromosomal regions that descended from the same region in the genome of the common ancestor
Outline
Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic
regions?
Problem Background Why is this challenging? Introduction to cluster finding
Results Statistics for pairwise clusters Statistics for three-way clusters
Pevzner, Tesler. Genome Research 2003
Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution
Reconstruct history of chromosomal rearrangements
Infer ancestral genetic map
Phylogeny reconstruction
Identify ancient whole genome duplications
Identification of homologous chromosomal segments is a key task in comparative genomics
Guillaume Bourque et al. Genome Research 2004
1. Genome evolution Reconstruct
history of chromosomal rearrangements
Infer ancestral genetic map
Phylogeny reconstruction
Identify ancient whole genome duplications
Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution
Reconstruct history of chromosomal rearrangements
Infer ancestral genetic map
Phylogeny reconstruction
Identify ancient whole genome duplications
Whole genome duplication
chromosome 2chromosome 1
Ancestral chromosome
chr 2
chr 1
Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution
Reconstruct history of chromosomal rearrangements
Infer ancestral genetic map
Phylogeny reconstruction
Identify ancient whole genome duplications McLysaght et al
Nature Genetics, 2002.
Infer functional associations
Predict operons
Identify horizontal transfers
2. Understand gene function and regulation in bacteria
Identification of conserved chromosomal structure is a key task in comparative genomics
Loss
Insertion
Inversions
Outline
Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic
regions?
Problem Background Why is this challenging? Introduction to cluster finding
Results Statistics for pairwise clusters Statistics for three-way clusters
Closely related genomes
Related regions are easy to identify
4 53 71 2
8 97 11 12104 531 62 431 2
13141517 161920 18
2013 14 15 1716
3
431 2
891112 10Species 1
Species 2
Five hundred million years...
Homologous regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly
preserved
More Distantly Related Genomes
45
3 71 2
89 11 12104 51 62 31 2
89 11121310 141517 161920
18
2013 1415 17 16
3
4 1
18
7
11
Gene clusters Similar gene content Neither gene content nor order is perfectly
45
3 71 2
89 11 12104 51 62 31 2
89 11121310 141517 161920
18
2013 1415 17 16
3
4 1
18
7
11
The signature of diverged regions
A Framework for Identifying Gene Clusters
1. Find homologous genes
2. Formally define a “gene cluster”
3. Design an algorithm to find clusters
4. Statistically verify clusters
Why Validate Clusters Statistically?
After sufficient time has passed, gene order will become randomized
Uniform random data tends to be “clumpy”
Some genes will end up close together in both genomes simply by chance
43 71 2
89 11 12104 51 62 31 2
1310 141517 161920
2013 1415 17 16
3
4 1
18
7
11
Cluster Statistics
r-windows
max-gap
2 regionsDurand & Sankoff,
2003my work
3 regions my work unsolved
Cluster Models
Outline
Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic
regions?
Problem Background Why is this challenging? Introduction to cluster finding
Results Statistics for pairwise clusters Statistics for three-way clusters
A Framework for Identifying Gene Clusters
1. Find homologous genes
2. Formally define a “gene cluster”
3. Design an algorithm to find clusters
4. Statistically verify clusters
Gene Homology
Identification of homologous gene pairs generally based on sequence similarity conserved genomic context is also informative
Assumptions matches are binary (similarity scores are
discarded) each gene is homologous to at most one other
gene in the other genome
A Framework for Identifying Gene Clusters
1. Find homologous genes Formally define a “gene cluster”
3. Devise an algorithm to identify clusters
4. Statistically verify clusters
Where are the gene clusters?
Intuitive notion: pairs of regions that are dense with homologs
How can we formalize this intuition?
The r-window cluster definition
Two windows of size r that share at least m homologous gene pairs
r is a user-specified parameter
r = 6
(Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)
A max-gap chain
The distance or “gap” between genes is equal to the number of intervening genes
A set of genes in a genome form a max-gap chain if the gap between adjacent genes is never
greater than g (a user-specified parameter)
g= 2 gap= 3
The max-gap cluster definition
g= 2 g= 3gap= 3
A set of genes form a max-gap cluster in two genomes if
1. the genes forms a max-gap chain in each genome
2. the cluster is maximal (i.e. not contained within a larger cluster)
A Framework for Identifying Gene Clusters
1. Find homologous genes
2. Formally define a “gene cluster” Devise an algorithm to identify
clusters
4. Statistically verify clusters
Max-gap search algorithms
Many genomic studies use a max-gap criteria Each group designs their own search algorithm These are often greedy algorithms, but greedy
algorithms miss disordered max-gap clusters
Hoberman et al, RECOMB Comp Genomics 2005
Greedy, Agglomerative Algorithms
1. initialize a cluster as a single homologous pair
2. search for a gene in proximity on both chromosomes
3. either extend the cluster and repeat, or terminate
g = 2
Greedy Algorithms Impose Order Constraints
g = 2
A max-gap cluster of size four
A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size
two
Algorithms and Definition Mismatch
Greedy, bottom-up algorithms will not find all max-gap clusters
There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002)
Cluster statistics depend on the search space, which depends on which algorithm is used
A Framework for Identifying Gene Clusters
1. Find homologous genes
2. Formally define a “gene cluster”
3. Devise an algorithm to identify clusters
Statistically verify clustersAn example…
Example: Whole genome self-comparison to detect duplicated blocks
Chose g=30
Compared all human chromosomes to all other chromosomes to find max-gap gene clusters
McLysaght et al. Nature Genetics, 2002.
How can we use statistical analysis?
Could two regions display this degree of similarity simply by chance?
Is g=30 a reasonable choice of gap size? Are larger clusters less likely to occur by
chance? How large does a cluster have to be before we
are surprised to observe it?
McLysaght et al. Nature Genetics, 2002.
3 genes duplicatedout of ~25 13 genes
29 genes
10 genes duplicatedout of ~100
Chr 3
Chr 17
Outline
Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic
regions?
Problem Background Why is this challenging? Introduction to cluster finding
Results Statistics for pairwise clusters Statistics for three-way clusters
Cluster Statistics
r-windows
max-gap
2 regionsDurand & Sankoff,
2003my work
3 regions my work unsolved
Cluster Models
The max-gap definition is the most widely used cluster definition in genomic analyses
Yet there is no formal statistical model for max-gap clusters
Overbeek et al 1999: inferring functional coupling of genes in bacteriaVision et al 2000: origins of genomic duplications in ArabidopsisFriedman and Hughes 2001: gene duplication and structure of eukaryotic
genomesTamames 2001: evolution of gene order conservation in prokaryotesVandepoele et al 2002: microcolinearity between Arabidopsis and riceMcLysaght et al 2002: genomic duplication during early chordate
evolutionSimillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in ArabidopsisLuc et al 2003: gene teams for comparative genomicsChen et al 2004: operon prediction in newly sequenced bacteriaBourque et al 2005: comparison of mammalian and chicken genome
architectures …
Is this cluster biologically meaningful? Could it have occurred in a comparison of
two random genomes?
The QuestionSuppose two whole genomes were
compared, and this max-gap cluster was identified:
Statistical Testing
Hypothesis testing Alternate hypothesis: shared ancestry Null hypothesis: random gene order
Discard clusters that could have arisen under the null model
Determine the probability of observing a similar cluster under the null hypothesis
Given an allowed gap size of g, what is the probability of observing a max-gap cluster
containing exactly h matching gene pairs?
How do we calculate this probability?
h=4
The Problem
The Inputs
n: number of genes in each genome
m: number of matching genes pairs
g: the maximum gap allowed in a cluster
h: number of matching genes in the cluster
h=4
m=6 n=22
g=2
The probability when m = n
…the probability of a max-gap cluster is 1
(regardless of the allowed gap size)
If gene content is identical…
Probability of a cluster of size h when m < n
h genes m genesm-h genes
1. Create chains of h genes in both
genomes
2. Place m-h remaining genes so they do not
extend the cluster
3. Normalize to get a probability
*
Basic approachEnumerate all ways to:
Probability of observing a cluster of size h
number of ways to place h genes so they form a chain
in both genomes
number of ways to place m-h remaining genes so they do not
extend the cluster
All configurations of m gene pairs in two genomes of size n
Probability of observing a cluster of size h
number of ways to place h genes so they form a chain
in both genomes
number of ways to place m-h remaining genes so they do not
extend the cluster
All configurations of m gene pairs in two genomes of size n
Number of ways to place h genes in two genomes so they form a cluster
h genes m genesm-h genes
Choose h genes to compose
the cluster
Assign each gene to a
selected spot in each genome
Select h spots in each genome, so they form a
max-gap chain
Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004
Probability of observing a cluster of size h
number of ways to place h genes so they form a chain
in both genomes
number of ways to place m-h remaining genes so they do not
extend the cluster
All configurations of m gene pairs in two genomes of size n
Approach: design a rule specifying where the genes can
be placed so that the cluster is not extended count the positions that satisfy the rule
Counting the number of ways to place m-h genes outside the cluster
g = 1
h = 3
Rule 1: A gene can go anywhere except in the cluster (the white box).
g = 1
gaps ≤ 1
Counting the number of ways to place m-h genes outside the cluster
Too lenient
overcounts
Rule 2: Every gene must be at least g+1 positions from
the cluster (outside the grey box).
g = 1
g + 1
g +1
Counting the number of ways to place m-h genes outside the cluster
Too strict
undercounts
g = 1
h = 3 gap > 1
gap > 1
Counting the number of ways to place m-h genes outside the cluster
Rule 2: Every gene must be at least g+1 positions from
the cluster (outside the grey box).
Too strict
undercounts
g = 1
gap > 1
Counting the number of ways to place m-h genes outside the cluster
Rule 3: At most one member of each gene pair can be in
the grey box.
Too lenient
overcounts
Rule 3: At most one member of each gene pair can be in
the grey box.
g = 1
Counting the number of ways to place m-h genes outside the clustergaps ≤ 1
Too lenient
overcounts
g = 1
Acceptable positions for a gene depend on the positions of the remaining genes
Use strict and lenient rules to calculate upper and lower bounds on G
Counting the number of ways to place m-h genes outside the cluster
Estimating G
Upper bound: Erroneously
enumerates this configuration
Lower bound: Fails to enumerate this configuration
Probability of observing a cluster of size h
number of ways to place h genes so they form a chain
in both genomes
number of ways to place m-h remaining genes so they do not
extend the cluster
Hoberman, Sankoff, Durand Journal of Computational Biology, 2005
What can we learn from this statistical result? Are we less likely to observe a large cluster
(containing more gene pairs) than a small cluster?
How large does a cluster have to be before we are surprised to observe it?
How do we choose the maximum allowed gap value?
Clu
ster
Pro
babi
lity
g=10
Whole-genome comparison cluster statistics
g=20
n=1000, m=250
h (cluster size)
With a significance
threshold of 10-4, any cluster
containing 8 or more genes is
significant.
E. coli vs B. Subtilis
Under null hypothesis, by g=25 all genes should form a single cluster
Completecluster doesn’t form until g=110
Typical operonsizes
clusters above the orange line are significant at the .001 level
Algorithm:Bergeron et al, 02
Statistics:Hoberman et al, 05
By g=9, probabilities no longer decrease monotonically with cluster size
Statistical analysis of max-gap gene clusters
1. Data analysis: Identifies statistically significant max-gap clusters
2. Search: Suggests a more principled approach for choosing a gap size that will yield significant clusters
3. Modeling: Provides insight into criteria for cluster definitions larger clusters should be less likely to occur by
chance
Outline
Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic
regions?
Problem Background Why is this challenging? Introduction to cluster finding
Results Statistics for pairwise clusters Statistics for three-way clusters
Cluster Statistics
r-windows
max-gap
2 regionsDurand & Sankoff,
2003my work
3 regions my work unsolved
Cluster Models
Joint work with Narayanan Raghupathy
Duplicated segments can be particularly difficult to identify
Whole genome duplication
chromosome 2
chromosome 1
Ancestral chromosome
chr 2
chr 1
Duplicated segments can be particularly difficult to identify
Ancestral chromosome
chr 2
chr 1
Whole genome duplication
Reciprocal gene loss
Comparisons of multiple genomes offer significantly more power to detect highly diverged homologous segments
Arabidopsis thaliana region 1
Rice chr 2
Arabidopsis thaliana region 2
Simillion et al, PNAS 2002
Rice
At 2At 1
5 3
0
1
20 22
22
126 26
At 2At 1
Existing Statistical Approaches
x2
x23x123
x3x13x1
x12
1. Only consider x123, genes shared between all regions disregards pairwise overlaps2. Conduct multiple pairwise comparisons does not consider the greater impact of genes
shared among all three regions requires that at least two pairwise clusters are
independently significant
W1W3
W2
Durand and Sankoff, J Comp Biology, 2003Methods reviewed in Simillion et al, Bioessays, 2004
No existing statistical approach considers all these quantities in a single test.
r-windows for pairwise comparison
Two windows of size r that share at least m homologous gene pairs
r = 6
With two regions there is a natural test statistic
An r-window cluster spanning two regions can be characterized by the number of shared
genes, m.
mr-m r-m
Durand and Sankoff, J Comp Biology, 2003
The probability that two random windows share m
genes:
r = 6
With three regions there are many more quantities to consider
Three-way overlap: x123
Pairwise overlaps: x12, x23, x13
x3
x23x123
x2x12x1
x13
We want to consider both types of overlaps
Our r-window statistics for three windows
Given three randomly selected windows of length r,
we derived equations for the probability of observing such a cluster
under a null hypothesis of random gene order.
),,,( 232313131212123123 xXxXxXxXP
Raghupathy, Hoberman, Durand. APBC 2007.
x3
x23x123
x2x12x1
x13
First tests that uses both pairwise and three-way overlaps
How serious are these limitations?
Pairwise tests have a number of limitations:
1. They do not consider the greater impact of genes shared among all three regions
2. They require that at least two pairwise comparisons are independently significant
Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.
What is the impact of ignoring three-way genes?
a) Three genes are shared by all three windows
b) Three distinct genes are shared by each pair of windows
030
0 303
3
Each pair of regions shares exactly
three genes
Which cluster is least likely to occur by chance?
a)
0h0
0
h0h
h
b)
n=5000, r=100
h
h
hxxij 123,00, 123 xhxij
(a) is less likely to occur than (b)
a) Three genes are shared by all three windows
b) Three distinct genes are shared by each pair of windows
Pairwise tests cannot distinguish these two clusters
but pairwise tests score them identically
How much of a limitation is this?
Pairwise tests have a number of limitations:
1. They do not consider the greater impact of genes shared among all three regions
2. They require that at least two pairwise comparisons are independently significant
Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.
When no genes appear in all three regions, are pairwise statistical tests sufficient?
n=5000, r =100, x123=0
h
α=0.001
Two Pairwise Tests
h hh
h ?h
Three-window
0
0
When no genes appear in all three regions, are pairwise statistical tests sufficient?
The three-window test is strictly more general than two pairwise tests
The three-window test will always reject the null hypothesis for any cluster in which the pairwise tests reject the null hypothesis
Two Pairwise Tests
5 55
8 ?8
Three-window
0
0
These results suggest that multi-region tests can identify more distantly related homologous regions.
There is an ongoing debate about whether there was 1 or 2 whole genome duplications in early vertebrates
More powerful statistical tests might help resolve this issue
Summary1. First statistical tests for max-gap clusters,
the most commonly used definition determines minimum significant cluster size allows more principled selection of gap
parameter reveals problems with max-gap definition
2. More sensitive statistical tests for comparisons of three genomic regions first test to consider both pairwise and three-
way overlaps can identify more distantly related regions may be able to resolve outstanding questions in
molecular evolution
Acknowledgements
Collaborators Dannie Durand, Biological Sciences and
Computer Science, CMU David Sankoff, Math and Statistics University of
Ottawa Narayanan Raghupathy, Biological Sciences,
CMU
Funders Barbara Lazarus Women@IT Fellowship The Sloan Foundation
Thanks
Questions?
Advantages of an analytical approach
Analyzing incomplete datasets Principled parameter selection Understanding statistical trends Insight into tradeoffs between
definitions Efficiency Accuracy
1 2 3 4 5 …. n-L+1 …. n
Ways to place the leftmost gene in the chain, so there are at least L-1 places left
The maximum length of the chain is: L = (h-1)g + h
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so there are at least L-1 slots left
There are h-1 gaps in a chain of h genes
Choices for the size of each gap
(from 0 to g)
The number of ways to create a chain of h genes
Chains near the end of
the genome
Ways to place the leftmost gene in the chain, so there are at least L-1 slots left
There are h-1 gaps in a chain of h genes
1 2 3 4 5 …. n-L+1 …. n
Choices for the size of each gap
(from 0 to g)
The number of ways to create a chain of h genes
l = w-1
Gaps are constrained:
And sum of gaps is constrained:
Counting clusters at the end of the genome
l = m
g1 g2 g3 gm-1
l < w
A known solution:
l = w-1
Gaps are constrained:
And sum of gaps is constrained:
Counting clusters at the end of the genome
l = m
…
d(m,g,m)
w-2
Cluster Length
d(m,g,m)
+ …
d(m,g,m) + d(m,g,m+1) +
d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)
d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)
ww-1 … m+1m
d(m,g,m+1)+
d(m,g,m) d(m,g,m+1)+
Line of Symmetry
l = w-1
l = m
Exploiting Symmetry
g=
1
1
g g-1
g-1
=
wm
m+1 w-1
1
2 g-2
=m+2 w-2 1
2 g-2
g
g
g-1 g-1
g
g
l=
…
d(m,g,m)
w-2
Cluster Length
d(m,g,m)
+ …
d(m,g,m) + d(m,g,m+1) +
d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)
d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)
ww-1 … m+1m
d(m,g,m+1)+
d(m,g,m) d(m,g,m+1)+
(g+1)m-1
(g+1)m-1
(g+1)m-1
l = w-1
l = m
Number of ways to position h genes in a genome of n genes so they form a max-gap chain
Ways to placeremaining h-1
genes
Starting positions near end
Starting positions