multiple species gene finding sourav chatterji [email protected]

49
Multiple Species Gene Finding Sourav Chatterji [email protected]

Upload: alannah-arnold

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Multiple Species Gene Finding

Sourav [email protected]

Page 2: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Predicting Replication Origins in Yeast. Breier AM, Chatterji S, Cozzarelli NR. Genome Biol.

2004;5(4):R22. Comparative GeneFinding using SLAM.

Paired Splice Site Detection in SLAM. Zhao X, Huang H, Speed TP. Proceedings of RECOMB 2004; 68-75.

Rat Genome Sequencing Consortium. Nature. 2004 Apr 1;428(6982):493-521.

Multiple Species GeneFinding. Chatterji S, Pachter L. Proceedings of RECOMB 2004; 187-193.

Evidence Based GeneFinding - Work in Progress.

Page 3: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

State of the Genomes 5 Roundworm Genomes

C.elegans and C.briggsae completed. 3 other worm genomes in progress.

11 Fruitfly Genomes D. Melanogaster - completed Seuqencing of 7 genomes in progress 3 more genomes in pipleline

18 Mammalian Genomes Human, Mouse, and Rat Genomes Published

Sequencing of 6 genomes in progress. 9 other genomes in pipeline 4 primate genomes : Orangutan, Macaque, Chimpanzee and Human.

Page 4: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Outline

GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Gibbs Sampling in HMMs Gene finding by Gibbs sampling Results

Evidence Based Multiple Species GeneFinding Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work

Page 5: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation).

High Sensitivity/Low Specificity

Page 6: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation).

High Sensitivity/Low Specificity Conserved regions among related species more likely to be functional than divergent regions.

IDEA: Comparative-based gene finding

Page 7: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Comparative (Pairwise) Gene Finding

ROSETTA : Global alignment followed by gene finding [Batzoglou and Pachter et al., 1999]

SLAM : Simultaneous Global alignment and gene finding [Alexandersson et al. 2001]

TWINSCAN : Blast alignment followed by gene finding [Korf et al. 2001]

AGENDA : Global alignment followed by gene finding [Rinner and Morgenstern, 2002]

DOUBLESCAN : Simultaneous alignment and gene finding [Meyer and Durbin, 2002]

SGP2 : Blast alignment followed by gene finding [Parra et al. 2003]

Page 8: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

The Good News Gene structure

Number of exons conserved (86% human/mouse) Exons have similar lengths (91% identical, remainder almost all differ by a multiple of 3)

Intron lengths are divergent (~1% identical length)

Sequence similarity Exons highly conserved (both amino acids & DNA)

Intron sequences dissimilarWaterston et al., 2003

Page 9: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

The Bad News Difficult to generalize many pairwise methods to multiple sequence methods

Alignment Exons may be misaligned (much shorter than introns)

Multiple sequence alignment is much harder than pairwise sequence alignment

Long Conserved Non Coding Sequences Confuse methods that rely on conservation in a naive way

Missing Sequence

Page 10: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Multiple Species Comparative Gene Finding

(with Alignment)

McAuliffe et al. (2004), Siepel et al. (2004)

Page 11: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Multiple Species Comparative Gene Finding

(with Alignment)

McAuliffe et al. (2004), Siepel et al. (2004)

Page 12: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Multiple Species Comparative Gene Finding

(without Alignment)

Page 13: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for Biological Sequence

Analysis Introduced by Lawrence et al. 1993

Motif Detection Extensions

Multiple Motifs in a Sequence Multiple Types of Motifs Phylogenetic Relationships between Sequences

Applications Alignment Linkage Analysis

Page 14: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling Aim : To sample from the joint distribution p(x1,x2,…,xn) when it is easy to sample from the conditional distributions

p(xi | x1,…xi-1,xi+1,…,xn) but not from the joint distribution. Method: Iteratively sample xi

t from the conditional distribution p(xi | x1

t,…xi-1t,xi+1

t-1 ,…,xn

t-1) Theorem : For discrete distributions, the distribution of (x1

t,x2t …,xn

t) converges to p(x1,x2,…,xn)

Page 15: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

t

s

Connection to HMMs

Z1

Y1

Z2

Ym

Zm

Y2

s s

tt

t = output probabilities

s = transition probabilities Difficult to sample from

P(Z | Y)

Easy to sample from P(| Z,Y)

Easy to sample Z from P(Z | ,Y)

Page 16: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

The Motif Finding Problem

Fixed width unknown motif.

1 motif per sequence, unknown location.

P1 P2 P3 P4 P5

A ? ? ? ? ?

C ? ? ? ? ?

G ? ? ? ? ?

T ? ? ? ? ?

?

?

?

?

PSSM

Page 17: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

The Motif Finding HMM

: PSSM parameters

Y : Observed sequences

Z : Alignment

P1 P2 P3 P4 P5

A ? ? ? ? ?

C ? ? ? ? ?

G ? ? ? ? ?

T ? ? ? ? ?

?

?

?

?

PSSM

ZY

BG P1 P5 BG

Z2

Y2

Zm-1

Ym-1

Z1

Y1

Zm

Ym

Page 18: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for Motif Detection

Sample from P( | Z,Y) [sample PSSM from alignment]

Sample Z from P(Z | ,Y) [find positions from PSSM]

Samples from P(Z,|Y)

P1 P2 P3 P4 P5

A ? ? ? ? ?

C ? ? ? ? ?

G ? ? ? ? ?

T ? ? ? ? ?

?

?

?

?

PSSM

ZY

BG P1 P5 BG

Z2

Y2

Zm-1

Ym-1

Z1

Y1

Zm

Ym

Page 19: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for HMMs

N sequences independently generated by an HMM.

Three types of random variables : Parameters Z = Z1,Z2,…,Zi …,ZN : hidden variables

Y = Y1,Y2,…,Yi …,YN : observed variables

Page 20: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for HMMs

N sequences independently generated by an HMM.

Three types of random variables : Parameters Z = Z1,Z2,…,Zi …,ZN : hidden variables

Y = Y1,Y2,…,Yi …,YN : observed variables

Zi1,Zi

2,…,Zi

m

Yi1,Yi

2,…,Yi

m

Page 21: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for HMMs

N sequences independently generated by an HMM.

Three types of random variables : Parameters Z = Z1,Z2,…,Zi …,ZN : hidden variables

Y = Y1,Y2,…,Yi …,YN : observed variables

Aim: To Sample from the distribution P(Z,|Y)

Iterations of a Gibbs Sampler Sample Zi from p(Zi| Y,) , Sample from p(| Y,Z)

Page 22: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

E00 E01 E02

Intron0

E10 E11 E12

Intron1

E20 E21 E22

Intron2

EI0 EI1 EI2ET0 ET1 ET2

SingleExon

IG

E100 E2

00 Ek00

Page 23: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for Gene Finding

Page 24: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for Gene Finding

Initial Predictions

Page 25: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for Gene Finding

Sample Z1 from P(Z1 | Z[-

1] , Y)

Page 26: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Gibbs Sampling for Gene Finding

Sample Z2 from P(Z2 | Z[-

2] , Y)

Page 27: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Learning the Number of Exon Classes

Page 28: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Learning the Number of

Exon Classes

Find Significant Hits Among Peptides

Page 29: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Learning the Number of

Exon Classes

Each Connected Component forms a Class of Genes

Page 30: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Testing

1.6 Mb Data from the NISC Comparative Sequencing Project

Divided into large genomic regions (100-200 kB) some of which contained multiple genes

Selection Criteria 4 mammals roughly equidistant from each other

Human, Mouse/Rat, Dog/Cat, Pig/Cow. Available RefSeq annotations with no obvious alternative splicing

Page 31: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Results

Nucl. Sn

Nucl. Sp

Exon Sn

Exon Sp

Gibbs 0.897 0.886 0.714 0.628

Genscan 0.911 0.548 0.777 0.518

Twinscan

0.692 0.856 0.440 0.513

SLAM 0.791 0.881 0.632 0.527

Page 32: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Robustness ResultsNucl. Sn

Nucl. Sp

Exon Sn

Exon Sp

Gibbs(before) 0.939

0.950

0.763

0.735

Gibbs(after) 0.885

0.910

0.740

0.703

Genscan(before)

0.911

0.680

0.771

0.612

Genscan(after)

0.866

0.652

0.748

0.594

Twinscan(before)

0.694

0.895

0.465

0.604

Twinscan(after)

0.665

0.853

0.465

0.598

SLAM(before) 0.927

0.911

0.718

0.566

SLAM(after) 0.438

0.936

0.250

0.646

Page 33: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Conclusions Efficient

Running time O(kNL) Memory requirements O(L)

k=#iterations,N=#sequences, L=max. length Converges rapidly.

No Alignment Required !! Symmetric Prediction for All Species Application : rapid comparative based annotation of newly sequenced genomes

Robust Rearrangements Draft Quality Sequence

Page 34: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Outline

GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Overview of Gibbs Sampling Gene finding by Gibbs sampling Results

Evidence Based Multiple Species GeneFinding

Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work

Page 35: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Evidence Based GeneFinding

Procrustes : cDNA Evidence, DP based Spliced Alignment [Gelfland et al. 1996]

Genewise : Protein evidence, combines genefinding HMM with protein profile HMM, part of the ENSEMBL pipeline. [Birney at al. 1996]

Projector : Evidence from orthologous genes in related species, uses pair HMM based model. [Meyer and Durbin 2004]

Evidence

Annotation

Page 36: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Evidence Based GeneFinding

Large scale sequencing efforts. 8 Drosophila genomes very soon

D. Melanogaster well annotated. 9 mammalian genomes by early 2005

Human genome well annotated. Aim : Rapid annotation of newly sequenced genomes. Use well annotated genomes as evidence. Draft Quality Genomes

Robustness for Sequencing Errors Using sequences from multiple species will result in more accurate annotations.

Will also give us high quality multiple alignments.

Data to study the evolution of genomes.

Page 37: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Evidence Based Multiple Species

GeneFinding

Basic Idea : Use annotations from a referencereference genome (e.g. D.melanogaster or H. sapiens) as evidence to annotate newly sequenced genomes.

Use Whole Genome Homology Maps (courtesy Colin Dewey)

Project exons from reference genome into every other genome.

Join projections to get multiple alignments. Use orthologous sequences from multiple species to get more accurate annotations.

Produce annotations with all supporting evidence.

Exploit phylogenetic relationships among the species.

Page 38: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Projecting Annotated Exons

Annotation

Homology Map

ExonAligner

Page 39: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

ExonAligner : An Exon Alignment Program

Mixture of global and local alignment. Penalize overhanging ends in Evidence. Overhanging ends in Target is OK.

Exploit the property that they code for homologous proteins.

Special Dynamic Programming Matrix Robust.

Sequencing Errors. Phase Shifts.

Chaining Algorithm for large sequences.

Evidence

Target

Page 40: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

The Dynamic Programming Matrix

The figure only shows edges into the black node. The red edges represent

non-codon gaps, i.e. gaps caused by phase shifts/sequencing errors and are of length which is not a multiple of 3. They are heavily penalized.

Page 41: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Chaining Algorithms Widely used in large scale alignment algorithms. MUMer [Salzberg et al. 1999] AVID [Bray et al. 2002] LAGAN [Brudno et al. 2003]

Step 1: Find good local alignments or fragments.

Step 2: Select a consistent subset of fragments for chaining and call these fragments anchors.

Step 3: Join the anchors to get an alignment.

Page 42: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

The ExonAligner Chaining Algorithm

Construction of Fragments. Translate target sequence in the 3 frames.

Find significant hits with (translated) evidence and use them as fragments.

Selection of Anchors. Construct weighted DAG from fragments.

Weigh edges by using dynamic programming. Use nodes in the shortest path in the DAG as anchors.

Use dynamic programming to join anchors together.

Page 43: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Exploiting Phylogeny

Page 44: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Recoverable Exons/Genes

?

Page 45: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Recoverable Exons/Genes

Project Using Exon Aligner

Page 46: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Preliminary Results Created a Homology map of Human, Chimp, Rat, Mouse and Chicken genomes 266836 exon cliques. 45543 non-convex exon cliques. 27502 of these recoverable.

Used ExonAligner to map 3300 human Refseq genes into the chimp genome. Robustness of Algorithm critical.

500 of the 42662 exons had non-codon gaps. These alignments will in turn be used to learn parameters for ExonAligner.

Extrapolate parameters for other species.

Page 47: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

An Illustrative Example

RefSeq Gene NM_030575 Single exon gene, 221 a.a. protein. No orthologous gene found by Genewise.

Potential orthologous gene in chimp 2 non-codon gaps in alignment (1 insertion and 1 deletion separated by 60 nt).

212 out of 221 amino acids are matches.

Is this a real ortholog? Phase Shift/Sequencing Error? Find orthologous genes in other species and use multiple alignment.

Page 48: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Future Work Extend ExonAligner for Multiple Species

Robust realignment Take into account codon structure Robust for phase shifts/sequencing errors

Annotation with supporting evidence. Basic Evidence e.g. RefSeq Gene Annotation Multiple Alignment with Orthologous Features

Score : Statistical Significance of the Feature

Page 49: Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Future Work Comprehensive Annotation Program

Put Evidence Based and Ab initio methods together

Try to use alignment/homology in Gibbs Sampler

Rapid annotation of Drosophila and Mammalian genomes Berkeley AAA group for Drosophila genomes.

Study the evolution of genes Find human specific genes