gene discovery using combined signals from genome sequence and natural selection michael brent...

48
Gene discovery using Gene discovery using combined signals from combined signals from genome sequence and genome sequence and natural selection natural selection Michael Brent Washington University The mouse genome analysis group

Upload: ashley-hopkins

Post on 19-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

Gene discovery using combined Gene discovery using combined signals from genome sequence signals from genome sequence

and natural selectionand natural selection

Michael BrentWashington University

The mouse genome analysis group

Page 2: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 2

Genes are read out via mRNAGenes are read out via mRNA

& processing

Page 3: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 3

RNA ProcessingRNA Processing

Page 4: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 4

A typical human gene structureA typical human gene structure

Page 5: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 5

In a mammalian genomeIn a mammalian genome

Finding all the genes is hard• Mammalian genomes are large

– 5,051 miles of 10pt type– Raleigh to Tripoli, Libya

• Only about 1.5% protein coding– Raleigh to Winston-Salem

Page 6: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 6

Genes are fairly unconstrainedGenes are fairly unconstrained

Intron length is highly variable• ~5% are 40-100 nt long• ~3% are longer than 30,000 nt

Distance between genes is highly variable• From 103 to 106 nt or more (probably)

Page 7: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 7

Exons per gene (RefSeq)Exons per gene (RefSeq)

0%

2%

4%

6%

8%

10%

12%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30+

Number of Exons

Pe

rce

nt

of

Ge

ne

s

Ref Seq

Page 8: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 8

Background is not randomBackground is not random

Segmental duplications• Entire regions duplicate, then diverge slowly

Processed pseudogenes• Spliced transcripts integrate back into the genome

– Sequence is similar to source genes– Generally not functional

Page 9: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 9

Gene prediction: two approachesGene prediction: two approaches

1. Transcript-based (E.g., GeneWise)A. Map experimentally determined sequences of

spliced transcripts to their genomic sourceB. Map transcript sequences to genomic regions

that could produce similar transcripts

2. De novo (genome only)• Model DNA patterns characteristic of gene

components– Splice donor and accepter– Protein coding sequence– Translation start and stop

Page 10: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 10

Advantages and disadvantagesAdvantages and disadvantages

Transcript-based • Advantage: conservative

–Evidence of transcription for every exon• Disadvantage: conservative

–Can’t find “truly novel” genes• Still subject to error

Page 11: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 11

Advantages and disadvantagesAdvantages and disadvantages

De novo• Advantage 1: Less biased toward

–Known transcripts–Transcripts that can be sequenced easily

• Advantage 2: Genome sequencing is easy• Disadvantages

–No direct evidence of transcription–Presumably, more false positives

Page 12: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 12

Single-genome Single-genome dede novonovo: : GenscanGenscan

Strengths• For mammalian sequence, one of the best

single-genome, de novo gene predictors• Widely used to great practical advantage• De facto standard for mammalian sequence

Limitations• Predicts >45K genes (best est.: 25-30K)• Predicts >315K exons (best est. 200K-250K)• Gets only 9% of known genes exactly right*

Page 13: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 13

Dual genome Dual genome de novode novo

We developed algorithms that use two genomes to• Reduce the number of false positives• Refined the details of the structures

Page 14: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 14

Probability model• Assigns probability to annotated DNA sequences:

5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’

Optimization algorithm• Given a DNA sequence, find the most probable

annotation, according to the model

Exon5’ UTR Intron

Single-genome de novo methodSingle-genome de novo method

Page 15: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 15

CCATGGCGTCTTCAGGCAGTGACTC

Genscan’s generative modelGenscan’s generative model

IntronExonIntron

Page 16: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 16

Generalized

HMM

• States correspond to gene features• Model generates DNA sequence

by passing through states• The probability of annotated DNA

sequence is the probability of –generating the DNA sequence –by passing through states corre-

sponding to the annotation.

Genscan’s generative modelGenscan’s generative model

Page 17: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 17

Dual genome predictionDual genome prediction

Input• Target and informant genomes

Idea• Patterns of evolution since the last common

ancestor may reveal gene structure

Page 18: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 18

Two conservation signalsTwo conservation signals

1. Local alignment signal• Selective pressures differ by feature• This leaves a characteristic signature

2. Structural signal• Locations of introns tend to be conserved

Page 19: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 19

Characteristic local alignmentsCharacteristic local alignments

TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC

|||||||||||||||||||| || ||||| || || |||

TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC

Coding exon

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

|||||| || | ||||||||| || || ||

CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT

Intron (non-coding)

human

human

mouse

mouse

Page 20: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 20

Conservation of intron locationConservation of intron location

Page 21: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 21

AlignAlign→→predictpredict→→filterfilter→→testtest

WU-BLAST

Aligned Intron Filter

Validation (RT-PCR)

TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC

TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC

TCTGCCACC|| || ||TCAGCTACT

TWINSCAN

Page 22: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 22

gHMM decodingRepresentation change

TCTGCCACC||:||:||

TCTGCCACC|| || ||TCAGCTACT

Conservation sequenceTWINSCAN

Page 23: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 23

BLAST AlignmentsBLAST Alignments

TargetInformant

Page 24: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 24

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

Page 25: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 25

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

Page 26: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 26

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

Page 27: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 27

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

Page 28: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 28

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Synthetic (projected) local alignmenthuman

mouse

|||||| | ||||||||| || || ||

CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with• “|” if it is aligned and identical

Page 29: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 29

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Synthetic (projected) local alignmenthuman

mouse

|||||| |:|||||||||::||:|| ||:

CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap

Page 30: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 30

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Synthetic (projected) local alignmenthuman

mouse

||||||. . . . . . . . .|:|||||||||::||:|| ||:

CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned

Page 31: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 31

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Conservation sequencehuman

||||||. . . . . . . . .|:|||||||||::||:|| ||:

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned

Page 32: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 32

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC

Conservation sequencehuman

||||||. . . . . . . . .|:|||||||||::||:||||:

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned

Page 33: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 33

Probability model• Assigns probability to annotated DNA:

5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ |||........|:||||:|||||||||:||::||

Optimization• Given DNA and conservation sequence, find the most

probable annotation, according to the model

Exon5’ UTR Intron

Twinscan: Extending the modelTwinscan: Extending the model

Page 34: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 34

• Each state “generates” DNA and conservation sequence independently

• Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states

TwinscanTwinscan

Page 35: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 35

Performance EvaluationPerformance Evaluation

RefSeq• A set ~13,000 “Known” mRNAs• Represents ~40-50% of human genes

–Usually, only one of several splices• Mapping to genome is imperfect• Best available gold standard

Page 36: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 36

Page 37: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 37

0%

2%

4%

6%

8%

10%

12%

14%

16%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30+

Number of Exons

Pe

rce

nt

of

Ge

ne

s

Ref Seq

Twinscan

Page 38: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 38

Page 39: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 39

Page 40: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 40

Short term goalShort term goal

All multi-exon human genes• Predict accurately

–Integrate information from more genomes• Verify at least one intron experimentally• Follow up with full-length verification

Page 41: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 41

AcknowledgmentsAcknowledgmentsFunding agencies

• National Institutes of Health (NHGRI)• National Science Foundation (DBI)

Sequencing centers• Sanger, Whitehead, Wash. U.

My group• Ian Korf, Paul Flicek, Evan Keibler, Ping Hu

Collaborators• Roderic Guigo, Josep Abril, Genis Parra

– Pankaj Agarwal• Stylianos Antonarakis, Alexandre Reymond, Manolis

Dermitzakis

Page 42: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 42

Other cladesOther clades

Plants• Arabidopsis thaliana, cabbage, rice

Nematodes• C. elegans, C. briggsae

Fungi• Cryptococcus neoformans (JEC21, H99)

Page 43: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 43

Pair HMM algorithms (SLAM,…)Pair HMM algorithms (SLAM,…)

• Input is orthologous sequences.• Aligns and predicts simultaneously, using a

joint probability model• Predicts orthologous genes in 2 sequences• All predicted CDS is aligned• Some aligned regions are not predicted CDS

–Labeled conserved non-coding sequence

Page 44: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 44

The algorithms (SLAM,…)The algorithms (SLAM,…)

sgp2• Alignment before prediction (tblastx)• Predicts genes in target sequence only• Don’t need orthologous input sequences

–Paralogs & low-coverage shotgun can help• Modifies scores of all potential exons, by

–At each base, add tblastx score of best overlapping local alignment (roughly)

–To gene-id scores of that potential exon

Page 45: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 45

The algorithmsThe algorithmsTWINSCAN

• Alignment before prediction (blastn)• Predicts in target sequence only• Modifies scores of all potential exons, UTRs,

splice sites, start and stop models, by–At each base, apply a feature-specific

scoring model (estimated for this purpose)–to the best overlapping local alignment,

and adding the result–To Genscan scores for that feature

Page 46: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 46

% Aligned, CDS vs. other% Aligned, CDS vs. other

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Coding Non-coding

Genscansgp2TwinscanSanger

Page 47: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

10/7/2002 47GENSIPS

QuerySequence

tblastxHSPs

geneidExons

HSPsProjectio

ns

SGPExons

Syntenic Gene Prediction (sgp2)Syntenic Gene Prediction (sgp2)

Page 48: Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS 10/7/2002 48

Why work on gene finding?Why work on gene finding?

Genes are• Components responsible for biological function• Variations cause human disease / susceptibility• Controls for modifying biological function

–Human gene therapy–Agriculture–Nanotechnology, etc.