comparative genomics and proteomics in ensembl sep 2006

Post on 16-Jan-2016

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Comparative genomics Comparative genomics and proteomics in and proteomics in

EnsemblEnsembl

Sep 2006

2 of 56

• Rationale• Species available• Comparative proteomics

– Orthologue and paralogue prediction– Protein clustering into families

• Comparative genomics– Genome-wide DNA alignments– Synteny block characterisation

• Future and perspectives

OverviewOverview

3 of 56

The Compara database is one single multispecies database

• Gene orthology/paralogy prediction• Protein clustering• Whole genome alignments• Synteny regions

ComparaCompara

4 of 56

The era of sequencing genomesThe era of sequencing genomes

360

450

990 25

70

140

?

550

25070?

1002003004005001000

Million years

340

1500?

?

Chordata

Vertebrata

AmniotaTetrapoda

Teleostei

Urochordata

Arthropoda

NematodaFungi

Red : whole genome assembly availableGreen : whole genome assembly due within the next year in Ensembl

* 19 species currently in Ensembl* 19 species currently in Ensembl+ 10 + 10 Pre! Pre! EnsemblEnsembl

S. cerevisiae (baker’s yeast) *

C. elegans (nematode) *

A. mellifera (honey bee) *

D. rerio (zebrafish) *

D. melanogaster (fruitfly) *A. gambiae (African malaria mosquito) *A. aegypti (yellow fever mosquito) +

C. intestinalis (transparent sea squirt) * C. savignyi (sea squirt) +

T. rubripes (torafugu) *T. nigroviridis (spotted green pufferfish) *

O. latipes (Japanese medaka)

G. aculeatus (Stickleback) +

23

O. aries (sheep)

G. gallus (chicken) *

X. laevis (African clawed frog)

M. musculus (house mouse) *R. norvegicus (Norway rat) *

M. mulatta (rhesus macaque) *P. troglodytes (chimpanzee) *

C. familiaris (dog) *F. catus (cat)E. caballus (horse)S. scrofa (pig)B. taurus (cow) *

310

197

92

M. domestica (opossum) *

170

L. africana (elephant) +

105

41

91

4574

83

65

20

H. sapiens (human) * +

X. tropicalis (western clawed frog) *Amphibia

AvesMetatheria

Mammalia

Eutheria

5 of 56

• From the Ensembl perspective joins species through– orthologous/paralogous genes links– chromosome synteny links– protein family links

• From a broader perspective– Where are syntenic regions located?– How many genes are conserved?– Where are orthologous/paralogous genes?– Is gene order conserved?– Where are potential regulatory regions?– What is missing in one species, present only in another?

Comparing different speciesComparing different species

6 of 56

Orthologue and Paralogue Orthologue and Paralogue PredictionPrediction

• Evolutionary studies• Identify potential species-specific

proteins/genes• Identify orthologues of (human)

genes in model organisms

7 of 56

Gene EvolutionGene Evolution

• Divergence

• Speciation / Duplication

• Change within allelic population

• Point Mutations / Selection / Drift

• Exon/domain shuffling

• Transposition / Translocation

• Retroposition (reverse transcription)

• Horizontal gene transfer?

Orthologues and ParaloguesOrthologues and Paralogues

Reconstruct the Molecular Evolutionary history from the evidence visible within the known extant genes

8 of 56

• Orthologues : any gene pairwise relation where the ancestor node is a speciation event

• Paralogues : any gene pairwise relation where the ancestor node is a duplication event

HomologueHomologue RelationshipsRelationships

9 of 56

Atime

Duplication

M 2’

Speciation

Duplication

M 2

A 1 A 2

M 1 H 1

H 2

Inparalogues

OutparaloguesOrthologues

Inparalogues

Inparalogues

Orthologous genes have originated from a single ancestor (often have equivalent functions).Paralogous are genes related via duplication:

•Inparalogues (ortholog_one2one, ortholog_one2many, etc.) duplication follows speciation and •Between_species_paralog (outparalogues). Duplication precedes speciation

Homologue RelationshipsHomologue Relationships

10 of 56

• Find orthologous genes by comparing the protein sets of two species (only the longest peptide considered).• blastp+sw all versus all (on a paired species basis)• Build a graph of gene relations based on BRH (best reciprocal hit) and BSR (BLAST score ratio)• Extract connected components (single linkage clusters ), each cluster representing a gene family

Mouse HumanMouse Human Mouse Human

Human

Human

Orthology Prediction AlgorithmOrthology Prediction Algorithm

11 of 56

GeneTree prediction: GeneTree prediction: MUSCLE/PHYMLMUSCLE/PHYML

• Multiple alignment of clusters with MUSCLE (based on BRH and BSR).•Unrooted gene tree built using PHYML (Guidon & Gascuel, 2003)•Tree reconciliation (gene tree with species tree) to call duplication event on internal nood and root the tree using RAP (Dufayard et al. 2005)• Infer pairwise relations of orthology and paralogy types (from each tree)

12 of 56

Molecular PhylogeneticsMolecular Phylogenetics

• Protein sequences in different species, both:

• Provide information about the history of evolution

• Reconstruct evolution

• We are after an alignment that equally reflects all species:

• Modeling the branching processes by comparing gene and species trees (tree reconciliation)

13 of 56

PhylogeniesPhylogenies

Duplication nodeSpeciation node or leaf

Revealing the evolutionary history that has led to the organisms at the current stage.

- Leaves are real genomes- Internal nodes are ancestors

14 of 56

Orthologue and Paralogue typesOrthologue and Paralogue types

• ortholog_one2one• ortholog_one2many• ortholog_many2many• apparent_ortholog_one2one

• within_species_paralog• between_species_paralog

15 of 56

……in Ensembl…in Ensembl…

16 of 56

Orthologue and ParalogueOrthologue and Paralogue typestypes

17 of 56

GeneViewGeneView

18 of 56

GeneViewGeneView

19 of 56

Links to ATV and JalView

GeneTreeMUSCLE

protein alignment

GeneTreeViewGeneTreeView

20 of 56

Duplication node (red)

Speciation node (blue)

GeneTreeViewGeneTreeView

21 of 56

ATVATV

22 of 56

Protein clustering into familiesProtein clustering into families

• Cluster proteins from different organisms that may share the same function

• Obtain some kind of description for ‘novel’ genes/proteins

• Locate family members over the whole genome

• Identify possible orthologues and paralogues in other species

23 of 56

Protein DatasetProtein Dataset

• Nearly a million proteins clustered:– All Ensembl proteins from all species in Ensembl

• 513,256 predicted proteins

– All metazoan (animal) proteins in UniProt

• 55,892 UniProt/Swiss-Prot

• 469,725 UniProt/TrEMBL

• Blastp all versus all, then clustering with MCL

24 of 56

Clustering StrategyClustering Strategy

• BLASTP all-versus-all comparison

• Markov clustering

• For each cluster:– Calculation of multiple sequence

alignments with ClustalW– Assignment of a consensus

description

25 of 56

Markov Clustering (MCL)Markov Clustering (MCL)

• MCL for Markov CLustering algorithm, based on flow simulation in graphs (http://micans.org/mcl/)• Keeps into the same graph/cluster only very well inter-connected nodes (proteins) in the same graph (cluster)

• Allows rapid and accurate detection of protein families on large-scale.• Automatic description and clustalw multiple alignment applied on each cluster

MCL

26 of 56

Link to FamilyView

ProtViewProtView

27 of 56

Ensembl family members

within human

Ensembl family members in

other species

JalView multiple alignments

FamilyViewFamilyView

28 of 56

For For eacheach cluster cluster

• We store– Description and score– Multiple alignment

• Future extensions– Improving descriptions– Multiple alignment assessment– Build phylogeny on each cluster

• Using the multiple alignment• Using dS values (mainly inside mammals)• Extend paralogous prediction

29 of 56

Aligning complete genomesAligning complete genomes

30 of 56

Whole Genome AlignmentsWhole Genome Alignments

• Understand what evolution has done on the species compared, after speciation – What is missing in one species, present only

in another?– Differences between closely related species

may help understanding speciation• Define syntenic regions, those long

regions of DNA sequences were order and orientation is highly conserved

• Conserved non-coding regions– Guides to putative regulatory regions

31 of 56

Evolution at the DNA levelEvolution at the DNA level

…ACTGACATGTACCA…

…AC----CATGCACCA…

Mutation

Sequence edits

Rearrangements

Deletion

InversionTranslocationDuplication

32 of 56

Basic IdeaBasic Idea

• Functional sequences evolve more slowly than non-functional sequences

• Comparing genomic sequences from species at different evolutionary distances allows us to identify:– Coding genes– Non-coding genes– Non-coding regulatory sequences

33 of 56

Aligning large genomic sequencesAligning large genomic sequences

• Independent from protein/gene predictions• Should find all highly similar regions between two

sequences• Should allow for segments without similarity,

rearrangements etc.– Computes run only by few dedicated groups

• Issues– Heavy process– Scalability, as more and more genomes are sequenced– Time constraint– Computes run only by few dedicated groups– As the «true» alignment is not known, then difficult to

measure the alignment accuracy and apply the right method

34 of 56

Using a local alignerUsing a local aligner

• Local alignment– Find all highly similar regions over 2 sequences

• Find the orthologous as well as all the paralogous sequences

– Separated by segments without alignment

– Can handle rearranged sequences– Need post- filtering to limit too much

overlapping alignments

35 of 56

Local Local vv Global Alignment Global Alignment

AG

TG

CC

CT

GG

AA

CC

CT

GA

CG

GT

GG

GT

CA

CA

AA

AC

TT

CT

GG

A

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC

AG

TG

CC

CT

GG

AA

CC

CT

GA

CG

GT

GG

GT

CA

CA

AA

AC

TT

CT

GG

A

Local Global

Advantages Compares large genomic regions (requires syntenic maps)

Can detect, rearrangements like translocations, inversions and duplications (!)

Detects insertions and deletions

Disadvantages Fails to identify insertions or deletions

Fails to detect rearrangements (inversions)

36 of 56

GlocalGlocal Alignment ProblemAlignment ProblemFind least cost transformation of one sequence into another using new operations

•Sequence edits (indels, mutations)

•Inversions

•Translocations

•Duplications

•A combination of these

GTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAG

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACT

Glocal aligner (Brudno et al., 2003)

37 of 56

BLASTZ-net, tBLAT and MLAGANBLASTZ-net, tBLAT and MLAGAN

• BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse

• Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish

• MLAGAN global alignment used for multispecies alignments

38 of 56

all all versusversus all approach using all approach usingBLASTZ BLASTZ (collaboration with UCSC)(collaboration with UCSC)

• Can handle large sequences

• Used 2-weighted spaced seeding strategy• Dynamic masking

• Makes distinction between repeat and non-repeat sequences (soft masking)• Try aligning inside repeats

• One iterative step with lower threshold to expand alignments

39 of 56

Blastz strategyBlastz strategy

• 10Mb Human fragments (3000)• 30Mb Mouse fragments (100)• Lineage-specific repeats removed

• 48 hours on 1024 CPUs

• Generates 9Gb of output

• When filtered for Best hit on Human, reduced to 2.5Gb•10Mb Human fragments (3000)• 30Mb Mouse fragments (100)

40 of 56

Blastz human genome coverageBlastz human genome coverage

• 40% of the human genome is covered by an alignment of mouse sequences

By rescoring the alignment over a “tight” matrix that is very stringent and look for high conservation (>70% identity), the coverage goes down to 6%

41 of 56

DNA/DNA matches web displayDNA/DNA matches web display

ContigView human EPO

Conserved sequences

42 of 56

DotterViewDotterView

Mouse sequence

Humansequence

43 of 56

Multiple alignmentsMultiple alignments

• Currently 3 sets:– MLAGAN-primates:

– MLAGAN-amniote vertebrates:

– MLAGAN-eutherian mammals:

44 of 56

StrategyStrategy

• Use all coding exons• Use all coding exons

• Get sets of best reciprocal hits

• Use all coding exons

• Get sets of best reciprocal hits

• Create orthology maps

• Use all coding exons• Get sets of best reciprocal hits• Create orthology maps• Build multiple global alignments

45 of 56

MultiContigMultiContigViewView

46 of 56

MultipleMultiple alignmentsalignments

ContigView human EPO

47 of 56

Alignment on basepair level

Human

Dog

Rat

Mouse

Export alignments

AlignSpliceViewAlignSpliceView

48 of 56

MultiContigView MultiContigView vs.vs. AlignSliceView AlignSliceView

49 of 56

AlignViewAlignView

50 of 56

GeneSeqalignViewGeneSeqalignView

51 of 56

GeneSeqalignViewGeneSeqalignView

52 of 56

Syntenic RegionsSyntenic Regions

• Genome alignments are refined into larger syntenic regions

• Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent

• Any clusters less than 100 kb are discarded

53 of 56

SyntenyViewSyntenyViewHuman

chromosome

Mouse chromosomes

Mouse chromosomes

Orthologues

54 of 56

Syntenic blocks

CytoViewCytoView

55 of 56

OutlookOutlook

• OrthoView• Displaying alignments both from whole genome alignments and on orthologues• Consider all isoforms for each gene•Calculate dN/dS

56 of 56

AcknowledgementsAcknowledgements

• Abel Ureta-Vidal• Benoît Ballester• Kathryn Beal• Stephen Fitzgerald• Javier Herrero• Albert Vilella

Ensembl team

Sep 2006

57 of 56

Basic ideaBasic idea

Speciation event

selection

alignment

mutations

Ancestor sequence

MutationRegulatory regionExon

58 of 56

Global Global vv Local Alignments Local AlignmentsLocalGlobal

Advantages Disadvantages

Local Compares large genomic regions (uses syntenic maps)

Can detect, rearrangements like translocations, inversions and duplications (!)

Fails to identify insertions or deletions

Global Detects insertions and deletions

Fails to detect rearrangements (inversions)

(-)

1 2

1 2

inversion duplication

Glocal aligner (Brudno et al., 2003) pairwise only

59 of 56Adapted from Sonnhammer & Koonin (2002) TIG 18, 12: 620

Inparalogues Inparalogues vs vs OutparaloguesOutparalogues

60 of 56

Problems: weak orthologiesProblems: weak orthologies

61 of 56

Problems: missalignmentsProblems: missalignments

62 of 56

Possible solutionsPossible solutions

• Weak orthologies:

• Poor alignments:– report to author– edit alignments, detect wrong

edges, redefine blocks– use another aligner

63 of 56From Edgar, R. C. (2004) NAR 32:1792-1797

top related