melampsora genome annotation and genome structure analysis

19
Melampsora Genome Annotation and Genome Structure Analysis First Annotation Workshop of the Melampsora Genome Consortium Yao-Cheng Lin Bioinformatics & Evolutionary Genomics VIB Department of Plant Systems Biology, UGent

Upload: kiril

Post on 24-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Melampsora Genome Annotation and Genome Structure Analysis First Annotation Workshop of the Melampsora Genome Consortium. Yao-Cheng Lin Bioinformatics & Evolutionary Genomics VIB Department of Plant Systems Biology, UGent. Overview. Gene prediction (structure annotation) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Melampsora Genome Annotation and Genome Structure Analysis

Melampsora Genome Annotation and Genome Structure Analysis

First Annotation Workshop of the Melampsora Genome Consortium

Yao-Cheng LinBioinformatics & Evolutionary Genomics

VIB Department of Plant Systems Biology, UGent

Page 2: Melampsora Genome Annotation and Genome Structure Analysis

Overview

• Gene prediction (structure annotation)• Gene family analysis• Phylogeney position of Melampsora

Page 3: Melampsora Genome Annotation and Genome Structure Analysis

EuGène: gene prediction platform

EuGène

Intrinsic information

Extrinsic information

FunSiP

Coding IMMIntronic IMM

Translation start

TE & Repeat database

Protein databases

ESTs databases

Puccinia genomic sequence

RepeatMasker TblastXBlastX

BlastNGenomeThreader

start siteGT/AG

Splice site

Content potential for coding, intronic

and intergenic

Other prediction programs

Alternative models

Predicted genesGenomic sequence

Page 4: Melampsora Genome Annotation and Genome Structure Analysis

Resources for Melampsora gene prediction• Gene models for training

– Previously identified core genes in basidiomycetes– Genes with manual curation from INRA-Nancy

• Splice site training/prediction– FunSiP: Michiel Van Bel developed it & helped for training

• BlastX database– 8 basidiomycete proteomes, Fungi RefSeq, SwissProt

• TBLASTX database– Puccinia graminis genomic sequence

• EST libraries– JGI Sanger sequencing– 454 Pyrosequencing (the 1st mira assembly)

• Repeat libraries– Hadi/Marie-Pierre.– In-house script, collected from first run of gene prediction. – Masked area from JGI.

• EuGene 3.4

Page 5: Melampsora Genome Annotation and Genome Structure Analysis

Gene prediction – comparison of two prediction results

EuGene JGINumber of protein coding genes 17,167 16,694

Coding sequence < 300 aa 6,989 (40.7%) 8,212 (49.2%)

Average gene length (bp) 1,742.7 1,685.5Average coding sequence length (bp) 1,369.7 1,131.4Average exon length (bp) 261.1 235

Average exon number 5.3 4.8Average intron length (bp) 86.9 117.8

SwissProt support 6,521 (38.0%) 5,699 (34.1%)EST support 6,152 (35.8%) 6,241 (37.4%)EST support (< 300 aa) 1,066 995

Page 6: Melampsora Genome Annotation and Genome Structure Analysis

Gene prediction – protein length distribution

100

300

500

700

900

1100

1300

1500

1700

1900

05

10152025303540

Melampsora JGIMelampsora EuGeneLaccariaPuccinia

Protein length (aa)

Freq

uenc

y (%

)

Page 7: Melampsora Genome Annotation and Genome Structure Analysis

Example: metallothionein-like protein

• Metallothionein-like protein in Magnaporthe• Protein length: 22-amino acid (MMT1)• Six Cystein residues.• Mmt1 mutants loose the ability to cause plant disease.

• Difficulties in in silicon identification– Sequence divergence.– Short sequence, easily been rejected by E-value cut-off.

Page 8: Melampsora Genome Annotation and Genome Structure Analysis

Overview

• Gene prediction and annotation platform• Gene family analysis• Phylogeny position of Melampsora

Page 9: Melampsora Genome Annotation and Genome Structure Analysis

Gene family expansion and contraction

• Gene family clustering– Similarity search with 12 fungi genomes (10 basidiomycetes, 2

ascomycetes), (All-against-all BLASTP, E-value cutoff 1e-5).– Gene families constructed by TribeMCL with inflation factor 4.0.

• Species/Lineage specific gene family expansions– The mean gene family size and standard deviations were

calculate for all gene families (exclude SSFs and orphans).– To center and normalize the data, the matrix of previous profile

was transformed into a matrix of z-score.• Functional assignment

– Domain based: RPS-BLAST– HMM profile for each family -> Search the SwissProt and NR

database.– GO terms.

Page 10: Melampsora Genome Annotation and Genome Structure Analysis

Protein phylogeny profile / z-score

A B C Mean SD1 5 10 15 10 5

2 4 6 5 5 1

3 20 5 10 11.7 7.6

100 1 1 1201 0 10 0

A B C1 -1 0 1

2 -1 1 0

3 1.1 -0.9 -0.2

Protein phylogeny profileZ-score profile

Z = Gene number – mean gene number

Standard deviation

Species specific gene family

Core-gene family

Genome

Fam

ily

Page 11: Melampsora Genome Annotation and Genome Structure Analysis

Fungi genomes characteristics

Genome Genome size (Mb) Genes < 300 a.a

genesGC content

(%)Magnaporthe grisea 41.7 12,832 5,312 (41.4%) 51.6

Neurospora crassa 39.23 9,822 3,445 (35.1%) 49.3

Sporobolomyces roseus 21.1 5536 1,714 (31.0%) 49.5

Puccinia graminis 88.64 20,566 11,319 (55.0%) 43.0Melampsora larici-

populina 101.1 16,694 8,212 (49.2%) 42.1

Ustilago maydis 19.7 6,522 1,668 (25.6%) 54.0

Malassezia globosa 8.9 4,286 1,468 (34.3%) 52.0

Postia placenta 90.9 12,415 4,629 (37.3%) 52.4Phanerochaete chrysosporium 35.1 10,048 3,579 (35.6%) 53.2

Laccaria bicolor 64.9 19,036 10,013 (52.6%) 46.6

Coprinus cinereus 37.5 13,544 5,487 (40.5%) 51.6Cryptococcus neoformans 19.5 7,170 2,372 (33.1%) 48.2

1

2

3

Page 12: Melampsora Genome Annotation and Genome Structure Analysis

Orphans / Species specific gene families

Neuro

spora

crass

a

Magnap

orthe g

risea

Cryptoco

ccus n

eoform

ans

Coprinus c

inereus

Lacca

ria bico

lor

Phanero

chae

te ch

rysosp

orium

Postia p

lacen

ta

Malass

ezia

globosa

Ustilag

o may

dis

Sporobolomyc

es ro

seus

Puccinia

graminis

Melampso

ra lar

ici-populin

a0

10

20

30

40

50

60

70

80

Orphans Genes in species specific families

% o

f gen

es

1

23

Page 13: Melampsora Genome Annotation and Genome Structure Analysis

Difference in average gene family size

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4M

ean

z-sc

ore

Neurospora crassa

Magnaporthe grisea

Cryptococcus neoformans C

oprinus cinereus

Laccaria bicolor

Phanerochaete chrysosporium

Postia placent

Malassezia globosa

Ustilago maydis

Sporobolomyces roseus

Puccinia graminis_f._sp._tritici

Melampsora larici-populina

*Total 8035 families, exclude the species specific families

Page 14: Melampsora Genome Annotation and Genome Structure Analysis

Hierarchical clustering of gene family

N. crassaM. grisea

S. roseusP. graminis

M. larici-populinU. maydisM. globosaP. placenta

P. chrysosporiumC. cinereus

L. bicolorC. neoformans

• Top100 most variable profiles, based on the standard deviations were calculated.

• Red: Protein kinase, esterase lipase, cre recombinase, DNA/RNA helicase, Leucine-rich repeat

• Blue: major facilitator superfamily

Page 15: Melampsora Genome Annotation and Genome Structure Analysis

Overview

• Gene prediction and annotation platform• Gene family analysis• Phylogeny position of Melampsora

Page 16: Melampsora Genome Annotation and Genome Structure Analysis

Phylogenies of Melampsora

• Construct the Melampsora phylogenic tree based on FUNYBASE with selected fungi genomes.

• FUNYBASE: single-copy gene family (246 genes) within 21 fungi species (mostly ascomycetes).

• 22 selected species:– Ascomycete: Aspergillus nidulans, Coccidioides immitis, Fusarium

graminearum, Mycosphaerella graminicola, Magnaporthe grisea, Neurospora crassa, Nectria haematococca, Pyrenophora tritici-repentis, Stagonospora nodorum, Schizosaccharomyces pombe, Sclerotinia sclerotiorum.

– Basidiomycete: Coprinus cinereus, Cryptococcus neoformans, Laccaria bicolor, Malassezia globosa, Melampsora larici-populina, Phanerochaete chrysosporium, Puccinia graminis, Postia placenta, Sporobolomyces roseus, Ustilago maydis

– Zygomycete: Rhizopus oryzae

*new genome; reject in FUNYBASE

Page 17: Melampsora Genome Annotation and Genome Structure Analysis

Phylogenies of Melampsora - Method

• 246 HMM models for the conserved protein sequence blocks in FUNYBASE .

• For each genome, HMMER search against whole proteome and retain the protein sequence of the best hit in each model.

• 148 models have single-copy gene in our 22 selected species.

• Concatenate the 148 single-copy orthologs for tree building.

Page 18: Melampsora Genome Annotation and Genome Structure Analysis

Melampsora in the phylogenetic tree of fungi

using phylo_win, Neighbor joining method with Poisson correction, 500 bootstrap.

Page 19: Melampsora Genome Annotation and Genome Structure Analysis

Acknowledgements• Gent

• Stephane Rombauts• Michiel Van Bel• Klaas Vandepoele• Kenny Billiau• Thomas Abeel• Pierre Rouzé• Lieven Sterck• Yves Van de Peer

• Nancy

• Stéphane Hacquard• Emilie Tisserant• Marie-Pierre Oudot-Le Secq• Sébastien Duplessis• Francis Martin