yeast genome project

Yeast Genome Project

Introduction• Saccharomyces cerevisiae

• It is perhaps the most useful diploid yeast, having been instrumental to winemaking, baking and brewing since ancient times

• It is one of the most intensively studied eukaryotic model organisms in molecular and cell biology• Size: 5- 10 µm in diameter• Sequenced in year: 1996• Strain sequenced: S288C• Databases: Munich Information Centre for Protein Sequences (MIPS): http://www.mips.biochem.mpg.de/mips/yeast/ Yeast Protein Database (YPD): http://quest7.proteome.com/YPDhome.html Saccharomyces Genome Database (SGD): http://genome-www.stanford.edu/Saccharomyces/

• Schizosaccharomyces pombe (Fission yeast)

• It is used as a model organism in molecular and cell biology• Size: 3 to 4 µm in diameter and 7 to 14 µm in length• Sequenced in year: 2002• Strain sequenced: 972h by European sequencing consortium (EUPOM) including 13 laboratories and

Wellcome Trust Sanger Institute; Cold Spring Harbor Laboratory• Databases: PomBase: http://www.pombase.org Broad Institute: Saccharomyces genome database:

http://www.broadinstitute.org/annotation/genome/schizosaccharomyces_group/MultiHome.html

• Candida albicans

• Most common human fungal pathogen• It is diploid fungus that grows both as yeast and filamentous cells and a causal

agent of opportunistic oral and genital infections in humans and candidal onychomycosis, an infection of the nail plate

• Size: 2.0-7.0 µm in diameter µm in length 3.0-8.5 µm in length • Sequenced in year:2004 by consortia formed by Stanford technology centre• Strain sequenced: SC5314

• Databases: Candida database : http://www.candidagenome.org Broad Institute: Saccharomyces genome database:

http://www.broadinstitute.org/annotation/genome/candida_group/MultiHome.html

• The bakers yeast Saccharomyces cerevisae is the first eukaryote whose genome is entirely sequenced

• Mitochondrial DNA was sequenced in segments in the 1980s.

• In 1989, it was decided to initiate a yeast sequencing project within the frame of the EU biotechnology programmes, some 35 European laboratories became initially involved in this enterprise [Vassarotti & Goffeau, 1992]

• Chromosome III was the first chromosome to be completed in 1992 followed by XI and II both in 1994

• The 315kb sequence of yeast chromosome III was published, it was a remarkable scientific landmark not only by being the first eukaryotic chromosome ever to be sequenced, but primarily because it revealed the extent of what remained to be understood in the genome of an otherwise extensively studied organism, such as, Saccharomyces cerevisiae

• Soon after its beginning, several other laboratories joined the project and agreed upon an international collaboration that enabled the whole yeast genome sequence to be finalized in 1995

• More than 600 scientists in Europe, North America and Japan became involved in this effort and the entire sequence was released in April 1996.

EU=55.9%, UK=17.6%, USA= 20.0%, Canada= 4.3%, Japan= 2.2%Figure: Consortia involved in the yeast genome sequencing project

Cloning and Mapping Procedures:• The sequencing of chromosome III started from a collection of overlapping

plasmid or phage lambda clones that were distributed by the DNA coordinator to the contracting laboratories. However, it soon became evident that ordered cosmid libraries were much more advantageous to aid large scale sequencing.

• To construct a library with as complete coverage as possible with as few clones as possible, the cloned DNA fragments should be randomly distributed on the DNA.

• Under these conditions, the number of clones (N) in a library representing each genomic segment with a given probability (P) is

N = ln (1-P)/ln (1-f) where f is the insert length expressed as fraction of the genome size [Clarke & Carbon, 1976].

• For example, with the size of 12,800 kb for the yeast genome and assuming an average insert length of 35 kb, a cosmid library containing 4600 random clones would represent the yeast genome at P=99.99%, i.e. about twelve times the genome equivalent

A low number of clones was of interest in setting up ordered yeast cosmid libraries or specific sublibraries by sorting out from an unordered cosmid library by colony hybridization using specific chromosomal DNA purified by pulsed-field gel electrophoresis as a probe

The 'nested chromosomal fragmentation' method [Thierry & Dujon, 1992] was then applied to rapid sorting of these clones

Finally, a set of overlapping cosmids was sufficient to build a contig of specific chromosome

• This approach has also been successfully applied to many of the other chromosomes sequenced in the yeast genome project

• To facilitate sequencing and assembly of the sequences, contigs of overlapping cosmids and fine-resolution physical maps of the respective chromosomes were constructed first, by application of classical mapping methods (fingerprints, cross-hybridization) or by novel methods developed for this programme, such as site-specific chromosome fragmentation [Thierry & Dujon, 1992] or the high resolution cross-hybridization matrix [Scholler et al., 1995]

• These techniques were also be of interest for other genomes as well and, particularly, for mapping YAC inserts

Sequencing strategies and Sequence Assembly

• In the European network, clones were distributed to the collaborating laboratories according to a scheme worked out by the DNA coordinators

• Each contracting laboratory was free to apply sequencing strategies and techniques of its own provided that the sequences were entirely determined on both strands and unambiguous readings were obtained

• Two principle approaches were used to prepare subclones for sequencing: 1) generation of sub-libraries by the use of a series of appropriate restriction enzymes or from

nested deletions of appropriate sub-fragments made by exonuclease III2) generation of shotgun libraries from whole cosmids or subcloned fragments by random

shearing of the DNA

• Sequencing by the Sanger technique was done1) manually, labelling with [35S]dATP being the preferred method of monitoring2) by automated devices

• Two types of devices for on-line detection with fluorescence labeling were employed1) Applied Biosystems ABI373A2) Pharmacia A.L.F.

• One laboratory used the direct blotting electrophoresis system from GATC company (Konstanz). Similar procedures were applied to the sequencing of chromosomes outside the European network. The American laboratories largely relied on machine-based large-scale sequencing.

Sequencing Telomeres

• The yeast chromosome telomeres presented a particular problem

• Due to their repetitive sub-structures and the lack of appropriate restriction sites they could be cloned by conventional procedures with only a few exceptions

• Largely, telomeres were physically mapped relative to the terminal-most cosmid inserts using the I-SceI chromosome fragmentation procedure [Thierry & Dujon, 1992]

• The sequences were then determined from specific plasmid clones obtained by 'telomere trap cloning', an elegant strategy developed by E. Louis at Oxford [Louis, 1994; Louis & Borts, 1995]

Sequence Assembly• Within the European network, all original sequences were submitted by the collaborating

laboratories to the Martinsried Institute of Protein Sequences (MIPS) which acted as an informatics centre

• The sequences were kept in a data library, assembled into progressively growing contigs, and updated during the course of the project by the application of appropriate criteria in a number of quality controls, starting with chromosome XI

• In collaboration with the DNA coordinators the final chromosome sequences were derived. Also in the other yeast chromosomes, automated procedures were employed for sequence assembly, based for example on the programpackage developed at

1) Cambridge [e.g. Dear & Staden, 1991]2) ACeDB programdeveloped for the C. elegans genome project [Thierry-Mieg & Durbin, 1992]

• In any case, correct assembly of the sequences was guaranteed by establishing that the order of restriction sites predicted from the sequence was consistent with the physical maps of these sites that had been determined independently and care was taken to perform quality controls that would result in a high accuracy

• From theoretical considerations taking all types of errors together, it follows that with an average sequence accuracy of 99.9%

• In practice, care was taken to minimize frameshift errors, which represented about two thirds of all sequencing errors and thus would have the most deleterious effects on gene interpretation. Meanwhile, all sequences have been systematically checked for errors again and were corrected in the data libraries.

The sequences have been interpreted using the following principles:i. All intron splice site/branch-point pairs detected by using specially

defined patterns were listedii. All ORFs containing at least 100 contiguous sense codons and not

contained entirely in a longer ORFiii. Centromere and telomere regions, as well as tRNA genes and Ty

elements or remnants thereof were sought by comparison with previously characterized datasets

• FASTA BLASTX and FLASH1 in combination with the Protein Sequence Database of PIR-International and other public databases

• Protein signatures were detected by using the PROSITE dictionary, as well as BLOCKS and PRODOM domains

• Base composition; nucleotide pattern frequencies; GC profiles; ORF distribution profiles were performed by using GCG programs or the X11 program package

• For calculations of GC content of ORFs the algorithm CODONS was used• This information was compiled at the end of the sequencing project to

annotate all genetic elements in the yeast genome

Classification of S. cerevisiae genes

ORF sizes in the S. cerevisiae genome

At the time, the yeast genome sequencing project had been finalized, comparison of the total sequence with public databases revealed:

• some 28.4% of the yeast ORFs corresponded either to previously known protein-encoding genes or to genes whose functions have been determined previously or during the course of the project

• An estimated 5.6% of the total remained questionable ORFs• 66% of the total ORFs represented novel putative yeast genes• 14.8% of the total had homologues among gene products from yeast or

other organisms whose functions are known• 14.4% of the total had recognizable motifs or weak homologies to genes

of experimentally characterized functions. • Remaining 37.7% of the total ORFs had either homologues to ORFs of

unknown function on other • Thus, approximately 2200 of the yeast genes had to be categorized as

'genes of unknown function', sometimes called ‘orphans’

A most useful inventory of the yeast proteins had been compiled in the Yeast Proteome Database (YPD) [Garrels et al., 1996] and is updated regularly.

The mystery of orphans• ‘Orphans’ are defined by the absence of known function and of structural homologs of known function,

so it seems only natural that, with time, they will vanish.• Functions of a few genes previously classified as orphans were reported during the sequencing

project itself• The most striking result from the chromosome III, sequence was that approximately half of all

protein-coding ORFs revealed by the sequence, had no clearcut sequence homologs in any organisrn, including yeast itself

• Thus, with right sequence of the first eukayotic chromosome, it was the discovery of the extent of our ignorance, rather than the discovery of many new genes, that was the most conspicuous finding

• exact figures depend on stringency criteria applied to determine the significance of sequence similarities

• on average, 30-35% of all ORFs of the yeast genome are orphans.• Even in absence of homologs, computers can provide some clues about the nature of some

orphans. • For example, prediction of transmembrane segments resulted in the striking conclusion that up to 35-

40% of the predicted proteins from chromosome III have transmembrane helices. Ultimately, the function of each sequence-predicted ORF can only be demonstrated by experiments

• total number of orphans in the yeast genome (about 2000)• It is clear that orphans by and large, are not fundamentally different from other yeast genes in terms

of expression.o If orphans are real genes, why were they not discovered before? Genome redundancy is a possible explanation. As sequencing progressed, structural homologs to

earlier orphans were regularly discovered in the yeast genome. Statistically, however, there is no indication that orphans tend to be more frequently duplicated than the genes previously characterized by classical genetics or their structural homologs. If any-thing, the converse seems to be true.

Gene Density and Gene Arrangement of Protein-encoding Genes in S. cerevisiae

• From the number of genes and the total size of the yeast genome one arrives at a gene density

• Gene density in all yeast chromosomes is rather similar

• Excluding the ORFs contributed by the Ty elements, ORFs occupy on average 70% of the sequences. This leaves only limited space for the intergenic regions which can be thought to harbour the major regulatory elements involved in chromosome maintenance, DNA replication and transcription.

• The compact nature of the S. cerevisiae genome is apparent when compared to more complex eukaryotic systems.

• C. elegans contains a potential protein-encoding gene only every 5-6 kb [Hodgkin et al., 1994]• In the human genome, gene density had been estimated to be as low as one gene in 30 kb [Olson,

1993] after the draft sequence is available, this figure is one gene in about 100 kb

• Schizosaccharomyces pombe, possesses a lower gene density (one gene per 2.3 kb) than S. cerevisiae. The difference between the two yeast genomes appears to be due to the fact that in the fission yeast 40% of the genes contain introns, whereas only a minor fraction (< 5% of the protein-encoding genes in S. cerevisiae are found to be interrupted by introns

• Generally, ORFs appear to be rather evenly distributed among the two strands of the single chromosomes. In some chromosomes (e.g. I, II, VIII), there is a slight excess of coding capacity on one of the strands, the significance of which is not known

• Average base composition of yeast DNA is 38.4% (G+C)

• GC content of:1. protein coding (40.2%) 2. non-coding regions (35.1%)

• Coding regions are evenly distributed between the two strands• Average ORF size is 1450 bp

• The average sizes of inter-ORF regions vary between 630 and 945 bp for different chromosomes1. 618 bp on average for 'divergent promoters' (36.2% GC) 2. 326 bp for 'convergent terminators' (29.3% GC)3. 517 bp for 'promoter-terminator combinations' (34.2% GC)

• Average base composition has been found to be symmetrical over the entire chromosomes• Base composition of ORFs themselves showing a significant excess of homopurine pairs on the

coding strand .• Regional variations of base composition with similar amplitudes were first noted along chromosome

III • A most interesting observation was that the compositional periodicity correlates with local gene

density, reaching more than 85% in GC-rich regions, followed by segments of comparably lower gene density (50-55%) in AT-rich regions [Dujon et al., 1994].

Functional elements of yeast chromosome:1. Centromere2. Telomere3. Origins of replication

Complex and Simple repeats• yeast genome is remarkably poor in repeated sequences• unique constellation of repetitious sequences at the two ends of chromosome I is found.

Approximately 30 kb in each subtelomeric region carry similar (but non-essential) genes and a 15 kb repeat

• these terminal regions represent the yeast equivalent to heterochromatin and the occurrence of this type of DNA suggests that its presence gives this chromosome the critical length required for proper stability and function

• The 30 kb region can be removed from each end without affecting vegetative growth, although chromosome stability is considerably reduced

• Besides the Ty elements, it is the rDNA on chromosome XII that most significantly contributes to repetitiveness. A cluster of some 15 tandem repeats (2 kb each) containing the CUP1 gene and contributing to polymorphic variation is found on chromosome VIII

• Repeated stretches of short oligonucleotides exist. These include poly(A) or poly(T) tracts, alternating poly(AT) or poly(TG) tracts, and direct or inverted long repeats

(S. cerevisae)

Genome Inventory of S. cerevisae

Graphical View of Protein Coding Genes of S. cerevisiae (as of Nov 20, 2013)

S. Cerevisiae gene products that are annotated to one or more terms in each GO aspect

Distribution of Gene Products among Molecular Function Categories

Distribution of Gene Products among Cellular Component Categories

Distribution of Gene Products among Biological Process Categories

Genome Inventory of S. pombe2004

2013

Genome Inventory of C. albicans

Graphical View of Protein Coding Genes of C. albicans (as of Nov 20, 2013) C. albicans gene products that are annotated to one or more terms in each GO aspect

Distribution of Gene Products among Cellular Component Categories

Distribution of Gene Products among Biological Process Categories

Distribution of Gene Products among Molecular Function Categories

Feature type (Total )

Saccharomyces cerevisae

Schizosaccharomyces pombe

Candida albicans

No. of genes 6,607 5123 6,214

Chromosome length (bp) 12,157,105 12,362,167 14,324,315

Nuclear genome (bp) 12,071,326 12,342,737 14,283,895

Mitochondrial genome (bp) 85,779 19,430 40,420

No. of chromosomes 16 3 8

Mean coding Length (bp) 1485 1426 1439

No. of Introns 272 4730 224

Coding percentage 69.9 % 57.5 % 61.5 %

Non-coding RNA 92 450 -

GC content 39 % 36 % 33.46 %

Gene density (gene per bp) 2124 2528 2342

Unique proteins 1104 681 1218

Pseudogenes 19 29 7

Centromere 16 3 8

tRNA 299 171 156

rRNA 27 47 6

snRNA 6 7 5

Table 1: Frequency and Characteristics of Short Tandem Repeats in the Coding Sequences of Fungal Genomes

Table 2: Number, Abundance Ranking, and Proportion of Gene Products Containing the Indicated Interpro Protein Domain yeast species and human

Genetic and Physical maps• The genetic map of S. cerevisiae [Mortimer et al., 1992] has been of

considerable value to yeast molecular biologists

• DNA probes from some known genes mapped to particular chromosomes for chromosomal walking. Finally, however, physical maps of all chromosomes have been constructed without reference to the genetic maps.

• Beside local expansion or contraction of the genetic map, and the fact that the overall frequency of meiotic recombination increases with shortening chromosome size, the order of the genes positioned on the chromosomes by genetic and physical mapping grossly agree

• Thus, the comparison of the physical and genetic maps show that most of the linkages have been established to give the correct gene order but that in many cases the relative distances derived from genetic mapping are imprecise. The obvious imprecision of the genetic maps may be due to the fact that different yeast strains have been used in establishing the linkages

Genetic and Physical map of yeast chromosome II

Genetic redundancy in yeast• There is a considerable degree of internal genetic redundancy in the yeast genome• It is difficult to correlate physical redundancy completely to functional redundancy because even in

yeast gene functions have been precisely defined to a limited extent• Duplicated sequences are confined to nearly the entire coding region of these genes and do not extend

into the intergenic regions• Corresponding gene products share high similarity in terms of amino acid sequence or sometimes are

even identical and, therefore, may be functionally redundant• Due to sequence differences within the promoter regions, gene expression should vary according to the

nature of the regulatory elements or other (regulatory) constraints; it may well be that one gene copy is highly expressed while another one is lowly expressed; turning on or off expression of a particular copy within a gene family may depend on the differentiated status of the cell (such as mating type, sporulation, etc.)

• Classical examples of redundant genes in subtelomeric regions are the yeast MEL, SUC, MGL and MAL genes subtelomeric regions of several yeast chromosomes share highly conserved segments, in some instances up to 30 kb, which carry duplicated genes the functions of which are largely unknown.

• Duplicated genes have also been found in clusters. E.g. in chromosome II and cluster of three hexose transporter genes on chromosome VIII

• Cluster Homology Regions (CHRs): Sequences of complete chromosomes on being compared to each other revealed that there are large chromosome segments in which homologous genes are arranged in the same order with the same relative transcriptional orientations on two or more chromosomes. This is responsible for 30-40% of total redundancy

• Chromosomes II and IV share the longest CHR, comprising a pair of pericentric regions of 170 and 120 kb, respectively, that share 18 pairs of homologous genes

• Significance: Whatever the relative timescale and mechanisms of duplications, these events followed by mutations affecting functional properties give a chance to result in improved environmental fitness. On the other hand, the high gene density in yeast indicates a strong tendency to maintain a compact genome, therefore compensatory mechanisms must exist to remove non-functional or superfluous gene copies.

Figure: View of 53 clustered gene duplications between the 16 chromosomes of yeast

Table: Gene duplication in S. pombe and S. cerevisiae using NCBI BlastClust

Sequence Variation among Yeast Strains

• Polymorphisms in different yeast strains is due to the following factors:

1) variable number of gene copies from repeated gene families

2) individual patterns caused by the presence or absence of particular Ty elements

3) plasticity of the chromosome ends

4) excisions or inversions of particular gene regions

5) chromosome breakage has been found to occur in yeast, resulting in karyotypes deviating from the 'normal' picture

Yeast Mitochondrial genome

• The mitochondrial genes and their mosaic intronic structure were first identified in S. cerevisiae in 1998 . First mitochondrial gene sequenced ever was from S. cerevisiae

• Multi-copy mitochondrial genome from S. cerevisiae is characterized by : low gene density and high A+T content base composition is highly heterogeneous G+C content of the genes is approximately 30% intergenic spacers are composed of quasi-pure A+T stretches of several hundreds of base pairs,

interrupted by more than 150 (G+C)rich clusters, ranging from 10 to 80 bp in length (This shows why scientists have sequenced the genes and neglected the intergenic regions)

• The genome contains the genes for cytochrome c oxidase subunits I, II and III (cox1, cox2 and cox3) ATP synthase subunits 6, 8 and 9 (atp6, atp8 and atp9), apocytochrome b (cytb), a ribosomal protein (var1) several intron-related open reading frames (ORFs) 7-8 replication origin- like (ori) elements and encodes 21S and 15S ribosomal RNAs, 24 tRNAs that can

recognize all codons, and the 9S RNA component of RNase P

• cox1 gene and, to a lesser extent, the cytb, 21S RNA and 15S RNA genes constitute the largest blocks of higher G+C density

• atp6, atp9, cox2, cox3 and tRNA genes appear as small G+C-enriched islands in the middle of A+T and G+C cluster-rich regions

Red- Exons; Grey- Introns; Yellow- rRNA; Green- tRNA; Dark blue- Ori elements

Human-Yeast connection

• By comparing the catalogue of human sequences available in the databases with the ORFs on the completed yeast chromosomes at the amino acid level it is estimated that:

>30% of the yeast genes have homologues among the human genes. As expected, most of the genes of known function categorized in this way

represent basic functions in both organisms. More similarities become apparent, when ESTs are included in the analysis. Most compelling protagonists among these homologues are yeast genes that

bear substantial similarity to human 'disease genes‘ Yeast genome is 200 times smaller than the human one Yeast genome is only 9-10 times less complex in its capacity to code for

proteins

• Applications: Yeast may be a simple system to assay novel drugs or ligands in view of the

conservation of some basic mechanisms between yeast and human cells This conservation that makes some yeast genes important for study of human

genetics

S. Cerevisae genes related to human disease genes

S. Cerevisae genes related to nucleotide excision repair (NER) genes

S. pombe genes related to human disease genes

S. pombe genes related to human cancer genes

Figure: Comparison of homologous genes from different species

Figure: Orthologs in different species

Figure: Comparison of proteins in S. pombe (S.p.), S. cerevisiae (S.c.) and C. elegans (C.e.)(a) Pie chart comparing the homology of proteins of S. pombe with those of S. cerevisiae and C. elegans; (b) Pie chart comparing the homology of proteins of S. cerevisiae with those of S. pombe and C. elegans

S. cerevisiae had a sequence approximately 60 times larger than any sequence previously attempted indicating why Goffeau felt compelled to invite the cooperation of a group of laboratories

At the time the sequencing of model organisms such as S. cerevisiae appeared to be the logical step towards the eventual characterization of the human genome, a task that seemed beyond the scope of technology due to its tremendous size of 3,000 Mb

Thank-you… By:Nazish Nehal,M. Tech (Biotechnology),University School of Biotechnology (USBT),Guru Gobind Singh Indraprastha University,New Delhi (INDIA)

yeast genome project

Technology