comparative analysis of promoter sequences: the discovery of the pribnow-box and some follow-up...

40
Comparative Analysis of Promoter Sequences: The discovery of the Pribnow- box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins Celebrating the 20th Anniversary of Swiss-Prot Fortaleza – Brazil, Aug 3 2006

Upload: adele-brooks

Post on 14-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Comparative Analysis of Promoter Sequences:

The discovery of the Pribnow-box and some follow-up discoveries

Philipp Bucher

In Silico Analysis of ProteinsCelebrating the 20th Anniversary of Swiss-Prot

Fortaleza – Brazil, Aug 3 2006

Page 2: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Why a talk on promoters at a protein meeting ?Aren’t promoters DNA sequences ?

No. promoters are not DNA sequences.

Any general representation of promoters, or algorithm to predict promoters, does not relate to intrinsic properties of DNA.

In fact, a profile or hidden Markov model representing promoter sequences constitutes a description of the DNA-binding surfaces of a protein in terms of base pair preferences.

Not surprisingly therefore, the first consensus sequence for an E.coli promoter element has been derived from seven sequences originating from six different species, including a eukaroytic virus.

Page 3: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Early comparative analysis of E.coli promoter sequences

FIG. 4. Comparison of promoter sequences (see text). b, Homologous sequence probably engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been omitted. SV40, simian virus 40; w.t., wild type.

Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence

5' T-A-T-Pu-A-T-G 3'3' A-T-A-Py-T-A-C 5'

is implicated in the formation of a tight binary complex with RNA polymerase.

Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.

Page 4: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

E. coli promoters: Chapter 2

A second sequence motif located about -35 bp upstream of the initiation site was discovered based on a larger promoter sequence collection.

Page 5: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

E.coli promoters: Chapter 3

The figure below illustrates the concept of functional homology between two promoter sequences. In particular, these footprint results confirm that the -35 and -10 elements are correctly assigned even though the spacing between the two elements is different (Siebenlist et al. 1980, Cell 20, 269-281).

Page 6: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins
Page 7: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

The program TargSearch implements an early sequence profile method using position-specific residue weights and scores for alternative spacer lengths.

Page 8: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Prediction of the rate constant for open complex formation with TargSearch

scores

Page 9: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Early work on E. coli promoters: Important contributions to computational biology

• Representation of functional molecular sequence motifs by IUPAC consensus sequences and weight matrices

• A definition of functional homology and an xperimental criterion for correct alignment of DNA sequence motifs.

• Prediction algorithms using profile or HMM-like target description.

• The idea that quantitative promoter prediction scores can and perhaps should viewed as predictors of a protein property: the selectivity of RNA polymerase to a particular DNA ligand sequence.

Page 10: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Eukaryotic promoters: Differences with regard to E.coli promoters and other biological facts

• Eukaryotic polymerases do not have intrinsic affinity to specific promoter sequences.

• Eukaryotic promoters are recognized by a variety of transcriptions factors, each recognizing a specific target motif.

• The binding sites of proteins which direct RNA polymerase to the promoter, may be located at larger and more variable distances from the initiation sites. Moreover, they these sites may occur in either orientation, or even downstream of the start site.

• Tissue and developmental stage-specificity.

• Epigenitic silencing mediated by chromatin condensation or DNA methylation.

Page 11: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins
Page 12: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

EPD EssentialsPromoter definition: An experimentally mapped transcription initiation site.

Important assumption: A capped 5’end of a eukaryotic mRNA is generated by transcriptional initiation, not endonucleolytic cleavage

Primary data: (i) RNA sequencing, nuclease protection, primer extension data published in Journal articles, (ii) 5’ESTs from cDNA clones obtained with the oligo-capping method (only recently).

Purpose: (i) Comparative analysis of promoter elements, (ii) training and test set for promoter prediction algorithms (iii) resource for experimental researchers.

Page 13: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Signal Search Analysis Essentials

History: Signal Search Analysis is an ancient method developed by myself in the early eighties in Max Birnstiel’s lab in Zurich (first published in 1984)

Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences.

Recent event: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites.

Note the difference: SSA programs serve to characterize motifs that occur at constrained distances from sites

not:motifs that are over-represented within sequence sets

There are hundreds of programs that address the latter problem, but only very few that serve the same purpose as the SSA programs!

Page 14: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Locally Over-represented Sequence Motifs

Page 15: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

TATA-box Signal Occurrence Profile for Human Promoters

Page 16: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Definition of a Locally Over-represented Sequence Motif

The definition of a locally over-represented sequence motif has three components:

1. A weight matrix or consensus sequence defining the motif

2. A cut-off value

3. A preferred region of occurrence with respect to a functional site, e.g. a transcription initiation sites

The weight matrix or consensus sequence allows one to compute a match score for any subsequence of a promoter that has the same length as the matrix.

The cut-off value determines which subsequence constitutes a motif match.

The preferred region is the third criterion necessary to decide whether a given promoter contains a given locally over-represented sequence motif or not.

The difference in occurrence frequency inside and outside of the preferred region can be used as an objective function to optimize the three components of a locally over-represented sequence motif listed above.

Page 17: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

An algorithm to optimize a locally over-represented sequence

Page 18: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

A weight matrix definition for the TATA-box motif

See also. Bucher 1990, J. Mol. Biol. 212, 563-578.

Page 19: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Promoter prediction

Benchmark results from Fickett & Hatzigeorgiou 1997, Genome Res. 7, 861-878

Note: The false/random discovery rates (about 1 in 1 kb) are about 2 orders of magnitude too high if one assumes one promoter per 100 kb for the human genome (perhaps an underestimation).

At this unacceptably high false discovery rate the sensitivity barely exceeds 50% for most of the programs.

Page 20: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Why is eukaryotic promoter prediction so hard ?

Technical reasons:

– Too few promoters mapped experimentally

– Low quality of experimental data resulting in inexact or wrong transcription initiation site mapping

Biological reasons:

– Transcription initiation appears to be often a fuzzy process. The initiation sites pertaining to one promoter may be scattered over 50 bp or more.

– There may be many useless promoters giving rise to rapidly degraded non-functional transcripts.

– There may be too many promoter classes recognized by different combinations of transcription factors.

– Tissue and developmental stage specificity. Most promoters are in fact silent in most tissues. Promoter prediction is partly a tissue-specific problem.

Page 21: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Progress may come from new technologies

Introduction of high throughput technologies for cDNA (mRNA) 5’end sequencing. Recent papers:

Oligo-capping technique: Suzuki et al. (2001) Identification and Characterization of the Potential Promoter Regions of 1031 kinds of human genes. Genome Res. 11:677-684.

CAGE: Carninci et al. (2001) Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. doi:10.1038/ng1789.

Close to one million 5’ tags of human transcripts have been analyzed with these techniques.

Processing of cDNA 5’tags has tripled the number of promoter entries in EPD in less than two years.

We have coined the term “in silico primer extension” designating the process of TSS mapping with cDNA 5’tag data.

Page 22: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

In silico primer Extension - Essentials

Purpose:1. to map transcription start sites to a genome,2. to study the regulation of alternative promoter usage

Experimental procedures: 1. full-length cDNA synthesis (e.g. oligo-capping method) 2. Generation of 5’tags (EST sequencing, 5’SAGE, CAGE)

Computational procedures: 1. mapping of 5’ tags to the genome,2. identification of clusters in mRNA 5’end profiles

Page 23: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Promoter region defined by transcription start sites (TSS)

conventional primer extension experiment with gene specific primer

TSS

genomic DNA

cDNAs

promoter

Page 24: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

In Silico (Digital) versus in Vitro (Analog) Primer Extension

ccgagtcccctcacccctttccttcccacAGGTCCCTGGCCAAAGATTTATTTCTCTTGACAACCA

Page 25: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Our in Silico Primer Extension Pipeline

Unigene entry RefSeq entry

Genome sequence (2kb)

GenBank/EMBL 5’ EST entries of selected libraries

Trace files cDNA 5’tag (50 nt)

Blast

Profile-based multiple sequence aligment method

mRNA 5’end profile1-D clusteringBy MADAP

Zero to severalPromoter entries

Page 26: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

10 bp

# of

5’e

nd o

f N

ED

O tr

ansc

ript

s

Genomic positionR84046905-84046987

R84047148-84047231

45 bp

Definition of Promoter Sites and Classes from cDNA 5’end Profiles with the Program MADAP

Page 27: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

In silico PE versus conventional techniques

Characterization of three optional promoters in the 5' region of the human aldolase A gene.Maire P. et al (1987) J. Mol. Biol. 197, 425-438

100 bp

# of

5’e

nd o

f D

BT

SS

tran

scri

pts

Genomic position

Page 28: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Is in silico primer extension really accurate and reliable enough for promoter analysis ?

Page 29: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Comparative Evaluation of Human promoter Sets Compiled by Different Methods

Questions addressed:

1. What is the overlap and agreement in transcription start sites definitions between the four data sets ?

2. Is any of the data sets contaminated by a substantial number of non-promoter sequences ?

3. Which method defines the transcription start site most accurately ?

4. Is any of the four promoter compilations biased with regard to promoter subclasses ?

Page 30: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Comparative Evaluation of Human promoter Sets Compiled by Different Methods

Goal of the project: to compare four different promoter (transcription start sites) compilations:

1. EPD: manually compiled promoter compilation based primarily on nuclease protection and primer extension experiments published in the biological journal literature.

2. PRESTA: Automatically compiled promoter collection relying on author submitted sequence feature annotations in EMBL sequence entries and confirmatory evidence from public EST sequences.

3. DBTSS (NEDO): Transcription starts sites inferred from 5’end sequences of full-length enriched cDNA libraries obtained with the oligo-capping method.

4. MGC: Transcription starts sites inferred from 5’end sequences of full-length enriched cDNA libraries from the Mammalian Gene Catalog (MGC) program.

Page 31: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Promoter Elements and Sequence Properties used for the Evaluation of Different Promoter Sets

Locally over-represented sequence motifs:

• TATA-box: site selector element, occurs around position –27, estimated frequency in human promoters: 64%.

• Initiator: site selector element, presumably occurs exactly at initiation site, estimated frequency in human promoters: 50%.

• CCAAT-box: upstream promoter element, occurs in a large upstream region with peak frequency at –80, estimated frequency in human promoters: 23%.

• GC-box: upstream promoter element, occurs in a large upstream region with peak frequency at –50, estimated frequency in human promoters: 52%.

Other known sequence features:

• CpG islands: regions of 200-1000 bp with a ratio of CpGobs / CpGexp > 0.6 and a C+G content > 50%, occurs around transcription initiation sites, estimated frequency based on promoters in EPD: 39%.

Page 32: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

TATA-box Profiles for Four Different Promoter Sets

Page 33: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Initiator Profiles for Four Different Promoter Sets

Page 34: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

CCAAT-box Profiles for Four Different Promoter Sets

Page 35: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

GC-box Profiles for Four Different Promoter Sets

Page 36: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

In silico analysis of larger promoter sequence sets.

The previous results have shown that in silico primer extension is accurate, perhaps even more accurate than convetnional methods.

However:

Was data set size really the bottleneck in promoter analysis ?

Have we already gained new insights into promoter structure from analyzing larger promoter sets defined by in silico primer extension ?

A recent study of about 2000 Drosophila promoters may give a preliminary answer to this question.

Page 37: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins
Page 38: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

The best conserved and most abundant Drosophila core promoter elements as found by Uwe Ohler and coworkers

Page 39: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

In particular, the most significant and undoubtedly most frequent, most conserved, and thus probably most important Drosophila promoter element corresponds to the following motif:

30 years of very intensive and expensive wet lab molecular biology research has not uncovered that motif !!!

Page 40: Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins

Back to Proteins:

What is the protein that binds to the most important promoter of element of Drospophila ?

Guesses from the audience may be sent to:

[email protected]