laboratory of computational biology eead.csic.es/compbio
DESCRIPTION
The relation between amino-acid substitutions in the interface of transcription factors and their recognized DNA motifs. Álvaro Sebastian Yagüe [email protected]. Laboratory of Computational Biology http://www.eead.csic.es/compbio Estación Experimental de Aula Dei - PowerPoint PPT PresentationTRANSCRIPT
Laboratory of Computational BiologyLaboratory of Computational Biologyhttp://www.eead.csic.es/compbio
Estación Experimental de Aula DeiEstación Experimental de Aula DeiCSIC, Zaragoza, EspañaCSIC, Zaragoza, España
Álvaro Sebastian YagüeÁlvaro Sebastian Yagüe
[email protected]@eead.csic.es
The relation between amino-acid substitutions in the The relation between amino-acid substitutions in the
interface of transcription factors and their interface of transcription factors and their
recognized DNA motifsrecognized DNA motifs
February 2, 2010 - V National Conference BIFI 2011February 2, 2010 - V National Conference BIFI 2011
Content index
• DNA recognition and binding
• 3D footprinting
• footprintDB database
• alignment of DNA motifs
• alignment of protein interfaces
DNA recognition and bindingDNA recognition and binding
DNA-binding proteins
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
Jones CE, Olson OM: Sequence-specific DNA-protein interaction: the lac repressor. J Theor Biol 64:323-332, 1977.
lac repressor
Tyr 7
Tyr 12
Tyr 17
DNA-binding proteins
Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996.
lac repressor
Tyr 7
Tyr 12
Tyr 17
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
DNA-binding proteins
lac repressor
Tyr 7Tyr 12
Tyr 17
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996.
DNA-binding proteins
Tyr 7
Tyr 12
Tyr 17
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
3D footprinting3D footprinting
Methods for studying protein-DNA interactions
Method Advantages Limitations
Nitrocellulose filter binding assay Relatively simple handling No localisation of binding site
Footprinting assays Technical simplicityIncomplete binding frequently results in unclear footprint
Methylation interferenceCombined analysis of binding site and effect of epigenetic variations
Very complex workflow
Electrophoretic mobility shift assay (EMSA)
Technically simple assay that permits semi-quantitative studies
In complex analyses, no immediate information on binding sites or proteins involved
Chromatin immunoprecipitation (ChIP) Applicable also for in vivo analyses Relies very strongly on antibody specificity
DNA adenine methyltransferase identification (DamID)
In vivo detection Requirement of exogenous fusion proteins
Surface plasmon resonance (SPR)Real-time recording of association and dissociation
No high throughput
Systematic evolution of ligands by exponential enrichment (SELEX)
Enables in vitro selection of optimal binding partners
Only selection of best binding events
Yeast one-hybrid system In vivo assay Very complex system
DNA microarrays High throughput Analysis process for individual proteins
Protein microarrays High throughput Monomer-specificity
Proximity ligationHighly specific and sensitive down to single-molecule detection
Complex sample preparation
Atomic force microscopy, X-ray crystallography, nuclear magnetic resonance
High-resolution structural informationNo use for definition of interaction pairs or identification of genomic locations
Helwa R, Hoheisel JD: Analysis of DNA-protein interactions: from nitrocellulose filter binding assays to microarray studies. Anal Bioanal Chem 398:2551-2561.
3D Footprinting
3D footprinting is a computational technique developed in our lab that annotates DNA-
binding interfaces by analizing 3D published structures from PDB.
Interface residues for 1d5y_A TF: 32,34,35,37,38
http://floresta.eead.csic.es/3dfootprint/
3D-footprint calcultated interface:
1D5Y
footprintDBfootprintDB
footprintDB
We have designed, implemented and curated a database with more than 3000 unique DNA-
binding proteins (mostly transcription factors, TFs) and 4000 Position Weight Matrices
(PWMs) extracted from the literature and other repositories.
TF sequences in footprintDB have annotated their DNA-binding interface residues by
aligning their sequences with 3D-footprint templates.
footprintDB
Database Description TFs PWMs
TRANSFACData on transcription factors, their experimentelly-proven binding sites, their positional weight matrices and regulated genes.
367 608
JASPAR CORECurated, non-redundant set of profiles, derived from published collections of experimentally defined transcription factor binding sites for eukaryotes.
443 465
RegulonDB Curated data of the transcriptional regulatory network of Escherichia coli K12. 70 70
3D-footprintDatabase of DNA-binding protein structures that is updated weekly with Protein Data Bank complexes.
1006 1225
AthaMapGenome-wide map of potential transcription factor and small RNA binding sites in Arabidopsis thaliana
42 48
Drosophila CTFMMotif models reported in 51 primary references in the form of position PWMs for 56 Drosophila melanogaster transcription factors.
59 62
ZIFDBRepository of information on C2H2 zinc fingers and engineered zinc- finger arrays.
858 873
ZifBASE An extensive collection of various natural and engineered zinc finger proteins. 139 144
AGRISResource of Arabidopsis promoter sequences, transcription factors and their target genes.
53 53
UniPROBERepository of experimental data from universal protein binding microarray (PBM) experiments.
296 437
PLACEDatabase of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports.
28 480
footprintDB
footprintDB predicts:
1. Transcription factors which bind a specific DNA site or motif
2. DNA motifs likely to recognised by a specific DNA-binding protein
http://floresta.eead.csic.es/footprintdb/
alignment of protein interfacesalignment of protein interfaces
The rationale behind footprintDB is the observation that proteins which recognize a
similar DNA motif most often have a similar set of residues at the interface.
DNA motif ~ TF interface
yCAATTAws ~ RKRTQNTK
-yaATTAam ~ RRRIQNTK
-yAATTArg ~ RRRIQNAK
-TAATTArc ~ RRRIQNAK
-tmATTAAs ~ KRRIQNMK
Alignment of protein interfaces
Alignment of protein interfaces
Noyes et al. have recently shown that homeodomain binding specificities depend on
the interface residues involved in DNA motif recognition.
Noyes, M.B., Christensen, R.G., Wakabayashi, A., Stormo, G.D., Brodsky, M.H., Wolfe, S.A.: Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133 (2008) 1277-1289
Interface alignment with footprintDB annotated interfaces
yCAATTAws ~ RKRTQNTK-yaATTAam ~ RRRIQNTK-TAATTArc ~ RRRIQNAK-tmATTAAs ~ KRRIQNMK
Alignment of protein interfaces
Unknown homeodomain protein
Homeodomain interface residues
RRRIQNAK
Predicted DNA binding motif
TAATTArc
ROC curve shows that interface alignments improve DNA motif predictions in comparisson with Blast scores.
Alignment of protein interfaces
Scoring of aligned protein interfaces will be more accurate in predicting which DNA
motif bind a unknown DNA binding protein that other scoring methods like local
alignment.
Homeodomains: bZIPs:
alignment of DNA motifsalignment of DNA motifs
DNA motif alignment issues
• Three alignment combinations: ATC / GTT ; ATC / AAC ; GAT / GTT
longer calculation time and higher false positive rate than a pairwise alignment
• Different motif sizes: TgAGt / ackrTGACGTCAycra
it’s not a big issue if we divide the score by the number of aligned nucleotides
• Small motifs are prone to false high-scoring alignments, due to the small
nucleotide alphabet size: AGt / CGT
high similarity thresholds are required, particularly with individual Zinc Fingers
that usually recognize 3 nts
DNA motif alignment issues
• Complex motifs (multimeric proteins): ackrTGACGTCAycra /
rTGACwmAGCA
they are not easy to align and heteromultimers might bind different sites
• A single motif for TFs with multiple DNA-binding domains
it might not be possible to know which domain binds to each submotif
• TFs with different annotated motifs
as a result of different oligomeric conformations or experimental approaches
• Motifs with very low information content: akaTTrchhaAhcw
might be genuine or result from low resolution experiments; source of FP hits
Alignment of DNA motifs
Family Motifs Multimeric Multidomain
Homeodomain TAATkr, TGAyA Sometimes Unusual
Basic helix-loop-helix (bHLH) CACGTG, CAsshG Always (homodimers, heterodimers) Never
Basic leucine zipper (bZIP) CACGTG, -ACGT-, TGAGTC Always (homodimers, heterodimers) Never
MYB GkTwGkTr Usual (multimers) Usual
High mobility group (HMG) mTT(T)GwT, TTATC, ATTCA Sometimes Unusual
GAGA GAGA Never Never
Fork head TrTTTr Unusual Never
Fungal Zn(2)-Cys(6) binuclear cluster
CGG Usual (homodimers) Never
Ets GGAw Usual (homodimers, heterodimers, multimers) Never
Rel homology domain (RHD) GGnnwTyCC' Always (homodimers, heterodimers) Never
Interferon regulatory factor AAnnGAAA Always (homodimers, heterodimers, multimers) Never
Some families of transcription factors and their singularities:
Motifs are aligned with Smith-Waterman ungapped algorithm and motif
similarity is calculated using the sum of the Pearson Correlation
Coefficients of the motif positions.
G C C
Alignment of DNA motifs
G A C
Similarity: 1 + 0 + 1 = 2 / 3 = 0.67
Motifs are aligned with Smith-Waterman ungapped algorithm and motif
similarity is calculated using the sum of the Pearson Correlation
Coefficients of the motif positions.
Alignment of DNA motifs
A C G T01 0 0 6 0 G02 1 4 0 1 C03 0 4 0 2 C
Pearson Correlation Coefficient:
A C G T01 0 0 3 1 G02 3 1 0 0 A03 0 4 0 0 C
Simil = 1+2+3 = 0.94 + 0.14 + 0.87 = 1.95
94.0)11()13()10()10()5.10()5.16()5.10()5.10(
)11)(5.10()13)(5.16()10)(5.10()10)(5.10(22222222
Position 1:
GCC GAC
4900 TRANSFAC individual DNA sites were aligned with their
corresponding DNA motifs (PWMs), yielding a mean similarity of 0.70
P0 A C G T01 2 0 4 0 G02 1 0 4 1 G03 0 6 0 0 C04 2 0 0 4 T05 0 0 0 6 T06 0 6 0 0 C07 0 6 0 0 C08 3 0 0 3 W09 1 4 1 0 C
AGCTTCCTCGGCATCCAGGTCTTCCTAAGCTTCCACGGCATCCACGACTTCCTC
DNA motifs have a large variability
Half of DNA sites share <0.70 similarity with its motif
Alignment of DNA motifs
4900 TRANSFAC individual DNA sites were aligned against random
footprintDB database motifs, yielding a mean similarity of 0.47.
P0 A C G T01 02 03 04 05 06 07 08 09
AGCTTCCTC
Individual DNA sites and motifs can yield
moderate similarities by chance
?
Alignment of DNA motifs
Which motif similarity threshold should
we use to identify DNA sites and motifs?
AGCTTCCTC
P0 A C G T01 2 0 4 0 G02 1 0 4 1 G03 0 6 0 0 C04 2 0 0 4 T05 0 0 0 6 T06 0 6 0 0 C07 0 6 0 0 C08 3 0 0 3 W09 1 4 1 0 C
0.47 < ? < 0.70
Alignment of DNA motifs
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10.10.30.4
0.5
0.6
0.7
0.8
0.9
1
F P R
TP
R
Drawing a ROC curve interpolating TPR and FPR from TRANSFAC
alignments, we obtain that values of motif similarity ratio beween 0.60 and
0.55 cover a sensitivity (TPR) range of 0.71-0.80 and a specificity (1-FPR)
range of 0.88-0.74.
similarity0.55 – 0.60
Alignment of DNA motifs
Thanks for your attentionThanks for your attention
Laboratory of Computational Biology
Estación Experimental de Aula Dei / CSIC
Av. Montañana 1.005
50059 Zaragoza (Spain)
Tel.: +34 976716089
Web: http://www.eead.csic.es/compbio/
Questions?Questions?