medgen 505 gene regulation bioinformatics wyeth w. wasserman
TRANSCRIPT
![Page 1: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/1.jpg)
MedGen 505
Gene Regulation Bioinformatics
Wyeth W. Wasserman
www.cisreg.ca
![Page 2: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/2.jpg)
CMMT
Overview
• TFBS Prediction with Motif Models
• Improving Specificity of Predictions
• Analysis of Sets of Co-Expressed and Co-Regulated Genes
![Page 3: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/3.jpg)
CMMT
Transcription Factor Binding Sites(over-simplified for pedagogical purposes)
TATAURE
URF Pol-II
![Page 4: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/4.jpg)
Teaching a computer to find TFBS…
![Page 5: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/5.jpg)
CMMT
Laboratory Discovery of TFBS
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
ACTIVITY
![Page 6: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/6.jpg)
Representing Binding Sites for a TF
• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
• A matrix describing a a set of sites
• A single site• AAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
![Page 7: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/7.jpg)
CMMT
TGCTG = 0.9
PFMs to PWMs
Add the following features to the model:1. Correcting for the base frequencies in DNA2. Weighting for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic
A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2
f matrix w matrix
Log ( )f(b,i) + s(n)p(b)
![Page 8: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/8.jpg)
CMMT
Performance of Profiles
• 95% of predicted sites bound in vitro (Tronche 1997)
• MyoD binding sites predicted about once every 600 bp (Fickett 1995)
• The Futility Conjuncture– Nearly 100% of predicted transcription factor
binding sites have no function in vivo
![Page 9: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/9.jpg)
CMMT
JASPAR
AN OPEN-ACCESS DATABASE
OF TF BINDING PROFILES
![Page 10: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/10.jpg)
PROBLEM: Too many spurious predictions
Actin, alpha cardiac
![Page 11: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/11.jpg)
CMMT
Terms
• Specificity – The portion of predictions that are correct
• Sensitivity – The portion of “positives” that are detected
• The detection of TFBS is limited by terrible specificity. Why?
I.9
![Page 12: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/12.jpg)
CMMT
Method#1Phylogenetic Footprinting
70,000,000 years of evolution reveals
most regulatory regions
![Page 13: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/13.jpg)
CMMT
Phylogenetic Footprinting
FoxC2100%
80%
60%
40%
20%
0%
![Page 14: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/14.jpg)
CMMT
Phylogenetic Footprinting to Identify Functional Segments
% Id
en
tity
Actin gene compared between human and mouse with DPB.200 bp Window Start Position (human sequence)
![Page 15: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/15.jpg)
CMMT
Phylogenetic Footprinting Dramatically Reduces Spurious Hits
Human
Mouse
Actin, alpha cardiac
![Page 16: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/16.jpg)
CMMT
Performance: Human vs. Mouse
• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)
• 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained
SELECTIVITY SENSITIVITY
![Page 17: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/17.jpg)
CMMT
ConSite (www.cisreg.ca)
NEW: Ortholog Sequence Retrieval Service
![Page 18: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/18.jpg)
CMMT
Emerging Issues
• Multiple sequence comparisons– Incorporate phylogenetic trees– Visualization
• Analysis of closely related species– Phylogenetic shadowing
• Genome rearrangements– Inversion compatible alignment algorithm
• Higher order models of TFBS
![Page 19: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/19.jpg)
CMMT
OnLine Resources for Phylogenetic Footprinting
• Linked to TFBS– ConSite– rVISTA
• Alignments– Blastz– Lagan– Avid– ORCA
I.18
• Visualization– Sockeye– Vista Browser– PipMaker
![Page 20: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/20.jpg)
CMMT
Method#2Discrimination of Regulatory Modules
TFs do NOT act in isolation
![Page 21: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/21.jpg)
Layers of Complexity in Metazoan Transcription
![Page 22: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/22.jpg)
CMMT
Diverse and non-uniform use of terms: Partial glossary for tutorial
• Promoter – Sufficient to support the initiation of transcription; orientation dependent; includes TSS
• Regulatory Regions– Proximal – adjacent to promoter– Distal – some distance away from promoter (vague)– May be positive (enhancing) or negative (repressing)
• TSS – transcription start site• TFBS – single transcription factor binding site• Modules – Sets of TFBS that function together
EXONTFBS TATA
TSSTFBSTFBS
Promoter Region
TFBSTFBS
Distal Regulatory Region Proximal Regulatory Region
EXONTFBS TFBS
Distal R.R.
![Page 23: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/23.jpg)
CMMT
Detecting Clusters of TF Binding Sites
• Trained Methods– Sufficient examples of real clusters to establish
weights on the relative importance of each TF
• Statistical Over-Representation of Combinations– Binding profiles available for a set of biologically
motivated TFs
![Page 24: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/24.jpg)
CMMT
Training for the detection of liver cis-regulatory modules (CRMs)
![Page 25: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/25.jpg)
CMMT
Models for Liver TFs…
HNF1
C/EBP
HNF3
HNF4
![Page 26: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/26.jpg)
CMMT
Logistic Regression Analysis
“logit”
Optimize vector to maximize the distance between output values for positive and negative training data.
Output value is:
elogit
p(x)= 1 + elogit
![Page 27: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/27.jpg)
CMMT
Performance of the Liver Model
• Performance– Sensitivity: 60% of known CRMs detected– Specificity: 1 prediction/35,000bp
• Limitations– Applies to genes expressed late in hepatocyte
differentiation– Requires 10-15 genes in positive training set– This model doesn’t account for multiple sites for the
same TF• New methods from several groups address this limit
![Page 28: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/28.jpg)
CMMT
UGT1A1
WildtypeOther
Live
r M
odul
e M
odel
Sco
re
“Window” Position in Sequence
![Page 29: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/29.jpg)
CMMT
Making better predictions
• Profiles make far too many false predictions to have predictive value in isolation
• Phylogenetic footprinting eliminates ~90% of false predictions
• Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context
![Page 30: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/30.jpg)
CMMT
Method#3 Higher Order Models
Position-position dependence
![Page 31: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/31.jpg)
What is a higher-order background model?
Zero-order:p(A)=0.29, p(C)=0.21, p(G)=0.21, p(T)=0.29
Ni
inucleotidePseqP...1
)()(
First-order:AA T
C
GA
m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases
Probabilistic Methods for Pattern Discovery(7)
![Page 32: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/32.jpg)
CMMT
Linking co-expressed genes to candidate transcription factors
![Page 33: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/33.jpg)
CMMT
Deciphering Regulation of Co-Expressed Genes
![Page 34: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/34.jpg)
CMMT
oPOSSUM Procedure
Set of co-expressed
genes
Automated sequence retrieval
from EnsEMBL
Phylogenetic Footprinting
Detection of transcription factor
binding sites
Statistical significance of binding sites
Putative mediating
transcription factors
ORCA
![Page 35: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/35.jpg)
CMMT
Statistical Methods for Identifying Over-represented TFBS
• Z scores– Based on the number of occurrences of the TFBS relative
to background
– Normalized for sequence length
– Simple binomial distribution model
• Fisher exact probability scores– Based on the number of genes containing the TFBS
relative to background
– Hypergeometric probability distribution
![Page 36: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/36.jpg)
CMMT
The oPOSSUM Database
• Orthologous genes: 8468
• Promoter pairs: 6911
• Promoters with TFBS: 6758
• Total # of TFBS predictions: 1638293
• Overall failure rate: 20.2%
![Page 37: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/37.jpg)
CMMT
Validation using Reference Gene Sets
TFs with experimentally-verified sites in the reference sets.
A. Muscle-specific (23 input; 16 analyzed)
B. Liver-specific (20 input; 12 analyzed)
Rank Z-score Fisher Rank Z-score Fisher
SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08
MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03
c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01
Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01
TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02
deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01
S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01
Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02
Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01
HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01
![Page 38: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/38.jpg)
Application to Microarray Data Sets
1. NF-кB inhibition microarray study
![Page 39: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/39.jpg)
Genes Significantly Down-regulated by the NF-κB inhibitor (326 input; 179 analyzed)
TF Class Rank Z-score Fisher No. Genes
p65 REL 1 36.57 5.66e-12 62
NF-kappaB REL 2 32.58 5.82e-11 61
c-REL REL 3 26.02 8.59e-08 63
Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6
SPI-B ETS 5 16.59 1.23e-03 135
Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23
Sox-5 HMG 7 15.38 2.56e-02 126
p50 REL 8 14.72 2.23e-03 19
Nkx HOMEO 9 13.66 2.29e-03 111
Bsap PAIRED 10 13.2 9.92e-02 1
FREAC-4 FORKHEAD 11 12.05 1.66e-03 92
n-MYC bHLH-ZIP 25 6.695 1.84e-03 102
ARNT bHLH 26 6.695 1.84e-03 102
HNF-3beta FORKHEAD 29 5.948 3.32e-03 47
SOX17 HMG 31 5.406 8.60e-03 79
![Page 40: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/40.jpg)
CMMT
C-Myc SAGE Data
• c-Myc transcription factor dimerizes with the Max protein
• Key regulator of cell proliferation, differentiation and apoptosis
• Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells
• They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR
![Page 41: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/41.jpg)
CMMT
Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed)
TF Class Rank Z-score Fisher No. Genes
Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7
Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2
Max bHLH-ZIP 3 18.32 2.16e-02 12
SAP-1 ETS 4 13.23 1.61e-04 13
USF bHLH-ZIP 5 11.90 1.84e-01 16
SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12
n-MYC bHLH-ZIP 7 11.11 1.55e-01 20
ARNT bHLH 8 11.11 1.55e-01 20
Elk-1 ETS 9 10.92 3.88e-03 19
Ahr-ARNT bHLH 10 10.17 1.11e-01 25
![Page 42: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/42.jpg)
CMMT
C-Fos Microarray Experiment
• In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line
• We mapped the list of 252 induced Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs
![Page 43: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/43.jpg)
Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed)
TF Class Rank Z-score Fisher No. Genes
c-FOS bZIP 1 17.53 2.60e-05 45
RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1
PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1
CREB bZIP 4 3.626 1.25e-01 10
E2F Unknown 5 2.965 7.67e-02 15
NF-kappaB REL 6 2.915 1.04e-01 17
SRF MADS 7 2.707 2.24e-01 2
MEF2 MADS 8 2.634 1.32e-01 13
c-REL REL 9 2.467 5.79e-02 22
Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1
Ahr-ARNT bHLH 15 1.716 2.57e-03 63
deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75
Elk-1 ETS 21 0.7875 8.12e-03 37
MZF_1-4 ZN-FINGER, C2H2 27 -0.2421 5.41e-03 73
n-MYC bHLH-ZIP 30 -0.8738 8.20e-03 51
ARNT bHLH 31 -0.8738 8.20e-03 51
![Page 44: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/44.jpg)
CMMT
oPOSSUM Server
![Page 45: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/45.jpg)
CMMT
http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum
INPUT A LIST OF CO-EXPRESSED GENES
![Page 46: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/46.jpg)
CMMT
SELECT YOUR TFBS PROFILES
![Page 47: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/47.jpg)
CMMT
SELECT:
1. CONSERVATION2. PSSM MATCH THRESHOLD3. PROMOTER REGION4. STATISTICAL MEASURE
![Page 48: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/48.jpg)
CMMT
de novo Discovery of TF Binding Sites
![Page 49: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/49.jpg)
CMMT
Pattern Discovery
![Page 50: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/50.jpg)
CMMT
de novo Pattern Discovery
• Exhaustive – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in
comparison of “+” and “-” (or complete) promoter collections
• Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo)– Generalization: Identify strong patterns in “+” promoter
collection vs. background model of expected sequence characteristics
![Page 51: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/51.jpg)
Exhaustive methods
Word based methods: How likely are X words in a set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range
![Page 52: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/52.jpg)
CMMT
Over-representation
How many words of type ’AGGAGTGA’ are found in our sequences?
k
jjapiinbeginswP
1
)(
k
jjw apknXE
1
)()1(
w
www XVar
XEXZ
How likely is this result?
Exhaustive methods
![Page 53: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/53.jpg)
CMMT
Exhaustive methodsFind all words of length 7 in the yeast genome
Make a lookup table:
AAACCTTT 456TTTTTTTT 57788GATAGGCA 589
Etc...
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
![Page 54: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/54.jpg)
Two data structures used:
1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4
2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j.
One starting point in each sequence is chosen randomly initially.
The Gibbs Sampling algorithm
tgacttcctgatctctagacctcatgacctct
Probabilistic Methods for Pattern Discovery
![Page 55: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/55.jpg)
Iteration step
Remove one sequence z from the set. Update the current pattern according to
tgacttcctgatctctagacctcatgacctct
BN
bcq jji
ji
1
,,
Pseudocount for symbol j
Sum of all pseudocounts in column
Probabilistic Methods for Pattern Discovery
A
’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model
B
z
![Page 56: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/56.jpg)
CMMT
Applied Pattern Discovery is Acutely Sensitive to Noise
10
12
14
16
18
0 100 200 300 400 500 600
SEQUENCE LENGTH
PA
TTE
RN
SIM
ILA
RIT
Yvs.
TR
UE
ME
F2 P
RO
FILE
True Mef2 Binding Sites
![Page 57: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/57.jpg)
CMMT
Four Approaches to Improve Sensitivity
• Better background models-Higher-order properties of DNA
• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of
sequence
• Regulatory Modules – Architectural rules
• Limit the types of binding profiles allowed– TFBS patterns are NOT random
![Page 58: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/58.jpg)
Information segmentation
Information content distributions of TFBS are distinctly non-random
(Wasserman et al 2000)
Palindromicity, dyads(van Helden et al 2000)
Variable gaps(Hu 2003)
TFBSs are not randomly drawn
Enhancing pattern detection sensitivity
![Page 59: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/59.jpg)
CMMT
Pattern discovery methods using biochemical constraints
![Page 60: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/60.jpg)
CMMT
Some profile constraints have been explored…
• Segmentation of informative columns
• Palindromic patterns
![Page 61: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/61.jpg)
CMMT
Our Hypothesis
• Point 1: Structurally-related DNA binding domains interact with similar target sequences
• Exceptions exist (e.g. Zn-fingers)
• Point 2: There are a finite number of binding domains used in human TFs
• Approximately 20-25
• Idea: We could use the shared binding properties for each family to focus pattern detection methods
• Constrain the range of patterns sought
![Page 62: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/62.jpg)
CMMT
Comparison of profiles requires alignment and a scoring function
• Scoring function based on sum of squared differences
• Align frequency matrices with modified Needleman-Wunsch algorithm
• Calculate empirical p-values based on simulated set of matrices
Score
Fre
que
ncy
![Page 63: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/63.jpg)
CMMT
Intra-family comparisons more similar than inter-family
TF Database(JASPAR)
COMPARE
Match to bHLH
Jackknife Test 87% correct
Independent Test Set 93% correct
![Page 64: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/64.jpg)
CMMT
![Page 65: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/65.jpg)
CMMT
FBPs enhance sensitivity of pattern detection
![Page 66: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/66.jpg)
![Page 67: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/67.jpg)
CMMT
REVIEWING THE TOP POINTS
![Page 68: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/68.jpg)
Orientation
Regulatory regions problem space
Sets of binding
sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Sets of binding
sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Clusters of binding sites
Clusters of binding sites
Transcription factors
Transcription factor binding sitesRegulatory nucleotide sequences
Transcription factors
Transcription factor binding sitesRegulatory nucleotide sequences
TATAURE
URF Pol-II
![Page 69: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/69.jpg)
Analysis of regulatory regions with TFBS
Detecting binding sites in a single sequence
Scanning a sequence against a PWM
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC
Abs_score = 13.4 (sum of column scores)
Sp1
Calculating the relative scoreA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.51281.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.23481.2348 1.23481.2348 2.12222.1222 2.12222.1222 0.4368 1.23481.2348 1.51281.5128 1.74571.7457 1.74571.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.74571.7457 ]
A [-0.2284 0.4368 -1.5 -1.5 -1.5-1.5 0.4368 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5-1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5-1.5 ]T [ 0.4368 -0.22840.4368 -0.2284 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5-1.5 1.7457 ]
Max_score = 15.2 (sum of highest column scores)
Min_score = -10.3 (sum of lowest column scores)
93%
100%10.3)(15.2
(-10.3)-13.4
% 100Min_score - Max_scoreMin_score - Abs_score
Rel_score
Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75%
Ouch.
![Page 70: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/70.jpg)
Low specificity of profiles:•too many hits•great majority not biologically significant
A dramatic improvement in the percentage of biologically significant detections
Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions
Analysis of regulatory regions with TFBS
Phylogenetic Footprints
![Page 71: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/71.jpg)
CMMT
Pattern Discovery
10
12
14
16
18
0 100 200 300 400 500 600
![Page 72: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649cc05503460f94987206/html5/thumbnails/72.jpg)
CMMT
Concluding Thoughts
• Bioinformatics is often constrained by our understanding of biochemistry rather than computational or statistical limitations
• Evolution has a powerful influence on the performance of many bioinformatics methods
• Computational predictions have value, but only if you understand the limitations of the methods