computational investigation of gene regulatory elementscbs/projects/2004_presentation... ·...
TRANSCRIPT
![Page 1: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/1.jpg)
1
Computational Investigationof Gene Regulatory Elements
Ryan WeddleComputational Biosciences
Internship Presentation12/15/2004
![Page 2: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/2.jpg)
2
Table of Contents
Introduction . . . . 3Goals . . . . . 9Methods . . . . 12Results . . . . . 21Discussion . . . . 37Acknowledgements . . 43
![Page 3: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/3.jpg)
3
Introduction
![Page 4: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/4.jpg)
4
Invasive Glioma Glioma is a particularly devastating type of brain
cancer caused by mutations to glial cells. While tumors may be treated through traditional
means such as chemo and radiation therapies, thesemeans are less effective at preventing spread andrecurrence.
This is due to the fact that invasive glioma migratesinto other parts of the brain by phenotypicallydifferent invasive cells.
These cells are not rapidly dividing and are, thus,less effected by traditional anti-cancer therapy.
![Page 5: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/5.jpg)
5
Different Tumor Cells Tumor composed of
core and periphery Motile cells are more
prevalent in periphery Laser capture micro-
dissection used toseparate cellpopulations
Tumor PeripheryTumor Core
Motile Cells
![Page 6: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/6.jpg)
6
What makes them different? Microarray analysis
performed indicated aset of 15 differentiallyexpressed genes.
The differential levelsof mRNA between thetwo cell populationswere verified withqPCR analysis.
![Page 7: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/7.jpg)
7
What does this mean?
When a set of genes are differentiallyexpressed in this manner, it is oftenhypothesized that they may be co-regulated.
If they are co-regulated, thenunderstanding their regulation is usefulif we wish to prevent their functionthrough some therapeutic means.
![Page 8: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/8.jpg)
8
Gene RegulationEukaryotic gene regulation is much more
complicated than bacterial gene regulation.Takes place on several levels:
Chromatin remodelingTranscriptional controlMessage controlTranslational control
We are hope to understand thetranscriptional control through computationalmeans.
![Page 9: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/9.jpg)
9
Project Goals
![Page 10: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/10.jpg)
10
Exploratory Investigation
This project aims to gain understandingof the mechanisms that regulate thesedifferentially expressed genes.Leverage sequence dataInvestigate known methodsInvestigate new methodsGenerate and test hypotheses
![Page 11: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/11.jpg)
11
Leveraging Sequence Data
Two senses in which we are takingadvantage of the DNA sequenceresources now available:Searching genomic sequence data around
our genes for transcription factor bindingsites
Using sequence data from multiplegenomes to narrow our search
![Page 12: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/12.jpg)
12
Methods
![Page 13: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/13.jpg)
13
Investigating Known Methods
Phylogenetic FootprintingTransfac DatabasePattern Detection AlgorithmsAssociation Rule Mining
![Page 14: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/14.jpg)
14
Phylogenetic Footprinting Look at sequence which has been conserved over
evolutionary time: Ignore coding sequences Ignore known repeating sequences
Hypothesis is that conserved elements are underselective pressure due to some functional role.
We used PipMaker to create visualizations, and theblastz software program to compute ungappedalignments.
Due to limited availability at onset of project, weused only human and mouse genomes.
![Page 15: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/15.jpg)
15
Example: BCL2L2 Gene Pip
Black regions are ungapped alignments:Human vs mouseLong segments often codonsNotice some upstream conservation
Percent identity indicated by y-axis.
![Page 16: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/16.jpg)
16
Transfac Database
Database of known transcription factorbinding sitesCatalogues known occurrencesRepresent TFBS by consensus
sequences and weight matrix methods
![Page 17: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/17.jpg)
17
Matrix Example: TATAXXPO A C G T01 3.00 3.89 2.48 3.99 N02 0.00 10.19 2.66 0.52 C03 0.33 3.33 0.00 9.71 T04 9.76 0.00 0.00 3.61 A05 0.00 0.00 0.00 13.36 T06 13.36 0.00 0.00 0.00 A07 12.40 0.00 0.00 0.96 A08 13.36 0.00 0.00 0.00 A09 13.36 0.00 0.00 0.00 A10 3.92 1.36 6.11 1.97 RXXBA total weight of sequences: 13.36XXCC consind generated matrix (random_expectation: 0.30)XX//
![Page 18: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/18.jpg)
18
Transfac UtilizationWe can use Transfac to scan DNA
sequences:Find potential occurrencesDifferent scores for different quality of matches
Cannot be used to find novel binding sites,only novel occurrences of known bindingsites.
Useful tool, but too noisy to be relied on inautomated processes.
![Page 19: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/19.jpg)
19
Pattern Detection Algorithms
Pattern detection algorithms are useful whenwe are looking for novel motifs.
We used the MEME/MAST tools to searchour conserved sequences for novel motifs:Most interesting result was an already known
splice sequenceMEME works best when you know how many
occurrences you are expecting and whereyou are expecting them.
![Page 20: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/20.jpg)
20
Association Rule MiningARM is a mechanism for finding rules about
association between different elements.Classical example is “market basket analysis”Here we are interested in any interesting patterns
in the occurrence of TFBS identified by Transfacin our conserved sequences.
Results in many low quality rules:Typically infrequent or low confidenceBest rules found due to overlapping putative
binding sites - little informational content
![Page 21: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/21.jpg)
21
Exploratory InvestigationResults
![Page 22: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/22.jpg)
22
Investigating Novel Methods
All existing methods had shortcomingswhen applied to our dataset:Transfac highly uncertainPattern detection and association rule
mining failed to yield interesting resultsToo few elements for meaningful
clustering, etc.How can we reframe the problem?
![Page 23: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/23.jpg)
23
Scaling It All Up
Association rule mining is intended forlarge databases.Our gene/TF universe was probably too
small to result in interesting rules.What if we could scale it up?
Look at every subsequence up to a certainlength in each genomic region
Determine identity between shortsequences by allowing slight mismatch
![Page 24: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/24.jpg)
24
Kmer Analysis
ARM can be modified to find very lowsupport rules that have high certainty - the“needle in haystack.”
We can build a database of all TFBS sizedshort sequences in our conserved sequencedata:Mine this database for association rulesInteresting rules might indicate functional
relationships.
![Page 25: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/25.jpg)
25
Building the Kmer Database
Sequence data for each gene was obtainedfrom both mouse and human genomesRepeat sequences and coding regions were
masked out.Kmer library for all 6-11mers with several
degrees of mismatch was constructed150,000 occurrences of 80,000 unique kmers550MB on disk40MB when we exclude all but perfect matches
![Page 26: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/26.jpg)
26
Refining the Kmer Database
This is still a very large database!Likely to result in many rulesHard to analyzeHow can we easily measure the similarity
within this database, before devoting timeto implementing new algorithms?
Narrow database to include onlyexactly matching 11mers
![Page 27: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/27.jpg)
27
Research Hypothesis“There is more short sequence similarity, as
measured by exactly matching 11mers, inour target sequence corpus, than would beexpected from random sequence data.”
If we can confirm this hypothesis, we canassert that there is interesting informationalcontent at the sequence level.Worthwhile to investigate further
![Page 28: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/28.jpg)
28
Randomized Sequence DataWe needed a basis for comparison to
determine whether the short sequencesimilarity observed in our data set wassignificant.Generate random sequence data that maintains
the same nucleotide bias for each sequencefragment
Perform kmer analyses on each of these randomtrials
100 trials in total
![Page 29: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/29.jpg)
29
Research Hypothesis Results
Randomly Generated Sequences
![Page 30: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/30.jpg)
30
![Page 31: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/31.jpg)
31
Summary Statistics11mer distributions calculated for both
Uniform nucleotide distributionSame distribution as in target data
A=26.7% C=23.0% G=26.1% T=24.2%
Z-test: Is our observed count of 73
11mers higher than thepopulation mean?
Z score = -46P-value < 10^-6
![Page 32: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/32.jpg)
32
Hypothesis Revisited Results looked promising.. However, they depended on assumptions about
random sequence data. Therefore, we revised our hypothesis:
“There is more short sequence similarity, as measured byexactly matching 11mers, in our target sequence corpus,than would be found by randomly sampling sets of genesfrom the human and mouse genomes.”
Confirming this hypothesis would provide concreteevidence that our observed 11mer similarityconstituted a meaningful departure from the norm.
![Page 33: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/33.jpg)
33
Analyzing Random GenesDownloaded all human-mouse homologs
from EMBLPerformed pre-processing on all homolog
pairsRepeatmaskingBlastz for phylogenetic footprinting
Randomly selected 100 sets of genesPerformed 11mer analysis on every setCatalogued results
![Page 34: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/34.jpg)
34
Research Hypothesis Results
Randomly Selected Sequences
![Page 35: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/35.jpg)
35
Distributions Overlay
![Page 36: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/36.jpg)
36
Comparing the Distributions
All distributions appear normal.73 observed 11mer matches are clearly
More occurrences than expected fromrandom sequence
Much fewer than expected from randomlyselected genes
What’s going on here?
![Page 37: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/37.jpg)
37
Discussion
![Page 38: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/38.jpg)
38
Conclusions73 observed 11mer matches are anecdotally
interesting.Transfac matches for TATA, various TFs
Our most exhaustive results indicate that,however, we cannot claim that the number ofmatches are statistically significant.
But, there are more variables involved in thefinal analysis, which could be controlled for infurther analyses.
![Page 39: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/39.jpg)
39
Possible Confounding FactorsAmount of conserved sequence may differ
due to:Percent conservationSize of genes
Controlled for in random sequencegeneration, but not in random gene selection
Assumes all genes are comparableControlling for these factors could be a good
avenue for future research
![Page 40: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/40.jpg)
40
Questioning Assumptions
Everything rests on the assumptionthat our target set of genes is co-regulated by common elements at theDNA sequence level.Further assumption that regulatory
mechanism is local to the genesWhat about chromatin and its role in
regulation?
![Page 41: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/41.jpg)
41
Suggestions for Future Work
It would be useful to repeat the final testswhile controlling for gene size andconservation.
Consider testing these same methods on analready well characterized set of co-regulated genes, rather than on aninvestigative data set.
Research methods for taking chromatin andDNA sequence structure into account.
![Page 42: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/42.jpg)
42
Things I LearnedIn exploratory investigations, Perl is your
best friend.Much of this would have been impossible to do
manually.Perl really is faster for rapid prototyping when you
don’t know in advanced what your needs will be.You can try new methods on old data, or old
methods on new data, but developing newmethods on new data is difficult.
![Page 43: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to](https://reader033.vdocuments.net/reader033/viewer/2022042309/5ed56a0a6551673b635ad76e/html5/thumbnails/43.jpg)
43
Acknowledgements
Dr. Jeff Touchman . Tgen, ASUDr. Phillip Stafford. . Tgen, ASUDr. Rosemary Renaut . ASUDr. Michael Berens . TgenDr. Huan Liu . . . ASUDominique Hoelzinger . TgenMaulik Shah . . . Tgen, ASU