tools and techniques for assessing...
TRANSCRIPT
TOOLS AND TECHNIQUES FOR ASSESSING FUNCTIONAL RELEVANCEOF GENOMIC LOCI
A THESIS SUBMITTED TOTHE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES
OFMIDDLE EAST TECHNICAL UNIVERSITY
BY
BURÇAK OTLU
IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR
THE DEGREE OF DOCTOR OF PHILOSOPHYIN
COMPUTER ENGINEERING
JUNE 2017
Approval of the thesis:
TOOLS AND TECHNIQUES FOR ASSESSING FUNCTIONAL RELEVANCEOF GENOMIC LOCI
submitted by BURÇAK OTLU in partial fulfillment of the requirements for the degreeof Doctor of Philosophy in Computer Engineering Department, Middle EastTechnical University by,
Prof. Dr. Gülbin Dural ÜnverDean, Graduate School of Natural and Applied Sciences
Prof. Dr. Adnan YazıcıHead of Department, Computer Engineering
Prof. Dr. Tolga CanSupervisor, Computer Engineering Department, METU
Prof. Dr. Sündüz KelesCo-supervisor, Department of Statistics,University of Wisconsin–Madison, USA
Examining Committee Members:
Prof. Dr. M. Volkan AtalayComputer Engineering Department, METU
Prof. Dr. Tolga CanComputer Engineering Department, METU
Assoc. Prof. Dr. Murat ManguogluComputer Engineering Department, METU
Assist. Prof. Dr. Öznur Tastan OkanComputer Engineering Department, Bilkent University
Assist. Prof. Dr. Can AlkanComputer Engineering Department, Bilkent University
Date:
I hereby declare that all information in this document has been obtained andpresented in accordance with academic rules and ethical conduct. I also declarethat, as required by these rules and conduct, I have fully cited and referenced allmaterial and results that are not original to this work.
Name, Last Name: BURÇAK OTLU
Signature :
iv
ABSTRACT
TOOLS AND TECHNIQUES FOR ASSESSING FUNCTIONALRELEVANCE OF GENOMIC LOCI
Otlu, Burçak
Ph.D., Department of Computer Engineering
Supervisor : Prof. Dr. Tolga Can
Co-Supervisor : Prof. Dr. Sündüz Keles
June 2017, 144 pages
Genomic studies identify genomic loci representing genetic variations, transcription
factor occupancy, or histone modification through next generation sequencing (NGS)
technologies. Interpreting these loci requires evaluating them with known genomic
and epigenomic annotations. In this thesis, we develop tools and techniques to assess
the functional relevance of set of genomic intervals. Towards this goal, we first intro-
duce Genomic Loci ANnotation and Enrichment Tool (GLANET) as a comprehensive
annotation and enrichment analysis tool. Input query to GLANET is a set of genomic
intervals. GLANET annotates and performs enrichment analysis on these loci with
a rich library that includes: (i) gene-centric regions that encompass their non-coding
neighborhood, (ii) a large collection of regulatory regions from ENCODE, and (iii)
gene sets derived from pathways. As a key feature, users can easily extend this library
with new gene sets and genomic intervals. GLANET implements a sampling-based
enrichment test that can account for GC content and/or mappability biases inherent
to NGS technologies, which shows high statistical power and well-controlled Type-I
error rate. Other key features of GLANET include assessment of impact of single
v
nucleotide variants on transcription factor binding sites when input consists of SNPs
only and not only exon based but also regulation based gene set enrichment analysis
by considering introns and proximal regions of genes in a gene set. GLANET also
allows joint enrichment analysis for TF binding sites and KEGG pathways. With this
option, users can evaluate whether the input set is enriched concurrently with binding
sites of TFs and the genes within a KEGG pathway. This joint enrichment analysis
provides a detailed functional interpretation of the input loci. As a second contri-
bution we designed novel data-driven computational experiments for assessing the
power and Type-I error of enrichment procedures. The data-driven computational ex-
periments render detailed quantitative comparisons of GLANET with other tools pos-
sible. Our results on these computational experiments showcase GLANET’s unique
capabilities as well as robustness, speed and accuracy. Finally, as a third contribution,
we present an efficient algorithmic solution for finding common overlapping intervals
over n interval sets. Our strategy is based on constructing one segment tree for each
interval set as the first step and proceeds by converting each segment tree to an in-
dexed segment tree forest by cutting this tree at a certain depth. Experiments on real
data show that this data structure decreases the search time. This novel representation
also enables parallel computations on each segment tree in the forest. We also extend
this solution to solve the problem of finding at least k common overlapping inter-
vals over n interval sets. The tools and techniques developed herein will hopefully
expedite the genomic research and help improve our understanding of the molecular
biology of the cell and the mechanisms underlying diseases.
Keywords: Genomic Intervals, Interval Intersection, Single-Nucleotide Polymorphisms
(SNPs), Genomic Variants, Gene Sets, Annotation and Enrichment Analysis, Regu-
latory Sequence Analysis, Joint Enrichment Analysis, DNA Regulatory Elements, n
Interval Set Intersection
vi
ÖZ
GENOMIK LOKASYONLARIN FONKSIYONEL ILGILILIKLERININDEGERLENDIRILMESI IÇIN ARAÇLAR VE TEKNIKLER
Otlu, Burçak
Doktora, Bilgisayar Mühendisligi Bölümü
Tez Yöneticisi : Prof. Dr. Tolga Can
Ortak Tez Yöneticisi : Prof. Dr. Sündüz Keles
Haziran 2017 , 144 sayfa
Genomik çalısmalar, yeni nesil sıralama (YNS) teknolojileri ile elde edilen, gene-
tik farklılıkları temsil eden, transkripsiyon faktörü veya histon modifikasyonu gibi
genomik lokasyonları belirler. Bu genomik lokasyonların yorumlanması, bilinen ge-
nomik ve epigenomik adlandırılmıs lokasyonlarla degerlendirilmesini gerektirir. Bu
tezde, genomik aralıkların fonksiyonel ilgililiklerinin degerlendirilmesi için araçlar
ve teknikler gelistirilmistir. Bu amaca yönelik olarak öncelikle Genomic Lokasyon
Adlandırma ve Zenginlestirme Aracını (GLANET), kapsamlı bir adlandırma ve zen-
ginlestirme analiz aracı olarak sunuyoruz. GLANET’in girdisi bir genomik aralık kü-
mesidir. GLANET bu genomik aralıklarda, (i) genlerin kodlanmayan komsuluklarını
da içeren gen-merkezli bölgelerini (ii) ENCODE’un genis bir düzenleyici bölge kol-
leksiyonunu (iii) yolaklardan elde edilen gen kümelerini içeren zengin bir kütüphane
ile adlandırma ve zenginlestirme analizi yapar. Önemli bir özellik olarak, kullanıcı-
lar bu kütüphaneyi yeni gen kümeleri ve genomik aralıklarla genisletebilirler. GLA-
NET, YNS teknolojilerine özgü olan GC içerigi ve/veya eslenirlik yanlılıklarını he-
vii
saba katabilen yüksek istatistiksel gücü ve iyi kontrol edilen Tip-I hata oranı gösteren
örnekleme-tabanlı zenginlestirme testi uygular. GLANET’in diger önemli özellikleri
arasında, girdi olarak sadece tek nükleotid farklılıkları (TNF) verildigi zaman, bu
TNF’lerin transkripsiyon faktörleri üzerindeki etkilerinin degerlendirilmesi ve gen
kümelerinin sadece ekson tabanlı degil de; gen kümesindeki genlerin intronlarını ve
proksimal bölgelerini de hesaba katarak düzenleyici tabanlı zenginlestirme analizi ya-
pabilmesi de yer alır. GLANET ayrıca TF baglama alanları ve KEGG yolakları için
ortak zenginlestirme analizine izin verir. Bu opsiyon sayesinde, kullanıcılar girdi kü-
mesinin hem TF baglanma alanları hem de KEGG yolagındaki genler ile aynı anda
zenginlesip zenginlesmedigini degerlendirebilirler. Bu ortak zenginlestirme analizi,
girdi aralıkların detaylı fonksiyonel yorumlanmasına olanak saglar. Bu tezde, ikinci
bir katkı olarak, zenginlestirme prosedürlerinin güç ve Tip-I hatasını degerlendirmek
için yeni veri-tabanlı hesaplamalı deneyler tasarladık. Veri-tabanlı hesaplamalı de-
neyler, GLANET’in diger araçlar ile ayrıntılı nicel karsılastırılmasını da mümkün
kılmaktadır. Bu hesaplamalı deneyler üzerindeki sonuçlarımız GLANET’in özgün
yeteneklerinin yanı sıra saglamlıgını, hızını ve dogrulugunu sergilemektedir. Son ola-
rak, üçüncü bir katkı olarak, n aralık kümesinde ortak örtüsen aralıkları bulmak için
verimli bir algoritmik çözüm sunmaktayız. Stratejimiz, ilk adım olarak belirlenen her
bir aralık kümesi için bir segment agacı insa etmeye dayanır ve bu agacı belli bir
derinlikte keserek, kesilen segment agacını indekslenmis bir segment agaç ormanına
dönüstürerek devam eder. Gerçek veriler üzerindeki deneyler, bu veri yapısının arama
süresini düsürdügünü göstermektedir. Bu yeni gösterim, ormandaki her bir segment
agacı üzerinde paralel hesaplamaları da mümkün kılmaktadır. Ayrıca, bu çözümü, n
aralık kümesinde en az k ortak örtüsen aralık bulma problemini çözmek için de genis-
lettik. Bu tezde gelistirilen araçlar ve teknikler, umuyoruz ki; genomik arastırmaları
hızlandıracak, hücrenin moleküler biyolojisini ve hastalıkların altında yatan mekaniz-
maları anlamamıza yardımcı olacaktır.
Anahtar Kelimeler: Genomik Aralıklar, Aralık Örtüstürme, Tek Nükleotid Farklılık-
ları, Genomik Farklılıklar, Gen Kümeleri, Adlandırma ve Zenginlestirme Analizi,
Düzenleyici Sıralama Analizi, Ortak Zenginlestirme Analizi, DNA Düzenleyici Ele-
mentler, n Aralık Kümesi Örtüstürme
viii
To my mother and father, Feride and Fikret
To my daughter and son, Betül and Süleyman Ediz
ix
ACKNOWLEDGMENTS
PhD may take long time, mine took six and a half years. Now, I would like to go back
in time, and remember some important dates and events that took place throughout
my PhD.
On September 29, 2009, at midnight, my daughter just 49 days old new born baby had
an operation in Hacettepe University Hospital. I would like to present my gratitude
to my father, Dr. Fikret Otlu, Prof. Dr. Özgür Deren and Prof. Dr. Cemalettin Aksoy
for their existence and for the successful operation. After the operation, we stayed
together with my daughter, Betül, in the hospital for 42 days for her treatment. During
our stay, I decided to pursue a PhD. Later on, my PhD journey started officially on
September 13, 2010.
In 2010, I read a book of Prof. Dr. Pavel Pevzner and Neil C. Jones, titled "An In-
troduction to Bioinformatics Algorithms (Computational Molecular Biology)", after
then I was determined to study Bioinformatics and Computational Biology. I would
like to thank them for their efforts and for this well written book.
In spring of 2012, I took a course from Assist. Prof. Can Alkan in Bilkent University.
For the course project, I said to him that I would like to work on GWAS data, and
he forwarded me to Assist. Prof. Öznur Tastan and she started collaborating with
me and Prof. Dr. Sündüz Keles who was in Bilkent University at that time for her
sabbatical leave. We did our first meeting in Öznur Tastan’s office. I remember that
Sündüz Keles was wearing a black-white striped blouse and a black skirt, and she told
the "small n, big p" problem on the board.
My course project which is then turned into my PhD studies started with Sündüz
Keles and Öznur Tastan in this way. Together with Sündüz Keles and Öznur Tastan, I
developed our tool, GLANET. At first, there was nothing, like a blank page. For more
than four years, day in day out, step by step I coded this tool. Github repository keeps
x
all the history. Our four years long skype meetings made GLANET evolved over time.
Sometimes, I fell into the traps of perfectionism, unnecessary implementations for the
feeling of completeness or it was necessary at that time but then it wasn’t. Later on,
journal reviewers determined the new directions for the tool. A lot of analyses and
comparisons took place this time. I would like to thank to Sündüz Keles and Öznur
Tastan for almost weekly meetings, support and guidance through all these years. By
the way, I would like to thank to Can Fırtına because of his initial work on GUI,
command line arguments and documentation of GLANET of which I continued and
maintained later on.
When I started my thesis in 2012, my advisor from METU, Prof. Dr. Tolga Can was
in Cyprus then for his sabbatical leave. So we couldn’t start with him together but I’m
happy that we finished together. I would like to thank for his suggestions and brilliant
ideas. He was always available when I asked for.
I would like to thank to Prof. Dr. Afsin Sarıtas, for his help and support for looking
after our children, without his help I couldn’t have looked after them this much good.
I would like to thank to my department for the beautiful room and my corner next to
the window from where I can see the sky, clouds, sun and sometimes the moon which
relieves me especially when I’m overwhelmed and weary. I would like to thank to
all of my professors and assistant friends in the department for their friendship and
accompany.
I would like to thank to TÜBITAK ULAKBIM for providing high performance and
grid computing resources for carrying out my experiments. I would like to acknowl-
edge that I have been supported by The Scientific and Technological Research Coun-
cil of Turkey (TÜBITAK 2211-C PhD Scholarship) during my PhD studies.
Although I have a family and I’m a mother of two beautiful children, during PhD, I
felt loneliness deep down in my soul. What I want to say is that PhD may become a
long period of your life in which you are more with yourself, you work alone most of
the time and it requires endurance and perseverance. At least, it was the case for me.
Nonetheless, I’m grateful for lot of things, first of all, I’m grateful for my children,
Betül and Süleyman Ediz, they are definitely main driving forces in my life. They
xi
made me strong, hopeful and courageous. Then my parents, Feride and Fikret, if
they didn’t look after my children, I couldn’t have pursued this PhD. As legacy, I
took a smiling face and a sincere, loving heart full of passion and compassion from
them. I’m strong, confident, determined, courageous, faithful and hopeful. When I
feel down, I like singing, especially classical Turkish music, and sometimes I just
remember the lyrics of a song, sometimes something like that "I have a dream, a song
to sing to help me cope with anything". Whatever we do during the day, at the end
of the day, it is the heart that really matters. And our hearts are like our GPSes, they
somehow know our true calling and which direction to go. I hope we will all have the
courage to follow our hearts, dreams and intuitions at any time, at any age. All the
work presented here is love made visible, as Khalil Cibran said. And I would like to
last my acknowledgments with the words of Winston Churchill, "Success is not final,
failure is not fatal: it is the courage to continue that counts."
xii
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ÖZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii
CHAPTERS
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 BIOLOGICAL BACKGROUND AND RELATED WORK . . . . . . 7
2.1 Biological Terms . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Work Regarding Thesis Part 1 . . . . . . . . . . . . 10
xiii
2.3 Related Work Regarding Thesis Part 2 . . . . . . . . . . . . 19
3 ANNOTATION OF GENOMIC LOCI . . . . . . . . . . . . . . . . . 21
3.1 User Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 GLANET Annotation Library . . . . . . . . . . . . . . . . . 21
3.3 Library Representation . . . . . . . . . . . . . . . . . . . . 24
3.4 Interval Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Time and Space Complexity of Annotation . . . . . . . . . . 26
4 REGULATORY SEQUENCE ANALYSIS OF SINGLE NUCLEOTIDE
POLYMORPHISMS . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Regulatory Sequence Analysis . . . . . . . . . . . . . . . . 27
4.2 GLANET Use Case: Regulatory Sequence Analysis of OCD
GWAS SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 ENRICHMENT ANALYSIS OF GENOMIC REGIONS . . . . . . . 31
5.1 Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Random Interval Sampling Procedure . . . . . . . . . . . . . 35
5.2.1 GC and Mappability Calculation . . . . . . . . . . 37
5.3 Time and Space Complexity of Random Interval Generation . 40
5.4 Joint Enrichment Analysis of Transcription Factors and KEGG
Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Time and Space Complexity of Enrichment Analysis . . . . . 42
xiv
6 DATA DRIVEN COMPUTATIONAL EXPERIMENTS . . . . . . . . 43
6.1 Design of Data-driven Computational Experiments . . . . . . 43
6.1.1 Type-I error experiments . . . . . . . . . . . . . . 44
6.1.2 Power experiments . . . . . . . . . . . . . . . . . 44
6.1.3 Transcriptional activator and repressor elements . . 45
6.1.4 Genomic interval sets for expressed genes . . . . . 46
6.1.5 Genomic interval sets for non-expressed genes . . 46
6.2 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2.1 Data-driven Computational Experiments Results
for Activator Elements . . . . . . . . . . . . . . . 47
6.2.2 Data-driven Computational Experiments Results
for Repressor Elements . . . . . . . . . . . . . . . 59
6.2.3 GLANET GAT Comparison Results for Activa-
tor and Repressor Elements through Data-driven
Computational Experiments . . . . . . . . . . . . 62
6.2.4 Assessing GLANET Enrichment Parameters through
Wilcoxon Signed Rank Tests . . . . . . . . . . . . 67
6.2.5 Assessing GLANET Enrichment Parameters through
ROC Curves and Comparison with GAT . . . . . . 72
7 GLANET USE CASES AND RUN TIME COMPARISONS . . . . . 81
7.1 GLANET GAT Comparison with Additional Data-sets . . . . 81
xv
7.2 Example Use Cases of GLANET . . . . . . . . . . . . . . . 88
7.2.1 Enrichment Analysis of OCD GWAS SNPs . . . . 88
7.2.2 Enrichment Analysis of GATA2 Binding Regions
for Gene Ontology Terms using User-defined Gene
Sets Feature . . . . . . . . . . . . . . . . . . . . . 89
7.3 GLANET Run Time Comparison . . . . . . . . . . . . . . . 90
7.3.1 Comparison with GAT . . . . . . . . . . . . . . . 91
7.3.2 Comparison with GREAT . . . . . . . . . . . . . 94
8 FINDING OVERLAPPING INTERVALS FOR N GIVEN INTER-
VAL SETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1 Segment Tree . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Segment Tree Construction Complexity Analysis . . . . . . . 99
8.3 Segment Tree Query . . . . . . . . . . . . . . . . . . . . . . 99
8.4 Motivation: Indexed Segment Tree Forest . . . . . . . . . . . 100
8.4.1 Hash Function, Preset Value . . . . . . . . . . . . 100
8.4.2 Cut-off Depth . . . . . . . . . . . . . . . . . . . . 101
8.4.3 Moving Intervals That Were Stored in The Nodes
Above The Cut-off Depth . . . . . . . . . . . . . . 102
8.4.4 Linking Segment Tree Nodes at Cut-off Depth to
Each Other . . . . . . . . . . . . . . . . . . . . . 102
8.5 Indexed Segment Tree Forest in More Details . . . . . . . . 103
xvi
8.6 Query in Indexed Segment Tree Forest . . . . . . . . . . . . 103
8.6.1 How to Guarantee at Most Two Additional Index
Searches Are Enough? . . . . . . . . . . . . . . . 104
8.7 Finding n Common Overlapping Intervals for n Given Inter-
val Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.8 Finding at Least k Common Overlapping Intervals for nGiven
Interval Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . 117
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
APPENDICES
A GLANET DATA SOURCES . . . . . . . . . . . . . . . . . . . . . . 131
B TYPE-I ERROR, POWER AND ROC CURVE FIGURES . . . . . . 133
CURRICULUM VITAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
xvii
LIST OF TABLES
TABLES
Table 2.1 Available tools including GLANET are compared with respect to
their accepted input types and annotation libraries utilized. . . . . . . . . . 14
Table 2.2 Available tools including GLANET are compared with respect to
their statistical tests carried out and enrichment options provided. . . . . . 16
Table 2.3 Available tools including GLANET are compared with respect to
their enrichment analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Table 5.1 GLANET main parameters for enrichment test. . . . . . . . . . . . 34
Table 6.1 Data-driven Computational Experiments for GLANET . . . . . . . 48
Table 6.2 Data-driven Computational Experiments for GAT . . . . . . . . . . 48
Table 6.3 Type-I error rates calculated in data-driven experiments conducted
with repressor elements, H3K27me3 and H3K9me3, in GM12878 and
K562 cell lines for α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 6.4 Type-I error rates calculated in data-driven experiments conducted
with repressor elements, H3K27me3 and H3K9me3, in GM12878 and
K562 cell lines for α = 0.001. . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 6.5 Power calculated in data-driven experiments conducted with repres-
sor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell
lines for α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xviii
Table 6.6 Power calculated in data-driven experiments conducted with repres-
sor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell
lines for α = 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Table 6.7 One-sided Wilcoxon signed rank test results for testing whether the
Type-I error distribution of experiments generated under the parameter
setting specified in the row has lower mean of ranks compared to the dis-
tribution of Type-I errors generated under the parameter setting specified
in the column, where the null hypothesis states that there is no difference.
A p-value presented in the cell indicates that setting in the corresponding
row has a lower mean of ranks in Type-I error distribution than the setting
in the corresponding column; if the cell is empty the opposite holds. The
p-values are less than or equal to the actual test result. . . . . . . . . . . . 68
Table 6.8 Wilcoxon Signed Rank Tests for (woIF,wIF). Type-I error distri-
bution of wIF is less than Type-I error distribution of woIF. To decrease
Type-I error, we prefer wIF over woIF. . . . . . . . . . . . . . . . . . . . 69
Table 6.9 Wilcoxon Signed Rank Tests for (EOO,NOOB). Type-I error distri-
bution of NOOB is less than Type-I error distribution of EOO. To decrease
Type-I error, we prefer NOOB over EOO. . . . . . . . . . . . . . . . . . . 69
Table 6.10 Table summarizes random interval generation option that achieves
the lowest Type-I error for non-expressed and expressed gene intervals
using association measures EOO and NOOB and the two isochore family
options woIF and wIF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
xix
Table 6.11 Kolmogorov-Smirnov test results. Null hypothesis states that the
distribution of GC content or mappability values calculated for 50, 000
randomly sampled intervals from human genome and the correspond-
ing interval set are not different. Each row corresponds to Kolmogorov-
Smirnov testing of this null hypothesis. In all tests, the null hypothesis
is rejected (p-value < 2.2e-16). The first column lists the property of the
genome in question, the second column lists the distribution that is com-
pared with the genome, finally the last column lists the maximum distance
between the two distributions. . . . . . . . . . . . . . . . . . . . . . . . . 72
Table 6.12 GLANET and GAT ROC curves comparison results under (EOO,woIF)
setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Table 6.13 GLANET and GAT ROC curves comparison results under (EOO,wIF)
setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Table 6.14 GLANET and GAT ROC curves comparison results under (NOOB,woIF)
setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Table 6.15 GLANET and GAT ROC curves comparison results under (NOOB,wIF)
setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Table 6.16 We compared the winner settings from Tables 6.12- 6.15 with each
other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Table 6.17 ROC curves of different parameter settings where (woIF) setting is
on are compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 6.18 ROC curves of different parameter settings where (wIF) setting is
on are compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 6.19 ROC curves of different “Generate Random Data Options" are com-
pared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 7.1 Experiment1: Intervals of transcriptor factor Srf in Jurkat cell line
are overlapped with DNaseI hypersensitive sites in Jurkat cell line. Both
GAT and GLANET find enrichment of DNaseI(Jurkat) for Srf(Jurkat). . . 84
xx
Table 7.2 Experiment2: Intervals of transcriptor factor Srf in Jurkat cell line
are overlapped with DNaseI hypersensitive sites in HepG2 cell line. Both
GAT and GLANET find enrichment of DNaseI(HepG2) for Srf(Jurkat). . . 85
Table 7.3 Experiment3: DNaseI hypersensitive sites in HepG2 cell line are
overlapped with DNaseI hypersensitive sites in Jurkat cell line. Both GAT
and GLANET find enrichment of DNaseI(Jurkat) for DNaseI(HepG2). . . 86
Table 7.4 Experiment4: Intervals of transcriptor factor Srf in Jurkat cell line
are overlapped with DNaseI hypersensitive sites in HepG2-Unique cell
line. Both GAT and GLANET find no enrichment of DNaseI(HepG2-
Unique) for Srf(Jurkat). . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Table 7.5 GO semantic similarity scores calculated between the set of bio-
logical process GO terms that GATA2 is annotated with and the set of
GO terms where GATA2 binding regions are found enriched based on
GLANET enrichment analysis in three different analysis modes (exon,
regulatory based and all-based). . . . . . . . . . . . . . . . . . . . . . . . 90
Table 7.6 Elapsed CPU (user + system) run times in seconds for GLANET
and GAT runs for a given input query are provided. . . . . . . . . . . . . . 91
Table 7.7 Elapsed wall clock times in seconds for GLANET and GAT runs
for a given input query are provided. . . . . . . . . . . . . . . . . . . . . 92
Table 7.8 CPU (user + system) times in seconds spent for GLANET and GAT
runs given the input query specified. . . . . . . . . . . . . . . . . . . . . 93
Table 7.9 Wall clock times in seconds spent for GLANET and GAT runs given
the input query specified. . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Table 7.10 CPU (user + system) time in seconds spent for GLANET runs given
the input query specified. For 1,000 and 10,000 samplings, each run time
is the average of 10 individual runs. . . . . . . . . . . . . . . . . . . . . . 94
xxi
Table 7.11 Wall clock time in seconds spent for GLANET runs given the input
query specified. For 1,000 and 10,000 samplings, each run time is the
average of 10 individual runs. . . . . . . . . . . . . . . . . . . . . . . . . 95
Table 8.1 Various preset values and cut-off depth decisions are compared.
Construction time and search time of indexed segment tree forest and seg-
ment tree in wall clock time are averaged over 100 runs. P-values resulting
from paired t-test for search run times of indexed segment tree forest and
segment tree are provided. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table A.1 GLANET data sources and their download dates. . . . . . . . . . . 131
xxii
LIST OF FIGURES
FIGURES
Figure 3.1 (a) Overall functionality of GLANET. (b) Gene-centric genomic
intervals are defined based on commonly used location analyses in ChIP-
seq and related studies [43]. GLANET uses these intervals to provide
detailed annotation of user query with respect to known genes. . . . . . . 22
Figure 3.2 Genomic intervals are represented in interval trees [44]. A separate
interval tree is constructed for each chromosome and genomic element
type, e.g. for transcription factor binding annotations. Each node contains
the low and high endpoints of the genomic interval, the color of the node
(red or black), the maximum high endpoint stored in the subtree rooted at
this node and the genomic elements annotated with this particular genomic
interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 4.1 Three main steps of regulatory sequence analysis in GLANET. . . . 29
Figure 4.2 GLANET regulatory sequence analysis for the OCD SNPs anno-
tated with TFs in the library. (a) SNP rs1891215 located at chr1:7,667,794
changes reference nucleotide A to G, and as a result, leads to a better
match to the STAT1 PFM, i.e., the p-value of the match to the STAT1 PFM
changes from 1.1e-3 to 6.1e-5. (b) SNP rs10946279 (chr6:170,553,248)
changes reference nucleotide C to T, thereby decreasing the significance
of the match to the MAX PFM, i.e., the p-value of the match increases
from 6.1e-5 to 1.5e-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
xxiii
Figure 5.1 Box plots of GC content and mappability values for ten different
ENCODE files, for each element type. . . . . . . . . . . . . . . . . . . . 33
Figure 6.1 Design for data-driven computational experiments for expressed
genes. N is set to 1000. Activator elements are defined as H2AZ, H3K27ac,
H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1,
H4K20me1, [8] and POL2; whereas H3K27me3 and H3K9me3 constitute
the repressor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 6.2 Design for data-driven computational experiments for non-expressed
genes. N is set to 1000. Activator elements are defined as H2AZ, H3K27ac,
H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1,
H4K20me1, [8] and POL2; whereas H3K27me3 and H3K9me3 constitute
the repressor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 6.3 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Isochore
Family (wIF) heuristic using K562, (Non-expressed Genes, Completely-
Discard) and (Expressed Genes, Top5) results, for significance level of
0.05. (c, d) Type-I error and power estimated without Isochore Family
(woIF) heuristic using K562, (Non-expressed Genes, CompletelyDiscard)
and (Expressed Genes, Top5) results, for significance level of 0.05. . . . . 51
Figure 6.4 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Iso-
chore Family (wIF) heuristic using GM12878, (Non-expressed Genes,
CompletelyDiscard) and (Expressed Genes, Top5) results, for significance
level of 0.05. (c, d) Type-I error and power estimated without Isochore
Family (woIF) heuristic using GM12878, (Non-expressed Genes, Com-
pletelyDiscard) and (Expressed Genes, Top5) results, for significance level
of 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
xxiv
Figure 6.5 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Isochore
Family (wIF) heuristic using K562, (Non-expressed Genes, TakeTheLongest)
and (Expressed Genes, Top20) results, for significance level of 0.05. (c, d)
Type-I error and power estimated without Isochore Family (woIF) heuris-
tic using K562, (Non-expressed Genes, TakeTheLongest) and (Expressed
Genes, Top20) results, for significance level of 0.05. . . . . . . . . . . . . 53
Figure 6.6 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Isochore
Family (wIF) heuristic using GM12878, (Non-expressed Genes, TakeTh-
eLongest) and (Expressed Genes, Top20) results, for significance level
of 0.05. (c, d) Type-I error and power estimated without Isochore Fam-
ily (woIF) heuristic using GM12878, (Non-expressed Genes, TakeThe-
Longest) and (Expressed Genes, Top20) results, for significance level of
0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 6.7 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Isochore
Family (wIF) heuristic using K562, (Non-expressed Genes, Completely-
Discard) and (Expressed Genes, Top5) results, for significance level of
0.001. (c, d) Type-I error and power estimated without Isochore Family
(woIF) heuristic using K562, (Non-expressed Genes, CompletelyDiscard)
and (Expressed Genes, Top5) results, for significance level of 0.001. . . . 55
xxv
Figure 6.8 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Iso-
chore Family (wIF) heuristic using GM12878, (Non-expressed Genes,
CompletelyDiscard) and (Expressed Genes, Top5) results, for significance
level of 0.001. (c, d) Type-I error and power estimated without Iso-
chore Family (woIF) heuristic using GM12878, (Non-expressed Genes,
CompletelyDiscard) and (Expressed Genes, Top5) results, for significance
level of 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 6.9 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Isochore
Family (wIF) heuristic using K562, (Non-expressed Genes, TakeTheLongest)
and (Expressed Genes, Top20) results, for significance level of 0.001.
(c, d) Type-I error and power estimated without Isochore Family (woIF)
heuristic using K562, (Non-expressed Genes, TakeTheLongest) and (Ex-
pressed Genes, Top20) results, for significance level of 0.001. . . . . . . . 57
Figure 6.10 Assessment of GLANET Type-I error and power with data-driven
computational experiments. Histone marks with ambiguous activator roles
are marked with ∗. (a, b) Type-I error and power estimated with Isochore
Family (wIF) heuristic using GM12878, (Non-expressed Genes, TakeTh-
eLongest) and (Expressed Genes, Top20) results, for significance level of
0.001. (c, d) Type-I error and power estimated without Isochore Fam-
ily (woIF) heuristic using GM12878, (Non-expressed Genes, TakeThe-
Longest) and (Expressed Genes, Top20) results, for significance level of
0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xxvi
Figure 6.11 Comparison of GLANET and GAT with respect to data-driven
computational experiments in terms of Type-I Error and Power for sig-
nificance level of 0.05. GLANET(wIF,wGC) and GAT(wIF) parameter
settings results are used. Results for the two association statistics - exis-
tence of overlap (EOO) and the number of overlapping bases (NOOB)
are displayed. (a, b) Type-I error and power of activator elements in
(Non-expressed Genes, CompletelyDiscard) and (Expressed Genes, Top5)
experiment settings, respectively. (c, d) Type-I error and power of re-
pressor elements in (Expressed Genes, Top5) and (Non-expressed Genes,
CompletelyDiscard) experiment settings, respectively. GLANET achieves
higher power for H3K9me3 than GAT. . . . . . . . . . . . . . . . . . . . 63
Figure 6.12 . Comparison of GLANET and GAT with respect to data-driven
computational experiments in terms of Type-I Error and Power for sig-
nificance level of 0.05. GLANET(wIF,wGC) and GAT(wIF) parameter
settings results are used. Results for the two association statistics - exis-
tence of overlap (EOO) and the number of overlapping bases (NOOB) are
displayed. (a, b) Type-I error and power of activator elements in (Non-
expressed Genes, TakeTheLongest) and (Expressed Genes, Top20) exper-
iment settings, respectively. (c, d) Type-I error and power of repressor el-
ements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTh-
eLongest) experiment settings, respectively. . . . . . . . . . . . . . . . . 64
Figure 6.13 . Comparison of GLANET and GAT with respect to data-driven
computational experiments in terms of Type-I Error and Power for sig-
nificance level of 0.001. GLANET(wIF,wGC) and GAT(wIF) parameter
settings results are used. Results for the two association statistics - exis-
tence of overlap (EOO) and the number of overlapping bases (NOOB) are
displayed. (a, b) Type-I error and power of activator elements in (Non-
expressed Genes, CompletelyDiscard) and (Expressed Genes, Top5) ex-
periment settings, respectively. (c, d) Type-I error and power of repressor
elements in (Expressed Genes, Top5) and (Non-expressed Genes, Com-
pletelyDiscard) experiment settings, respectively. . . . . . . . . . . . . . . 65
xxvii
Figure 6.14 . Comparison of GLANET and GAT with respect to data-driven
computational experiments in terms of Type-I Error and Power for sig-
nificance level of 0.001. GLANET(wIF,wGC) and GAT(wIF) parameter
settings results are used. Results for the two association statistics - exis-
tence of overlap (EOO) and the number of overlapping bases (NOOB) are
displayed. (a, b) Type-I error and power of activator elements in (Non-
expressed Genes, TakeTheLongest) and (Expressed Genes, Top20) exper-
iment settings, respectively. (c, d) Type-I error and power of repressor el-
ements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTh-
eLongest) experiment settings, respectively. . . . . . . . . . . . . . . . . 66
Figure 6.15 Violin plots for (a) GC of randomly sampled intervals from hu-
man genome, GC of intervals of GM12878 non-expressed genes and ex-
pressed genes. (b) Mappability of randomly sampled intervals from hu-
man genome, mappability of intervals from non-expressed and expressed
gene-sets of GM12878. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6.16 Violin plots for (a) GC of randomly sampled intervals from hu-
man genome, GC of intervals of K562 non-expressed genes and expressed
genes. (b) Mappability of randomly sampled intervals from human genome,
mappability of intervals from non-expressed and expressed gene-sets of
K562. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6.17 ROC Curves for (a) H3K9ME3 in K562 under parameter (NOOB,
woIF) and experiment (CompletelyDiscard, Top5) (b) H3K9ME3 in K562
under parameter (NOOB, wIF) and experiment (CompletelyDiscard, Top5)
settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 6.18 ROC Curves for (a) H4K20ME1 in GM12878 under parameter
(NOOB, wIF) and experiment (CompletelyDiscard, Top5) (b) H4K20ME1
in K562 under parameter (NOOB, wIF) and experiment (CompletelyDis-
card, Top5) settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xxviii
Figure 6.19 ROC Curves for (a) H3K4ME1 in K562 under parameter (NOOB,
woIF) and experiment (TakeTheLongest, Top20) (b) H3K4ME1 in K562
under parameter (NOOB, woIF) and experiment (CompletelyDiscard, Top5)
settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 6.20 ROC Curves for (a) POL2 in GM12878 under parameter (EOO,
woIF) and experiment (CompletelyDiscard, Top5) (b) POL2 in K562 un-
der parameter (NOOB, wIF) and experiment (TakeTheLongest, Top20)
settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 7.1 GLANET and GAT are run on four experiments ranging from high
to low expected association between the compared genomic interval sets.
Each row depicts an experiment where the first set is input query and the
second set is a genomic element in the annotation library, e.g., experiment
Srf(Jurkat) vs. DNaseI(Jurkat) evaluates whether the binding regions of
transcription factor Srf in Jurkat cells are enriched for DNaseI accessible,
i.e., open chromatin, regions in the same cells. . . . . . . . . . . . . . . . 82
Figure 8.1 Intervals (s1, s2, s3, s4, s5) are stored in the nodes. The arrows from
the nodes point to their canonical subsets. . . . . . . . . . . . . . . . . . . 99
Figure 8.2 Blue colored segment tree nodes at cut-off depth and red colored
nodes with no children at depth above the cut-off depth are stored in our
segment tree forest. To enhance fast access, these stored segment tree
nodes are connected to each other through forward and backward links. . . 102
Figure 8.3 Segment tree nodes with the same index are stored in a BST and
index now points to the root of BST. Blue and red colored nodes are origi-
nal segment tree nodes which are linked to each other. Blue colored nodes
are in fact the roots of the segment trees below them. Red colored nodes
do not have any children. Parents of these blue and red colored nodes are
the artificial nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 8.4 Searching the nodes pointed by lowIndex and highIndex, the
nodes in between them, and plus two more nodes at most is enough. . . . . 104
xxix
Figure B.1 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in GM12878 for (EOO,CompletelyDiscard,Top5). . . . . 134
Figure B.2 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in GM12878 for (NOOB,CompletelyDiscard,Top5). . . . 135
Figure B.3 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in K562 for (EOO,CompletelyDiscard,Top5). . . . . . . . 136
Figure B.4 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in K562 for (NOOB,CompletelyDiscard,Top5). . . . . . . 137
Figure B.5 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in GM12878 for (EOO,TakeTheLongest,Top20). . . . . . 138
Figure B.6 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in GM12878 for (NOOB,TakeTheLongest,Top20). . . . . 139
Figure B.7 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in K562 for (EOO,TakeTheLongest,Top20). . . . . . . . . 140
Figure B.8 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves
for H4K20ME1 in K562 for (NOOB,TakeTheLongest,Top20). . . . . . . . 141
xxx
LIST OF ALGORITHMS
ALGORITHMS
Algorithm 5.1 generateRandomIntervals . . . . . . . . . . . . . . . . . . . . 38
Algorithm 8.1 findingNCommonOverlappingIntervalsForNIntervalSets . . . . 107
Algorithm 8.2 search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Algorithm 8.3 mainSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Algorithm 8.4 searchAtLinkedNode . . . . . . . . . . . . . . . . . . . . . . . 109
Algorithm 8.5 searchForward . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Algorithm 8.6 searchBackward . . . . . . . . . . . . . . . . . . . . . . . . . 110
Algorithm 8.7 searchDownward . . . . . . . . . . . . . . . . . . . . . . . . . 110
Algorithm 8.8 searchAtLowerNode . . . . . . . . . . . . . . . . . . . . . . . 111
Algorithm 8.9 searchAtHigherNode . . . . . . . . . . . . . . . . . . . . . . . 111
Algorithm 8.10 findingAtLeastKCommonOverlappingIntervalsForNIntervalSets 112
Algorithm 8.11 fillEndPointsAndIntervals . . . . . . . . . . . . . . . . . . . . 113
Algorithm 8.12 sortEndPoints: Sort allEndPoints in ascending order . . . . . 113
Algorithm 8.13 constructSegmentTree: Using sortedAllEndPoints . . . . . . 113
Algorithm 8.14 storeIntervals: One interval set at a time . . . . . . . . . . . . . 113
Algorithm 8.15 findAtLeastK . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
xxxi
LIST OF ABBREVIATIONS
ABBRV Abbreviation
GLANET Genomic Loci ANnotation and Enrichment Tool
ENCODE Encyclopedia of DNA Elements
NGS Next Generation Sequencing
ChIP-seq Chromatin Immunoprecipitation Sequencing
BS-seq Bisulfite Sequencing
GWAS Genome Wide Association Studies
SNP Single-Nucleotide Polymorphism
CNV Copy Number Variation
LD Linkage Disequilibrium
DHSs DNaseI Hypersensitive Sites
TF Transcription Factor
TFBS Transcription Factor Binding Sites
HM Histone Modification
KEGG Kyoto Encyclopedia of Genes and Genomes
GO Gene Ontology
GUI Graphical User Interface
DNA Deoxyibo Nucleic Acid
RNA Ribo Nucleic Acid
RSA Regulatory Sequence Analysis
RSAT Regulatory Sequence Analysis Tool
PFM Position Frequency Matrix
TSS Transcription Start Site
xxxii
UTR Un-Translated Region
TPM Transcripts Per Million
DDCE Data Driven Computational Experiments
IF Isochore Family
wGC with GC
wM with Mappability
wGCM with GC and Mappability
woGCM without GC and Mappability
wIF with Isochore Family
woIF without Isochore Family
EOO Existence of Overlap
NOOB Number of Overlapping Bases
ROC Receiver Operating Characteristic
AUC Area Under Curve
OCD Obsessive Compulsive Disorder
BST Binary Search Tree
JOF Joint Overlap Analysis Framework
JEF Joint Enrichment Analysis Framework
VCF Variant Call Format
miRNA MicroRNA
xxxiii
xxxiv
CHAPTER 1
INTRODUCTION
High-throughput sequencing technologies are routinely used for cataloging genomic
variants [1, 2, 3], profiling protein-DNA interactions, histone modifications (ChIP-seq
[4]), DNA methylation (e.g., BS-seq [5]), and mapping of accessible chromatin (e.g.,
DNase-seq [6], ATAC-seq [7]). Analyses of these experiments reveal sets of genomic
intervals. Assessing the functional relevance of these genomic intervals requires inte-
grating them with already known genomic and epigenomic annotations. For example,
functional interpretation of a list of single nucleotide polymorphisms (SNPs) or copy
number variation (CNV) regions requires evaluating whether these genomic varia-
tion sites reside in gene coding regions, transcription factor binding sites, or histone
modification sites, or assessing whether the list is enriched with one or more path-
ways. Similarly, individual research groups often profile protein-DNA interactions
or histone modifications. A routine practice is to query resulting genomic intervals
against available consortia-derived genomic annotations such as those generated in
ENCODE [8] or against other available data generated by the research group. This
thesis develops tools and techniques that facilitates such analysis.
The work carried out in this thesis can be examined under two main parts: In the
first part, we mainly concentrated on annotation of genomic regions and their enrich-
ment analysis that adjusts for genomic biases using efficient data structures which
store data at varying resolutions. For this purpose, we developed GLANET both as
an annotation and enrichment tool. Additionally, in order to assess performance of
its enrichment procedure, we designed novel data-driven computational experiments.
In the second part of the thesis, we extend the annotation of genomic intervals into
1
finding at least k or n common overlapping intervals from n given interval sets. To re-
duce the search time, we proposed novel data structures which enable further parallel
computations also possible.
There are available tools for annotation and enrichment analysis of genomic regions.
They are equipped with different functionalities with respect to the types of the inputs,
annotation libraries, enrichment tests, and further, if any, downstream analysis they
enable.
FunciSNP [9], HaploReg [10], ALIGATOR [11], Annotate-it [12], PANOGA [13]
and FORGE [14] only accept SNPs as input. ENCODE ChIP-Seq Significance Tool
[15] is similarly limited by providing annotation and enrichment only for input gene
lists. RegulomeDB [16], SnpEff [17], Ensembl SNP Effect Predictor(VEP) [18],
ANNOVAR [19] and FunciSNP do not provide enrichment analysis.
There are a few tools available for annotation and enrichment analysis of longer ge-
nomic intervals [20, 21, 22]. These are generally restricted by the annotation libraries
they utilize. For example, INRICH tests for enrichment of only pre-defined gene sets
[20]. GREAT [21] takes a set of non-coding genomic regions and provides analysis
with respect to the annotations of nearby genes. The enrichment analysis in GREAT
does not take into account potential genomic biases involved in generation of the in-
put genomic regions. In contrast, GAT [22] is more flexible. It takes as input genomic
intervals and user-provided annotation libraries. Compared to INRICH and GREAT,
GAT enables users to input a workspace to define a subset of the genome for estimat-
ing appropriate null distribution during enrichment analysis. However, GAT’s built-in
capabilities are restricted, and it does not work with gene-sets. Furthermore, it relies
on the user to define and provide input files to specify where the random samples will
be generated from. This knowledge; however, is often not available to the user.
In summary, there are a number of notable shortcomings of the existing tools. Firstly,
majority of the tools are specific to inferring potential functionality of a given set of
SNPs and do not accommodate longer genomic intervals resulting from NGS exper-
iments such as ChIP-seq, BS-seq, insertion and deletion variants. Secondly, most of
these tools do not account for systematic biases such as mappability and GC content
introduced by the sequencing technologies [23, 24, 25, 26, 27, 28]. Thirdly, gene set
2
or pathway enrichment tools do not support analysis with non-coding upstream and
downstream regions of the genes. Finally, and perhaps more importantly, they work
with fixed annotation libraries and do not enable users to add on their own libraries
for annotation and enrichment. The lack of such a feature limits the analysis that can
be accomplished with these tools.
We developed GLANET as an annotation and enrichment tool with several useful
built-in analysis capabilities for the human genome. GLANET annotation library
includes a rich set of genomic information: (i) regions defined on and in the neigh-
borhood of coding regions that encompass regulatory regions; (ii) ENCODE-derived
potential regulatory regions that encompass binding sites for multiple transcription
factors, DNaseI hypersensitive sites, modification regions for multiple histones across
a wide variety of cell types; and (iii) gene sets derived from KEGG [29] pathways and
GO [30] terms. Users can easily annotate their input intervals with the genomic el-
ements defined in the annotation library and expand the GLANET library by adding
user-defined libraries and/or pre-defined gene sets.
In order to evaluate whether the input intervals overlap significantly with the genomic
elements in the GLANET annotation library, GLANET implements an enrichment
procedure that accounts for mappability [23, 24, 25] and GC content [26, 27, 28] bi-
ases inherent to NGS. When the input intervals are derived from an NGS experiment,
these biases constrain regions of the genome that can contribute to interval genera-
tion. Few of the existing tools account for these biases. For example, Forge [14]
randomly samples SNPs from regions that match the GC content of the input SNPs
to estimate a null distribution for enrichment testing. GAT [22] divides the genome
into isochore families that have similar GC content and performs sampling for each
isochore separately and, as a result, provides a coarse level matching of GC content.
GLANET estimates a null model from randomly sampled intervals that match each
interval of the input in terms of chromosome, length, mappability, and GC content as
opposed to operating on the average properties of the input intervals. Although this
sampling strategy is computationally intensive, GLANET conducts these analyses
rapidly by deploying efficient search strategies enabled by appropriately constructed
representations of the genomic intervals.
3
GLANET additionally provides several built-in analysis tools for specific input types.
When the input is a SNP list, users can evaluate whether the SNPs reside in transcrip-
tion factor binding regions and, if so, whether they are located in the actual transcrip-
tion factor binding motifs obtainable via either the reference or the SNP allele and
whether the variation potentially impacts the binding of TFs, either by enhancing or
disrupting binding motifs. GLANET enables joint enrichment analysis for transcrip-
tion factor binding and KEGG pathways. With this option, users can evaluate whether
the input set is enriched concurrently with binding sites of TFs and the genes within a
KEGG pathway. This joint enrichment analysis provides a detailed functional inter-
pretation of the input loci.
In addition to being a comprehensive tool that can help answer variety of questions,
another contribution of this thesis is the design of data-driven computational experi-
ments for evaluating its enrichment procedure. In order to assess the statistical power
and Type-I error of GLANET across its available parameter settings, we designed
data-driven computational experiments using large collections of ENCODE ChIP-seq
and RNA-seq data. These computational experiments indicated that while GLANET
enrichment test often performs conservatively in terms of Type-I error, it has high
statistical power. We present comparisons of GAT and GLANET and illustrate appli-
cations of GLANET within different biological contexts.
In the second part of the thesis, we provided solutions for finding common overlap-
ping intervals for n interval sets problem. In analyzing genomic intervals originating
from multiple data sets, this algorithmic problem is critical. We divided this prob-
lem into two sub-problems: finding n common overlapping intervals and at least k
common overlapping intervals for n interval sets. For the first sub-problem, we con-
structed one segment tree for each interval set and then converted each segment tree
into indexed segment tree forest. We observed that this way of representation re-
duces the search time. For the second sub-problem, we proposed constructing one
segment tree for n interval sets and find the overlapping intervals immediately after
the construction of the segment tree is completed.
We can summarize our contributions in this thesis as follows:
• We develop a comprehensive annotation and enrichment tool with a rich set of
4
functionalities for the human genome. GLANET’s open source code is avail-
able with a comprehensive user’s manual and other supporting materials.
• We design novel data-driven computational experiments for assessing the Type-
I error rate and power of GLANET’s enrichment analysis. We show that GLANET
has low Type-I error with high statistical power and it is sensitive to varying
experiment and parameter settings, and significance levels. The data-driven
computational experiments are instrumental for assessing the enrichment capa-
bilities of other tools. Towards this aim, we conduct extensive experiments to
compare GLANET with existing enrichment tools with similar functionality.
• We present an algorithmic framework for finding n common overlapping inter-
vals and finding at least k overlapping intervals over n given interval sets. In
this problem, the indexed short segment tree forests are constructed in lieu of
one tall segment tree, which leads to reduction in search time. This representa-
tion is inherently well suited for parallelization.
Rest of the thesis is organized as follows: In Chapter 2, we provide the necessary
background information for biological terms that are used throughout the thesis, and
we present an up-to-date overview of related work. In Chapter 3 and Chapter 4, two
main functionalities of GLANET, i) annotation of genomic intervals, and ii) Regu-
latory Sequence Analysis (RSA) of SNPs are described in detail, respectively. We
dedicate Chapter 5 to enrichment analysis and Chapter 6 to data-driven compu-
tational experiments. In Chapter 7, we present various scenarios to showcase the
extensive built-in capabilities of GLANET and runtime comparisons. In Chapter 8,
we propose our solutions for finding n common overlapping intervals and at least k
such intervals over n interval sets. Finally, we conclude the thesis in Chapter 9 with
a final discussion and remarks on possible future directions.
5
6
CHAPTER 2
BIOLOGICAL BACKGROUND AND RELATED WORK
We utilize many biological terms throughout the thesis. In Section 2.1, we describe
them briefly from a computer engineer’s point of view. Next, we provide the related
work in two separate sections, each section is dedicated to each part of the thesis
introduced in Chapter 1. In Section 2.2, we provide an overview of existing tools for
genomic annotation and enrichment analysis, which is followed by Section 2.3, in
which we summarize the related work on finding common overlapping intervals for
n given interval sets.
2.1 Biological Terms
DNA Deoxyribonucleic acid (DNA) is the genetic material in almost all organisms
including humans.
RNA Like DNA, ribonucleic acid (RNA) is essential, and performs many functions
in the cell. Unlike DNA, it is single stranded and there exists many different
types of RNA.
mRNA It is the messenger RNA, which carries the necessary genetic information for
synthesis of the proteins. After transcription, formed mRNA is translated into
a protein.
Genome Genome is the complete set of genetic material of an organism including its
genes. Whole human genome is a DNA sequence of more than 3 billion base
pairs which resides in the cell nucleus.
7
Chromosome DNA and histone proteins are super coiled to packaged into dense
structures called chromosomes. Human genome has 23 pairs of chromosomes
in somatic cells and and 23 chromosomes in gametes (egg and sperm cells).
Cell Cell is the smallest basic unit for all organisms. Cell contains the whole genome
and has the ability to replicate itself. Human body contains trillions of cells.
Gene Gene is the key functional and physical unit in the genome. Genes make pro-
teins according to the instructions on the DNA. Human genome consists of
around 25,000 genes.
Eukaryotes Eukaryote organisms such as fungi, plants and animals have membrane-
bound organelles and their genetic materials are enclosed within membrane-
bound nucleus in their cells .
Prokaryotes Prokaryotes such as bacteria are single-celled organisms with no nu-
cleus and membrane-bound organelles.
Transcript Level It is the level at which gene’s DNA is transcripted into mRNA.
Gene Structure We will concentrate on gene structure in eukaryotes. Promoter re-
gions regulate gene expression. Promoters are at the upstream of the coding
region and genes can not be expressed without promoters. Enhancers can ex-
press genes, they exist far upstream of the promoters but they can also exist
between the genes and downstream of the genes. Coding regions are the ex-
ons of the genes which are transcribed into mRNAs. Non-coding regions are
the introns of the genes and the intragenic region between the genes that con-
tains promoters and enhancers which regulate gene expression. Transcription
Start Site (TSS) is often called the 5’UTR, un-translated region where after the
coding region starts. Transcription Stop Site is called the 3’UTR where be-
fore the coding region ends. Therefore, coding region of a gene lies between
5’UTR and 3’UTR. We adopted the gene-centric regions as they are depicted
in Figure 3.1b.
NGS Next Generation Sequencing (NGS) is also known as high-throughput sequenc-
ing, which sequences DNA and RNA. It is faster, cheaper, needs less DNA and
8
is more accurate, reliable than formerly used Sanger sequencing. It has a revo-
lutionary effect on genomics and molecular biology research.
GC Content GC content is the ratio of total number of guanines and cytosines to the
total number of bases in a given sequence.
GC Content = G+CA+T+G+C
Mappability Mappability is measure of accurately mapping, which is also known as
uniqueness.
Genetic Variation Genetic variations are the differences in DNA sequences in each
of our genomes. Here are the major types of variations:
• Mutations are one or more nucleotide changes at random or due to envi-
ronmental conditions. Mutations have less frequency than SNPs.
• Single Nucleotide Polymorphisms (SNPs) are like typo, and each SNP is
one nucleotide that differs from the reference genome. They are the most
common type of genetic variations in human genome; on average, there is
one SNP at every 300 nucleotides.
• Copy Number Variations (CNVs) are structural genomic variants. Some
DNA sequences repeat themselves. Copy Number Variations (CNVs) are
the variations in the number of DNA repeats which can be resulted in
deletions or insertions in the genome.
GWAS Genome-wide association studies (GWAS) investigates the genetic variations
associated with a certain phenotype. This method looks for the genetic varia-
tions that exists more frequently in genomes with a certain phenotype than the
genomes without the phenotype.
LD The two or more polymorphic loci are in linkage disequilibrium (LD) if their
respective alleles do not associate independently (randomly). In other words,
linkage disequilibrium describes the dependent (nonrandom) association be-
tween pairs of alleles at different loci [31].
TFs Transcription factors are the proteins that bind to specific DNA sequences which
are called promoter and enhancer regions and they initiate and regulate the tran-
scription of genes. They increase or decrease the level of transcript in the genes
9
which results in expressed or non-expressed genes, respectively. Expressed
genes are turned on and produce proteins whereas non-expressed genes are
turned off and do not produce proteins.
HMs Histones are proteins that regulate gene expression. Histone modifications
modify the histone proteins and therefore impact the gene regulation.
Pathway A pathway is a collection of manually drawn diagrams depicting the cur-
rent knowledge on molecular interactions, reactions and relations of the biolog-
ical functions.
GO Term Gene Ontology is a structured vocabulary that represents molecular func-
tion, biological process or cellular component.
VCF Variant Call Format (VCF) is a standardized format for storing DNA polymor-
phism data such as SNPs, insertions, deletions and structural variants developed
for the 1000 Genomes Project [32].
miRNA Human microRNAs (miRNA) are evolutionary conserved short non-coding
single strand RNA molecules. They are involved in gene regulation, implicated
in many human diseases and represent promising therapy options [33].
2.2 Related Work Regarding Thesis Part 1
Regarding to the first part of the thesis, there are various tools that provide enrichment
and/or annotation analysis on given genomic intervals. In this section, we would like
to go into more detail about these available tools. We also provide a comprehensive
summary of these tools in Tables 2.1- 2.3.
We broadly classify available tools into two main classes with respect to their an-
notation and enrichment functionality. Notably, the majority of these tools are spe-
cific for inferring functionality of a given set of SNPs and do accommodate genomic
loci of variable lengths obtained from NGS experiments such as ChIP-seq, BS-seq,
CNV analysis. FunciSNP [9], HaploReg [10], ALIGATOR [11], Annotate-it [12],
PANOGA [13] and FORGE [14] only accept SNPs as input. And, some of these tools
only provide annotation but not enrichment analysis. RegulomeDB [16], SnpEff [17],
10
Ensembl SNP Effect Predictor(VEP) [18], ANNOVAR [19] and FunciSNP do not
provide enrichment analysis. ENCODE ChIP-Seq Significance Tool [15] is similarly
limited by providing annotation and enrichment only for input gene lists. GREAT
[21], INRICH [20] and GAT [22] are the tools available for annotation and enrich-
ment analysis of longer genomic intervals.
FunciSNP identifies candidate regulatory SNPs of a GWAS with the help of user-
defined ENCODE ChIP-seq peak files (biofeatures) which are known to be related
with the disease of GWAS [9]. FunciSNP takes GWAS SNPs (tagSNPs) and a set
of user-defined ENCODE ChIP-seq peak files (biofeatures) which are known to be
related with the disease/phenotype of the GWAS, as inputs. It considers all the SNPs
within a certain window around tagSNPs and after overlapping with the given biofea-
tures, it prioritizes only those overlapped SNPs by calculating LD measures, r2 and
D′. FunciSNP does not perform enrichment analysis of functional elements or any
predefined gene sets. Rather than that FunciSNP tries to identify the candidate regu-
latory SNPs. HaploReg selects and displays the causal SNPs within the same Linkage
Disequilibrium (LD) block that are enriched with a cell specific DnaseI hypersensitive
site and enhancer [10]. HaploReg gets the necessary LD information from the 1000
Genomes Project and provides r2 and D′ measurements for all genomic variants and
their linked SNPs which can be visualized along with their predicted chromatin state
in nine cell types, conservation across mammals and their effect on regulatory motifs.
ALIGATOR takes LD pruned SNPs of a GWAS and analyzes the enrichment of GO
pathways [11]. ALIGATOR analyses the enrichment of GO pathways by calculating
the GO pathway specific empirical p-values and correction of empirical p-values for
multiple testing by using a bootstrap approach. Annotate-it is aimed for experimen-
talists which enables them to load their samples and compare variation among the
samples [12]. It has particularly focused only on single nucleotide variants and anno-
tates the variants with the possible consequences on the transcripts of a certain gene
such as nonsense, essential splice site, nonsynonmous, synonmous and UTR. FORGE
outputs the enriched cell and tissue specific ENCODE derived DNA elements for the
given SNPs of a GWAS by using ChIP-Seq hotspots instead of peaks [14]. FORGE
analysis tool annotates the given GWAS SNPs with the functional elements from ei-
ther the ENCODE or Roadmap Epigenomics projects which are generated by the
11
Hotspot method because hotspots reveal more tissue specific signal. For the given
SNPs set, number of overlaps are counted and a background SNPs sets are created
and number of overlaps are counted where the enrichment value of the given SNPs
set is expressed as the z-score.
Encode ChiP-Seq Significance Tool identifies the enriched ENCODE transcription
factors from a list of protein-coding genes, protein-coding transcript, pseudogenes, or
pseudotranscripts using an one-tailed hypergeometric test [15]. RegulomeDB scores
given genomic variants for assessing their regulatory potential by counting the num-
ber of different types of functional elements that overlap [16]. Using a simple heuris-
tic, RegulomeDB has scored the genomic variants such that a genomic variant with
a lower score means that this genomic variant is more likely to be located in a func-
tional region since it has more overlaps with known and predicted functional elements
whereas a genomic variant with a higher score means the opposite. Known and pre-
dicted functional elements include DNaseI hypersensitivity regions, transcription fac-
tors binding sites, and promoter regions that have been biochemically characterized
to regulate transcription. In fact, RegulomeDB is a database that annotates genomic
variants with known and predicted functional elements in the intergenic regions of the
human genome. Database of RegulomeDB includes public datasets from GEO, the
ENCODE project, and published literature. However, RegulomeDB does not provide
enrichment analysis of the functional elements and predefined gene-sets, instead it
categorizes and prioritizes the given genomic variants by its scoring system. SnpEff
annotates coding and non-coding genomic variants, however it calculates the coding
effect of the variant such as codon change or amino acid change when the genomic
variant hits an exon[17]. It performs neither functional element nor predefined gene-
set enrichment analysis. Variant Effect Predictor (VEP) is an Ensembl [34] API, it
predicts the effects of variants such as amino acid change, codon change [18]. AN-
NOVAR annotates the given genomic variants in gene-based manner to identify the
variants that cause amino acid changes, in region-based manner to identify variants
in specific genomic regions and in filter-based manner to identify the variants that are
filtered against pre-computed functional importance scores (such as SIFT score) [19].
ANNOVAR aims to pinpoint functionally important genomic variants for autosomal
dominant diseases.
12
INRICH takes SNPs which can be resulting from a GWAS as input and generates
LD independent genomic intervals from these SNPs and tests for the enrichment of
predefined gene sets such as KEGG Pathways, GO terms and a diverse collection of
gene sets from Molecular Signature Database [20]. GREAT calculates the statistical
significance of the elements in its annotation libraries by incorporating distal binding
sites up to 1Mb [21]. GREAT annotates the given genomic regions for human, mouse
and zebrafish using 20 ontologies. GREAT performs binomial test and hypergeomet-
ric test for the statistical enrichment of annotation terms and outputs the annotation
terms that are significantly associated with the given genomic regions. GAT finds
the enrichment of tracks with respect to annotations by generating samplings from
workspace. Tracks are the interval sets of interest, annotations are the several regions
of the genome with their annotations and workspace contains the accessible regions
of the genome where the samplings’ intervals have to overlap with. If the tracks con-
tain high mappable regions of the genome, to adjust for this bias user has to provide
high mappable regions of the genome and provide workspace file accordingly. This
applies for other biases. User has to know the properties of intervals in tracks and
provide workspace file accordingly. GLANET takes this burden from the user and
handles correcting for GC content, isochore family and mappability using its offline
prepared bias files at varying resolution.
There are various other tools such as GEMINI [35] and Variant Tools [36]. GEM-
INI tries to isolate the underlying variants of a disease by annotating the genomic
variations of samples [35]. GEMINI loads a VCF file of genotypes of samples and
annotates the variants of samples (disease/phenotype) with its database, therefore it
takes a long time. GEMINI provides an database framework where you can write your
own SQL queries and a Phyton programming interface to implement your own code
in addition to its off the shelf tools. Variant Tools annotates and analyzes genomic
variants of samples in order to associate variants and genes with diseases [36].
We conclude the related work regarding the first part of the thesis with tools and
methods for the assessment of enrichment analysis. To the best of our knowledge,
there is no tool or any method for assessing the performance of enrichment analysis
of the given genomic intervals with respect to other genomic interval sets.
13
Table 2.1: Available tools including GLANET are compared with respect to their
accepted input types and annotation libraries utilized.To
ol(V
ersi
on)
SNPs
Gen
omic
Inte
rval
s
Form
at
Pre-
defin
edG
ene
Sets
Gen
es
Dat
aSo
urce
s
Allo
ws
Use
rPro
vide
d
Ann
otat
ion
Lib
rari
es
RegulomeDB
(v1.1)
! ! dbSNP Ids,
VCF, BED,
GFF3
Gencode
v7
ENCODE, Roadmap Epige-
nomics, dbSNP, GEO, pub-
lished literature, eQTL,
dsQTL, predicted annotations,
DNase footprinting, PWMs,
DNA Methylation
SnpEff
(v4.2)
! ! VCF, TXT,
SAMTools
Pileup For-
mat
KEGG,
GO,
MSigDb,
Reactome
Ensembl ENCODE, Roadmap Epige-
nomics, NextProd, UCSC, Mo-
tif annotations
!
Ensembl
SNP Effect
Predic-
tor (VEP)
(Ensembl
release 83)
! ! VCF, Pileup,
HGVS nota-
tions
RefSeq,
Ensembl,
Gencode
1000 Genomes, Ensembl tran-
scripts, Gencode and RefSeq
transcripts
!
ANNOVAR ! ! VCF, GFF3 RefSeq,
UCSC,
Ensembl,
Gencode,
AceView
ENCODE, 1000 Genomes,
dbSNP, SIFT, UCSC regions,
OMIM, Exome Sequenc-
ing Project, MutationTaster,
Polyphen, Complete Genomics
and many other data sources
!
FunciSNP
(v1.12.0)
! dbSNP Ids UCSC
known
genes
ENCODE, Roadmap Epige-
nomics, 1000 Genomes,
TCGA, Faire sites, DNaseI
hypersensitive sites
HaploReg
(v4.1)
! * dbSNP Ids RefSeq,
Gencode
ENCODE, Roadmap Epige-
nomics, 1000 Genomes, db-
SNP, eQTL, motif instances
ALIGATOR ! dbSNP Ids GO Entrez dbSNP
*Only ac-
cepts one
single region
Input Annotation Libraries
Continued on next page
14
Table 2.1 – continued from previous page
Tool
(Ver
sion
)
SNPs
Gen
omic
Inte
rval
s
Form
at
Pre-
defin
edG
ene
Sets
Gen
es
Dat
aSo
urce
s
Allo
ws
Use
rPro
vide
d
Ann
otat
ion
Lib
rari
es
Annotate-it
(v0.4)
! VCF KEGG,
GO,
BIOCARTA,
Reactome
Ensembl 1000 Genomes, OMIM, 200
Danish Exomes, Polyphen2,
SIFT, LRT, MutationTaster,
Anatomical gene expression
(eGenetics/SANBI dataset),
HPO, EPCC and LDDB phe-
notype ontologies to annotate
samples
Encode
ChiP-Seq
Significance
Tool
User given
gene list
Ensembl,
Gencode,
Entrez
ENCODE, HAVANA, HUGO
Gene Nomenclature Commit-
tee
PANOGA ! dbSNP Ids KEGG Protein-Protein Interaction
Data
FORGE
(v1.1)
! dbSNP Ids,
VCF, BED
ENCODE, Roadmap Epige-
nomics, 1000 Genomes, GEO,
omni genotyping arrays,
GWAS snp arrays
Variant Tools
(v2.7.0)
! ! dbSNP Ids,
VCF, BED,
GFF3, CSV,
Plink
KEGG ENCODE, 1000 Genomes,
dbSNP, Exome Sequencing
Project, dbNSFP, UCSC,
HapMap project, GWAS
catalog
!
GEMINI
(v0.18.2)
! ! VCF KEGG ENCODE, 1000 Genomes, db-
SNP, ClinVar, UCSC, OMIM,
HPRD, Exome Sequencing
Project
!
GREAT
(v3.0.0)
! BED GO,
MSigDb,
Panther,
BioCyc
Ensembl
genes
20 ontologies including disease
ontologies, phenotype ontolo-
gies, miRNA motifs, miRNA
targets
INRICH
(v1.1)
! ! dbSNP Ids KEGG,
GO,
MSigDb
Entrez
*Only ac-
cepts one
single region
Input Annotation Libraries
Continued on next page
15
Table 2.1 – continued from previous page
Tool
(Ver
sion
)
SNPs
Gen
omic
Inte
rval
s
Form
at
Pre-
defin
edG
ene
Sets
Gen
es
Dat
aSo
urce
s
Allo
ws
Use
rPro
vide
d
Ann
otat
ion
Lib
rari
es
GAT (v1.2.2) ! BED !
GLANET
(v1.0)
! ! dbSNP Ids,
BED, GFF3,
narrowPeak
KEGG,
GO
RefSeq ENCODE !
Input Annotation Libraries
Table 2.2: Available tools including GLANET are compared with respect to their
statistical tests carried out and enrichment options provided.
Tool
(Version)
Statistical Model or Test Takes into account
Genomic Biases
Correction for Multiple
Hypothesis Testing
RegulomeDB
(v1.1)
SnpEff
(v4.2)
Ensembl
SNP Effect
Predic-
tor (VEP)
(Ensembl
release 83)
ANNOVAR
FunciSNP
(v1.12.0)
LD
HaploReg
(v4.1)
Binomial Test LD
ALIGATOR Permutation Approach LD Bootstrap Approach
Annotate-it
(v0.4)
Filter-based Approach,
Weighted sum Approach,
Gamma-based approx-
imation for the null
distribution of weighted
sum statisticStatistical Tests
Continued on next page
16
Table 2.2 – continued from previous page
Tool
(Version)
Statistical Model or Test Takes into account
Genomic Biases
Correction for Multiple
Hypothesis Testing
Encode
ChiP-Seq
Significance
Tool
One-tailed Hypergeomet-
ric Test
Benjamini-Hochberg FDR
PANOGA Two-sided test based on
the hypergeometric distri-
bution
LD Bonferroni Correction
FORGE
(v1.1)
Background Distribution,
Z-score
LD, GC, minor allele fre-
quency (maf) and distance
to the nearest transcription
start site (TSS)
Bonferroni Correction
Variant Tools
(v2.7.0)
Fisher’s Exact Test for
Single Variant Analysis,
Single gene rare variant
tests, Conditional rare
variants analysis and etc
GEMINI
(v0.18.2)
Built-in analyses such as
find de novo mutations,
find compound heterozy-
gotes and so on
GREAT
(v3.0.0)
Binomial Test, Hypergeo-
metric Test
Bonferroni Correction,
Benjamini-Hochberg FDR
INRICH
(v1.1)
Permutation Approach LD Bootstrap Approach
GAT (v1.2.2) Sampling method Chromosome Identity, GC
and Mappability (Not tai-
lored for each given inter-
val)
Storey’s q-value, Ben-
jamini–Hochberg FDR
GLANET
(v1.0)
Sampling Based Ap-
proach, Z-score
GC, Mappability, Isochore
Family, Interval Length,
Interval Chromosome
Bonferroni Correction,
Benjamini-Hochberg FDR
Statistical Tests
Table 2.3: Available tools including GLANET are compared with respect to their
enrichment analysis.
Tool
(Version)
Provides
Enrichment
DNA Regulatory
Elements
Predefined
gene sets
Others User
Interface
RegulomeDB
(v1.1)
Web
Enrichment Analysis
Continued on next page
17
Table 2.3 – continued from previous page
Tool
(Version)
Provides
Enrichment
DNA Regulatory
Elements
Predefined
gene sets
Others User
Interface
SnpEff
(v4.2)
Command
Line
Ensembl
SNP Effect
Predic-
tor (VEP)
(Ensembl
release 83)
Web
ANNOVAR Command
Line
FunciSNP
(v1.12.0)
Command
Line
HaploReg
(v4.1)
! DNaseI hyper-
sensitive sites
Enhancer Web
ALIGATOR ! GO Command
Line
Annotate-it
(v0.4)
! Annotate-it provides
candidate gene lists,
aggregate functionality
scores, phenotype-specific
gene prioritization, and
statistical methods for
disease-gene finding in
case/control studies
Web
Encode
ChiP-Seq
Significance
Tool
! Transcription
Factors
Web
PANOGA ! Identifies sub-networks
within protein-protein
interaction networks
Web
FORGE
(v1.1)
! Dnasel hotspots Cell Type Specific Enrich-
ment
Web
Variant Tools
(v2.7.0)
! Use more than 20 asso-
ciation analysis methods
to associate variants and
genes with qualitative or
quantitative traits
Command
Line
GEMINI
(v0.18.2)
Enables users to write their
own SQL queries
Command
Line
Enrichment Analysis
Continued on next page
18
Table 2.3 – continued from previous page
Tool
(Version)
Provides
Enrichment
DNA Regulatory
Elements
Predefined
gene sets
Others User
Interface
GREAT
(v3.0.0)
! Transcription
Factors
GO,
MSigDb,
Panther,
BioCyc
Gene Expression Data,
Regulatory Motifs, Gene
Families
Web
INRICH
(v1.1)
! KEGG, GO,
MSigDb
GUI, Com-
mand Line
GAT (v1.2.2) ! Command
Line
GLANET
(v1.0)
! DNaseI hy-
persensitive
sites, Histone
Modifications,
Transcription
Factors
KEGG, GO User defined gene-set en-
richment, user defined li-
brary enrichment
GUI, Com-
mand Line
Enrichment Analysis
2.3 Related Work Regarding Thesis Part 2
Concerning the second part of the thesis, there are some existing tools that perform in-
terval intersection [37, 38, 39] and other genomic analyses. UCSC Genome Browser
is continuously evolving since its first launch. Lately, Data Integrator feature is re-
leased in UCSC Genome Browser, which allows users to combine and extract data
from multiple tracks (up to 5 tracks), simultaneously [37]. BEDTools is developed for
comparison, manipulation and annotation of genomic features in BAM, BED, GFF
and VCF formats [38]. BEDOPS is highly scalable and easily-parallelizable genome
analysis toolkit, which enables tasks to be easily split by chromosome for distributing
whole-genome analyses across a computational cluster [39]. NCList defines its dedi-
cated data structure for interval databases [40]. Tabix indexes tab-delimited files and
converts a sequential access file into a random access file [41]. Layer et. al. propose
a novel parallel «slice-then-sweep» algorithm for n-way interval set intersection with
non-containing intervals restriction on the intervals in given data sets [42].
19
20
CHAPTER 3
ANNOTATION OF GENOMIC LOCI
Annotation is the process of finding overlapping intervals between the user query
and the intervals stored in the GLANET’s annotation library. However, users are not
restricted with GLANET’s library. Our tool, GLANET allows users to expand the
annotation library by their user defined gene sets and genomic intervals.
In this chapter, we define user query, annotation library and how the library is rep-
resented using interval trees. Figure 3.1a provides an overview of the workflow and
capabilities of GLANET. We describe below individual components in more detail.
3.1 User Query
Users can query SNPs or varying length genomic intervals for annotation and/or en-
richment analysis. GLANET supports commonly used input formats such as BED,
narrowPeak, GFF3, 0-based or 1-based coordinates, and reference SNP (RS) iden-
tifiers for SNPs. Overlapping genomic intervals in the query are merged a priori to
analysis to avoid inducing dependencies among the query intervals.
3.2 GLANET Annotation Library
GLANET annotation library contains lists of annotated genomic regions from the
literature. We refer to these as GLANET elements, or genomic elements. Each of these
elements is represented by a set of genomic intervals. Default GLANET annotation
21
(a)
List of genomic intervalsSNPs, insertions, deletions, ChIP-seq,
BS-seq peaks, etc.
Accepted formats: dbSNP IDs, BED,
narrowPeak, GFF3, 0-based and 1-
based interval coordinates
Input
AnnotationList of input genomic
intervals annotated with
genomic elements in
the library.
Output
Genomic ElementsCell type specific non-coding
regulatory annotations:
• Transcription factor binding sites
• DNaseI hypersensitive sites
• Histone modification regions
Gene centered regions:
• Exons
• Introns
• 5’ proximal and distal regions
• 3’ proximal and distal regions
GLANET
Annotation Library
Gene SetsGO Terms and KEGG pathway
gene sets:
• Exon based: Exons of genes
• Regulatory based: Introns, 5’
and 3’ proximal regions of genes
• All based: Exons, introns, 5’ and
3’ distal and proximal regions of
genes
User Defined
Gene Sets
User Defined
Genomic Elements.
Enrichment
Preprocess Remove duplicates and
merge overlapping intervals
Pre-computed values:
• GC content
• Mappability
Genomic Biases
List of enriched
genomic elements and
gene sets.
Output
List of SNPs that fall into TF
binding sites and statistical
assessment of their impact
on the TF binding.
OutputRegulatory Sequence
Analysis for SNPs
(b)
3’3dExon Exon ExonIntron Intron5p1 3p15p2 3p25d
5’
2kb10kb 2kb 10kb100kb 100kb
Upstream Downstream
Figure 3.1: (a) Overall functionality of GLANET. (b) Gene-centric genomic intervals
are defined based on commonly used location analyses in ChIP-seq and related studies
[43]. GLANET uses these intervals to provide detailed annotation of user query with
respect to known genes.
library consists of the following genomic elements:
1. Non-coding regulatory annotations: Regulatory elements encompass non-coding
regions such as DNaseI hypersensitive sites (DHSs), transcription factor bind-
ing and histone modification regions across multiple cell types from the EN-
CODE project. Each element represents a set of genomic intervals that are
identified as peaks by the ENCODE project in a biochemical high through-
put assay. For example, STAT1_K562 represents genomic intervals bound by
transcription factor STAT1 in K562 cells.
2. Gene-centric elements: Gene-centric elements are defined for each gene and
are based on exons, introns, and six different regulatory regions that are either
22
proximal or distal to each RefSeq gene. We adopt the nomenclature from com-
monly used location analysis [43] and define 5p1, 5p2, and 5d as the regions
0 to 2kb, 2kb to 10kb, and 10kb to 100kb upstream of first exon of the gene,
respectively. Similarly, we define 3p1, 3p2, and 3d as the regions 0 to 2kb, 2kb
to 10kb, and 10kb to 100kb downstream of last exon of the gene, respectively
(Figure 3.1b). These gene-centric elements enable users to annotate their input
query with respect to known genes and more importantly non-coding regions
around them. These regions are further incorporated into pathway and gene set
enrichment analysis.
3. Functional gene sets: The input set of genomic intervals can also be queried
against pre-defined gene sets. GLANET includes gene sets derived from KEGG
pathways and GO Terms as its default functional gene sets. GLANET further
defines three classes of gene set elements as exon-based, regulation-based, and
all-based. Exon-based gene set elements include exons of the genes in each
individual gene set. In contrast, regulation-based gene set elements consist of
introns and the four different proximal noncoding regions, namely 5p1, 5p2,
3p1, and 3p2, of genes in each gene set. The third category, all-based gene
set elements, consists of exons, introns, and all six proximal and distal regions
of genes in each gene set. These three modes allow users to not only assess
enrichment of an input query with respect to exonic regions or full length of
genes but also enable regulation-centric enrichment analysis.
4. User-defined annotations: An important feature of GLANET is that users can
expand the GLANET annotation library with new genomic elements, i.e., ge-
nomic intervals or gene sets, and query against this extended library. This op-
tion broadens the applicability of GLANET to various settings. For example,
it enables investigating the input set against an in-house generated ChIP-seq
data analysis, or against gene sets derived from gene expression data analysis,
si/shRNA gene lists, or other functional assays. We present an example ap-
plication in Section 7.2.2, where we consider GATA2 bound regions in K562
cells as input query and utilize gene sets derived from GO term annotations as
user-defined annotations.
23
Genomic intervals of a genomic element type.
Different color indicates different genomic elements.
Single chromosome
An interval tree is constructed for each
genomic element type and chromosome
using its genomic intervals.
[low, high]
•color
•max high endpoint
stored in the
subtree rooted at
this node
•annotated genomic
elements
Figure 3.2: Genomic intervals are represented in interval trees [44]. A separate in-
terval tree is constructed for each chromosome and genomic element type, e.g. for
transcription factor binding annotations. Each node contains the low and high end-
points of the genomic interval, the color of the node (red or black), the maximum high
endpoint stored in the subtree rooted at this node and the genomic elements annotated
with this particular genomic interval.
We provide details on data sources in Table A.1 of Appendix A.
3.3 Library Representation
A genomic interval is a continuous stretch of the genome with a chromosomal start
and end coordinates denoted by [t1, t2] with t1 ≤ t2 where t1 is the low endpoint
and t2 is the high endpoint of the interval. Each genomic element in the GLANET
library is defined by a set of such genomic intervals. For example, in exon based
analysis, a gene is represented by the set of genomic intervals of its exons. Similarly,
a transcription factor’s binding regions or histone modification sites are represented
by a set of genomic intervals that corresponds to ChIP-Seq peaks. GLANET stores
these genomic intervals in interval trees (Figure 3.2).
24
An interval tree is a red-black tree in which each node x stores the low and high end-
points, t1 and t2, of an interval and an integer value max which is the maximum high
endpoint stored in the subtree rooted at this node x [44]. On each node of the tree, we
also store the genomic annotations associated with the interval stored on that node.
For each element type in the annotation library, e.g., genomic elements representing
all transcription factor binding regions across all cell lines, chromosome-specific in-
terval trees are constructed (Figure 3.2). Then, for annotation and enrichment analysis
the appropriate interval trees are searched for query intervals using the interval tree
search algorithm as described in [44].
GLANET annotation overlaps each genomic interval in the input set with genomic
elements in its annotation library and provides the following options for quantifying
the overlap:
1. Existence of overlap (EOO): This option simply evaluates whether a given input
interval intersects at least 1 base pair (bp) with any of the intervals of a genomic
element in the annotation library. GLANET provides flexibility in the overlap
definition, that is, by default, with at least single base intersection is consid-
ered overlapping; GLANET also allows users to provide a higher threshold for
defining overlap. Finally, the fraction of intervals overlapping each genomic
element is reported as the query-level association statistics for each genomic
element in the annotation library.
2. Number of overlapping bases (NOOB): In order to take into account the size
of the intersection between a given input interval and intervals of a genomic
element, NOOB takes into account the actual number of overlapping bases. The
total numbers of overlapping bases across all the input intervals are reported as
the query-level association statistics for each element in the annotation library.
In this calculation each overlapping base is counted only once.
3.4 Interval Tree
Interval tree is a well-known and highly used space partitioning tree. We adopted the
interval tree implementation provided in [44]. Its space complexity isO(n), construc-
25
tion and query requiresO(n log n)) andO(min(n, k log n)) time for n given intervals
and k hits, respectively.
3.5 Time and Space Complexity of Annotation
Annotation is performed by searching for each query interval in the interval tree. The
time complexity of a query search in an interval tree is O(min(n, k log n)), where n
is the number of all genomic intervals in the interval tree (number of nodes) and k
is the number of genomic intervals overlapping the query interval. Typically, k log n
is smaller than n. For m query intervals, time complexity of Annotation is O(m ∗min(n, k log n)).
We construct chromosome based interval trees for each element type, namely, DNa-
seI hypersenstive sites (DHSs), Transcription Factors (TFs), Histone Modifications
(HMs) and RefSeq Genes. Space complexity of each interval tree is O(n), where n
is the number of intervals stored in the interval tree.
26
CHAPTER 4
REGULATORY SEQUENCE ANALYSIS OF SINGLE
NUCLEOTIDE POLYMORPHISMS
GLANET provides regulatory sequence analysis (RSA) when user query consists of
Single Nucleotide Polymorphisms (SNPs) only. For each input SNP, GLANET finds
overlapping transcription factors (TFs) and for each TF, GLANET gathers its posi-
tion frequency matrix (pfm). Next, GLANET retrieves DNA sequences centered at
SNP position of reference, altered and extended reference sequences and checks for
whether binding affinity of the TF increases or decreases with respect to the TF’s pfm
because of the SNP. In this chapter, we describe the steps of our RSA and conclude
the chapter with an use case of GLANET, which is RSA for Obsesssive Compulsive
Disorder (OCD) GWAS SNPs.
4.1 Regulatory Sequence Analysis
GLANET provides a detailed regulatory sequence analysis for SNP input queries.
This analysis takes advantage of the available ENCODE transcription factor binding
regions in the default GLANET annotation library. GLANET first finds in which of
the transcription factors’ binding regions, the SNP resides in. Then, the locations of
the SNPs residing in a TF binding region are evaluated for overlap with a significant
motif match using the position frequency matrices (PFMs) of the corresponding TFs.
This evaluation is carried out with both the reference and the SNP alleles. Specifi-
cally, for evaluating a single SNP with respect to one PFM, GLANET retrieves DNA
subsequence of the reference genome within a 41 bps window centered at the SNP
27
locus. It then assesses whether this subsequence provides a significant match to the
PFM with either the reference or the SNP allele with the RSAT tool [45]. Both Jaspar
Core [46] and Encode motifs [47] are utilized as part of GLANET’s PFM library.
Overview of regulatory sequence analysis can be found in Figure 4.1. GLANET
performs regulatory sequence analysis in three main steps :
1. In the first step, SNP and TF pairs for which SNP resides in the binding region
of the TF are found. This is accomplished by overlapping the positions of the
SNPs with transcription factor binding sites provided in the annotation library.
2. In the second step, GLANET generates three subsequences around the SNP
site: reference, SNP and extended sequences. These sequences are used to
statistically assess whether the SNP can alter the transcription factor binding.
Reference and SNP sequences are 41 bps long and they are created by taking
±20 bps upstream and downstream sequences around SNP locus. Extended
reference sequence is a 401 bp region centered at SNP locus and is used to
check if the SNP site is actually the most likely binding site in the vicinity of
the SNP.
3. In the third step, GLANET scans the subsequences for a matching motif site in
each of the sequences (Reference, SNP, Extended) and evaluate the statistical
significance of the match using RSAT [45]. For this, the position frequency ma-
trices (PFMs) for the annotated TFs are obtained from Jaspar Core and Encode
motifs [46, 47]. This step results with three p-values: pref, psnp and pextended. The
smaller the p-value, the better the match is.
In this scenario, we only consider the cases where the SNP location is found to be the
best matching site within the peak and we only consider cases where pextended is not
smaller than the minimum of the psnp or pref.
Let pref and psnp denote the p-values of motif matches with the reference and SNP
alleles, respectively. Since we precondition our analysis on the fact that the SNP
overlaps a TF binding region, we also evaluate whether the region harbors a motif
match to the PFM that does not overlap the SNP location. Let pextended denote the
28
Step 1 Find SNPs and transcription factor (TFs) pairs, where SNP falls into TF’s binding site.
rsID chr position alleles
rs11057881 chr12 125371973 A/C
…
TF chr start end
GABP chr12 125371778 125372047
…
SNP file
Transcription factor fileFind SNP - TF pairs that overlap.
Step 2For each of the SNP in the list, create three subsequences around the SNP locus. Reference and
altered SNP sequences include 20 nucleotides downstream and upstream of the SNP locus. Extended
sequence is retrieved from the reference genome within a 401 bps window centered at the SNP locus.
AGACCTGAGATAGCACTGAACCCGGTATAGACTGTTTTTCC
AGACCTGAGATAGCACTGAAACCGGTATAGACTGTTTTTCC
..CGGATGCCTGAGACCTGAGATAGCACTGAACCCGGTATAGACTGTTTTTCCCCATGATAAAATTT…
Reference seq.SNP Altered seq.
SNP locus
Extended seq.
20 bps 20 bps
Step 3Scan each sequence with TF’s position frequency matrices and assess TF binding possibility in the
sequence.
ETS_known9 GABPA_1 GABPA_jaspar_MA0062.2
A |0.032356 0.07 0.00 0.00 1.00 1.00 0.09 0.06 0.16 0.27 0.24
C |0.776542 0.92 0.00 0.00 0.00 0.00 0.03 0.26 0.14 0.26 0.36
G |0.190091 0.00 1.00 1.00 0.00 0.00 0.87 0.04 0.61 0.42 0.23
T |0.001011 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.10 0.05 0.18
Position Frequency Matrix files
Reference seq.
Extended seq.
SNP Altered seq. RSAT prefpsnppextended
Compare p-values and
determine SNPs
potentially affecting
TF motif sites.
Figure 4.1: Three main steps of regulatory sequence analysis in GLANET.
p-value of such a match. If pextended is smaller than psnp, GLANET filters it out in the
post-analysis as the binding region has a better motif match that does not overlap the
SNP location. If the SNP location is the best place for the motif to match, GLANET
compares psnp and pref. If psnp is larger than pref, the SNP has a potentially disrupting
effect, it decreases the binding affinity of TF. If the converse holds, GLANET suggests
that the SNP is creating a sequence motif that is more favorably recognized by the TF.
In other words, the SNP has a potentially enhancer effect which increases the binding
affinity of TF.
29
4.2 GLANET Use Case: Regulatory Sequence Analysis of OCD GWAS SNPs
Following up OCD SNPs with GLANET regulatory sequence analysis revealed that
some of these SNPs might be affecting TF binding. For example, SNP rs1891215
resides within a STAT1 binding region and has a match to STAT1 PFM with pref
of 1.1e-3. As the SNP changes the allele from A to G, it generates a better STAT1
binding site with psnp of 6.1e-5 (Figure 4.2a). In contrast, the SNP rs10946279 resides
within a MAX binding region. This location has a match to the MAX PFM with a pref
of 6.1e-5; however, the SNP alters the match (psnp = 1.5e-3), potentially disrupting
the binding site (Figure 4.2b). All regulatory sequence post analysis results of OCD
SNPs are available in Supp. Table S20 under http://burcak.ceng.metu.
edu.tr/PhDThesis/SuppMaterials/.
(a) (b)rs1891215
Reference
SNP CTTCTGGGAAA
STAT1
CTTCTGGAAAA
rs10946279
GCCGTGCGATGCTGTGCGAT
MAX
Figure 4.2: GLANET regulatory sequence analysis for the OCD SNPs annotated
with TFs in the library. (a) SNP rs1891215 located at chr1:7,667,794 changes refer-
ence nucleotide A to G, and as a result, leads to a better match to the STAT1 PFM,
i.e., the p-value of the match to the STAT1 PFM changes from 1.1e-3 to 6.1e-5. (b)
SNP rs10946279 (chr6:170,553,248) changes reference nucleotide C to T, thereby
decreasing the significance of the match to the MAX PFM, i.e., the p-value of the
match increases from 6.1e-5 to 1.5e-3.
30
CHAPTER 5
ENRICHMENT ANALYSIS OF GENOMIC REGIONS
Enrichment analysis enables identifying one or more common functional themes in
the input query set by assessing the statistical significance of the overlaps between
the user query and intervals of elements stored in GLANET’s annotation library. For
this purpose, GLANET employs sampling-based enrichment analysis which requires
random interval generation of each user input query interval through matching its
chromosome and length by default and GC content, mappability and isochore family
jointly or separately on request for each sampling. GLANET also allows users to
expand its default annotation library by providing their own user defined gene sets
and library. Furthermore, GLANET offers joint enrichment analysis for transcription
factor and KEGG Pathway pairs, of which we explain all in detail, in this chapter.
5.1 Enrichment Analysis
To evaluate the statistical significance of the overlaps, GLANET calculates the ob-
served and expected test statistics using one of the association statistics options listed
in Table 5.1. GLANET computes the observed test statistics for each member ele-
ment of selected element type by finding the overlaps between the input query and
the genomic intervals of element in the annotation library. To calculate expected test
statistics, GLANET estimates empirical null distributions by randomly sampling in-
tervals that match the characteristics of the input query intervals. We use a resampling
based approach to obtain the empirical null distribution of the test statistic. We col-
lect test statistics of B samplings, each with n randomly generated genomic intervals,
31
where n is the number of input intervals in the query. bth sampling is represented
by randomly generated genomic intervals, Sb = {sb1, sb2, . . . , sbn}, ∀b ∈ {1, . . . , B}that match the given genomic intervals properties. The collection of overlap statistics
across multiple random samplings is then used to estimate an empirical null distribu-
tion for the overlap statistic and to calculate an empirical p-value = 1B
∑Bb=1 1(kb≥k) .
Here k denotes the observed test statistic and kb is the overlap statistic of randomly
generated genomic intervals Sb from bth sampling. The indicator function returns 1
when the inequality holds and 0 otherwise. Multiple testing correction to account for
large numbers of genomic elements is performed with two options: Bonferroni pro-
cedure [48] for controlling family-wise error rate and Benjamini-Hochberg procedure
[49] for controlling the false discovery rate.
The key part of estimating the empirical null distribution of enrichment test is the
random interval sampling step. The random intervals are generated such that they
match properties of the each member of the input interval set as opposed to the av-
erage properties of these intervals. Matched properties of each input interval are its
chromosome, length, GC content, mappability and isochore family. Among these
properties, GC content and mappability are the systematic biases that are introduced
by the NGS technologies. In other words, these technologies restrict the genomic re-
gions that can contribute to resulting intervals. To validate the introduced GC content
and mappability biases in the intervals of ENCODE derived DNA elements in our
annotation library, we evaluated the GC content and mappability values of all inter-
vals for each ENCODE file. We sorted the ENCODE files with respect to their mean
GC content and mappability values in ascending order and selected the ten different
files that almost equally separate the sorted mean values to show how GC content and
mappability vary for DNaseI hypersensitive sites (DHSs), transcription factor binding
sites (TFBSs) and histone modifications (HMs). Figures 5.1a, 5.1c and 5.1e show
the box plots of GC content of intervals of ten different ENCODE files sorted with
respect to their mean GC contents and displays that GC content values vary mostly
between 0.4 and 0.6. On the other hand, Figures 5.1b, 5.1d and 5.1f show the box
plots of mappability values of intervals of ten different ENCODE files sorted with re-
spect to their mean mappability values and reveals that mean mappability values vary
mostly between 0.8 and 1.0. We can conclude that intervals obtained from ENCODE
32
data sets tend to be highly mappable with average GC content.
(a) GC contents for DHSs (b) Mappability values for DHSs
(c) GC contents for HMs (d) Mappability values for HMs
(e) GC contents for TFBSs (f) Mappability values for TFBSs
Figure 5.1: Box plots of GC content and mappability values for ten different EN-
CODE files, for each element type.
33
Table 5.1: GLANET main parameters for enrichment test.
Association Statistic Options
EOOOverlap statistic is 1 or 0 based on whether the input interval overlaps with any of the genomic
element intervals or not.
NOOBOverlap test statistic is the exact number of overlapping bases between the input interval and the
genomic element intervals.
Random Interval Generation Matching Options
wGCFor an input interval, randomly sample an interval with the same length from the same chromosome
such that it matches the GC content of the query interval.
wMRandomly sample an interval with the same length from the same chromosome such that it matches
the mappability of the query interval.
wGCMRandomly sample an interval with the same length from the same chromosome such that it matches
both mappability and GC content of the query interval.
woGCM Randomly sample an interval with the same length from the same chromosome.
Random Interval Generation Start Options
wIF
Starts the random interval search within the same chromosome with a matching GC isochore family.
When GC is on, it provides a good start for GC matching. When GC option is not selected, it
provides coarse grain GC matching.
woIF Starts the random interval search for an interval within the chromosome randomly.
GLANET provides flexibility in which property to consider. User can account for GC
content or mappability bias jointly or separately or choose not to match any of these
properties. The availability of these modes provide flexibility for the cases wherein
the input genomic intervals are generated by different technologies. In matching the
GC content, genomic intervals are matched with varying resolution depending on
the length of given genomic intervals, i.e., the shorter the genomic interval, the more
precise the GC content matching is. A detailed description of the GC and mappability
matching procedure is available in Algorithm 5.1.
GLANET also offers an Isochore Family (IF) option in matching GC. The genome
is divided into five regions that are characterized by similar GC content composition.
These regions are called isochores and are named as L1, L2, H1, H2, and H3 in ac-
cordance with increasing GC levels, < 38%, 38–42%, 42–47%, 47–52%, > 52%,
respectively as defined in [50, 51]. Finally, each chromosome is divided into 100, 000
bps long intervals and each such interval is tagged with its appropriate isochore fam-
ily. When with Isochore Family (wIF) option is selected, initially, input interval’s
34
isochore family is calculated and a random interval of 100, 000 bps long is selected
from the appropriate isochore family pool of that chromosome. Subsequently, a ran-
dom interval of input interval’s length is sampled from this 100, 000 bps long interval.
If GC option and/or mappability is also selected, a random interval is repeatedly se-
lected until a random interval close to input interval’s GC content and/or mappability
depending on the selected mode under a preset threshold is generated. When GC op-
tion is selected, wIF provides a good starting point for GC matching, when it is not
selected, it provides a very coarse grain matching of GC.
The different options for enrichment test is summarized in Table 5.1.
5.2 Random Interval Sampling Procedure
To perform enrichment analysis, GLANET generates a null distribution of the test
statistics by first sampling random intervals and calculating these intervals’ overlap
with the annotation library element intervals. The random intervals are generated such
that they match properties of the each member of the input interval set as opposed
to the average properties of these intervals. The algorithm for generating random
intervals is outlined in Algorithm 5.1. Note that, we do not include the relaxation
steps of the thresholds for sake of clarity. Here we provide the details of this random
interval generation scheme.
The input interval set may contain overlapping intervals. In such cases, GLANET
preprocesses the input by merging overlapping intervals into a single interval to avoid
dependency within them. Similarly, the random intervals for an input interval set
are always selected such that they do not overlap. GLANET provides four main
parameters for random interval generation: with GC (wGC), with Mappability (wM),
with GC and Mappability (wGCM), and without GC and Mappability (woGCM).
GLANET random interval generation can also be run without Isochore Family (woIF)
and with Isochore Family (wIF). Regardless of which option is selected, for each
input interval a corresponding random interval of the same length from the same
chromosome is sampled. When the given interval’s length is greater than 100, 000
bps, GLANET does not generate random intervals by accounting for GC content
35
and/or mappability even one of these options (wGC,wM,wGCM) is on. Since for
very large intervals, GC content and mappabilty values are not meaningful. In case of
wGC, wM, or wGCM options are selected in addition to the length and chromosome
of given interval, GLANET also matches given interval’s GC, mappability, or both
GC and mappability, respectively as follows:
• GC Option or Mappability Option Selected: If one of the wGC or wM op-
tion is selected, GLANET tries to match the GC content or mappability value of
the given interval. Same procedure applies for matching GC or mappability val-
ues. GLANET first generates a random interval and calculates its GC content
or mappability depending on which option is selected. This random interval is
accepted if its value is close to the corresponding value of input interval within
a pre-defined threshold. Otherwise, GLANET generates a new random interval
until an acceptable random interval is obtained. If after a certain number of
attempts, no random interval can be found because it is not within the threshold
distance to the GC or mappability of the input interval, then the threshold for
the acceptable match is increased by a small increment. Again, after a certain
amount of trials, if relaxing this threshold does not help, GLANET chooses the
random interval with the minimum difference in GC content or mappability up
to that point.
• GC and Mappability Option Selected: If wGCM option is on, GLANET se-
lects a random interval with close GC content and mappability values to the
input interval. A random interval is considered acceptable if its GC content
and mappability values are within a pre-defined distance to the input interval’s
values. If the random interval values do not match, a new interval is sampled
until an acceptable random interval is obtained. If after a certain number of at-
tempts, no random interval can be generated because it is not within the thresh-
old distance to the GC or mappability of the input interval, the threshold for the
acceptable match is increased by a small increment. If relaxing this threshold
does not help, GLANET chooses the random interval with the minimum sum
of the differences in GC content and mappability up to that point.
36
5.2.1 GC and Mappability Calculation
In order to calculate the GC content and mappability of given intervals, GLANET pre-
computes GC content and mappability values of genomic regions and stores them in
the disk. The GC content of the genomic regions are calculated at various lengths
such as 1 bp, 100 bps, 1000 bps, 10, 000 bps and 100, 000 bps. In runtime GLANET
constructs a GC interval tree from one of these pre-computed GC content values based
on the mode of the input interval lengths. Specifically, the shorter the input intervals
are, the more precise the GC calculation is. If mode of given intervals’ lengths is short
(<= 100 bps long), GLANET calculates GC content of the given intervals at one base
resolution and stores them in a byte list. Otherwise, GLANET stores GC contents of
100 bps, 1000 bps and 10, 000 bps long intervals in interval trees. When the mode is
between (> 100 and≤ 1000) GLANET calculates GC content at 100 base resolution,
if the mode is (> 1000 and ≤ 10, 000 ) GLANET calculates at 1000 base resolution.
For cases between (> 10, 000 and ≤ 100, 000) at 10, 000 base resolution and when
mode gets longer than (100, 000 bps) then GLANET does not calculate GC content
for intervals longer than (100, 000 bps) but only for intervals shorter than (100, 000
bps) at 10, 000 base resolution.
37
Algorithm 5.1: generateRandomIntervalsRequire: wIF , tM , tGC , tV alue, LMAX
1: wIF : If true, isochore family pools will be used in random interval generation.
2: tM : Threshold to match mappability within this value.
3: tGC : Threshold to match GC content within this value.
4: tV alue: Stands for tM or tGC .
5: LMAX: Maximum interval length GC and mappability will be accounted for
(Default is 100,000 bps).
6: for each chromosome chri do
7: Si ← subset of intervals in S that are on chri
8: if Si 6= ∅ then
9: for each sampling b in {1, . . . , B} do
10: S(b)i ← ∅
11: for each given interval g in Si do
12: gLen← length(g)
13: if gLen ≤ LMAX then
14: if wGCM then
15: generateARandomIntervalwGCM(g)
16: else if wGC or wM then
17: generateARandomIntervalwGCorwM(g)
18: end if
19: else if gLen > LMAX or woGCM then
20: r ← getARandomInterval(chri, gLen)
21: end if
22: S(b)i ← S
(b)i ∪ r
23: end for
24: end for
25: end if
26: end for
38
1 Function generateAnRandomIntervalwGCM(g)
2 gGC ←− calculateGC(g)
3 gM ←− calculateMappability(g)
4 do
5 do
6 if wIF then
7 gIF ←− findIsochoreFamily(g)
8 r ←− getARandomInterval(chri, gLen, gIF )
9 else
10 r ←− getARandomInterval(chri, gLen)
11 endif
12 while r overlaps with an already generated interval in S(b)i
13 rGC ←− calculateGC(r)
14 rM ←− calculateMappability(r)
15 while (|rGC − gGC| > tGC) or (|rM − gM | > tM)
16 return r
1 Function generateAnRandomIntervalwGCorwM(g)
2 gV alue← calculateGC(g) or calculateMappability(g);
3 do
4 do
5 if wIF then
6 gIF ← findIsochoreFamily(g) ;
7 r ← getARandomInterval(chri, gLen, gIF ) ;
8 else
9 r ← getARandomInterval(chri, gLen) ;
10 endif
11 while r overlaps with an already generated interval in S(b)i ;
12 rV alue← calculateGC(r) or calculateMappability(r) ;
13 while (|rV alue− gV alue| > tV alue);
14 return r;
39
Mappabilities of genomic intervals are obtained from ENCODE, the source files
are listed in Table A.1 of Appendix A. A query interval can be part of a single interval
or overlap with multiple intervals with different mappability values as provided in
the original source. In either case, its mappability is estimated by calculating the
weighted average, where the weights are the proportions of the query interval lengths
that overlap with the source mappability interval.
5.3 Time and Space Complexity of Random Interval Generation
We generate random interval for each sampling and for each query interval. Therefore
time complexity of random interval generation is O(b ∗ m) times the summation
of time complexity of GC, mappability and isochore family calculations if random
interval generation wGC, wM and wIF options are selected, where b is the number of
samplings and m is the number of query intervals.
For each sampling and query interval, we calculate GC content, mappability and
isochore family of the query interval and randomly generated interval depending on
the options chosen. To avoid infinite loops, each calculation has its own preset number
of trials and if procedure can not generate a random interval within a threshold, it
selects the best random interval generated up to that point.
We store different data structures in memory for GC calculation depending on
the required resolution. When query intervals consist of SNPs or mode of intervals
less than <= 100 nucleotides long, we need the GC data at the highest resolution,
therefore we keep the GC content of each nucleotide in chromosome based byte array
lists. Each byte in GC byte array list contains GC content of 7 nucleotides. Therefore
space complexity of GC byte array list is proportional to the size of the chromosome.
For other query intervals, we keep the GC content of 100, 1000 and 10, 000 bps long
intervals in chromosome based interval trees depending on the mode of the query
intervals. Therefore, space complexity of interval trees are proportional to the size of
the human chromosomes.
We keep isochore family of 100, 000 bps long intervals in chromosome based ar-
ray lists. Therefore, space complexity of isochore family data structure is proportional
40
to the size of each human chromosome.
We keep chromosome based two array lists for mappability. Start and end posi-
tions of intervals with a specific mappability value are stored in an integer array list in
ascending order and their corresponding mappability values are stored in a short array
list where data is gathered from mappability bigWig files as data source is shown in
Table A.1 of Appendix A. Therefore, space complexity of mappability data struc-
tures are proportional to the number of intervals provided in the chromosome based
bigWig files.
Time complexity of GC calculation for GC byte array list is O(1) since we reach
to the corresponding byte or bytes directly in array index based manner and time
complexity of GC calculation for GC interval tree is equal to the cost of interval tree
search. Isochore family calculation relies on calculated GC, therefore time complex-
ity of isochore family calculation is equal to the time complexity of GC calculation.
Time complexity of mappability calculation is equal to the time complexity of binary
search in mappability integer array list, and then using the indexes returned by binary
search, reaching to the corresponding mappability values inO(1) time in mappability
short array list.
5.4 Joint Enrichment Analysis of Transcription Factors and KEGG Pathways
GLANET enables joint enrichment analysis for TF binding sites and KEGG path-
ways. With this option, users can evaluate whether the input set is enriched concur-
rently with binding sites of TFs and the genes within a KEGG pathway. This joint
enrichment analysis provides a detailed functional interpretation of the input loci.
To exemplify this situation, for the given query intervals, TF enrichment anal-
ysis may not reveal enrichment for any particular TF, however, a joint enrichment
analysis of genomic elements representing TF binding regions and KEGG pathways
may identify several enriched transcription factor and pathway pairs. Therefore, joint
enrichment analysis may provide more information than TF or KEGG Pathway en-
richment may provide separately.
41
Separate enrichment analysis for TFs or KEGG pathways with respect to query
intervals requires overlapping TFs or KEGG Pathway intervals with query intervals,
which involves finding overlapping intervals for 2 interval sets. However, joint en-
richment analysis of TFs and KEGG pathways with respect to query intervals requires
finding common overlapping intervals for 3 interval sets, namely, TFs, KEGG Path-
way and query intervals.
Later on, in Chapter 8, we generalize this finding common overlapping intervals
problem from 2 or 3 interval sets to n interval sets. We provide our proposed solutions
for finding n common overlapping intervals for n interval sets and finding at least k
common overlapping intervals for n interval sets problems.
5.5 Time and Space Complexity of Enrichment Analysis
Enrichment achieves annotation for each sampling’s randomly generated data. Time
complexity of enrichment is time complexity of random interval generation and plus
annotation of all samplings. Time complexity of annotation of all samplings is O(b ∗m ∗min(n, k log n)), where b is the number of samplings, m is the number of query
intervals, n is the number of intervals stored in the interval tree and k is the number
of hits. Time and space complexity of random interval generation is presented in 5.3.
We construct chromosome based interval trees for each element type, namely,
DNaseI hypersenstive sites (DHSs), Transcription Factors (TFs), Histone Modifica-
tions (HMs) and RefSeq Genes. Space complexity of each interval tree is O(n),where n is the number of intervals stored in the interval tree.
42
CHAPTER 6
DATA DRIVEN COMPUTATIONAL EXPERIMENTS
We designed novel data-driven computational experiments to evaluate GLANET’s
enrichment procedure in terms of Type-I error and power. We show that GLANET’s
enrichment test has low Type-I error with high statistical power and it is sensitive
to varying experiment parameters and GLANET parameters, and significance levels.
The data-driven computational experiments also enable us to assess the enrichment
capabilities of other tools. Towards this aim, we conduct extensive experiments to
compare GLANET with an another enrichment tool, GAT. Here, in this chapter, we
present the design of data-driven computational experiments with detailed explana-
tions. We provide the experiment results and compare GLANET and GAT leveraging
on these results. Moreover, we interpret the results further with Wilcoxon signed rank
tests and ROC curves.
6.1 Design of Data-driven Computational Experiments
The key idea of these experiments is that at the TSSs of expressed genes, we would
expect to observe enrichment of DNA polymerase II (POL2) occupancy and mod-
ifications that are related to transcriptional activation. In contrast, for the TSSs of
non-expressed genes, we would expect enrichment of histone modification elements
that are associated with transcriptional repression.
We used data from K562 and GM12878 cells and defined expressed and non-expressed
gene sets based on RNA-seq analysis of these cells. Genomic intervals that cover the
500 bps upstream and 100 bps downstream of the first exon of the genes in these sets
43
were retrieved. We based our experiments on enrichment of activator and repressor
elements on these intervals. In these experiments, our null hypothesis always stated
that there is no enrichment. For each simulation, we sampled non-overlapping inter-
vals from the TSS regions of the relevant gene set (expressed or non-expressed genes)
and evaluated enrichment of 12 histone modifications with roles on transcriptional re-
pression or activation and POL2 occupancy separately with GLANET. Based on these
simulations, we calculated Type-I error and power as follows:
6.1.1 Type-I error experiments
These experiments evaluate whether GLANET enrichment procedure can control
Type-I error (probability of rejecting the null when null hypothesis is true, thus mak-
ing a false rejection) considering settings where the null hypothesis is true. In the case
of non-expressed genes, the null hypothesis is that intervals that are located around
the TSSs of non-expressed genes’ are not enriched with activator elements. Similarly
in experiments conducted with expressed genes, the null hypothesis is that the inter-
vals around the TSSs of expressed genes are not enriched with repressor elements.
Type-I error rate is the number of times we incorrectly reject the null hypothesis.
6.1.2 Power experiments
These experiments evaluate the power (probability of rejecting the null when alterna-
tive hypothesis is true, making a correct rejection) of GLANET enrichment procedure
considering cases where the alternative hypothesis is true. In experiments conducted
with non-expressed genes, our null hypothesis states that the intervals are not enriched
with repressor elements. Similarly in the case of expressed genes, the null hypothesis
is that the genomic intervals are not enriched with activator elements. Then, power is
the number of times we correctly reject the null hypothesis.
Design for data-driven computational experiments is summarized in Figures 6.1 and
6.2. Each experiment consisted of sampling 500 non-overlapping intervals from the
relevant gene set described below and repeating the sampling procedure for 1, 000
simulations to estimate Type-I error and power. The list of genomic elements and
44
Step 1. Define set of expressing genes based on RNA-seq expression data.
Step 2. Retrieve 601 bps intervals around the genes’ first exons.
Step 3. Sample 500 intervals from the interval set.
Step 4a. Input intervals to GLANET and check
if the activator is enriched.
Repeat steps 3 and 4a N times.
Power: Number of times the activator is found
enriched/N.
Step 4b. Input intervals to GLANET and
check if the repressor is enriched.
Repeat steps 3 and 4b N times.
Type-I error: Number of times the
repressor is found enriched/N.
Computational experiments with expressed genes
Figure 6.1: Design for data-driven computational experiments for expressed genes.
N is set to 1000. Activator elements are defined as H2AZ, H3K27ac, H3K4me2,
H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1, H4K20me1, [8]
and POL2; whereas H3K27me3 and H3K9me3 constitute the repressor elements.
further details on how we defined the sets of expressed and non-expressed gene sets,
and the regions around the TSSs are detailed below.
6.1.3 Transcriptional activator and repressor elements
We considered histone modifications and POL2 occupancy in two groups as (1) ac-
tivator elements including POL2 and modifications H2AZ, H3K27ac, H3K4me2,
H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1, H4K20me1 as-
sociated with transcriptional activation at TSSs [8]; (2) repressor elements including
modification H3K9me3 and H3K27me3 [8]. However, some of these elements are
either observed to exhibit both activator and repressor features and/or reported to be
present in regions other than the TSSs such as gene bodies or 3’ end. We marked
H3K36me3, H3K4me1, H4K20me1, and H3K9me3 modifications as ambigous ele-
ments as their role in the TSSs site is ambigious [8, 52, 53].
After processing the RNA-seq data of GM12878 and K562 cell lines with the EN-
CODE RNA-seq data analysis pipeline (https://www.encodeproject.org/
rna-seq/small-rnas/), we defined expressed and non-expressed gene sets.
45
Step 1. Define the non-expressing genes based on RNA-seq expression data.
Step 2. Retrieve 601 bps genomic intervals around their first exons and filter genomic
intervals based on DNaseI exclusion criteria.
Step 3. Sample 500 intervals from the interval set.
Step 4a. Input intervals to GLANET and check
if the repressor is enriched.
Repeat steps 3 and 4a N times.
Power: Number of times the repressor is
found enriched/N.
Step 4b. Input intervals to GLANET and
check if the activator is enriched.
Repeat steps 3 and 4b N times.
Type-I error: Number of times the
activator is found enriched/N.
Computational experiments with non-expressed genes
Figure 6.2: Design for data-driven computational experiments for non-expressed
genes. N is set to 1000. Activator elements are defined as H2AZ, H3K27ac,
H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1,
H4K20me1, [8] and POL2; whereas H3K27me3 and H3K9me3 constitute the re-
pressor elements.
Both the GM12878 and K562 RNA-seq data included two biological replicates. For
each gene, we utilized the lowest and highest transcripts per million (TPM) values
across replicates for defining the expressed and non-expressed gene sets, respectively.
6.1.4 Genomic interval sets for expressed genes
We defined two sets of expressed genes with varying levels of stringency by consid-
ering the top 5th and top 20th percentiles of genes with respect to the their descending
TPM values. In each case, genomic intervals that cover the 500 bps upstream and 100
bps downstream of the first exon of the genes in these sets are retrieved. We refer to
these two genomic interval sets as Top5 and Top20.
6.1.5 Genomic interval sets for non-expressed genes
We labeled genes with zero TPM values as non-expressed genes and formed a tenta-
tive interval set by taking 500 bps upstream and 100 bps downstream of these genes’
46
first exons. [54] and others observed that DNaseI hypersensitivity and gene expres-
sion correlate positively; therefore, we further filtered these intervals with respect to
their cell type specific DNaseI signal. We considered two modes of DNaseI over-
lap exclusion by (i) discarding the interval completely from the interval set (Com-
pletelyDiscard) in case of any overlap with DNase-seq peak exists and (ii) keeping
the interval by reducing it to the longest interval without DNase-seq peak overlap
(TakeTheLongest). In experiments conducted with non-expressed genes, we oper-
ated with these two different interval sets: CompletelyDiscard and TakeTheLongest.
The DNaseI overlap exclusion accounted for the fact that zero TPM values might
arise as an artifact of sequencing depth and resulted in a conservatively defined set of
non-expressed genes.
6.2 RESULTS
We designed and conducted novel data-driven computational experiments to assess
Type-I error and power of GLANET’s enrichment procedure. In this section, we re-
port results on these data-driven computational experiments that validate the enrich-
ment procedure in a controlled setting. We explore the effect of GLANET enrichment
parameters together with experiment parameters, which necessitated 128, 000 runs of
GLANET as indicated in Table 6.1.
Next, we compare GAT and GLANET through data-driven computational experi-
ments which required 32, 000 GAT runs as it is described in Table 6.2. GAT achieves
coarse grain GC matching, and it is not exactly as same as wGC or wIF of GLANET,
but please notice that for GAT throughout the text, we use wGC and wIF interchange-
ably.
6.2.1 Data-driven Computational Experiments Results for Activator Elements
We performed the data-driven computational experiments summarized in Figures 6.1
and 6.2 under all possible enrichment analysis parameter settings of GLANET listed
in Table 5.1. We varied the association measure modes, EOO or NOOB and con-
sidered cases where we accounted for GC, and/or mappability or ignored these two
47
Table 6.1: Data-driven Computational Experiments for GLANET
GLANET DDCE
Experiment Parameters GLANET Parameters
Cell
Line
Experiment
Scenario
Experiment
Setting
Association
Statistic
Option
Random Interval
Generation
Matching Option
Random Interval
Generation Start
Option
GM12878
K562
Expressed
Genes
Top5
EOO
NOOB
wGC
wM
wGCM
woGCM
woIF
wIF
Top20
Non-expressed
Genes
CompletelyDiscard
TakeTheLongest
x 2 x 2 x 2 x 2 x 4 x 2
128 different Experiment and GLANET parameter combinations
1000 simulations for each parameter combination
128,000 runs of GLANET
Table 6.2: Data-driven Computational Experiments for GAT
GAT DDCE
Experiment Parameters GAT Parameters
Cell
Line
Experiment
Scenario
Experiment
Setting
Association
Statistic
Option
Random Interval
Generation
Matching Option
GM12878
K562
Expressed
Genes
Top5
EOO
NOOB
wGC
woGC
Top20
Non-expressed
Genes
CompletelyDiscard
TakeTheLongest
x 2 x 2 x 2 x 2 x 2
32 different Experiment and GAT parameter combinations
1000 simulations for each parameter combination
32,000 runs of GAT
biases in random interval generation step. These settings are with GC (wGC), with
mappability (wM), with GC and mappability (wGCM) and without GC and map-
pability (woGCM). Furthermore, we considered with Isochore Family and without
Isochore Family options, which we refer as wIF and woIF, respectively. These con-
stituted 16 different parameter settings. As described above, we varied the definitions
48
of non-expressed and expressed genes too; for expressed gene setting we have Top5,
which is the conservatively defined set of expressed genes and Top20 that is less con-
servatively defined. For the non-expressed interval set, CompletelyDiscard is a more
stringent definition than the TakeTheLongest case. We repeated these experiments
for K562 and GM12878 cell lines in order to get a complete picture of GLANET
enrichment procedure performance.
Through these data-driven computational experiments, we assessed GLANET Type-I
error and power. We provided the results for significance levels of α = 0.05 and
α = 0.001, which are displayed in Figures 6.3- 6.10.
Figure 6.3 summarizes the results of experiments conducted with activator elements
for expressed genes (Top5) and non-expressed genes (CompletelyDiscard) settings
for K562. Overall, we observe that the Type-I error is well below the target sig-
nificance level (α = 0.05) without sacrifice on power in all sixteen modes of the
GLANET enrichment analysis. One exception to this is, H3K4me1, where Type-I er-
ror is significantly higher than the target level. This could potentially be attributed to
its ambiguous role on the promoters as it acts also on the downstream of TSSs [8] and
reported to exhibit repressor features [52]. Interestingly, enrichment assessment of
this mark for non-expressed genes is most affected by the bias adjustment in the null
distribution estimation. The Type-I error involving this mark improves significantly
under the with GC and/or mappability regardless of the association statistics utilized
for enrichment without a negative impact on power. Similiarly, using wIF option im-
proves its Type-I error. Another exception case is H3K36me3 mark with considerably
low power. This is also one of the elements whose role on the promoters is ambigous;
H3K36me3 is reported to have preference for the 3’ of active genes [8]. When the
same experiments are conducted in GM12878 cell line, we obtained similar results
even with lower Type-I errors (Figure 6.4).
When we use a looser interval exclusion criteria in generating intervals of non-expressed
genes (TakeTheLongest) and, a less stringent definition of expressed genes (Top20),
the Type-I errors are higher. They are even higher for some non-ambiguous elements
in both K562 and GM12878 cells (Figures 6.5 and 6.6). This indicates that GLANET
is not universally conservative across all settings. When we re-assessed Type-I errors
49
and power at a more stringent level of significance such as 0.001, the Type-I er-
rors are controlled in (CompletelyDiscard) and (Top5) experiments without loss of
power (Figures 6.7 and 6.8) with the exception of ambiguous elements H3K4me1,
H3K36me3, and H4K20me1. When the less stringent experiment settings (TakeThe-
Longest,Top20) are used at this significance level, there are few elements with Type-I
error above the target significance level and power less than one (Figures 6.9 and
6.10).
50
(a) K562, Non-exp, α=0.05, wIF (b) K562, Exp, α=0.05, wIFEOO NOOB
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
(c) K562, Non-exp, α=0.05, woIF (d) K562, Exp, α=0.05, woIFEOO NOOB
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.3: Assessment of GLANET Type-I error and power with data-driven com-
putational experiments. Histone marks with ambiguous activator roles are marked
with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuris-
tic using K562, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,
Top5) results, for significance level of 0.05. (c, d) Type-I error and power estimated
without Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, Com-
pletelyDiscard) and (Expressed Genes, Top5) results, for significance level of 0.05.
51
(a) GM12878, Non-exp, α=0.05, wIF (b) GM12878, Exp, α=0.05, wIFEOO NOOB
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
(c) GM12878, Non-exp, α=0.05, woIF (d) GM12878, Exp, α=0.05, woIFEOO NOOB
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.4: Assessment of GLANET Type-I error and power with data-driven com-
putational experiments. Histone marks with ambiguous activator roles are marked
with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic
using GM12878, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,
Top5) results, for significance level of 0.05. (c, d) Type-I error and power estimated
without Isochore Family (woIF) heuristic using GM12878, (Non-expressed Genes,
CompletelyDiscard) and (Expressed Genes, Top5) results, for significance level of
0.05.
52
(a) K562, Non-exp, α=0.05, wIF (b) K562, Exp, α=0.05, wIFEOO NOOB
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1H
2AZ
H3K
27A
CH
3K4M
E2
H3K
4ME
3H
3K79
ME
2H
3K9A
CH
3K9A
CB
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
(c) K562, Non-exp, α=0.05, woIF (d) K562, Exp, α=0.05, woIFEOO NOOB
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1H
2AZ
H3K
27A
CH
3K4M
E2
H3K
4ME
3H
3K79
ME
2H
3K9A
CH
3K9A
CB
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.5: Assessment of GLANET Type-I error and power with data-driven com-
putational experiments. Histone marks with ambiguous activator roles are marked
with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic
using K562, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes, Top20)
results, for significance level of 0.05. (c, d) Type-I error and power estimated with-
out Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, TakeThe-
Longest) and (Expressed Genes, Top20) results, for significance level of 0.05.
53
(a) GM12878, Non-exp, α=0.05, wIF (b) GM12878, Exp, α=0.05, wIFEOO NOOB
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
(c) GM12878, Non-exp, α=0.05, woIF (d) GM12878, Exp, α=0.05, woIFEOO NOOB
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
0.00.20.40.6
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.6: Assessment of GLANET Type-I error and power with data-driven compu-
tational experiments. Histone marks with ambiguous activator roles are marked with∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic using
GM12878, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes, Top20)
results, for significance level of 0.05. (c, d) Type-I error and power estimated without
Isochore Family (woIF) heuristic using GM12878, (Non-expressed Genes, TakeThe-
Longest) and (Expressed Genes, Top20) results, for significance level of 0.05.
54
(a) K562, Non-exp, α=0.001, wIF (b) K562, Exp, α=0.001, wIFEOO NOOB
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1H
2AZ
H3K
27A
CH
3K4M
E2
H3K
4ME
3H
3K79
ME
2H
3K9A
CH
3K9A
CB
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
(c) K562, Non-exp, α=0.001, woIF (d) K562, Exp, α=0.001, woIFEOO NOOB
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1H
2AZ
H3K
27A
CH
3K4M
E2
H3K
4ME
3H
3K79
ME
2H
3K9A
CH
3K9A
CB
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.7: Assessment of GLANET Type-I error and power with data-driven com-
putational experiments. Histone marks with ambiguous activator roles are marked
with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuris-
tic using K562, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,
Top5) results, for significance level of 0.001. (c, d) Type-I error and power estimated
without Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, Com-
pletelyDiscard) and (Expressed Genes, Top5) results, for significance level of 0.001.
55
(a) GM12878, Non-exp, α=0.001, wIF (b) GM12878, Exp, α=0.001, wIFEOO NOOB
0.000.010.020.030.040.05
0.000.010.020.030.040.05
0.000.010.020.030.040.05
0.000.010.020.030.040.05
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
(c) GM12878, Non-exp, α=0.001, woIF (d) GM12878, Exp, α=0.001, woIFEOO NOOB
0.000.010.020.030.040.05
0.000.010.020.030.040.05
0.000.010.020.030.040.05
0.000.010.020.030.040.05
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.8: Assessment of GLANET Type-I error and power with data-driven com-
putational experiments. Histone marks with ambiguous activator roles are marked
with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic
using GM12878, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,
Top5) results, for significance level of 0.001. (c, d) Type-I error and power estimated
without Isochore Family (woIF) heuristic using GM12878, (Non-expressed Genes,
CompletelyDiscard) and (Expressed Genes, Top5) results, for significance level of
0.001.
56
(a) K562, Non-exp, α=0.001, wIF (b) K562, Exp, α=0.001, wIFEOO NOOB
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1H
2AZ
H3K
27A
CH
3K4M
E2
H3K
4ME
3H
3K79
ME
2H
3K9A
CH
3K9A
CB
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
(c) K562, Non-exp, α=0.001, woIF (d) K562, Exp, α=0.001, woIFEOO NOOB
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1H
2AZ
H3K
27A
CH
3K4M
E2
H3K
4ME
3H
3K79
ME
2H
3K9A
CH
3K9A
CB
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
H3K
9AC
BP
OL2
*H3K
36M
E3
*H3K
4ME
1*H
4K20
ME
1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.9: Assessment of GLANET Type-I error and power with data-driven com-
putational experiments. Histone marks with ambiguous activator roles are marked
with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic
using K562, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes, Top20)
results, for significance level of 0.001. (c, d) Type-I error and power estimated with-
out Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, TakeThe-
Longest) and (Expressed Genes, Top20) results, for significance level of 0.001.
57
(a) GM12878, Non-exp, α=0.001, wIF (b) GM12878, Exp, α=0.001, wIFEOO NOOB
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
(c) GM12878, Non-exp, α=0.001, woIF (d) GM12878, Exp, α=0.001, woIFEOO NOOB
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
wGC
wGCM
wM
woGCM
EOO NOOB
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
wG
Cw
GC
Mw
Mw
oGC
M
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er
wGC
wGCM
wM
woGCM
Figure 6.10: Assessment of GLANET Type-I error and power with data-driven com-
putational experiments. Histone marks with ambiguous activator roles are marked
with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic
using GM12878, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes,
Top20) results, for significance level of 0.001. (c, d) Type-I error and power es-
timated without Isochore Family (woIF) heuristic using GM12878, (Non-expressed
Genes, TakeTheLongest) and (Expressed Genes, Top20) results, for significance level
of 0.001.
58
6.2.2 Data-driven Computational Experiments Results for Repressor Elements
Experiments with repressor element H3K27me3 resulted in zero Type-I error except
for a few cases in GM12878 (Tables 6.3 and 6.4). In experiments with the repressor
element H3K27me3, GLANET attained power of one across all settings as shown in
Tables 6.5 and 6.6. Experiments with the repressor element H3K9me3 resulted in
Type-I error of zero for GM12878 , and Type-I errors over the set significance level
depending on the parameter selection in K562 cell (Tables 6.3 and 6.4). The power
in both cells for this histone mark is low (Tables 6.5 and 6.6). H3K9me3 is also one
of the ambiguous elements in terms of its repressive role on promoters.
Overall we observe that GLANET controls Type-I error well without loss of power.
Type-I error control is significantly better with the NOOB association statistics. Ac-
counting for GC and mappability biases and use of wIF option lower the Type-I error.
We further explore how these different parameters affect GLANET enrichment anal-
ysis in Sections 6.2.4 and 6.2.5.
Table 6.3: Type-I error rates calculated in data-driven experiments conducted with
repressor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for
α = 0.05.
Type-I Error, α = 0.05
wIF woIF
Expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM
Top5 EOO 0 0 0 0 0 0 0 0
H3K27me3 Top20 EOO 0.001 0 0 0.001 0.002 0.001 0 0.006
GM12878 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0 0 0 0.001 0.001 0.001 0 0.008
Top5 EOO 0 0 0 0 0 0 0 0
H3K27me3 Top20 EOO 0 0 0 0 0 0 0 0
K562 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0 0 0 0 0 0 0 0
Top5 EOO 0 0 0 0 0 0 0 0
H3K9me3 Top20 EOO 0 0 0 0 0 0 0 0
GM12878 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0 0 0 0 0 0 0 0
Top5 EOO 0 0 0 0 0 0 0 0
H3K9me3 Top20 EOO 0.079 0.051 0.052 0.083 0.103 0.081 0.066 0.126
K562 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0.042 0.023 0.025 0.041 0.06 0.039 0.035 0.085
59
Table 6.4: Type-I error rates calculated in data-driven experiments conducted with
repressor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for
α = 0.001.
Type-I Error, α = 0.001
wIF woIF
Expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM
Top5 EOO 0 0 0 0 0 0 0 0
H3K27me3 Top20 EOO 0 0 0 0 0 0 0 0
GM12878 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0 0 0 0 0 0 0 0
Top5 EOO 0 0 0 0 0 0 0 0
H3K27me3 Top20 EOO 0 0 0 0 0 0 0 0
K562 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0 0 0 0 0 0 0 0
Top5 EOO 0 0 0 0 0 0 0 0
H3K9me3 Top20 EOO 0 0 0 0 0 0 0 0
GM12878 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0 0 0 0 0 0 0 0
Top5 EOO 0 0 0 0 0 0 0 0
H3K9me3 Top20 EOO 0.002 0.001 0.001 0.002 0.003 0.001 0.001 0.005
K562 Top5 NOOB 0 0 0 0 0 0 0 0
Top20 NOOB 0.001 0 0 0.001 0.001 0 0 0.001
60
Table 6.5: Power calculated in data-driven experiments conducted with repressor el-
ements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for α = 0.05.
Power, α = 0.05
wIF woIF
Non-expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM
CompletelyDiscard EOO 1 1 1 1 1 1 1 1
H3K27me3 TakeTheLongest EOO 1 1 1 1 1 1 1 1
GM12878 CompletelyDiscard NOOB 1 1 1 1 1 1 1 1
TakeTheLongest NOOB 1 1 1 1 1 1 1 1
CompletelyDiscard EOO 1 1 1 1 1 1 1 1
H3K27me3 TakeTheLongest EOO 1 1 1 1 1 1 1 1
K562 CompletelyDiscard NOOB 1 1 1 1 1 1 1 1
TakeTheLongest NOOB 1 1 1 1 1 1 1 1
CompletelyDiscard EOO 0.134 0.151 0.163 0.154 0.161 0.182 0.177 0.214
H3K9me3 TakeTheLongest EOO 0.186 0.199 0.209 0.211 0.221 0.244 0.234 0.299
GM12878 CompletelyDiscard NOOB 0.076 0.098 0.103 0.095 0.094 0.113 0.106 0.133
TakeTheLongest NOOB 0.096 0.113 0.124 0.119 0.12 0.134 0.129 0.168
CompletelyDiscard EOO 0.003 0.004 0.003 0.004 0.004 0.005 0.005 0.007
H3K9me3 TakeTheLongest EOO 0.002 0.003 0.002 0.003 0.002 0.005 0.004 0.006
K562 CompletelyDiscard NOOB 0.003 0.004 0.004 0.004 0.004 0.005 0.005 0.006
TakeTheLongest NOOB 0.005 0.004 0.005 0.005 0.005 0.005 0.005 0.005
Table 6.6: Power calculated in data-driven experiments conducted with repressor el-
ements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for α = 0.001.
Power, α = 0.001
wIF woIF
Non-expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM
CompletelyDiscard EOO 1 1 1 1 1 1 1 1
H3K27me3 TakeTheLongest EOO 1 1 1 1 1 1 1 1
GM12878 CompletelyDiscard NOOB 1 1 1 1 1 1 1 1
TakeTheLongest NOOB 1 1 1 1 1 1 1 1
CompletelyDiscard EOO 1 1 1 1 1 1 1 1
H3K27me3 TakeTheLongest EOO 1 1 1 1 1 1 1 1
K562 CompletelyDiscard NOOB 1 1 1 1 1 1 1 1
TakeTheLongest NOOB 1 1 1 1 1 1 1 1
CompletelyDiscard EOO 0.003 0.005 0.004 0.005 0.006 0.006 0.008 0.017
H3K9me3 TakeTheLongest EOO 0.008 0.009 0.009 0.011 0.012 0.013 0.013 0.023
GM12878 CompletelyDiscard NOOB 0 0.002 0.001 0.001 0 0.003 0.001 0.004
TakeTheLongest NOOB 0.005 0.005 0.004 0.005 0.005 0.005 0.004 0.007
CompletelyDiscard EOO 0 0 0 0 0 0 0 0
H3K9me3 TakeTheLongest EOO 0 0 0 0 0 0 0 0
K562 CompletelyDiscard NOOB 0 0 0 0 0 0 0 0
TakeTheLongest NOOB 0 0 0 0 0 0 0 0
61
6.2.3 GLANET GAT Comparison Results for Activator and Repressor Ele-
ments through Data-driven Computational Experiments
Among the available annotation and enrichment tools, GAT is the only one that takes
genomic intervals as input and facilitates accounting for mappability bias. GAT
also relies on sampling-based null distribution estimation. It accommodates poten-
tial genomic biases in the input query by allowing users to define a workspace. This
workspace specifies which regions of the genome should be utilized in sampling of
the intervals for null distribution generation. From a practical standpoint, defining
this workspace is not straightforward, GLANET, on the other hand, adjusts for GC
and mappability biases by matching each input interval with its default library. Over-
all, GAT’s matching procedure is coarser compared to GLANET as GAT matches
these properties in a coarse grain fashion, where as GLANET’s takes a more fine-
grain approach and matches GC content and/or mappability of each individual input
interval. Two association measures GAT utilizes are the number of overlapping bases
between the two sets of genomic intervals and number of intervals in the segments of
interest overlapping. These two measures coincide with NOOB and EOO association
statistics of GLANET.
We compared GLANET and GAT with the same data-driven computational experi-
ments for all settings and compute element specific Type-I error and power of GAT
at 0.001 and 0.05 significance levels. We observed that GAT is also conservative
in terms of Type-I error for more stringent experiment settings (CompletelyDiscard,
Top5). Additionally, GLANET achieves better Type-I error rate for certain elements
such as H3K4me1 and also better power for H3K36me3 and H4K20me1 elements
compared to GAT as shown in Figures 6.11 and 6.13. For less stringent experi-
ment settings (TakeTheLongest, Top20), results show that GLANET Type-I error and
power are comparable or better than GAT (Figures 6.12 and 6.14). We extended
this analysis with ROC curves by varying the significance level as detailed in Section
6.2.5.
62
(a) Non-exp, CompletelyDiscard, α=0.05 (b) Exp, Top5, α=0.05EOO NOOB
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er GAT
GLANET
(c) Exp, Top5, α=0.05 (d) Non-exp, CompletelyDiscard, α=0.05EOO NOOB
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Pow
er GAT
GLANET
Figure 6.11: Comparison of GLANET and GAT with respect to data-driven com-
putational experiments in terms of Type-I Error and Power for significance level of
0.05. GLANET(wIF,wGC) and GAT(wIF) parameter settings results are used. Re-
sults for the two association statistics - existence of overlap (EOO) and the number
of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of activa-
tor elements in (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,
Top5) experiment settings, respectively. (c, d) Type-I error and power of repressor el-
ements in (Expressed Genes, Top5) and (Non-expressed Genes, CompletelyDiscard)
experiment settings, respectively. GLANET achieves higher power for H3K9me3
than GAT.
63
(a) Non-exp, TakeTheLongest, α=0.05 (b) Exp, Top20, α=0.05EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er GAT
GLANET
(c) Exp, Top20, α=0.05 (d) Non-exp, TakeTheLongest, α=0.05EOO NOOB
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Pow
er GAT
GLANET
Figure 6.12: . Comparison of GLANET and GAT with respect to data-driven com-
putational experiments in terms of Type-I Error and Power for significance level of
0.05. GLANET(wIF,wGC) and GAT(wIF) parameter settings results are used. Re-
sults for the two association statistics - existence of overlap (EOO) and the number
of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of acti-
vator elements in (Non-expressed Genes, TakeTheLongest) and (Expressed Genes,
Top20) experiment settings, respectively. (c, d) Type-I error and power of repressor
elements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTheLongest)
experiment settings, respectively.
64
(a) Non-exp, CompletelyDiscard, α=0.001 (b) Exp, Top5, α=0.001EOO NOOB
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er GAT
GLANET
(c) Exp, Top5, α=0.001 (d) Non-exp, CompletelyDiscard, α=0.001EOO NOOB
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Pow
er GAT
GLANET
Figure 6.13: . Comparison of GLANET and GAT with respect to data-driven com-
putational experiments in terms of Type-I Error and Power for significance level of
0.001. GLANET(wIF,wGC) and GAT(wIF) parameter settings results are used. Re-
sults for the two association statistics - existence of overlap (EOO) and the number
of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of activa-
tor elements in (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,
Top5) experiment settings, respectively. (c, d) Type-I error and power of repressor el-
ements in (Expressed Genes, Top5) and (Non-expressed Genes, CompletelyDiscard)
experiment settings, respectively.
65
(a) Non-exp, TakeTheLongest, α=0.001 (b) Exp, Top20, α=0.001EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
H2A
ZH
3K27
AC
H3K
4ME
2H
3K4M
E3
H3K
79M
E2
H3K
9AC
PO
L2*H
3K36
ME
3*H
3K4M
E1
*H4K
20M
E1
Pow
er GAT
GLANET
(c) Exp, Top20, α=0.001 (d) Non-exp, TakeTheLongest, α=0.001EOO NOOB
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Type
I E
rror
GAT
GLANET
EOO NOOB
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
GM
12878K
562
H3K
27M
E3
*H3K
9ME
3
H3K
27M
E3
*H3K
9ME
3
Pow
er GAT
GLANET
Figure 6.14: . Comparison of GLANET and GAT with respect to data-driven com-
putational experiments in terms of Type-I Error and Power for significance level of
0.001. GLANET(wIF,wGC) and GAT(wIF) parameter settings results are used. Re-
sults for the two association statistics - existence of overlap (EOO) and the number
of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of acti-
vator elements in (Non-expressed Genes, TakeTheLongest) and (Expressed Genes,
Top20) experiment settings, respectively. (c, d) Type-I error and power of repressor
elements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTheLongest)
experiment settings, respectively.
66
6.2.4 Assessing GLANET Enrichment Parameters through Wilcoxon Signed
Rank Tests
To get a comprehensive view of how GLANET parameters would affect the enrich-
ment test performance, we summarize our results across different experiments con-
ducted with various activator and repressor elements and different parameter settings.
We concentrate on Type-I error, as it is more variable than the power.
We gathered the Type-I error of activator and repressor elements for 26 different sig-
nificance levels starting from 0 to 0.25 in increments of 0.01. We considered more
stringent and less stringent settings for expressed and non-expressed genes which are
(Top5,CompletelyDiscard) and (Top20,TakeTheLongest). We had 2 repressor ele-
ments for both of the cell lines. However, we had 11 and 10 activator elements for
K562 and GM12878, respectively. As a result, for expressed genes and non-expressed
genes, we had 208 and 1092 Type-I errors considered in Wilcoxon signed rank tests,
respectively.
We carried out Wilcoxon signed rank tests to assess the statistical significance of the
difference between the Type-I errors achieved by different GLANET parameter set-
tings. The null states there is no difference in the mean of the ranks of the two distri-
butions whereas alternative hypothesis is that the first distribution has lower mean of
ranks than the second one. We carried out these tests for non-expressed and expressed
simulations separately. Table 6.7 illustrates the p-values of the tests.
In general, matching GC and/or mappability reduces the Type-I error ranks compared
to the woGCM setting on the average. Therefore, we conclude that accounting for
these biases leads to more realistic generation of the empirical null distribution. If
the input is sourced from a constrained region of the genome, sampling uniformly
at random from the genome optimistically concludes that there is a enrichment of
the genomic element, even though there is none, leading to higher Type-I errors. In
both expressed and non-expressed gene sets we observe that matching GC reduces
the Type-I error. This is consistent with the fact that TSSs start sites have higher GC
content [55].
67
Table 6.7: One-sided Wilcoxon signed rank test results for testing whether the Type-
I error distribution of experiments generated under the parameter setting specified in
the row has lower mean of ranks compared to the distribution of Type-I errors gen-
erated under the parameter setting specified in the column, where the null hypothesis
states that there is no difference. A p-value presented in the cell indicates that setting
in the corresponding row has a lower mean of ranks in Type-I error distribution than
the setting in the corresponding column; if the cell is empty the opposite holds. The
p-values are less than or equal to the actual test result.
Wilcoxon signed rank test p-values
Non-expressed(EOO,woIF) Non-expressed(NOOB,woIF)
wGC wM wGCM woGCM wGC wM wGCM woGCM
wGC 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16
wM 2.2e-16 2.2e-16
wGCM 2.2e-16 2.2e-16 2.2e-16 2.2e-16
woGCM
Non-expressed(EOO,wIF) Non-expressed(NOOB,wIF)
wGC wM wGCM woGCM wGC wM wGCM woGCM
wGC 1.9e-04 2.2e-16 2.2e-16 1.004e-14 2.2e-16 6.524e-15
wM 2.2e-16 2.2e-16 2.2e-16
wGCM
woGCM 1.97e-04 2.39e-11
Expressed(EOO,woIF) Expressed(NOOB,woIF)
wGC wM wGCM woGCM wGC wM wGCM woGCM
wGC 5.47e-12 1.2e-12
wM 1.18e-09 5.5e-12 1.75e-09 1.2e-12
wGCM 5.51e-10 1.17e-09 5.5e-12 3.75e-10 5.38e-10 1.2e-12
woGCM
Expressed(EOO,wIF) Expressed(NOOB,wIF)
wGC wM wGCM woGCM wGC wM wGCM woGCM
wGC 1.43e-04 3.93e-03
wM 1.14e-09 2.78e-06 7.88e-10 2.57e-09 7.80e-06 1.75e-09
wGCM 1.15e-09 7.70e-10 2.56e-09 1.75e-09
woGCM
As shown in Table 6.7, we observed that for non-expressed genes, wGC achieved
lower Type-I errors than the other options. For expressed genes, wGCM achieved
lower Type-I errors than the others when woIF was on. However, when wIF was on,
wM performed better in terms of Type-I error. This is because wIF provides coarse
68
grain GC matching. We also pooled the Type-I errors for (woIF,wIF) and observed
that wIF achieves lower Type-I errors than woIF (Table 6.8) in general and NOOB
provides lower Type-I errors than EOO (Table 6.9).
Overall we observe that Type-I error control is significantly better with the NOOB
association statistics. Accounting for GC and mappability biases and use of wIF
option lower the Type-I error.
Table 6.8: Wilcoxon Signed Rank Tests for (woIF,wIF). Type-I error distribution of
wIF is less than Type-I error distribution of woIF. To decrease Type-I error, we prefer
wIF over woIF.
Wilcoxon signed rank test p-values
woIF wIF
woIF NA 1
wIF 2.2e-16 NA
Table 6.9: Wilcoxon Signed Rank Tests for (EOO,NOOB). Type-I error distribution
of NOOB is less than Type-I error distribution of EOO. To decrease Type-I error, we
prefer NOOB over EOO.
Wilcoxon signed rank test p-values
EOO NOOB
EOO NA 1
NOOB 2.2e-16 NA
Finally, we notice an interesting difference in the experiment results conducted with
expressed and non-expressed genes. As listed in Table 6.10, matching only GC
in non-expressed genes results in the lowest Type-I errors. The experiments on ex-
pressed gene intervals show that matching mappability in addition to GC is required to
achieve lower Type-I errors. Thus, we observe that for these data-driven experiments,
accounting for mappability is not critical in the non-expressed gene set whereas it
is critical for the expressed case. We next asked whether the GC and mappability
distributions of these interval sets can explain this result.
69
Table 6.10: Table summarizes random interval generation option that achieves the
lowest Type-I error for non-expressed and expressed gene intervals using association
measures EOO and NOOB and the two isochore family options woIF and wIF.
Gene-set(AssociationMeasure,
IsochoreFamily)Random Interval
Generation ModeNon-expressed(EOO,woIF) wGC
Non-expressed(EOO,wIF) wGC
Non-expressed(NOOB,woIF) wGC
Non-expressed(NOOB,wIF) wGC
Expressed(EOO,woIF) wGCM
Expressed(EOO,wIF) wM
Expressed(NOOB,woIF) wGCM
Expressed(NOOB,wIF) wM
We considered the empirical GC and mappability distributions of the gene set inter-
vals and compared them with the two distributions computed on the whole genome.
We sampled 50, 000 intervals of each 600 bps long from the human genome uni-
formly at random. Figures 6.15 and 6.16 display violin plots of GC and mappability
of these random intervals, the intervals for the expressed and non-expressed genes in
GM12878 and K562 cell lines, respectively. As shown in Figures 6.15a and 6.16a,
GC distributions of non-expressed genes and expressed genes are similar to each other
and they are both considerably different from the whole genome, especially in the
lower tail (Kolmogorov-Smirnov test, p-value ≤ 2.2e-16). This provides support for
the fact that matching for GC is important in both simulations conducted with the
non-expressed and expressed genes sets. The same does not hold for the mappability
distributions: mappability distribution of non-expressed genes promoter intervals is
more similar to that of whole genome than the expressed genes’ intervals (Figures
6.15b and 6.16b). Although both expressed and non-expressed gene intervals are
significantly different than the genome based on two-sample Kolmogorov-Smirnov,
test (p-value ≤ 2.2e-16); the test statistic, which quantifies the distance between the
two compared distributions, is smallest between the mappability distributions of the
human genome and the non-expressed gene set in both of the cell lines (Table 6.11).
70
(a) (b)
0.0
0.2
0.4
0.6
0.8
WholeGenome
GM12878Non−expressed
GM12878Expressed
●
● ●G
C
0.0
0.2
0.4
0.6
0.8
1.0
WholeGenome
GM12878Non−expressed
GM12878Expressed
● ● ●
MA
PPA
BIL
ITY
Figure 6.15: Violin plots for (a) GC of randomly sampled intervals from human
genome, GC of intervals of GM12878 non-expressed genes and expressed genes.
(b) Mappability of randomly sampled intervals from human genome, mappability of
intervals from non-expressed and expressed gene-sets of GM12878.
(a) (b)
0.0
0.2
0.4
0.6
0.8
WholeGenome
K562Non−expressed
K562Expressed
●
●●
GC
0.0
0.2
0.4
0.6
0.8
1.0
WholeGenome
K562Non−expressed
K562Expressed
● ● ●
MA
PPA
BIL
ITY
Figure 6.16: Violin plots for (a) GC of randomly sampled intervals from human
genome, GC of intervals of K562 non-expressed genes and expressed genes. (b) Map-
pability of randomly sampled intervals from human genome, mappability of intervals
from non-expressed and expressed gene-sets of K562.
71
Table 6.11: Kolmogorov-Smirnov test results. Null hypothesis states that the distri-
bution of GC content or mappability values calculated for 50, 000 randomly sampled
intervals from human genome and the corresponding interval set are not different.
Each row corresponds to Kolmogorov-Smirnov testing of this null hypothesis. In all
tests, the null hypothesis is rejected (p-value < 2.2e-16). The first column lists the
property of the genome in question, the second column lists the distribution that is
compared with the genome, finally the last column lists the maximum distance be-
tween the two distributions.
Kolmogorov-Smirnov Test Results
Property Interval Set Maximum
DistanceGC Non-expressed (GM12878) 0.1454
GC Expressed (GM12878) 0.1462
GC Non-expressed (K562) 0.1241
GC Expressed (K562) 0.1897
Mappability Non-expressed (GM12878) 0.0794
Mappability Expressed (GM12878) 0.1693
Mappability Non-expressed (K562) 0.0898
Mappability Expressed (K562) 0.1585
6.2.5 Assessing GLANET Enrichment Parameters through ROC Curves and
Comparison with GAT
We plotted element-based and cell-based ROC curve for each possible GLANET pa-
rameter combination and experiment setting. We also included GAT in our ROC
curves for comparison reasons. While plotting ROC curves, we labeled each activa-
tor element as enriched and not enriched under expressed and non-expressed genes
scenarios, respectively. Similarly, we labeled each repressor element as not-enriched
and enriched under expressed and non-expressed genes scenarios, respectively.
In each ROC curve, we considered expressed and non-expressed genes scenarios to-
gether, therefore we had equal number of labels of not-enriched and enriched with
their accompanying p-values for 1000 runs of all possible GLANET parameter com-
binations. We drew ROC curves, separately for more stringent (CompletelyDis-
card,Top5) and less stringent (TakeTheLongest,Top20) experiment settings .
72
Approximately for 66.6% of the elements (H2AZ, H3K27AC, H3K3ME2, H3K4ME3,
H3K79ME2, H3K9AC, POL2, H3K27ME3), ROC curves are alike with very mini-
mal differences under varying all other variables. However, in approximately 33.3%
of the elements (H4K20ME1, H3K9ME3, H3K4ME1, H3K36ME3), ROC curves
are different from each other. As a result, we can say that for these elements ROC
curves change from element to element (e.g. POL2 or H3K9ME3), from cell line
to cell line (GM12878 or K562), from parameter setting to parameter setting (e.g.
(EOO,woIF,wGCM) or (NOOB,wIF,wM)), from experiment setting to experiment
setting ((CompletelyDiscard,Top5) or (TakeTheLongest,Top20)). Therefore pooling
all results and providing only one ROC curve is not possible and meaningful.
To exemplify this situation, we presented 8 ROC curve figures in Figures 6.17-
6.20. In Figures 6.17a and 6.17b, everything is the same except the parameter
setting which is changed from (NOOB,woIF) to (NOOB,wIF). In Figure 6.17a,
GLANET(NOOB,woIF,wGCM) achieved the highest AUC whereas in Figure 6.17b,
GLANET(NOOB,wIF,wM) achieved the highest AUC with the help of coarse grain
GC matching option, wIF.
In Figures 6.18a and 6.18b, all variables are the same except the cell line which
is changed from GM12878 to K562. In Figure 6.18a, GLANET(NOOB,wIF,wGC)
and GLANET(NOOB,wIF,woGCM) achieved the highest AUCs. For this case, wIF,
coarse grain GC matching performed very well. Even woGCM achieved higher AUC
than wGCM and wM. However, in Figure 6.18b, for K562 cell line, under all random
interval generation options GLANET and GAT performed very well.
In Figures 6.19a and 6.19b, all variables are the same except the experiment set-
ting which is changed from less stringent (TakeTheLongest,Top20) to more stringent
(CompletelyDiscard,Top5). In Figure 6.19a, GLANET(NOOB,woIF,wGC) achieved
the highest AUC. However, in Figure 6.19b, under all random interval generation op-
tions GLANET and GAT performed very well.
In Figures 6.20a and 6.20b, except the element POL2, we changed everything. We
changed the cell line from GM12878 to K562, parameter setting from (EOO,woIF)
to (NOOB,wIF), experiment setting from (CompletelyDiscard,Top5) to (TakeThe-
Longest,Top20). And we observed that in Figures 6.20a and 6.20b, under all random
73
interval generation options GLANET and GAT performed very well.
(a) (b)
Figure 6.17: ROC Curves for (a) H3K9ME3 in K562 under parameter (NOOB, woIF)
and experiment (CompletelyDiscard, Top5) (b) H3K9ME3 in K562 under parameter
(NOOB, wIF) and experiment (CompletelyDiscard, Top5) settings.
(a) (b)
Figure 6.18: ROC Curves for (a) H4K20ME1 in GM12878 under parameter (NOOB,
wIF) and experiment (CompletelyDiscard, Top5) (b) H4K20ME1 in K562 under pa-
rameter (NOOB, wIF) and experiment (CompletelyDiscard, Top5) settings.
There are 13 and 12 elements in K562 and GM12878, respectively, which makes 25
element-cell pairs. We considered 2 experiment settings: (CompletelyDiscard,Top5)
and (TakeTheLongest,Top20) which makes 50 ROC curve figures. We plotted each
ROC curve figure under 4 parameter settings: (EOO,woIF), (EOO,wIF),(NOOB,woIF)
and (NOOB,wIF) which makes 200 ROC curve figures. And in each ROC curve fig-
ure, we plotted the 5 ROC curves resulting from GLANET(wGC,wM,wGCM,woGCM)
and GAT(woGCM) parameter settings.
To compare quantitatively which tool and parameter setting performed better than
74
(a) (b)
Figure 6.19: ROC Curves for (a) H3K4ME1 in K562 under parameter (NOOB, woIF)
and experiment (TakeTheLongest, Top20) (b) H3K4ME1 in K562 under parameter
(NOOB, woIF) and experiment (CompletelyDiscard, Top5) settings.
(a) (b)
Figure 6.20: ROC Curves for (a) POL2 in GM12878 under parameter (EOO, woIF)
and experiment (CompletelyDiscard, Top5) (b) POL2 in K562 under parameter
(NOOB, wIF) and experiment (TakeTheLongest, Top20) settings.
the others through ROC curves, we compared AUC of ROC curves with each other
using pROC R package [56]. Each time, we compared AUC of two ROC curves and
checked whether AUC of the first ROC curve is statistically higher than the AUC of
the second tested ROC curve or not. If yes, we increased the number of wins, if no, we
increased the number of losses, otherwise, we increased the number of ties for the first
ROC curve. We did the appropriate update for the second ROC curve. At the end, we
counted and accumulated the number of wins, ties and losses in (Wins/Ties/Losses)
representation for each tool and parameter setting’s ROC curve.
75
Under (EOO,woIF) parameter setting, GLANET(EOO,woIF,wGCM) achieved the
highest number of wins. The comparison results are presented in Table 6.12. Under
(EOO,wIF) parameter setting, GLANET(EOO,wIF,wM) achieved the highest num-
ber of wins. The comparison results are presented in Table 6.13.
Table 6.12: GLANET and GAT ROC curves comparison results under (EOO,woIF)
setting.
(EOO,woIF)GAT
(woGCM)
GLANET
(woGCM)
GLANET
(wGC)
GLANET
(wM)
GLANET
(wGCM)
Number of
Wins Ties Losses
GAT
(woGCM)1/44/5 3/37/10 3/38/9 3/37/10 10 156 34
GLANET
(woGCM)5/44/1 3/38/9 3/38/9 3/38/9 14 158 28
GLANET
(wGC)10/37/3 9/38/3 5/41/4 3/43/4 27 159 14
GLANET
(wM)9/38/3 9/38/3 4/41/5 3/42/5 25 159 16
GLANET
(wGCM)10/37/3 9/38/3 4/43/3 5/42/3 28 160 12
Table 6.13: GLANET and GAT ROC curves comparison results under (EOO,wIF)
setting.
(EOO,wIF)GAT
(woGCM)
GLANET
(woGCM)
GLANET
(wGC)
GLANET
(wM)
GLANET
(wGCM)
Number of
Wins Ties Losses
GAT
(woGCM)5/38/7 3/40/7 3/39/8 5/39/6 16 156 28
GLANET
(woGCM)7/38/5 2/45/3 3/42/5 3/42/5 15 167 18
GLANET
(wGC)7/40/3 3/45/2 3/43/4 3/43/4 16 171 13
GLANET
(wM)8/39/3 5/42/3 4/43/3 3/47/0 20 171 9
GLANET
(wGCM)6/39/5 5/42/3 4/43/3 0/47/3 15 171 14
Under (NOOB,woIF) parameter setting, GLANET(NOOB,woIF,woGCM) achieved
76
the highest number of wins. Interestingly, all number of wins are very close to each
other. The comparison results are presented in Table 6.14. Under (NOOB,wIF)
parameter setting, GLANET(NOOB,wIF,wM) achieved the highest number of wins.
The comparison results are presented in Table 6.15.
Table 6.14: GLANET and GAT ROC curves comparison results under (NOOB,woIF)
setting.
(NOOB,woIF)GAT
(woGCM)
GLANET
(woGCM)
GLANET
(wGC)
GLANET
(wM)
GLANET
(wGCM)
Number of
Wins Ties Losses
GAT
(woGCM)2/44/4 6/39/5 7/38/5 6/39/5 21 160 19
GLANET
(woGCM)4/44/2 6/39/5 6/39/5 6/39/5 22 161 17
GLANET
(wGC)5/39/6 5/39/6 4/42/4 6/40/4 20 160 20
GLANET
(wM)5/38/7 5/39/6 4/42/4 5/40/5 19 159 22
GLANET
(wGCM)5/39/6 5/39/6 4/40/6 5/40/5 19 158 23
Table 6.15: GLANET and GAT ROC curves comparison results under (NOOB,wIF)
setting.
(NOOB,wIF)GAT
(woGCM)
GLANET
(woGCM)
GLANET
(wGC)
GLANET
(wM)
GLANET
(wGCM)
Number of
Wins Ties Losses
GAT
(woGCM)1/41/8 1/41/8 0/40/10 1/39/10 3 161 36
GLANET
(woGCM)8/41/1 6/42/2 5/41/4 5/41/4 24 165 11
GLANET
(wGC)8/41/1 2/42/6 5/41/4 7/39/4 22 163 15
GLANET
(wM)10/40/0 4/41/5 4/41/5 6/44/0 24 166 10
GLANET
(wGCM)10/39/1 4/41/5 4/39/7 0/44/6 18 163 19
To decide the best tool and parameter setting with respect to ROC curves, we com-
77
pared the winners of each (EOO,woIF), (EOO,wIF), (NOOB,woIF) and (NOOB,wIF)
parameter settings with each other. GLANET(EOO, wIF, wM) achieved the highest
number of wins among them which was followed by GLANET(EOO, woIF, wGCM)
and GLANET(NOOB, wIF, wM). And GLANET(NOOB, woIF, woGCM) was the
worst among them. Finally, the comparison results are presented in Table 6.16.
Table 6.16: We compared the winner settings from Tables 6.12- 6.15 with each other.
Compare
Winners
GLANET
(EOO,woIF)
(wGCM)
GLANET
(EOO,wIF)
(wM)
GLANET
(NOOB,woIF)
(woGCM)
GLANET
(NOOB,wIF)
(wM)
Number of
Wins Ties Losses
GLANET
(EOO,woIF)
(wGCM)4/40/6 7/39/4 7/36/7 18 115 17
GLANET
(EOO,wIF)
(wM)6/40/4 9/38/3 7/38/5 22 116 12
GLANET
(NOOB,woIF)
(woGCM)4/39/7 3/38/9 6/38/6 13 115 22
GLANET
(NOOB,wIF)
(wM)7/36/7 5/38/7 6/38/6 18 112 20
Under (woIF) and (wIF) parameter settings, GLANET(wGC) and GLANET(wM)
achieved the highest number of wins as they are shown in Tables 6.17 and 6.18,
respectively. When we pooled all ROC curves for each random interval generation
option, GLANET(wM) beat the others (Table 6.19).
We also plotted element-based and cell-based Type-I error, power and ROC curve fig-
ures resulting from data-driven computational experiments for all possible GLANET
parameter and experiment settings. Corresponding figures for H4K20ME1 can be
found in Figures B.1- B.8 of Appendix B.
78
Table 6.17: ROC curves of different parameter settings where (woIF) setting is on
are compared.
woIFGAT
(woGCM)
GLANET
(woGCM)
GLANET
(wGC)
GLANET
(wM)
GLANET
(wGCM)
Number of
Wins Ties Losses
GAT
(woGCM)3/88/9 9/76/15 10/76/14 9/76/15 31 316 53
GLANET
(woGCM)9/88/3 9/77/14 9/77/14 9/77/14 36 319 45
GLANET
(wGC)15/76/9 14/77/9 9/83/8 9/83/8 47 319 34
GLANET
(wM)14/76/10 14/77/9 8/83/9 8/82/10 44 318 38
GLANET
(wGCM)15/76/9 14/77/9 8/83/9 10/82/8 47 318 35
Table 6.18: ROC curves of different parameter settings where (wIF) setting is on are
compared.
wIFGAT
(woGCM)
GLANET
(woGCM)
GLANET
(wGC)
GLANET
(wM)
GLANET
(wGCM)
Number of
Wins Ties Losses
GAT
(woGCM)6/79/15 4/81/15 3/79/18 6/78/16 19 317 64
GLANET
(woGCM)15/79/6 8/87/5 8/83/9 8/83/9 39 332 29
GLANET
(wGC)15/81/4 5/87/8 8/84/8 10/82/8 38 334 28
GLANET
(wM)18/79/3 9/83/8 8/84/8 9/91/0 44 337 19
GLANET
(wGCM)16/78/6 9/83/8 8/82/10 0/91/9 33 334 33
79
Table 6.19: ROC curves of different “Generate Random Data Options" are compared.
All PooledGAT
(woGCM)
GLANET
(woGCM)
GLANET
(wGC)
GLANET
(wM)
GLANET
(wGCM)
Number of
Wins Ties Losses
GAT
(woGCM)9/167/24 13/157/30 13/155/32 15/154/31 50 633 117
GLANET
(woGCM)24/167/9 17/164/19 17/160/23 17/160/23 75 651 74
GLANET
(wGC)30/157/13 19/164/17 17/167/16 19/165/16 85 653 62
GLANET
(wM)32/155/13 23/160/17 16/167/17 17/173/10 88 655 57
GLANET
(wGCM)31/154/15 23/160/17 16/165/19 10/173/17 80 652 68
80
CHAPTER 7
GLANET USE CASES AND RUN TIME COMPARISONS
In this chapter, we compare GLANET and GAT with additional data-sets and present
the results, which is followed by two use cases of GLANET. Firstly, we carried out
enrichment analysis of OCD GWAS SNPS for the elements in GLANET’s default
annotation library. Secondly, we performed enrichment analysis of GATA2 binding
sites in K562 cell line for Gene Ontology terms. And lastly, we finalize the chapter
with run time comparisons of GLANET with GAT and GREAT.
7.1 GLANET GAT Comparison with Additional Data-sets
We repeated the experiments provided in the GAT supplementary website [57] with
GLANET. The detailed results for these additional experiments are provided in Tables
7.1- 7.4. Results for GAT runs are obtained from the GAT tutorial (http://gat.
readthedocs.org/en/latest/tutorialIntervalOverlap.html). For
each experiment, GLANET results are computed in sixteen different parameter set-
tings. GLANET is run with different modes of random data generation (wGC,wM,
wGCM,woGCM), isochore family (woIF,wIF) and association measure (EOO, NOOB).
In each of the Tables 7.1- 7.4, Observed column shows the association measure value
calculated between the given sets, set1 and set2. Expected and StdDev columns show
the mean and standard deviation of association measure values of samplings, respec-
tively. Fold change is one plus Observed divided by one plus Expected. Enrichment
result is provided by the p-value column.
These experiments evaluate the significance of the overlap of binding regions of tran-
81
scription factor Srf in Jurkat cells with three different sets of DHSs from Jurkat and
HepG2 cells. These experiments also exemplify another use case of GLANET where
the input intervals are TF binding regions.
EOO NOOB
020406080
020406080
020406080
020406080
Srf(Jurkat)DNaseI(Jurkat)
Srf(Jurkat)DNaseI(HepG2)
DNaseI(HepG2)DNaseI(Jurkat)
Srf(Jurkat)DNaseI(HepG2U)
wG
C
wM
wG
CM
woG
CM
wG
C
wM
wG
CM
woG
CM
Fol
d C
hang
e
GAT(NOOB,woGCM,woIF) GLANET(wIF) GLANET(woIF)
Figure 7.1: GLANET and GAT are run on four experiments ranging from high to
low expected association between the compared genomic interval sets. Each row
depicts an experiment where the first set is input query and the second set is a genomic
element in the annotation library, e.g., experiment Srf(Jurkat) vs. DNaseI(Jurkat)
evaluates whether the binding regions of transcription factor Srf in Jurkat cells are
enriched for DNaseI accessible, i.e., open chromatin, regions in the same cells.
The first experiment (Srf(Jurkat) vs. DNaseI(Jurkat)) assesses whether Srf binding
sites in Jurkat cells, identified by [58], are enriched in DHSs [8] from the same cells.
Given that a majority of the transcription factor binding events resides in open chro-
matin regions, we expect to observe significant enrichment. The second experiment
conducts the same analysis with the same input against library that contains DHSs
from HepG2 cells. The third experiment checks whether DHSs from both cell types
82
are significantly overlapping or not. Both GAT and GLANET report significant en-
richment for these three experiments (p-values are listed in Tables 7.1, 7.2 and
7.3), consistent with the expectations. The fourth experiment targets DHSs identified
in HepG2 cells but not in the Jurkat cells (HepG2 unique) as the genomic element.
It evaluates whether Srf binding sites in Jurkat cells are enriched for these HepG2
specific DNaseI hypersensitive sites. Both GAT and GLANET conclude that the
observed overlap between Srf binding sites from Jurkat cells and DHSs specific to
HepG2 cells are not statistically larger than what would be expected under the null
distribution. A comprehensive list of p-values with exact values of the overlaps are
provided in Table 7.4.
Along with a p-value quantifying enrichment, GAT reports fold enrichment, which
is defined as the ratio of the observed number of overlapping nucleotides divided
by expected number of overlapping nucleotides based on randomizations. We also
calculate the fold enrichment for GLANET for these experiments. In Figure 7.1,
we observe that all enrichment modes of GLANET result in conclusions consistent
with expectations and the GAT results, while (wGCM,wIF) setting is most conser-
vative in terms of fold enrichment. Of the sixteen settings of GLANET, results with
(NOOB,woGCM,woIF) parameter setting agree most closely with the GAT results.
This is expected because GAT uses NOOB as the association measure as well and
does not account for GC and mappability in these experiments.
83
Table 7.1: Experiment1: Intervals of transcriptor factor Srf in Jurkat cell line are
overlapped with DNaseI hypersensitive sites in Jurkat cell line. Both GAT and
GLANET find enrichment of DNaseI(Jurkat) for Srf(Jurkat).
Experiment1 Set1: Srf(Jurkat) Set2: DNaseI(Jurkat)
Tool Parameter Settings Observed Expected StdDevFold
ChangepValue
GAT (NOOB,woGCM,woIF) 20183 246.5650 105.5933 81.5301 1.0e-03
GLANET
(EOO,wGC,woIF) 450 15.7577 3.8662 26.9130 0
(EOO,wM,woIF) 450 7.6723 2.7149 52.0046 0
(EOO,wGCM,woIF) 450 17.3464 4.0456 24.5824 0
(EOO,woGCM,woIF) 450 6.6257 2.5610 59.1421 0
GLANET
(EOO,wGC,wIF) 450 15.5799 3.8328 27.2016 0
(EOO,wM,wIF) 450 11.9761 3.4328 34.7562 0
(EOO,wGCM,wIF) 450 17.3041 4.0071 24.6392 0
(EOO,woGCM,wIF) 450 10.9239 3.2333 37.8231 0
GLANET
(NOOB,wGC,woIF) 20183 599.3644 158.8155 33.6195 0
(NOOB,wM,woIF) 20183 288.3931 112.5672 69.7459 0
(NOOB,wGCM,woIF) 20183 668.5556 169.8404 30.1453 0
(NOOB,woGCM,woIF) 20183 247.9067 105.5192 81.0906 0
GLANET
(NOOB,wGC,wIF) 20183 595.9552 160.3645 33.8115 0
(NOOB,wM,wIF) 20183 453.3407 140.4382 44.4248 0
(NOOB,wGCM,wIF) 20183 657.1246 168.5808 30.6689 0
(NOOB,woGCM,wIF) 20183 413.4114 136.8533 48.7052 0
84
Table 7.2: Experiment2: Intervals of transcriptor factor Srf in Jurkat cell line are
overlapped with DNaseI hypersensitive sites in HepG2 cell line. Both GAT and
GLANET find enrichment of DNaseI(HepG2) for Srf(Jurkat).
Experiment2 Set1: Srf(Jurkat) Set2: DNaseI(HepG2)
Tool Parameter Settings Observed Expected StdDevFold
ChangepValue
GAT (NOOB, woGCM, woIF) 18965 597.1380 166.9945 31.7084 1.0e-03
GLANET
(EOO,wGC,woIF) 381 49.4944 6.1386 7.5651 0
(EOO,wM,woIF) 381 15.8633 3.9072 22.6527 0
(EOO,wGCM,woIF) 381 55.9002 6.3335 6.7135 0
(EOO,woGCM,woIF) 381 13.5410 3.6388 26.2705 0
GLANET
(EOO,wGC,wIF) 381 55.2896 6.5083 6.7863 0
(EOO,wM,wIF) 381 34.2100 5.5440 10.8491 0
(EOO,wGCM,wIF) 381 62.4521 6.6809 6.0202 0
(EOO,woGCM,wIF) 381 30.8020 5.3329 12.0118 0
GLANET
(NOOB,wGC,woIF) 18965 2298.8933 295.0334 8.2464 0
(NOOB,wM,woIF) 18965 699.2644 177.4524 27.0840 0
(NOOB,wGCM,woIF) 18965 2592.5174 305.0763 7.3128 0
(NOOB,woGCM,woIF) 18965 595.3543 165.2816 31.8032 0
GLANET
(NOOB,wGC,wIF) 18965 2532.3832 310.1418 7.4864 0
(NOOB,wM,wIF) 18965 1531.4211 257.4727 12.3764 0
(NOOB,wGCM,wIF) 18965 2874.7601 316.5953 6.5951 0
(NOOB,woGCM,wIF) 18965 1375.0903 246.4372 13.7825 0
85
Table 7.3: Experiment3: DNaseI hypersensitive sites in HepG2 cell line are over-
lapped with DNaseI hypersensitive sites in Jurkat cell line. Both GAT and GLANET
find enrichment of DNaseI(Jurkat) for DNaseI(HepG2).
Experiment3 Set1: DNaseI(HepG2) Set2: DNaseI(Jurkat)
Tool Parameter Settings Observed Expected StdDevFold
ChangepValue
GAT (NOOB,woGCM,woIF) 6163503 456928.2770 8119.7800 13.4890 1.0e-03
GLANET
(EOO,wGC,woIF) 37863 4486.2310 63.3604 8.4381 0
(EOO,wM,woIF) 37863 4729.1280 62.8720 8.0048 0
(EOO,wGCM,woIF) 37863 4980.2900 63.7331 7.6012 0
(EOO,woGCM,woIF) 37863 4021.9370 61.3296 9.4120 0
GLANET
(EOO,wGC,wIF) 37863 4779.9930 62.7600 7.9196 0
(EOO,wM,wIF) 37863 5330.1410 66.3065 7.1024 0
(EOO,wGCM,wIF) 37863 5304.6820 67.4277 7.1365 0
(EOO,woGCM,wIF) 37863 4679.8590 62.3539 8.0891 0
GLANET
(NOOB,wGC,woIF) 6163503 514669.0700 8361.7736 11.9756 0
(NOOB,wM,woIF) 6163503 542634.9810 8866.3420 11.3584 0
(NOOB,wGCM,woIF) 6163503 577794.1580 9186.0057 10.6672 0
(NOOB,woGCM,woIF) 6163503 457457.8130 7800.8096 13.4733 0
GLANET
(NOOB,wGC,wIF) 6163503 548311.7080 8391.4861 11.2408 0
(NOOB,wM,wIF) 6163503 616187.0040 9160.8373 10.0026 0
(NOOB,wGCM,wIF) 6163503 614923.7840 8718.6997 10.0231 0
(NOOB,woGCM,wIF) 6163503 536616.5930 8472.0299 11.4858 0
86
Table 7.4: Experiment4: Intervals of transcriptor factor Srf in Jurkat cell line are
overlapped with DNaseI hypersensitive sites in HepG2-Unique cell line. Both GAT
and GLANET find no enrichment of DNaseI(HepG2-Unique) for Srf(Jurkat).
Experiment4 Set1: Srf(Jurkat) Set2: DNaseI(HepG2-Unique)
Tool Parameter Settings Observed Expected StdDevFold
ChangepValue
GAT (NOOB,woGCM,woIF) 425 324.6790 117.8233 1.3080 1.85e-01
GLANET
(EOO,wGC,woIF) 9 21.5893 4.4931 0.4426 9.995e-01
(EOO,wM,woIF) 9 8.9383 2.9387 1.0062 5.403e-01
(EOO,wGCM,woIF) 9 24.3285 4.6873 0.3948 9.998e-01
(EOO,woGCM,woIF) 9 7.5673 2.7146 1.1672 3.486e-01
GLANET
(EOO,wGC,wIF) 9 27.1950 4.9426 0.3546 1e+00
(EOO,wM,wIF) 9 18.8889 4.2593 0.5027 9.956e-01
(EOO,wGCM,wIF) 9 29.8631 5.1835 0.3240 1e+00
(EOO,woGCM,wIF) 9 17.0837 4.0756 0.5529 9.878e-01
GLANET
(NOOB,wGC,woIF) 425 951.4744 206.6389 0.4472 9.973e-01
(NOOB,wM,woIF) 425 379.5031 131.6084 1.1195 3.46e-01
(NOOB,wGCM,woIF) 425 1066.5066 216.2852 0.3990 9.998e-01
(NOOB,woGCM,woIF) 425 324.0335 122.5103 1.3106 2.053e-01
GLANET
(NOOB,wGC,wIF) 425 1186.2319 224.0918 0.3588 9.998e-01
(NOOB,wM,wIF) 425 816.2769 189.3228 0.5212 9.867e-01
(NOOB,wGCM,wIF) 425 1309.7033 235.8895 0.3250 1e+00
(NOOB,woGCM,wIF) 425 731.7741 182.2007 0.5813 9.603e-01
87
7.2 Example Use Cases of GLANET
7.2.1 Enrichment Analysis of OCD GWAS SNPs
We next illustrate how GLANET can be used to analyze a set of SNPs identified
in an obsessive compulsive disorder (OCD) genome-wide association study (GWAS)
[59]. This set of 2, 340 SNPs is identified as significant in either of case-control, trios,
and/or combined case-control-trios analysis performed by [59].
We first conduct KEGG pathway analysis using GLANET in three modes: exon-
based, regulation-based, and all-based. These modes vary the genic region defini-
tion as defined in Figure 3.1b. The number of random samplings for these enrich-
ment analyses is set to 10, 000. Several potential pathways are found. Interestingly,
GLANET regulation-based enrichment analysis identifies glutamatergic synapse path-
way (hsa04724) as enriched; this is one of the pathways that KEGG reports as asso-
ciated with OCD. Both DLGAP1 and GRIK1 genes are part of this pathway and they
overlap with OCD associated SNPs in their intronic regions: DLGAP1 overlap with
rs1628281, rs767887, rs1791397, rs11081062, rs11663827, rs1116345, rs615916 and
rs7230434 where as GRIK1 overlaps with rs363524 and rs363514. Additionally,
other SNPs overlap with regulatory regions of other genes in this pathway such as
rs6479056 with PPP3R2(5p1) and GRIN3A(intron), rs17124656 with GNG2(5p1),
rs1559157 with GRIA1(intron), etc. The full list of genes where overlaps take place
for glutamatergic synapse pathway are provided in Supp. Table S18 under http:
//burcak.ceng.metu.edu.tr/PhDThesis/SuppMaterials/.
A key outcome of this application is that standard pathway analysis that only uti-
lizes exonic regions of the pre-defined genes can fail to identify pathways that are
biologically relevant through their regulatory roles. For example, long-term depres-
sion pathway (hsa04730) is significantly enriched with a BH FDR adjusted p-value
of 1.62e-02 only in the regulation-based analysis. The link between OCD and de-
pression has long been established and majority of OCD patients also suffer from
depression [60]. GLANET enables such an analysis within minutes.
We also conducted enrichment analysis of OCD SNPs with default GLANET annota-
88
tion libraries representing transcription factor binding regions and histone modifica-
tions. The complete list of enrichment analysis is provided in Supp. Table S19 under
http://burcak.ceng.metu.edu.tr/PhDThesis/SuppMaterials/. Al-
though TF enrichment analysis did not reveal enrichment for any particular TF, a joint
enrichment analysis of genomic elements representing TF binding regions and KEGG
pathways identified several enriched transcription factor and pathway pairs.
7.2.2 Enrichment Analysis of GATA2 Binding Regions for Gene Ontology Terms
using User-defined Gene Sets Feature
GLANET allows the expansion of annotation library with user-defined gene sets
and/or genomic intervals. We designed this key feature to provide flexibility for users
in including as many genes sets and genomic intervals in their analysis as they wish.
Here, we present a proof of principle application where users define gene sets, Gene
Ontology (GO) terms [30], based on biological process. For each GO term, we curate
a gene set from genes that are annotated with that particular GO term based on an
experimental evidence (reported with one of the GO evidence codes: EXP, IDA, IPI,
IMP, IGI, IEP). Utilizing GLANET’s user-defined gene set feature, these gene sets
are loaded in the GLANET annotation library.
We used GATA2 binding regions (i.e., peaks from the relevant ChIP-seq experi-
ment) from K562 cells as input to GLANET and assessed which of the GO term
gene sets are enriched in these regions. GATA2 is a transcription factor crucial in
maintaining the proliferation and survival of early hematopoietic cells and prefer-
ential differentiation to erythroid or megakaryocytic lineages [61, 62]. As we ex-
pect a subset of GATA2 binding regions to be in close proximity of the genes that
GATA2 regulates, such an analysis should identify the significantly enriched bio-
logical processes. We conduct this analysis with the three genic region definitions:
exon-based, regulation-based and all-based. GLANET correctly identifies several
enriched GO terms that are related to the specific biological role of GATA2 such
as regulation of definitive erythrocyte differentiation (GO:0010724), platelet forma-
tion (GO:0030220), and eosinophil fate commitment (GO:0035854) (Supp. Table
S21 is available under http://burcak.ceng.metu.edu.tr/PhDThesis/
89
SuppMaterials/). To quantify similarity between the set of GO terms that GATA2
is annotated with and the set of GO terms GLANET found enriched, we calculate GO
semantic similarity scores between these two sets using GOSemSim R package [63].
Semantic similarity scores are computed using Wang measure with rcmax method.
The resulting scores are provided in Table 7.5. As can be seen in the table, the set of
GO terms found enriched with GLANET are highly similar to the GO Terms anno-
tated with GATA2 gene and the similarity increases once we incorporate non-coding
regions of the genes in the gene set, where the GATA2 binding takes place.
Table 7.5: GO semantic similarity scores calculated between the set of biological pro-
cess GO terms that GATA2 is annotated with and the set of GO terms where GATA2
binding regions are found enriched based on GLANET enrichment analysis in three
different analysis modes (exon, regulatory based and all-based).
Enrichment Mode
Exon Regulatory All
GO Semantic Similarity Score 0.43 0.73 0.99
GLANET’s user-defined gene set feature renders this enrichment analysis straight-
forward. In other settings, gene sets that are derived from gene expression analysis
or functional assays can be loaded to GLANET annotation library. In addition to
gene sets, users can also load genomic regions such as ChIP-seq or copy number
variation regions as genomic elements into the GLANET annotation library through
user-defined library feature.
7.3 GLANET Run Time Comparison
We compare GLANET against GAT and GREAT with respect to run time. GLANET
and GAT are compared based on genomic interval enrichment, as GAT does not of-
fer gene set enrichment. GREAT comparisons are on the basis of gene set enrich-
ment, as GREAT only offers enrichment based on annotations of nearby genes. All
GAT and GLANET runs are run on the following system configuration: CPU: In-
tel(R) Xeon(R) CPU E7-4850 v3 @ 2.20GHz CPU. Memory: 1TB. Operating sys-
tem: Ubuntu 16.04.2 LTS.
90
7.3.1 Comparison with GAT
We compare GAT and GLANET in two different experimental settings. For the first
comparison setting, input intervals are randomly selected from the promoter regions
of non-expressing genes in GM12878 cell line from (Non-Expressing, Completely-
Discard), where each interval is 601 bps long. All ENCODE checks the enrichment
of all ENCODE elements in the GLANET library which encompass histone modi-
fications, transcription factor sites, and DNaseI hypersensitive sites for all cell lines
(568 files). Subset ENCODE only includes 12 histone modifications and POL2 as
described in Section 6.1.3. Both GLANET and GAT are run under the parameter
setting (NOOB, wIF, woGCM). Results for 1,000 and 10,000 samplings are averaged
over 10 runs. For 100,000 samplings, each run time in the table denotes the average
run time from 5 individual runs. It’s worth noticing that increasing the library size
did not increase the run time that much. We varied the number of intervals and the
number of samplings. The resulting CPU times (user + system) and wall clock times
are provided in Tables 7.6 and 7.7.
Table 7.6: Elapsed CPU (user + system) run times in seconds for GLANET and GAT
runs for a given input query are provided.
Input QueryNumber of
Input
Intervals
Number of
Samplings
CPU (user + system) run times of tools (in secs)
GLANET -
all ENCODE
GLANET -
subset ENCODE
GAT -
subset ENCODE
Promoter
regions of
Non-expressing
genes in
GM12878
500 1,000 826 690 145
500 10,000 1,169 856 1,463
500 100,000 4,447 2,140 14,353
1000 1,000 1,395 1,283 147
1000 10,000 1,650 1,165 1,538
1000 100,000 9,137 3,866 14,341
2000 1,000 1,396 1,179 155
2000 10,000 2,429 1,270 1,583
2000 100,000 14,724 6,257 16,039
In the second comparison setting, we used data provided in GAT web tutorial as de-
scribed in Section 7.1. Srf(Jurkat) is the transcription binding sites of 556 intervals
each 51 bps long from Jurkat cell line. DNaseI(Jurkat) and DNaseI(HepG2) com-
91
Table 7.7: Elapsed wall clock times in seconds for GLANET and GAT runs for a
given input query are provided.
Input QueryNumber of
Input
Intervals
Number of
Samplings
Wall clock run times of tools (in secs)
GLANET -
all ENCODE
GLANET -
subset ENCODE
GAT -
subset ENCODE
Promoter
regions of
Non-expressing
genes in
GM12878
500 1,000 108 80 59
500 10,000 173 72 565
500 100,000 750 126 5,505
1000 1,000 142 120 47
1000 10,000 298 125 509
1000 100,000 1,386 330 4,418
2000 1,000 129 87 44
2000 10,000 260 110 455
2000 100,000 1,648 483 4,777
prised of DNaseI hypersensitive sites in Jurkat and HepG2 cell line, respectively.
DNaseI(HepG2Unique) consists of DNaseI hypersensitive sites in HepG2 but not in
Jurkat cell line. For your information, DNaseI(Jurkat) have 159,613 intervals each
151 bps long. DNaseI(HepG2) have 144,171 intervals of average 360 bps long. DNa-
seI(HepG2Unique) have 106,308 intervals of average 275 bps long. For 1,000 and
10,000 samplings, run time is the average of 10 runs. For 100,000 samplings, each
run time shows the average run time from 5 individual runs. The results for CPU
times (user + system) and wall clock times are listed in Tables 7.8 and 7.9.
All the run time results for GLANET and GAT are shown in terms of CPU (user
+ system) time and wall clock time in seconds. CPU time is the actual time
that one CPU would need to complete its process. Thus, these run times are
the sum of the times taken in each thread for a run if multithreading is avail-
able (time command in Unix). Please note that since GLANET and GAT are
multi-threaded applications, wall clock times are less than CPU times. Dur-
ing these runs, 16GB of memory is reserved for GLANET and GAT, except for
DNaseI(Hepg2)-DNaseI(Jurkat) runs of 100,000 samplings, in which GLANET re-
quired 64GB of memory.
92
Table 7.8: CPU (user + system) times in seconds spent for GLANET and GAT runs
given the input query specified.
Input
Query
User Defined
Library
Number
of
Samplings
CPU (user + system) run times of tools (in secs)
GLANET GAT
woIF wIF wIF
wGC wM wGCM wGC woGCM woGCM
Srf
(Jurkat)
DNaseI
(Jurkat)
1,000 505 498 741 492 473 86
10,000 923 589 1,056 712 582 792
100,000 4,158 1,812 3,856 2,710 1,777 7,383
DNaseI
(HepG2)
DNaseI
(Jurkat)
1,000 16,843 7,942 24,602 13,386 2,428 1,125
10,000 167,079 69,248 250,360 127,134 16,693 12,476
100,000 2,066,470 766,951 2,700,420 1,447,620 262,553 97,659
Srf
(Jurkat)
DNaseI
(HepG2)
1,000 518 499 741 509 495 82
10,000 951 585 1,056 715 551 792
100,000 4,312 1,779 4,002 2,712 1,746 7,296
Srf
(Jurkat)
DNaseI
(HepG2Unique)
1,000 519 499 752 492 485 76
10,000 945 596 1,049 692 565 701
100,000 4,042 1,745 3,987 2,728 1,734 6,924
Table 7.9: Wall clock times in seconds spent for GLANET and GAT runs given the
input query specified.
Input
Query
User Defined
Library
Number
of
Samplings
Wall clock run times of tools (in secs)
GLANET GAT
woIF wIF wIF
wGC wM wGCM wGC woGCM woGCM
Srf
(Jurkat)
DNaseI
(Jurkat)
1,000 224 230 439 222 218 16
10,000 277 235 470 245 230 144
100,000 712 284 843 416 262 1,300
DNaseI
(HepG2)
DNaseI
(Jurkat)
1,000 3,152 1,195 4,708 2,411 229 193
10,000 29,436 11,014 44,775 21,647 1,423 2,164
100,000 323,963 101,847 402,297 215,717 16,669 16,256
Srf
(Jurkat)
DNaseI
(HepG2)
1,000 233 232 434 228 224 15
10,000 301 251 496 265 239 155
100,000 735 275 917 413 255 1,268
Srf
(Jurkat)
DNaseI
(HepG2Unique)
1,000 227 230 426 221 223 13
10,000 287 239 480 244 232 118
100,000 707 272 902 410 258 1,226
93
7.3.2 Comparison with GREAT
We compared GREAT on the basis of GO terms gene set enrichment. The input was
GATA2 transcription bindings sites in K562 cell line and their enrichment is checked
against gene sets derived from GO terms as described in Section 7.2.2. The input
included 7407 intervals of average 256 bps long. Tables 7.10 and 7.11 include
the results for CPU and wall clock run times, respectively. GREAT is not available
as a stand-alone command line application, thus, the results are obtained from the
online web service. Nevertheless, when run on from the server, GREAT was very
fast, it completed one analysis in less than 1 minute, as its enrichment procedure is
not based on sampling but instead assumes a parametric distribution. Since we do not
know how each GREAT run is parallelized in their server, we do not know the actual
CPU time. Therefore, it is not possible for us to compare the run times and we do not
have information on the actual memory used for the GREAT analysis.
Table 7.10: CPU (user + system) time in seconds spent for GLANET runs given the
input query specified. For 1,000 and 10,000 samplings, each run time is the average
of 10 individual runs.
Input QueryEnrichment
of GO Terms
Number of
Samplings
GLANET
Association
Measure
Random
Interval
Generation
Isochore
Family
CPU Run
Time
(in secs)
GATA2
Binding sites
in K562
BP 1,000 NOOB woGCM woIF 522,75
BP, MF and CC 1,000 NOOB woGCM woIF 658,34
BP 10,000 NOOB woGCM woIF 3069,14
BP, MF and CC 10,000 NOOB woGCM woIF 5808,08
BP 10,000 NOOB wGC wIF 6838,32
BP, MF and CC 10,000 NOOB wGC wIF 9718,03
94
Table 7.11: Wall clock time in seconds spent for GLANET runs given the input
query specified. For 1,000 and 10,000 samplings, each run time is the average of 10
individual runs.
Input QueryEnrichment
of GO Terms
Number of
Samplings
GLANET
Association
Measure
Random
Interval
Generation
Isochore
Family
Wall Clock
Run Time
(in secs)
GATA2
Binding sites
in K562
BP 1,000 NOOB woGCM woIF 63.63
BP, MF and CC 1,000 NOOB woGCM woIF 77.42
BP 10,000 NOOB woGCM woIF 332.28
BP, MF and CC 10,000 NOOB woGCM woIF 606.79
BP 10,000 NOOB wGC wIF 586.59
BP, MF and CC 10,000 NOOB wGC wIF 845.74
95
96
CHAPTER 8
FINDING OVERLAPPING INTERVALS FOR N GIVEN
INTERVAL SETS
Genomic interval intersection is crucial for attaining biological insights from genomic
data sets coming from NGS technologies. In this chapter, we generalize this genomic
interval intersection problem of finding common overlapping intervals from 2 or 3
interval sets to n interval sets. We divide the finding overlapping intervals problem
into two sub-problems: finding n common overlapping intervals from n given inter-
val sets and finding at least k common overlapping intervals from n given interval
sets. We propose two different solutions to each of these sub-problems. For finding n
common overlapping intervals from n given interval sets, first we construct a segment
tree for each interval set and then we convert each segment tree into indexed segment
tree forest. We show that constructing indexed short segment trees rather than one
tall segment tree reduces the search time. For finding at least k common overlapping
intervals from n given interval sets, we construct one big segment tree for all inter-
val sets and find the overlapping intervals immediately after the construction of the
segment tree is completed.
8.1 Segment Tree
Segment tree is a data structure for storing intervals. Alike interval trees, segment
tree is an another well-known space partitioning tree. It uses O(nlogn) storage and
it can be constructed in O(nlogn) time for n given intervals. Finding all intervals
in the segment tree containing query point qx requires O(logn + k) time for n given
97
intervals and k hits [64].
Let I := [x1 : x′1], [x2 : x
′2], ..., [xn : x′n] be a set of n intervals on the real line. Let
p1, p2, ..., pm be the list of distinct interval endpoints, sorted from left to right. We
simply partition the real line induced by these points pi. We call the regions in this
partitioning as elementary intervals. Thus the elementary intervals from these points
p1, p2, ..., pm−1, pm are, from left to right, (−∞ : p1), [p1 : p1], (p1 : p2), [p2 :
p2], ..., (pm−1 : pm), [pm : pm], (pm :∞).
To this end, we build a binary search tree T whose leaves correspond to the elemen-
tary intervals induced by the endpoints of the intervals in I in an ordered way: the
leftmost leaf corresponds to the leftmost elementary interval, and so on. We denote
the elementary interval corresponding to a leaf µ by Int(µ).
The internal nodes of T correspond to intervals that are the union of intervals of
its two children: the Int(ν) corresponding to node ν is the union of the intervals
Int(νleftChild) and Int(νrightChild) in the subtree rooted at ν. Parent of leaf nodes
has the Int(ν), which is the union of the elementary intervals Int(νleftChild) and
Int(νrightChild) at the leaves.
Each internal node ν in T has its interval, Int(ν) whereas each leaf node µ has its
elementary interval, Int(µ), and each node stores a set of intervals, canonical subset,
I(ν), where I(ν) ⊆ I . This canonical subset of node ν stores the intervals [x : x′] ∈ Isuch that Int(ν) ⊆ [x : x′] and Int(parent(ν)) 6⊆ [x : x′].
As a result, constructed balanced binary tree T is a segment tree. And this way of
construction ensures non-overlapping, totally consecutive intervals for the nodes at
any depth, from left to right. In fact, this way of construction provides natural binning
at any depth of the tree. In Figure 8.1, we exemplify how we store 5 intervals in the
segment tree leaves and internal nodes which are constructed from the endpoints of
the 5 given intervals [64].
98
s1s2
s5
s4
s3
s1
s1 s1s3
s3,s4
s3 s4
s2 ,s5 s5
s2 , s5
Figure 8.1: Intervals (s1, s2, s3, s4, s5) are stored in the nodes. The arrows from the
nodes point to their canonical subsets.
8.2 Segment Tree Construction Complexity Analysis
To construct a segment tree for an interval set of n intervals we proceed as follows:
We sort the endpoints of n intervals inO(n log n) time and define elementary intervals
at each end point and between each consecutive endpoints. We then construct a binary
tree on these elementary intervals, where each interval is the union of its left and right
child’s elementary intervals or intervals and goes up to root in this way. This can be
done bottom-up in linear time. In the last phase, n intervals are attached to nodes,
if node’s interval, Int(ν), is totally contained in the interval. As a result, an interval
can be attached to more than one node and number of intervals attached to nodes
decreases as we go up in the tree as the node’s interval, Int(ν), becomes larger.
8.3 Segment Tree Query
Query starts at the root node, if the query point qx overlaps with the node’s interval,
Int(ν), the associated intervals stored at that node are output and the query continues
on the left or right child of that node, visiting one node per level of the tree. The
time complexity of segment tree query is O(log n + k) where n is the number of
intervals and k is the number of overlapping intervals in the segment tree for the
query point qx [64]. As a result, constructing segment tree for the interval set with
the highest number of intervals and using interval set with less number of intervals as
query intervals will be better.
99
8.4 Motivation: Indexed Segment Tree Forest
After analyzing constructed segment trees for real data sets, we observed that nodes
at the top of the segment tree (approximately top two thirds of the segment tree)
do not store any intervals or hold only a few intervals in their canonical subsets. We
realized that intervals are mostly stored in the bottom nodes of the segment tree which
constitute approximately the bottom one third of the segment tree.
Keeping the whole segment tree with significant number of nodes without any or with
a few intervals is unnecessary. And passing through all these nodes for each query
in order to find overlapping intervals will definitely increase query time. Instead of
having one tall segment tree, we can cut the segment tree at a certain depth close to
the bottom of the tree and have as many short segment trees as segment tree nodes
present at this cut-off depth plus the segment tree nodes with no children above this
cut-off depth. The closer the cut-off depth to the bottom of the tree, the higher the
number of short segment trees will be.
8.4.1 Hash Function, Preset Value
By using one universal hash function as shown in Equation 8.1, we index these short
segment trees and we aim to reach each short segment tree in O(1) time instead of
O(cut-off) time. Preset value in hash function determines the number of segment
trees with the same index which is called collision.
hash_index = (node.interval.lowEndPoint/presetV alue) (8.1)
The lower the preset value, the less number of segment trees with the same index.
However, this may result in sequential search of more than one segment trees which
is definitely not preferred. The higher the preset value, the more number of segment
trees with the same index. This implies that more one segment trees will have the
same index so reaching each segment tree may not be O(1) but O(height of binary
search tree (BST) formed from these short segment trees with the same index) instead
of O(cut-off) time. As long as the height of BST formed from these segment trees
100
with the same index is less than the cut-off depth, search in indexed segment tree
forest will be still less than search in one tall segment tree.
8.4.2 Cut-off Depth
We may decide on the cut-off depth by considering two factors: 1) total number of
intervals stored in canonical subsets of nodes at the top part of the tree higher than
this cut-depth and 2) number of segment tree nodes at the cut-off depth. The lower
the cut-off depth, the more segment trees will be in the forest. We tried three different
ways for deciding on cut-off depth.
In the first approach, we first construct the segment tree, and then we just cut the
segment tree at 75% of its total depth, whih is closer to the bottom of the tree. For
instance, if the total depth of the segment tree is 20, then we cut the tree at cut-off
depth of 15 and consider the segment tree nodes at this depth and the nodes above the
cut-off depth with no children.
In the second approach, after we construct the segment tree, we traverse the segment
tree in breadth first manner and stop at the depth where the number of intervals stored
in the nodes up to that depth is greater than or equal to the 1% of the total number of
intervals.
In the third approach, during the construction of the segment tree, we keep the number
of intervals stored in the nodes and decide on the cut-off depth where the number of
intervals stored in the nodes up to that depth is greater than or equal to the 1% of
the total number of intervals. We call these three approaches, AFTER_CONS_75%,
AFTER_CONS_BFT and DURING_CONS, respectively.
Cut-off depth and preset value are the two parameters that affect the performance of
search in indexed segment tree forest.
101
Root of the segment tree
Cut-off depth
Figure 8.2: Blue colored segment tree nodes at cut-off depth and red colored nodes
with no children at depth above the cut-off depth are stored in our segment tree forest.
To enhance fast access, these stored segment tree nodes are connected to each other
through forward and backward links.
8.4.3 Moving Intervals That Were Stored in The Nodes Above The Cut-off
Depth
All the intervals attached to the nodes that are above the cut-off depth must be dis-
tributed to the nodes at cut-off depth. Definitely, if an interval is attached to a node
above the cut-off depth, then this interval must be attached to its offspring node at
cut-off depth. If there is no offspring node at cut-off depth then we directly add node
holding the interval if node has no offspring, otherwise we attach the interval to its
lowest offspring nodes with no children and add this lowest offspring nodes to our
segment tree forest, with the node closest to the cut-off depth first priority in order to
keep the order between the intervals of the nodes. Please note that we do this extra
work for a small number of nodes.
8.4.4 Linking Segment Tree Nodes at Cut-off Depth to Each Other
To ensure fast access between consecutive segment tree nodes at cut-off depth, we
connect segment tree nodes to each other through forward and backward pointers.
We call these nodes as linked nodes (Figure 8.2).
102
8.5 Indexed Segment Tree Forest in More Details
We cut the segment tree at cut-off depth and keep the segment tree nodes at this cut-
off depth in an indexed segment tree forest. At cut-off depth, each segment tree node
is in fact a root of segment tree at its below, and we compute its hash index using
a hash function for each segment tree node and we store [index,segment tree node]
pairs in a map.
We have one universal hash function as it is provided in Equation 8.1 where we
tested various preset values such as 10, 000, 50, 000, 100, 000, and 500, 000. This
preset value effects the number of different segment trees having the same hash index
(collisions) in the map. The smaller the preset value, the less number of collisions.
On the contrary, the higher the preset value, the more number of collisions.
In case of collision, we construct a binary search tree (BST) from the segment tree
nodes with the same index and now index in the map points to the root of this BST.
As long as the height of the newly created BST is less than the cut-off depth, we
still decrease the search time from O(cut-offdepth) to O(DepthofBST ). As a
future work, we may construct an interval tree from the segment tree nodes with the
same index instead of BST since interval tree is a balanced tree whereas BST is not
necessarily balanced.
Original segment tree nodes in this BST are connected to each other, which are called
linked nodes, as mentioned above. On the other hand, parent nodes of these linked
nodes in the BST constitute the artificial nodes as it is shown in Figure 8.3.
8.6 Query in Indexed Segment Tree Forest
For each query interval, we compute its lowIndex and highIndex using its low and
high endpoints, respectively. We start searching on a linked node pointed by the
lowIndex if it exists, otherwise we find the lowerIndex (highest index lower than
lowIndex) and start searching at the node shown by the lowerIndex and continue
searching forward. If it is not possible, we start searching on the linked node pointed
by the highIndex if it exists, if not we compute higherIndex (lowest index higher
103
indexi indexi+1 indexi+2 indexi+3
Figure 8.3: Segment tree nodes with the same index are stored in a BST and index
now points to the root of BST. Blue and red colored nodes are original segment tree
nodes which are linked to each other. Blue colored nodes are in fact the roots of the
segment trees below them. Red colored nodes do not have any children. Parents of
these blue and red colored nodes are the artificial nodes.
than highIndex) and search the node pointed by higherIndex and continue search-
ing backward. If there is no node pointed by higherIndex, it means that there is no
overlapping intervals with the query interval. All the pseudocode of the algorithms
are provided in 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8 and 8.9.
8.6.1 How to Guarantee at Most Two Additional Index Searches Are Enough?
As it is shown in Figure 8.4, we first compute lowIndex and highIndex using query
low and high endpoints, respectively. Then we search for the segment trees pointed
by one of these indexes in the order of lowIndexi, lowIndexi−1, highIndexj or
highIndexj+1.
Here we present why we may need to consider only two more segment trees pointed
by the indexes lowIndexi−1 and the highIndexj+1 (Figure 8.4).
queryInterval(lowEndPoint,highEndPoint)
segmentjsegmentisegmenti-1segmenti-2 segmentj+1 segmentj+2…
…lowIndexilowIndexi-1lowIndexi-2 highIndexj+2highIndexj+1highIndexj… …
Figure 8.4: Searching the nodes pointed by lowIndex and highIndex, the nodes in
between them, and plus two more nodes at most is enough.
104
lowIndexi = queryLowEndPoint/presetV alue (8.2)
highIndexj = queryHighEndPoint/presetV alue (8.3)
lowIndexi−2 < lowIndexi−1 < lowIndexi (8.4)
lowIndexi−1 < lowIndexi ⇒ (8.5)
lowNodei−1.interval.lowEndPoint < queryLowEndPoint (8.6)
From the preserved order between intervals of consecutive nodes we know that
lowNodei−2.interval.highEndPoint < lowNodei−1.interval.lowEndPoint
(8.7)
Equations 8.6 and 8.7 imply that
lowNodei−2.interval.highEndPoint < queryLowEndPoint (8.8)
As a result of inequality 8.8, lowNodei−2.interval and query interval can not over-
lap. Therefore we may need to look at only one more index preceding the lowIndexi
and search for the segment tree pointed by that index and forward. In the same man-
ner, we may need to consider only one more index subsequent to the highIndexj .
8.7 Finding n Common Overlapping Intervals for n Given Interval Sets
We have n interval sets, we use one of them as query intervals and for each of the
(n−1) remaining interval sets and we construct chromosome-based indexed segment
105
tree forest. We find the overlapping intervals with the query intervals by searching on
the indexed segment tree forest of (n− 1) remaining interval sets one by one.
We tested our indexed segment tree forest approach using hotspot peaks for five
fetal adrenal tissues: fAdrenal-DS12528, fAdrenal-DS15123, fAdrenal-DS17319,
fAdrenal-DS17677 and fAdrenal-DS20343 where they contain 193, 835 , 188, 966
, 137, 386 , 132, 500 and 195, 098 intervals, respectively. We computed the search
times of 100 runs in wall clock time using indexed segment tree forest and segment
tree. We verified that search run times of indexed segment tree forest are statistically
significantly less than search run times of segment tree using paired t-test given the
appropriate preset value and cut-off depth and we listed the averaged wall clock run
times of construction and search in Table 8.1 .
Table 8.1: Various preset values and cut-off depth decisions are compared. Con-
struction time and search time of indexed segment tree forest and segment tree in
wall clock time are averaged over 100 runs. P-values resulting from paired t-test for
search run times of indexed segment tree forest and segment tree are provided.
Preset
Value
Cut-off
Depth Decision
Construction Time
(in millisecs)
Search Time
(in millisecs)T-Tests
p-value
500,000After Construction 75% 126470.75 136.7 0.05812
After Construction BFT 122114.23 137.01 0.05168
During Construction 118195.75 135.17 0.04848
Segment Tree 122297.07 157.56
100,000After Construction 75% 122979.78 139.47 0.6844
After Construction BFT 112195.09 126.7 0.04633
During Construction 116809.28 122.76 0.02144
Segment Tree 112263.3 143.59
50,000After Construction 75% 121466.31 121.91 0.03485
After Construction BFT 121255.48 128.2 0.1242
During Construction 116043.7 127.71 0.09609
Segment Tree 126385.71 141.64
10,000After Construction 75% 116232.69 133.55 0.5923
After Construction BFT 115737.17 146.23 0.1682
During Construction 115734.24 144.98 0.3418
Segment Tree 112578.32 137.31
106
Algorithm 8.1: findingNCommonOverlappingIntervalsForNIntervalSetsRequire: n interval sets
Require: outputfile File to output common overlapping intervals
1: queryIntervals← smallest interval set
2: overlappingIntervalsList← ∅3: for each remaining n− 1 interval set do
4: index2NodeMap← constructIndexedSegmentTreeForest
5: search(queryIntervals, index2NodeMap, overlappingIntervalsList)
6: end for
7: return overlappingIntervalsList
Algorithm 8.2: searchRequire: queryIntervals
Require: index2NodeMap
Require: overlappingIntervalsList
1: qOvIntList : queryOverlappingIntervalsList
2: for each query interval do
3: qOvIntList← mainSearch(query, index2NodeMap, presetV alue)
4: update overlappingIntervalsList with qOvIntList
5: end for
107
Algorithm 8.3: mainSearchRequire: query(lowEndPoint, highEndPoint)
Require: index2NodeMap
Require: presetV alue > 0
1: overlappingIntervals← ∅2: lowIndex← lowEndPoint/presetV alue
3: highIndex← highEndPoint/presetV alue
4: lowNode← index2NodeMap.get(lowIndex)
5: if lowNode 6= null and linked(lowNode) then
6: searchAtLinkedNode(lowNode, query, overlappingIntervals)
7: else
8: lowerIndex← getLowerIndex(index2NodeMap, lowIndex)
9: lowerNode = index2NodeMap.get(lowerIndex)
10: if lowerNode 6= null then
11: searchAtLowerNode(lowerNode, query, overlappingIntervals)
12: else
13: highNode← index2NodeMap.get(highIndex)
14: if highNode 6= null and linked(highNode) then
15: searchAtLinkedNode(highNode, query, overlappingIntervals)
16: else
17: higherIndex← getHigherIndex(index2NodeMap, highIndex)
18: higherNode = index2NodeMap.get(higherIndex)
19: if higherNode 6= null then
20: searchAtHigherNode(higherNode, query, overlappingIntervals)
21: end if
22: end if
23: end if
24: end if
25: return overlappingIntervals
108
Algorithm 8.4: searchAtLinkedNodeRequire: node is a linked original node
Require: query(lowEndPoint, highEndPoint)
Require: overlappingIntervals
1: searchForward(node, query, overlappingIntervals)
2: searchBackward(node.backwardNode, query, overlappingIntervals)
Algorithm 8.5: searchForwardRequire: node is a linked original node
Require: query(lowEndPoint, highEndPoint)
Require: overlappingIntervals
1: low: lowEndPoint
2: high: highEndPoint
3: if node 6= null and node.interval.low ≤ high then
4: if low ≤ node.interval.high then
5: add node.canonicalSubset to overlappingIntervals
6: if node.left 6= null and low ≤ node.left.interval.high then
7: searchDownward(node.left, query, overlappingIntervals)
8: end if
9: if node.right 6= null and node.right.interval.low ≤ high then
10: searchDownward(node.right, query, overlappingIntervals)
11: end if
12: end if
13: searchForward(node.forwardNode, query, overlappingIntervals)
14: end if
109
Algorithm 8.6: searchBackwardRequire: node is a linked original node
Require: query(lowEndPoint, highEndPoint)
Require: overlappingIntervals
1: low: lowEndPoint
2: high: highEndPoint
3: if node 6= null and low ≤ node.interval.high then
4: if node.interval.low ≤ high then
5: add node.canonicalSubset to overlappingIntervals
6: if node.left 6= null and low ≤ node.left.interval.high then
7: searchDownward(node.left, query, overlappingIntervals)
8: end if
9: if node.right 6= null and node.right.interval.low ≤ high then
10: searchDownward(node.right, query, overlappingIntervals)
11: end if
12: end if
13: searchBackward(node.backwardNode, query, overlappingIntervals)
14: end if
Algorithm 8.7: searchDownwardRequire: query(lowEndPoint, highEndPoint)
Require: node 6= null
Require: node and query overlaps
Require: overlappingIntervals
1: Add node.canonicalSubset to overlappingIntervals
2: if node.left 6= null and low ≤ node.left.interval.high then
3: searchDownward(node.left, query, overlappingIntervals)
4: end if
5: if node.right 6= null and node.right.interval.low ≤ high then
6: searchDownward(node.right, query, overlappingIntervals)
7: end if
110
Algorithm 8.8: searchAtLowerNodeRequire: lowerNode 6= null
Require: query(lowEndPoint, highEndPoint)
Require: overlappingIntervals
1: if linked(lowerNode) then
2: searchForward(lowerNode, query, overlappingIntervals)
3: else
4: if overlaps(query, lowerNode) then
5: searchDownward(lowerNode, query, overlappingIntervals)
6: end if
7: node← findRightMostNode(lowerNode)
8: searchForward(node.forwardNode, query, overlappingIntervals)
9: end if
Algorithm 8.9: searchAtHigherNodeRequire: higherNode 6= null
Require: query(lowEndPoint, highEndPoint)
Require: overlappingIntervals
1: if linked(higherNode) then
2: searchBackward(higherNode, query, overlappingIntervals)
3: else
4: if overlaps(query, higherNode) then
5: searchDownward(higherNode, query, overlappingIntervals)
6: end if
7: node← findLeftMostNode(higherNode)
8: searchBackward(node.backwardNode, query, overlappingIntervals)
9: end if
111
8.8 Finding at Least k Common Overlapping Intervals for n Given Interval
Sets
Constructing one segment tree or indexed segment tree forest for each interval set
solves our problem of finding n common overlapping intervals for n given interval
sets. However, for finding at least k common overlapping intervals for n given inter-
val sets problem, we have to call our proposed indexed segment tree forest for each
interval set solution for C(n, k) times. Definitely, we visit this option for correctness
of our new proposed of solution for finding at least k common overlapping intervals
for n given interval sets. In this section, we provide our algorithm for finding at least
k common overlapping intervals for n interval sets using one segment tree for all in-
terval sets where 2 ≤ k ≤ n. To enhance the performance of the algorithms, we
implemented them using fork/join framework of Java 1.8. In this manner, we aimed
to take advantage of multiple processors as much as possible.
All the pseudocode of the algorithms are provided in 8.10, 8.11, 8.12,8.13,8.14, and
8.15.
Algorithm 8.10: findingAtLeastKCommonOverlappingIntervalsForNInter-
valSetsRequire: n interval sets
Require: k
Require: output File to output common overlapping intervals
1: allEndPoints← fillEndPointsAndIntervals(n)
2: sortedAllEndPoints← sortEndPoints(allEndPoints)
3: root← constructSegmentTree(sortedAllEndPoints)
4: storeIntervals(n, root)
5: ovIntList;← ∅6: lastOvIntList;← ∅7: findAtLeastK(root, k, ovIntList, lastOvIntList, output)
112
Algorithm 8.11: fillEndPointsAndIntervalsRequire: n interval sets
1: for each interval set i do
2: for each interval j in interval set i do
3: add lowEndPointi,j and highEndPointi,j to allEndPoints
4: add interval j to intervalsi
5: end for
6: end for
7: return allEndPoints
Algorithm 8.12: sortEndPoints: Sort allEndPoints in ascending orderRequire: allEndPoints
1: sort end points
2: return sortedAllEndPoints
Algorithm 8.13: constructSegmentTree: Using sortedAllEndPointsRequire: sortedAllEndPoints
1: construct segment tree
2: return root of the segment tree
Algorithm 8.14: storeIntervals: One interval set at a timeRequire: n interval sets
1: for each interval set i do
2: for each interval j in interval set i do
3: store interval j to segment tree
4: update node.intervalSetNumbers with i
5: end for
6: end for
7: Prune segment tree from the nodes that do not have at least k numbers in their
node.intervalSetNumbers
113
Algorithm 8.15: findAtLeastKRequire: node
Require: k
Require: overlappingIntervalsList
Require: lastOverlappingIntervalsList
Require: output File to output common overlapping intervals
1: intervalSetNumbers← node.intervalSetNumbers
2: newOvIntList : newOverlappingIntervalsList
3: ovIntList: overlappingIntervalsList
4: nIntSetNum2IntMap: newIntervalSetNumber2IntervalMap
5: exIntSetNum2IntMap: existingIntervalSetNumber2IntervalMap
6: intSetNum: intervalSetNumber
7: exInt: existingInterval
8: lastOvIntList: lastOverlappingIntervalsList
9: if (intervalSetNumbers.size ≥ k) then
10: if (node.canonicalSubset 6= null) then
11: for each intervali in node.canonicalSubset do
12: create newOvIntList
13: if (ovIntList = ∅) then
14: create nIntSetNum2IntMap
15: add intervali to nIntSetNum2IntMap
16: add nIntSetNum2IntMap to newOvIntList
17: else
18: for each exIntSetNum2IntMap in ovIntList do
19: if exIntSetNum2IntMap contains intSetNum of intervali then
20: exInt← exIntSetNum2IntMap.get(intSetNum)
21: if (exInt 6= intervali) then
22: create nIntSetNum2IntMap from exIntSetNum2IntMap
23: add intervali to nIntSetNum2IntMap
24: if (nIntSetNum2IntMap.size ≥ k) then
25: checkforAtLeastKCommonOverlapsUpdateOutput
114
26: else
27: add nIntSetNum2IntMap to newOvIntList
28: end if
29: end if
30: else
31: add intervali to exIntSetNum2IntMap
32: if (exIntSetNum2IntMap.size ≥ k) then
33: checkforAtLeastKCommonOverlapsUpdateOutput
34: end if
35: end if
36: end for
37: end if
38: add newOvIntList to ovIntList
39: end for
40: end if
41: if (node.left 6= null) then
42: findAtLeastK(node.left, k, ovIntList, lastOvIntList, output)
43: end if
44: if (node.right 6= null) then
45: findAtLeastK(node.right, k, ovIntList, lastOvIntList, output)
46: end if
47: end if
115
116
CHAPTER 9
CONCLUSION AND FUTURE WORK
Research carried out in this thesis can be examined under two main parts. In the first
part of the thesis, we developed a comprehensive annotation and enrichment analysis
tool, GLANET, which implements a sampling-based enrichment test that accounts
for genomic biases and has several useful built-in analysis capabilities. Following
GLANET, we designed novel data-driven computational experiments to assess our
enrichment analysis in terms of its Type-I error and power, in detail. Through these
experiments, we investigated the effect of correcting for genomic biases separately
and jointly on enrichment analysis along with GLANET’s other parameters. These
experiments also provide a methodology for benchmarking enrichment analyzes of
other tools. To exemplify this use case, we compared GLANET with another tool,
GAT [22]. In the second part of the thesis, we extended the annotation analysis pro-
vided in GLANET for finding common overlapping intervals from 2 or 3 interval sets
to n given interval sets. To this aim, we proposed novel indexed segment tree forest
data structure with its accompanying algorithms and showed that indexed segment
tree forest reduces search time.
NGS technologies allow us to sequence whole genomes rapidly, analyze gene expres-
sion through RNA sequencing, study somatic variations by sequencing patient sam-
ples, analyze epigenetic factors such as genome-wide DNA methylation and DNA-
protein interactions. As a result of many different sequencing methods, they provide
genomic intervals of interest. Interpretation of these genomic intervals requires over-
lapping them with already annotated genomic intervals through annotation. Imme-
diate follow-up study is to find the statistically significant overlaps among them via
117
enrichment analysis. There are various tools each with different shortcomings. Some
of them do not accept genomic intervals of varying length but only SNPs, some do
not provide enrichment analysis but only annotation, some of them allow enrichment
analysis only for gene lists, and most of them do not account for genomic biases dur-
ing enrichment analysis. To overcome these shortcomings, we developed Genomic
Loci ANnotation and Enrichment Tool, GLANET with many built-in capabilities.
First of all, GLANET offers an easy-to-run desktop and command line application
with its open source code available in https://github.com/burcakotlu/
GLANET and full documentation provided at https://glanet.readthedocs.
org. GLANET performs flexible annotation and enrichment analysis of a given set
of fixed or varying length loci. To annotate the given genomic intervals with the
intervals in the default library or with the user extended library, we utilize interval
trees. We construct chromosome-based interval trees and find overlapping intervals
through interval tree search. We find the statistically significant overlaps between
given interval sets by conducting sampling-based enrichment analysis. Genomic bi-
ases inherent to NGS technologies such as GC content and mappability restrict the
regions of genome that can contribute and show up in the resulting intervals. We
adjust for these biases during the random interval generation phase of enrichment
analysis. By correcting for these biases in generation of samplings, we aim to reduce
false positives in enrichment analysis.
Overall, we can summarize the features of GLANET as follows: GLANET utilizes
a rich pre-defined annotation library that contains regions defined not only on exons
of the genes but also on their intronic and regulatory regions, GO Terms, KEGG
pathways and a large collection of regulatory genomic element libraries from the
ENCODE project. One key feature of GLANET is that the user can expand its default
library by user defined gene sets and genomic intervals. This option makes GLANET
especially suitable for research groups that generate genomic interval data or gene sets
through a variety of high throughput experiments and routinely perform enrichment
analysis. Other unique features of GLANET include allowing gene-set enrichment
analysis with non-coding neighborhood of the genes, regulatory sequence analysis
for SNP queries, joint enrichment analysis of TF-pathway pairs and an enrichment
procedure that allows accounting for mappability and GC content biases separately
118
or jointly. GLANET can be used in a variety of interesting biological applications,
some of which we showcase throughout the thesis and used earlier in [65].
And secondly, to the best of our knowledge, there is no tool or method for assessing
the performance of the enrichment analysis that we are conducting. To evaluate how
accounting for genomic biases and other GLANET parameters affect Type-I error
rate and power of our enrichment procedure, we designed novel data-driven compu-
tational experiments. We considered two cell lines, GM12878 and K562, each having
two replicates of RNA-seq data. Leveraging on the expression level of genes in these
RNA-seq data, we determined the expressed and non-expressed genes. According
to our literature review, we defined each histone modification and transcription factor
element in our library as activator or repressor element. We described two experiment
settings, based on more and less stringent definitions of expressed and non-expressed
genes. The key idea of these experiments can be briefly explained as follows: we
expect the enrichment of activator elements in the proximal regions of the expressed
genes. In contrast, at proximal regions of the non-expressed genes, we expect the
enrichment of repressor elements. By means of these experiments, we analyzed the
affect of genomic biases separately and jointly on the enrichment analysis and we
calculated the element-based and cell-based Type-I error rate and power of our en-
richment analysis. We observed that in input types where the mappability and/or
GC distribution is not close to the distribution of the genome, not accounting for GC
and/or mappability results in large Type-I errors. Overall, our data-driven computa-
tional experiments illustrate that GLANET has high power for detecting enrichment
with conservative Type-I error control. These experiments can be easily adapted by
other tools to assess their own performance. To exemplify this usage, we evaluated
another tool, GAT, and assessed its Type-I error rate and power. Furthermore, for
comparison reasons, we provided element-based and cell-based ROC curves, Type-I
Error and power figures of GLANET and GAT depending on the results coming from
our data-driven computational experiments.
In the second part of the thesis, we extended finding common overlapping intervals
from 2 or 3 interval sets problem into n interval sets problem. This time, we utilized
segment tree which is another space-partitioning data structure. We observed that top
part of the segment tree does not hold any intervals at all or holds only a few intervals.
119
Intervals are mostly stored in the bottom part of the segment tree. Depending on
this observation, we cut the segment tree at a certain depth and indexed the segment
tree nodes at this cut-off depth and also the nodes without offspring above this cut-
off depth. In this manner, we represented original segment tree as indexed shorter
segment tree forest. We showed that this way of representation reduces the search
time which is verified by t-tests.
Additionally, we developed algorithm for finding at least k common overlapping in-
tervals out of n interval sets problem. We constructed one segment tree for all inter-
vals coming from n interval sets. We kept track of intervals and their source interval
set number at each node. This augmentation enabled us to find common overlapping
intervals immediately after the storage of intervals is completed. Because of the na-
ture of segment tree, one interval can be stored in more than one nodes. And this may
result in multiple output of the same intervals as overlapping intervals. To overcome
this challenge, we proposed an augmented data structure for lastly found overlapping
intervals per each depth. Unfortunately pruning of the segment tree after storage of
intervals coming from each interval set was not possible for finding at least k com-
mon overlapping intervals out of n interval sets problem till the storage of intervals is
completed. This resulted in long run time. However it is still shorter than the straight-
forward idea which is calling our solution for finding i common overlapping intervals
for i interval sets where k ≤ i ≤ n and for all possible combinations, nCi.
Research in this thesis can be extended in several directions as follows:
• Variant Call Format (VCF) has become a primary format for representing SNP,
indel, and structural variation calls [32]. VCF support, in other words, incorpo-
ration of VCF in GLANET as accepted formats may increase its usage.
• Noncoding RNAs play significant role in cellular process and also diseases.
Functional annotation and enrichment with respect to noncoding RNAs such as
microRNAs, lncRNAs can be conducted using the user-defined library options
of the GLANET. Integrating this information into the GLANET default library
is considered as a future work.
• We employ fork/join framework of Java 1.8 in enrichment analysis. As a future
120
work, the annotation step can utilize parallelism.
• Currently, GLANET supports analysis only for the human genome. As a future
work, it can be further developed to work with other model organisms such
as Arabidopsis thaliana (Plant), Saccharomyces cerevisiae (Yeast), Drosophila
melanogaster (Fruit fly), Mus musculus (Mouse), and Danio rerio (Zebrafish).
• Finding common overlapping intervals for n interval sets can be solved by us-
ing one segment tree for each interval set and this solution can be applied in
parallel for each chromosome. Moreover, we showed that representing one
segment tree in indexed segment tree forest decreases search time. This way of
representation allows us to search in parallel on each segment tree in the forest.
As a future work, we can search on each segment tree in the forest in parallel.
• We provided a solution for at least k common overlapping intervals out of n
interval sets. As a future work, we can provide customized parallel implemen-
tations of this solution.
• We can collect and present our solutions for finding n or at least k common
overlapping intervals for n given interval sets under “Joint Overlap Analysis
Framework" (JOF). Application of this framework can be discovery of n or at
least k common overlapping transcription factors (TFs), histone modifications
(HMs), and/or DNaseI hypersensitive sites (DHSs) which constitute the n given
interval sets.
• Furthermore, we can utilize the resulting common overlapping intervals from
JOF and carry out enrichment analysis with respect to other interval sets of
interest such as Copy Number Variations (CNVs), SNPs, genomic variants,
DNA regulatory elements, and genic regions. We can make use of sampling-
based enrichment analysis already provided in GLANET. This will enable us to
extend JOF to “Joint Enrichment Analysis Framework" (JEF). One application
area for JEF can be finding co-enriched TFs for GWAS SNPs or CNVs.
121
122
REFERENCES
[1] The International HapMap 3 Consortium, D. M. Altshuler, R. A. Gibbs, L. Pel-
tonen, D. M. Altshuler, R. A. Gibbs, L. Peltonen, E. Dermitzakis, S. F.
Schaffner, F. Yu, L. Peltonen, and et al., “Integrating common and rare genetic
variation in diverse human populations,” Nature, vol. 467, pp. 52–58, Sep 2010.
[2] G. A. McVean, D. M. Altshuler (Co-Chair), R. M. Durbin (Co-Chair), G. R.
Abecasis, D. R. Bentley, A. Chakravarti, A. G. Clark, P. Donnelly, E. E. Eichler,
P. Flicek, and et al., “An integrated map of genetic variation from 1,092 human
genomes,” Nature, vol. 491, pp. 56–65, Oct 2012.
[3] R. McLendon, A. Friedman, D. Bigner, E. G. Van Meir, D. J. Brat, G. M. Mas-
trogianakis, J. J. Olson, T. Mikkelsen, N. Lehman, K. Aldape, and et al., “Com-
prehensive genomic characterization defines human glioblastoma genes and
core pathways,” Nature, vol. 455, pp. 1061–1068, Sep 2008.
[4] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold, “Genome-wide map-
ping of in vivo protein-dna interactions,” Science, vol. 316, pp. 1497–1502, Jun
2007.
[5] R. P. Darst, C. E. Pardo, L. Ai, K. D. Brown, and M. P. Kladde, “Bisulfite
sequencing of DNA.,” Current protocols in molecular biology / edited by Fred-
erick M. Ausubel ... [et al.], vol. Chapter 7, July 2010.
[6] L. Song and G. E. Crawford, “Dnase-seq: A high-resolution technique for map-
ping active gene regulatory elements across the genome from mammalian cells,”
Cold Spring Harbor Protocols, vol. 2010, pp. pdb.prot5384–pdb.prot5384, Feb
2010.
[7] J. D. Buenrostro, B. Wu, H. Y. Chang, and W. J. Greenleaf, ATAC-seq: A Method
for Assaying Chromatin Accessibility Genome-Wide. John Wiley & Sons, Inc.,
2001.
123
[8] B. E. Bernstein, E. Birney, I. Dunham, E. D. Green, C. Gunter, and M. Snyder,
“An integrated encyclopedia of dna elements in the human genome,” Nature,
vol. 489, no. 7414, pp. 57–74, 2012.
[9] S. G. Coetzee, S. K. Rhie, B. P. Berman, G. A. Coetzee, and H. Noushmehr,
“Funcisnp: An r/bioconductor tool integrating functional non-coding data sets
with genetic association studies to identify candidate regulatory snps,” Nucleic
acids research, vol. 40, no. 18, 2012.
[10] L. D. Ward and M. Kellis, “Haploreg: a resource for exploring chromatin states,
conservation, and regulatory motif alterations within sets of genetically linked
variants,” Nucleic acids research, vol. 40, no. Database issue, pp. D930–4, 2012.
[11] P. Holmans, E. K. Green, J. S. Pahwa, M. A. Ferreira, S. M. Purcell, P. Sklar,
M. J. Owen, M. C. O’Donovan, and N. Craddock, “Gene ontology analysis
of gwa study data sets provides insights into the biology of bipolar disorder,”
American journal of human genetics, vol. 85, no. 1, pp. 13–24, 2009.
[12] A. Sifrim, J. K. Van Houdt, L. C. Tranchevent, B. Nowakowska, R. Sakai,
G. A. Pavlopoulos, K. Devriendt, J. R. Vermeesch, Y. Moreau, and J. Aerts,
“Annotate-it: a swiss-knife approach to annotation, analysis and interpretation
of single nucleotide variation in human disease,” Genome medicine, vol. 4, no. 9,
p. 73, 2012.
[13] B. Bakir-Gungor, E. Egemen, and O. U. Sezerman, “Panoga: a web server
for identification of snp-targeted pathways from genome-wide association study
data,” Bioinformatics, vol. 30, no. 9, pp. 1287–1289, 2014.
[14] I. Dunham, E. Kulesha, V. Iotchkova, S. Morganella, and E. Birney, “Forge: A
tool to discover cell specific enrichments of gwas associated snps in regulatory
regions [version 1; referees: 2 approved with reservations],” F1000Research,
vol. 4, no. 18, 2015.
[15] R. K. Auerbach, B. Chen, and A. J. Butte, “Relating genes to function: identify-
ing enriched transcription factors using the encode chip-seq significance tool,”
Bioinformatics, vol. 29, no. 15, pp. 1922–4, 2013.
124
[16] A. P. Boyle, E. L. Hong, M. Hariharan, Y. Cheng, M. A. Schaub, M. Kasowski,
K. J. Karczewski, J. Park, B. C. Hitz, S. Weng, J. M. Cherry, and M. Snyder,
“Annotation of functional variation in personal genomes using regulomedb,”
Genome research, vol. 22, no. 9, pp. 1790–7, 2012.
[17] P. Cingolani, A. Platts, L. Wang le, M. Coon, T. Nguyen, L. Wang, S. J. Land,
X. Lu, and D. M. Ruden, “A program for annotating and predicting the effects
of single nucleotide polymorphisms, snpeff: Snps in the genome of drosophila
melanogaster strain w1118; iso-2; iso-3,” Fly, vol. 6, no. 2, pp. 80–92, 2012.
[18] W. McLaren, B. Pritchard, D. Rios, Y. Chen, P. Flicek, and F. Cunningham,
“Deriving the consequences of genomic variants with the ensembl api and snp
effect predictor,” Bioinformatics, vol. 26, no. 16, pp. 2069–70, 2010.
[19] K. Wang, M. Li, and H. Hakonarson, “Annovar: functional annotation of genetic
variants from high-throughput sequencing data,” Nucleic acids research, vol. 38,
no. 16, p. e164, 2010.
[20] P. H. Lee, C. O’Dushlaine, B. Thomas, and S. M. Purcell, “Inrich: interval-
based enrichment analysis for genome-wide association studies,” Bioinformat-
ics, vol. 28, no. 13, pp. 1797–9, 2012.
[21] C. Y. McLean, D. Bristor, M. Hiller, S. L. Clarke, B. T. Schaar, C. B. Lowe,
A. M. Wenger, and G. Bejerano, “Great improves functional interpretation of
cis-regulatory regions,” Nature biotechnology, vol. 28, no. 5, pp. 495–501, 2010.
[22] A. Heger, C. Webber, M. Goodson, C. P. Ponting, and G. Lunter, “GAT: a simu-
lation framework for testing the association of genomic intervals,” Bioinformat-
ics, vol. 29, pp. 2046–2048, Aug. 2013.
[23] J. Rozowsky, G. Euskirchen, R. K. Auerbach, Z. D. Zhang, T. Gibson, R. Bjorn-
son, N. Carriero, M. Snyder, and M. B. Gerstein, “Peakseq enables system-
atic scoring of chip-seq experiments relative to controls,” Nature biotechnology,
vol. 27, no. 1, pp. 66–75, 2009.
[24] D. Chung, P. F. Kuan, B. Li, R. Sanalkumar, K. Liang, E. H. Bresnick, C. Dewey,
and S. Keles, “Discovering transcription factor binding sites in highly repetitive
125
regions of genomes with multi-read analysis of chip-seq data,” PLoS computa-
tional biology, vol. 7, no. 7, p. e1002111, 2011.
[25] M. S. Cheung, T. A. Down, I. Latorre, and J. Ahringer, “Systematic bias in high-
throughput sequencing data and its correction by beads,” Nucleic acids research,
vol. 39, no. 15, p. e103, 2011.
[26] Y. C. Chen, T. Liu, C. H. Yu, T. Y. Chiang, and C. C. Hwang, “Effects of gc bias
in next-generation-sequencing data on de novo genome assembly,” PloS one,
vol. 8, no. 4, p. e62856, 2013.
[27] Y. Benjamini and T. P. Speed, “Summarizing and correcting the gc content bias
in high-throughput sequencing,” Nucleic acids research, vol. 40, no. 10, p. e72,
2012.
[28] J. Dabney and M. Meyer, “Length and gc-biases during sequencing library am-
plification: a comparison of various polymerase-buffer systems with ancient
and modern dna sequencing libraries,” BioTechniques, vol. 52, no. 2, pp. 87–94,
2012.
[29] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “Kegg for in-
tegration and interpretation of large-scale molecular data sets,” Nucleic acids
research, vol. 40, no. D1, pp. D109–D114, 2012.
[30] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.
Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-
Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald,
G. M. Rubin, and G. Sherlock, “Gene Ontology: tool for the unification of
biology,” Nature Genetics, vol. 25, pp. 25–29, May 2000.
[31] P. J. Croucher, Linkage Disequilibrium. John Wiley & Sons, Ltd, 2001.
[32] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo,
R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin,
and . G. P. A. G. , “The variant call format and vcftools,” Bioinformatics, vol. 27,
no. 15, p. 2156, 2011.
126
[33] M. Leclercq, A. B. Diallo, and M. Blanchette, “Prediction of human mirna target
genes using computationally reconstructed ancestral mammalian sequences,”
Nucleic Acids Research, vol. 45, no. 2, p. 556, 2017.
[34] A. Yates, W. Akanni, M. R. Amode, D. Barrell, K. Billis, D. Carvalho-Silva,
C. Cummins, P. Clapham, S. Fitzgerald, L. Gil, C. G. Girón, L. Gordon,
T. Hourlier, S. E. Hunt, S. H. Janacek, N. Johnson, T. Juettemann, S. Keenan,
I. Lavidas, F. J. Martin, T. Maurel, W. McLaren, D. N. Murphy, R. Nag,
M. Nuhn, A. Parker, M. Patricio, M. Pignatelli, M. Rahtz, H. S. Riat, D. Shep-
pard, K. Taylor, A. Thormann, A. Vullo, S. P. Wilder, A. Zadissa, E. Birney,
J. Harrow, M. Muffato, E. Perry, M. Ruffier, G. Spudich, S. J. Trevanion, F. Cun-
ningham, B. L. Aken, D. R. Zerbino, and P. Flicek, “Ensembl 2016,” Nucleic
Acids Research, vol. 44, no. D1, p. D710, 2016.
[35] U. Paila, B. A. Chapman, R. Kirchner, and A. R. Quinlan, “Gemini: integrative
exploration of genetic variation and genome annotations,” PLoS computational
biology, vol. 9, no. 7, p. e1003153, 2013.
[36] F. A. San Lucas, G. Wang, P. Scheet, and B. Peng, “Integrated annotation and
analysis of genetic variants from next-generation sequencing studies with vari-
ant tools,” Bioinformatics, vol. 28, no. 3, pp. 421–2, 2012.
[37] M. L. Speir, A. S. Zweig, K. R. Rosenbloom, B. J. Raney, B. Paten, P. Ne-
jad, B. T. Lee, K. Learned, D. Karolchik, A. S. Hinrichs, S. Heitner, R. A.
Harte, M. Haeussler, L. Guruvadoo, P. A. Fujita, C. Eisenhart, M. Diekhans,
H. Clawson, J. Casper, G. P. Barber, D. Haussler, R. M. Kuhn, and W. J. Kent,
“The UCSC Genome Browser database: 2016 update.,” Nucleic acids research,
vol. 44, pp. D717–D725, Jan. 2016.
[38] A. R. Quinlan and I. M. Hall, “BEDTools: a flexible suite of utilities for compar-
ing genomic features.,” Bioinformatics (Oxford, England), vol. 26, pp. 841–842,
Mar. 2010.
[39] S. Neph, M. S. Kuehn, A. P. Reynolds, E. Haugen, R. E. Thurman, A. K. John-
son, E. Rynes, M. T. Maurano, J. Vierstra, S. Thomas, R. Sandstrom, R. Hum-
bert, and J. A. Stamatoyannopoulos, “BEDOPS: high-performance genomic fea-
ture operations,” Bioinformatics, vol. 28, pp. 1919–1920, July 2012.
127
[40] A. V. Alekseyenko and C. J. Lee, “Nested Containment List (NCList): a new
algorithm for accelerating interval query of genome alignment and interval
databases.,” Bioinformatics (Oxford, England), vol. 23, pp. 1386–1393, June
2007.
[41] H. Li, “Tabix: fast retrieval of sequence features from generic TAB-delimited
files,” Bioinformatics, vol. 27, pp. 718–719, Mar. 2011.
[42] R. M. Layer and A. R. Quinlan, “A parallel algorithm for n -way interval set
intersection,” Proceedings of the IEEE, vol. PP, no. 99, pp. 1–10, 2015.
[43] K. R. Blahnik, L. Dou, H. O’Geen, T. McPhillips, X. Xu, A. R. Cao, S. Iyen-
gar, C. M. Nicolet, B. Ludascher, I. Korf, and P. J. Farnham, “Sole-search: an
integrated analysis program for peak detection and functional annotation using
chip-seq data,” Nucleic acids research, vol. 38, no. 3, p. e13, 2010.
[44] T. H. Cormen, Introduction to algorithms. Cambridge, Mass.: MIT Press,
3rd ed., 2009.
[45] M. Thomas-Chollier, O. Sand, J.-V. Turatsinze, R. Janky, M. Defrance,
E. Vervisch, S. Brohée, and J. van Helden, “RSAT: regulatory sequence anal-
ysis tools,” Nucleic Acids Research, vol. 36, pp. W119–W127, July 2008.
[46] A. Mathelier, X. Zhao, A. W. Zhang, F. Parcy, R. Worsley-Hunt, D. J. Arenillas,
S. Buchman, C.-y. Y. Chen, A. Chou, H. Ienasescu, J. Lim, C. Shyr, G. Tan,
M. Zhou, B. Lenhard, A. Sandelin, and W. W. Wasserman, “JASPAR 2014: an
extensively expanded and updated open-access database of transcription factor
binding profiles.,” Nucleic acids research, vol. 42, pp. D142–D147, Jan. 2014.
[47] P. Kheradpour and M. Kellis, “Systematic discovery and characterization of reg-
ulatory motifs in ENCODE TF binding experiments,” Nucleic Acids Research,
vol. 42, pp. gkt1249–2987, Dec. 2013.
[48] C. E. Bonferroni, “Teoria statistica delle classi e calcolo delle probabilità,” Pub-
blicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di
Firenze, vol. 8, pp. 3–62, 1936.
128
[49] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A Practi-
cal and Powerful Approach to Multiple Testing,” Journal of the Royal Statistical
Society. Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995.
[50] M. Costantini, O. Clay, F. Auletta, and G. Bernardi, “An isochore map of human
chromosomes,” Genome research, vol. 16, pp. 536–541, Apr. 2006.
[51] G. Bernardi, “Misunderstandings about isochores. Part 1,” Gene, vol. 276,
pp. 3–13, Oct. 2001.
[52] J. Cheng, R. Blum, C. Bowman, D. Hu, A. Shilatifard, S. Shen, and B. D. Dyn-
lacht, “A Role for H3K4 Monomethylation in Gene Repression and Partitioning
of Chromatin Readers.,” Molecular cell, vol. 53, pp. 979–992, Mar. 2014.
[53] A. Barski, S. Cuddapah, K. Cui, T. Y. Roh, D. E. Schones, Z. Wang, G. Wei,
I. Chepelev, and K. Zhao, “High-resolution profiling of histone methylations in
the human genome.,” Cell, vol. 129, no. 4, pp. 823–837, 2007.
[54] W. Shu, H. Chen, X. Bo, and S. Wang, “Genome-wide analysis of the rela-
tionships between DNaseI HS, histone modifications and gene expression re-
veals distinct modes of chromatin domains,” Nucleic Acids Research, vol. 39,
pp. 7428–7443, Sept. 2011.
[55] A. M. Deaton and A. Bird, “CpG islands and the regulation of transcription,”
Genes & Development, vol. 25, pp. 1010–1022, May 2011.
[56] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez, and
M. Müller, “proc: an open-source package for r and s+ to analyze and com-
pare roc curves,” BMC Bioinformatics, vol. 12, p. 77, 2011.
[57] A. Heger, “Gat tutorial.” https://gat.readthedocs.org. Accessed:
2016-05-13.
[58] A. Valouev, D. S. Johnson, A. Sundquist, C. Medina, E. Anton, S. Batzoglou,
R. M. Myers, and A. Sidow, “Genome-wide analysis of transcription factor
binding sites based on ChIP-Seq data,” Nat Meth, vol. 5, pp. 829–834, Sept.
2008.
129
[59] S. E. Stewart, D. Yu, J. M. Scharf, B. M. Neale, J. A. Fagerness, et al., “Genome-
wide association study of obsessive-compulsive disorder,” Molecular psychia-
try, vol. 18, no. 7, pp. 788–98, 2013.
[60] T. Overbeek, K. Schruers, and E. Griez, “Comorbidity of obsessive-compulsive
disorder and depression: prevalence, symptom severity, and treatment effect,”
The Journal of clinical psychiatry, vol. 63, no. 12, pp. 1–478, 2002.
[61] F.-Y. Tsai and S. H. Orkin, “Transcription factor gata-2 is required for prolif-
eration/survival of early hematopoietic cells and mast cell formation, but not
for erythroid and myeloid terminal differentiation,” Blood, vol. 89, no. 10,
pp. 3636–3643, 1997.
[62] K. Kitajima, M. Tanaka, J. Zheng, H. Yen, A. Sato, D. Sugiyama, H. Umehara,
E. Sakai, and T. Nakano, “Redirecting differentiation of hematopoietic progen-
itors by a transcription factor, gata-2,” Blood, vol. 107, no. 5, pp. 1857–1863,
2006.
[63] G. Yu, F. Li, Y. Qin, X. Bo, Y. Wu, and S. Wang, “GOSemSim: an R pack-
age for measuring semantic similarity among GO terms and gene products.,”
Bioinformatics (Oxford, England), vol. 26, pp. 976–978, Apr. 2010.
[64] M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars, Computational Ge-
ometry: Algorithms and Applications. Springer, softcover reprint of hardcover
3rd ed. 2008 ed., Nov. 2010.
[65] C. Yao, B. H. Chen, R. Joehanes, B. Otlu, X. Zhang, C. Liu, T. Huan, O. Tas-
tan, L. A. Cupples, J. B. Meigs, C. S. Fox, J. E. Freedman, P. Courchesne,
C. J. O’Donnell, P. J. Munson, S. Keles, and D. Levy, “Integromic analysis of
genetic variation and gene expression identifies networks for cardiovascular dis-
ease phenotypesclinical perspective,” Circulation, vol. 131, no. 6, pp. 536–549,
2015.
130
APPENDIX A
GLANET DATA SOURCES
Table A.1: GLANET data sources and their download dates.
Data Source Download Date
ENCODE DNaseI hypersensitive sites http://ftp.ebi.ac.uk/pub/databases/
ensembl/encode/supplementary/
integration_data_jan2011/byDataType/
openchrom/jan2011/idrPeaks/
conservative/
29/03/2013
ENCODE DNaseI hypersensitive sites http://ftp.ebi.ac.uk/pub/databases/
ensembl/encode/supplementary/
integration_data_jan2011/
byDataType/dnase/jul2010/
29/03/2013
ENCODE Transcription factor binding sites http://ftp.ebi.ac.uk/pub/databases/
ensembl/encode/supplementary/
integration_data_jan2011/
byDataType/peaks/jan2011/spp/
optimal/
22/03/2013
ENCODE Histone modification sites http://ftp.ebi.ac.uk/pub/databases/
ensembl/encode/supplementary/
integration_data_jan2011/
byDataType/peaks/jan2011/histone_
macs/optimal/
29/03/2013
hg19 RefSeq genes http://genome.ucsc.edu/ 18/11/2014
hg19 chromosome sizes http://genome.ucsc.edu/goldenPath/
help/hg19.chrom.sizes
22/05/2013
KEGG pathways http://rest.kegg.jp/list/pathway/
hsa
23/09/2013
Continued on next page
131
Table A.1 – continued from previous page
Data Source Download Date
KEGG pathway to gene mapping http://www.genome.jp/linkdb/linkdb.
html
18/06/2013
GC fasta files http://hgdownload.cse.ucsc.edu/
goldenPath/hg19/chromosomes/
19/07/2013
Mappability bigWig files ftp://hgdownload.cse.ucsc.edu/
apache/htdocs/goldenPath/hg19/
encodeDCC/wgEncodeMapability/
18/07/2013
JASPAR CORE pfms http://jaspar.genereg.net/
html/DOWNLOAD/JASPAR_CORE/pfm/
nonredundant/pfm_all.txt
26/08/2014
ENCODE motifs http://compbio.mit.edu/
encode-motifs/
25/02/2014
NCBI REMAP API supported assemblies Downloaded by remap_api.pl within GLANET
when a Regulatory Sequence Analysis is requisted
(repmap_api.pl source: ftp://ftp.ncbi.nlm.
nih.gov/pub/remap).
01/04/2016
Latest ref seq assembly ids Downloaded from ftp://ftp.ncbi.nlm.nih.
gov/genomes/ASSEMBLY_REPORTS/All/
within GLANET each time Regulatory Sequence
Analysis is requested.
01/04/2016
Gene ids ftp://ftp.ncbi.nlm.nih.gov/gene/
DATA/gene2refseq.gz
18/11/2014
132
APPENDIX B
TYPE-I ERROR, POWER AND ROC CURVE FIGURES
In Appendix B, for H4K20ME1 we provided Type-I Error, power and ROC Curve fig-
ures resulting from data-driven computational experiments for all possible GLANET
parameter and experiment settings. For sake of completeness, we provided all the
Type-I Error, power and ROC Curve figures for all elements under http://burcak.
ceng.metu.edu.tr/PhDThesis/ in BurcakOtlu_PhD_Thesis_ElementBased_
TypeIError_Power_ROCCurve_Figures.pdf.
We plotted the ROC curves using plotROC R package and compared the AUC of each
ROC curve with each other using pROC R package.
We drew the Type-I Error and power figures providing Type-I Error and power values
for varying significance levels starting from 0 to 0.25 in increments of 0.01.
133
(a) (b)
(c) (d)
(e) (f)
Figure B.1: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in GM12878 for (EOO,CompletelyDiscard,Top5).
134
(a) (b)
(c) (d)
(e) (f)
Figure B.2: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in GM12878 for (NOOB,CompletelyDiscard,Top5).
135
(a) (b)
(c) (d)
(e) (f)
Figure B.3: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in K562 for (EOO,CompletelyDiscard,Top5).
136
(a) (b)
(c) (d)
(e) (f)
Figure B.4: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in K562 for (NOOB,CompletelyDiscard,Top5).
137
(a) (b)
(c) (d)
(e) (f)
Figure B.5: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in GM12878 for (EOO,TakeTheLongest,Top20).
138
(a) (b)
(c) (d)
(e) (f)
Figure B.6: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in GM12878 for (NOOB,TakeTheLongest,Top20).
139
(a) (b)
(c) (d)
(e) (f)
Figure B.7: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in K562 for (EOO,TakeTheLongest,Top20).
140
(a) (b)
(c) (d)
(e) (f)
Figure B.8: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for
H4K20ME1 in K562 for (NOOB,TakeTheLongest,Top20).
141
142
CURRICULUM VITAE
PERSONAL INFORMATION
Surname, Name: Otlu, Burçak
Nationality: Turkish (TC)
Date and Place of Birth: 10.09.1977, Izmir
Phone: 0 312 210 5541
Fax: 0 312 210 5544
EDUCATION
Degree Institution Year of Graduation
M.S. Department of Computer Engineering, METU 2002
B.S. Department of Computer Engineering, METU 1999
High School Ankara Cumhuriyet High School 1995
High School Ankara Science High School 1994
PROFESSIONAL EXPERIENCE
Year Place Enrollment
2010-2016 Middle East Technical University Research Assistant
2006-2009 Solveka Software Senior Functional Developer
2005-2006 Oyak Technology Software Engineer
1999-2004 Middle East Technical University Research Assistant
143
PUBLICATIONS
In Preparation
1. Joint Overlap Analysis Framework, B. Otlu, T. Can (in draft)
International Journal Publications
1. B. Otlu, C. Firtina, S. Keles, O. Tastan, GLANET Genomic Loci Annotation
and Enrichment Tool, Bioinformatics, 10 May 2017 (accepted), 24 May 2017
(online published)
2. C. Yao, B.H. Chen, R. Joehanes, B. Otlu, X. Zhang, C. Liu, T. Huan, O. Tas-
tan, L.A. Cupples, J.B. Meigs, C.S. Fox, J.E. Freedman, P. Courchesne, C.J.
O’Donnell, P.J. Munson, S. Keles, D. Levy, Integromic analysis of genetic vari-
ation and gene expression identifies networks for cardiovascular disease phe-
notypes, Circulation, Volume 131, Issue 6, 10 February 2015, Pages 536-549.
(printed)
International Conference Poster Presentations
1. GLANET: Genomic Loci Annotation and Enrichment Tool, B. Otlu, O. Tastan,
S. Keles, The 13th European Conference on Computational Biology, ECCB, 7-
10 September 2014, Poster Presentation, Strasbourg, France
AWARD AND SCHOLARSHIP
TÜBITAK, 2211-C PhD Scholarship (2014-2017)
METU, PhD Student Lecture Performance Award (2012)
144