tools and techniques for assessing...

TOOLS AND TECHNIQUES FOR ASSESSING FUNCTIONAL RELEVANCEOF GENOMIC LOCI

A THESIS SUBMITTED TOTHE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES

OFMIDDLE EAST TECHNICAL UNIVERSITY

BY

BURÇAK OTLU

IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR

THE DEGREE OF DOCTOR OF PHILOSOPHYIN

COMPUTER ENGINEERING

JUNE 2017

Approval of the thesis:

TOOLS AND TECHNIQUES FOR ASSESSING FUNCTIONAL RELEVANCEOF GENOMIC LOCI

submitted by BURÇAK OTLU in partial fulfillment of the requirements for the degreeof Doctor of Philosophy in Computer Engineering Department, Middle EastTechnical University by,

Prof. Dr. Gülbin Dural ÜnverDean, Graduate School of Natural and Applied Sciences

Prof. Dr. Adnan YazıcıHead of Department, Computer Engineering

Prof. Dr. Tolga CanSupervisor, Computer Engineering Department, METU

Prof. Dr. Sündüz KelesCo-supervisor, Department of Statistics,University of Wisconsin–Madison, USA

Examining Committee Members:

Prof. Dr. M. Volkan AtalayComputer Engineering Department, METU

Prof. Dr. Tolga CanComputer Engineering Department, METU

Assoc. Prof. Dr. Murat ManguogluComputer Engineering Department, METU

Assist. Prof. Dr. Öznur Tastan OkanComputer Engineering Department, Bilkent University

Assist. Prof. Dr. Can AlkanComputer Engineering Department, Bilkent University

Date:

I hereby declare that all information in this document has been obtained andpresented in accordance with academic rules and ethical conduct. I also declarethat, as required by these rules and conduct, I have fully cited and referenced allmaterial and results that are not original to this work.

Name, Last Name: BURÇAK OTLU

Signature :

iv

ABSTRACT

TOOLS AND TECHNIQUES FOR ASSESSING FUNCTIONALRELEVANCE OF GENOMIC LOCI

Otlu, Burçak

Ph.D., Department of Computer Engineering

Supervisor : Prof. Dr. Tolga Can

Co-Supervisor : Prof. Dr. Sündüz Keles

June 2017, 144 pages

Genomic studies identify genomic loci representing genetic variations, transcription

factor occupancy, or histone modification through next generation sequencing (NGS)

technologies. Interpreting these loci requires evaluating them with known genomic

and epigenomic annotations. In this thesis, we develop tools and techniques to assess

the functional relevance of set of genomic intervals. Towards this goal, we first intro-

duce Genomic Loci ANnotation and Enrichment Tool (GLANET) as a comprehensive

annotation and enrichment analysis tool. Input query to GLANET is a set of genomic

intervals. GLANET annotates and performs enrichment analysis on these loci with

a rich library that includes: (i) gene-centric regions that encompass their non-coding

neighborhood, (ii) a large collection of regulatory regions from ENCODE, and (iii)

gene sets derived from pathways. As a key feature, users can easily extend this library

with new gene sets and genomic intervals. GLANET implements a sampling-based

enrichment test that can account for GC content and/or mappability biases inherent

to NGS technologies, which shows high statistical power and well-controlled Type-I

error rate. Other key features of GLANET include assessment of impact of single

v

nucleotide variants on transcription factor binding sites when input consists of SNPs

only and not only exon based but also regulation based gene set enrichment analysis

by considering introns and proximal regions of genes in a gene set. GLANET also

allows joint enrichment analysis for TF binding sites and KEGG pathways. With this

option, users can evaluate whether the input set is enriched concurrently with binding

sites of TFs and the genes within a KEGG pathway. This joint enrichment analysis

provides a detailed functional interpretation of the input loci. As a second contri-

bution we designed novel data-driven computational experiments for assessing the

power and Type-I error of enrichment procedures. The data-driven computational ex-

periments render detailed quantitative comparisons of GLANET with other tools pos-

sible. Our results on these computational experiments showcase GLANET’s unique

capabilities as well as robustness, speed and accuracy. Finally, as a third contribution,

we present an efficient algorithmic solution for finding common overlapping intervals

over n interval sets. Our strategy is based on constructing one segment tree for each

interval set as the first step and proceeds by converting each segment tree to an in-

dexed segment tree forest by cutting this tree at a certain depth. Experiments on real

data show that this data structure decreases the search time. This novel representation

also enables parallel computations on each segment tree in the forest. We also extend

this solution to solve the problem of finding at least k common overlapping inter-

vals over n interval sets. The tools and techniques developed herein will hopefully

expedite the genomic research and help improve our understanding of the molecular

biology of the cell and the mechanisms underlying diseases.

Keywords: Genomic Intervals, Interval Intersection, Single-Nucleotide Polymorphisms

(SNPs), Genomic Variants, Gene Sets, Annotation and Enrichment Analysis, Regu-

latory Sequence Analysis, Joint Enrichment Analysis, DNA Regulatory Elements, n

Interval Set Intersection

vi

ÖZ

GENOMIK LOKASYONLARIN FONKSIYONEL ILGILILIKLERININDEGERLENDIRILMESI IÇIN ARAÇLAR VE TEKNIKLER

Otlu, Burçak

Doktora, Bilgisayar Mühendisligi Bölümü

Tez Yöneticisi : Prof. Dr. Tolga Can

Ortak Tez Yöneticisi : Prof. Dr. Sündüz Keles

Haziran 2017 , 144 sayfa

Genomik çalısmalar, yeni nesil sıralama (YNS) teknolojileri ile elde edilen, gene-

tik farklılıkları temsil eden, transkripsiyon faktörü veya histon modifikasyonu gibi

genomik lokasyonları belirler. Bu genomik lokasyonların yorumlanması, bilinen ge-

nomik ve epigenomik adlandırılmıs lokasyonlarla degerlendirilmesini gerektirir. Bu

tezde, genomik aralıkların fonksiyonel ilgililiklerinin degerlendirilmesi için araçlar

ve teknikler gelistirilmistir. Bu amaca yönelik olarak öncelikle Genomic Lokasyon

Adlandırma ve Zenginlestirme Aracını (GLANET), kapsamlı bir adlandırma ve zen-

ginlestirme analiz aracı olarak sunuyoruz. GLANET’in girdisi bir genomik aralık kü-

mesidir. GLANET bu genomik aralıklarda, (i) genlerin kodlanmayan komsuluklarını

da içeren gen-merkezli bölgelerini (ii) ENCODE’un genis bir düzenleyici bölge kol-

leksiyonunu (iii) yolaklardan elde edilen gen kümelerini içeren zengin bir kütüphane

ile adlandırma ve zenginlestirme analizi yapar. Önemli bir özellik olarak, kullanıcı-

lar bu kütüphaneyi yeni gen kümeleri ve genomik aralıklarla genisletebilirler. GLA-

NET, YNS teknolojilerine özgü olan GC içerigi ve/veya eslenirlik yanlılıklarını he-

vii

saba katabilen yüksek istatistiksel gücü ve iyi kontrol edilen Tip-I hata oranı gösteren

örnekleme-tabanlı zenginlestirme testi uygular. GLANET’in diger önemli özellikleri

arasında, girdi olarak sadece tek nükleotid farklılıkları (TNF) verildigi zaman, bu

TNF’lerin transkripsiyon faktörleri üzerindeki etkilerinin degerlendirilmesi ve gen

kümelerinin sadece ekson tabanlı degil de; gen kümesindeki genlerin intronlarını ve

proksimal bölgelerini de hesaba katarak düzenleyici tabanlı zenginlestirme analizi ya-

pabilmesi de yer alır. GLANET ayrıca TF baglama alanları ve KEGG yolakları için

ortak zenginlestirme analizine izin verir. Bu opsiyon sayesinde, kullanıcılar girdi kü-

mesinin hem TF baglanma alanları hem de KEGG yolagındaki genler ile aynı anda

zenginlesip zenginlesmedigini degerlendirebilirler. Bu ortak zenginlestirme analizi,

girdi aralıkların detaylı fonksiyonel yorumlanmasına olanak saglar. Bu tezde, ikinci

bir katkı olarak, zenginlestirme prosedürlerinin güç ve Tip-I hatasını degerlendirmek

için yeni veri-tabanlı hesaplamalı deneyler tasarladık. Veri-tabanlı hesaplamalı de-

neyler, GLANET’in diger araçlar ile ayrıntılı nicel karsılastırılmasını da mümkün

kılmaktadır. Bu hesaplamalı deneyler üzerindeki sonuçlarımız GLANET’in özgün

yeteneklerinin yanı sıra saglamlıgını, hızını ve dogrulugunu sergilemektedir. Son ola-

rak, üçüncü bir katkı olarak, n aralık kümesinde ortak örtüsen aralıkları bulmak için

verimli bir algoritmik çözüm sunmaktayız. Stratejimiz, ilk adım olarak belirlenen her

bir aralık kümesi için bir segment agacı insa etmeye dayanır ve bu agacı belli bir

derinlikte keserek, kesilen segment agacını indekslenmis bir segment agaç ormanına

dönüstürerek devam eder. Gerçek veriler üzerindeki deneyler, bu veri yapısının arama

süresini düsürdügünü göstermektedir. Bu yeni gösterim, ormandaki her bir segment

agacı üzerinde paralel hesaplamaları da mümkün kılmaktadır. Ayrıca, bu çözümü, n

aralık kümesinde en az k ortak örtüsen aralık bulma problemini çözmek için de genis-

lettik. Bu tezde gelistirilen araçlar ve teknikler, umuyoruz ki; genomik arastırmaları

hızlandıracak, hücrenin moleküler biyolojisini ve hastalıkların altında yatan mekaniz-

maları anlamamıza yardımcı olacaktır.

Anahtar Kelimeler: Genomik Aralıklar, Aralık Örtüstürme, Tek Nükleotid Farklılık-

ları, Genomik Farklılıklar, Gen Kümeleri, Adlandırma ve Zenginlestirme Analizi,

Düzenleyici Sıralama Analizi, Ortak Zenginlestirme Analizi, DNA Düzenleyici Ele-

mentler, n Aralık Kümesi Örtüstürme

viii

To my mother and father, Feride and Fikret

To my daughter and son, Betül and Süleyman Ediz

ix

ACKNOWLEDGMENTS

PhD may take long time, mine took six and a half years. Now, I would like to go back

in time, and remember some important dates and events that took place throughout

my PhD.

On September 29, 2009, at midnight, my daughter just 49 days old new born baby had

an operation in Hacettepe University Hospital. I would like to present my gratitude

to my father, Dr. Fikret Otlu, Prof. Dr. Özgür Deren and Prof. Dr. Cemalettin Aksoy

for their existence and for the successful operation. After the operation, we stayed

together with my daughter, Betül, in the hospital for 42 days for her treatment. During

our stay, I decided to pursue a PhD. Later on, my PhD journey started officially on

September 13, 2010.

In 2010, I read a book of Prof. Dr. Pavel Pevzner and Neil C. Jones, titled "An In-

troduction to Bioinformatics Algorithms (Computational Molecular Biology)", after

then I was determined to study Bioinformatics and Computational Biology. I would

like to thank them for their efforts and for this well written book.

In spring of 2012, I took a course from Assist. Prof. Can Alkan in Bilkent University.

For the course project, I said to him that I would like to work on GWAS data, and

he forwarded me to Assist. Prof. Öznur Tastan and she started collaborating with

me and Prof. Dr. Sündüz Keles who was in Bilkent University at that time for her

sabbatical leave. We did our first meeting in Öznur Tastan’s office. I remember that

Sündüz Keles was wearing a black-white striped blouse and a black skirt, and she told

the "small n, big p" problem on the board.

My course project which is then turned into my PhD studies started with Sündüz

Keles and Öznur Tastan in this way. Together with Sündüz Keles and Öznur Tastan, I

developed our tool, GLANET. At first, there was nothing, like a blank page. For more

than four years, day in day out, step by step I coded this tool. Github repository keeps

x

all the history. Our four years long skype meetings made GLANET evolved over time.

Sometimes, I fell into the traps of perfectionism, unnecessary implementations for the

feeling of completeness or it was necessary at that time but then it wasn’t. Later on,

journal reviewers determined the new directions for the tool. A lot of analyses and

comparisons took place this time. I would like to thank to Sündüz Keles and Öznur

Tastan for almost weekly meetings, support and guidance through all these years. By

the way, I would like to thank to Can Fırtına because of his initial work on GUI,

command line arguments and documentation of GLANET of which I continued and

maintained later on.

When I started my thesis in 2012, my advisor from METU, Prof. Dr. Tolga Can was

in Cyprus then for his sabbatical leave. So we couldn’t start with him together but I’m

happy that we finished together. I would like to thank for his suggestions and brilliant

ideas. He was always available when I asked for.

I would like to thank to Prof. Dr. Afsin Sarıtas, for his help and support for looking

after our children, without his help I couldn’t have looked after them this much good.

I would like to thank to my department for the beautiful room and my corner next to

the window from where I can see the sky, clouds, sun and sometimes the moon which

relieves me especially when I’m overwhelmed and weary. I would like to thank to

all of my professors and assistant friends in the department for their friendship and

accompany.

I would like to thank to TÜBITAK ULAKBIM for providing high performance and

grid computing resources for carrying out my experiments. I would like to acknowl-

edge that I have been supported by The Scientific and Technological Research Coun-

cil of Turkey (TÜBITAK 2211-C PhD Scholarship) during my PhD studies.

Although I have a family and I’m a mother of two beautiful children, during PhD, I

felt loneliness deep down in my soul. What I want to say is that PhD may become a

long period of your life in which you are more with yourself, you work alone most of

the time and it requires endurance and perseverance. At least, it was the case for me.

Nonetheless, I’m grateful for lot of things, first of all, I’m grateful for my children,

Betül and Süleyman Ediz, they are definitely main driving forces in my life. They

xi

made me strong, hopeful and courageous. Then my parents, Feride and Fikret, if

they didn’t look after my children, I couldn’t have pursued this PhD. As legacy, I

took a smiling face and a sincere, loving heart full of passion and compassion from

them. I’m strong, confident, determined, courageous, faithful and hopeful. When I

feel down, I like singing, especially classical Turkish music, and sometimes I just

remember the lyrics of a song, sometimes something like that "I have a dream, a song

to sing to help me cope with anything". Whatever we do during the day, at the end

of the day, it is the heart that really matters. And our hearts are like our GPSes, they

somehow know our true calling and which direction to go. I hope we will all have the

courage to follow our hearts, dreams and intuitions at any time, at any age. All the

work presented here is love made visible, as Khalil Cibran said. And I would like to

last my acknowledgments with the words of Winston Churchill, "Success is not final,

failure is not fatal: it is the courage to continue that counts."

xii

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ÖZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii

LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii

CHAPTERS

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 BIOLOGICAL BACKGROUND AND RELATED WORK . . . . . . 7

2.1 Biological Terms . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Related Work Regarding Thesis Part 1 . . . . . . . . . . . . 10

xiii

2.3 Related Work Regarding Thesis Part 2 . . . . . . . . . . . . 19

3 ANNOTATION OF GENOMIC LOCI . . . . . . . . . . . . . . . . . 21

3.1 User Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 GLANET Annotation Library . . . . . . . . . . . . . . . . . 21

3.3 Library Representation . . . . . . . . . . . . . . . . . . . . 24

3.4 Interval Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Time and Space Complexity of Annotation . . . . . . . . . . 26

4 REGULATORY SEQUENCE ANALYSIS OF SINGLE NUCLEOTIDE

POLYMORPHISMS . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Regulatory Sequence Analysis . . . . . . . . . . . . . . . . 27

4.2 GLANET Use Case: Regulatory Sequence Analysis of OCD

GWAS SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 ENRICHMENT ANALYSIS OF GENOMIC REGIONS . . . . . . . 31

5.1 Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Random Interval Sampling Procedure . . . . . . . . . . . . . 35

5.2.1 GC and Mappability Calculation . . . . . . . . . . 37

5.3 Time and Space Complexity of Random Interval Generation . 40

5.4 Joint Enrichment Analysis of Transcription Factors and KEGG

Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5 Time and Space Complexity of Enrichment Analysis . . . . . 42

xiv

6 DATA DRIVEN COMPUTATIONAL EXPERIMENTS . . . . . . . . 43

6.1 Design of Data-driven Computational Experiments . . . . . . 43

6.1.1 Type-I error experiments . . . . . . . . . . . . . . 44

6.1.2 Power experiments . . . . . . . . . . . . . . . . . 44

6.1.3 Transcriptional activator and repressor elements . . 45

6.1.4 Genomic interval sets for expressed genes . . . . . 46

6.1.5 Genomic interval sets for non-expressed genes . . 46

6.2 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2.1 Data-driven Computational Experiments Results

for Activator Elements . . . . . . . . . . . . . . . 47

6.2.2 Data-driven Computational Experiments Results

for Repressor Elements . . . . . . . . . . . . . . . 59

6.2.3 GLANET GAT Comparison Results for Activa-

tor and Repressor Elements through Data-driven

Computational Experiments . . . . . . . . . . . . 62

6.2.4 Assessing GLANET Enrichment Parameters through

Wilcoxon Signed Rank Tests . . . . . . . . . . . . 67

6.2.5 Assessing GLANET Enrichment Parameters through

ROC Curves and Comparison with GAT . . . . . . 72

7 GLANET USE CASES AND RUN TIME COMPARISONS . . . . . 81

7.1 GLANET GAT Comparison with Additional Data-sets . . . . 81

xv

7.2 Example Use Cases of GLANET . . . . . . . . . . . . . . . 88

7.2.1 Enrichment Analysis of OCD GWAS SNPs . . . . 88

7.2.2 Enrichment Analysis of GATA2 Binding Regions

for Gene Ontology Terms using User-defined Gene

Sets Feature . . . . . . . . . . . . . . . . . . . . . 89

7.3 GLANET Run Time Comparison . . . . . . . . . . . . . . . 90

7.3.1 Comparison with GAT . . . . . . . . . . . . . . . 91

7.3.2 Comparison with GREAT . . . . . . . . . . . . . 94

8 FINDING OVERLAPPING INTERVALS FOR N GIVEN INTER-

VAL SETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.1 Segment Tree . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.2 Segment Tree Construction Complexity Analysis . . . . . . . 99

8.3 Segment Tree Query . . . . . . . . . . . . . . . . . . . . . . 99

8.4 Motivation: Indexed Segment Tree Forest . . . . . . . . . . . 100

8.4.1 Hash Function, Preset Value . . . . . . . . . . . . 100

8.4.2 Cut-off Depth . . . . . . . . . . . . . . . . . . . . 101

8.4.3 Moving Intervals That Were Stored in The Nodes

Above The Cut-off Depth . . . . . . . . . . . . . . 102

8.4.4 Linking Segment Tree Nodes at Cut-off Depth to

Each Other . . . . . . . . . . . . . . . . . . . . . 102

8.5 Indexed Segment Tree Forest in More Details . . . . . . . . 103

xvi

8.6 Query in Indexed Segment Tree Forest . . . . . . . . . . . . 103

8.6.1 How to Guarantee at Most Two Additional Index

Searches Are Enough? . . . . . . . . . . . . . . . 104

8.7 Finding n Common Overlapping Intervals for n Given Inter-

val Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.8 Finding at Least k Common Overlapping Intervals for nGiven

Interval Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 112

9 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . 117

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

APPENDICES

A GLANET DATA SOURCES . . . . . . . . . . . . . . . . . . . . . . 131

B TYPE-I ERROR, POWER AND ROC CURVE FIGURES . . . . . . 133

CURRICULUM VITAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

xvii

LIST OF TABLES

TABLES

Table 2.1 Available tools including GLANET are compared with respect to

their accepted input types and annotation libraries utilized. . . . . . . . . . 14


their statistical tests carried out and enrichment options provided. . . . . . 16


their enrichment analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Table 5.1 GLANET main parameters for enrichment test. . . . . . . . . . . . 34

Table 6.1 Data-driven Computational Experiments for GLANET . . . . . . . 48

Table 6.2 Data-driven Computational Experiments for GAT . . . . . . . . . . 48

Table 6.3 Type-I error rates calculated in data-driven experiments conducted

with repressor elements, H3K27me3 and H3K9me3, in GM12878 and

K562 cell lines for α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . 59

Table 6.4 Type-I error rates calculated in data-driven experiments conducted

with repressor elements, H3K27me3 and H3K9me3, in GM12878 and

K562 cell lines for α = 0.001. . . . . . . . . . . . . . . . . . . . . . . . . 60

Table 6.5 Power calculated in data-driven experiments conducted with repres-

sor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell

lines for α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

xviii

Table 6.6 Power calculated in data-driven experiments conducted with repres-

sor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell

lines for α = 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Table 6.7 One-sided Wilcoxon signed rank test results for testing whether the

Type-I error distribution of experiments generated under the parameter

setting specified in the row has lower mean of ranks compared to the dis-

tribution of Type-I errors generated under the parameter setting specified

in the column, where the null hypothesis states that there is no difference.

A p-value presented in the cell indicates that setting in the corresponding

row has a lower mean of ranks in Type-I error distribution than the setting

in the corresponding column; if the cell is empty the opposite holds. The

p-values are less than or equal to the actual test result. . . . . . . . . . . . 68

Table 6.8 Wilcoxon Signed Rank Tests for (woIF,wIF). Type-I error distri-

bution of wIF is less than Type-I error distribution of woIF. To decrease

Type-I error, we prefer wIF over woIF. . . . . . . . . . . . . . . . . . . . 69

Table 6.9 Wilcoxon Signed Rank Tests for (EOO,NOOB). Type-I error distri-

bution of NOOB is less than Type-I error distribution of EOO. To decrease

Type-I error, we prefer NOOB over EOO. . . . . . . . . . . . . . . . . . . 69

Table 6.10 Table summarizes random interval generation option that achieves

the lowest Type-I error for non-expressed and expressed gene intervals

using association measures EOO and NOOB and the two isochore family

options woIF and wIF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

xix

Table 6.11 Kolmogorov-Smirnov test results. Null hypothesis states that the

distribution of GC content or mappability values calculated for 50, 000

randomly sampled intervals from human genome and the correspond-

ing interval set are not different. Each row corresponds to Kolmogorov-

Smirnov testing of this null hypothesis. In all tests, the null hypothesis

is rejected (p-value < 2.2e-16). The first column lists the property of the

genome in question, the second column lists the distribution that is com-

pared with the genome, finally the last column lists the maximum distance

between the two distributions. . . . . . . . . . . . . . . . . . . . . . . . . 72

Table 6.12 GLANET and GAT ROC curves comparison results under (EOO,woIF)

setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Table 6.13 GLANET and GAT ROC curves comparison results under (EOO,wIF)

setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Table 6.14 GLANET and GAT ROC curves comparison results under (NOOB,woIF)

setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Table 6.15 GLANET and GAT ROC curves comparison results under (NOOB,wIF)

setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Table 6.16 We compared the winner settings from Tables 6.12- 6.15 with each

other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Table 6.17 ROC curves of different parameter settings where (woIF) setting is

on are compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Table 6.18 ROC curves of different parameter settings where (wIF) setting is

on are compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Table 6.19 ROC curves of different “Generate Random Data Options" are com-

pared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Table 7.1 Experiment1: Intervals of transcriptor factor Srf in Jurkat cell line

are overlapped with DNaseI hypersensitive sites in Jurkat cell line. Both

GAT and GLANET find enrichment of DNaseI(Jurkat) for Srf(Jurkat). . . 84

xx


are overlapped with DNaseI hypersensitive sites in HepG2 cell line. Both

GAT and GLANET find enrichment of DNaseI(HepG2) for Srf(Jurkat). . . 85

Table 7.3 Experiment3: DNaseI hypersensitive sites in HepG2 cell line are

overlapped with DNaseI hypersensitive sites in Jurkat cell line. Both GAT

and GLANET find enrichment of DNaseI(Jurkat) for DNaseI(HepG2). . . 86


are overlapped with DNaseI hypersensitive sites in HepG2-Unique cell

line. Both GAT and GLANET find no enrichment of DNaseI(HepG2-

Unique) for Srf(Jurkat). . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Table 7.5 GO semantic similarity scores calculated between the set of bio-

logical process GO terms that GATA2 is annotated with and the set of

GO terms where GATA2 binding regions are found enriched based on

GLANET enrichment analysis in three different analysis modes (exon,

regulatory based and all-based). . . . . . . . . . . . . . . . . . . . . . . . 90

Table 7.6 Elapsed CPU (user + system) run times in seconds for GLANET

and GAT runs for a given input query are provided. . . . . . . . . . . . . . 91

Table 7.7 Elapsed wall clock times in seconds for GLANET and GAT runs

for a given input query are provided. . . . . . . . . . . . . . . . . . . . . 92

Table 7.8 CPU (user + system) times in seconds spent for GLANET and GAT

runs given the input query specified. . . . . . . . . . . . . . . . . . . . . 93

Table 7.9 Wall clock times in seconds spent for GLANET and GAT runs given

the input query specified. . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Table 7.10 CPU (user + system) time in seconds spent for GLANET runs given

the input query specified. For 1,000 and 10,000 samplings, each run time

is the average of 10 individual runs. . . . . . . . . . . . . . . . . . . . . . 94

xxi

Table 7.11 Wall clock time in seconds spent for GLANET runs given the input

query specified. For 1,000 and 10,000 samplings, each run time is the

average of 10 individual runs. . . . . . . . . . . . . . . . . . . . . . . . . 95

Table 8.1 Various preset values and cut-off depth decisions are compared.

Construction time and search time of indexed segment tree forest and seg-

ment tree in wall clock time are averaged over 100 runs. P-values resulting

from paired t-test for search run times of indexed segment tree forest and

segment tree are provided. . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Table A.1 GLANET data sources and their download dates. . . . . . . . . . . 131

xxii

LIST OF FIGURES

FIGURES

Figure 3.1 (a) Overall functionality of GLANET. (b) Gene-centric genomic

intervals are defined based on commonly used location analyses in ChIP-

seq and related studies [43]. GLANET uses these intervals to provide

detailed annotation of user query with respect to known genes. . . . . . . 22

Figure 3.2 Genomic intervals are represented in interval trees [44]. A separate

interval tree is constructed for each chromosome and genomic element

type, e.g. for transcription factor binding annotations. Each node contains

the low and high endpoints of the genomic interval, the color of the node

(red or black), the maximum high endpoint stored in the subtree rooted at

this node and the genomic elements annotated with this particular genomic

interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 4.1 Three main steps of regulatory sequence analysis in GLANET. . . . 29

Figure 4.2 GLANET regulatory sequence analysis for the OCD SNPs anno-

tated with TFs in the library. (a) SNP rs1891215 located at chr1:7,667,794

changes reference nucleotide A to G, and as a result, leads to a better

match to the STAT1 PFM, i.e., the p-value of the match to the STAT1 PFM

changes from 1.1e-3 to 6.1e-5. (b) SNP rs10946279 (chr6:170,553,248)

changes reference nucleotide C to T, thereby decreasing the significance

of the match to the MAX PFM, i.e., the p-value of the match increases

from 6.1e-5 to 1.5e-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xxiii

Figure 5.1 Box plots of GC content and mappability values for ten different

ENCODE files, for each element type. . . . . . . . . . . . . . . . . . . . 33

Figure 6.1 Design for data-driven computational experiments for expressed

genes. N is set to 1000. Activator elements are defined as H2AZ, H3K27ac,

H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1,

H4K20me1, [8] and POL2; whereas H3K27me3 and H3K9me3 constitute

the repressor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Figure 6.2 Design for data-driven computational experiments for non-expressed



H4K20me1, [8] and POL2; whereas H3K27me3 and H3K9me3 constitute

the repressor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Figure 6.3 Assessment of GLANET Type-I error and power with data-driven

computational experiments. Histone marks with ambiguous activator roles

are marked with ∗. (a, b) Type-I error and power estimated with Isochore

Family (wIF) heuristic using K562, (Non-expressed Genes, Completely-

Discard) and (Expressed Genes, Top5) results, for significance level of

0.05. (c, d) Type-I error and power estimated without Isochore Family

(woIF) heuristic using K562, (Non-expressed Genes, CompletelyDiscard)

and (Expressed Genes, Top5) results, for significance level of 0.05. . . . . 51



are marked with ∗. (a, b) Type-I error and power estimated with Iso-

chore Family (wIF) heuristic using GM12878, (Non-expressed Genes,

CompletelyDiscard) and (Expressed Genes, Top5) results, for significance

level of 0.05. (c, d) Type-I error and power estimated without Isochore

Family (woIF) heuristic using GM12878, (Non-expressed Genes, Com-

pletelyDiscard) and (Expressed Genes, Top5) results, for significance level

of 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

xxiv




Family (wIF) heuristic using K562, (Non-expressed Genes, TakeTheLongest)

and (Expressed Genes, Top20) results, for significance level of 0.05. (c, d)

Type-I error and power estimated without Isochore Family (woIF) heuris-

tic using K562, (Non-expressed Genes, TakeTheLongest) and (Expressed

Genes, Top20) results, for significance level of 0.05. . . . . . . . . . . . . 53




Family (wIF) heuristic using GM12878, (Non-expressed Genes, TakeTh-

eLongest) and (Expressed Genes, Top20) results, for significance level

of 0.05. (c, d) Type-I error and power estimated without Isochore Fam-

ily (woIF) heuristic using GM12878, (Non-expressed Genes, TakeThe-

Longest) and (Expressed Genes, Top20) results, for significance level of

0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54




Family (wIF) heuristic using K562, (Non-expressed Genes, Completely-

Discard) and (Expressed Genes, Top5) results, for significance level of

0.001. (c, d) Type-I error and power estimated without Isochore Family

(woIF) heuristic using K562, (Non-expressed Genes, CompletelyDiscard)

and (Expressed Genes, Top5) results, for significance level of 0.001. . . . 55

xxv



are marked with ∗. (a, b) Type-I error and power estimated with Iso-

chore Family (wIF) heuristic using GM12878, (Non-expressed Genes,


level of 0.001. (c, d) Type-I error and power estimated without Iso-

chore Family (woIF) heuristic using GM12878, (Non-expressed Genes,


level of 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56




Family (wIF) heuristic using K562, (Non-expressed Genes, TakeTheLongest)

and (Expressed Genes, Top20) results, for significance level of 0.001.

(c, d) Type-I error and power estimated without Isochore Family (woIF)

heuristic using K562, (Non-expressed Genes, TakeTheLongest) and (Ex-

pressed Genes, Top20) results, for significance level of 0.001. . . . . . . . 57




Family (wIF) heuristic using GM12878, (Non-expressed Genes, TakeTh-

eLongest) and (Expressed Genes, Top20) results, for significance level of

0.001. (c, d) Type-I error and power estimated without Isochore Fam-

ily (woIF) heuristic using GM12878, (Non-expressed Genes, TakeThe-

Longest) and (Expressed Genes, Top20) results, for significance level of

0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

xxvi

Figure 6.11 Comparison of GLANET and GAT with respect to data-driven

computational experiments in terms of Type-I Error and Power for sig-

nificance level of 0.05. GLANET(wIF,wGC) and GAT(wIF) parameter

settings results are used. Results for the two association statistics - exis-

tence of overlap (EOO) and the number of overlapping bases (NOOB)

are displayed. (a, b) Type-I error and power of activator elements in

(Non-expressed Genes, CompletelyDiscard) and (Expressed Genes, Top5)

experiment settings, respectively. (c, d) Type-I error and power of re-

pressor elements in (Expressed Genes, Top5) and (Non-expressed Genes,

CompletelyDiscard) experiment settings, respectively. GLANET achieves

higher power for H3K9me3 than GAT. . . . . . . . . . . . . . . . . . . . 63

Figure 6.12 . Comparison of GLANET and GAT with respect to data-driven




tence of overlap (EOO) and the number of overlapping bases (NOOB) are

displayed. (a, b) Type-I error and power of activator elements in (Non-

expressed Genes, TakeTheLongest) and (Expressed Genes, Top20) exper-

iment settings, respectively. (c, d) Type-I error and power of repressor el-

ements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTh-

eLongest) experiment settings, respectively. . . . . . . . . . . . . . . . . 64







expressed Genes, CompletelyDiscard) and (Expressed Genes, Top5) ex-

periment settings, respectively. (c, d) Type-I error and power of repressor

elements in (Expressed Genes, Top5) and (Non-expressed Genes, Com-

pletelyDiscard) experiment settings, respectively. . . . . . . . . . . . . . . 65

xxvii







expressed Genes, TakeTheLongest) and (Expressed Genes, Top20) exper-

iment settings, respectively. (c, d) Type-I error and power of repressor el-

ements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTh-

eLongest) experiment settings, respectively. . . . . . . . . . . . . . . . . 66

Figure 6.15 Violin plots for (a) GC of randomly sampled intervals from hu-

man genome, GC of intervals of GM12878 non-expressed genes and ex-

pressed genes. (b) Mappability of randomly sampled intervals from hu-

man genome, mappability of intervals from non-expressed and expressed

gene-sets of GM12878. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 6.16 Violin plots for (a) GC of randomly sampled intervals from hu-

man genome, GC of intervals of K562 non-expressed genes and expressed

genes. (b) Mappability of randomly sampled intervals from human genome,

mappability of intervals from non-expressed and expressed gene-sets of

K562. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 6.17 ROC Curves for (a) H3K9ME3 in K562 under parameter (NOOB,

woIF) and experiment (CompletelyDiscard, Top5) (b) H3K9ME3 in K562

under parameter (NOOB, wIF) and experiment (CompletelyDiscard, Top5)

settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Figure 6.18 ROC Curves for (a) H4K20ME1 in GM12878 under parameter

(NOOB, wIF) and experiment (CompletelyDiscard, Top5) (b) H4K20ME1

in K562 under parameter (NOOB, wIF) and experiment (CompletelyDis-

card, Top5) settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xxviii

Figure 6.19 ROC Curves for (a) H3K4ME1 in K562 under parameter (NOOB,

woIF) and experiment (TakeTheLongest, Top20) (b) H3K4ME1 in K562

under parameter (NOOB, woIF) and experiment (CompletelyDiscard, Top5)

settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 6.20 ROC Curves for (a) POL2 in GM12878 under parameter (EOO,

woIF) and experiment (CompletelyDiscard, Top5) (b) POL2 in K562 un-

der parameter (NOOB, wIF) and experiment (TakeTheLongest, Top20)

settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 7.1 GLANET and GAT are run on four experiments ranging from high

to low expected association between the compared genomic interval sets.

Each row depicts an experiment where the first set is input query and the

second set is a genomic element in the annotation library, e.g., experiment

Srf(Jurkat) vs. DNaseI(Jurkat) evaluates whether the binding regions of

transcription factor Srf in Jurkat cells are enriched for DNaseI accessible,

i.e., open chromatin, regions in the same cells. . . . . . . . . . . . . . . . 82

Figure 8.1 Intervals (s1, s2, s3, s4, s5) are stored in the nodes. The arrows from

the nodes point to their canonical subsets. . . . . . . . . . . . . . . . . . . 99

Figure 8.2 Blue colored segment tree nodes at cut-off depth and red colored

nodes with no children at depth above the cut-off depth are stored in our

segment tree forest. To enhance fast access, these stored segment tree

nodes are connected to each other through forward and backward links. . . 102

Figure 8.3 Segment tree nodes with the same index are stored in a BST and

index now points to the root of BST. Blue and red colored nodes are origi-

nal segment tree nodes which are linked to each other. Blue colored nodes

are in fact the roots of the segment trees below them. Red colored nodes

do not have any children. Parents of these blue and red colored nodes are

the artificial nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Figure 8.4 Searching the nodes pointed by lowIndex and highIndex, the

nodes in between them, and plus two more nodes at most is enough. . . . . 104

xxix

Figure B.1 Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves

for H4K20ME1 in GM12878 for (EOO,CompletelyDiscard,Top5). . . . . 134


for H4K20ME1 in GM12878 for (NOOB,CompletelyDiscard,Top5). . . . 135


for H4K20ME1 in K562 for (EOO,CompletelyDiscard,Top5). . . . . . . . 136


for H4K20ME1 in K562 for (NOOB,CompletelyDiscard,Top5). . . . . . . 137


for H4K20ME1 in GM12878 for (EOO,TakeTheLongest,Top20). . . . . . 138


for H4K20ME1 in GM12878 for (NOOB,TakeTheLongest,Top20). . . . . 139


for H4K20ME1 in K562 for (EOO,TakeTheLongest,Top20). . . . . . . . . 140


for H4K20ME1 in K562 for (NOOB,TakeTheLongest,Top20). . . . . . . . 141

xxx

LIST OF ALGORITHMS

ALGORITHMS

Algorithm 5.1 generateRandomIntervals . . . . . . . . . . . . . . . . . . . . 38

Algorithm 8.1 findingNCommonOverlappingIntervalsForNIntervalSets . . . . 107

Algorithm 8.2 search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Algorithm 8.3 mainSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Algorithm 8.4 searchAtLinkedNode . . . . . . . . . . . . . . . . . . . . . . . 109

Algorithm 8.5 searchForward . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Algorithm 8.6 searchBackward . . . . . . . . . . . . . . . . . . . . . . . . . 110

Algorithm 8.7 searchDownward . . . . . . . . . . . . . . . . . . . . . . . . . 110

Algorithm 8.8 searchAtLowerNode . . . . . . . . . . . . . . . . . . . . . . . 111

Algorithm 8.9 searchAtHigherNode . . . . . . . . . . . . . . . . . . . . . . . 111

Algorithm 8.10 findingAtLeastKCommonOverlappingIntervalsForNIntervalSets 112

Algorithm 8.11 fillEndPointsAndIntervals . . . . . . . . . . . . . . . . . . . . 113

Algorithm 8.12 sortEndPoints: Sort allEndPoints in ascending order . . . . . 113

Algorithm 8.13 constructSegmentTree: Using sortedAllEndPoints . . . . . . 113

Algorithm 8.14 storeIntervals: One interval set at a time . . . . . . . . . . . . . 113

Algorithm 8.15 findAtLeastK . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

xxxi

LIST OF ABBREVIATIONS

ABBRV Abbreviation

GLANET Genomic Loci ANnotation and Enrichment Tool

ENCODE Encyclopedia of DNA Elements

NGS Next Generation Sequencing

ChIP-seq Chromatin Immunoprecipitation Sequencing

BS-seq Bisulfite Sequencing

GWAS Genome Wide Association Studies

SNP Single-Nucleotide Polymorphism

CNV Copy Number Variation

LD Linkage Disequilibrium

DHSs DNaseI Hypersensitive Sites

TF Transcription Factor

TFBS Transcription Factor Binding Sites

HM Histone Modification

KEGG Kyoto Encyclopedia of Genes and Genomes

GO Gene Ontology

GUI Graphical User Interface

DNA Deoxyibo Nucleic Acid

RNA Ribo Nucleic Acid

RSA Regulatory Sequence Analysis

RSAT Regulatory Sequence Analysis Tool

PFM Position Frequency Matrix

TSS Transcription Start Site

xxxii

UTR Un-Translated Region

TPM Transcripts Per Million

DDCE Data Driven Computational Experiments

IF Isochore Family

wGC with GC

wM with Mappability

wGCM with GC and Mappability

woGCM without GC and Mappability

wIF with Isochore Family

woIF without Isochore Family

EOO Existence of Overlap

NOOB Number of Overlapping Bases

ROC Receiver Operating Characteristic

AUC Area Under Curve

OCD Obsessive Compulsive Disorder

BST Binary Search Tree

JOF Joint Overlap Analysis Framework

JEF Joint Enrichment Analysis Framework

VCF Variant Call Format

miRNA MicroRNA

xxxiii

CHAPTER 1

INTRODUCTION

High-throughput sequencing technologies are routinely used for cataloging genomic

variants [1, 2, 3], profiling protein-DNA interactions, histone modifications (ChIP-seq

[4]), DNA methylation (e.g., BS-seq [5]), and mapping of accessible chromatin (e.g.,

DNase-seq [6], ATAC-seq [7]). Analyses of these experiments reveal sets of genomic

intervals. Assessing the functional relevance of these genomic intervals requires inte-

grating them with already known genomic and epigenomic annotations. For example,

functional interpretation of a list of single nucleotide polymorphisms (SNPs) or copy

number variation (CNV) regions requires evaluating whether these genomic varia-

tion sites reside in gene coding regions, transcription factor binding sites, or histone

modification sites, or assessing whether the list is enriched with one or more path-

ways. Similarly, individual research groups often profile protein-DNA interactions

or histone modifications. A routine practice is to query resulting genomic intervals

against available consortia-derived genomic annotations such as those generated in

ENCODE [8] or against other available data generated by the research group. This

thesis develops tools and techniques that facilitates such analysis.

The work carried out in this thesis can be examined under two main parts: In the

first part, we mainly concentrated on annotation of genomic regions and their enrich-

ment analysis that adjusts for genomic biases using efficient data structures which

store data at varying resolutions. For this purpose, we developed GLANET both as

an annotation and enrichment tool. Additionally, in order to assess performance of

its enrichment procedure, we designed novel data-driven computational experiments.

In the second part of the thesis, we extend the annotation of genomic intervals into

1

finding at least k or n common overlapping intervals from n given interval sets. To re-

duce the search time, we proposed novel data structures which enable further parallel

computations also possible.

There are available tools for annotation and enrichment analysis of genomic regions.

They are equipped with different functionalities with respect to the types of the inputs,

annotation libraries, enrichment tests, and further, if any, downstream analysis they

enable.

FunciSNP [9], HaploReg [10], ALIGATOR [11], Annotate-it [12], PANOGA [13]

and FORGE [14] only accept SNPs as input. ENCODE ChIP-Seq Significance Tool

[15] is similarly limited by providing annotation and enrichment only for input gene

lists. RegulomeDB [16], SnpEff [17], Ensembl SNP Effect Predictor(VEP) [18],

ANNOVAR [19] and FunciSNP do not provide enrichment analysis.

There are a few tools available for annotation and enrichment analysis of longer ge-

nomic intervals [20, 21, 22]. These are generally restricted by the annotation libraries

they utilize. For example, INRICH tests for enrichment of only pre-defined gene sets

[20]. GREAT [21] takes a set of non-coding genomic regions and provides analysis

with respect to the annotations of nearby genes. The enrichment analysis in GREAT

does not take into account potential genomic biases involved in generation of the in-

put genomic regions. In contrast, GAT [22] is more flexible. It takes as input genomic

intervals and user-provided annotation libraries. Compared to INRICH and GREAT,

GAT enables users to input a workspace to define a subset of the genome for estimat-

ing appropriate null distribution during enrichment analysis. However, GAT’s built-in

capabilities are restricted, and it does not work with gene-sets. Furthermore, it relies

on the user to define and provide input files to specify where the random samples will

be generated from. This knowledge; however, is often not available to the user.

In summary, there are a number of notable shortcomings of the existing tools. Firstly,

majority of the tools are specific to inferring potential functionality of a given set of

SNPs and do not accommodate longer genomic intervals resulting from NGS exper-

iments such as ChIP-seq, BS-seq, insertion and deletion variants. Secondly, most of

these tools do not account for systematic biases such as mappability and GC content

introduced by the sequencing technologies [23, 24, 25, 26, 27, 28]. Thirdly, gene set

2

or pathway enrichment tools do not support analysis with non-coding upstream and

downstream regions of the genes. Finally, and perhaps more importantly, they work

with fixed annotation libraries and do not enable users to add on their own libraries

for annotation and enrichment. The lack of such a feature limits the analysis that can

be accomplished with these tools.

We developed GLANET as an annotation and enrichment tool with several useful

built-in analysis capabilities for the human genome. GLANET annotation library

includes a rich set of genomic information: (i) regions defined on and in the neigh-

borhood of coding regions that encompass regulatory regions; (ii) ENCODE-derived

potential regulatory regions that encompass binding sites for multiple transcription

factors, DNaseI hypersensitive sites, modification regions for multiple histones across

a wide variety of cell types; and (iii) gene sets derived from KEGG [29] pathways and

GO [30] terms. Users can easily annotate their input intervals with the genomic el-

ements defined in the annotation library and expand the GLANET library by adding

user-defined libraries and/or pre-defined gene sets.

In order to evaluate whether the input intervals overlap significantly with the genomic

elements in the GLANET annotation library, GLANET implements an enrichment

procedure that accounts for mappability [23, 24, 25] and GC content [26, 27, 28] bi-

ases inherent to NGS. When the input intervals are derived from an NGS experiment,

these biases constrain regions of the genome that can contribute to interval genera-

tion. Few of the existing tools account for these biases. For example, Forge [14]

randomly samples SNPs from regions that match the GC content of the input SNPs

to estimate a null distribution for enrichment testing. GAT [22] divides the genome

into isochore families that have similar GC content and performs sampling for each

isochore separately and, as a result, provides a coarse level matching of GC content.

GLANET estimates a null model from randomly sampled intervals that match each

interval of the input in terms of chromosome, length, mappability, and GC content as

opposed to operating on the average properties of the input intervals. Although this

sampling strategy is computationally intensive, GLANET conducts these analyses

rapidly by deploying efficient search strategies enabled by appropriately constructed

representations of the genomic intervals.

3

GLANET additionally provides several built-in analysis tools for specific input types.

When the input is a SNP list, users can evaluate whether the SNPs reside in transcrip-

tion factor binding regions and, if so, whether they are located in the actual transcrip-

tion factor binding motifs obtainable via either the reference or the SNP allele and

whether the variation potentially impacts the binding of TFs, either by enhancing or

disrupting binding motifs. GLANET enables joint enrichment analysis for transcrip-

tion factor binding and KEGG pathways. With this option, users can evaluate whether

the input set is enriched concurrently with binding sites of TFs and the genes within a

KEGG pathway. This joint enrichment analysis provides a detailed functional inter-

pretation of the input loci.

In addition to being a comprehensive tool that can help answer variety of questions,

another contribution of this thesis is the design of data-driven computational experi-

ments for evaluating its enrichment procedure. In order to assess the statistical power

and Type-I error of GLANET across its available parameter settings, we designed

data-driven computational experiments using large collections of ENCODE ChIP-seq

and RNA-seq data. These computational experiments indicated that while GLANET

enrichment test often performs conservatively in terms of Type-I error, it has high

statistical power. We present comparisons of GAT and GLANET and illustrate appli-

cations of GLANET within different biological contexts.

In the second part of the thesis, we provided solutions for finding common overlap-

ping intervals for n interval sets problem. In analyzing genomic intervals originating

from multiple data sets, this algorithmic problem is critical. We divided this prob-

lem into two sub-problems: finding n common overlapping intervals and at least k

common overlapping intervals for n interval sets. For the first sub-problem, we con-

structed one segment tree for each interval set and then converted each segment tree

into indexed segment tree forest. We observed that this way of representation re-

duces the search time. For the second sub-problem, we proposed constructing one

segment tree for n interval sets and find the overlapping intervals immediately after

the construction of the segment tree is completed.

We can summarize our contributions in this thesis as follows:

• We develop a comprehensive annotation and enrichment tool with a rich set of

4

functionalities for the human genome. GLANET’s open source code is avail-

able with a comprehensive user’s manual and other supporting materials.

• We design novel data-driven computational experiments for assessing the Type-

I error rate and power of GLANET’s enrichment analysis. We show that GLANET

has low Type-I error with high statistical power and it is sensitive to varying

experiment and parameter settings, and significance levels. The data-driven

computational experiments are instrumental for assessing the enrichment capa-

bilities of other tools. Towards this aim, we conduct extensive experiments to

compare GLANET with existing enrichment tools with similar functionality.

• We present an algorithmic framework for finding n common overlapping inter-

vals and finding at least k overlapping intervals over n given interval sets. In

this problem, the indexed short segment tree forests are constructed in lieu of

one tall segment tree, which leads to reduction in search time. This representa-

tion is inherently well suited for parallelization.

Rest of the thesis is organized as follows: In Chapter 2, we provide the necessary

background information for biological terms that are used throughout the thesis, and

we present an up-to-date overview of related work. In Chapter 3 and Chapter 4, two

main functionalities of GLANET, i) annotation of genomic intervals, and ii) Regu-

latory Sequence Analysis (RSA) of SNPs are described in detail, respectively. We

dedicate Chapter 5 to enrichment analysis and Chapter 6 to data-driven compu-

tational experiments. In Chapter 7, we present various scenarios to showcase the

extensive built-in capabilities of GLANET and runtime comparisons. In Chapter 8,

we propose our solutions for finding n common overlapping intervals and at least k

such intervals over n interval sets. Finally, we conclude the thesis in Chapter 9 with

a final discussion and remarks on possible future directions.

5

CHAPTER 2

BIOLOGICAL BACKGROUND AND RELATED WORK

We utilize many biological terms throughout the thesis. In Section 2.1, we describe

them briefly from a computer engineer’s point of view. Next, we provide the related

work in two separate sections, each section is dedicated to each part of the thesis

introduced in Chapter 1. In Section 2.2, we provide an overview of existing tools for

genomic annotation and enrichment analysis, which is followed by Section 2.3, in

which we summarize the related work on finding common overlapping intervals for

n given interval sets.

2.1 Biological Terms

DNA Deoxyribonucleic acid (DNA) is the genetic material in almost all organisms

including humans.

RNA Like DNA, ribonucleic acid (RNA) is essential, and performs many functions

in the cell. Unlike DNA, it is single stranded and there exists many different

types of RNA.

mRNA It is the messenger RNA, which carries the necessary genetic information for

synthesis of the proteins. After transcription, formed mRNA is translated into

a protein.

Genome Genome is the complete set of genetic material of an organism including its

genes. Whole human genome is a DNA sequence of more than 3 billion base

pairs which resides in the cell nucleus.

7

Chromosome DNA and histone proteins are super coiled to packaged into dense

structures called chromosomes. Human genome has 23 pairs of chromosomes

in somatic cells and and 23 chromosomes in gametes (egg and sperm cells).

Cell Cell is the smallest basic unit for all organisms. Cell contains the whole genome

and has the ability to replicate itself. Human body contains trillions of cells.

Gene Gene is the key functional and physical unit in the genome. Genes make pro-

teins according to the instructions on the DNA. Human genome consists of

around 25,000 genes.

Eukaryotes Eukaryote organisms such as fungi, plants and animals have membrane-

bound organelles and their genetic materials are enclosed within membrane-

bound nucleus in their cells .

Prokaryotes Prokaryotes such as bacteria are single-celled organisms with no nu-

cleus and membrane-bound organelles.

Transcript Level It is the level at which gene’s DNA is transcripted into mRNA.

Gene Structure We will concentrate on gene structure in eukaryotes. Promoter re-

gions regulate gene expression. Promoters are at the upstream of the coding

region and genes can not be expressed without promoters. Enhancers can ex-

press genes, they exist far upstream of the promoters but they can also exist

between the genes and downstream of the genes. Coding regions are the ex-

ons of the genes which are transcribed into mRNAs. Non-coding regions are

the introns of the genes and the intragenic region between the genes that con-

tains promoters and enhancers which regulate gene expression. Transcription

Start Site (TSS) is often called the 5’UTR, un-translated region where after the

coding region starts. Transcription Stop Site is called the 3’UTR where be-

fore the coding region ends. Therefore, coding region of a gene lies between

5’UTR and 3’UTR. We adopted the gene-centric regions as they are depicted

in Figure 3.1b.

NGS Next Generation Sequencing (NGS) is also known as high-throughput sequenc-

ing, which sequences DNA and RNA. It is faster, cheaper, needs less DNA and

8

is more accurate, reliable than formerly used Sanger sequencing. It has a revo-

lutionary effect on genomics and molecular biology research.

GC Content GC content is the ratio of total number of guanines and cytosines to the

total number of bases in a given sequence.

GC Content = G+CA+T+G+C

Mappability Mappability is measure of accurately mapping, which is also known as

uniqueness.

Genetic Variation Genetic variations are the differences in DNA sequences in each

of our genomes. Here are the major types of variations:

• Mutations are one or more nucleotide changes at random or due to envi-

ronmental conditions. Mutations have less frequency than SNPs.

• Single Nucleotide Polymorphisms (SNPs) are like typo, and each SNP is

one nucleotide that differs from the reference genome. They are the most

common type of genetic variations in human genome; on average, there is

one SNP at every 300 nucleotides.

• Copy Number Variations (CNVs) are structural genomic variants. Some

DNA sequences repeat themselves. Copy Number Variations (CNVs) are

the variations in the number of DNA repeats which can be resulted in

deletions or insertions in the genome.

GWAS Genome-wide association studies (GWAS) investigates the genetic variations

associated with a certain phenotype. This method looks for the genetic varia-

tions that exists more frequently in genomes with a certain phenotype than the

genomes without the phenotype.

LD The two or more polymorphic loci are in linkage disequilibrium (LD) if their

respective alleles do not associate independently (randomly). In other words,

linkage disequilibrium describes the dependent (nonrandom) association be-

tween pairs of alleles at different loci [31].

TFs Transcription factors are the proteins that bind to specific DNA sequences which

are called promoter and enhancer regions and they initiate and regulate the tran-

scription of genes. They increase or decrease the level of transcript in the genes

9

which results in expressed or non-expressed genes, respectively. Expressed

genes are turned on and produce proteins whereas non-expressed genes are

turned off and do not produce proteins.

HMs Histones are proteins that regulate gene expression. Histone modifications

modify the histone proteins and therefore impact the gene regulation.

Pathway A pathway is a collection of manually drawn diagrams depicting the cur-

rent knowledge on molecular interactions, reactions and relations of the biolog-

ical functions.

GO Term Gene Ontology is a structured vocabulary that represents molecular func-

tion, biological process or cellular component.

VCF Variant Call Format (VCF) is a standardized format for storing DNA polymor-

phism data such as SNPs, insertions, deletions and structural variants developed

for the 1000 Genomes Project [32].

miRNA Human microRNAs (miRNA) are evolutionary conserved short non-coding

single strand RNA molecules. They are involved in gene regulation, implicated

in many human diseases and represent promising therapy options [33].

2.2 Related Work Regarding Thesis Part 1

Regarding to the first part of the thesis, there are various tools that provide enrichment

and/or annotation analysis on given genomic intervals. In this section, we would like

to go into more detail about these available tools. We also provide a comprehensive

summary of these tools in Tables 2.1- 2.3.

We broadly classify available tools into two main classes with respect to their an-

notation and enrichment functionality. Notably, the majority of these tools are spe-

cific for inferring functionality of a given set of SNPs and do accommodate genomic

loci of variable lengths obtained from NGS experiments such as ChIP-seq, BS-seq,

CNV analysis. FunciSNP [9], HaploReg [10], ALIGATOR [11], Annotate-it [12],

PANOGA [13] and FORGE [14] only accept SNPs as input. And, some of these tools

only provide annotation but not enrichment analysis. RegulomeDB [16], SnpEff [17],

10

Ensembl SNP Effect Predictor(VEP) [18], ANNOVAR [19] and FunciSNP do not

provide enrichment analysis. ENCODE ChIP-Seq Significance Tool [15] is similarly

limited by providing annotation and enrichment only for input gene lists. GREAT

[21], INRICH [20] and GAT [22] are the tools available for annotation and enrich-

ment analysis of longer genomic intervals.

FunciSNP identifies candidate regulatory SNPs of a GWAS with the help of user-

defined ENCODE ChIP-seq peak files (biofeatures) which are known to be related

with the disease of GWAS [9]. FunciSNP takes GWAS SNPs (tagSNPs) and a set

of user-defined ENCODE ChIP-seq peak files (biofeatures) which are known to be

related with the disease/phenotype of the GWAS, as inputs. It considers all the SNPs

within a certain window around tagSNPs and after overlapping with the given biofea-

tures, it prioritizes only those overlapped SNPs by calculating LD measures, r2 and

D′. FunciSNP does not perform enrichment analysis of functional elements or any

predefined gene sets. Rather than that FunciSNP tries to identify the candidate regu-

latory SNPs. HaploReg selects and displays the causal SNPs within the same Linkage

Disequilibrium (LD) block that are enriched with a cell specific DnaseI hypersensitive

site and enhancer [10]. HaploReg gets the necessary LD information from the 1000

Genomes Project and provides r2 and D′ measurements for all genomic variants and

their linked SNPs which can be visualized along with their predicted chromatin state

in nine cell types, conservation across mammals and their effect on regulatory motifs.

ALIGATOR takes LD pruned SNPs of a GWAS and analyzes the enrichment of GO

pathways [11]. ALIGATOR analyses the enrichment of GO pathways by calculating

the GO pathway specific empirical p-values and correction of empirical p-values for

multiple testing by using a bootstrap approach. Annotate-it is aimed for experimen-

talists which enables them to load their samples and compare variation among the

samples [12]. It has particularly focused only on single nucleotide variants and anno-

tates the variants with the possible consequences on the transcripts of a certain gene

such as nonsense, essential splice site, nonsynonmous, synonmous and UTR. FORGE

outputs the enriched cell and tissue specific ENCODE derived DNA elements for the

given SNPs of a GWAS by using ChIP-Seq hotspots instead of peaks [14]. FORGE

analysis tool annotates the given GWAS SNPs with the functional elements from ei-

ther the ENCODE or Roadmap Epigenomics projects which are generated by the

11

Hotspot method because hotspots reveal more tissue specific signal. For the given

SNPs set, number of overlaps are counted and a background SNPs sets are created

and number of overlaps are counted where the enrichment value of the given SNPs

set is expressed as the z-score.

Encode ChiP-Seq Significance Tool identifies the enriched ENCODE transcription

factors from a list of protein-coding genes, protein-coding transcript, pseudogenes, or

pseudotranscripts using an one-tailed hypergeometric test [15]. RegulomeDB scores

given genomic variants for assessing their regulatory potential by counting the num-

ber of different types of functional elements that overlap [16]. Using a simple heuris-

tic, RegulomeDB has scored the genomic variants such that a genomic variant with

a lower score means that this genomic variant is more likely to be located in a func-

tional region since it has more overlaps with known and predicted functional elements

whereas a genomic variant with a higher score means the opposite. Known and pre-

dicted functional elements include DNaseI hypersensitivity regions, transcription fac-

tors binding sites, and promoter regions that have been biochemically characterized

to regulate transcription. In fact, RegulomeDB is a database that annotates genomic

variants with known and predicted functional elements in the intergenic regions of the

human genome. Database of RegulomeDB includes public datasets from GEO, the

ENCODE project, and published literature. However, RegulomeDB does not provide

enrichment analysis of the functional elements and predefined gene-sets, instead it

categorizes and prioritizes the given genomic variants by its scoring system. SnpEff

annotates coding and non-coding genomic variants, however it calculates the coding

effect of the variant such as codon change or amino acid change when the genomic

variant hits an exon[17]. It performs neither functional element nor predefined gene-

set enrichment analysis. Variant Effect Predictor (VEP) is an Ensembl [34] API, it

predicts the effects of variants such as amino acid change, codon change [18]. AN-

NOVAR annotates the given genomic variants in gene-based manner to identify the

variants that cause amino acid changes, in region-based manner to identify variants

in specific genomic regions and in filter-based manner to identify the variants that are

filtered against pre-computed functional importance scores (such as SIFT score) [19].

ANNOVAR aims to pinpoint functionally important genomic variants for autosomal

dominant diseases.

12

INRICH takes SNPs which can be resulting from a GWAS as input and generates

LD independent genomic intervals from these SNPs and tests for the enrichment of

predefined gene sets such as KEGG Pathways, GO terms and a diverse collection of

gene sets from Molecular Signature Database [20]. GREAT calculates the statistical

significance of the elements in its annotation libraries by incorporating distal binding

sites up to 1Mb [21]. GREAT annotates the given genomic regions for human, mouse

and zebrafish using 20 ontologies. GREAT performs binomial test and hypergeomet-

ric test for the statistical enrichment of annotation terms and outputs the annotation

terms that are significantly associated with the given genomic regions. GAT finds

the enrichment of tracks with respect to annotations by generating samplings from

workspace. Tracks are the interval sets of interest, annotations are the several regions

of the genome with their annotations and workspace contains the accessible regions

of the genome where the samplings’ intervals have to overlap with. If the tracks con-

tain high mappable regions of the genome, to adjust for this bias user has to provide

high mappable regions of the genome and provide workspace file accordingly. This

applies for other biases. User has to know the properties of intervals in tracks and

provide workspace file accordingly. GLANET takes this burden from the user and

handles correcting for GC content, isochore family and mappability using its offline

prepared bias files at varying resolution.

There are various other tools such as GEMINI [35] and Variant Tools [36]. GEM-

INI tries to isolate the underlying variants of a disease by annotating the genomic

variations of samples [35]. GEMINI loads a VCF file of genotypes of samples and

annotates the variants of samples (disease/phenotype) with its database, therefore it

takes a long time. GEMINI provides an database framework where you can write your

own SQL queries and a Phyton programming interface to implement your own code

in addition to its off the shelf tools. Variant Tools annotates and analyzes genomic

variants of samples in order to associate variants and genes with diseases [36].

We conclude the related work regarding the first part of the thesis with tools and

methods for the assessment of enrichment analysis. To the best of our knowledge,

there is no tool or any method for assessing the performance of enrichment analysis

of the given genomic intervals with respect to other genomic interval sets.

13

Table 2.1: Available tools including GLANET are compared with respect to their

accepted input types and annotation libraries utilized.To

ol(V

ersi

on)

SNPs

Gen

omic

Inte

rval

s

Form

at

Pre-

defin

edG

ene

Sets

Gen

es

Dat

aSo

urce

s

Allo

ws

Use

rPro

vide

d

Ann

otat

ion

Lib

rari

es

RegulomeDB

(v1.1)

! ! dbSNP Ids,

VCF, BED,

GFF3

Gencode

v7

ENCODE, Roadmap Epige-

nomics, dbSNP, GEO, pub-

lished literature, eQTL,

dsQTL, predicted annotations,

DNase footprinting, PWMs,

DNA Methylation

SnpEff

(v4.2)

! ! VCF, TXT,

SAMTools

Pileup For-

mat

KEGG,

GO,

MSigDb,

Reactome

Ensembl ENCODE, Roadmap Epige-

nomics, NextProd, UCSC, Mo-

tif annotations

!

Ensembl

SNP Effect

Predic-

tor (VEP)

(Ensembl

release 83)

! ! VCF, Pileup,

HGVS nota-

tions

RefSeq,

Ensembl,

Gencode

1000 Genomes, Ensembl tran-

scripts, Gencode and RefSeq

transcripts

!

ANNOVAR ! ! VCF, GFF3 RefSeq,

UCSC,

Ensembl,

Gencode,

AceView

ENCODE, 1000 Genomes,

dbSNP, SIFT, UCSC regions,

OMIM, Exome Sequenc-

ing Project, MutationTaster,

Polyphen, Complete Genomics

and many other data sources

!

FunciSNP

(v1.12.0)

! dbSNP Ids UCSC

known

genes


nomics, 1000 Genomes,

TCGA, Faire sites, DNaseI

hypersensitive sites

HaploReg

(v4.1)

! * dbSNP Ids RefSeq,

Gencode


nomics, 1000 Genomes, db-

SNP, eQTL, motif instances

ALIGATOR ! dbSNP Ids GO Entrez dbSNP

*Only ac-

cepts one

single region

Input Annotation Libraries

Continued on next page

14

Table 2.1 – continued from previous page

Tool

(Ver

sion

)

SNPs

Gen

omic

Inte

rval

s

Form

at

Pre-

defin

edG

ene

Sets

Gen

es

Dat

aSo

urce

s

Allo

ws

Use

rPro

vide

d

Ann

otat

ion

Lib

rari

es

Annotate-it

(v0.4)

! VCF KEGG,

GO,

BIOCARTA,

Reactome

Ensembl 1000 Genomes, OMIM, 200

Danish Exomes, Polyphen2,

SIFT, LRT, MutationTaster,

Anatomical gene expression

(eGenetics/SANBI dataset),

HPO, EPCC and LDDB phe-

notype ontologies to annotate

samples

Encode

ChiP-Seq

Significance

Tool

User given

gene list

Ensembl,

Gencode,

Entrez

ENCODE, HAVANA, HUGO

Gene Nomenclature Commit-

tee

PANOGA ! dbSNP Ids KEGG Protein-Protein Interaction

Data

FORGE

(v1.1)

! dbSNP Ids,

VCF, BED


nomics, 1000 Genomes, GEO,

omni genotyping arrays,

GWAS snp arrays

Variant Tools

(v2.7.0)

! ! dbSNP Ids,

VCF, BED,

GFF3, CSV,

Plink

KEGG ENCODE, 1000 Genomes,

dbSNP, Exome Sequencing

Project, dbNSFP, UCSC,

HapMap project, GWAS

catalog

!

GEMINI

(v0.18.2)

! ! VCF KEGG ENCODE, 1000 Genomes, db-

SNP, ClinVar, UCSC, OMIM,

HPRD, Exome Sequencing

Project

!

GREAT

(v3.0.0)

! BED GO,

MSigDb,

Panther,

BioCyc

Ensembl

genes

20 ontologies including disease

ontologies, phenotype ontolo-

gies, miRNA motifs, miRNA

targets

INRICH

(v1.1)

! ! dbSNP Ids KEGG,

GO,

MSigDb

Entrez

*Only ac-

cepts one

single region



15


Tool

(Ver

sion

)

SNPs

Gen

omic

Inte

rval

s

Form

at

Pre-

defin

edG

ene

Sets

Gen

es

Dat

aSo

urce

s

Allo

ws

Use

rPro

vide

d

Ann

otat

ion

Lib

rari

es

GAT (v1.2.2) ! BED !

GLANET

(v1.0)

! ! dbSNP Ids,

BED, GFF3,

narrowPeak

KEGG,

GO

RefSeq ENCODE !



statistical tests carried out and enrichment options provided.

Tool

(Version)

Statistical Model or Test Takes into account

Genomic Biases

Correction for Multiple

Hypothesis Testing

RegulomeDB

(v1.1)

SnpEff

(v4.2)

Ensembl

SNP Effect

Predic-

tor (VEP)

(Ensembl

release 83)

ANNOVAR

FunciSNP

(v1.12.0)

LD

HaploReg

(v4.1)

Binomial Test LD

ALIGATOR Permutation Approach LD Bootstrap Approach

Annotate-it

(v0.4)

Filter-based Approach,

Weighted sum Approach,

Gamma-based approx-

imation for the null

distribution of weighted

sum statisticStatistical Tests


16


Tool

(Version)

Statistical Model or Test Takes into account

Genomic Biases

Correction for Multiple

Hypothesis Testing

Encode

ChiP-Seq

Significance

Tool

One-tailed Hypergeomet-

ric Test

Benjamini-Hochberg FDR

PANOGA Two-sided test based on

the hypergeometric distri-

bution

LD Bonferroni Correction

FORGE

(v1.1)

Background Distribution,

Z-score

LD, GC, minor allele fre-

quency (maf) and distance

to the nearest transcription

start site (TSS)

Bonferroni Correction

Variant Tools

(v2.7.0)

Fisher’s Exact Test for

Single Variant Analysis,

Single gene rare variant

tests, Conditional rare

variants analysis and etc

GEMINI

(v0.18.2)

Built-in analyses such as

find de novo mutations,

find compound heterozy-

gotes and so on

GREAT

(v3.0.0)

Binomial Test, Hypergeo-

metric Test

Bonferroni Correction,


INRICH

(v1.1)

Permutation Approach LD Bootstrap Approach

GAT (v1.2.2) Sampling method Chromosome Identity, GC

and Mappability (Not tai-

lored for each given inter-

val)

Storey’s q-value, Ben-

jamini–Hochberg FDR

GLANET

(v1.0)

Sampling Based Ap-

proach, Z-score

GC, Mappability, Isochore

Family, Interval Length,

Interval Chromosome

Bonferroni Correction,


Statistical Tests


enrichment analysis.

Tool

(Version)

Provides

Enrichment

DNA Regulatory

Elements

Predefined

gene sets

Others User

Interface

RegulomeDB

(v1.1)

Web

Enrichment Analysis


17


Tool

(Version)

Provides

Enrichment

DNA Regulatory

Elements

Predefined

gene sets

Others User

Interface

SnpEff

(v4.2)

Command

Line

Ensembl

SNP Effect

Predic-

tor (VEP)

(Ensembl

release 83)

Web

ANNOVAR Command

Line

FunciSNP

(v1.12.0)

Command

Line

HaploReg

(v4.1)

! DNaseI hyper-

sensitive sites

Enhancer Web

ALIGATOR ! GO Command

Line

Annotate-it

(v0.4)

! Annotate-it provides

candidate gene lists,

aggregate functionality

scores, phenotype-specific

gene prioritization, and

statistical methods for

disease-gene finding in

case/control studies

Web

Encode

ChiP-Seq

Significance

Tool

! Transcription

Factors

Web

PANOGA ! Identifies sub-networks

within protein-protein

interaction networks

Web

FORGE

(v1.1)

! Dnasel hotspots Cell Type Specific Enrich-

ment

Web

Variant Tools

(v2.7.0)

! Use more than 20 asso-

ciation analysis methods

to associate variants and

genes with qualitative or

quantitative traits

Command

Line

GEMINI

(v0.18.2)

Enables users to write their

own SQL queries

Command

Line

Enrichment Analysis


18


Tool

(Version)

Provides

Enrichment

DNA Regulatory

Elements

Predefined

gene sets

Others User

Interface

GREAT

(v3.0.0)

! Transcription

Factors

GO,

MSigDb,

Panther,

BioCyc

Gene Expression Data,

Regulatory Motifs, Gene

Families

Web

INRICH

(v1.1)

! KEGG, GO,

MSigDb

GUI, Com-

mand Line

GAT (v1.2.2) ! Command

Line

GLANET

(v1.0)

! DNaseI hy-

persensitive

sites, Histone

Modifications,

Transcription

Factors

KEGG, GO User defined gene-set en-

richment, user defined li-

brary enrichment

GUI, Com-

mand Line

Enrichment Analysis

2.3 Related Work Regarding Thesis Part 2

Concerning the second part of the thesis, there are some existing tools that perform in-

terval intersection [37, 38, 39] and other genomic analyses. UCSC Genome Browser

is continuously evolving since its first launch. Lately, Data Integrator feature is re-

leased in UCSC Genome Browser, which allows users to combine and extract data

from multiple tracks (up to 5 tracks), simultaneously [37]. BEDTools is developed for

comparison, manipulation and annotation of genomic features in BAM, BED, GFF

and VCF formats [38]. BEDOPS is highly scalable and easily-parallelizable genome

analysis toolkit, which enables tasks to be easily split by chromosome for distributing

whole-genome analyses across a computational cluster [39]. NCList defines its dedi-

cated data structure for interval databases [40]. Tabix indexes tab-delimited files and

converts a sequential access file into a random access file [41]. Layer et. al. propose

a novel parallel «slice-then-sweep» algorithm for n-way interval set intersection with

non-containing intervals restriction on the intervals in given data sets [42].

19

CHAPTER 3

ANNOTATION OF GENOMIC LOCI

Annotation is the process of finding overlapping intervals between the user query

and the intervals stored in the GLANET’s annotation library. However, users are not

restricted with GLANET’s library. Our tool, GLANET allows users to expand the

annotation library by their user defined gene sets and genomic intervals.

In this chapter, we define user query, annotation library and how the library is rep-

resented using interval trees. Figure 3.1a provides an overview of the workflow and

capabilities of GLANET. We describe below individual components in more detail.

3.1 User Query

Users can query SNPs or varying length genomic intervals for annotation and/or en-

richment analysis. GLANET supports commonly used input formats such as BED,

narrowPeak, GFF3, 0-based or 1-based coordinates, and reference SNP (RS) iden-

tifiers for SNPs. Overlapping genomic intervals in the query are merged a priori to

analysis to avoid inducing dependencies among the query intervals.

3.2 GLANET Annotation Library

GLANET annotation library contains lists of annotated genomic regions from the

literature. We refer to these as GLANET elements, or genomic elements. Each of these

elements is represented by a set of genomic intervals. Default GLANET annotation

21

(a)

List of genomic intervalsSNPs, insertions, deletions, ChIP-seq,

BS-seq peaks, etc.

Accepted formats: dbSNP IDs, BED,

narrowPeak, GFF3, 0-based and 1-

based interval coordinates

Input

AnnotationList of input genomic

intervals annotated with

genomic elements in

the library.

Output

Genomic ElementsCell type specific non-coding

regulatory annotations:

• Transcription factor binding sites

• DNaseI hypersensitive sites

• Histone modification regions

Gene centered regions:

• Exons

• Introns

• 5’ proximal and distal regions

• 3’ proximal and distal regions

GLANET

Annotation Library

Gene SetsGO Terms and KEGG pathway

gene sets:

• Exon based: Exons of genes

• Regulatory based: Introns, 5’

and 3’ proximal regions of genes

• All based: Exons, introns, 5’ and

3’ distal and proximal regions of

genes

User Defined

Gene Sets

User Defined

Genomic Elements.

Enrichment

Preprocess Remove duplicates and

merge overlapping intervals

Pre-computed values:

• GC content

• Mappability

Genomic Biases

List of enriched

genomic elements and

gene sets.

Output

List of SNPs that fall into TF

binding sites and statistical

assessment of their impact

on the TF binding.

OutputRegulatory Sequence

Analysis for SNPs

(b)

3’3dExon Exon ExonIntron Intron5p1 3p15p2 3p25d

5’

2kb10kb 2kb 10kb100kb 100kb

Upstream Downstream

Figure 3.1: (a) Overall functionality of GLANET. (b) Gene-centric genomic intervals

are defined based on commonly used location analyses in ChIP-seq and related studies

[43]. GLANET uses these intervals to provide detailed annotation of user query with

respect to known genes.

library consists of the following genomic elements:

1. Non-coding regulatory annotations: Regulatory elements encompass non-coding

regions such as DNaseI hypersensitive sites (DHSs), transcription factor bind-

ing and histone modification regions across multiple cell types from the EN-

CODE project. Each element represents a set of genomic intervals that are

identified as peaks by the ENCODE project in a biochemical high through-

put assay. For example, STAT1_K562 represents genomic intervals bound by

transcription factor STAT1 in K562 cells.

2. Gene-centric elements: Gene-centric elements are defined for each gene and

are based on exons, introns, and six different regulatory regions that are either

22

proximal or distal to each RefSeq gene. We adopt the nomenclature from com-

monly used location analysis [43] and define 5p1, 5p2, and 5d as the regions

0 to 2kb, 2kb to 10kb, and 10kb to 100kb upstream of first exon of the gene,

respectively. Similarly, we define 3p1, 3p2, and 3d as the regions 0 to 2kb, 2kb

to 10kb, and 10kb to 100kb downstream of last exon of the gene, respectively

(Figure 3.1b). These gene-centric elements enable users to annotate their input

query with respect to known genes and more importantly non-coding regions

around them. These regions are further incorporated into pathway and gene set

enrichment analysis.

3. Functional gene sets: The input set of genomic intervals can also be queried

against pre-defined gene sets. GLANET includes gene sets derived from KEGG

pathways and GO Terms as its default functional gene sets. GLANET further

defines three classes of gene set elements as exon-based, regulation-based, and

all-based. Exon-based gene set elements include exons of the genes in each

individual gene set. In contrast, regulation-based gene set elements consist of

introns and the four different proximal noncoding regions, namely 5p1, 5p2,

3p1, and 3p2, of genes in each gene set. The third category, all-based gene

set elements, consists of exons, introns, and all six proximal and distal regions

of genes in each gene set. These three modes allow users to not only assess

enrichment of an input query with respect to exonic regions or full length of

genes but also enable regulation-centric enrichment analysis.

4. User-defined annotations: An important feature of GLANET is that users can

expand the GLANET annotation library with new genomic elements, i.e., ge-

nomic intervals or gene sets, and query against this extended library. This op-

tion broadens the applicability of GLANET to various settings. For example,

it enables investigating the input set against an in-house generated ChIP-seq

data analysis, or against gene sets derived from gene expression data analysis,

si/shRNA gene lists, or other functional assays. We present an example ap-

plication in Section 7.2.2, where we consider GATA2 bound regions in K562

cells as input query and utilize gene sets derived from GO term annotations as

user-defined annotations.

23

Genomic intervals of a genomic element type.

Different color indicates different genomic elements.

Single chromosome

An interval tree is constructed for each

genomic element type and chromosome

using its genomic intervals.

[low, high]

•color

•max high endpoint

stored in the

subtree rooted at

this node

•annotated genomic

elements

Figure 3.2: Genomic intervals are represented in interval trees [44]. A separate in-

terval tree is constructed for each chromosome and genomic element type, e.g. for

transcription factor binding annotations. Each node contains the low and high end-

points of the genomic interval, the color of the node (red or black), the maximum high

endpoint stored in the subtree rooted at this node and the genomic elements annotated

with this particular genomic interval.

We provide details on data sources in Table A.1 of Appendix A.

3.3 Library Representation

A genomic interval is a continuous stretch of the genome with a chromosomal start

and end coordinates denoted by [t1, t2] with t1 ≤ t2 where t1 is the low endpoint

and t2 is the high endpoint of the interval. Each genomic element in the GLANET

library is defined by a set of such genomic intervals. For example, in exon based

analysis, a gene is represented by the set of genomic intervals of its exons. Similarly,

a transcription factor’s binding regions or histone modification sites are represented

by a set of genomic intervals that corresponds to ChIP-Seq peaks. GLANET stores

these genomic intervals in interval trees (Figure 3.2).

24

An interval tree is a red-black tree in which each node x stores the low and high end-

points, t1 and t2, of an interval and an integer value max which is the maximum high

endpoint stored in the subtree rooted at this node x [44]. On each node of the tree, we

also store the genomic annotations associated with the interval stored on that node.

For each element type in the annotation library, e.g., genomic elements representing

all transcription factor binding regions across all cell lines, chromosome-specific in-

terval trees are constructed (Figure 3.2). Then, for annotation and enrichment analysis

the appropriate interval trees are searched for query intervals using the interval tree

search algorithm as described in [44].

GLANET annotation overlaps each genomic interval in the input set with genomic

elements in its annotation library and provides the following options for quantifying

the overlap:

1. Existence of overlap (EOO): This option simply evaluates whether a given input

interval intersects at least 1 base pair (bp) with any of the intervals of a genomic

element in the annotation library. GLANET provides flexibility in the overlap

definition, that is, by default, with at least single base intersection is consid-

ered overlapping; GLANET also allows users to provide a higher threshold for

defining overlap. Finally, the fraction of intervals overlapping each genomic

element is reported as the query-level association statistics for each genomic

element in the annotation library.

2. Number of overlapping bases (NOOB): In order to take into account the size

of the intersection between a given input interval and intervals of a genomic

element, NOOB takes into account the actual number of overlapping bases. The

total numbers of overlapping bases across all the input intervals are reported as

the query-level association statistics for each element in the annotation library.

In this calculation each overlapping base is counted only once.

3.4 Interval Tree

Interval tree is a well-known and highly used space partitioning tree. We adopted the

interval tree implementation provided in [44]. Its space complexity isO(n), construc-

25

tion and query requiresO(n log n)) andO(min(n, k log n)) time for n given intervals

and k hits, respectively.

3.5 Time and Space Complexity of Annotation

Annotation is performed by searching for each query interval in the interval tree. The

time complexity of a query search in an interval tree is O(min(n, k log n)), where n

is the number of all genomic intervals in the interval tree (number of nodes) and k

is the number of genomic intervals overlapping the query interval. Typically, k log n

is smaller than n. For m query intervals, time complexity of Annotation is O(m ∗min(n, k log n)).

We construct chromosome based interval trees for each element type, namely, DNa-

seI hypersenstive sites (DHSs), Transcription Factors (TFs), Histone Modifications

(HMs) and RefSeq Genes. Space complexity of each interval tree is O(n), where n

is the number of intervals stored in the interval tree.

26

CHAPTER 4

REGULATORY SEQUENCE ANALYSIS OF SINGLE

NUCLEOTIDE POLYMORPHISMS

GLANET provides regulatory sequence analysis (RSA) when user query consists of

Single Nucleotide Polymorphisms (SNPs) only. For each input SNP, GLANET finds

overlapping transcription factors (TFs) and for each TF, GLANET gathers its posi-

tion frequency matrix (pfm). Next, GLANET retrieves DNA sequences centered at

SNP position of reference, altered and extended reference sequences and checks for

whether binding affinity of the TF increases or decreases with respect to the TF’s pfm

because of the SNP. In this chapter, we describe the steps of our RSA and conclude

the chapter with an use case of GLANET, which is RSA for Obsesssive Compulsive

Disorder (OCD) GWAS SNPs.

4.1 Regulatory Sequence Analysis

GLANET provides a detailed regulatory sequence analysis for SNP input queries.

This analysis takes advantage of the available ENCODE transcription factor binding

regions in the default GLANET annotation library. GLANET first finds in which of

the transcription factors’ binding regions, the SNP resides in. Then, the locations of

the SNPs residing in a TF binding region are evaluated for overlap with a significant

motif match using the position frequency matrices (PFMs) of the corresponding TFs.

This evaluation is carried out with both the reference and the SNP alleles. Specifi-

cally, for evaluating a single SNP with respect to one PFM, GLANET retrieves DNA

subsequence of the reference genome within a 41 bps window centered at the SNP

27

locus. It then assesses whether this subsequence provides a significant match to the

PFM with either the reference or the SNP allele with the RSAT tool [45]. Both Jaspar

Core [46] and Encode motifs [47] are utilized as part of GLANET’s PFM library.

Overview of regulatory sequence analysis can be found in Figure 4.1. GLANET

performs regulatory sequence analysis in three main steps :

1. In the first step, SNP and TF pairs for which SNP resides in the binding region

of the TF are found. This is accomplished by overlapping the positions of the

SNPs with transcription factor binding sites provided in the annotation library.

2. In the second step, GLANET generates three subsequences around the SNP

site: reference, SNP and extended sequences. These sequences are used to

statistically assess whether the SNP can alter the transcription factor binding.

Reference and SNP sequences are 41 bps long and they are created by taking

±20 bps upstream and downstream sequences around SNP locus. Extended

reference sequence is a 401 bp region centered at SNP locus and is used to

check if the SNP site is actually the most likely binding site in the vicinity of

the SNP.

3. In the third step, GLANET scans the subsequences for a matching motif site in

each of the sequences (Reference, SNP, Extended) and evaluate the statistical

significance of the match using RSAT [45]. For this, the position frequency ma-

trices (PFMs) for the annotated TFs are obtained from Jaspar Core and Encode

motifs [46, 47]. This step results with three p-values: pref, psnp and pextended. The

smaller the p-value, the better the match is.

In this scenario, we only consider the cases where the SNP location is found to be the

best matching site within the peak and we only consider cases where pextended is not

smaller than the minimum of the psnp or pref.

Let pref and psnp denote the p-values of motif matches with the reference and SNP

alleles, respectively. Since we precondition our analysis on the fact that the SNP

overlaps a TF binding region, we also evaluate whether the region harbors a motif

match to the PFM that does not overlap the SNP location. Let pextended denote the

28

Step 1 Find SNPs and transcription factor (TFs) pairs, where SNP falls into TF’s binding site.

rsID chr position alleles

rs11057881 chr12 125371973 A/C

…

TF chr start end

GABP chr12 125371778 125372047

…

SNP file

Transcription factor fileFind SNP - TF pairs that overlap.

Step 2For each of the SNP in the list, create three subsequences around the SNP locus. Reference and

altered SNP sequences include 20 nucleotides downstream and upstream of the SNP locus. Extended

sequence is retrieved from the reference genome within a 401 bps window centered at the SNP locus.

AGACCTGAGATAGCACTGAACCCGGTATAGACTGTTTTTCC

AGACCTGAGATAGCACTGAAACCGGTATAGACTGTTTTTCC

..CGGATGCCTGAGACCTGAGATAGCACTGAACCCGGTATAGACTGTTTTTCCCCATGATAAAATTT…

Reference seq.SNP Altered seq.

SNP locus

Extended seq.

20 bps 20 bps

Step 3Scan each sequence with TF’s position frequency matrices and assess TF binding possibility in the

sequence.

ETS_known9 GABPA_1 GABPA_jaspar_MA0062.2

A |0.032356 0.07 0.00 0.00 1.00 1.00 0.09 0.06 0.16 0.27 0.24

C |0.776542 0.92 0.00 0.00 0.00 0.00 0.03 0.26 0.14 0.26 0.36

G |0.190091 0.00 1.00 1.00 0.00 0.00 0.87 0.04 0.61 0.42 0.23

T |0.001011 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.10 0.05 0.18

Position Frequency Matrix files

Reference seq.

Extended seq.

SNP Altered seq. RSAT prefpsnppextended

Compare p-values and

determine SNPs

potentially affecting

TF motif sites.

Figure 4.1: Three main steps of regulatory sequence analysis in GLANET.

p-value of such a match. If pextended is smaller than psnp, GLANET filters it out in the

post-analysis as the binding region has a better motif match that does not overlap the

SNP location. If the SNP location is the best place for the motif to match, GLANET

compares psnp and pref. If psnp is larger than pref, the SNP has a potentially disrupting

effect, it decreases the binding affinity of TF. If the converse holds, GLANET suggests

that the SNP is creating a sequence motif that is more favorably recognized by the TF.

In other words, the SNP has a potentially enhancer effect which increases the binding

affinity of TF.

29

4.2 GLANET Use Case: Regulatory Sequence Analysis of OCD GWAS SNPs

Following up OCD SNPs with GLANET regulatory sequence analysis revealed that

some of these SNPs might be affecting TF binding. For example, SNP rs1891215

resides within a STAT1 binding region and has a match to STAT1 PFM with pref

of 1.1e-3. As the SNP changes the allele from A to G, it generates a better STAT1

binding site with psnp of 6.1e-5 (Figure 4.2a). In contrast, the SNP rs10946279 resides

within a MAX binding region. This location has a match to the MAX PFM with a pref

of 6.1e-5; however, the SNP alters the match (psnp = 1.5e-3), potentially disrupting

the binding site (Figure 4.2b). All regulatory sequence post analysis results of OCD

SNPs are available in Supp. Table S20 under http://burcak.ceng.metu.

edu.tr/PhDThesis/SuppMaterials/.

(a) (b)rs1891215

Reference

SNP CTTCTGGGAAA

STAT1

CTTCTGGAAAA

rs10946279

GCCGTGCGATGCTGTGCGAT

MAX

Figure 4.2: GLANET regulatory sequence analysis for the OCD SNPs annotated

with TFs in the library. (a) SNP rs1891215 located at chr1:7,667,794 changes refer-

ence nucleotide A to G, and as a result, leads to a better match to the STAT1 PFM,

i.e., the p-value of the match to the STAT1 PFM changes from 1.1e-3 to 6.1e-5. (b)

SNP rs10946279 (chr6:170,553,248) changes reference nucleotide C to T, thereby

decreasing the significance of the match to the MAX PFM, i.e., the p-value of the

match increases from 6.1e-5 to 1.5e-3.

30

http://burcak.ceng.metu.edu.tr/PhDThesis/SuppMaterials/


CHAPTER 5

ENRICHMENT ANALYSIS OF GENOMIC REGIONS

Enrichment analysis enables identifying one or more common functional themes in

the input query set by assessing the statistical significance of the overlaps between

the user query and intervals of elements stored in GLANET’s annotation library. For

this purpose, GLANET employs sampling-based enrichment analysis which requires

random interval generation of each user input query interval through matching its

chromosome and length by default and GC content, mappability and isochore family

jointly or separately on request for each sampling. GLANET also allows users to

expand its default annotation library by providing their own user defined gene sets

and library. Furthermore, GLANET offers joint enrichment analysis for transcription

factor and KEGG Pathway pairs, of which we explain all in detail, in this chapter.

5.1 Enrichment Analysis

To evaluate the statistical significance of the overlaps, GLANET calculates the ob-

served and expected test statistics using one of the association statistics options listed

in Table 5.1. GLANET computes the observed test statistics for each member ele-

ment of selected element type by finding the overlaps between the input query and

the genomic intervals of element in the annotation library. To calculate expected test

statistics, GLANET estimates empirical null distributions by randomly sampling in-

tervals that match the characteristics of the input query intervals. We use a resampling

based approach to obtain the empirical null distribution of the test statistic. We col-

lect test statistics of B samplings, each with n randomly generated genomic intervals,

31

where n is the number of input intervals in the query. bth sampling is represented

by randomly generated genomic intervals, Sb = {sb1, sb2, . . . , sbn}, ∀b ∈ {1, . . . , B}that match the given genomic intervals properties. The collection of overlap statistics

across multiple random samplings is then used to estimate an empirical null distribu-

tion for the overlap statistic and to calculate an empirical p-value = 1B

∑Bb=1 1(kb≥k) .

Here k denotes the observed test statistic and kb is the overlap statistic of randomly

generated genomic intervals Sb from bth sampling. The indicator function returns 1

when the inequality holds and 0 otherwise. Multiple testing correction to account for

large numbers of genomic elements is performed with two options: Bonferroni pro-

cedure [48] for controlling family-wise error rate and Benjamini-Hochberg procedure

[49] for controlling the false discovery rate.

The key part of estimating the empirical null distribution of enrichment test is the

random interval sampling step. The random intervals are generated such that they

match properties of the each member of the input interval set as opposed to the av-

erage properties of these intervals. Matched properties of each input interval are its

chromosome, length, GC content, mappability and isochore family. Among these

properties, GC content and mappability are the systematic biases that are introduced

by the NGS technologies. In other words, these technologies restrict the genomic re-

gions that can contribute to resulting intervals. To validate the introduced GC content

and mappability biases in the intervals of ENCODE derived DNA elements in our

annotation library, we evaluated the GC content and mappability values of all inter-

vals for each ENCODE file. We sorted the ENCODE files with respect to their mean

GC content and mappability values in ascending order and selected the ten different

files that almost equally separate the sorted mean values to show how GC content and

mappability vary for DNaseI hypersensitive sites (DHSs), transcription factor binding

sites (TFBSs) and histone modifications (HMs). Figures 5.1a, 5.1c and 5.1e show

the box plots of GC content of intervals of ten different ENCODE files sorted with

respect to their mean GC contents and displays that GC content values vary mostly

between 0.4 and 0.6. On the other hand, Figures 5.1b, 5.1d and 5.1f show the box

plots of mappability values of intervals of ten different ENCODE files sorted with re-

spect to their mean mappability values and reveals that mean mappability values vary

mostly between 0.8 and 1.0. We can conclude that intervals obtained from ENCODE

32

data sets tend to be highly mappable with average GC content.

(a) GC contents for DHSs (b) Mappability values for DHSs

(c) GC contents for HMs (d) Mappability values for HMs

(e) GC contents for TFBSs (f) Mappability values for TFBSs

Figure 5.1: Box plots of GC content and mappability values for ten different EN-

CODE files, for each element type.

33

Table 5.1: GLANET main parameters for enrichment test.

Association Statistic Options

EOOOverlap statistic is 1 or 0 based on whether the input interval overlaps with any of the genomic

element intervals or not.

NOOBOverlap test statistic is the exact number of overlapping bases between the input interval and the

genomic element intervals.

Random Interval Generation Matching Options

wGCFor an input interval, randomly sample an interval with the same length from the same chromosome

such that it matches the GC content of the query interval.

wMRandomly sample an interval with the same length from the same chromosome such that it matches

the mappability of the query interval.

wGCMRandomly sample an interval with the same length from the same chromosome such that it matches

both mappability and GC content of the query interval.

woGCM Randomly sample an interval with the same length from the same chromosome.

Random Interval Generation Start Options

wIF

Starts the random interval search within the same chromosome with a matching GC isochore family.

When GC is on, it provides a good start for GC matching. When GC option is not selected, it

provides coarse grain GC matching.

woIF Starts the random interval search for an interval within the chromosome randomly.

GLANET provides flexibility in which property to consider. User can account for GC

content or mappability bias jointly or separately or choose not to match any of these

properties. The availability of these modes provide flexibility for the cases wherein

the input genomic intervals are generated by different technologies. In matching the

GC content, genomic intervals are matched with varying resolution depending on

the length of given genomic intervals, i.e., the shorter the genomic interval, the more

precise the GC content matching is. A detailed description of the GC and mappability

matching procedure is available in Algorithm 5.1.

GLANET also offers an Isochore Family (IF) option in matching GC. The genome

is divided into five regions that are characterized by similar GC content composition.

These regions are called isochores and are named as L1, L2, H1, H2, and H3 in ac-

cordance with increasing GC levels, < 38%, 38–42%, 42–47%, 47–52%, > 52%,

respectively as defined in [50, 51]. Finally, each chromosome is divided into 100, 000

bps long intervals and each such interval is tagged with its appropriate isochore fam-

ily. When with Isochore Family (wIF) option is selected, initially, input interval’s

34

isochore family is calculated and a random interval of 100, 000 bps long is selected

from the appropriate isochore family pool of that chromosome. Subsequently, a ran-

dom interval of input interval’s length is sampled from this 100, 000 bps long interval.

If GC option and/or mappability is also selected, a random interval is repeatedly se-

lected until a random interval close to input interval’s GC content and/or mappability

depending on the selected mode under a preset threshold is generated. When GC op-

tion is selected, wIF provides a good starting point for GC matching, when it is not

selected, it provides a very coarse grain matching of GC.

The different options for enrichment test is summarized in Table 5.1.

5.2 Random Interval Sampling Procedure

To perform enrichment analysis, GLANET generates a null distribution of the test

statistics by first sampling random intervals and calculating these intervals’ overlap

with the annotation library element intervals. The random intervals are generated such

that they match properties of the each member of the input interval set as opposed

to the average properties of these intervals. The algorithm for generating random

intervals is outlined in Algorithm 5.1. Note that, we do not include the relaxation

steps of the thresholds for sake of clarity. Here we provide the details of this random

interval generation scheme.

The input interval set may contain overlapping intervals. In such cases, GLANET

preprocesses the input by merging overlapping intervals into a single interval to avoid

dependency within them. Similarly, the random intervals for an input interval set

are always selected such that they do not overlap. GLANET provides four main

parameters for random interval generation: with GC (wGC), with Mappability (wM),

with GC and Mappability (wGCM), and without GC and Mappability (woGCM).

GLANET random interval generation can also be run without Isochore Family (woIF)

and with Isochore Family (wIF). Regardless of which option is selected, for each

input interval a corresponding random interval of the same length from the same

chromosome is sampled. When the given interval’s length is greater than 100, 000

bps, GLANET does not generate random intervals by accounting for GC content

35

and/or mappability even one of these options (wGC,wM,wGCM) is on. Since for

very large intervals, GC content and mappabilty values are not meaningful. In case of

wGC, wM, or wGCM options are selected in addition to the length and chromosome

of given interval, GLANET also matches given interval’s GC, mappability, or both

GC and mappability, respectively as follows:

• GC Option or Mappability Option Selected: If one of the wGC or wM op-

tion is selected, GLANET tries to match the GC content or mappability value of

the given interval. Same procedure applies for matching GC or mappability val-

ues. GLANET first generates a random interval and calculates its GC content

or mappability depending on which option is selected. This random interval is

accepted if its value is close to the corresponding value of input interval within

a pre-defined threshold. Otherwise, GLANET generates a new random interval

until an acceptable random interval is obtained. If after a certain number of

attempts, no random interval can be found because it is not within the threshold

distance to the GC or mappability of the input interval, then the threshold for

the acceptable match is increased by a small increment. Again, after a certain

amount of trials, if relaxing this threshold does not help, GLANET chooses the

random interval with the minimum difference in GC content or mappability up

to that point.

• GC and Mappability Option Selected: If wGCM option is on, GLANET se-

lects a random interval with close GC content and mappability values to the

input interval. A random interval is considered acceptable if its GC content

and mappability values are within a pre-defined distance to the input interval’s

values. If the random interval values do not match, a new interval is sampled

until an acceptable random interval is obtained. If after a certain number of at-

tempts, no random interval can be generated because it is not within the thresh-

old distance to the GC or mappability of the input interval, the threshold for the

acceptable match is increased by a small increment. If relaxing this threshold

does not help, GLANET chooses the random interval with the minimum sum

of the differences in GC content and mappability up to that point.

36

5.2.1 GC and Mappability Calculation

In order to calculate the GC content and mappability of given intervals, GLANET pre-

computes GC content and mappability values of genomic regions and stores them in

the disk. The GC content of the genomic regions are calculated at various lengths

such as 1 bp, 100 bps, 1000 bps, 10, 000 bps and 100, 000 bps. In runtime GLANET

constructs a GC interval tree from one of these pre-computed GC content values based

on the mode of the input interval lengths. Specifically, the shorter the input intervals

are, the more precise the GC calculation is. If mode of given intervals’ lengths is short

(<= 100 bps long), GLANET calculates GC content of the given intervals at one base

resolution and stores them in a byte list. Otherwise, GLANET stores GC contents of

100 bps, 1000 bps and 10, 000 bps long intervals in interval trees. When the mode is

between (> 100 and≤ 1000) GLANET calculates GC content at 100 base resolution,

if the mode is (> 1000 and ≤ 10, 000 ) GLANET calculates at 1000 base resolution.

For cases between (> 10, 000 and ≤ 100, 000) at 10, 000 base resolution and when

mode gets longer than (100, 000 bps) then GLANET does not calculate GC content

for intervals longer than (100, 000 bps) but only for intervals shorter than (100, 000

bps) at 10, 000 base resolution.

37

Algorithm 5.1: generateRandomIntervalsRequire: wIF , tM , tGC , tV alue, LMAX

1: wIF : If true, isochore family pools will be used in random interval generation.

2: tM : Threshold to match mappability within this value.

3: tGC : Threshold to match GC content within this value.

4: tV alue: Stands for tM or tGC .

5: LMAX: Maximum interval length GC and mappability will be accounted for

(Default is 100,000 bps).

6: for each chromosome chri do

7: Si ← subset of intervals in S that are on chri

8: if Si 6= ∅ then

9: for each sampling b in {1, . . . , B} do

10: S(b)i ← ∅

11: for each given interval g in Si do

12: gLen← length(g)

13: if gLen ≤ LMAX then

14: if wGCM then

15: generateARandomIntervalwGCM(g)

16: else if wGC or wM then

17: generateARandomIntervalwGCorwM(g)

18: end if

19: else if gLen > LMAX or woGCM then

20: r ← getARandomInterval(chri, gLen)

21: end if

22: S(b)i ← S

(b)i ∪ r

23: end for

24: end for

25: end if

26: end for

38

1 Function generateAnRandomIntervalwGCM(g)

2 gGC ←− calculateGC(g)

3 gM ←− calculateMappability(g)

4 do

5 do

6 if wIF then

7 gIF ←− findIsochoreFamily(g)

8 r ←− getARandomInterval(chri, gLen, gIF )

9 else

10 r ←− getARandomInterval(chri, gLen)

11 endif

12 while r overlaps with an already generated interval in S(b)i

13 rGC ←− calculateGC(r)

14 rM ←− calculateMappability(r)

15 while (|rGC − gGC| > tGC) or (|rM − gM | > tM)

16 return r

1 Function generateAnRandomIntervalwGCorwM(g)

2 gV alue← calculateGC(g) or calculateMappability(g);

3 do

4 do

5 if wIF then

6 gIF ← findIsochoreFamily(g) ;

7 r ← getARandomInterval(chri, gLen, gIF ) ;

8 else

9 r ← getARandomInterval(chri, gLen) ;

10 endif

11 while r overlaps with an already generated interval in S(b)i ;

12 rV alue← calculateGC(r) or calculateMappability(r) ;

13 while (|rV alue− gV alue| > tV alue);

14 return r;

39

Mappabilities of genomic intervals are obtained from ENCODE, the source files

are listed in Table A.1 of Appendix A. A query interval can be part of a single interval

or overlap with multiple intervals with different mappability values as provided in

the original source. In either case, its mappability is estimated by calculating the

weighted average, where the weights are the proportions of the query interval lengths

that overlap with the source mappability interval.

5.3 Time and Space Complexity of Random Interval Generation

We generate random interval for each sampling and for each query interval. Therefore

time complexity of random interval generation is O(b ∗ m) times the summation

of time complexity of GC, mappability and isochore family calculations if random

interval generation wGC, wM and wIF options are selected, where b is the number of

samplings and m is the number of query intervals.

For each sampling and query interval, we calculate GC content, mappability and

isochore family of the query interval and randomly generated interval depending on

the options chosen. To avoid infinite loops, each calculation has its own preset number

of trials and if procedure can not generate a random interval within a threshold, it

selects the best random interval generated up to that point.

We store different data structures in memory for GC calculation depending on

the required resolution. When query intervals consist of SNPs or mode of intervals

less than <= 100 nucleotides long, we need the GC data at the highest resolution,

therefore we keep the GC content of each nucleotide in chromosome based byte array

lists. Each byte in GC byte array list contains GC content of 7 nucleotides. Therefore

space complexity of GC byte array list is proportional to the size of the chromosome.

For other query intervals, we keep the GC content of 100, 1000 and 10, 000 bps long

intervals in chromosome based interval trees depending on the mode of the query

intervals. Therefore, space complexity of interval trees are proportional to the size of

the human chromosomes.

We keep isochore family of 100, 000 bps long intervals in chromosome based ar-

ray lists. Therefore, space complexity of isochore family data structure is proportional

40

to the size of each human chromosome.

We keep chromosome based two array lists for mappability. Start and end posi-

tions of intervals with a specific mappability value are stored in an integer array list in

ascending order and their corresponding mappability values are stored in a short array

list where data is gathered from mappability bigWig files as data source is shown in

Table A.1 of Appendix A. Therefore, space complexity of mappability data struc-

tures are proportional to the number of intervals provided in the chromosome based

bigWig files.

Time complexity of GC calculation for GC byte array list is O(1) since we reach

to the corresponding byte or bytes directly in array index based manner and time

complexity of GC calculation for GC interval tree is equal to the cost of interval tree

search. Isochore family calculation relies on calculated GC, therefore time complex-

ity of isochore family calculation is equal to the time complexity of GC calculation.

Time complexity of mappability calculation is equal to the time complexity of binary

search in mappability integer array list, and then using the indexes returned by binary

search, reaching to the corresponding mappability values inO(1) time in mappability

short array list.

5.4 Joint Enrichment Analysis of Transcription Factors and KEGG Pathways

GLANET enables joint enrichment analysis for TF binding sites and KEGG path-

ways. With this option, users can evaluate whether the input set is enriched concur-

rently with binding sites of TFs and the genes within a KEGG pathway. This joint

enrichment analysis provides a detailed functional interpretation of the input loci.

To exemplify this situation, for the given query intervals, TF enrichment anal-

ysis may not reveal enrichment for any particular TF, however, a joint enrichment

analysis of genomic elements representing TF binding regions and KEGG pathways

may identify several enriched transcription factor and pathway pairs. Therefore, joint

enrichment analysis may provide more information than TF or KEGG Pathway en-

richment may provide separately.

41

Separate enrichment analysis for TFs or KEGG pathways with respect to query

intervals requires overlapping TFs or KEGG Pathway intervals with query intervals,

which involves finding overlapping intervals for 2 interval sets. However, joint en-

richment analysis of TFs and KEGG pathways with respect to query intervals requires

finding common overlapping intervals for 3 interval sets, namely, TFs, KEGG Path-

way and query intervals.

Later on, in Chapter 8, we generalize this finding common overlapping intervals

problem from 2 or 3 interval sets to n interval sets. We provide our proposed solutions

for finding n common overlapping intervals for n interval sets and finding at least k

common overlapping intervals for n interval sets problems.

5.5 Time and Space Complexity of Enrichment Analysis

Enrichment achieves annotation for each sampling’s randomly generated data. Time

complexity of enrichment is time complexity of random interval generation and plus

annotation of all samplings. Time complexity of annotation of all samplings is O(b ∗m ∗min(n, k log n)), where b is the number of samplings, m is the number of query

intervals, n is the number of intervals stored in the interval tree and k is the number

of hits. Time and space complexity of random interval generation is presented in 5.3.

We construct chromosome based interval trees for each element type, namely,

DNaseI hypersenstive sites (DHSs), Transcription Factors (TFs), Histone Modifica-

tions (HMs) and RefSeq Genes. Space complexity of each interval tree is O(n),where n is the number of intervals stored in the interval tree.

42

CHAPTER 6

DATA DRIVEN COMPUTATIONAL EXPERIMENTS

We designed novel data-driven computational experiments to evaluate GLANET’s

enrichment procedure in terms of Type-I error and power. We show that GLANET’s

enrichment test has low Type-I error with high statistical power and it is sensitive

to varying experiment parameters and GLANET parameters, and significance levels.

The data-driven computational experiments also enable us to assess the enrichment

capabilities of other tools. Towards this aim, we conduct extensive experiments to

compare GLANET with an another enrichment tool, GAT. Here, in this chapter, we

present the design of data-driven computational experiments with detailed explana-

tions. We provide the experiment results and compare GLANET and GAT leveraging

on these results. Moreover, we interpret the results further with Wilcoxon signed rank

tests and ROC curves.

6.1 Design of Data-driven Computational Experiments

The key idea of these experiments is that at the TSSs of expressed genes, we would

expect to observe enrichment of DNA polymerase II (POL2) occupancy and mod-

ifications that are related to transcriptional activation. In contrast, for the TSSs of

non-expressed genes, we would expect enrichment of histone modification elements

that are associated with transcriptional repression.

We used data from K562 and GM12878 cells and defined expressed and non-expressed

gene sets based on RNA-seq analysis of these cells. Genomic intervals that cover the

500 bps upstream and 100 bps downstream of the first exon of the genes in these sets

43

were retrieved. We based our experiments on enrichment of activator and repressor

elements on these intervals. In these experiments, our null hypothesis always stated

that there is no enrichment. For each simulation, we sampled non-overlapping inter-

vals from the TSS regions of the relevant gene set (expressed or non-expressed genes)

and evaluated enrichment of 12 histone modifications with roles on transcriptional re-

pression or activation and POL2 occupancy separately with GLANET. Based on these

simulations, we calculated Type-I error and power as follows:

6.1.1 Type-I error experiments

These experiments evaluate whether GLANET enrichment procedure can control

Type-I error (probability of rejecting the null when null hypothesis is true, thus mak-

ing a false rejection) considering settings where the null hypothesis is true. In the case

of non-expressed genes, the null hypothesis is that intervals that are located around

the TSSs of non-expressed genes’ are not enriched with activator elements. Similarly

in experiments conducted with expressed genes, the null hypothesis is that the inter-

vals around the TSSs of expressed genes are not enriched with repressor elements.

Type-I error rate is the number of times we incorrectly reject the null hypothesis.

6.1.2 Power experiments

These experiments evaluate the power (probability of rejecting the null when alterna-

tive hypothesis is true, making a correct rejection) of GLANET enrichment procedure

considering cases where the alternative hypothesis is true. In experiments conducted

with non-expressed genes, our null hypothesis states that the intervals are not enriched

with repressor elements. Similarly in the case of expressed genes, the null hypothesis

is that the genomic intervals are not enriched with activator elements. Then, power is

the number of times we correctly reject the null hypothesis.

Design for data-driven computational experiments is summarized in Figures 6.1 and

6.2. Each experiment consisted of sampling 500 non-overlapping intervals from the

relevant gene set described below and repeating the sampling procedure for 1, 000

simulations to estimate Type-I error and power. The list of genomic elements and

44

Step 1. Define set of expressing genes based on RNA-seq expression data.

Step 2. Retrieve 601 bps intervals around the genes’ first exons.

Step 3. Sample 500 intervals from the interval set.

Step 4a. Input intervals to GLANET and check

if the activator is enriched.

Repeat steps 3 and 4a N times.

Power: Number of times the activator is found

enriched/N.

Step 4b. Input intervals to GLANET and

check if the repressor is enriched.

Repeat steps 3 and 4b N times.

Type-I error: Number of times the

repressor is found enriched/N.

Computational experiments with expressed genes

Figure 6.1: Design for data-driven computational experiments for expressed genes.

N is set to 1000. Activator elements are defined as H2AZ, H3K27ac, H3K4me2,

H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1, H4K20me1, [8]

and POL2; whereas H3K27me3 and H3K9me3 constitute the repressor elements.

further details on how we defined the sets of expressed and non-expressed gene sets,

and the regions around the TSSs are detailed below.

6.1.3 Transcriptional activator and repressor elements

We considered histone modifications and POL2 occupancy in two groups as (1) ac-

tivator elements including POL2 and modifications H2AZ, H3K27ac, H3K4me2,

H3K4me3, H3K79me2, H3K9ac, H3K9acb, H3K36me3, H3K4me1, H4K20me1 as-

sociated with transcriptional activation at TSSs [8]; (2) repressor elements including

modification H3K9me3 and H3K27me3 [8]. However, some of these elements are

either observed to exhibit both activator and repressor features and/or reported to be

present in regions other than the TSSs such as gene bodies or 3’ end. We marked

H3K36me3, H3K4me1, H4K20me1, and H3K9me3 modifications as ambigous ele-

ments as their role in the TSSs site is ambigious [8, 52, 53].

After processing the RNA-seq data of GM12878 and K562 cell lines with the EN-

CODE RNA-seq data analysis pipeline (https://www.encodeproject.org/

rna-seq/small-rnas/), we defined expressed and non-expressed gene sets.

45

https://www.encodeproject.org/rna-seq/small-rnas/

https://www.encodeproject.org/rna-seq/small-rnas/

Step 1. Define the non-expressing genes based on RNA-seq expression data.

Step 2. Retrieve 601 bps genomic intervals around their first exons and filter genomic

intervals based on DNaseI exclusion criteria.

Step 3. Sample 500 intervals from the interval set.

Step 4a. Input intervals to GLANET and check

if the repressor is enriched.

Repeat steps 3 and 4a N times.

Power: Number of times the repressor is

found enriched/N.

Step 4b. Input intervals to GLANET and

check if the activator is enriched.

Repeat steps 3 and 4b N times.

Type-I error: Number of times the

activator is found enriched/N.

Computational experiments with non-expressed genes

Figure 6.2: Design for data-driven computational experiments for non-expressed



H4K20me1, [8] and POL2; whereas H3K27me3 and H3K9me3 constitute the re-

pressor elements.

Both the GM12878 and K562 RNA-seq data included two biological replicates. For

each gene, we utilized the lowest and highest transcripts per million (TPM) values

across replicates for defining the expressed and non-expressed gene sets, respectively.

6.1.4 Genomic interval sets for expressed genes

We defined two sets of expressed genes with varying levels of stringency by consid-

ering the top 5th and top 20th percentiles of genes with respect to the their descending

TPM values. In each case, genomic intervals that cover the 500 bps upstream and 100

bps downstream of the first exon of the genes in these sets are retrieved. We refer to

these two genomic interval sets as Top5 and Top20.

6.1.5 Genomic interval sets for non-expressed genes

We labeled genes with zero TPM values as non-expressed genes and formed a tenta-

tive interval set by taking 500 bps upstream and 100 bps downstream of these genes’

46

first exons. [54] and others observed that DNaseI hypersensitivity and gene expres-

sion correlate positively; therefore, we further filtered these intervals with respect to

their cell type specific DNaseI signal. We considered two modes of DNaseI over-

lap exclusion by (i) discarding the interval completely from the interval set (Com-

pletelyDiscard) in case of any overlap with DNase-seq peak exists and (ii) keeping

the interval by reducing it to the longest interval without DNase-seq peak overlap

(TakeTheLongest). In experiments conducted with non-expressed genes, we oper-

ated with these two different interval sets: CompletelyDiscard and TakeTheLongest.

The DNaseI overlap exclusion accounted for the fact that zero TPM values might

arise as an artifact of sequencing depth and resulted in a conservatively defined set of

non-expressed genes.

6.2 RESULTS

We designed and conducted novel data-driven computational experiments to assess

Type-I error and power of GLANET’s enrichment procedure. In this section, we re-

port results on these data-driven computational experiments that validate the enrich-

ment procedure in a controlled setting. We explore the effect of GLANET enrichment

parameters together with experiment parameters, which necessitated 128, 000 runs of

GLANET as indicated in Table 6.1.

Next, we compare GAT and GLANET through data-driven computational experi-

ments which required 32, 000 GAT runs as it is described in Table 6.2. GAT achieves

coarse grain GC matching, and it is not exactly as same as wGC or wIF of GLANET,

but please notice that for GAT throughout the text, we use wGC and wIF interchange-

ably.

6.2.1 Data-driven Computational Experiments Results for Activator Elements

We performed the data-driven computational experiments summarized in Figures 6.1

and 6.2 under all possible enrichment analysis parameter settings of GLANET listed

in Table 5.1. We varied the association measure modes, EOO or NOOB and con-

sidered cases where we accounted for GC, and/or mappability or ignored these two

47

Table 6.1: Data-driven Computational Experiments for GLANET

GLANET DDCE

Experiment Parameters GLANET Parameters

Cell

Line

Experiment

Scenario

Experiment

Setting

Association

Statistic

Option

Random Interval

Generation

Matching Option

Random Interval

Generation Start

Option

GM12878

K562

Expressed

Genes

Top5

EOO

NOOB

wGC

wM

wGCM

woGCM

woIF

wIF

Top20

Non-expressed

Genes

CompletelyDiscard

TakeTheLongest

x 2 x 2 x 2 x 2 x 4 x 2

128 different Experiment and GLANET parameter combinations

1000 simulations for each parameter combination

128,000 runs of GLANET

Table 6.2: Data-driven Computational Experiments for GAT

GAT DDCE

Experiment Parameters GAT Parameters

Cell

Line

Experiment

Scenario

Experiment

Setting

Association

Statistic

Option

Random Interval

Generation

Matching Option

GM12878

K562

Expressed

Genes

Top5

EOO

NOOB

wGC

woGC

Top20

Non-expressed

Genes

CompletelyDiscard

TakeTheLongest

x 2 x 2 x 2 x 2 x 2

32 different Experiment and GAT parameter combinations

1000 simulations for each parameter combination

32,000 runs of GAT

biases in random interval generation step. These settings are with GC (wGC), with

mappability (wM), with GC and mappability (wGCM) and without GC and map-

pability (woGCM). Furthermore, we considered with Isochore Family and without

Isochore Family options, which we refer as wIF and woIF, respectively. These con-

stituted 16 different parameter settings. As described above, we varied the definitions

48

of non-expressed and expressed genes too; for expressed gene setting we have Top5,

which is the conservatively defined set of expressed genes and Top20 that is less con-

servatively defined. For the non-expressed interval set, CompletelyDiscard is a more

stringent definition than the TakeTheLongest case. We repeated these experiments

for K562 and GM12878 cell lines in order to get a complete picture of GLANET

enrichment procedure performance.

Through these data-driven computational experiments, we assessed GLANET Type-I

error and power. We provided the results for significance levels of α = 0.05 and

α = 0.001, which are displayed in Figures 6.3- 6.10.

Figure 6.3 summarizes the results of experiments conducted with activator elements

for expressed genes (Top5) and non-expressed genes (CompletelyDiscard) settings

for K562. Overall, we observe that the Type-I error is well below the target sig-

nificance level (α = 0.05) without sacrifice on power in all sixteen modes of the

GLANET enrichment analysis. One exception to this is, H3K4me1, where Type-I er-

ror is significantly higher than the target level. This could potentially be attributed to

its ambiguous role on the promoters as it acts also on the downstream of TSSs [8] and

reported to exhibit repressor features [52]. Interestingly, enrichment assessment of

this mark for non-expressed genes is most affected by the bias adjustment in the null

distribution estimation. The Type-I error involving this mark improves significantly

under the with GC and/or mappability regardless of the association statistics utilized

for enrichment without a negative impact on power. Similiarly, using wIF option im-

proves its Type-I error. Another exception case is H3K36me3 mark with considerably

low power. This is also one of the elements whose role on the promoters is ambigous;

H3K36me3 is reported to have preference for the 3’ of active genes [8]. When the

same experiments are conducted in GM12878 cell line, we obtained similar results

even with lower Type-I errors (Figure 6.4).

When we use a looser interval exclusion criteria in generating intervals of non-expressed

genes (TakeTheLongest) and, a less stringent definition of expressed genes (Top20),

the Type-I errors are higher. They are even higher for some non-ambiguous elements

in both K562 and GM12878 cells (Figures 6.5 and 6.6). This indicates that GLANET

is not universally conservative across all settings. When we re-assessed Type-I errors

49

and power at a more stringent level of significance such as 0.001, the Type-I er-

rors are controlled in (CompletelyDiscard) and (Top5) experiments without loss of

power (Figures 6.7 and 6.8) with the exception of ambiguous elements H3K4me1,

H3K36me3, and H4K20me1. When the less stringent experiment settings (TakeThe-

Longest,Top20) are used at this significance level, there are few elements with Type-I

error above the target significance level and power less than one (Figures 6.9 and

6.10).

50

(a) K562, Non-exp, α=0.05, wIF (b) K562, Exp, α=0.05, wIFEOO NOOB

0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM

(c) K562, Non-exp, α=0.05, woIF (d) K562, Exp, α=0.05, woIFEOO NOOB

0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM

Figure 6.3: Assessment of GLANET Type-I error and power with data-driven com-

putational experiments. Histone marks with ambiguous activator roles are marked

with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuris-

tic using K562, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,

Top5) results, for significance level of 0.05. (c, d) Type-I error and power estimated

without Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, Com-

pletelyDiscard) and (Expressed Genes, Top5) results, for significance level of 0.05.

51

(a) GM12878, Non-exp, α=0.05, wIF (b) GM12878, Exp, α=0.05, wIFEOO NOOB

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM

(c) GM12878, Non-exp, α=0.05, woIF (d) GM12878, Exp, α=0.05, woIFEOO NOOB

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM



with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic

using GM12878, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,


without Isochore Family (woIF) heuristic using GM12878, (Non-expressed Genes,

CompletelyDiscard) and (Expressed Genes, Top5) results, for significance level of

0.05.

52


0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1H

2AZ

H3K

27A

CH

3K4M

E2

H3K

4ME

3H

3K79

ME

2H

3K9A

CH

3K9A

CB

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM


0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1H

2AZ

H3K

27A

CH

3K4M

E2

H3K

4ME

3H

3K79

ME

2H

3K9A

CH

3K9A

CB

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM




using K562, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes, Top20)

results, for significance level of 0.05. (c, d) Type-I error and power estimated with-

out Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, TakeThe-

Longest) and (Expressed Genes, Top20) results, for significance level of 0.05.

53


0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM


0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

0.00.20.40.6

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM

Figure 6.6: Assessment of GLANET Type-I error and power with data-driven compu-

tational experiments. Histone marks with ambiguous activator roles are marked with∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuristic using

GM12878, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes, Top20)

results, for significance level of 0.05. (c, d) Type-I error and power estimated without

Isochore Family (woIF) heuristic using GM12878, (Non-expressed Genes, TakeThe-


54


0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1H

2AZ

H3K

27A

CH

3K4M

E2

H3K

4ME

3H

3K79

ME

2H

3K9A

CH

3K9A

CB

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM


0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1H

2AZ

H3K

27A

CH

3K4M

E2

H3K

4ME

3H

3K79

ME

2H

3K9A

CH

3K9A

CB

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM



with ∗. (a, b) Type-I error and power estimated with Isochore Family (wIF) heuris-

tic using K562, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,


without Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, Com-

pletelyDiscard) and (Expressed Genes, Top5) results, for significance level of 0.001.

55


0.000.010.020.030.040.05

0.000.010.020.030.040.05

0.000.010.020.030.040.05

0.000.010.020.030.040.05

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM


0.000.010.020.030.040.05

0.000.010.020.030.040.05

0.000.010.020.030.040.05

0.000.010.020.030.040.05

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM




using GM12878, (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,


without Isochore Family (woIF) heuristic using GM12878, (Non-expressed Genes,

CompletelyDiscard) and (Expressed Genes, Top5) results, for significance level of

0.001.

56


0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1H

2AZ

H3K

27A

CH

3K4M

E2

H3K

4ME

3H

3K79

ME

2H

3K9A

CH

3K9A

CB

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM


0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1H

2AZ

H3K

27A

CH

3K4M

E2

H3K

4ME

3H

3K79

ME

2H

3K9A

CH

3K9A

CB

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

H3K

9AC

BP

OL2

*H3K

36M

E3

*H3K

4ME

1*H

4K20

ME

1

Pow

er

wGC

wGCM

wM

woGCM




using K562, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes, Top20)

results, for significance level of 0.001. (c, d) Type-I error and power estimated with-

out Isochore Family (woIF) heuristic using K562, (Non-expressed Genes, TakeThe-


57


0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM


0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

0.000.050.100.150.20

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

wGC

wGCM

wM

woGCM

EOO NOOB

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

wG

Cw

GC

Mw

Mw

oGC

M

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er

wGC

wGCM

wM

woGCM




using GM12878, (Non-expressed Genes, TakeTheLongest) and (Expressed Genes,

Top20) results, for significance level of 0.001. (c, d) Type-I error and power es-

timated without Isochore Family (woIF) heuristic using GM12878, (Non-expressed

Genes, TakeTheLongest) and (Expressed Genes, Top20) results, for significance level

of 0.001.

58

6.2.2 Data-driven Computational Experiments Results for Repressor Elements

Experiments with repressor element H3K27me3 resulted in zero Type-I error except

for a few cases in GM12878 (Tables 6.3 and 6.4). In experiments with the repressor

element H3K27me3, GLANET attained power of one across all settings as shown in

Tables 6.5 and 6.6. Experiments with the repressor element H3K9me3 resulted in

Type-I error of zero for GM12878 , and Type-I errors over the set significance level

depending on the parameter selection in K562 cell (Tables 6.3 and 6.4). The power

in both cells for this histone mark is low (Tables 6.5 and 6.6). H3K9me3 is also one

of the ambiguous elements in terms of its repressive role on promoters.

Overall we observe that GLANET controls Type-I error well without loss of power.

Type-I error control is significantly better with the NOOB association statistics. Ac-

counting for GC and mappability biases and use of wIF option lower the Type-I error.

We further explore how these different parameters affect GLANET enrichment anal-

ysis in Sections 6.2.4 and 6.2.5.

Table 6.3: Type-I error rates calculated in data-driven experiments conducted with

repressor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for

α = 0.05.

Type-I Error, α = 0.05

wIF woIF

Expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM

Top5 EOO 0 0 0 0 0 0 0 0

H3K27me3 Top20 EOO 0.001 0 0 0.001 0.002 0.001 0 0.006

GM12878 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0 0 0 0.001 0.001 0.001 0 0.008

Top5 EOO 0 0 0 0 0 0 0 0

H3K27me3 Top20 EOO 0 0 0 0 0 0 0 0

K562 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0 0 0 0 0 0 0 0

Top5 EOO 0 0 0 0 0 0 0 0

H3K9me3 Top20 EOO 0 0 0 0 0 0 0 0

GM12878 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0 0 0 0 0 0 0 0

Top5 EOO 0 0 0 0 0 0 0 0

H3K9me3 Top20 EOO 0.079 0.051 0.052 0.083 0.103 0.081 0.066 0.126

K562 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0.042 0.023 0.025 0.041 0.06 0.039 0.035 0.085

59

Table 6.4: Type-I error rates calculated in data-driven experiments conducted with

repressor elements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for

α = 0.001.

Type-I Error, α = 0.001

wIF woIF

Expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM

Top5 EOO 0 0 0 0 0 0 0 0

H3K27me3 Top20 EOO 0 0 0 0 0 0 0 0

GM12878 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0 0 0 0 0 0 0 0

Top5 EOO 0 0 0 0 0 0 0 0

H3K27me3 Top20 EOO 0 0 0 0 0 0 0 0

K562 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0 0 0 0 0 0 0 0

Top5 EOO 0 0 0 0 0 0 0 0

H3K9me3 Top20 EOO 0 0 0 0 0 0 0 0

GM12878 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0 0 0 0 0 0 0 0

Top5 EOO 0 0 0 0 0 0 0 0

H3K9me3 Top20 EOO 0.002 0.001 0.001 0.002 0.003 0.001 0.001 0.005

K562 Top5 NOOB 0 0 0 0 0 0 0 0

Top20 NOOB 0.001 0 0 0.001 0.001 0 0 0.001

60

Table 6.5: Power calculated in data-driven experiments conducted with repressor el-

ements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for α = 0.05.

Power, α = 0.05

wIF woIF

Non-expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM

CompletelyDiscard EOO 1 1 1 1 1 1 1 1

H3K27me3 TakeTheLongest EOO 1 1 1 1 1 1 1 1

GM12878 CompletelyDiscard NOOB 1 1 1 1 1 1 1 1

TakeTheLongest NOOB 1 1 1 1 1 1 1 1



K562 CompletelyDiscard NOOB 1 1 1 1 1 1 1 1


CompletelyDiscard EOO 0.134 0.151 0.163 0.154 0.161 0.182 0.177 0.214

H3K9me3 TakeTheLongest EOO 0.186 0.199 0.209 0.211 0.221 0.244 0.234 0.299

GM12878 CompletelyDiscard NOOB 0.076 0.098 0.103 0.095 0.094 0.113 0.106 0.133

TakeTheLongest NOOB 0.096 0.113 0.124 0.119 0.12 0.134 0.129 0.168



K562 CompletelyDiscard NOOB 0.003 0.004 0.004 0.004 0.004 0.005 0.005 0.006


Table 6.6: Power calculated in data-driven experiments conducted with repressor el-

ements, H3K27me3 and H3K9me3, in GM12878 and K562 cell lines for α = 0.001.

Power, α = 0.001

wIF woIF

Non-expressed Genes wGC wM wGCM woGCM wGC wM wGCM woGCM



GM12878 CompletelyDiscard NOOB 1 1 1 1 1 1 1 1








GM12878 CompletelyDiscard NOOB 0 0.002 0.001 0.001 0 0.003 0.001 0.004






61

6.2.3 GLANET GAT Comparison Results for Activator and Repressor Ele-

ments through Data-driven Computational Experiments

Among the available annotation and enrichment tools, GAT is the only one that takes

genomic intervals as input and facilitates accounting for mappability bias. GAT

also relies on sampling-based null distribution estimation. It accommodates poten-

tial genomic biases in the input query by allowing users to define a workspace. This

workspace specifies which regions of the genome should be utilized in sampling of

the intervals for null distribution generation. From a practical standpoint, defining

this workspace is not straightforward, GLANET, on the other hand, adjusts for GC

and mappability biases by matching each input interval with its default library. Over-

all, GAT’s matching procedure is coarser compared to GLANET as GAT matches

these properties in a coarse grain fashion, where as GLANET’s takes a more fine-

grain approach and matches GC content and/or mappability of each individual input

interval. Two association measures GAT utilizes are the number of overlapping bases

between the two sets of genomic intervals and number of intervals in the segments of

interest overlapping. These two measures coincide with NOOB and EOO association

statistics of GLANET.

We compared GLANET and GAT with the same data-driven computational experi-

ments for all settings and compute element specific Type-I error and power of GAT

at 0.001 and 0.05 significance levels. We observed that GAT is also conservative

in terms of Type-I error for more stringent experiment settings (CompletelyDiscard,

Top5). Additionally, GLANET achieves better Type-I error rate for certain elements

such as H3K4me1 and also better power for H3K36me3 and H4K20me1 elements

compared to GAT as shown in Figures 6.11 and 6.13. For less stringent experi-

ment settings (TakeTheLongest, Top20), results show that GLANET Type-I error and

power are comparable or better than GAT (Figures 6.12 and 6.14). We extended

this analysis with ROC curves by varying the significance level as detailed in Section

6.2.5.

62

(a) Non-exp, CompletelyDiscard, α=0.05 (b) Exp, Top5, α=0.05EOO NOOB

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er GAT

GLANET

(c) Exp, Top5, α=0.05 (d) Non-exp, CompletelyDiscard, α=0.05EOO NOOB

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Pow

er GAT

GLANET

Figure 6.11: Comparison of GLANET and GAT with respect to data-driven com-

putational experiments in terms of Type-I Error and Power for significance level of

0.05. GLANET(wIF,wGC) and GAT(wIF) parameter settings results are used. Re-

sults for the two association statistics - existence of overlap (EOO) and the number

of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of activa-

tor elements in (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,

Top5) experiment settings, respectively. (c, d) Type-I error and power of repressor el-

ements in (Expressed Genes, Top5) and (Non-expressed Genes, CompletelyDiscard)

experiment settings, respectively. GLANET achieves higher power for H3K9me3

than GAT.

63

(a) Non-exp, TakeTheLongest, α=0.05 (b) Exp, Top20, α=0.05EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er GAT

GLANET

(c) Exp, Top20, α=0.05 (d) Non-exp, TakeTheLongest, α=0.05EOO NOOB

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Pow

er GAT

GLANET

Figure 6.12: . Comparison of GLANET and GAT with respect to data-driven com-




of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of acti-

vator elements in (Non-expressed Genes, TakeTheLongest) and (Expressed Genes,

Top20) experiment settings, respectively. (c, d) Type-I error and power of repressor

elements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTheLongest)

experiment settings, respectively.

64

(a) Non-exp, CompletelyDiscard, α=0.001 (b) Exp, Top5, α=0.001EOO NOOB

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er GAT

GLANET

(c) Exp, Top5, α=0.001 (d) Non-exp, CompletelyDiscard, α=0.001EOO NOOB

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Pow

er GAT

GLANET





of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of activa-

tor elements in (Non-expressed Genes, CompletelyDiscard) and (Expressed Genes,

Top5) experiment settings, respectively. (c, d) Type-I error and power of repressor el-

ements in (Expressed Genes, Top5) and (Non-expressed Genes, CompletelyDiscard)


65

(a) Non-exp, TakeTheLongest, α=0.001 (b) Exp, Top20, α=0.001EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

H2A

ZH

3K27

AC

H3K

4ME

2H

3K4M

E3

H3K

79M

E2

H3K

9AC

PO

L2*H

3K36

ME

3*H

3K4M

E1

*H4K

20M

E1

Pow

er GAT

GLANET

(c) Exp, Top20, α=0.001 (d) Non-exp, TakeTheLongest, α=0.001EOO NOOB

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Type

I E

rror

GAT

GLANET

EOO NOOB

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

GM

12878K

562

H3K

27M

E3

*H3K

9ME

3

H3K

27M

E3

*H3K

9ME

3

Pow

er GAT

GLANET





of overlapping bases (NOOB) are displayed. (a, b) Type-I error and power of acti-

vator elements in (Non-expressed Genes, TakeTheLongest) and (Expressed Genes,

Top20) experiment settings, respectively. (c, d) Type-I error and power of repressor

elements in (Expressed Genes, Top20) and (Non-expressed Genes, TakeTheLongest)


66

6.2.4 Assessing GLANET Enrichment Parameters through Wilcoxon Signed

Rank Tests

To get a comprehensive view of how GLANET parameters would affect the enrich-

ment test performance, we summarize our results across different experiments con-

ducted with various activator and repressor elements and different parameter settings.

We concentrate on Type-I error, as it is more variable than the power.

We gathered the Type-I error of activator and repressor elements for 26 different sig-

nificance levels starting from 0 to 0.25 in increments of 0.01. We considered more

stringent and less stringent settings for expressed and non-expressed genes which are

(Top5,CompletelyDiscard) and (Top20,TakeTheLongest). We had 2 repressor ele-

ments for both of the cell lines. However, we had 11 and 10 activator elements for

K562 and GM12878, respectively. As a result, for expressed genes and non-expressed

genes, we had 208 and 1092 Type-I errors considered in Wilcoxon signed rank tests,

respectively.

We carried out Wilcoxon signed rank tests to assess the statistical significance of the

difference between the Type-I errors achieved by different GLANET parameter set-

tings. The null states there is no difference in the mean of the ranks of the two distri-

butions whereas alternative hypothesis is that the first distribution has lower mean of

ranks than the second one. We carried out these tests for non-expressed and expressed

simulations separately. Table 6.7 illustrates the p-values of the tests.

In general, matching GC and/or mappability reduces the Type-I error ranks compared

to the woGCM setting on the average. Therefore, we conclude that accounting for

these biases leads to more realistic generation of the empirical null distribution. If

the input is sourced from a constrained region of the genome, sampling uniformly

at random from the genome optimistically concludes that there is a enrichment of

the genomic element, even though there is none, leading to higher Type-I errors. In

both expressed and non-expressed gene sets we observe that matching GC reduces

the Type-I error. This is consistent with the fact that TSSs start sites have higher GC

content [55].

67

Table 6.7: One-sided Wilcoxon signed rank test results for testing whether the Type-

I error distribution of experiments generated under the parameter setting specified in

the row has lower mean of ranks compared to the distribution of Type-I errors gen-

erated under the parameter setting specified in the column, where the null hypothesis

states that there is no difference. A p-value presented in the cell indicates that setting

in the corresponding row has a lower mean of ranks in Type-I error distribution than

the setting in the corresponding column; if the cell is empty the opposite holds. The

p-values are less than or equal to the actual test result.

Wilcoxon signed rank test p-values

Non-expressed(EOO,woIF) Non-expressed(NOOB,woIF)

wGC wM wGCM woGCM wGC wM wGCM woGCM

wGC 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16

wM 2.2e-16 2.2e-16

wGCM 2.2e-16 2.2e-16 2.2e-16 2.2e-16

woGCM

Non-expressed(EOO,wIF) Non-expressed(NOOB,wIF)


wGC 1.9e-04 2.2e-16 2.2e-16 1.004e-14 2.2e-16 6.524e-15

wM 2.2e-16 2.2e-16 2.2e-16

wGCM

woGCM 1.97e-04 2.39e-11

Expressed(EOO,woIF) Expressed(NOOB,woIF)


wGC 5.47e-12 1.2e-12

wM 1.18e-09 5.5e-12 1.75e-09 1.2e-12

wGCM 5.51e-10 1.17e-09 5.5e-12 3.75e-10 5.38e-10 1.2e-12

woGCM

Expressed(EOO,wIF) Expressed(NOOB,wIF)


wGC 1.43e-04 3.93e-03

wM 1.14e-09 2.78e-06 7.88e-10 2.57e-09 7.80e-06 1.75e-09

wGCM 1.15e-09 7.70e-10 2.56e-09 1.75e-09

woGCM

As shown in Table 6.7, we observed that for non-expressed genes, wGC achieved

lower Type-I errors than the other options. For expressed genes, wGCM achieved

lower Type-I errors than the others when woIF was on. However, when wIF was on,

wM performed better in terms of Type-I error. This is because wIF provides coarse

68

grain GC matching. We also pooled the Type-I errors for (woIF,wIF) and observed

that wIF achieves lower Type-I errors than woIF (Table 6.8) in general and NOOB

provides lower Type-I errors than EOO (Table 6.9).

Overall we observe that Type-I error control is significantly better with the NOOB

association statistics. Accounting for GC and mappability biases and use of wIF

option lower the Type-I error.

Table 6.8: Wilcoxon Signed Rank Tests for (woIF,wIF). Type-I error distribution of

wIF is less than Type-I error distribution of woIF. To decrease Type-I error, we prefer

wIF over woIF.


woIF wIF

woIF NA 1

wIF 2.2e-16 NA

Table 6.9: Wilcoxon Signed Rank Tests for (EOO,NOOB). Type-I error distribution

of NOOB is less than Type-I error distribution of EOO. To decrease Type-I error, we

prefer NOOB over EOO.


EOO NOOB

EOO NA 1

NOOB 2.2e-16 NA

Finally, we notice an interesting difference in the experiment results conducted with

expressed and non-expressed genes. As listed in Table 6.10, matching only GC

in non-expressed genes results in the lowest Type-I errors. The experiments on ex-

pressed gene intervals show that matching mappability in addition to GC is required to

achieve lower Type-I errors. Thus, we observe that for these data-driven experiments,

accounting for mappability is not critical in the non-expressed gene set whereas it

is critical for the expressed case. We next asked whether the GC and mappability

distributions of these interval sets can explain this result.

69

Table 6.10: Table summarizes random interval generation option that achieves the

lowest Type-I error for non-expressed and expressed gene intervals using association

measures EOO and NOOB and the two isochore family options woIF and wIF.

Gene-set(AssociationMeasure,

IsochoreFamily)Random Interval

Generation ModeNon-expressed(EOO,woIF) wGC

Non-expressed(EOO,wIF) wGC

Non-expressed(NOOB,woIF) wGC

Non-expressed(NOOB,wIF) wGC

Expressed(EOO,woIF) wGCM

Expressed(EOO,wIF) wM

Expressed(NOOB,woIF) wGCM

Expressed(NOOB,wIF) wM

We considered the empirical GC and mappability distributions of the gene set inter-

vals and compared them with the two distributions computed on the whole genome.

We sampled 50, 000 intervals of each 600 bps long from the human genome uni-

formly at random. Figures 6.15 and 6.16 display violin plots of GC and mappability

of these random intervals, the intervals for the expressed and non-expressed genes in

GM12878 and K562 cell lines, respectively. As shown in Figures 6.15a and 6.16a,

GC distributions of non-expressed genes and expressed genes are similar to each other

and they are both considerably different from the whole genome, especially in the

lower tail (Kolmogorov-Smirnov test, p-value ≤ 2.2e-16). This provides support for

the fact that matching for GC is important in both simulations conducted with the

non-expressed and expressed genes sets. The same does not hold for the mappability

distributions: mappability distribution of non-expressed genes promoter intervals is

more similar to that of whole genome than the expressed genes’ intervals (Figures

6.15b and 6.16b). Although both expressed and non-expressed gene intervals are

significantly different than the genome based on two-sample Kolmogorov-Smirnov,

test (p-value ≤ 2.2e-16); the test statistic, which quantifies the distance between the

two compared distributions, is smallest between the mappability distributions of the

human genome and the non-expressed gene set in both of the cell lines (Table 6.11).

70

(a) (b)

0.0

0.2

0.4

0.6

0.8

WholeGenome

GM12878Non−expressed

GM12878Expressed

●

● ●G

C

0.0

0.2

0.4

0.6

0.8

1.0

WholeGenome

GM12878Non−expressed

GM12878Expressed

● ● ●

MA

PPA

BIL

ITY

Figure 6.15: Violin plots for (a) GC of randomly sampled intervals from human

genome, GC of intervals of GM12878 non-expressed genes and expressed genes.

(b) Mappability of randomly sampled intervals from human genome, mappability of

intervals from non-expressed and expressed gene-sets of GM12878.

(a) (b)

0.0

0.2

0.4

0.6

0.8

WholeGenome

K562Non−expressed

K562Expressed

●

●●

GC

0.0

0.2

0.4

0.6

0.8

1.0

WholeGenome

K562Non−expressed

K562Expressed

● ● ●

MA

PPA

BIL

ITY

Figure 6.16: Violin plots for (a) GC of randomly sampled intervals from human

genome, GC of intervals of K562 non-expressed genes and expressed genes. (b) Map-

pability of randomly sampled intervals from human genome, mappability of intervals

from non-expressed and expressed gene-sets of K562.

71

Table 6.11: Kolmogorov-Smirnov test results. Null hypothesis states that the distri-

bution of GC content or mappability values calculated for 50, 000 randomly sampled

intervals from human genome and the corresponding interval set are not different.

Each row corresponds to Kolmogorov-Smirnov testing of this null hypothesis. In all

tests, the null hypothesis is rejected (p-value < 2.2e-16). The first column lists the

property of the genome in question, the second column lists the distribution that is

compared with the genome, finally the last column lists the maximum distance be-

tween the two distributions.

Kolmogorov-Smirnov Test Results

Property Interval Set Maximum

DistanceGC Non-expressed (GM12878) 0.1454

GC Expressed (GM12878) 0.1462

GC Non-expressed (K562) 0.1241

GC Expressed (K562) 0.1897

Mappability Non-expressed (GM12878) 0.0794

Mappability Expressed (GM12878) 0.1693

Mappability Non-expressed (K562) 0.0898

Mappability Expressed (K562) 0.1585

6.2.5 Assessing GLANET Enrichment Parameters through ROC Curves and

Comparison with GAT

We plotted element-based and cell-based ROC curve for each possible GLANET pa-

rameter combination and experiment setting. We also included GAT in our ROC

curves for comparison reasons. While plotting ROC curves, we labeled each activa-

tor element as enriched and not enriched under expressed and non-expressed genes

scenarios, respectively. Similarly, we labeled each repressor element as not-enriched

and enriched under expressed and non-expressed genes scenarios, respectively.

In each ROC curve, we considered expressed and non-expressed genes scenarios to-

gether, therefore we had equal number of labels of not-enriched and enriched with

their accompanying p-values for 1000 runs of all possible GLANET parameter com-

binations. We drew ROC curves, separately for more stringent (CompletelyDis-

card,Top5) and less stringent (TakeTheLongest,Top20) experiment settings .

72

Approximately for 66.6% of the elements (H2AZ, H3K27AC, H3K3ME2, H3K4ME3,

H3K79ME2, H3K9AC, POL2, H3K27ME3), ROC curves are alike with very mini-

mal differences under varying all other variables. However, in approximately 33.3%

of the elements (H4K20ME1, H3K9ME3, H3K4ME1, H3K36ME3), ROC curves

are different from each other. As a result, we can say that for these elements ROC

curves change from element to element (e.g. POL2 or H3K9ME3), from cell line

to cell line (GM12878 or K562), from parameter setting to parameter setting (e.g.

(EOO,woIF,wGCM) or (NOOB,wIF,wM)), from experiment setting to experiment

setting ((CompletelyDiscard,Top5) or (TakeTheLongest,Top20)). Therefore pooling

all results and providing only one ROC curve is not possible and meaningful.

To exemplify this situation, we presented 8 ROC curve figures in Figures 6.17-

6.20. In Figures 6.17a and 6.17b, everything is the same except the parameter

setting which is changed from (NOOB,woIF) to (NOOB,wIF). In Figure 6.17a,

GLANET(NOOB,woIF,wGCM) achieved the highest AUC whereas in Figure 6.17b,

GLANET(NOOB,wIF,wM) achieved the highest AUC with the help of coarse grain

GC matching option, wIF.

In Figures 6.18a and 6.18b, all variables are the same except the cell line which

is changed from GM12878 to K562. In Figure 6.18a, GLANET(NOOB,wIF,wGC)

and GLANET(NOOB,wIF,woGCM) achieved the highest AUCs. For this case, wIF,

coarse grain GC matching performed very well. Even woGCM achieved higher AUC

than wGCM and wM. However, in Figure 6.18b, for K562 cell line, under all random

interval generation options GLANET and GAT performed very well.

In Figures 6.19a and 6.19b, all variables are the same except the experiment set-

ting which is changed from less stringent (TakeTheLongest,Top20) to more stringent

(CompletelyDiscard,Top5). In Figure 6.19a, GLANET(NOOB,woIF,wGC) achieved

the highest AUC. However, in Figure 6.19b, under all random interval generation op-

tions GLANET and GAT performed very well.

In Figures 6.20a and 6.20b, except the element POL2, we changed everything. We

changed the cell line from GM12878 to K562, parameter setting from (EOO,woIF)

to (NOOB,wIF), experiment setting from (CompletelyDiscard,Top5) to (TakeThe-

Longest,Top20). And we observed that in Figures 6.20a and 6.20b, under all random

73

interval generation options GLANET and GAT performed very well.

(a) (b)

Figure 6.17: ROC Curves for (a) H3K9ME3 in K562 under parameter (NOOB, woIF)

and experiment (CompletelyDiscard, Top5) (b) H3K9ME3 in K562 under parameter

(NOOB, wIF) and experiment (CompletelyDiscard, Top5) settings.

(a) (b)

Figure 6.18: ROC Curves for (a) H4K20ME1 in GM12878 under parameter (NOOB,

wIF) and experiment (CompletelyDiscard, Top5) (b) H4K20ME1 in K562 under pa-

rameter (NOOB, wIF) and experiment (CompletelyDiscard, Top5) settings.

There are 13 and 12 elements in K562 and GM12878, respectively, which makes 25

element-cell pairs. We considered 2 experiment settings: (CompletelyDiscard,Top5)

and (TakeTheLongest,Top20) which makes 50 ROC curve figures. We plotted each

ROC curve figure under 4 parameter settings: (EOO,woIF), (EOO,wIF),(NOOB,woIF)

and (NOOB,wIF) which makes 200 ROC curve figures. And in each ROC curve fig-

ure, we plotted the 5 ROC curves resulting from GLANET(wGC,wM,wGCM,woGCM)

and GAT(woGCM) parameter settings.

To compare quantitatively which tool and parameter setting performed better than

74

(a) (b)

Figure 6.19: ROC Curves for (a) H3K4ME1 in K562 under parameter (NOOB, woIF)

and experiment (TakeTheLongest, Top20) (b) H3K4ME1 in K562 under parameter

(NOOB, woIF) and experiment (CompletelyDiscard, Top5) settings.

(a) (b)

Figure 6.20: ROC Curves for (a) POL2 in GM12878 under parameter (EOO, woIF)

and experiment (CompletelyDiscard, Top5) (b) POL2 in K562 under parameter

(NOOB, wIF) and experiment (TakeTheLongest, Top20) settings.

the others through ROC curves, we compared AUC of ROC curves with each other

using pROC R package [56]. Each time, we compared AUC of two ROC curves and

checked whether AUC of the first ROC curve is statistically higher than the AUC of

the second tested ROC curve or not. If yes, we increased the number of wins, if no, we

increased the number of losses, otherwise, we increased the number of ties for the first

ROC curve. We did the appropriate update for the second ROC curve. At the end, we

counted and accumulated the number of wins, ties and losses in (Wins/Ties/Losses)

representation for each tool and parameter setting’s ROC curve.

75

Under (EOO,woIF) parameter setting, GLANET(EOO,woIF,wGCM) achieved the

highest number of wins. The comparison results are presented in Table 6.12. Under

(EOO,wIF) parameter setting, GLANET(EOO,wIF,wM) achieved the highest num-

ber of wins. The comparison results are presented in Table 6.13.

Table 6.12: GLANET and GAT ROC curves comparison results under (EOO,woIF)

setting.

(EOO,woIF)GAT

(woGCM)

GLANET

(woGCM)

GLANET

(wGC)

GLANET

(wM)

GLANET

(wGCM)

Number of

Wins Ties Losses

GAT

(woGCM)1/44/5 3/37/10 3/38/9 3/37/10 10 156 34

GLANET

(woGCM)5/44/1 3/38/9 3/38/9 3/38/9 14 158 28

GLANET

(wGC)10/37/3 9/38/3 5/41/4 3/43/4 27 159 14

GLANET

(wM)9/38/3 9/38/3 4/41/5 3/42/5 25 159 16

GLANET

(wGCM)10/37/3 9/38/3 4/43/3 5/42/3 28 160 12

Table 6.13: GLANET and GAT ROC curves comparison results under (EOO,wIF)

setting.

(EOO,wIF)GAT

(woGCM)

GLANET

(woGCM)

GLANET

(wGC)

GLANET

(wM)

GLANET

(wGCM)

Number of

Wins Ties Losses

GAT

(woGCM)5/38/7 3/40/7 3/39/8 5/39/6 16 156 28

GLANET

(woGCM)7/38/5 2/45/3 3/42/5 3/42/5 15 167 18

GLANET

(wGC)7/40/3 3/45/2 3/43/4 3/43/4 16 171 13

GLANET

(wM)8/39/3 5/42/3 4/43/3 3/47/0 20 171 9

GLANET

(wGCM)6/39/5 5/42/3 4/43/3 0/47/3 15 171 14

Under (NOOB,woIF) parameter setting, GLANET(NOOB,woIF,woGCM) achieved

76

the highest number of wins. Interestingly, all number of wins are very close to each

other. The comparison results are presented in Table 6.14. Under (NOOB,wIF)

parameter setting, GLANET(NOOB,wIF,wM) achieved the highest number of wins.

The comparison results are presented in Table 6.15.

Table 6.14: GLANET and GAT ROC curves comparison results under (NOOB,woIF)

setting.

(NOOB,woIF)GAT

(woGCM)

GLANET

(woGCM)

GLANET

(wGC)

GLANET

(wM)

GLANET

(wGCM)

Number of

Wins Ties Losses

GAT

(woGCM)2/44/4 6/39/5 7/38/5 6/39/5 21 160 19

GLANET

(woGCM)4/44/2 6/39/5 6/39/5 6/39/5 22 161 17

GLANET

(wGC)5/39/6 5/39/6 4/42/4 6/40/4 20 160 20

GLANET

(wM)5/38/7 5/39/6 4/42/4 5/40/5 19 159 22

GLANET

(wGCM)5/39/6 5/39/6 4/40/6 5/40/5 19 158 23

Table 6.15: GLANET and GAT ROC curves comparison results under (NOOB,wIF)

setting.

(NOOB,wIF)GAT

(woGCM)

GLANET

(woGCM)

GLANET

(wGC)

GLANET

(wM)

GLANET

(wGCM)

Number of

Wins Ties Losses

GAT

(woGCM)1/41/8 1/41/8 0/40/10 1/39/10 3 161 36

GLANET

(woGCM)8/41/1 6/42/2 5/41/4 5/41/4 24 165 11

GLANET

(wGC)8/41/1 2/42/6 5/41/4 7/39/4 22 163 15

GLANET

(wM)10/40/0 4/41/5 4/41/5 6/44/0 24 166 10

GLANET

(wGCM)10/39/1 4/41/5 4/39/7 0/44/6 18 163 19

To decide the best tool and parameter setting with respect to ROC curves, we com-

77

pared the winners of each (EOO,woIF), (EOO,wIF), (NOOB,woIF) and (NOOB,wIF)

parameter settings with each other. GLANET(EOO, wIF, wM) achieved the highest

number of wins among them which was followed by GLANET(EOO, woIF, wGCM)

and GLANET(NOOB, wIF, wM). And GLANET(NOOB, woIF, woGCM) was the

worst among them. Finally, the comparison results are presented in Table 6.16.

Table 6.16: We compared the winner settings from Tables 6.12- 6.15 with each other.

Compare

Winners

GLANET

(EOO,woIF)

(wGCM)

GLANET

(EOO,wIF)

(wM)

GLANET

(NOOB,woIF)

(woGCM)

GLANET

(NOOB,wIF)

(wM)

Number of

Wins Ties Losses

GLANET

(EOO,woIF)

(wGCM)4/40/6 7/39/4 7/36/7 18 115 17

GLANET

(EOO,wIF)

(wM)6/40/4 9/38/3 7/38/5 22 116 12

GLANET

(NOOB,woIF)

(woGCM)4/39/7 3/38/9 6/38/6 13 115 22

GLANET

(NOOB,wIF)

(wM)7/36/7 5/38/7 6/38/6 18 112 20

Under (woIF) and (wIF) parameter settings, GLANET(wGC) and GLANET(wM)

achieved the highest number of wins as they are shown in Tables 6.17 and 6.18,

respectively. When we pooled all ROC curves for each random interval generation

option, GLANET(wM) beat the others (Table 6.19).

We also plotted element-based and cell-based Type-I error, power and ROC curve fig-

ures resulting from data-driven computational experiments for all possible GLANET

parameter and experiment settings. Corresponding figures for H4K20ME1 can be

found in Figures B.1- B.8 of Appendix B.

78

Table 6.17: ROC curves of different parameter settings where (woIF) setting is on

are compared.

woIFGAT

(woGCM)

GLANET

(woGCM)

GLANET

(wGC)

GLANET

(wM)

GLANET

(wGCM)

Number of

Wins Ties Losses

GAT

(woGCM)3/88/9 9/76/15 10/76/14 9/76/15 31 316 53

GLANET

(woGCM)9/88/3 9/77/14 9/77/14 9/77/14 36 319 45

GLANET

(wGC)15/76/9 14/77/9 9/83/8 9/83/8 47 319 34

GLANET

(wM)14/76/10 14/77/9 8/83/9 8/82/10 44 318 38

GLANET

(wGCM)15/76/9 14/77/9 8/83/9 10/82/8 47 318 35

Table 6.18: ROC curves of different parameter settings where (wIF) setting is on are

compared.

wIFGAT

(woGCM)

GLANET

(woGCM)

GLANET

(wGC)

GLANET

(wM)

GLANET

(wGCM)

Number of

Wins Ties Losses

GAT

(woGCM)6/79/15 4/81/15 3/79/18 6/78/16 19 317 64

GLANET

(woGCM)15/79/6 8/87/5 8/83/9 8/83/9 39 332 29

GLANET

(wGC)15/81/4 5/87/8 8/84/8 10/82/8 38 334 28

GLANET

(wM)18/79/3 9/83/8 8/84/8 9/91/0 44 337 19

GLANET

(wGCM)16/78/6 9/83/8 8/82/10 0/91/9 33 334 33

79

Table 6.19: ROC curves of different “Generate Random Data Options" are compared.

All PooledGAT

(woGCM)

GLANET

(woGCM)

GLANET

(wGC)

GLANET

(wM)

GLANET

(wGCM)

Number of

Wins Ties Losses

GAT

(woGCM)9/167/24 13/157/30 13/155/32 15/154/31 50 633 117

GLANET

(woGCM)24/167/9 17/164/19 17/160/23 17/160/23 75 651 74

GLANET

(wGC)30/157/13 19/164/17 17/167/16 19/165/16 85 653 62

GLANET

(wM)32/155/13 23/160/17 16/167/17 17/173/10 88 655 57

GLANET

(wGCM)31/154/15 23/160/17 16/165/19 10/173/17 80 652 68

80

CHAPTER 7

GLANET USE CASES AND RUN TIME COMPARISONS

In this chapter, we compare GLANET and GAT with additional data-sets and present

the results, which is followed by two use cases of GLANET. Firstly, we carried out

enrichment analysis of OCD GWAS SNPS for the elements in GLANET’s default

annotation library. Secondly, we performed enrichment analysis of GATA2 binding

sites in K562 cell line for Gene Ontology terms. And lastly, we finalize the chapter

with run time comparisons of GLANET with GAT and GREAT.

7.1 GLANET GAT Comparison with Additional Data-sets

We repeated the experiments provided in the GAT supplementary website [57] with

GLANET. The detailed results for these additional experiments are provided in Tables

7.1- 7.4. Results for GAT runs are obtained from the GAT tutorial (http://gat.

readthedocs.org/en/latest/tutorialIntervalOverlap.html). For

each experiment, GLANET results are computed in sixteen different parameter set-

tings. GLANET is run with different modes of random data generation (wGC,wM,

wGCM,woGCM), isochore family (woIF,wIF) and association measure (EOO, NOOB).

In each of the Tables 7.1- 7.4, Observed column shows the association measure value

calculated between the given sets, set1 and set2. Expected and StdDev columns show

the mean and standard deviation of association measure values of samplings, respec-

tively. Fold change is one plus Observed divided by one plus Expected. Enrichment

result is provided by the p-value column.

These experiments evaluate the significance of the overlap of binding regions of tran-

81

http://gat.readthedocs.org/en/latest/tutorialIntervalOverlap.html

http://gat.readthedocs.org/en/latest/tutorialIntervalOverlap.html

scription factor Srf in Jurkat cells with three different sets of DHSs from Jurkat and

HepG2 cells. These experiments also exemplify another use case of GLANET where

the input intervals are TF binding regions.

EOO NOOB

020406080

020406080

020406080

020406080

Srf(Jurkat)DNaseI(Jurkat)

Srf(Jurkat)DNaseI(HepG2)

DNaseI(HepG2)DNaseI(Jurkat)

Srf(Jurkat)DNaseI(HepG2U)

wG

C

wM

wG

CM

woG

CM

wG

C

wM

wG

CM

woG

CM

Fol

d C

hang

e

GAT(NOOB,woGCM,woIF) GLANET(wIF) GLANET(woIF)

Figure 7.1: GLANET and GAT are run on four experiments ranging from high to

low expected association between the compared genomic interval sets. Each row

depicts an experiment where the first set is input query and the second set is a genomic

element in the annotation library, e.g., experiment Srf(Jurkat) vs. DNaseI(Jurkat)

evaluates whether the binding regions of transcription factor Srf in Jurkat cells are

enriched for DNaseI accessible, i.e., open chromatin, regions in the same cells.

The first experiment (Srf(Jurkat) vs. DNaseI(Jurkat)) assesses whether Srf binding

sites in Jurkat cells, identified by [58], are enriched in DHSs [8] from the same cells.

Given that a majority of the transcription factor binding events resides in open chro-

matin regions, we expect to observe significant enrichment. The second experiment

conducts the same analysis with the same input against library that contains DHSs

from HepG2 cells. The third experiment checks whether DHSs from both cell types

82

are significantly overlapping or not. Both GAT and GLANET report significant en-

richment for these three experiments (p-values are listed in Tables 7.1, 7.2 and

7.3), consistent with the expectations. The fourth experiment targets DHSs identified

in HepG2 cells but not in the Jurkat cells (HepG2 unique) as the genomic element.

It evaluates whether Srf binding sites in Jurkat cells are enriched for these HepG2

specific DNaseI hypersensitive sites. Both GAT and GLANET conclude that the

observed overlap between Srf binding sites from Jurkat cells and DHSs specific to

HepG2 cells are not statistically larger than what would be expected under the null

distribution. A comprehensive list of p-values with exact values of the overlaps are

provided in Table 7.4.

Along with a p-value quantifying enrichment, GAT reports fold enrichment, which

is defined as the ratio of the observed number of overlapping nucleotides divided

by expected number of overlapping nucleotides based on randomizations. We also

calculate the fold enrichment for GLANET for these experiments. In Figure 7.1,

we observe that all enrichment modes of GLANET result in conclusions consistent

with expectations and the GAT results, while (wGCM,wIF) setting is most conser-

vative in terms of fold enrichment. Of the sixteen settings of GLANET, results with

(NOOB,woGCM,woIF) parameter setting agree most closely with the GAT results.

This is expected because GAT uses NOOB as the association measure as well and

does not account for GC and mappability in these experiments.

83

Table 7.1: Experiment1: Intervals of transcriptor factor Srf in Jurkat cell line are

overlapped with DNaseI hypersensitive sites in Jurkat cell line. Both GAT and

GLANET find enrichment of DNaseI(Jurkat) for Srf(Jurkat).

Experiment1 Set1: Srf(Jurkat) Set2: DNaseI(Jurkat)

Tool Parameter Settings Observed Expected StdDevFold

ChangepValue

GAT (NOOB,woGCM,woIF) 20183 246.5650 105.5933 81.5301 1.0e-03

GLANET

(EOO,wGC,woIF) 450 15.7577 3.8662 26.9130 0

(EOO,wM,woIF) 450 7.6723 2.7149 52.0046 0

(EOO,wGCM,woIF) 450 17.3464 4.0456 24.5824 0

(EOO,woGCM,woIF) 450 6.6257 2.5610 59.1421 0

GLANET

(EOO,wGC,wIF) 450 15.5799 3.8328 27.2016 0

(EOO,wM,wIF) 450 11.9761 3.4328 34.7562 0

(EOO,wGCM,wIF) 450 17.3041 4.0071 24.6392 0

(EOO,woGCM,wIF) 450 10.9239 3.2333 37.8231 0

GLANET

(NOOB,wGC,woIF) 20183 599.3644 158.8155 33.6195 0

(NOOB,wM,woIF) 20183 288.3931 112.5672 69.7459 0

(NOOB,wGCM,woIF) 20183 668.5556 169.8404 30.1453 0

(NOOB,woGCM,woIF) 20183 247.9067 105.5192 81.0906 0

GLANET

(NOOB,wGC,wIF) 20183 595.9552 160.3645 33.8115 0

(NOOB,wM,wIF) 20183 453.3407 140.4382 44.4248 0

(NOOB,wGCM,wIF) 20183 657.1246 168.5808 30.6689 0

(NOOB,woGCM,wIF) 20183 413.4114 136.8533 48.7052 0

84


overlapped with DNaseI hypersensitive sites in HepG2 cell line. Both GAT and

GLANET find enrichment of DNaseI(HepG2) for Srf(Jurkat).

Experiment2 Set1: Srf(Jurkat) Set2: DNaseI(HepG2)


ChangepValue

GAT (NOOB, woGCM, woIF) 18965 597.1380 166.9945 31.7084 1.0e-03

GLANET

(EOO,wGC,woIF) 381 49.4944 6.1386 7.5651 0

(EOO,wM,woIF) 381 15.8633 3.9072 22.6527 0

(EOO,wGCM,woIF) 381 55.9002 6.3335 6.7135 0

(EOO,woGCM,woIF) 381 13.5410 3.6388 26.2705 0

GLANET

(EOO,wGC,wIF) 381 55.2896 6.5083 6.7863 0

(EOO,wM,wIF) 381 34.2100 5.5440 10.8491 0

(EOO,wGCM,wIF) 381 62.4521 6.6809 6.0202 0

(EOO,woGCM,wIF) 381 30.8020 5.3329 12.0118 0

GLANET

(NOOB,wGC,woIF) 18965 2298.8933 295.0334 8.2464 0

(NOOB,wM,woIF) 18965 699.2644 177.4524 27.0840 0

(NOOB,wGCM,woIF) 18965 2592.5174 305.0763 7.3128 0

(NOOB,woGCM,woIF) 18965 595.3543 165.2816 31.8032 0

GLANET

(NOOB,wGC,wIF) 18965 2532.3832 310.1418 7.4864 0

(NOOB,wM,wIF) 18965 1531.4211 257.4727 12.3764 0

(NOOB,wGCM,wIF) 18965 2874.7601 316.5953 6.5951 0

(NOOB,woGCM,wIF) 18965 1375.0903 246.4372 13.7825 0

85

Table 7.3: Experiment3: DNaseI hypersensitive sites in HepG2 cell line are over-

lapped with DNaseI hypersensitive sites in Jurkat cell line. Both GAT and GLANET

find enrichment of DNaseI(Jurkat) for DNaseI(HepG2).

Experiment3 Set1: DNaseI(HepG2) Set2: DNaseI(Jurkat)


ChangepValue


GLANET

(EOO,wGC,woIF) 37863 4486.2310 63.3604 8.4381 0

(EOO,wM,woIF) 37863 4729.1280 62.8720 8.0048 0

(EOO,wGCM,woIF) 37863 4980.2900 63.7331 7.6012 0

(EOO,woGCM,woIF) 37863 4021.9370 61.3296 9.4120 0

GLANET

(EOO,wGC,wIF) 37863 4779.9930 62.7600 7.9196 0

(EOO,wM,wIF) 37863 5330.1410 66.3065 7.1024 0

(EOO,wGCM,wIF) 37863 5304.6820 67.4277 7.1365 0

(EOO,woGCM,wIF) 37863 4679.8590 62.3539 8.0891 0

GLANET

(NOOB,wGC,woIF) 6163503 514669.0700 8361.7736 11.9756 0

(NOOB,wM,woIF) 6163503 542634.9810 8866.3420 11.3584 0

(NOOB,wGCM,woIF) 6163503 577794.1580 9186.0057 10.6672 0

(NOOB,woGCM,woIF) 6163503 457457.8130 7800.8096 13.4733 0

GLANET

(NOOB,wGC,wIF) 6163503 548311.7080 8391.4861 11.2408 0

(NOOB,wM,wIF) 6163503 616187.0040 9160.8373 10.0026 0

(NOOB,wGCM,wIF) 6163503 614923.7840 8718.6997 10.0231 0

(NOOB,woGCM,wIF) 6163503 536616.5930 8472.0299 11.4858 0

86


overlapped with DNaseI hypersensitive sites in HepG2-Unique cell line. Both GAT

and GLANET find no enrichment of DNaseI(HepG2-Unique) for Srf(Jurkat).

Experiment4 Set1: Srf(Jurkat) Set2: DNaseI(HepG2-Unique)


ChangepValue


GLANET

(EOO,wGC,woIF) 9 21.5893 4.4931 0.4426 9.995e-01

(EOO,wM,woIF) 9 8.9383 2.9387 1.0062 5.403e-01

(EOO,wGCM,woIF) 9 24.3285 4.6873 0.3948 9.998e-01

(EOO,woGCM,woIF) 9 7.5673 2.7146 1.1672 3.486e-01

GLANET

(EOO,wGC,wIF) 9 27.1950 4.9426 0.3546 1e+00

(EOO,wM,wIF) 9 18.8889 4.2593 0.5027 9.956e-01

(EOO,wGCM,wIF) 9 29.8631 5.1835 0.3240 1e+00

(EOO,woGCM,wIF) 9 17.0837 4.0756 0.5529 9.878e-01

GLANET

(NOOB,wGC,woIF) 425 951.4744 206.6389 0.4472 9.973e-01

(NOOB,wM,woIF) 425 379.5031 131.6084 1.1195 3.46e-01

(NOOB,wGCM,woIF) 425 1066.5066 216.2852 0.3990 9.998e-01

(NOOB,woGCM,woIF) 425 324.0335 122.5103 1.3106 2.053e-01

GLANET

(NOOB,wGC,wIF) 425 1186.2319 224.0918 0.3588 9.998e-01

(NOOB,wM,wIF) 425 816.2769 189.3228 0.5212 9.867e-01

(NOOB,wGCM,wIF) 425 1309.7033 235.8895 0.3250 1e+00

(NOOB,woGCM,wIF) 425 731.7741 182.2007 0.5813 9.603e-01

87

7.2 Example Use Cases of GLANET

7.2.1 Enrichment Analysis of OCD GWAS SNPs

We next illustrate how GLANET can be used to analyze a set of SNPs identified

in an obsessive compulsive disorder (OCD) genome-wide association study (GWAS)

[59]. This set of 2, 340 SNPs is identified as significant in either of case-control, trios,

and/or combined case-control-trios analysis performed by [59].

We first conduct KEGG pathway analysis using GLANET in three modes: exon-

based, regulation-based, and all-based. These modes vary the genic region defini-

tion as defined in Figure 3.1b. The number of random samplings for these enrich-

ment analyses is set to 10, 000. Several potential pathways are found. Interestingly,

GLANET regulation-based enrichment analysis identifies glutamatergic synapse path-

way (hsa04724) as enriched; this is one of the pathways that KEGG reports as asso-

ciated with OCD. Both DLGAP1 and GRIK1 genes are part of this pathway and they

overlap with OCD associated SNPs in their intronic regions: DLGAP1 overlap with

rs1628281, rs767887, rs1791397, rs11081062, rs11663827, rs1116345, rs615916 and

rs7230434 where as GRIK1 overlaps with rs363524 and rs363514. Additionally,

other SNPs overlap with regulatory regions of other genes in this pathway such as

rs6479056 with PPP3R2(5p1) and GRIN3A(intron), rs17124656 with GNG2(5p1),

rs1559157 with GRIA1(intron), etc. The full list of genes where overlaps take place

for glutamatergic synapse pathway are provided in Supp. Table S18 under http:

//burcak.ceng.metu.edu.tr/PhDThesis/SuppMaterials/.

A key outcome of this application is that standard pathway analysis that only uti-

lizes exonic regions of the pre-defined genes can fail to identify pathways that are

biologically relevant through their regulatory roles. For example, long-term depres-

sion pathway (hsa04730) is significantly enriched with a BH FDR adjusted p-value

of 1.62e-02 only in the regulation-based analysis. The link between OCD and de-

pression has long been established and majority of OCD patients also suffer from

depression [60]. GLANET enables such an analysis within minutes.

We also conducted enrichment analysis of OCD SNPs with default GLANET annota-

88



tion libraries representing transcription factor binding regions and histone modifica-

tions. The complete list of enrichment analysis is provided in Supp. Table S19 under

http://burcak.ceng.metu.edu.tr/PhDThesis/SuppMaterials/. Al-

though TF enrichment analysis did not reveal enrichment for any particular TF, a joint

enrichment analysis of genomic elements representing TF binding regions and KEGG

pathways identified several enriched transcription factor and pathway pairs.

7.2.2 Enrichment Analysis of GATA2 Binding Regions for Gene Ontology Terms

using User-defined Gene Sets Feature

GLANET allows the expansion of annotation library with user-defined gene sets

and/or genomic intervals. We designed this key feature to provide flexibility for users

in including as many genes sets and genomic intervals in their analysis as they wish.

Here, we present a proof of principle application where users define gene sets, Gene

Ontology (GO) terms [30], based on biological process. For each GO term, we curate

a gene set from genes that are annotated with that particular GO term based on an

experimental evidence (reported with one of the GO evidence codes: EXP, IDA, IPI,

IMP, IGI, IEP). Utilizing GLANET’s user-defined gene set feature, these gene sets

are loaded in the GLANET annotation library.

We used GATA2 binding regions (i.e., peaks from the relevant ChIP-seq experi-

ment) from K562 cells as input to GLANET and assessed which of the GO term

gene sets are enriched in these regions. GATA2 is a transcription factor crucial in

maintaining the proliferation and survival of early hematopoietic cells and prefer-

ential differentiation to erythroid or megakaryocytic lineages [61, 62]. As we ex-

pect a subset of GATA2 binding regions to be in close proximity of the genes that

GATA2 regulates, such an analysis should identify the significantly enriched bio-

logical processes. We conduct this analysis with the three genic region definitions:

exon-based, regulation-based and all-based. GLANET correctly identifies several

enriched GO terms that are related to the specific biological role of GATA2 such

as regulation of definitive erythrocyte differentiation (GO:0010724), platelet forma-

tion (GO:0030220), and eosinophil fate commitment (GO:0035854) (Supp. Table

S21 is available under http://burcak.ceng.metu.edu.tr/PhDThesis/

89





SuppMaterials/). To quantify similarity between the set of GO terms that GATA2

is annotated with and the set of GO terms GLANET found enriched, we calculate GO

semantic similarity scores between these two sets using GOSemSim R package [63].

Semantic similarity scores are computed using Wang measure with rcmax method.

The resulting scores are provided in Table 7.5. As can be seen in the table, the set of

GO terms found enriched with GLANET are highly similar to the GO Terms anno-

tated with GATA2 gene and the similarity increases once we incorporate non-coding

regions of the genes in the gene set, where the GATA2 binding takes place.

Table 7.5: GO semantic similarity scores calculated between the set of biological pro-

cess GO terms that GATA2 is annotated with and the set of GO terms where GATA2

binding regions are found enriched based on GLANET enrichment analysis in three

different analysis modes (exon, regulatory based and all-based).

Enrichment Mode

Exon Regulatory All

GO Semantic Similarity Score 0.43 0.73 0.99

GLANET’s user-defined gene set feature renders this enrichment analysis straight-

forward. In other settings, gene sets that are derived from gene expression analysis

or functional assays can be loaded to GLANET annotation library. In addition to

gene sets, users can also load genomic regions such as ChIP-seq or copy number

variation regions as genomic elements into the GLANET annotation library through

user-defined library feature.

7.3 GLANET Run Time Comparison

We compare GLANET against GAT and GREAT with respect to run time. GLANET

and GAT are compared based on genomic interval enrichment, as GAT does not of-

fer gene set enrichment. GREAT comparisons are on the basis of gene set enrich-

ment, as GREAT only offers enrichment based on annotations of nearby genes. All

GAT and GLANET runs are run on the following system configuration: CPU: In-

tel(R) Xeon(R) CPU E7-4850 v3 @ 2.20GHz CPU. Memory: 1TB. Operating sys-

tem: Ubuntu 16.04.2 LTS.

90




7.3.1 Comparison with GAT

We compare GAT and GLANET in two different experimental settings. For the first

comparison setting, input intervals are randomly selected from the promoter regions

of non-expressing genes in GM12878 cell line from (Non-Expressing, Completely-

Discard), where each interval is 601 bps long. All ENCODE checks the enrichment

of all ENCODE elements in the GLANET library which encompass histone modi-

fications, transcription factor sites, and DNaseI hypersensitive sites for all cell lines

(568 files). Subset ENCODE only includes 12 histone modifications and POL2 as

described in Section 6.1.3. Both GLANET and GAT are run under the parameter

setting (NOOB, wIF, woGCM). Results for 1,000 and 10,000 samplings are averaged

over 10 runs. For 100,000 samplings, each run time in the table denotes the average

run time from 5 individual runs. It’s worth noticing that increasing the library size

did not increase the run time that much. We varied the number of intervals and the

number of samplings. The resulting CPU times (user + system) and wall clock times

are provided in Tables 7.6 and 7.7.

Table 7.6: Elapsed CPU (user + system) run times in seconds for GLANET and GAT

runs for a given input query are provided.

Input QueryNumber of

Input

Intervals

Number of

Samplings

CPU (user + system) run times of tools (in secs)

GLANET -

all ENCODE

GLANET -

subset ENCODE

GAT -

subset ENCODE

Promoter

regions of

Non-expressing

genes in

GM12878

500 1,000 826 690 145

500 10,000 1,169 856 1,463

500 100,000 4,447 2,140 14,353

1000 1,000 1,395 1,283 147

1000 10,000 1,650 1,165 1,538

1000 100,000 9,137 3,866 14,341

2000 1,000 1,396 1,179 155

2000 10,000 2,429 1,270 1,583

2000 100,000 14,724 6,257 16,039

In the second comparison setting, we used data provided in GAT web tutorial as de-

scribed in Section 7.1. Srf(Jurkat) is the transcription binding sites of 556 intervals

each 51 bps long from Jurkat cell line. DNaseI(Jurkat) and DNaseI(HepG2) com-

91

Table 7.7: Elapsed wall clock times in seconds for GLANET and GAT runs for a

given input query are provided.

Input QueryNumber of

Input

Intervals

Number of

Samplings

Wall clock run times of tools (in secs)

GLANET -

all ENCODE

GLANET -

subset ENCODE

GAT -

subset ENCODE

Promoter

regions of

Non-expressing

genes in

GM12878

500 1,000 108 80 59

500 10,000 173 72 565

500 100,000 750 126 5,505

1000 1,000 142 120 47

1000 10,000 298 125 509

1000 100,000 1,386 330 4,418

2000 1,000 129 87 44

2000 10,000 260 110 455

2000 100,000 1,648 483 4,777

prised of DNaseI hypersensitive sites in Jurkat and HepG2 cell line, respectively.

DNaseI(HepG2Unique) consists of DNaseI hypersensitive sites in HepG2 but not in

Jurkat cell line. For your information, DNaseI(Jurkat) have 159,613 intervals each

151 bps long. DNaseI(HepG2) have 144,171 intervals of average 360 bps long. DNa-

seI(HepG2Unique) have 106,308 intervals of average 275 bps long. For 1,000 and

10,000 samplings, run time is the average of 10 runs. For 100,000 samplings, each

run time shows the average run time from 5 individual runs. The results for CPU

times (user + system) and wall clock times are listed in Tables 7.8 and 7.9.

All the run time results for GLANET and GAT are shown in terms of CPU (user

+ system) time and wall clock time in seconds. CPU time is the actual time

that one CPU would need to complete its process. Thus, these run times are

the sum of the times taken in each thread for a run if multithreading is avail-

able (time command in Unix). Please note that since GLANET and GAT are

multi-threaded applications, wall clock times are less than CPU times. Dur-

ing these runs, 16GB of memory is reserved for GLANET and GAT, except for

DNaseI(Hepg2)-DNaseI(Jurkat) runs of 100,000 samplings, in which GLANET re-

quired 64GB of memory.

92

Table 7.8: CPU (user + system) times in seconds spent for GLANET and GAT runs

given the input query specified.

Input

Query

User Defined

Library

Number

of

Samplings

CPU (user + system) run times of tools (in secs)

GLANET GAT

woIF wIF wIF

wGC wM wGCM wGC woGCM woGCM

Srf

(Jurkat)

DNaseI

(Jurkat)

1,000 505 498 741 492 473 86

10,000 923 589 1,056 712 582 792

100,000 4,158 1,812 3,856 2,710 1,777 7,383

DNaseI

(HepG2)

DNaseI

(Jurkat)

1,000 16,843 7,942 24,602 13,386 2,428 1,125

10,000 167,079 69,248 250,360 127,134 16,693 12,476

100,000 2,066,470 766,951 2,700,420 1,447,620 262,553 97,659

Srf

(Jurkat)

DNaseI

(HepG2)

1,000 518 499 741 509 495 82

10,000 951 585 1,056 715 551 792

100,000 4,312 1,779 4,002 2,712 1,746 7,296

Srf

(Jurkat)

DNaseI

(HepG2Unique)

1,000 519 499 752 492 485 76

10,000 945 596 1,049 692 565 701

100,000 4,042 1,745 3,987 2,728 1,734 6,924

Table 7.9: Wall clock times in seconds spent for GLANET and GAT runs given the

input query specified.

Input

Query

User Defined

Library

Number

of

Samplings

Wall clock run times of tools (in secs)

GLANET GAT

woIF wIF wIF

wGC wM wGCM wGC woGCM woGCM

Srf

(Jurkat)

DNaseI

(Jurkat)

1,000 224 230 439 222 218 16

10,000 277 235 470 245 230 144

100,000 712 284 843 416 262 1,300

DNaseI

(HepG2)

DNaseI

(Jurkat)

1,000 3,152 1,195 4,708 2,411 229 193

10,000 29,436 11,014 44,775 21,647 1,423 2,164

100,000 323,963 101,847 402,297 215,717 16,669 16,256

Srf

(Jurkat)

DNaseI

(HepG2)

1,000 233 232 434 228 224 15

10,000 301 251 496 265 239 155

100,000 735 275 917 413 255 1,268

Srf

(Jurkat)

DNaseI

(HepG2Unique)

1,000 227 230 426 221 223 13

10,000 287 239 480 244 232 118

100,000 707 272 902 410 258 1,226

93

7.3.2 Comparison with GREAT

We compared GREAT on the basis of GO terms gene set enrichment. The input was

GATA2 transcription bindings sites in K562 cell line and their enrichment is checked

against gene sets derived from GO terms as described in Section 7.2.2. The input

included 7407 intervals of average 256 bps long. Tables 7.10 and 7.11 include

the results for CPU and wall clock run times, respectively. GREAT is not available

as a stand-alone command line application, thus, the results are obtained from the

online web service. Nevertheless, when run on from the server, GREAT was very

fast, it completed one analysis in less than 1 minute, as its enrichment procedure is

not based on sampling but instead assumes a parametric distribution. Since we do not

know how each GREAT run is parallelized in their server, we do not know the actual

CPU time. Therefore, it is not possible for us to compare the run times and we do not

have information on the actual memory used for the GREAT analysis.

Table 7.10: CPU (user + system) time in seconds spent for GLANET runs given the

input query specified. For 1,000 and 10,000 samplings, each run time is the average

of 10 individual runs.

Input QueryEnrichment

of GO Terms

Number of

Samplings

GLANET

Association

Measure

Random

Interval

Generation

Isochore

Family

CPU Run

Time

(in secs)

GATA2

Binding sites

in K562

BP 1,000 NOOB woGCM woIF 522,75

BP, MF and CC 1,000 NOOB woGCM woIF 658,34

BP 10,000 NOOB woGCM woIF 3069,14

BP, MF and CC 10,000 NOOB woGCM woIF 5808,08

BP 10,000 NOOB wGC wIF 6838,32

BP, MF and CC 10,000 NOOB wGC wIF 9718,03

94

Table 7.11: Wall clock time in seconds spent for GLANET runs given the input

query specified. For 1,000 and 10,000 samplings, each run time is the average of 10

individual runs.

Input QueryEnrichment

of GO Terms

Number of

Samplings

GLANET

Association

Measure

Random

Interval

Generation

Isochore

Family

Wall Clock

Run Time

(in secs)

GATA2

Binding sites

in K562

BP 1,000 NOOB woGCM woIF 63.63

BP, MF and CC 1,000 NOOB woGCM woIF 77.42

BP 10,000 NOOB woGCM woIF 332.28

BP, MF and CC 10,000 NOOB woGCM woIF 606.79

BP 10,000 NOOB wGC wIF 586.59

BP, MF and CC 10,000 NOOB wGC wIF 845.74

95

CHAPTER 8

FINDING OVERLAPPING INTERVALS FOR N GIVEN

INTERVAL SETS

Genomic interval intersection is crucial for attaining biological insights from genomic

data sets coming from NGS technologies. In this chapter, we generalize this genomic

interval intersection problem of finding common overlapping intervals from 2 or 3

interval sets to n interval sets. We divide the finding overlapping intervals problem

into two sub-problems: finding n common overlapping intervals from n given inter-

val sets and finding at least k common overlapping intervals from n given interval

sets. We propose two different solutions to each of these sub-problems. For finding n

common overlapping intervals from n given interval sets, first we construct a segment

tree for each interval set and then we convert each segment tree into indexed segment

tree forest. We show that constructing indexed short segment trees rather than one

tall segment tree reduces the search time. For finding at least k common overlapping

intervals from n given interval sets, we construct one big segment tree for all inter-

val sets and find the overlapping intervals immediately after the construction of the

segment tree is completed.

8.1 Segment Tree

Segment tree is a data structure for storing intervals. Alike interval trees, segment

tree is an another well-known space partitioning tree. It uses O(nlogn) storage and

it can be constructed in O(nlogn) time for n given intervals. Finding all intervals

in the segment tree containing query point qx requires O(logn + k) time for n given

97

intervals and k hits [64].

Let I := [x1 : x′1], [x2 : x

′2], ..., [xn : x′n] be a set of n intervals on the real line. Let

p1, p2, ..., pm be the list of distinct interval endpoints, sorted from left to right. We

simply partition the real line induced by these points pi. We call the regions in this

partitioning as elementary intervals. Thus the elementary intervals from these points

p1, p2, ..., pm−1, pm are, from left to right, (−∞ : p1), [p1 : p1], (p1 : p2), [p2 :

p2], ..., (pm−1 : pm), [pm : pm], (pm :∞).

To this end, we build a binary search tree T whose leaves correspond to the elemen-

tary intervals induced by the endpoints of the intervals in I in an ordered way: the

leftmost leaf corresponds to the leftmost elementary interval, and so on. We denote

the elementary interval corresponding to a leaf µ by Int(µ).

The internal nodes of T correspond to intervals that are the union of intervals of

its two children: the Int(ν) corresponding to node ν is the union of the intervals

Int(νleftChild) and Int(νrightChild) in the subtree rooted at ν. Parent of leaf nodes

has the Int(ν), which is the union of the elementary intervals Int(νleftChild) and

Int(νrightChild) at the leaves.

Each internal node ν in T has its interval, Int(ν) whereas each leaf node µ has its

elementary interval, Int(µ), and each node stores a set of intervals, canonical subset,

I(ν), where I(ν) ⊆ I . This canonical subset of node ν stores the intervals [x : x′] ∈ Isuch that Int(ν) ⊆ [x : x′] and Int(parent(ν)) 6⊆ [x : x′].

As a result, constructed balanced binary tree T is a segment tree. And this way of

construction ensures non-overlapping, totally consecutive intervals for the nodes at

any depth, from left to right. In fact, this way of construction provides natural binning

at any depth of the tree. In Figure 8.1, we exemplify how we store 5 intervals in the

segment tree leaves and internal nodes which are constructed from the endpoints of

the 5 given intervals [64].

98

s1s2

s5

s4

s3

s1

s1 s1s3

s3,s4

s3 s4

s2 ,s5 s5

s2 , s5

Figure 8.1: Intervals (s1, s2, s3, s4, s5) are stored in the nodes. The arrows from the

nodes point to their canonical subsets.

8.2 Segment Tree Construction Complexity Analysis

To construct a segment tree for an interval set of n intervals we proceed as follows:

We sort the endpoints of n intervals inO(n log n) time and define elementary intervals

at each end point and between each consecutive endpoints. We then construct a binary

tree on these elementary intervals, where each interval is the union of its left and right

child’s elementary intervals or intervals and goes up to root in this way. This can be

done bottom-up in linear time. In the last phase, n intervals are attached to nodes,

if node’s interval, Int(ν), is totally contained in the interval. As a result, an interval

can be attached to more than one node and number of intervals attached to nodes

decreases as we go up in the tree as the node’s interval, Int(ν), becomes larger.

8.3 Segment Tree Query

Query starts at the root node, if the query point qx overlaps with the node’s interval,

Int(ν), the associated intervals stored at that node are output and the query continues

on the left or right child of that node, visiting one node per level of the tree. The

time complexity of segment tree query is O(log n + k) where n is the number of

intervals and k is the number of overlapping intervals in the segment tree for the

query point qx [64]. As a result, constructing segment tree for the interval set with

the highest number of intervals and using interval set with less number of intervals as

query intervals will be better.

99

8.4 Motivation: Indexed Segment Tree Forest

After analyzing constructed segment trees for real data sets, we observed that nodes

at the top of the segment tree (approximately top two thirds of the segment tree)

do not store any intervals or hold only a few intervals in their canonical subsets. We

realized that intervals are mostly stored in the bottom nodes of the segment tree which

constitute approximately the bottom one third of the segment tree.

Keeping the whole segment tree with significant number of nodes without any or with

a few intervals is unnecessary. And passing through all these nodes for each query

in order to find overlapping intervals will definitely increase query time. Instead of

having one tall segment tree, we can cut the segment tree at a certain depth close to

the bottom of the tree and have as many short segment trees as segment tree nodes

present at this cut-off depth plus the segment tree nodes with no children above this

cut-off depth. The closer the cut-off depth to the bottom of the tree, the higher the

number of short segment trees will be.

8.4.1 Hash Function, Preset Value

By using one universal hash function as shown in Equation 8.1, we index these short

segment trees and we aim to reach each short segment tree in O(1) time instead of

O(cut-off) time. Preset value in hash function determines the number of segment

trees with the same index which is called collision.

hash_index = (node.interval.lowEndPoint/presetV alue) (8.1)

The lower the preset value, the less number of segment trees with the same index.

However, this may result in sequential search of more than one segment trees which

is definitely not preferred. The higher the preset value, the more number of segment

trees with the same index. This implies that more one segment trees will have the

same index so reaching each segment tree may not be O(1) but O(height of binary

search tree (BST) formed from these short segment trees with the same index) instead

of O(cut-off) time. As long as the height of BST formed from these segment trees

100

with the same index is less than the cut-off depth, search in indexed segment tree

forest will be still less than search in one tall segment tree.

8.4.2 Cut-off Depth

We may decide on the cut-off depth by considering two factors: 1) total number of

intervals stored in canonical subsets of nodes at the top part of the tree higher than

this cut-depth and 2) number of segment tree nodes at the cut-off depth. The lower

the cut-off depth, the more segment trees will be in the forest. We tried three different

ways for deciding on cut-off depth.

In the first approach, we first construct the segment tree, and then we just cut the

segment tree at 75% of its total depth, whih is closer to the bottom of the tree. For

instance, if the total depth of the segment tree is 20, then we cut the tree at cut-off

depth of 15 and consider the segment tree nodes at this depth and the nodes above the

cut-off depth with no children.

In the second approach, after we construct the segment tree, we traverse the segment

tree in breadth first manner and stop at the depth where the number of intervals stored

in the nodes up to that depth is greater than or equal to the 1% of the total number of

intervals.

In the third approach, during the construction of the segment tree, we keep the number

of intervals stored in the nodes and decide on the cut-off depth where the number of

intervals stored in the nodes up to that depth is greater than or equal to the 1% of

the total number of intervals. We call these three approaches, AFTER_CONS_75%,

AFTER_CONS_BFT and DURING_CONS, respectively.

Cut-off depth and preset value are the two parameters that affect the performance of

search in indexed segment tree forest.

101

Root of the segment tree

Cut-off depth

Figure 8.2: Blue colored segment tree nodes at cut-off depth and red colored nodes

with no children at depth above the cut-off depth are stored in our segment tree forest.

To enhance fast access, these stored segment tree nodes are connected to each other

through forward and backward links.

8.4.3 Moving Intervals That Were Stored in The Nodes Above The Cut-off

Depth

All the intervals attached to the nodes that are above the cut-off depth must be dis-

tributed to the nodes at cut-off depth. Definitely, if an interval is attached to a node

above the cut-off depth, then this interval must be attached to its offspring node at

cut-off depth. If there is no offspring node at cut-off depth then we directly add node

holding the interval if node has no offspring, otherwise we attach the interval to its

lowest offspring nodes with no children and add this lowest offspring nodes to our

segment tree forest, with the node closest to the cut-off depth first priority in order to

keep the order between the intervals of the nodes. Please note that we do this extra

work for a small number of nodes.

8.4.4 Linking Segment Tree Nodes at Cut-off Depth to Each Other

To ensure fast access between consecutive segment tree nodes at cut-off depth, we

connect segment tree nodes to each other through forward and backward pointers.

We call these nodes as linked nodes (Figure 8.2).

102

8.5 Indexed Segment Tree Forest in More Details

We cut the segment tree at cut-off depth and keep the segment tree nodes at this cut-

off depth in an indexed segment tree forest. At cut-off depth, each segment tree node

is in fact a root of segment tree at its below, and we compute its hash index using

a hash function for each segment tree node and we store [index,segment tree node]

pairs in a map.

We have one universal hash function as it is provided in Equation 8.1 where we

tested various preset values such as 10, 000, 50, 000, 100, 000, and 500, 000. This

preset value effects the number of different segment trees having the same hash index

(collisions) in the map. The smaller the preset value, the less number of collisions.

On the contrary, the higher the preset value, the more number of collisions.

In case of collision, we construct a binary search tree (BST) from the segment tree

nodes with the same index and now index in the map points to the root of this BST.

As long as the height of the newly created BST is less than the cut-off depth, we

still decrease the search time from O(cut-offdepth) to O(DepthofBST ). As a

future work, we may construct an interval tree from the segment tree nodes with the

same index instead of BST since interval tree is a balanced tree whereas BST is not

necessarily balanced.

Original segment tree nodes in this BST are connected to each other, which are called

linked nodes, as mentioned above. On the other hand, parent nodes of these linked

nodes in the BST constitute the artificial nodes as it is shown in Figure 8.3.

8.6 Query in Indexed Segment Tree Forest

For each query interval, we compute its lowIndex and highIndex using its low and

high endpoints, respectively. We start searching on a linked node pointed by the

lowIndex if it exists, otherwise we find the lowerIndex (highest index lower than

lowIndex) and start searching at the node shown by the lowerIndex and continue

searching forward. If it is not possible, we start searching on the linked node pointed

by the highIndex if it exists, if not we compute higherIndex (lowest index higher

103

indexi indexi+1 indexi+2 indexi+3

Figure 8.3: Segment tree nodes with the same index are stored in a BST and index

now points to the root of BST. Blue and red colored nodes are original segment tree

nodes which are linked to each other. Blue colored nodes are in fact the roots of the

segment trees below them. Red colored nodes do not have any children. Parents of

these blue and red colored nodes are the artificial nodes.

than highIndex) and search the node pointed by higherIndex and continue search-

ing backward. If there is no node pointed by higherIndex, it means that there is no

overlapping intervals with the query interval. All the pseudocode of the algorithms

are provided in 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8 and 8.9.

8.6.1 How to Guarantee at Most Two Additional Index Searches Are Enough?

As it is shown in Figure 8.4, we first compute lowIndex and highIndex using query

low and high endpoints, respectively. Then we search for the segment trees pointed

by one of these indexes in the order of lowIndexi, lowIndexi−1, highIndexj or

highIndexj+1.

Here we present why we may need to consider only two more segment trees pointed

by the indexes lowIndexi−1 and the highIndexj+1 (Figure 8.4).

queryInterval(lowEndPoint,highEndPoint)

segmentjsegmentisegmenti-1segmenti-2 segmentj+1 segmentj+2…

…lowIndexilowIndexi-1lowIndexi-2 highIndexj+2highIndexj+1highIndexj… …

Figure 8.4: Searching the nodes pointed by lowIndex and highIndex, the nodes in

between them, and plus two more nodes at most is enough.

104

lowIndexi = queryLowEndPoint/presetV alue (8.2)

highIndexj = queryHighEndPoint/presetV alue (8.3)

lowIndexi−2 < lowIndexi−1 < lowIndexi (8.4)

lowIndexi−1 < lowIndexi ⇒ (8.5)

lowNodei−1.interval.lowEndPoint < queryLowEndPoint (8.6)

From the preserved order between intervals of consecutive nodes we know that

lowNodei−2.interval.highEndPoint < lowNodei−1.interval.lowEndPoint

(8.7)

Equations 8.6 and 8.7 imply that

lowNodei−2.interval.highEndPoint < queryLowEndPoint (8.8)

As a result of inequality 8.8, lowNodei−2.interval and query interval can not over-

lap. Therefore we may need to look at only one more index preceding the lowIndexi

and search for the segment tree pointed by that index and forward. In the same man-

ner, we may need to consider only one more index subsequent to the highIndexj .

8.7 Finding n Common Overlapping Intervals for n Given Interval Sets

We have n interval sets, we use one of them as query intervals and for each of the

(n−1) remaining interval sets and we construct chromosome-based indexed segment

105

tree forest. We find the overlapping intervals with the query intervals by searching on

the indexed segment tree forest of (n− 1) remaining interval sets one by one.

We tested our indexed segment tree forest approach using hotspot peaks for five

fetal adrenal tissues: fAdrenal-DS12528, fAdrenal-DS15123, fAdrenal-DS17319,

fAdrenal-DS17677 and fAdrenal-DS20343 where they contain 193, 835 , 188, 966

, 137, 386 , 132, 500 and 195, 098 intervals, respectively. We computed the search

times of 100 runs in wall clock time using indexed segment tree forest and segment

tree. We verified that search run times of indexed segment tree forest are statistically

significantly less than search run times of segment tree using paired t-test given the

appropriate preset value and cut-off depth and we listed the averaged wall clock run

times of construction and search in Table 8.1 .

Table 8.1: Various preset values and cut-off depth decisions are compared. Con-

struction time and search time of indexed segment tree forest and segment tree in

wall clock time are averaged over 100 runs. P-values resulting from paired t-test for

search run times of indexed segment tree forest and segment tree are provided.

Preset

Value

Cut-off

Depth Decision

Construction Time

(in millisecs)

Search Time

(in millisecs)T-Tests

p-value

500,000After Construction 75% 126470.75 136.7 0.05812

After Construction BFT 122114.23 137.01 0.05168

During Construction 118195.75 135.17 0.04848

Segment Tree 122297.07 157.56




Segment Tree 112263.3 143.59




Segment Tree 126385.71 141.64




Segment Tree 112578.32 137.31

106

Algorithm 8.1: findingNCommonOverlappingIntervalsForNIntervalSetsRequire: n interval sets

Require: outputfile File to output common overlapping intervals

1: queryIntervals← smallest interval set

2: overlappingIntervalsList← ∅3: for each remaining n− 1 interval set do

4: index2NodeMap← constructIndexedSegmentTreeForest

5: search(queryIntervals, index2NodeMap, overlappingIntervalsList)

6: end for

7: return overlappingIntervalsList

Algorithm 8.2: searchRequire: queryIntervals

Require: index2NodeMap

Require: overlappingIntervalsList

1: qOvIntList : queryOverlappingIntervalsList

2: for each query interval do

3: qOvIntList← mainSearch(query, index2NodeMap, presetV alue)

4: update overlappingIntervalsList with qOvIntList

5: end for

107

Algorithm 8.3: mainSearchRequire: query(lowEndPoint, highEndPoint)

Require: index2NodeMap

Require: presetV alue > 0

1: overlappingIntervals← ∅2: lowIndex← lowEndPoint/presetV alue

3: highIndex← highEndPoint/presetV alue

4: lowNode← index2NodeMap.get(lowIndex)

5: if lowNode 6= null and linked(lowNode) then

6: searchAtLinkedNode(lowNode, query, overlappingIntervals)

7: else

8: lowerIndex← getLowerIndex(index2NodeMap, lowIndex)

9: lowerNode = index2NodeMap.get(lowerIndex)

10: if lowerNode 6= null then

11: searchAtLowerNode(lowerNode, query, overlappingIntervals)

12: else

13: highNode← index2NodeMap.get(highIndex)

14: if highNode 6= null and linked(highNode) then

15: searchAtLinkedNode(highNode, query, overlappingIntervals)

16: else

17: higherIndex← getHigherIndex(index2NodeMap, highIndex)

18: higherNode = index2NodeMap.get(higherIndex)

19: if higherNode 6= null then

20: searchAtHigherNode(higherNode, query, overlappingIntervals)

21: end if

22: end if

23: end if

24: end if

25: return overlappingIntervals

108

Algorithm 8.4: searchAtLinkedNodeRequire: node is a linked original node

Require: query(lowEndPoint, highEndPoint)

Require: overlappingIntervals

1: searchForward(node, query, overlappingIntervals)

2: searchBackward(node.backwardNode, query, overlappingIntervals)

Algorithm 8.5: searchForwardRequire: node is a linked original node



1: low: lowEndPoint

2: high: highEndPoint

3: if node 6= null and node.interval.low ≤ high then

4: if low ≤ node.interval.high then

5: add node.canonicalSubset to overlappingIntervals

6: if node.left 6= null and low ≤ node.left.interval.high then

7: searchDownward(node.left, query, overlappingIntervals)

8: end if

9: if node.right 6= null and node.right.interval.low ≤ high then

10: searchDownward(node.right, query, overlappingIntervals)

11: end if

12: end if

13: searchForward(node.forwardNode, query, overlappingIntervals)

14: end if

109

Algorithm 8.6: searchBackwardRequire: node is a linked original node



1: low: lowEndPoint

2: high: highEndPoint

3: if node 6= null and low ≤ node.interval.high then

4: if node.interval.low ≤ high then

5: add node.canonicalSubset to overlappingIntervals



8: end if



11: end if

12: end if


14: end if

Algorithm 8.7: searchDownwardRequire: query(lowEndPoint, highEndPoint)

Require: node 6= null

Require: node and query overlaps


1: Add node.canonicalSubset to overlappingIntervals



4: end if



7: end if

110

Algorithm 8.8: searchAtLowerNodeRequire: lowerNode 6= null



1: if linked(lowerNode) then

2: searchForward(lowerNode, query, overlappingIntervals)

3: else

4: if overlaps(query, lowerNode) then

5: searchDownward(lowerNode, query, overlappingIntervals)

6: end if

7: node← findRightMostNode(lowerNode)

8: searchForward(node.forwardNode, query, overlappingIntervals)

9: end if

Algorithm 8.9: searchAtHigherNodeRequire: higherNode 6= null



1: if linked(higherNode) then

2: searchBackward(higherNode, query, overlappingIntervals)

3: else

4: if overlaps(query, higherNode) then

5: searchDownward(higherNode, query, overlappingIntervals)

6: end if

7: node← findLeftMostNode(higherNode)


9: end if

111

8.8 Finding at Least k Common Overlapping Intervals for n Given Interval

Sets

Constructing one segment tree or indexed segment tree forest for each interval set

solves our problem of finding n common overlapping intervals for n given interval

sets. However, for finding at least k common overlapping intervals for n given inter-

val sets problem, we have to call our proposed indexed segment tree forest for each

interval set solution for C(n, k) times. Definitely, we visit this option for correctness

of our new proposed of solution for finding at least k common overlapping intervals

for n given interval sets. In this section, we provide our algorithm for finding at least

k common overlapping intervals for n interval sets using one segment tree for all in-

terval sets where 2 ≤ k ≤ n. To enhance the performance of the algorithms, we

implemented them using fork/join framework of Java 1.8. In this manner, we aimed

to take advantage of multiple processors as much as possible.

All the pseudocode of the algorithms are provided in 8.10, 8.11, 8.12,8.13,8.14, and

8.15.

Algorithm 8.10: findingAtLeastKCommonOverlappingIntervalsForNInter-

valSetsRequire: n interval sets

Require: k

Require: output File to output common overlapping intervals

1: allEndPoints← fillEndPointsAndIntervals(n)

2: sortedAllEndPoints← sortEndPoints(allEndPoints)

3: root← constructSegmentTree(sortedAllEndPoints)

4: storeIntervals(n, root)

5: ovIntList;← ∅6: lastOvIntList;← ∅7: findAtLeastK(root, k, ovIntList, lastOvIntList, output)

112

Algorithm 8.11: fillEndPointsAndIntervalsRequire: n interval sets

1: for each interval set i do

2: for each interval j in interval set i do

3: add lowEndPointi,j and highEndPointi,j to allEndPoints

4: add interval j to intervalsi

5: end for

6: end for

7: return allEndPoints

Algorithm 8.12: sortEndPoints: Sort allEndPoints in ascending orderRequire: allEndPoints

1: sort end points

2: return sortedAllEndPoints

Algorithm 8.13: constructSegmentTree: Using sortedAllEndPointsRequire: sortedAllEndPoints

1: construct segment tree

2: return root of the segment tree

Algorithm 8.14: storeIntervals: One interval set at a timeRequire: n interval sets

1: for each interval set i do

2: for each interval j in interval set i do

3: store interval j to segment tree

4: update node.intervalSetNumbers with i

5: end for

6: end for

7: Prune segment tree from the nodes that do not have at least k numbers in their

node.intervalSetNumbers

113

Algorithm 8.15: findAtLeastKRequire: node

Require: k

Require: overlappingIntervalsList

Require: lastOverlappingIntervalsList

Require: output File to output common overlapping intervals

1: intervalSetNumbers← node.intervalSetNumbers

2: newOvIntList : newOverlappingIntervalsList

3: ovIntList: overlappingIntervalsList

4: nIntSetNum2IntMap: newIntervalSetNumber2IntervalMap

5: exIntSetNum2IntMap: existingIntervalSetNumber2IntervalMap

6: intSetNum: intervalSetNumber

7: exInt: existingInterval

8: lastOvIntList: lastOverlappingIntervalsList

9: if (intervalSetNumbers.size ≥ k) then

10: if (node.canonicalSubset 6= null) then

11: for each intervali in node.canonicalSubset do

12: create newOvIntList

13: if (ovIntList = ∅) then

14: create nIntSetNum2IntMap

15: add intervali to nIntSetNum2IntMap

16: add nIntSetNum2IntMap to newOvIntList

17: else

18: for each exIntSetNum2IntMap in ovIntList do

19: if exIntSetNum2IntMap contains intSetNum of intervali then

20: exInt← exIntSetNum2IntMap.get(intSetNum)

21: if (exInt 6= intervali) then

22: create nIntSetNum2IntMap from exIntSetNum2IntMap

23: add intervali to nIntSetNum2IntMap

24: if (nIntSetNum2IntMap.size ≥ k) then

25: checkforAtLeastKCommonOverlapsUpdateOutput

114

26: else

27: add nIntSetNum2IntMap to newOvIntList

28: end if

29: end if

30: else

31: add intervali to exIntSetNum2IntMap

32: if (exIntSetNum2IntMap.size ≥ k) then

33: checkforAtLeastKCommonOverlapsUpdateOutput

34: end if

35: end if

36: end for

37: end if

38: add newOvIntList to ovIntList

39: end for

40: end if

41: if (node.left 6= null) then

42: findAtLeastK(node.left, k, ovIntList, lastOvIntList, output)

43: end if

44: if (node.right 6= null) then

45: findAtLeastK(node.right, k, ovIntList, lastOvIntList, output)

46: end if

47: end if

115

CHAPTER 9

CONCLUSION AND FUTURE WORK

Research carried out in this thesis can be examined under two main parts. In the first

part of the thesis, we developed a comprehensive annotation and enrichment analysis

tool, GLANET, which implements a sampling-based enrichment test that accounts

for genomic biases and has several useful built-in analysis capabilities. Following

GLANET, we designed novel data-driven computational experiments to assess our

enrichment analysis in terms of its Type-I error and power, in detail. Through these

experiments, we investigated the effect of correcting for genomic biases separately

and jointly on enrichment analysis along with GLANET’s other parameters. These

experiments also provide a methodology for benchmarking enrichment analyzes of

other tools. To exemplify this use case, we compared GLANET with another tool,

GAT [22]. In the second part of the thesis, we extended the annotation analysis pro-

vided in GLANET for finding common overlapping intervals from 2 or 3 interval sets

to n given interval sets. To this aim, we proposed novel indexed segment tree forest

data structure with its accompanying algorithms and showed that indexed segment

tree forest reduces search time.

NGS technologies allow us to sequence whole genomes rapidly, analyze gene expres-

sion through RNA sequencing, study somatic variations by sequencing patient sam-

ples, analyze epigenetic factors such as genome-wide DNA methylation and DNA-

protein interactions. As a result of many different sequencing methods, they provide

genomic intervals of interest. Interpretation of these genomic intervals requires over-

lapping them with already annotated genomic intervals through annotation. Imme-

diate follow-up study is to find the statistically significant overlaps among them via

117

enrichment analysis. There are various tools each with different shortcomings. Some

of them do not accept genomic intervals of varying length but only SNPs, some do

not provide enrichment analysis but only annotation, some of them allow enrichment

analysis only for gene lists, and most of them do not account for genomic biases dur-

ing enrichment analysis. To overcome these shortcomings, we developed Genomic

Loci ANnotation and Enrichment Tool, GLANET with many built-in capabilities.

First of all, GLANET offers an easy-to-run desktop and command line application

with its open source code available in https://github.com/burcakotlu/

GLANET and full documentation provided at https://glanet.readthedocs.

org. GLANET performs flexible annotation and enrichment analysis of a given set

of fixed or varying length loci. To annotate the given genomic intervals with the

intervals in the default library or with the user extended library, we utilize interval

trees. We construct chromosome-based interval trees and find overlapping intervals

through interval tree search. We find the statistically significant overlaps between

given interval sets by conducting sampling-based enrichment analysis. Genomic bi-

ases inherent to NGS technologies such as GC content and mappability restrict the

regions of genome that can contribute and show up in the resulting intervals. We

adjust for these biases during the random interval generation phase of enrichment

analysis. By correcting for these biases in generation of samplings, we aim to reduce

false positives in enrichment analysis.

Overall, we can summarize the features of GLANET as follows: GLANET utilizes

a rich pre-defined annotation library that contains regions defined not only on exons

of the genes but also on their intronic and regulatory regions, GO Terms, KEGG

pathways and a large collection of regulatory genomic element libraries from the

ENCODE project. One key feature of GLANET is that the user can expand its default

library by user defined gene sets and genomic intervals. This option makes GLANET

especially suitable for research groups that generate genomic interval data or gene sets

through a variety of high throughput experiments and routinely perform enrichment

analysis. Other unique features of GLANET include allowing gene-set enrichment

analysis with non-coding neighborhood of the genes, regulatory sequence analysis

for SNP queries, joint enrichment analysis of TF-pathway pairs and an enrichment

procedure that allows accounting for mappability and GC content biases separately

118

https://github.com/burcakotlu/GLANET

https://github.com/burcakotlu/GLANET

https://glanet.readthedocs.org

https://glanet.readthedocs.org

or jointly. GLANET can be used in a variety of interesting biological applications,

some of which we showcase throughout the thesis and used earlier in [65].

And secondly, to the best of our knowledge, there is no tool or method for assessing

the performance of the enrichment analysis that we are conducting. To evaluate how

accounting for genomic biases and other GLANET parameters affect Type-I error

rate and power of our enrichment procedure, we designed novel data-driven compu-

tational experiments. We considered two cell lines, GM12878 and K562, each having

two replicates of RNA-seq data. Leveraging on the expression level of genes in these

RNA-seq data, we determined the expressed and non-expressed genes. According

to our literature review, we defined each histone modification and transcription factor

element in our library as activator or repressor element. We described two experiment

settings, based on more and less stringent definitions of expressed and non-expressed

genes. The key idea of these experiments can be briefly explained as follows: we

expect the enrichment of activator elements in the proximal regions of the expressed

genes. In contrast, at proximal regions of the non-expressed genes, we expect the

enrichment of repressor elements. By means of these experiments, we analyzed the

affect of genomic biases separately and jointly on the enrichment analysis and we

calculated the element-based and cell-based Type-I error rate and power of our en-

richment analysis. We observed that in input types where the mappability and/or

GC distribution is not close to the distribution of the genome, not accounting for GC

and/or mappability results in large Type-I errors. Overall, our data-driven computa-

tional experiments illustrate that GLANET has high power for detecting enrichment

with conservative Type-I error control. These experiments can be easily adapted by

other tools to assess their own performance. To exemplify this usage, we evaluated

another tool, GAT, and assessed its Type-I error rate and power. Furthermore, for

comparison reasons, we provided element-based and cell-based ROC curves, Type-I

Error and power figures of GLANET and GAT depending on the results coming from

our data-driven computational experiments.

In the second part of the thesis, we extended finding common overlapping intervals

from 2 or 3 interval sets problem into n interval sets problem. This time, we utilized

segment tree which is another space-partitioning data structure. We observed that top

part of the segment tree does not hold any intervals at all or holds only a few intervals.

119

Intervals are mostly stored in the bottom part of the segment tree. Depending on

this observation, we cut the segment tree at a certain depth and indexed the segment

tree nodes at this cut-off depth and also the nodes without offspring above this cut-

off depth. In this manner, we represented original segment tree as indexed shorter

segment tree forest. We showed that this way of representation reduces the search

time which is verified by t-tests.

Additionally, we developed algorithm for finding at least k common overlapping in-

tervals out of n interval sets problem. We constructed one segment tree for all inter-

vals coming from n interval sets. We kept track of intervals and their source interval

set number at each node. This augmentation enabled us to find common overlapping

intervals immediately after the storage of intervals is completed. Because of the na-

ture of segment tree, one interval can be stored in more than one nodes. And this may

result in multiple output of the same intervals as overlapping intervals. To overcome

this challenge, we proposed an augmented data structure for lastly found overlapping

intervals per each depth. Unfortunately pruning of the segment tree after storage of

intervals coming from each interval set was not possible for finding at least k com-

mon overlapping intervals out of n interval sets problem till the storage of intervals is

completed. This resulted in long run time. However it is still shorter than the straight-

forward idea which is calling our solution for finding i common overlapping intervals

for i interval sets where k ≤ i ≤ n and for all possible combinations, nCi.

Research in this thesis can be extended in several directions as follows:

• Variant Call Format (VCF) has become a primary format for representing SNP,

indel, and structural variation calls [32]. VCF support, in other words, incorpo-

ration of VCF in GLANET as accepted formats may increase its usage.

• Noncoding RNAs play significant role in cellular process and also diseases.

Functional annotation and enrichment with respect to noncoding RNAs such as

microRNAs, lncRNAs can be conducted using the user-defined library options

of the GLANET. Integrating this information into the GLANET default library

is considered as a future work.

• We employ fork/join framework of Java 1.8 in enrichment analysis. As a future

120

work, the annotation step can utilize parallelism.

• Currently, GLANET supports analysis only for the human genome. As a future

work, it can be further developed to work with other model organisms such

as Arabidopsis thaliana (Plant), Saccharomyces cerevisiae (Yeast), Drosophila

melanogaster (Fruit fly), Mus musculus (Mouse), and Danio rerio (Zebrafish).

• Finding common overlapping intervals for n interval sets can be solved by us-

ing one segment tree for each interval set and this solution can be applied in

parallel for each chromosome. Moreover, we showed that representing one

segment tree in indexed segment tree forest decreases search time. This way of

representation allows us to search in parallel on each segment tree in the forest.

As a future work, we can search on each segment tree in the forest in parallel.

• We provided a solution for at least k common overlapping intervals out of n

interval sets. As a future work, we can provide customized parallel implemen-

tations of this solution.

• We can collect and present our solutions for finding n or at least k common

overlapping intervals for n given interval sets under “Joint Overlap Analysis

Framework" (JOF). Application of this framework can be discovery of n or at

least k common overlapping transcription factors (TFs), histone modifications

(HMs), and/or DNaseI hypersensitive sites (DHSs) which constitute the n given

interval sets.

• Furthermore, we can utilize the resulting common overlapping intervals from

JOF and carry out enrichment analysis with respect to other interval sets of

interest such as Copy Number Variations (CNVs), SNPs, genomic variants,

DNA regulatory elements, and genic regions. We can make use of sampling-

based enrichment analysis already provided in GLANET. This will enable us to

extend JOF to “Joint Enrichment Analysis Framework" (JEF). One application

area for JEF can be finding co-enriched TFs for GWAS SNPs or CNVs.

121

REFERENCES

[1] The International HapMap 3 Consortium, D. M. Altshuler, R. A. Gibbs, L. Pel-

tonen, D. M. Altshuler, R. A. Gibbs, L. Peltonen, E. Dermitzakis, S. F.

Schaffner, F. Yu, L. Peltonen, and et al., “Integrating common and rare genetic

variation in diverse human populations,” Nature, vol. 467, pp. 52–58, Sep 2010.

[2] G. A. McVean, D. M. Altshuler (Co-Chair), R. M. Durbin (Co-Chair), G. R.

Abecasis, D. R. Bentley, A. Chakravarti, A. G. Clark, P. Donnelly, E. E. Eichler,

P. Flicek, and et al., “An integrated map of genetic variation from 1,092 human

genomes,” Nature, vol. 491, pp. 56–65, Oct 2012.

[3] R. McLendon, A. Friedman, D. Bigner, E. G. Van Meir, D. J. Brat, G. M. Mas-

trogianakis, J. J. Olson, T. Mikkelsen, N. Lehman, K. Aldape, and et al., “Com-

prehensive genomic characterization defines human glioblastoma genes and

core pathways,” Nature, vol. 455, pp. 1061–1068, Sep 2008.

[4] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold, “Genome-wide map-

ping of in vivo protein-dna interactions,” Science, vol. 316, pp. 1497–1502, Jun

2007.

[5] R. P. Darst, C. E. Pardo, L. Ai, K. D. Brown, and M. P. Kladde, “Bisulfite

sequencing of DNA.,” Current protocols in molecular biology / edited by Fred-

erick M. Ausubel ... [et al.], vol. Chapter 7, July 2010.

[6] L. Song and G. E. Crawford, “Dnase-seq: A high-resolution technique for map-

ping active gene regulatory elements across the genome from mammalian cells,”

Cold Spring Harbor Protocols, vol. 2010, pp. pdb.prot5384–pdb.prot5384, Feb

2010.

[7] J. D. Buenrostro, B. Wu, H. Y. Chang, and W. J. Greenleaf, ATAC-seq: A Method

for Assaying Chromatin Accessibility Genome-Wide. John Wiley & Sons, Inc.,

2001.

123

[8] B. E. Bernstein, E. Birney, I. Dunham, E. D. Green, C. Gunter, and M. Snyder,

“An integrated encyclopedia of dna elements in the human genome,” Nature,

vol. 489, no. 7414, pp. 57–74, 2012.

[9] S. G. Coetzee, S. K. Rhie, B. P. Berman, G. A. Coetzee, and H. Noushmehr,

“Funcisnp: An r/bioconductor tool integrating functional non-coding data sets

with genetic association studies to identify candidate regulatory snps,” Nucleic

acids research, vol. 40, no. 18, 2012.

[10] L. D. Ward and M. Kellis, “Haploreg: a resource for exploring chromatin states,

conservation, and regulatory motif alterations within sets of genetically linked

variants,” Nucleic acids research, vol. 40, no. Database issue, pp. D930–4, 2012.

[11] P. Holmans, E. K. Green, J. S. Pahwa, M. A. Ferreira, S. M. Purcell, P. Sklar,

M. J. Owen, M. C. O’Donovan, and N. Craddock, “Gene ontology analysis

of gwa study data sets provides insights into the biology of bipolar disorder,”

American journal of human genetics, vol. 85, no. 1, pp. 13–24, 2009.

[12] A. Sifrim, J. K. Van Houdt, L. C. Tranchevent, B. Nowakowska, R. Sakai,

G. A. Pavlopoulos, K. Devriendt, J. R. Vermeesch, Y. Moreau, and J. Aerts,

“Annotate-it: a swiss-knife approach to annotation, analysis and interpretation

of single nucleotide variation in human disease,” Genome medicine, vol. 4, no. 9,

p. 73, 2012.

[13] B. Bakir-Gungor, E. Egemen, and O. U. Sezerman, “Panoga: a web server

for identification of snp-targeted pathways from genome-wide association study

data,” Bioinformatics, vol. 30, no. 9, pp. 1287–1289, 2014.

[14] I. Dunham, E. Kulesha, V. Iotchkova, S. Morganella, and E. Birney, “Forge: A

tool to discover cell specific enrichments of gwas associated snps in regulatory

regions [version 1; referees: 2 approved with reservations],” F1000Research,

vol. 4, no. 18, 2015.

[15] R. K. Auerbach, B. Chen, and A. J. Butte, “Relating genes to function: identify-

ing enriched transcription factors using the encode chip-seq significance tool,”

Bioinformatics, vol. 29, no. 15, pp. 1922–4, 2013.

124

[16] A. P. Boyle, E. L. Hong, M. Hariharan, Y. Cheng, M. A. Schaub, M. Kasowski,

K. J. Karczewski, J. Park, B. C. Hitz, S. Weng, J. M. Cherry, and M. Snyder,

“Annotation of functional variation in personal genomes using regulomedb,”

Genome research, vol. 22, no. 9, pp. 1790–7, 2012.

[17] P. Cingolani, A. Platts, L. Wang le, M. Coon, T. Nguyen, L. Wang, S. J. Land,

X. Lu, and D. M. Ruden, “A program for annotating and predicting the effects

of single nucleotide polymorphisms, snpeff: Snps in the genome of drosophila

melanogaster strain w1118; iso-2; iso-3,” Fly, vol. 6, no. 2, pp. 80–92, 2012.

[18] W. McLaren, B. Pritchard, D. Rios, Y. Chen, P. Flicek, and F. Cunningham,

“Deriving the consequences of genomic variants with the ensembl api and snp

effect predictor,” Bioinformatics, vol. 26, no. 16, pp. 2069–70, 2010.

[19] K. Wang, M. Li, and H. Hakonarson, “Annovar: functional annotation of genetic

variants from high-throughput sequencing data,” Nucleic acids research, vol. 38,

no. 16, p. e164, 2010.

[20] P. H. Lee, C. O’Dushlaine, B. Thomas, and S. M. Purcell, “Inrich: interval-

based enrichment analysis for genome-wide association studies,” Bioinformat-

ics, vol. 28, no. 13, pp. 1797–9, 2012.

[21] C. Y. McLean, D. Bristor, M. Hiller, S. L. Clarke, B. T. Schaar, C. B. Lowe,

A. M. Wenger, and G. Bejerano, “Great improves functional interpretation of

cis-regulatory regions,” Nature biotechnology, vol. 28, no. 5, pp. 495–501, 2010.

[22] A. Heger, C. Webber, M. Goodson, C. P. Ponting, and G. Lunter, “GAT: a simu-

lation framework for testing the association of genomic intervals,” Bioinformat-

ics, vol. 29, pp. 2046–2048, Aug. 2013.

[23] J. Rozowsky, G. Euskirchen, R. K. Auerbach, Z. D. Zhang, T. Gibson, R. Bjorn-

son, N. Carriero, M. Snyder, and M. B. Gerstein, “Peakseq enables system-

atic scoring of chip-seq experiments relative to controls,” Nature biotechnology,

vol. 27, no. 1, pp. 66–75, 2009.

[24] D. Chung, P. F. Kuan, B. Li, R. Sanalkumar, K. Liang, E. H. Bresnick, C. Dewey,

and S. Keles, “Discovering transcription factor binding sites in highly repetitive

125

regions of genomes with multi-read analysis of chip-seq data,” PLoS computa-

tional biology, vol. 7, no. 7, p. e1002111, 2011.

[25] M. S. Cheung, T. A. Down, I. Latorre, and J. Ahringer, “Systematic bias in high-

throughput sequencing data and its correction by beads,” Nucleic acids research,

vol. 39, no. 15, p. e103, 2011.

[26] Y. C. Chen, T. Liu, C. H. Yu, T. Y. Chiang, and C. C. Hwang, “Effects of gc bias

in next-generation-sequencing data on de novo genome assembly,” PloS one,

vol. 8, no. 4, p. e62856, 2013.

[27] Y. Benjamini and T. P. Speed, “Summarizing and correcting the gc content bias

in high-throughput sequencing,” Nucleic acids research, vol. 40, no. 10, p. e72,

2012.

[28] J. Dabney and M. Meyer, “Length and gc-biases during sequencing library am-

plification: a comparison of various polymerase-buffer systems with ancient

and modern dna sequencing libraries,” BioTechniques, vol. 52, no. 2, pp. 87–94,

2012.

[29] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “Kegg for in-

tegration and interpretation of large-scale molecular data sets,” Nucleic acids

research, vol. 40, no. D1, pp. D109–D114, 2012.

[30] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.

Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-

Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald,

G. M. Rubin, and G. Sherlock, “Gene Ontology: tool for the unification of

biology,” Nature Genetics, vol. 25, pp. 25–29, May 2000.

[31] P. J. Croucher, Linkage Disequilibrium. John Wiley & Sons, Ltd, 2001.

[32] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo,

R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin,

and . G. P. A. G. , “The variant call format and vcftools,” Bioinformatics, vol. 27,

no. 15, p. 2156, 2011.

126

[33] M. Leclercq, A. B. Diallo, and M. Blanchette, “Prediction of human mirna target

genes using computationally reconstructed ancestral mammalian sequences,”

Nucleic Acids Research, vol. 45, no. 2, p. 556, 2017.

[34] A. Yates, W. Akanni, M. R. Amode, D. Barrell, K. Billis, D. Carvalho-Silva,

C. Cummins, P. Clapham, S. Fitzgerald, L. Gil, C. G. Girón, L. Gordon,

T. Hourlier, S. E. Hunt, S. H. Janacek, N. Johnson, T. Juettemann, S. Keenan,

I. Lavidas, F. J. Martin, T. Maurel, W. McLaren, D. N. Murphy, R. Nag,

M. Nuhn, A. Parker, M. Patricio, M. Pignatelli, M. Rahtz, H. S. Riat, D. Shep-

pard, K. Taylor, A. Thormann, A. Vullo, S. P. Wilder, A. Zadissa, E. Birney,

J. Harrow, M. Muffato, E. Perry, M. Ruffier, G. Spudich, S. J. Trevanion, F. Cun-

ningham, B. L. Aken, D. R. Zerbino, and P. Flicek, “Ensembl 2016,” Nucleic

Acids Research, vol. 44, no. D1, p. D710, 2016.

[35] U. Paila, B. A. Chapman, R. Kirchner, and A. R. Quinlan, “Gemini: integrative

exploration of genetic variation and genome annotations,” PLoS computational

biology, vol. 9, no. 7, p. e1003153, 2013.

[36] F. A. San Lucas, G. Wang, P. Scheet, and B. Peng, “Integrated annotation and

analysis of genetic variants from next-generation sequencing studies with vari-

ant tools,” Bioinformatics, vol. 28, no. 3, pp. 421–2, 2012.

[37] M. L. Speir, A. S. Zweig, K. R. Rosenbloom, B. J. Raney, B. Paten, P. Ne-

jad, B. T. Lee, K. Learned, D. Karolchik, A. S. Hinrichs, S. Heitner, R. A.

Harte, M. Haeussler, L. Guruvadoo, P. A. Fujita, C. Eisenhart, M. Diekhans,

H. Clawson, J. Casper, G. P. Barber, D. Haussler, R. M. Kuhn, and W. J. Kent,

“The UCSC Genome Browser database: 2016 update.,” Nucleic acids research,

vol. 44, pp. D717–D725, Jan. 2016.

[38] A. R. Quinlan and I. M. Hall, “BEDTools: a flexible suite of utilities for compar-

ing genomic features.,” Bioinformatics (Oxford, England), vol. 26, pp. 841–842,

Mar. 2010.

[39] S. Neph, M. S. Kuehn, A. P. Reynolds, E. Haugen, R. E. Thurman, A. K. John-

son, E. Rynes, M. T. Maurano, J. Vierstra, S. Thomas, R. Sandstrom, R. Hum-

bert, and J. A. Stamatoyannopoulos, “BEDOPS: high-performance genomic fea-

ture operations,” Bioinformatics, vol. 28, pp. 1919–1920, July 2012.

127

[40] A. V. Alekseyenko and C. J. Lee, “Nested Containment List (NCList): a new

algorithm for accelerating interval query of genome alignment and interval

databases.,” Bioinformatics (Oxford, England), vol. 23, pp. 1386–1393, June

2007.

[41] H. Li, “Tabix: fast retrieval of sequence features from generic TAB-delimited

files,” Bioinformatics, vol. 27, pp. 718–719, Mar. 2011.

[42] R. M. Layer and A. R. Quinlan, “A parallel algorithm for n -way interval set

intersection,” Proceedings of the IEEE, vol. PP, no. 99, pp. 1–10, 2015.

[43] K. R. Blahnik, L. Dou, H. O’Geen, T. McPhillips, X. Xu, A. R. Cao, S. Iyen-

gar, C. M. Nicolet, B. Ludascher, I. Korf, and P. J. Farnham, “Sole-search: an

integrated analysis program for peak detection and functional annotation using

chip-seq data,” Nucleic acids research, vol. 38, no. 3, p. e13, 2010.

[44] T. H. Cormen, Introduction to algorithms. Cambridge, Mass.: MIT Press,

3rd ed., 2009.

[45] M. Thomas-Chollier, O. Sand, J.-V. Turatsinze, R. Janky, M. Defrance,

E. Vervisch, S. Brohée, and J. van Helden, “RSAT: regulatory sequence anal-

ysis tools,” Nucleic Acids Research, vol. 36, pp. W119–W127, July 2008.

[46] A. Mathelier, X. Zhao, A. W. Zhang, F. Parcy, R. Worsley-Hunt, D. J. Arenillas,

S. Buchman, C.-y. Y. Chen, A. Chou, H. Ienasescu, J. Lim, C. Shyr, G. Tan,

M. Zhou, B. Lenhard, A. Sandelin, and W. W. Wasserman, “JASPAR 2014: an

extensively expanded and updated open-access database of transcription factor

binding profiles.,” Nucleic acids research, vol. 42, pp. D142–D147, Jan. 2014.

[47] P. Kheradpour and M. Kellis, “Systematic discovery and characterization of reg-

ulatory motifs in ENCODE TF binding experiments,” Nucleic Acids Research,

vol. 42, pp. gkt1249–2987, Dec. 2013.

[48] C. E. Bonferroni, “Teoria statistica delle classi e calcolo delle probabilità,” Pub-

blicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di

Firenze, vol. 8, pp. 3–62, 1936.

128

[49] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A Practi-

cal and Powerful Approach to Multiple Testing,” Journal of the Royal Statistical

Society. Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995.

[50] M. Costantini, O. Clay, F. Auletta, and G. Bernardi, “An isochore map of human

chromosomes,” Genome research, vol. 16, pp. 536–541, Apr. 2006.

[51] G. Bernardi, “Misunderstandings about isochores. Part 1,” Gene, vol. 276,

pp. 3–13, Oct. 2001.

[52] J. Cheng, R. Blum, C. Bowman, D. Hu, A. Shilatifard, S. Shen, and B. D. Dyn-

lacht, “A Role for H3K4 Monomethylation in Gene Repression and Partitioning

of Chromatin Readers.,” Molecular cell, vol. 53, pp. 979–992, Mar. 2014.

[53] A. Barski, S. Cuddapah, K. Cui, T. Y. Roh, D. E. Schones, Z. Wang, G. Wei,

I. Chepelev, and K. Zhao, “High-resolution profiling of histone methylations in

the human genome.,” Cell, vol. 129, no. 4, pp. 823–837, 2007.

[54] W. Shu, H. Chen, X. Bo, and S. Wang, “Genome-wide analysis of the rela-

tionships between DNaseI HS, histone modifications and gene expression re-

veals distinct modes of chromatin domains,” Nucleic Acids Research, vol. 39,

pp. 7428–7443, Sept. 2011.

[55] A. M. Deaton and A. Bird, “CpG islands and the regulation of transcription,”

Genes & Development, vol. 25, pp. 1010–1022, May 2011.

[56] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez, and

M. Müller, “proc: an open-source package for r and s+ to analyze and com-

pare roc curves,” BMC Bioinformatics, vol. 12, p. 77, 2011.

[57] A. Heger, “Gat tutorial.” https://gat.readthedocs.org. Accessed:

2016-05-13.

[58] A. Valouev, D. S. Johnson, A. Sundquist, C. Medina, E. Anton, S. Batzoglou,

R. M. Myers, and A. Sidow, “Genome-wide analysis of transcription factor

binding sites based on ChIP-Seq data,” Nat Meth, vol. 5, pp. 829–834, Sept.

2008.

129

https://gat.readthedocs.org

[59] S. E. Stewart, D. Yu, J. M. Scharf, B. M. Neale, J. A. Fagerness, et al., “Genome-

wide association study of obsessive-compulsive disorder,” Molecular psychia-

try, vol. 18, no. 7, pp. 788–98, 2013.

[60] T. Overbeek, K. Schruers, and E. Griez, “Comorbidity of obsessive-compulsive

disorder and depression: prevalence, symptom severity, and treatment effect,”

The Journal of clinical psychiatry, vol. 63, no. 12, pp. 1–478, 2002.

[61] F.-Y. Tsai and S. H. Orkin, “Transcription factor gata-2 is required for prolif-

eration/survival of early hematopoietic cells and mast cell formation, but not

for erythroid and myeloid terminal differentiation,” Blood, vol. 89, no. 10,

pp. 3636–3643, 1997.

[62] K. Kitajima, M. Tanaka, J. Zheng, H. Yen, A. Sato, D. Sugiyama, H. Umehara,

E. Sakai, and T. Nakano, “Redirecting differentiation of hematopoietic progen-

itors by a transcription factor, gata-2,” Blood, vol. 107, no. 5, pp. 1857–1863,

2006.

[63] G. Yu, F. Li, Y. Qin, X. Bo, Y. Wu, and S. Wang, “GOSemSim: an R pack-

age for measuring semantic similarity among GO terms and gene products.,”

Bioinformatics (Oxford, England), vol. 26, pp. 976–978, Apr. 2010.

[64] M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars, Computational Ge-

ometry: Algorithms and Applications. Springer, softcover reprint of hardcover

3rd ed. 2008 ed., Nov. 2010.

[65] C. Yao, B. H. Chen, R. Joehanes, B. Otlu, X. Zhang, C. Liu, T. Huan, O. Tas-

tan, L. A. Cupples, J. B. Meigs, C. S. Fox, J. E. Freedman, P. Courchesne,

C. J. O’Donnell, P. J. Munson, S. Keles, and D. Levy, “Integromic analysis of

genetic variation and gene expression identifies networks for cardiovascular dis-

ease phenotypesclinical perspective,” Circulation, vol. 131, no. 6, pp. 536–549,

2015.

130

APPENDIX A

GLANET DATA SOURCES

Table A.1: GLANET data sources and their download dates.

Data Source Download Date

ENCODE DNaseI hypersensitive sites http://ftp.ebi.ac.uk/pub/databases/

ensembl/encode/supplementary/

integration_data_jan2011/byDataType/

openchrom/jan2011/idrPeaks/

conservative/

29/03/2013

ENCODE DNaseI hypersensitive sites http://ftp.ebi.ac.uk/pub/databases/


integration_data_jan2011/

byDataType/dnase/jul2010/

29/03/2013

ENCODE Transcription factor binding sites http://ftp.ebi.ac.uk/pub/databases/



byDataType/peaks/jan2011/spp/

optimal/

22/03/2013

ENCODE Histone modification sites http://ftp.ebi.ac.uk/pub/databases/



byDataType/peaks/jan2011/histone_

macs/optimal/

29/03/2013

hg19 RefSeq genes http://genome.ucsc.edu/ 18/11/2014

hg19 chromosome sizes http://genome.ucsc.edu/goldenPath/

help/hg19.chrom.sizes

22/05/2013

KEGG pathways http://rest.kegg.jp/list/pathway/

hsa

23/09/2013


131

http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/supplementary/integration_data_jan2011/byDataType/openchrom/jan2011/idrPeaks/conservative/





http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/supplementary/integration_data_jan2011/byDataType/dnase/jul2010/




http://ftp.ebi.ac.uk/pub/databases/ensembl/ encode/supplementary/integration_data_jan2011/byDataType/peaks/jan2011/spp/optimal/





http://ftp.ebi.ac.uk/pub/databases/ensembl/ encode/supplementary/integration_data_jan2011/byDataType/peaks/jan2011/histone_macs/optimal/





http://genome.ucsc.edu/

http://genome.ucsc.edu/goldenPath/help/hg19.chrom.sizes

http://genome.ucsc.edu/goldenPath/help/hg19.chrom.sizes

http://rest.kegg.jp/list/pathway/hsa

http://rest.kegg.jp/list/pathway/hsa

Table A.1 – continued from previous page

Data Source Download Date

KEGG pathway to gene mapping http://www.genome.jp/linkdb/linkdb.

html

18/06/2013

GC fasta files http://hgdownload.cse.ucsc.edu/

goldenPath/hg19/chromosomes/

19/07/2013

Mappability bigWig files ftp://hgdownload.cse.ucsc.edu/

apache/htdocs/goldenPath/hg19/

encodeDCC/wgEncodeMapability/

18/07/2013

JASPAR CORE pfms http://jaspar.genereg.net/

html/DOWNLOAD/JASPAR_CORE/pfm/

nonredundant/pfm_all.txt

26/08/2014

ENCODE motifs http://compbio.mit.edu/

encode-motifs/

25/02/2014

NCBI REMAP API supported assemblies Downloaded by remap_api.pl within GLANET

when a Regulatory Sequence Analysis is requisted

(repmap_api.pl source: ftp://ftp.ncbi.nlm.

nih.gov/pub/remap).

01/04/2016

Latest ref seq assembly ids Downloaded from ftp://ftp.ncbi.nlm.nih.

gov/genomes/ASSEMBLY_REPORTS/All/

within GLANET each time Regulatory Sequence

Analysis is requested.

01/04/2016

Gene ids ftp://ftp.ncbi.nlm.nih.gov/gene/

DATA/gene2refseq.gz

18/11/2014

132

http://www.genome.jp/linkdb/linkdb.html

http://www.genome.jp/linkdb/linkdb.html

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

ftp://hgdownload.cse.ucsc.edu/apache/htdocs/goldenPath/hg19/encodeDCC/wgEncodeMapability/



http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/pfm/nonredundant/pfm_all.txt



http://compbio.mit.edu/encode-motifs/

http://compbio.mit.edu/encode-motifs/

ftp://ftp.ncbi.nlm.nih.gov/pub/remap

ftp://ftp.ncbi.nlm.nih.gov/pub/remap

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz

APPENDIX B

TYPE-I ERROR, POWER AND ROC CURVE FIGURES

In Appendix B, for H4K20ME1 we provided Type-I Error, power and ROC Curve fig-

ures resulting from data-driven computational experiments for all possible GLANET

parameter and experiment settings. For sake of completeness, we provided all the

Type-I Error, power and ROC Curve figures for all elements under http://burcak.

ceng.metu.edu.tr/PhDThesis/ in BurcakOtlu_PhD_Thesis_ElementBased_

TypeIError_Power_ROCCurve_Figures.pdf.

We plotted the ROC curves using plotROC R package and compared the AUC of each

ROC curve with each other using pROC R package.

We drew the Type-I Error and power figures providing Type-I Error and power values

for varying significance levels starting from 0 to 0.25 in increments of 0.01.

133

http://burcak.ceng.metu.edu.tr/PhDThesis/

http://burcak.ceng.metu.edu.tr/PhDThesis/

(a) (b)

(c) (d)

(e) (f)

Figure B.1: Element-based (a,b) Type-I Error, (c,d) Power and (e,f) ROC Curves for

H4K20ME1 in GM12878 for (EOO,CompletelyDiscard,Top5).

134

(a) (b)

(c) (d)

(e) (f)


H4K20ME1 in GM12878 for (NOOB,CompletelyDiscard,Top5).

135

(a) (b)

(c) (d)

(e) (f)


H4K20ME1 in K562 for (EOO,CompletelyDiscard,Top5).

136

(a) (b)

(c) (d)

(e) (f)


H4K20ME1 in K562 for (NOOB,CompletelyDiscard,Top5).

137

(a) (b)

(c) (d)

(e) (f)


H4K20ME1 in GM12878 for (EOO,TakeTheLongest,Top20).

138

(a) (b)

(c) (d)

(e) (f)


H4K20ME1 in GM12878 for (NOOB,TakeTheLongest,Top20).

139

(a) (b)

(c) (d)

(e) (f)


H4K20ME1 in K562 for (EOO,TakeTheLongest,Top20).

140

(a) (b)

(c) (d)

(e) (f)


H4K20ME1 in K562 for (NOOB,TakeTheLongest,Top20).

141

CURRICULUM VITAE

PERSONAL INFORMATION

Surname, Name: Otlu, Burçak

Nationality: Turkish (TC)

Date and Place of Birth: 10.09.1977, Izmir

Phone: 0 312 210 5541

Fax: 0 312 210 5544

EDUCATION

Degree Institution Year of Graduation

M.S. Department of Computer Engineering, METU 2002

B.S. Department of Computer Engineering, METU 1999

High School Ankara Cumhuriyet High School 1995

High School Ankara Science High School 1994

PROFESSIONAL EXPERIENCE

Year Place Enrollment

2010-2016 Middle East Technical University Research Assistant

2006-2009 Solveka Software Senior Functional Developer

2005-2006 Oyak Technology Software Engineer

1999-2004 Middle East Technical University Research Assistant

143

PUBLICATIONS

In Preparation

1. Joint Overlap Analysis Framework, B. Otlu, T. Can (in draft)

International Journal Publications

1. B. Otlu, C. Firtina, S. Keles, O. Tastan, GLANET Genomic Loci Annotation

and Enrichment Tool, Bioinformatics, 10 May 2017 (accepted), 24 May 2017

(online published)

2. C. Yao, B.H. Chen, R. Joehanes, B. Otlu, X. Zhang, C. Liu, T. Huan, O. Tas-

tan, L.A. Cupples, J.B. Meigs, C.S. Fox, J.E. Freedman, P. Courchesne, C.J.

O’Donnell, P.J. Munson, S. Keles, D. Levy, Integromic analysis of genetic vari-

ation and gene expression identifies networks for cardiovascular disease phe-

notypes, Circulation, Volume 131, Issue 6, 10 February 2015, Pages 536-549.

(printed)

International Conference Poster Presentations

1. GLANET: Genomic Loci Annotation and Enrichment Tool, B. Otlu, O. Tastan,

S. Keles, The 13th European Conference on Computational Biology, ECCB, 7-

10 September 2014, Poster Presentation, Strasbourg, France

AWARD AND SCHOLARSHIP

TÜBITAK, 2211-C PhD Scholarship (2014-2017)

METU, PhD Student Lecture Performance Award (2012)

144

tools and techniques for assessing...

Documents