spatiotemporal patterns of genomic expression in the mammalian brain noa...

Spatiotemporal Patterns of Genomic Expression in the Mammalian Brain

Noa Liscovitch

Interdisciplinary Studies Unit Gonda Multidisciplinary Brain Research Center

Ph.D. Thesis

Submitted to the Senate of Bar-Ilan University

Ramat-Gan, Israel July 2014

This work was carried out under the supervision of Dr. Gal Chechik

Gonda Multidisciplinary Brain Research Center, Bar-Ilan University

Acknowledgements

I’d like to express my deepest gratitude to my PhD advisor, Gal Chechik. As a biology student with no

computational background at all, Gal took a chance on me when accepting me as a student in his lab for

a 4-year long mutual commitment. This chance paid off tremendously for me as I had the opportunity to

be mentored by a PI with a rare “hands-on” approach: teaching me everything from high level biological

and computational principles to everyday advice on code preparation and organization, always with an

incredible amount of patience for any of my questions, trivial as they may seem to him. As a young

research student, Gal always encouraged me to think independently and believe in my ideas, while

respecting my pace and giving me great personal space to progress as I see fit, making this long journey

towards a PhD much more manageable and enjoyable.

My lab mates, especially four of them who I have been lucky to meet at the lab almost every day for the

last four years: Hadas Taubman, Lior Kirsch, Ossnat Bar Shira and Uri Shalit. Thank you for your friendship

and for endless amounts of professional and personal advice.

The staff of the Gonda secretariat: Aliza Shadmi, Asi Kirsch, Henia Gal, Ma’ayan Tsibelman and Tami

Rubenov, have been incredibly helpful, with a welcoming smile on their faces, and I could always count

on them for help in figuring out complex university bureaucracies or lending coffee and milk when needed.

My family: my mother, Billa Harari-Liscovitch and brother Dror Liscovitch, for their continuous support

and interest in my research and attempts to read my papers, and my father, Moti Liscovitch, who passed

away just several days before I started this PhD, and who I have missed dearly every single day since.

Lastly, my husband Lior, for making me laugh, fixing my computer, thank you for your love and support.

Table of Contents

Abstract ................................................................................................................................. ɪ

Chapter 1: Introduction.......................................................................................................... 1

1.1. What is gene expression? Measuring the transcriptome as a proxy to the proteome .................2

1.2 Technologies for measuring mRNA transcripts ...........................................................................3 1.2.1 DNA microarrays ........................................................................................................................................... 3 1.2.2 RNA-Sequencing ........................................................................................................................................... 3 1.2.3 In-situ hybridization (ISH) ............................................................................................................................. 4

1.3 Using genome-wide patterns of expression to study neural processes ........................................5 1.3.1 Gene expression in brain development ........................................................................................................ 5 1.3.2 Spatial patterns of expression in multiple scales; regional to single-cells .................................................... 9 1.3.3 Inferring gene function from neural co-expression patterns ..................................................................... 10 1.3.4 Beyond expression - post-transcriptional mechanisms in the brain .......................................................... 11

1.4 Dissertation outline ................................................................................................................. 12

Chapter 2: Specialization of neural expression during mouse development ........................... 15

2.1 Introduction ............................................................................................................................ 15

2.2 Results .................................................................................................................................... 16 2.2.1 Changes in expression regionalization during development ...................................................................... 16 2.2.2 Functional characteristics of early and post-natal regionalization ............................................................. 20 2.2.3 Expression conservation across regions and their embryonic origins ........................................................ 24 2.2.4 Comparison with human development ...................................................................................................... 27

2.3 Methods ................................................................................................................................. 31 2.3.1 Data acquisition and pre-processing .......................................................................................................... 31 2.3.2 Selecting brain region delineation .............................................................................................................. 31 2.3.3 Contribution of individual genes to the hourglass shape and functional analysis ..................................... 32 2.3.4 Identifying genes with similar sequences ................................................................................................... 33 2.3.5 Constructing reference curves for correlation analysis .............................................................................. 33 2.3.6 Visualizing inter-region distances ............................................................................................................... 34 2.3.7 Dissimilarity of one region to the rest of the brain .................................................................................... 34 2.3.8 Mouse-human comparison ......................................................................................................................... 34

2.4 Discussion ............................................................................................................................... 35

Chapter 3: Methods to represent neural ISH images ............................................................. 37

3.1 Introduction ............................................................................................................................ 37

3.2 A visual representation of ISH images ...................................................................................... 39 3.2.1 Feature extraction ...................................................................................................................................... 39 3.2.2 Feature aggregation using “Bags of visual words” ..................................................................................... 42 3.2.3 Applying a spatial pyramid kernel to the images ........................................................................................ 42 3.2.4 Using the representations for classification ............................................................................................... 43

3.3 A functional representation of ISH images ............................................................................... 44 3.3.1 Data filtering and preprocessing................................................................................................................. 45 3.3.2 Creating the representations ...................................................................................................................... 47 3.3.3 Choosing parameters for analysis ............................................................................................................... 50

Chapter 4: Analysis of neural ISH images ............................................................................. 55

4.1 Explainable gene coexpression patterns using ISH functional representations .......................... 55 4.1.1 Calculating image-image similarities .......................................................................................................... 55 4.1.2 Robustness of bag-of-words representations ............................................................................................ 57 4.1.3 Predicting functional annotations using brain ISH images ......................................................................... 58 4.1.4 Comparison with Neuroblast, the ABA image-correlation tool .................................................................. 60 4.1.5 Identifying and explaining similarities between GABAergic neuron markers ............................................ 61 4.1.6 Finding important spatial patterns in different scales using SIFT "visual words" ....................................... 62 4.1.7 Inferring new gene functions via explainable similarities .......................................................................... 64

4.2 Using ISH images to predict neural disease-related genes ......................................................... 65 4.2.1 Image classification based on disease-gene markers ................................................................................. 66 4.2.2 Validation of results .................................................................................................................................... 68

4.3 Localizing genes to cerebellar layers using ISH image classification ........................................... 69 4.3.2 Genome-wide predictions of cerebellum layer markers ............................................................................ 69 4.3.3 Characterizing layer-specific genes............................................................................................................. 70

Chapter 5: Patterns of RNA editing in the brain .................................................................... 75

5.1 Introduction ............................................................................................................................ 75

5.2 Results .................................................................................................................................... 76 5.2.1 ADAR and ADARB1 expression in the brain ................................................................................................ 77 5.2.2 ADAR expression is positively correlated with potential editing targets .................................................... 78 5.2.3 Effect of Alu location in the gene ............................................................................................................... 80 5.2.4 Specificity of ADAR-target correlations ...................................................................................................... 83 5.2.5 Relation of ADAR-target co-expression and editing potential in targets.................................................... 83 5.2.6 Correlations with ADAR over development ................................................................................................ 84

5.3 Methods ................................................................................................................................. 87 5.3.1 The data ...................................................................................................................................................... 87 5.3.2 Choosing target and background sets ........................................................................................................ 88 5.3.3 Testing ADAR-target correlations at different Alu locations ...................................................................... 89 5.3.4 Functional analysis of gene sets ................................................................................................................. 89

5.4 Discussion ............................................................................................................................... 89

Concluding remarks ............................................................................................................. 93

References........................................................................................................................... 97

Hebrew abstract .................................................................................................................... א

List of Figures

Figure 1.1: Measuring gene expression on the tissue using in situ hybridization………………………………… 5

Figure 1.2: Mammalian brain development……………………………………………………………………………..………… 6

Figure 1.3: Brain regionalization and patterning is controlled by gene expression………………..…………… 8

Figure 2.1: mouse brain developmental timeline. ……………………………………………………………..……………… 16

Figure 2.2: Mean pair-wise dissimilarities between the regions. …………………………………….………………… 17

Figure 2.3: robustness of hourglass shape to the sampling genes. ……………………………………….…………… 18

Figure 2.4: The hourglass shape is robust when removing highly variable genes. …………………..………… 19

Figure 2.5: The hourglass shape is robust throughout the brain. ……………………………………………………… 19

Figure 2.6. Functional characterization of hourglass shape. ……………………………………………..……………… 22

Figure 2.7. Sequence similarity vs. spatial correlation of gene pairs belonging to the GO category

'neuron fate commitment'. ……………………………………………………………………………………………………………… 23

Figure 2.8: cosine dissimilarity curve when computed with the three categories showing the

largest correlation with each reference curve. ………………………………………………………………………………… 24

Figure 2.9: Changes in dissimilarity across individual brain regions. ………………………………………………… 26

Figure 2.10: Hierarchical clustering of 11 large brain regions over development. …………………….……… 26

Figure 2.11: Comparison with human data. ……………………………………………………………………………..……… 28

Figure 2.12: Cross-correlation between mouse and human expression profiles over development…..29

Figure 2.13: Comparison with human data. …………………………………………………………………………………….. 30

Figure 3.1: gene expression ISH images for the genes. ………………………………………………………….………… 38

Figure 3.2: Gene expression for each gene was measured on a brain from a different individual

mouse...……………………………………………………………………………………………………………………………..….………… 39

Figure 3.3: Calculating SIFT descriptors. . ………………………………………………………………………..…….………… 40

Figure 3.4: various patterns of expression taken from one ISH image, at the same scale. ..…...………… 41

Figure 3.5: Calculating LBP features. . …………………………………………………………………………………...………… 41

Figure 3.6. Representing images using the Bag-of-visual-words model. ..…………………………………….…… 42

Figure 3.7: A spatial pyramid approach to extracting dense SIFT features. ..………………………….…….…… 43

Figure 3.8: Using compact ISH image representation as input for classifiers. ..…………………………………. 44

Figure 3.9: Each image series was represented with three slices. ..…………………………………….………..….. 45

Figure 3.10: Regular and expression-masked examples of ISH images as provided by the Allen Brain

Atlas. ..…………………………………….…………………………………………………………………..…………………….………....… 46

Figure 3.11. The raw data. ..…………………………………….……………………………………………………………………..… 47

Figure 3.12. Illustration of the image processing pipeline. ..……………………………………………………………… 50

Figure 3.13: Mean test-AUC values for dictionary size K=100, 200, 500, 1000. ..………………………………. 51

Figure 3.14: Mean test-set AUCs for dictionary size K=100 versus K=1000. ..……………………………………. 52

Figure 3.15: Mean AUC (averaged over test-splits) for the GO categories vs. GO category size

(number of genes in the category). ..……………………………………………………………………………………………….. 53

Figure 4.1: The similarity in the representation of same-gene pairs and different-gene pairs. ..………. 58

Figure 4.2. AUC scores for GO categories related to the nervous system and the remaining

categories..………………………………………………………………………………………………………………………………………. 59

Figure 4.3. Precision at top-K for similarity..…………………………………………………………………………………….. 61

Figure 4.4. Representing ISH images with visual words. ..………………………………………………………………… 63

Figure 4.5. The visual words important in classifying Add2 GO categories are overlaid on the

Add2 ISH image. ..………………………………………………………………………………………………………….………………… 64

Figure 4.6: ROC curves for (A) Parkinson’s disease predictions and (B) epilepsy predictions. ..………… 67

Figure 4.7. Comparison with Purkinje-deficient mice and layer enrichment for cell types. ..………….… 73

Figure 4.8: Examples of novel genetic markers. ..………………………………………………………………………..…… 74

Figure 5.1. ADAR and ADARB1 expression in the human brain based on the ABA-2013 dataset. …..… 78

Figure 5.2. The distribution of spatial correlation values between ADAR and targets and between

ADAR and a background set. ..………………………………………………………………………………….…………………..… 80

Figure 5.3: Effect vs. Alu location. ..………………………………………………………………………….……………………… 81

Figure 5.4. The distribution of spatial correlation values between ADAR and targets containing

Alus in different locations and between ADAR and a background set. ..…………………….………………………82

Figure 5.6: 2D histograms of the correlation of genes with ADAR vs. the number of Alu repeats the

genes contain. ..………………….…………………………………………………………………………………….…………………..… 84

Figure 5.7. The distribution of spatial correlation values between ADAR and targets with intronic

Alus (orange) and between ADAR and a background set (blue), at different time points……….………….86

Figure 5.8. ADAR-target correlations over development. ..………………………………………………….…………… 86

Figure 5.9: Differential co-expression of ADAR and targets. ..………………………………………………….…….… 87

List of Tables

Table 2.1. Mean contribution values of GO categories at E11.5 and P28. ………………….…………………….. 21

Table 3.1: Pearson's rho correlation values between AUC results for 2081 categories, compared

across the 4 different dictionary sizes. ..…………………………………………………………………..………………………. 51

Table 4.1. The GO categories classified with highest test-set AUC values. ..……………………………..………. 59

Table 4.2. Top-10 GO annotations explaining the similarities between the gene Synpo2 and Npepps

and Rasa4. ..…………………………………………………………………………………..………………………………………………....65

Table 4.3: Top 10 predicted genes for the two diseases, and corresponding prediction scores. ..………68

Table 4.4: Prediction validation using two datasets: GAD and Linghu 2009………………………………………..68

Table 4.5. Functional enrichment of genes localized to the white matter……………………………………….….71

Table 4.6. Functional enrichment of genes localized to the Purkinje layer……………………………….…………72

Table 4.7. Functional enrichment of genes localized to the granular layer………………………………….………72

Table 4.8. Functional enrichment of genes localized to the molecular layer……………….………………………72

Table 5.1. Number of target genes and background genes used in the analyses……….…….………………….88

List of Abbreviations

ABA: Allen Brain Atlas

AUC: Area Under the ROC Curve

BoW: Bag of Words

cDNA: Complementary DNA

devABA : Allen Developing Mouse Brain Atlas

FACS : Fluorescence-Activated Cell Sorting

GO: Gene Ontology

GBA: Guilt By Association

ISH: In-Situ Hybridization

LBP: Local Binary Patterns

PCA: Principal Component Analysis

PD: Parkinson’s disease

ROC: Receiver Operating Characteristic

SIFT: Scale Invariant Feature Transform

SVM: Support Vector Machine

I

Abstract

The vast complexity of the brain in terms of structure, function and development is enabled due to the

coordinated work of thousands of genes in time and space. In fact, around 80% of genes are differentially

expressed in the brain, and more than half of mouse genes have been found to be involved in brain

development. Although considerable effort and progress have been made in the past century to advance

the field of neurogenomics, the rules and driving forces behind a proper execution of an organism's

functional brain are far from being fully understood.

Patterns of gene expression in the brain have been studied for decades in the context of single genes or

small groups of genes, but the complexity and scale of the mammalian brain calls for the usage of genomic

approaches to the study of expression patterns in the developing and the adult brain. In the last decade,

new high throughput methods for biological data collection enabled the accumulation of large neural

gene-expression datasets allowing the study of complex neural processes from an integrative, genomic

point of view.

In this dissertation, several large neural gene expression databases are used to analyze genomic scaled

patterns of expression from several angles; developmental, spatial and functional. The intention is to shed

light into higher order principles of brain organization and function, but also into single gene function in

the context of neural processes.

We first look into one of the most complex biological processes: brain development. As the brain develops,

specific regions are formed, their structure and function reflected in unique sets of expressed genes. We

investigated the temporal dynamics of changes in regional gene expression patterns throughout mouse

brain development. We identify a neurotypic phase around the time of birth, in which patterns of gene

expression become more homogeneous across the brain, creating an ‘hourglass’ shaped expression

divergence profile. We characterize the biological processes, genes and brain regions responsible for this

pattern, and also compare mouse neurodevelopmental expression patterns with parallel data from

human, finding striking similarities and differences between the two species.

We then describe methods to exploit the abundance of spatial information that exists in high-resolution

images that show a mapping of gene expression in brain tissues, by employing computer vision methods

to represent and classify the images. Methods for feature extraction and image representation have been

II

developed for natural images, and high-resolution images of gene expression pose a unique challenge for

analysis. After creating a representation for the images, we can use it as input for classifiers. We use the

representations calculated for neural expression images to extract meaningful biological information such

as layer-specific gene markers in the mouse cerebellum, identify spatial co-expression profiles, and predict

functional annotations for genes and disease-gene markers.

Finally, we look beyond gene expression, and examine patterns of A-to-I RNA editing by ADARs, a post-

transcriptional modification pre-mRNA that is essential for normal life and development in vertebrates.

Although most human genes have been shown to undergo editing, the exact role of RNA editing is still

unclear, and various functions have been proposed to explain its operation. We addressed one current

hypothesis stating that editing is a way to negatively regulate gene expression, by looking at co-expression

of ADAR and potential RNA editing targets. Instead of the negative co-expression expected, we found a

positive one, suggesting a complex regulation mediated by RNA editing in the human brain.

1

Chapter 1: Introduction

The brain is our most complex organ; the human brain is composed of 50-100 billion neurons, forming an

estimated amount of 1014 connections. There are numerous types of neurons that differ from each other

in their functional and morphological properties. Neurons in the vertebrate brain are also supported by a

vast amount of glia cells, which contain many subtypes as well, varying by function and location. Brain

cells are organized in different compositions and spatial patterns to form cell layers, that form functionally

distinct brain regions. This complexity in structure and function is reflected in gene expression profiles

that are specialized in time and space. Given this, it is hardly surprising that around 80% of genes are

differentially expressed in the brain (Hawrylycz et al. 2012), and that more than half of mouse genes have

been found to be involved in brain development (Waterston et al. 2002).

Although considerable effort and progress have been made in the past century to advance the field of

neurogenomics, the rules and driving forces behind a proper execution of an organism's functional brain

are far from being fully understood. Many of the genes involved in brain patterning and function remain

unidentified. Gene expression in the brain is governed by regulatory networks of transcription factors and

other regulatory proteins controlling their concentrations in a highly localized and timely manner. Failure

to express the right gene at the right time and place can cause many neural diseases that have been found

to have a genetic basis; autism, Down syndrome, fragile X syndrome, Rett syndrome and

neurofibromatosis to name a few (Walker, Russell, and Hodgetts 1987).

Patterns of gene expression in the brain have been studied for decades in the context of single genes or

small groups of genes, but the complexity and scale of the mammalian brain described above calls for the

usage of genomic approaches to the study of expression patterns in the developing and the adult brain.

Current methods to measure gene expression have significantly improved in accuracy, cost and runtime

(Malone and Oliver 2011). In the last decade, large amounts of data was accumulated using methods such

as DNA microarrays and RNA sequencing. Specifically, a couple of extensive gene-expression datasets are

available for the study of the brain (Lein et al. 2007; Sunkin et al. 2012). In addition, data from many

individual experiments were assembled into large repositories. As a result, a genomic approach to

neuroscience is enabling us to study complex neural processes from an integrative point of view (Boguski

and Jones 2004; Zhong and Sternberg 2007).

2

In this dissertation, several large neural gene expression databases are used to analyze genome-scaled

patterns of expression from several angles: developmental, spatial and functional. The intention is to

shed light into higher order principles of brain organization and function, but also into single gene function

in the context of neural processes. Specifically we look into patterns of expression in the developing brain,

explore methods to extract visual and semantic features from images showing high resolution expression

mapping in the brain, identify markers of layers in the brain and gene co-expression patterns, examine

patterns of genes known to be related to neural disease and finally, look into RNA editing, a process which

is especially known to be important in the brain.

The introductory chapter is organized as follows: I start by defining gene expression and discussing the

concept of measuring mRNA expression as a proxy for protein abundance. I then describe several popular

methods to measure gene expression. I continue by reviewing some of the approaches that have been

used so far to investigate genomic patterns of neural gene expression which are relevant to the work

presented in this thesis. The introductory chapter concludes with an outline of this dissertation.

1.1. What is gene expression? Measuring the transcriptome as a proxy to the proteome

A well-known concept in biology, called “the central dogma”, describes the main flow of genetic

information in the cell. The genetic information is stored in DNA which codes for proteins. Proteins carry

out most of the actual work in the cell. Information from DNA to protein is mediated by molecules called

messenger RNA (mRNA). The specific phenotype and function of each cell is determined by the subset of

proteins it expresses in any given moment. Proteins vary largely by their size and 3D structure, while

mRNA molecules have a simpler and more uniform structures; they are composed of a single stranded

molecule, with four ribonucleotides as building blocks. The complex and high variance nature of proteins

makes it difficult to measure protein abundances in cells and tissues, and so high-throughput methods

have focused mainly on the measurements of mRNA abundances as a proxy for protein levels. It is clear

that mRNA levels do not precisely reflect protein abundances, due to many post-transcriptional regulatory

processes such as mRNA and protein degradation, and varying rates of translation. Still, the general

assumption is that mRNA levels are correlated to protein levels. The concordance between mRNA and

protein levels has been studied, for example in yeast (Foss et al. 2007) and in Arabidopsis (Fu et al. 2009),

finding a weak but significant correlation between the two measures. A more recent study conducted in

3

mouse shows a modest correlation (r = 0.27) between the transcriptome and the proteome (Ghazalpour

et al. 2011). More recently, another study argued that protein abundances have been significantly

underestimated as well as the relative importance of transcription (Li, Bickel, and Biggin 2014). While this

issue is still controversial, the wide consensus today is to measure gene expression using mRNA

abundances. Approximation of protein abundances may improve in the coming years with the advent of

more accurate ways to measure mRNA transcripts, or even with the development of high-throughput

methods to measure the proteome itself.

1.2 Technologies for measuring mRNA transcripts

1.2.1 DNA microarrays

Most genomic-scaled gene expression studies in the past decade were conducted using DNA microarrays,

a method that measures expression levels of thousands of mRNA transcripts of interest, effectively

providing a “snapshot” of the transcriptome in a certain condition which can be a tissue, organism,

subject, time point, cell type etc. In a microarray experiment, mRNA extracted from a tissue is hybridized

to a matrix of wells, containing different complementary DNA (cDNA) strands for all the sequences of

interest. Every well in the matrix shows transcript abundance for a different gene. There are two types of

microarray experiments: dual channel experiments, where two experimental conditions are labeled in

different colors and compared to each other, and single channel experiments, where transcript

abundance for the genes is compared to overall mRNA concentration in the sample. The microarrays

technology is very powerful because it provides fast measurements for thousands of genes at a very low

cost, as opposed to older methods such as northern blotting and RT-PCR, and has thus been widely used

in biological and medical sciences since its introduction in 1995 (Schena et al. 1995). The two main

technical challenges when using this method are (a) quantifying the fluorescent signal, and (b) the need

to know the sequences of interest in advance, in order to create the library of the probes.

1.2.2 RNA-Sequencing

In the past year several new sequencing technologies have been developed that make possible another

type of sequence based mRNA annotation which has been dubbed RNA-Seq (Mortazavi, Williams et al.

2008). Instead of sequencing RNAs one at a time these new technologies produce sequence information

about most of the RNAs in a biological sample in a single experiment. The underlying sequencing

4

technologies, however, are incapable of producing full transcript sequences as is the case with full-length

cDNA sequencing. Instead they recover hundreds or thousands of short independent subsequences from

the mRNAs, with typical lengths of a few dozens of base pairs. As with the earlier methods, these are

either aligned to a known reference genome, or they are mapped in an overlapping manner for a de-novo

assembly of a transcriptome. While RNA-Seq does not require previous knowledge of reference sequences

and can be quantified in a more exact manner (the output data is essentially transcript count) this method

is relatively new and the measurement noise and biases are not as well understood as in microarrays

(Auer and Doerge 2010; Malone and Oliver 2011).

1.2.3 In-situ hybridization (ISH)

A complementing aspect of gene expression studies is the spatial localization of mRNA or its protein

product. A popular method for acquiring this information is In Situ Hybridization (ISH) (Lein et al. 2007).

ISH is a potent method allowing for the localization of nucleic acid targets in fixed tissues, therefore

obtaining high resolution spatial information about gene expression. ISH should not be confused with

single-cell fluorescent in situ hybridization (FISH) image analysis, which aims to identify subcellular

structures. The principle of ISH is that mRNA, after several steps of preservation and fixation, can be

detected by using a complementary, fluorescently labeled probe, and gene expression can then be

measured in the context of tissue/cell morphology.

In an ISH experiment, a fluorescently labeled probe is hybridized to a complementary mRNA strand in the

tissue itself, which is thinly sliced, mapping the transcripts to their original location. Figure 1.1 depicts the

ISH process. This method measures RNA expression at a very high, even sub-cellular, spatial resolution,

but each experiment is typically limited to measure expression for one gene or a small number of genes.

5

1.3 Using genome-wide patterns of expression to study neural processes

The work presented in this dissertation attempts to gain insight into brain and gene function from the

analysis of genomic patterns of expression in the brain. This is done considering several aspects of gene

expression: temporal; looking at patterns of expression that evolve over time in the developing brain,

spatial; considering patterns of expression over regions, cell layers and single cells, evolutionary; where

patterns arising in different species are compared and functional; taking a closer look at gene co-

expression patterns in the healthy and diseased brain. Finally, I discuss possible mechanisms of post-

transcriptional regulation of expression via RNA editing in the brain.

1.3.1 Gene expression in brain development

The vertebrate nervous system develops from the neural tube that has a posterior part, which later

develops into the spinal cord, and an anterior part, which divides into 3 primary vesicles: the

prosencephalon, the mesencephalon and the rombencephalon. The prosencephalon further develops into

two secondary vesicles: the telencephalon and the diencephalon. The most posterior vesicle, the

rombencephalon, forms two secondary vesicles as well, the metencephalon, and the myelencephalon

(Figure 1.2A).

Brain regions are formed and wired through a series of partially overlapping cellular processes. The first

phase involves localized proliferation of neural precursor cells. These cells further divide to become

neurons and glia cells. The partially differentiated neural cells migrate from their site of origin to their final

Figure 1.1: Measuring gene expression on the tissue using in situ hybridization. Complementary DNA (cDNA) of the target gene’s mRNA is fluorescently labeled and placed onto the tissue of interest, which is thinly sliced. The cDNA is hybridized to the target mRNA and the tissue is images. The cells containing the mRNA of the target gene are fluorescently labeled. Figure created by Lior Kirsch.

6

locations in the nervous system (Ward et al. 2003). During migration, the neurons develop growth cones

that extend into axons and dendrites, forming synapses with other neurons (O’Connor and Tessier-Lavigne

1999). The neural cells further differentiate to form mature, specialized neurons and glia, and aggregate

into identifiable structures. Further steps include selective death of neurons, elimination of some

neuronal connections and stabilization of others (Buss and Oppenheim 2004). These modifications

continue throughout the organism’s lifetime, as a result of its experiences and its external and internal

environment. The major steps in human brain development are shown at figure 1.2B, and a parallel

timeline of development for rat and human brains is shown at figure 1.2C.

Figure 1.2: Mammalian brain development. (A) Division of the neural plate to the five embryonic vesicles. (B) Major events in

human brain development from conception to adulthood. Figure taken from (Tau and Peterson 2010) (C) the brains of rat and

human embryos at several matched stages of development (Bayer et al. 1993).

The creation of more refined and distinct regions in function and cytoarchitecture is called brain

regionalization. This process is usually carried out by genes whose localized expression in the brain signals

and induces the process of neural patterning: creating these functionally distinct regions in the brain. To

develop a properly functioning brain, the genes involved in development need to be expressed at precise

7

times and in particular locations. Characterizing the spatiotemporal expression patterns of those genes is

therefore of great importance in the field of developmental neuroscience.

Over the years, many development-related genes that have unique expression patterns have been

identified. For example, the spatial pattern of the Sonic Hedgehog gene (Shh) was found to be responsible

for generating different cell types in the neural tube, during early stages of vertebrate development, and

also responsible for generating the boundaries between the prethalamus and thalamus, and the pallium

and subpallium via regulatory interactions with other signaling molecules (Cavodeassi and Houart 2012)

(Figure 1.3A). Shh acts as a signaling molecule, whose concentration gradient determines the activation

levels of different homeobox transcription factors (Hox), responsible for the differentiation of neural cells

into different neuron types (Briscoe et al. 1999). The Hox gene family was first discovered by Edward

Lewis, who also found that the spatial patterning of the family members along the anterior-posterior axis

of the fruit-fly Drosophila melanogaster is responsible for the specification of each body segment as a

unique one with a unique function (Lewis 1978). Remarkably, these properties of the Hox family have

been preserved over evolution and they are also responsible to hindbrain patterning in the vertebrate

brain (Figure 1.3B). They include 4 classes which together control hindbrain segmentation into

rhombomeres in a combinatorial manner (McGinnis and Krumlauf 1992).

8

Figure 1.3: Brain regionalization and patterning is controlled by gene expression. (A) Shh expression mediates the creation of

boundaries between the prethalamus and thalamus, and the pallium and subpallium via regulatory interactions with other

signaling molecules (B) Hox family genes combinatorically control segmentation in a highly evolutionary preserved manner.

Although considerable effort and progress have been made in the past century to advance the field of

developmental neurobiology, the rules and driving forces behind a proper execution of an organism's

developmental scheme are far from being fully understood, and many genes involved in development

remain unidentified. Since it’s hypothesized that more than half of our genes play a role in brain

development (Waterston et al. 2002), a natural approach to the issue is to analyze transcriptomic patterns

of expression, allowing to identify new gene functions and larger principles of brain organization.

Most transcriptomic studies have been conducted on mouse; gene expression levels from publically

available repositories have been used to identify putative transcription factors involved in M. musculus

brain development (Gray et al. 2004). Large-scale gene expression studies were also used to monitor

changes in gene expression in the neocortex and cerebellum of aging mice (Lee, Weindruch, and Prolla

2000), and to characterize different brain regions by expression patterns (Zapala et al. 2005). In recent

9

years several atlases of gene expression over development in the human brain have been created and

made available to the scientific community, allowing to investigate genomic patterns of expression in the

brain and this has leading to important findings. New cortical germinal zones or postmitotic neurons have

been identified as sites of dynamic expression for many genes associated with neurological or psychiatric

disorders (Miller et al. 2014), differential gene expression was found to be more pronounced before birth

in the human brain (Kang et al. 2011), embryonic transcriptome signatures have been identified in adult

patterns of expression (Hawrylycz et al. 2012) and the relationship between genome and transcriptome

was studied in the developing brain, finding that race plays a minor role in the determination of gene

expression over cortical development, even when considering pronounced genetic differences between

individuals (Colantuoni et al. 2011).

The plethora of neural transcriptomic and genomic data, with the addition of other data modalities such

as protein/gene interactions and neural/neuronal connectivity data, ascertains that the field of

developmental neurogenomics will provide some exciting breakthroughs in the upcoming years. Chapter

two of this dissertation is dedicated to the study of the dynamics of inter-regional changes in gene

expression over mouse brain development, with a comparison to similar patterns in human brain

development.

1.3.2 Spatial patterns of expression in multiple scales; regional to single-cells

The brain is organized in multiple scales: from large regions with different functionalities through

specialized sub-regions, to single cells composing each sub-region and even smaller structures such as the

synaptic cleft or post-synaptic density, or neural subcellular compartments such as the synapse, the axon

and the dendrite. There are still many unanswered questions that remain in regards to the functionality

of many of these neural entities, and their relation to one another. For example, it is still unknown which

regions are composed of which cell types, what are the trancriptomic profiles of the different types of

neurons and glia and what is the relationship between different types of neurons and other important

neural cell types such as astrocytes.

Measuring transcriptomic profiles of regions using current methods usually involves mixture of the many

cell types that exist in a tissue, making it difficult to delineate the patterns into functionally distinct ones.

There has been some progress in the purification of specific cell types and their gene expression signatures

(Bryant et al. 1999; Miller et al. 2009; Thomas et al. 2012; Vincent et al. 2002; Wu et al. 2014), but this has

10

been limited to a small number of cell types so far, and requires a lot of experimental effort. Some other,

computational methods have been developed to tackle this problem, for example, co-expression patterns

have been used to infer cell type compositions for different brain regions (Grange et al. 2014).

In recent years, the availability of ISH images of brains has significantly grown, allowing to capture neural

gene expression patterns in a cellular resolution. The data comes in the format of images, calling for the

implementation of image processing and analysis techniques and the development of new, specialized

tools. Chapter 3 of this dissertation discusses several methods to extract features and analyze neural ISH

images using approaches adapted from computer vision, a field that generally focuses on extracting

information from natural images. In chapter 4 these techniques are implemented to discover layer specific

expression of genes, neural co-expression patterns and potential gene-disease candidates.

1.3.3 Inferring gene function from neural co-expression patterns

Since the completion of the human genome project, and consequently the sequencing of genomes of

many other organisms, many new genes have been discovered (Venter et al. 2001). In spite of much

progress in attempts to characterize new genes using high throughput methods and computational

analyses, we still don’t know the function of most genes. While it was shown that around 80% of the genes

are differentially expressed in the brain, it seems that the scientific community focuses its effort on a very

small number of genes. In fact, according to United States National Institute of Mental Health Director

Thomas Insel, over 99% of the neuroscience literature focuses on only 1% of the estimated 15,000–16,000

genes expressed in the brain (Gewin 2005). The genes that do have some known associated function are

likely to have more, unknown roles that change in different neural regions or in different developmental

stages. Kirsch and Chechik indeed show that the majority of human genes show a transcriptomic spatial

signature that reflects the embryonic origin of neural regions, suggesting that genes responsible to brain

patterning over development assume different roles in the adult brain (Kirsch and Chechik, in

preparation).

One way to infer new function is to look for genes that are expressed similarly to genes with known

functions. The similarity in expression suggests that the genes participate in similar biological processes.

This concept is well established as the “guilt by association principle” (GBA) (Oliver 2000). GBA essentially

means that genes that share functionality are likely to share other biological properties like similar

structure or protein domains, and are more likely to physically interact or associate in other manners.

11

This principle has been implemented successfully to discover new gene functions in many datasets (Horan

et al. 2008; Saito, Hirai, and Yonekura-Sakakibara 2008; Lee et al. 2004). An extension to the idea of

inferring functionality via coexpression is to look at differential coexpression, which is defined as changes

in gene–gene correlation structure between two sets of samples which are phenotypically distinct (de la

Fuente 2010). This approach can be used to detect regulatory transcriptional rewiring over development

(Gillis and Pavlidis 2011) or regulatory mechanisms that differ between healthy and unhealthy tissues or

subjects (Choi et al. 2005).

In the context of genes and the brain, gene co-expression analysis has been used to predict the spatial

distribution of neural cell types in the mouse brain (Grange et al. 2014), predict pharmaceutical target

candidates for schizophrenia and Parkinson’s disease (Walker, Volkmuth, and Klingler 1999), identify

epigenetic changes in alcoholism (Ponomarev et al. 2012), elucidate disease mechanisms in autism

(Voineagu et al. 2011) and many more (Gaiteri et al. 2014).

The concept of studying gene coexpression patterns in order to learn about new gene relationships and

function is used throughout this dissertation. Chapter 3 presents a method to identify gene-gene

expression similarities based on high resolution images that display a mapping of expression in the brain,

and chapters 5 discusses the coexpression patterns of one specific gene, ADAR, that is especially

important to normal brain function.

1.3.4 Beyond expression - post-transcriptional mechanisms in the brain

After the gene is transcribed it is subjected to several regulatory mechanisms that account for a large

fraction of the difference between the transcriptome and the proteome, as previously discussed in section

1.1. The major source of variation in the proteome is derived from alternative splicing, where exons of a

gene can be combined in different ways to create different versions of proteins with different functional

domains. Another way to diversify the transcriptome, albeit a more subtle one, is by RNA editing. This

process is carried out by enzymes that bind to RNA molecules and change nucleotide sequence through a

deamination process. There are two main families of this enzymes; adenosine deaminase acting on RNA

(ADAR), which convert adenosine to inosine, translated as guanosine by the translational machinery and

cytosine deaminase acting on RNA (CDAR) also known as APOBEC proteins, which deaminate cytosine to

create uracil.

12

A - I RNA editing has been shown to be especially important in neural tissues. Only a few dozen sites of

RNA editing leading to protein recoding have been identified in mammals. These sites are highly

conserved and are significantly enriched in genes that are related to neural function such as ion channels.

For example, the glutamate-activated cation channel GluR2 AMPA receptor subunit undergoes editing

that changes an amino acid, and consequently changes the channel’s permeability to calcium. This editing

event is important to neuronal viability and interference with RNA editing can cause syndromes such as

epilepsy and ALS (Kawahara and Kwak 2005). Another example of functional RNA editing of an important

neuromodulator is the editing of the serotonin receptor 5-HT2CR. Aberrant patterns of editing for this gene

has been linked to several neuropsychiatric conditions such as depression, schizophrenia, and also

metabolic diseases such as obesity and diabetes (Nishikura 2010).

In recent years, there have been several attempts to identify RNA editing sites in genomes of different

species. The identification of an abundance of RNA editing sites in primate DNA has led to one particularly

captivating theory, where the large amount of editing is suggested as the main driving force in brain

evolution (Li and Church 2013). While most human genes have been shown to undergo editing, the

functionality of this process is still not clear. In some cases, RNA editing of genes has been shown to be a

part of their regulatory mechanisms via gene silencing or nuclear retention (Zhang and Carmichael 2001;

Nishikura 2010).

In chapter 5 of this dissertation we explore the coexpression structures of the A-I RNA editing enzymes

ADAR and ADARB1, and their putative editing targets in neural tissues, in an attempt to find out if there

is indeed a large scale regulatory mechanism that takes place in the human brain.

1.4 Dissertation outline

This thesis is organized as follows. Chapter 2 presents an investigation of the dynamics of inter-regional

dissimilarities in gene expression profiles in different mouse brain regions, using a coarse quantification

of regional gene expression from a genomic collection of ISH images. Chapter 3 describes methods to

exploit the abundance of spatial information that exists in the ISH images by employing computer vision

methods to represent the images and use them to extract biological information, and in chapter 4 these

13

methods are implemented and used to answer diverse biological questions. In chapter 5 we look beyond

gene expression, and examine patterns of RNA editing in the adult and developing human brain.

15

Chapter 2: Specialization of neural expression during mouse development

2.1 Introduction

The development of the nervous system is a highly complex process, involving the coordinated expression

of thousands of genes (Waterston et al. 2002; Colantuoni et al. 2011; Kang et al. 2011). Classical models

of development describe a process of brain regionalization, that transforms the neural plate through

several phases into increasingly refined regions (Krauss et al. 1991; Martínez 2001). In the adult, functional

compartments of the brain have been shown to exhibit unique transcriptome signatures (Sandberg 2000;

Datson et al. 2001), suggesting that the process of brain regionalization may be accompanied by a similar

trend in the transcriptome, where expression profiles become more region-specific as the brain develops.

Regional profiles of gene expression in the brain have been studied extensively. These profiles were used

to define new brain delineations based on gene expression (Bohland et al. 2010), conduct comparisons

between brains of different species (Khaitovich et al. 2004), predict neural connectivity (French and

Pavlidis 2011; Wolf et al. 2011), capture functional similarities between brain regions (Hawrylycz et al.

2012) and shed light into many aspects of human brain development (Kang et al. 2011; Colantuoni et al.

2011).

In this chapter, we look at changes in regional expression patterns in the mouse brain, aiming to study the

specific timing of functional specialization. We study expression across 36 developmental neural regions

which cover the complete mouse brain at several time points spanning embryonic and post-natal mouse

development, and also 41 adult brain regions. Expression was measured for thousands of genes, allowing

a large-scale, genomic approach to the study of brain regionalization. We also conduct an inter-species

comparison between expression patterns in mouse and human brain development.

This chapter studies three aspects of spatio-temporal transcriptome patterns: which biological processes

become spatially specialized, at what time points during development, and in which brain regions. We

first trace how expression regionalization changes during brain development. We then identify neural

processes that contribute to the regionalization at various developmental phases. Then, we identify the

brain regions which become largely dissimilar from other regions, and the genes that contribute to this

16

dissimilarity. Finally, we compare the specialization patterns we find in mouse with corresponding

patterns measured in human.

2.2 Results

To study gene expression specialization during development, we analyze expression primarily based on

ISH expression values obtained from the Allen Developing Mouse Brain Atlas (devABA) (Henry and

Hohmann 2012). In this data, mRNA transcript levels were measured for 2002 genes of special interest in

brain development at 7 developmental time-points spanning embryonic (E11.5, E13.5, E15.5, E18.5) and

post-natal phases (P4, P14, P28). We added another time point, P56, using expression measurements for

the same set of genes from the Allen Adult Mouse Brain Atlas (Lein et al. 2007) (Figure 2.1A). The genes

in the dataset, comprising around 10% of the mouse genome, were selected to include transcription

factors, neurotransmitters, neuroanatomical markers, genes important in brain development and genes

of general interest in neuroscience (see Section 2.3.1). We used per-region data that was quantified from

ISH images by combining all pixels with the same regional label, based on a mapping of each image to a

reference atlas made available by the Allen institute (http://www.brain-map.org). We analyze data from

36 anatomically-delineated regions of the developing brain and 41 regions of the adult brain. These

regions encompass the entire brain (see Section 2.3.2). The data and pre-processing are described in more

details in Section 2.3.

Figure 2.1: mouse brain developmental timeline. ISH for each gene was performed at eight time points during development.

Shown here are mid-sagittal slices for the gene Hmgn2, taken with permission from Allen Institute for Brain Science. Allen Mouse

Brain Atlas [Internet] Available from: http://mouse.brain-map.org/ (Lein et al. 2007)

2.2.1 Changes in expression regionalization during development

Aiming to understand how the transcriptome becomes specialized across different brain regions, we first

quantify the differences between expression profiles of brain regions, and examine how these differences

change during development.

http://mouse.brain-map.org/

17

We quantify the differences between brain regions in terms of the correlation between their gene

expression profiles. Specifically, for every pair of regions R1, R2, we represent each region as a vector of

expression levels, calculate their Pearson Correlation Coefficient (PCC) and compute 1- PCC as the

dissimilarity between the regions. Figure 2.2 depicts the mean dissimilarity for each time point across all

pairs of brain regions. The dissimilarity varies significantly between ages (p-value < 10-16, ANOVA), and its

overall profile follows an 'hourglass' shape. During early development, the dissimilarity is actually reduced,

reaching its lowest value around birth (in E18.5 and P4), although one would expect that the process of

region specialization would lead to an increase in dissimilarity in early embryonic development. After

birth, the dissimilarity rises again. The variance of inter-region dissimilarity follows the changes in the

mean dissimilarity and decreases around birth as well. Interestingly, similar hourglass shapes were also

observed in the profiles of transcriptome variability across species during early development, providing

striking molecular evidence to the 'phylotypic stage' hypothesis (Kalinka et al. 2010a; Domazet-Lošo and

Tautz 2010a). The reduction in expression specialization across brain regions suggests a neurotypic phase

around birth in which all brain regions tend to have a more similar transcriptome.

To test if the overall hourglass shape is a wide effect or strongly depends on a small set of genes, we

measured the dissimilarity using 100 random subsets of sizes K =1000, 500, 200 and 100 genes. We find

that the hourglass shape is largely insensitive to the subset of genes analyzed (Figure 2.3). To further

Figure 2.2: Mean pair-wise dissimilarities between the regions. The curve is a second-order polynomial which minimizes the squared error of the fit to the data. Error bars denote data within 1.5 times the inter-quartile range, and the boxes show the lower and upper quartiles together with the median.

18

ensure that the hourglass effect is not driven by a small number of highly variable genes, we measured

again the dissimilarity, this time after removing the genes with the largest inter-region variability for each

time point. At each time point, we measured the standard deviation across regions for every gene, and

removed the top k genes with the highest standard deviation values (k = 50, 100, 200, 500). The hourglass

shape was robust even when removing the 500 most variable genes (25% of the dataset, Figure 2.4).

Figure 2.3: robustness of hourglass shape to the sampling genes. Dissimilarity curves were computed by random sampling of genes sized (A) 1000, (B) 500, (C) 200 and (D) 100. The shape is robust and largely remains even when using 100 genes, 5% of the full dataset.

19

We also tested the sensitivity of the hourglass shape to the selection of regions by computing the

dissimilarity repeatedly, each time with one region being excluded from the analysis ("leave one region

out", Figure 2.5A). To test how the delineation of the brain into regions may affect the results, we used

the hierarchical structure of the anatomical regions to select six sets of regions at increasing sizes (see

Section 2.3.2). Figure 2.5B depicts the dissimilarity profiles for each of the six sets, as computed at various

resolutions, from 488 developing and 631 adult small brain regions at the most refined level, to 48

developing and 13 adult brain regions at the most coarse level. The hourglass shape of dissimilarity profile

is largely preserved in all delineations. Together, these results demonstrate that the hourglass shape is

robust throughout the dataset and is not constrained to specific genes or brain regions.

Figure 2.4: The hourglass shape is robust when removing highly variable genes. Inter-region distance curve was calculated for the data withholding top k most variable genes for each time point. Error bars represent standard error between brain regions.

20

Figure 2.5: The hourglass shape is robust throughout the brain. Inter-region distance curve was calculated for the data

withholding one region at a time. The blue curve is the mean across brain regions, error bars represent standard deviations from

mean. (E) The dissimilarity curve using sets of regions taken from different levels of the reference atlas regional ontology tree,

starting from the leaf regions (level 1).

2.2.2 Functional characteristics of early and post-natal regionalization

Which biological processes could underlie the pattern of inter-region dissimilarity? In principle, the

hourglass shape could stem from functions or genes whose individual expression profiles follow the

hourglass shape. Alternatively, the shape could be the result of a mix of several biological processes, some

contributing to the decreasing phase of the hourglass and some contributing to the increasing phase. To

test these alternatives, we created a temporal profile for each gene that quantifies its contribution to the

hourglass shape at developmental time points (E11.5 - P28) (see Section 2.3.3). We then used the k-Means

clustering algorithm (Bishop 2006) to group the profiles into distinct clusters of genes that have congruent

developmental dissimilarity patterns, and searched for functional enrichment in these clusters using Gene

Ontology (GO) categories (see Section 2.3.3).

We found two main families of clusters that were functionally enriched (pFDR, q-value < 0.01), each family

accounting for a different phase of the hourglass shape, and depicted in Figure 2.6. Genes from the first

family contributed largely to the dissimilarity during early embryonic development and are related to

nervous-system development categories, such as neuron differentiation, axonogenesis and forebrain

development (an example is shown in Figure 2.6A). At the same time, genes from the second family have

a high contribution to dissimilarity in late post-natal developmental time points (P14 and P28) and tend

to be related to experience dependent plasticity, with enriched categories such as regulation of synaptic

transmission, behavior, learning and memory (Figure 2.6B).

To quantify the relative contribution of GO categories to early embryonic and late postnatal dissimilarity,

we computed a category contribution index (see Section 2.3.3). The top contributing categories at E11.5

are related to nervous system construction, including positive regulation of neuroblast proliferation and

axonogenesis (Table 1). The top scoring categories at P28 are related to the utilization of the nervous

system, including regulation of neurotransmitter secretion and visual perception. An exception to this rule

is the category hindbrain development, ranked at #10 at P28, which is in agreement with the postnatal

timeline of hindbrain development (Moens and Prince 2002).

21

GO category

contribution

at E11.5 GO category

contribution

at P28

positive regulation of neuroblast proliferation 0.0022 neurotransmitter metabolic process 0.0011

retinal ganglion cell axon guidance 0.002 regulation of neurotransmitter secretion 0.00039

CNS projection neuron axonogenesis 0.0018 sensory perception of sound 0.00032

central nervous system neuron development 0.0016 regulation of neurotransmitter levels 0.0003

midbrain development 0.0015 sensory percept. of mechanical stimulus 0.00028

central nervous system neuron axonogenesis 0.0015 synaptic transmission, dopaminergic 0.00027

hindbrain development 0.0012 visual perception 0.00027

neural tube development 0.0011 regulation of long-term synaptic

plasticity

0.00027

motor axon guidance 0.0011 sensory perception of light stimulus 0.00024

negative regulation of glial cell differentiation 0.0011 hindbrain development 0.00024

Table 2.1. Mean contribution values of GO categories at E11.5 and P28.

The observed expression dissimilarity means that each of these neural processes contains a mixture of

genes with different spatial expression patterns. Such spatial differences could result from specialization

at the level of gene families: the same process may be carried out in different brain regions using different

members of a common gene family. This is for example the case with homeobox genes, well known to

operate as pattern specificators in the brain (Puelles and Rubenstein 1993; Vollmer and Clerc 2002).

To search for spatial specialization within gene families of interest, we collected pairs of genes from the

17 enriched GO categories discussed above. We computed both their spatial correlation at developmental

ages with peak dissimilarity (E11.5 and P28), and their sequence similarity (see Section 2.3.4). Results for

an example category 'neuron fate commitment' are presented in Figure 2.7.

The spatial specialization of genes that are members of the same family, could explain apparent

inconsistencies in the way they cooperate, by considering their different spatial patterns.

One interesting example is the pair of paralogs Neurog1 and Neurog2, where there are mixed reports

suggesting that they sometimes operate in a synergistic way (Ma et al. 1999) and sometimes in a

redundant way (Takano-Maruyama, Chen, and Gaufo 2011). These genes are bHLH transcription factors

22

involved in neuronal differentiation determination and subtype specification during embryogenesis

(Zirlinger et al. 2002). Figure 2.6C shows that they display a complementary pattern of expression at E11.5

(ρ = -0.59, Pearson correlation): Neurog2 is prominently expressed in areas derived from the forebrain,

and Neurog1 is expressed more strongly at hindbrain areas. Their different spatial distribution could

explain why they were found to be redundant in some conditions, for example, in tissues where both are

expressed, but not in all of them.

Figure 2.6. Functional characterization of hourglass shape. (A), (B) Clusters of gene profiles that are functionally enriched. Each

profile is a measure of contribution to dissimilarity D (see Methods). Black bold curve is the mean of the cluster. Blue lines - all

the genes in the cluster; red lines - genes that are in the cluster and in the category; grey lines - genes that are not in the cluster

even though belong to the category. (A) Neuron migration shows decreasing dissimilarity (B) Learning or memory shows a post-

natal increase in dissimilarity. (C) Spatial expression of the genes Neurog1 and Neurog2 at E11.5 in 11 coarse regions, selected as

neuron differentiation genes with highly similar sequence.

23

Figure 2.7. Sequence similarity vs. spatial correlation of gene pairs belonging to the GO category 'neuron fate commitment'.

Pairs of genes with sequence similarity > 0 and spatial correlation > 0.2 are marked in yellow. Pairs of genes with sequence

similarity > 0 and spatial correlation < -0.2 are marked in red.

Similar neural processes were found using a complementary analysis where we first created a dissimilarity

curve for each GO category and then correlated them to reference curves which represent the embryonic

part of the hourglass and the post natal one (Figure 2.8, see Section 2.3.5 for full analysis details). It

therefore appears that early inter-structure specialization is dominated by genes related to the

construction of the nervous system, while late variability is dominated by genes related to its operation.

Surprisingly, these results suggest that region dissimilarity in brain construction processes strongly

decreases at the same developmental phases where brain regions are actually known to become

anatomically segregated and specialized.

24

2.2.3 Expression conservation across regions and their embryonic origins

To further understand how changes in dissimilarity relate to the process of regionalization throughout

development, we next look into the question of which brain regions contribute to the overall dissimilarity.

Brain regions develop from three embryonic vesicles; the prosencephalon (forebrain), mesencephalon

(midbrain) and rhombencephalon (hindbrain). In the adult mouse brain, (Zapala et al. 2005) showed that

brain regions sharing an embryonic precursor also tend to share similar expression profiles. Here we

further examine the dynamic of this relation, testing how the embryonic origins of brain regions influence

the changes in their dissimilarity.

Specifically, we first visualize the changes in region dissimilarity over time. All regions were embedded in

a two dimensional space, while preserving the pair-wise dissimilarity of their expression profiles (using

non-metric multidimensional scaling (Bishop 2006), see Section 2.3.6). The embeddings for each time

point are shown in Figure 2.9, revealing how the hourglass shape manifests itself across individual regions.

In accordance with the hourglass shape, brain regions tend to be less dispersed in the two time points

that surround birth (Figure 2.9, E18.5, P4). To visualize the relation between expression profiles and the

embryonic origin of each region, we colored the regions in Figure 2.9 by their embryonic vesicle of origin.

Indeed, regions sharing the same origin tend to be clustered together throughout development. This

Figure 2.8: Dissimilarity computed with the three categories showing the largest correlation with each reference curve. The three highly correlated "embryonic" categories are related to construction of the nervous system: axonogenesis, neuron projection morphogenesis and cerebral cortex development. In contrast, the categories that are highly correlated to the post-natal dissimilarity curve are related to nervous system function: regulation of sensory perception of pain, regulation of neuronal synaptic plasticity and negative regulation of transmission of nerve impulse. These findings suggest that early dissimilarity is dominated by genes involved in axonogenesis and late dissimilarity by synaptic transmission and neural activity.

25

relation was also statistically significant (ρ = 0.33, p < 0.05, mean over all time points of Pearson correlation

between the dissimilarity and embryonic tree distance).

The regions that are most diverged in the developing post-natal time points are Isthmus and rhombomere

1, the two regions that give rise to the cerebellum (Figure 2.9, black arrows). In the adult time point, the

cerebellar cortex is, notably, the most unique region in the brain in terms of gene expression. These results

are to a large extent consistent with previous analysis of cerebellar gene expression (Zapala et al. 2005;

Lein et al. 2007). The post natal shift in cerebellar gene expression is also in agreement with the functional

role of the cerebellum, since the cerebellum is a motor coordination center that relies on sensory input

becoming available only after birth. Cerebellar development is also known to take place at a large part

after birth (Wang and Zoghbi 2001).

The same effect can be seen when performing a hierarchical clustering analysis on 11 large brain regions.

The precursor region to the cerebellum, the pre-pontine hindbrain (PPH), is clustered with other hindbrain

regions throughout embryonic development. Immediately after birth the PPH detaches from the

hindbrain cluster, becoming the most specialized region in the brain (Figure 2.10).

We next turned to identify the specific genes contributing to the post-natal shift in cerebellar gene

expression. We defined the contribution of each gene g to the cerebellar dissimilarity, as the difference

between the total cerebellar dissimilarity with and without g (see Section 2.3.7), and listed the top twenty

genes that contribute most to cerebellar distance at each of the three post-natal developmental time

points. Overall, 78% (32/41 unique genes) of the top contributing genes are known to be related to the

cerebellum, including genes that play an important role in cerebellar development or function like

Neurod1, Pvalb, Zic1 and Zic5. The remaining top genes (8/41) have not been previously linked to the

cerebellum, even though some of them ranked very high in our contribution lists. For instance,

heterogeneous nuclear ribonucleoprotein A/B (Hnrpab), which is ranked 8 at P4, and microfibrillar-

associated protein 4 (Mfap4) which is ranked 20 at P4 and 13 at P14. Hnrpab is a DNA and RNA binding

protein, and is suggested to be involved in cytostatic activity (Taga et al. 2010). Mfap4 is thought to be an

extracellular matrix protein which is involved in cell adhesion or intercellular interactions, and has almost

no other associated information. Both of these genes make interesting targets for further investigation as

important to cerebellar specialization.

26

Figure 2.9: Changes in dissimilarity across individual brain regions. Embedding of all regions onto a 2D plane using

multidimensional scaling. Each circle corresponds to a brain region, with a size that corresponds to the within-region expression

standard deviation and a color that corresponds to its embryonic origins. Red: forebrain, telencephalon; pink: forebrain,

diencephalon; cyan: midbrain; blue: hindbrain. Rhombomere 1 and Isthmus in the developing post-natal time points are and the

cerebellar cortex and cerebellar nuclei at P56 are marked with a black arrow.

Figure 2.10: Hierarchical clustering of 11 large brain regions over development. Dendrogram of 11 brain regions, created by

their gene expression profile at all time-points. Region abbreviations: rostral secondary prosencephalon (RSP); telencephalic

vesicle (Tel); peduncular hypothalamus (PedHy); prosomere 3 (p3); prosomere 2 (p2); prosomere 1 (p1); midbrain (M);

prepontine hindbrain (PPH); pontine hindbrain (PH); pontomedullary hindbrain (PMH); medullary hindbrain (MH). Hindbrain

structures are colored in red. The PPH (cerebellar precursor) shifts dramatically after birth.

27

2.2.4 Comparison with human development

The above findings show how the specificity of the regional expression profiles in the brain changes during

development. How do these findings generalize to other mammals? A recent study provides a good

opportunity to test these findings in humans (Kang et al. 2011). Kang and colleagues measured the

transcriptome of 57 human subjects using DNA microarrays of 11 cortical regions, the mediodorsal

nucleus of the thalamus, striatum, amygdala, hippocampus and the cerebellar cortex.

We first aimed to assess if the gene expression levels in mouse and human can be compared. We

considered the human genes that are orthologous to the 2002 mouse genes and computed the Spearman

correlation of the gene expression profiles of every pair of time points, averaged over brain regions (see

Section 2.3.8). Figure 2.11 depicts the cross correlation between the human and the mouse

developmental timeline, showing a high correlation between the expression profiles of the two species,

which peaks along the translation between the mouse and human brain development timelines proposed

in (Clancy and Darlington 2001). This means that the expression profiles of the two species are highly

correlated in corresponding ages, and the correlation peaks at post-natal time points and that the mouse

and the human neurodevelopmental datasets can be directly compared to each other, even though they

are measured using different methods (ISH and microarrays) and at different brain regions.

28

We next turned to compare expression in specific regions of the mouse and human brains, focusing on

four mouse brain regions which have parallel regions in the human data (see Section 2.3.8). The human

cortical areas were averaged and compared to the mouse dorsal pallium, the human mediodorsal nucleus

of the thalamus was compared to the mouse thalamus, the human cerebellar cortex was compared to

two mouse regions which were averaged: rhombomere1 and isthmus, and the human and mouse striatum

were compared as well. For each pair of parallel regions, we first looked at the overall temporal

correspondence of the mouse and human development timelines by computing the correlation between

expression levels of the two species during development. We computed the cross-species correlation as

described above for the four pairs of human regions and their parallel mouse regions, finding high

correlation values for all region pairs (Figure 2.12).

Figure 2.11: Comparison with human data. (A) Cross correlation between mouse and human gene expression. The black line is taken from known developmental timeline of the two species based on anchor events (Clancy and Darlington 2001a).

29

Figure 2.12: Cross-correlation between mouse and human expression profiles over development. Coherence between

expression profiles for orthologous genes was measured using Spearman correlation, for every pair of time points in

mouse and human. (A) Thalamus (B) Cortex (C) Striatum and (D) Cerebellum. The black line depicts the mapping

between neurodevelopmental timelines of the two species proposed by (Clancy and Darlington 2001b).

We looked at region-specific dissimilarity and traced how the dissimilarity of each of the four regions from

all other brain regions changes over development, in both mouse and human (see Section 2.3.7). The

specialization patterns in mouse and human show partial correspondence (Figure 2.13). While the

thalamus is specialized very early in human (Figure 2.13B), at 4-8pcw, in the mouse it keeps a relatively

constant distance from the rest of the regions (Figure 2.13A). In mouse, the cortex is specialized right

before birth (Figure 2.13C), while in human there is a decrease in specialization over time (Figure 2.13D).

The Striatum in mouse gets specialized right before birth (Figure 2.13E), and in human it keeps a more or

less constant distance (Figure 2.13F). The region with the highest correspondence between mouse and

human is the cerebellum, which becomes specialized right after birth in both species (Figure 2.13G,H).

The differences between mouse and human regional specialization is striking, and the fact that the most

30

similar profile is for the cerebellum is especially interesting given the fact the cerebellum shows the lowest

inter-species correlations for the post-natal time points (Figure 2.12).

Figure 2.13: Comparison with human data. Region-specific dissimilarity curves of four brain regions in mouse and human. (A)

mouse thalamus, (B) human mediodorsal nucleus of the thalamus , (C) mouse dorsal pallium, (D) human cortical regions, (E)

mouse striatum, (F) human striatum, (G) mouse rhombomere 1 and isthmus and (H) human cerebellar cortex. Error bars denote

standard deviation across regions.

31

2.3 Methods

2.3.1 Data acquisition and pre-processing

The detailed process of data acquisition was described in (Lein et al. 2007). 2002 genes were chosen from

five classes: (1) Transcription factors, including homeobox, basic helix-loop-helix, forkhead, nuclear

receptor, high mobility group and POU domain genes. (2) Neuropeptides, neurotransmitters, and their

receptors. In particular genes involved in dopaminergic, serotonergic, glutamatergic and gabaergic

signaling. (3) Neuroanatomical marker genes. (4) Genes relevant to brain development including axon

guidance, receptor tyrosine kinases and their ligands. (5) Genes of general interest including common

drug targets, ion channels, cell adhesion, genes involved in neurotransmission, G-protein-coupled

receptors and genes involved in neurodevelopmental diseases. One animal used to measure expression

for each gene.

Brain regions may change dramatically in size and shape causing a problem to compare gene expression

in the brain across different developmental stages. Here, expression density for each brain region in each

time point was measured while taking the differences in size into account. The expression density for each

brain region R is defined as the sum of expressing pixels in R divided by the total number of pixels that

intersect R (taken from: http://developingmouse.brain-map.org/docs/InformaticsDataProcessing.pdf).

Since expression measurement for each gene come from different individual brains and their 3D shapes

differ, this registration process is prone to mistakes, especially for small regions. To avoid errors that stem

from erroneous registration, we selected a set of regions that are large relative to the magnitude of the

registration perturbation.

2.3.2 Selecting brain region delineation

We used the hierarchical structure of the anatomical regions as defined in the reference atlas ontologies

available in the Allen Brain Atlas website to define six delineations of the brain into sets of regions. These

delineations are achieved by considering several levels of the tree in a serial manner. We started with the

set of leaf regions and then repeatedly took their "parent" regions five times, yielding six sets of regions

corresponding to six levels of the ontology tree. The most refined level has 488 developing and 631 adult

small brain regions, and the coarsest level 48 developing and 13 adult brain regions. For some time points,

expression measurement are only available for a small number of regions, and the remaining regions were

ignored.

http://developingmouse.brain-map.org/docs/InformaticsDataProcessing.pdf

32

2.3.3 Contribution of individual genes to the hourglass shape and functional analysis

To functionally characterize the hourglass shape, we calculated the contribution of each gene to the inter-

region distance as: ( ) 1 /g g

geneContribution g D D where gD is the mean dissimilarity (1-PCC)

across all N pairs of regions

1, 2

11 ( 1, 2)g

r rD r r

N

, (1)

and gD is similarly measured, but after excluding the gene g. This was used to create a temporal

contribution profile for each gene.

To find biological processes who share similar contribution profiles, we clustered the profiles using k-

Means (k = 10, 15, 20, 25, 30, 35, 40, 45, 50). The resulting clusters were tested for Gene Ontology

functional enrichment (Ashburner et al. 2000). We limited the analysis to GO categories with at least 10

associated genes in our dataset (~0.5% of the dataset) and to GO categories related to nervous system

structure and function. This was done by taking several top-level categories like neurological system

process (GO:0050877) and nervous system development (GO:0007399) and get all of their descendant

categories in the GO graph. We added to this several more biological process categories and cellular

component categories with their descendants such as neuron projection, neuronal cell body,

synaptosome etc.

We tested for enrichment using a hyper-geometric test. P-values were corrected for multiple comparisons

using a double-FDR approach: First, for each clustering result, we corrected the enrichment p-values using

False Discovery Rate (FDR, (Benjamini and Hochberg 1995)). Next, to correct for the fact that the clustering

was computed for ten different values of k, we corrected the 10 p-values each category received using

FDR as well. Finally, to present the most refined categories, we screened the resulting categories using

the hierarchical structure of the GO tree, and discarded categories that had a descendant category with a

lower p-value.

To decide if a cluster represents the embryonic or the post-natal dissimilarity (or neither), we pooled all

contribution values of genes in the cluster in the embryonic time-points (E11.5, E13.5, E15.5, E18.5) and

33

separately pooled the ones in the post-natal developmental time points (P4, P14, P28). We then applied

a Wilcoxon signed-rank test to decide if there is a significant difference between the two samples. If there

was, we checked the direction of the difference by comparing the medians of the samples.

The contribution of each GO categories C to inter-region dissimilarity was computed as the mean

contribution of all genes assigned to C. 1

( ) ( )( )

cat geneg CContribution c Contribution g

size C . This

index captures both large categories and small categories with highly contributing genes.

2.3.4 Identifying genes with similar sequences

To identify genes from the same gene family we computed the similarity of their protein coding sequence

as measured by the Needleman-Wunsch algorithm (Needleman and Wunsch 1970). We used BLOck

SUbstitution Matrix 50 (BLOSUM50) as the scoring matrix for the global alignment and gap alignment

penalty of 8. Pairs with a score of zero or higher were considered as matches.

2.3.5 Constructing reference curves for correlation analysis

To ensure that the enriched categories represent the 'embryonic' and 'post-natal' parts of the hourglass

curve we also did a complementary analysis. Instead of clustering profiles and searching for enrichment

as described above, we started from GO categories, and computed the dissimilarity curves using all genes

in each category. We treat this dissimilarity curve as the dissimilarity of the category. We then computed

the Pearson correlation of the dissimilarity profile of each function with two reference curves: One

capturing the embryonic part of the hourglass curve and one capturing the post-natal one (Figure 2.8). To

construct the embryonic curve, we computed the full hourglass curve and then assigned the value of the

last embryonic time point E18.5, to all of the post natal time points. Similarly, for the post-natal reference

curve, the embryonic time points were assigned the value of the first post-natal time point, P4. Figure 2.8

shows the top three highest correlating categories for the two reference curves. We computed an

empirical p-value for this correlation using a Monte-Carlo approach: we sampled 10K groups of genes in

the same size of the category and computed the correlation coefficients of these groups to the reference

curves.

34

2.3.6 Visualizing inter-region distances

To visualize the temporal dynamics of the inter-region dissimilarity in the brain, we embedded the regions

in a two-dimensional space while preserving the pair-wise dissimilarity of their expression profiles, using

non-metric multidimensional scaling. For easier comparison of the time points, at each time point, the

location of the regions was adjusted to best match the location of the other regions using MATLAB’s

‘procrustes’.

2.3.7 Dissimilarity of one region to the rest of the brain

To quantify the time course of expression specialization, we measured the dissimilarity between a region

R and the remaining brain regions. The region-specific index for a region R is defined as the average

dissimilarity between R and all the other regions, 1

( ) 1 ( , )# r

D R R rregions

, divided by the mean

inter-region dissimilarity of Eq (1).

2.3.8 Mouse-human comparison

To compare expression in mouse and human brains, we focused on four mouse brain regions which have

parallel regions in the human data of (Kang et al. 2011). Since the mouse cortical regions have data only

for P28, we used their parent region, the dorsal pallium, to compare with the 11 human cortical areas,

averaged to create one cortical expression profile. The human mediodorsal nucleus of the thalamus was

compared to the mouse thalamus, the human cerebellar cortex was compared to two mouse regions

which were averaged: rhombomere1 and isthmus, and the human and mouse striatum were compared

as well.

To identify human genes that are orthologous to the 2002 genes in the mouse dataset we used the

R/BioConductor package BioMart (Smedley et al. 2009).

The set of 1737 ortholog gene pairs was used to calculate the Spearman correlation between mouse and

human expression profiles, averaged over regions, for every two time points (Figure 2.11), and also for

the four parallel regions (Figure 2.12).

35

2.4 Discussion

We characterized how the dissimilarity between transcription profiles of brain regions changes during

development of the mouse brain. Based on the process of brain regionalization we expected to observe a

monotonous increase in transcription specialization, but we actually found that brain regions exhibit

increasingly more similar expression profiles during early embryonic development, until reaching a

"neurotypical" phase around the time of birth. After birth, brain regions tend to specialize and their

expression dissimilarity increases. Functional characterization of the hourglass shape suggests that it is

derived from two separate, complementary processes: the embryonic reduction in dissimilarity is

dominated by genes responsible for constructing and shaping the brain, while the post-natal increase in

specialization largely involves processes that govern the operation of the nervous system, like neural

activity and plasticity.

When visualizing the dissimilarity between the regions (Figure 2.9), it is apparent that the cerebellum

“breaks off” from the rest of the regions after birth. The dissimilarity between the cerebellum and other

regions grows at each post-natal time point and so does its dissimilarity from other regions of the

hindbrain. This dynamic is consistent with the view that cerebellar development follows unique cues from

the junction of the midbrain and hindbrain (Sato, Joyner, and Nakamura 2004; Wingate 2001), and

therefore its transcriptome may differ from other hindbrain regions significantly (Chizhikov et al. 2006).

Indeed, the cerebellum has been shown before to be the most unique region in terms of its expression

profiles (Wang and Zoghbi 2001; Zapala et al. 2005; Lein et al. 2007; Kang et al. 2011). One explanation

for the late specialization lies in the main function of the cerebellum as a motor coordination and sensory-

motor integration center.

The above findings are in partial accordance to a recent large scale developmental-brain transcriptome

study in humans (Kang et al. 2011), where similarity between brain regions was aggregated across three

long life periods: embryonic, postnatal and adult. In both mouse and human dissimilarity decreases before

birth. However, on average, the similarity in these periods seems to grow from post-natal development

to adulthood in human. In the mouse dataset we see the opposite effect: a robust increase in dissimilarity

during post-natal development, following birth. While the cerebellum specializes after birth in both

species, other temporal dissimilarity profiles differ between the species. Further measurements are

needed to clarify if this mismatch reflects a fundamental difference between rodent and primate

36

development, or if it is due to differences in the experimental technique or the specific subset of six

regions measured in humans.

Interestingly, recent studies have shown examples of whole-organism developmental gene expression

profiles that follow an hourglass shape. (Kalinka et al. 2010) measured inter-species distances over

development for six species of flies and found that the distance is minimized during the presumed

'phylotypic' stage. (Domazet-Lošo and Tautz 2010b) analyzed the phylotypic stage further by looking into

the relative ages of genes expressed in different stages of development and finding that the genes

expressed during the phylotypic stage are more ancient, hence more stable in face of evolutionary

changes.

The above findings suggest that expression dissimilarity decreases at the same developmental phases

where brain regions become anatomically segregated and specialized. The question remains if the

reduced dissimilarity in mRNA is accompanied by reduced dissimilarity in regional protein abundance

profiles across the brain. Alternatively, post-transcription regulation mechanisms may take a larger role

in preserving specialization across brain regions and explain this apparent mismatch.

ISH provides a much higher spatial resolution than the one used in this study, that can be used to

investigate specialization at a finer scale of cell layers and even cell types. This is especially important

when considering the fact that gene expression as measured here reflects cell densities, as well as

transcript abundance. Quantifying and correcting for regional cell densities is a crucial step towards a

more accurate description of the neural transcriptome. Furthermore, the recent availability of

transcription measures from other species (Website: ©2012 Allen Institute for Brain Science. NIH

Blueprint Non-Human Primate (NHP) Atlas [Internet]. Available from:

Http://www.blueprintnhpatlas.org/; Website: ©2012 Allen Institute for Brain Science. BrainSpan Atlas of

the Developing Human Brain [Internet]. Available from: Http://brainspan.org/) calls for a thorough study

of the similarities and differences of development as reflected in gene expression between species to

understand the genetic blueprint underlying brain development.

37

Chapter 3: Methods to represent neural ISH images

3.1 Introduction

Analysis of gene expression patterns in the brain poses a unique challenge due to the highly complex

nature of brain tissues; each region is composed of many types of neurons and glia cells (Dickson 2002;

Lein et al. 2007). When measuring gene expression using popular methods such as DNA microarrays or

RNA-sequencing, expression patterns of different cells are typically mixed because of the difficulty in

extracting RNA from single cells. This has been done to some extent using cell sorting techniques (Cahoy

et al. 2008) and more recently by single-cell RNA-Seq (Wu et al. 2014). Another approach is to use high

resolution gene expression maps obtained using ISH. As discussed in section 1.2.3, in an ISH experiment,

a labeled probe is hybridized to a complementary mRNA strand in the tissue itself, which is thinly sliced,

mapping the transcripts to their original location. This method measures RNA expression at a very high,

even sub-cellular, spatial resolution, but each experiment is limited to measure expression for a small

number of genes.

The challenge of building a whole genome database of ISH for the mouse brain was met by the Allen

Institute for Brain Science, which have created a comprehensive mouse brain atlas, freely available on-

line at www.brain-map.org/. Examples of ISH images for two genes measured on adult brains are shown

at Figure 3.1. Expression was measured on adult brain tissues for every single gene in the mouse genome,

and for ~2000 genes along pre- and post-natal development (Thompson et al. 2014). The first version of

the atlas was completed in 2006, and was used to identify spatial expression patterns in the brain at a

genome-wide level (Lein et al. 2007), and to explore the structural organization of the brain from a

genomic perspective (Bohland et al. 2010). Since then, the atlas was extended in several ways, and now

also covers the human brain and non-human primates (Lein et al., 2007; Henry and Hohmann, 2012; Ng

et al., 2009).

http://www.brain-map.org/

38

This recent explosion of high-resolution expression data measured in mammalian brains calls for new

ways to analyze neural gene expression images. Most existing methods for bio-imaging analysis were

developed to handle data with very different characteristics, like drosophila embryos (Frise et al., 2010;

Peng et al., 2007; Pruteanu-Malinici et al., 2011) or cellular imagery (Peng et al., 2010; Coelho et al.,

2010). The complex nature of the mammalian brain poses new challenges for analysis. The expression of

each gene was measured on a brain of a different individual mouse and the images cannot be easily

aligned to each other and compared. As with every other organ in the body, there is a variance between

individuals, even on the level of neural structures and layers. Examples of images obtained from different

brains are shown in Figure 3.2. These differences make naïve approaches such as a simple correlation

between images impractical. Indeed, current approaches for analyzing brain images are based on smooth

nonlinear transformations to a reference atlas (Davis and Eddy, 2009; Hawrylycz et al., 2011) but these

methods may be insensitive to fine local patterns like those emerging from the layered structure of the

cerebellum or the spatial distribution of cortical interneurons.

This chapter describes several methods to represent and analyze the large collection of neural ISH images.

Specifically, we first describe ways to represent the images using visual features extracted from them.

Then, we describe a method we developed to represent the images in a functional way – adding a layer

of semantic interpretability to the representations. In the next chapter, we will describe the

implementation of these methods to extract meaningful biological information from the images such as

identification of layer-specific gene markers, spatial co-expression profiles, functional annotations for

genes and disease-gene markers.

Figure 3.1: gene expression ISH images for the genes. (A) Mapt and (B) Gria2. Expression is shown as black dots on the tissue, marking neural cells expressing the gene of interest.

39

3.2 A visual representation of ISH images

In this section, we first describe methods to extract visual features from the images. While these methods

have been originally developed for natural images, they hold several properties that make them suitable

to use in the case of neural ISH imagery. We then discuss a method to create a compact representation

from the large collection of features assembled from each image, and a method to incorporate spatial

information into the representation. Lastly, we describe using the representations as feature vectors for

classifiers as a way to classify or label the images.

3.2.1 Feature extraction

There are many ways to extract features from images; here I focus on two popular methods: scale

invariant feature transform (SIFT) and local binary patterns (LBP). Using these methods, we are able to

create a robust and compact representation for ISH images in several scales.

Scale invariant feature transform

Figure 3.2: Gene expression for each gene was measured on a brain from a different individual mouse, making the images difficult to compare with naïve methods such as pixel-pixel correlation.

40

The Scale Invariant Feature Transform (SIFT) was introduced by David Lowe in the late nineties (Lowe

1999). It was found to be very effective for matching objects in different images and is used extensively

in applications such as video tracking, object recognition and scene understanding. Using SIFT, we can

transform an image into a collection of features, called “descriptors”. The descriptors can be extracted in

specific “key-points” in the image, or on a grid. For natural images, usually the “key-point” strategy is

chosen, and a major step in the process of image representation is to identify the key-points. Here we will

use a grid-based strategy where descriptors are calculated on a regular grid. This strategy is more

appropriate for neural ISH imagery because of the dense and more uniform-looking nature of the

information contained in the images.

Specifically, we compute the descriptors as follows: any region to be represented by a descriptor is resized

to 16*16 pixels. Then, for each pixel, the orientation and magnitude of the intensity gradient is computed.

Each pixel is divided into 4*4 pixels and for each of them, a histogram of 8 gradient orientations is

computed. Then these 16 histograms are concatenated together to form a 128D feature vector. The

process of calculating a SIFT descriptor is depicted in Figure 3.3.

Local binary patterns

Figure 3.3: Calculating SIFT descriptors. (A) Any region to be represented by a descriptor is resized to 16*16 pixels, and the orientation and magnitude of the intensity gradient is computed for each pixel. (B) Each pixel is divided into 4*4 pixels and for each of them, a histogram of 8 gradient orientations is computed. (C) The 16 histograms are concatenated together to form a 128D feature vector. Figure adapted from: Levi Gil, (2013, August 18). A Short introduction to descriptors [blog post]

41

When examining similar-scaled expression patterns at different regions in the brain we find cells that are

arranged in many different textures, as shown in Figure 3.4. The different patterns can reflect varying

compositions of cell types, or different distributions of cells in the tissue. An important aspect in feature

extraction from these images is, therefore, the ability to capture these different textures. An effective

feature extraction method that captures textures well is called Local Binary Patterns (LBP). LBP was first

described in the early nineties (Ojala, Pietikainen, and Maenpaa 2002).

The LBP feature vector, in its simplest form, is created in the following manner. The image is divided into

rectangular sections. Then intensity values of each pixel in the section is compared to the intensity values

of its eight neighboring pixels from all sides. Then the intensity values of the neighboring pixels are

binarized; where the center pixel's value is greater than the neighbor's value, a value of 1 is assigned.

Otherwise, the value 0 is assigned (see Figure 3.5). This gives an 8-digit binary number. There are 28 = 256

options for this number. For each section, we calculate a histogram that counts the frequency of each

binary digit, and concatenate all of the histograms of the different sections to create the full image

representation.

Figure 3.4: various patterns of expression taken from one ISH image, at the same scale. Different regions in the brain exhibit expression signatures in many different textures.

Figure 3.5: Calculating LBP features. The image is divided into rectangular sections. Intensity values of each pixel in the section is compared to the intensity values of its eight neighboring pixels from all sides, and binarized; where the center pixel's value is greater than the neighbor's value, a value of 1 is assigned. Otherwise, the value 0 is assigned.

42

3.2.2 Feature aggregation using “Bags of visual words”

After the extraction of local features from an image, one way to represent them in a compact way is to

create a “visual vocabulary” using a method called a “bag-of-visual-words”. This method is adapted from

the representation of large corpuses of text, where documents are represented as histograms of word

frequencies. Here, image descriptors are divided into several groups, and a representative descriptor from

each group is called a “visual word”. This creates a “dictionary” or “vocabulary” of visual words. We then

divide the descriptors into groups. This can be done, for example, using methods such as K-Means

clustering, and then the representative visual word will be the centroid of the cluster. Now we can count

the frequency of each visual word in our image, and represent each image as a histogram of visual words,

called a “bag-of-visual-words”. The process is described using natural image examples in Figure 3.6.

Figure 3.6. Representing images using the Bag-of-visual-words model. (A) local features such as SIFT descriptors are extracted

from the images (B) the features from all the images in the dataset are aggregated together and (C) separated into groups using

methods such as K-means clustering. A representative feature is decided for each group. This can be, for example, the centroid

of each cluster. These representative features are called “visual words” (D) All of the descriptors in each image are assigned to

one visual word, using a distance measure such as the Cosine distance. Finally, each image is represented by a histogram counting

the number of occurrences of each visual word in it.

3.2.3 Applying a spatial pyramid kernel to the images

One of the main advantages of the visual BoW method described above is the ability to find small-scaled

spatial patterns that are location-independent in the brain. However, one way to incorporate spatial

43

location information into the representation is to use a method called spatial pyramid kernels (Lazebnik,

Schmid, and Ponce 2006). Using this approach, every image is split into 4 and 16 rectangles and the bag

of words method is applied to each rectangle separately (Figure 3.7). The resulting feature vector is a

concatenation of the 1+4+16 = 21 dictionaries. This approach has been shown to be highly successful in

machine vision tasks (Grauman and Darrell 2005; Huang 2009). The down side of this approach is that it

inflates the feature dimensionality significantly, and requires reducing the dictionary size, creating a

challenging tradeoff when computing both local and global features. An alternative approach could be

based on data-dependent segmentation of images into anatomic structures (like the thalamus, cortex or

cerebellum) followed by coding each structure separately. Such segmentation is a topic for a separate

research.

Figure 3.7: A spatial pyramid approach to extracting dense SIFT features. Features were extracted in the full image (a) and the

image divided into four parts (b) and 16 parts (c).

3.2.4 Using the representations for classification

After extracting visual features and representing the images in a compact way, we can use the new

representations as input features for unsupervised or supervised machine learning methods in order to

group them or make decisions about them (Figure 3.8). The next section describes a way to use this

approach to create a functional, semantic representation of the images, and the next chapter discusses

using the principles described above to extract meaningful information from the ISH image collection

available in the Allen Brain Atlas in order to identify gene co-expression patterns that are interpretable,

infer gene functionality from spatial expression patterns, identify disease-related genes and find layer and

cell-specific expression patterns in the brain.

44

Figure 3.8: Using compact ISH image representation as input for classifiers. (A) After feature extraction and image

representation using methods such SIFT-BoW, (B) the images can then be used as input features for classifiers, where genes serve

as positive and negative examples for different biological properties.

3.3 A functional representation of ISH images

One important challenge for automatic analysis of biological images lies in providing human interpretable

analysis. Most machine vision approaches are developed for tasks in analysis of natural images, like object

recognition. In such tasks, humans can understand the scene effortlessly, and infer complex relations

between objects easily. In bio-imaging however, the goal of image analysis is often to reveal features and

structures that are hardly seen even by experts. It is therefore important that an image analysis approach

provides meaningful interpretation to any patterns or structures that it detects.

Here we develop a method to learn functional representations of expression images by using predefined

functional ontologies. This approach has two main advantages: accuracy and interpretability, and it builds

on a growing body of work in object recognition in natural images, showing how images can be

represented using the activations of a large set of detectors (Deng, Berg, and Fei-Fei 2011; Torresani,

Szummer, and Fitzgibbon 2010; Li et al. 2010b; Li et al. 2010a; Malisiewicz 2012; Malisiewicz, Gupta, and

Efros 2011). For object recognition, the detectors may include common objects, like a detector for the

presence of a chair, a mug or a door. Here we show how to adapt this idea to represent gene expression

images, by training a large set of detectors, each corresponding to a known functional category, like axon

guidance or glutamatergic receptors. Once this representation is trained, every gene is represented as a

point in a low dimensional space whose axes correspond to functional meaningful categories.

45

3.3.1 Data filtering and preprocessing

We used whole-brain, expression-masked images of gene expression measured using ISH, publically

available at the Allen Brain Atlas (www.brain-map.org). Expression was measured for the entire mouse

genome. For each gene, a different adult mouse brain was sliced into 100µm thick slices, mRNA

abundance was measured and the slice was imaged. The database holds image series for over 20K

transcripts. Most genes have one corresponding image series, containing ~25 imaged brain slices. Some

genes were imaged more than once and have several associated image series.

Choosing an image to represent each gene

In our analysis, we used the most medial slice for each image series, yielding a typical image size of 8Kx16K

pixels. 4823 out of the available 21174 images showed no expression in the brain and were ignored in

subsequent analysis, leaving 16351 images representing 15612 genes.

In order to take into fuller account the 3D structure of the brain, we repeated the full set of our

experiments while including two additional sagittal sections. The three sections used were taken from one

hemisphere, capturing the medial section and also the 30% and 50% marks on the medial-lateral axis. An

example of three such slices is shown in Figure 3.9. However, the results of the experiments using multiple

slices were inconclusive and so we report results based on the medial slice alone. The reasons for this

inconsistency could be that the location of the non-medial slices is more variable, due to variation across

brains.

http://www.brain-map.org/

46

Figure 3.9: Each image series was represented with three slices, the most medial (a), and the 30% (b) and 50% (c) marks on the

medial-lateral axis.

Using expression masked images

Images in the Allen dataset are provided in two formats: the raw imagery, and images that were processed

to remove the background, yielding expression-masked images. The analysis was applied to the masked

images. This is a big advantage when examining expression patterns, as noise effects coming from

cytoarchitecture and underlying brain structures is reduced. Examples of a pair of images are given below

in Figure 3.10. Figure 3.11 shows examples of images, demonstrating the complexity of neural expression

patterns across brain regions and multiple scales. The images analyzed in our study were in grey scale but

are shown here as color-coded by expression intensity for better visualization.

Figure 3.10: Regular (A) and expression-

masked (B) examples of ISH images as

provided by the Allen Brain Atlas, for the

gene Tuba1. While the expression masked

images are presented in color, the color

images are in fact derived from gray-scale

images, which we have used in this work.

47

Figure 3.11. The raw data. ISH image for the gene Tuba1 shown (A) at different scales and (B) in three different regions.

3.3.2 Creating the representations

We present a method to identify similarities between neural ISH images and to explain these similarities

in functional terms.

Our method consists of a visual phase - where we transform the raw pixel images into a robust visual

representation, and a semantic phase - where we transform that visual representation using a set of 2081

gene-function detectors. The output of these detectors comprises a higher-order semantic

representation of the images in a gene-functional space (Figure 3.12). Similar two-phase systems have

recently been proposed and applied successfully for tasks such as cross-domain image similarity and

object detection in natural images (Malisiewicz 2012; Malisiewicz, Gupta, and Efros 2011; Li et al. 2010b;

Deng, Berg, and Fei-Fei 2011; Li et al. 2010a; Torresani, Szummer, and Fitzgibbon 2010).

For the first, visual, phase, we represent each image as a collection of local descriptors using SIFT features

(Lowe 2004). This step aims to address the problem that ISH brain images of the same gene vary

significantly in shape and size when measured in different brains (Kirsch, Liscovitch, and Chechik 2012).

SIFT features are histograms of oriented gradients on a small grid. The resulting image-patch SIFT

descriptor is invariant to small rotation and illumination (but not to scale), making imaged-slices from

different brains more comparable. We computed SIFT descriptors of dimension 128 extracted on a dense

grid spanning the full image (Bosch et al., 2006; Bosch et al., 2007; Csurka and Dance, 2004), at four spatial

48

resolutions. In ISH images, different information lies in different descriptor sizes, and we wish that the

representation captures spatial patterns both at the level of single cells, micro-circuitry, and at the coarser

level of distribution of expression across brain layers. To capture information at multiple scales, we used

the VLFeat implementation of SIFT (Vedaldi and Fulkerson 2010), where scale-invariance is not

incorporated automatically. Specifically, each image is represented as a collection of ~1M SIFT

descriptors, computed by down sampling each image at a factor of 1, 2, 4 and 8. Since the descriptors

were extracted from high resolution images which are mostly dark, many descriptors were completely

dark and were discarded.

Next, to achieve a compact non-linear representation of each image, we aggregate the descriptors from

all images for a given resolution level, and cluster them to form a dictionary of distinct “visual words” per

each resolution level (see Section 3.2.2). We used the original Lloyd optimization for k-Means with L2

distance, initializing the centroids by randomly sampling data points. The clustering procedure was

repeated multiple times (n=3), and the solution with the lowest energy was used. We tested 4 different

dictionary sizes (k=100, 200, 500, 1000), all yielding similar results (see Section 3.3.3), and report below

results for k=500 which obtained slightly higher accuracies. Next, we construct a standard “bag-of-words”

(Bosch, Zisserman, and Mu 2006) description of each image. As a result of this process, each image is

described by four concatenated 500-dimensional vectors counting how many times each “visual word”

appeared in it at a given resolution level. We also added a count of the number of zero descriptors per

resolution level, ending up with a 2004 dimensional vector describing each image. Using this approach,

similar spatial information from different brain regions is preserved, as opposed to using global

correlation-based approaches.

We then turn to the second, "semantic", phase, and represent each image by a set of functional

descriptors. Given a set of predefined Gene Ontology (GO) annotations of each gene, we train one

separate classifier for each known biological annotation category, using the SIFT bag-of-words

representation as an input vector (see Section 3.2.4). Specifically, here we trained a set of 2081 L2 -

regularized logistic regression classifiers (using LIBLINEAR (Fan et al. 2008)) corresponding to biological-

processes GO classes that have 15-500 annotated genes (see Section 3.3.3). We trained the classifiers

using two layers of 5-fold cross validation, performed as follows: The full set of 16351 gene images was

split into five non-overlapping equal sets (without controlling for the number of positives in each split),

training the classifiers on four of them and testing performance on the fifth, unseen test set of images.

49

This procedure was repeated five times, each time with a different set acting as the test set. All accuracy

and other results below are reported for a held-out test set that was not used during training.

To tune the logistic regression regularization hyperparameter, we used a second layer of cross validation.

We repeated the splitting procedure within each of the five training sets, splitting each of them again into

five subsets of images, using four for training and the fifth as a validation set. The regularization

hyperparameter was selected from the values {0.001, 0.01, 0.1, 1, 10, 100}. At the end of this process,

each gene is then represented as a vector of "activations", corresponding to the likelihood that the gene

belongs to one functional category such as “forebrain development” or “regulation of fatty acid

transport”.

The representation described above removes important information about global location in the brain.

We therefore also tested an approach using spatial pyramids (Lazebnik, Schmid, and Ponce 2006), where

descriptor histograms are computed separately for different parts of the image (see Section 3.2.3).

Unfortunately, this approach results in feature vectors whose dimensionality was too high for the current

dataset, and yielded poor classification results. We concluded the increase in feature dimensionality hurts

more than the gain obtained by describing different brain regions separately.

50

3.3.3 Choosing parameters for analysis

Choosing the dictionary size

In order to choose the size of the visual word dictionary, we performed analysis with four dictionary sizes:

100, 200, 500 and 1000. Figure 3.13 shows mean test-set AUC values obtained using the different

dictionary sizes. Mean AUC across categories is insensitive to the size of the dictionary (K). To check how

stable the representations are between the different K's, we measured the Pearson correlation between

AUC values of the 2081 GO categories using the different dictionary sizes. Correlation values are very high

and are shown in Table 4.1. The lowest correlation value is 0.846, between K=100 and K=1000, and is still

highly significant (P<10-100). Correspondence between AUC values for the 2081 GO categories obtained

using the two dictionary sizes are shown in Figure 3.14, showing indeed a high linear correspondence.

Figure 3.12. Illustration of the image processing pipeline. (A) Original image in pixel grayscale indicating level of

gene expression. (B) Local SIFT descriptors are extracted from image at 4 resolutions. (C) Descriptors from all

16351 images are clustered into 500 representative ‘visual words' for each resolution level using k-Means. (D) Each

image is represented as a histogram counting the occurrences of visual words. (E) L2-regularized logistic regression

classifiers are applied for 2081 GO categories. (F) The final 2081 dimensional image representation.

51

Figure 3.13: Mean test-AUC values for dictionary size K=100, 200, 500, 1000. Error bars indicate standard error of mean across

five folds in cross-validation data.

Dictionary size

(K)

100 200 500 1000

100 1 0.896 0.861 0.846

200 0.896 1 0.896 0.883

500 0.861 0.896 1 0.917

1000 0.846 0.883 0.917 1

Table 3.1: Pearson's rho correlation values between AUC results for 2081 categories, compared across the 4 different dictionary

sizes. Correlations are high (the lowest is 0.846 between K=100 and K=1000).

52

Figure 3.14: Mean test-set AUCs for dictionary size K=100 versus K=1000. This pair of dictionary sizes is the least correlated

among all dictionary size pairs. It can be seen that even in this case, the correlation is high and indicative of a stable

representation.

Choice of GO category size

We chose GO categories with a number of annotations ranging from 15 to 500 genes. We set the lower

limit to 15 in order to provide enough positive examples for testing the classifiers across five cross-

validation partitions. The higher limit is set to 500 to preclude the resulting semantic explanations from

being very general (we use more specific categories such as "regulation of long-term neuronal synaptic

plasticity" or "glutamate receptor signaling pathway" and avoid general categories such as "transport" or

"biological regulation").

To make sure that this choice of categories did not cause a bias in the classification results, we checked

the relation between category size and test-set AUC scores. No significant relation between the size of

the GO category and the resulting AUC values (Figure 3.15).

53

Figure 3.15: Mean AUC (averaged over test-splits) for the GO categories vs. GO category size (number of genes in the category).

There's no significant relation between classification success of a category and the number of genes annotated to it.

55

Chapter 4: Analysis of neural ISH images

The previous chapter discusses methods to extract features from neural ISH images and represent them

using both visual and semantic features. In this chapter, I will present several ways to use these

representations in order to extract biological information from them. Specifically, Section 4.1 discusses

using the functional representations presented in Section 3.3 to identify gene-gene co-expression

patterns, that can be explained in meaningful, semantic terms. Section 4.2 demonstrates using the ISH

images to identify neural-disease related genes, and in Section 4.3 the images are used to identify layer-

specific gene markers in the cerebellum.

4.1 Explainable gene coexpression patterns using ISH functional representations

In Section 3.3, we described a method to learn functional representations for neural ISH images. We now

turn to use these representations to identify gene co-expression patterns. Gene co-expression in the brain

has been extensively studied using the more popular methods to measure mRNA expression which

produce expression values aggregated for regions, without considering fine-resolution spatial

information that may differentiate between brain regions (see Sections 1.2.1, 1.2.2, 1.3.3). ISH image

analysis has been used in the past to infer gene biological functions from spatial co-expression in non-

neural tissues (Frise, Hammonds, and Celniker 2010). However, inferring functions based on gene

expression patterns in the brain is believed to be hard, since several studies found very low variability

between transcriptomic patterns of different brain regions, sometimes even lower than between-subject

variability for the same area (Khaitovich et al. 2004; Khaitovich et al. 2005). Here we identify spatial co-

expression patterns exploiting the fine-scaled patterns that exist in the brain ISH imagery, eventually

using this subtle, even cellular resolution, spatial information for functional inference.

4.1.1 Calculating image-image similarities

Following the method described in Section 3.2, each gene can now be represented as a vector of

functional category activations. In order to identify spatial co-expression patterns between the genes, we

can simply calculate the similarity between these functional representations. We use two gene-gene

56

similarity measures in this work, taking. The first, flat-sim, is simply the linear correlation of two functional

category activation vectors. The second, GO-sim, takes into account the known directed acyclic graph

(DAG) structure among the functional categories of the GO annotation.

Formally, the flat-sim score between a pair of 𝐿2 normalized feature vectors 𝑎 = (𝑎1 … 𝑎𝑚) 𝑏 =

(𝑏1 … 𝑏𝑚) is given by their dot product flat-sim(𝑎, 𝑏) = ∑ 𝑎𝑖 ∙ 𝑏𝑖𝑚𝑖=1 . This additive similarity measure

allows assessing the contribution of each individual feature to the overall similarity score, by setting the

contribution of the feature i (corresponding to GO category i) to 𝑎𝑖 ∙ 𝑏𝑖. Thus, for each pair of similar

images, we can sort the GO categories by order of their contribution to the similarity, providing a semantic

interpretation of the correlation.

However, flat-sim does not take into account that the activation of some functional categories can be far

more informative than others. For example, two genes that share a very specific function like "negative

regulation of systemic arterial blood pressure" are much more likely to be similar than a pair of genes

sharing a more general category like "metabolism". We address this issue by adapting a functional

similarity measure between gene products developed by (Schlicker et al. 2006), which we refer to as GO-

sim. GO-sim is designed to give high similarity scores to gene-pairs which share many specific & similar

functional categories. We treat our model’s functional activations as binary annotations (using a

threshold of 0.5), and calculate GO-sim as follows.

For each GO category i, we calculate its Information Content (IC) as 𝐼𝐶(𝑖) = −𝑙𝑜𝑔10#𝑔𝑒𝑛𝑒𝑠 𝑖𝑛 𝑖

𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑔𝑒𝑛𝑒𝑠, which

measures the specificity of each category. For each pair of categories i and j, we consider the set of their

common ancestors 𝑎𝑛𝑐(𝑖, 𝑗) and define 𝑠𝑖𝑚𝑟𝑒𝑙(𝑖, 𝑗) = max𝑘∈𝑎𝑛𝑐(𝑖,𝑗)

2𝐼𝐶(𝑘)

𝐼𝐶(𝑖)+𝐼𝐶(𝑗)(1 − 10−𝐼𝐶(𝑘)). The measure

𝑠𝑖𝑚𝑟𝑒𝑙 is symmetric, bounded between 0 and 1, and attains larger values for pairs of categories which

are both specific and close to each other in the GO graph.

In our method, each gene is annotated with multiple categories. Naively, we could calculate the mean

𝑠𝑖𝑚𝑟𝑒𝑙 measure between all pairs of categories, but calculating this mean could give weight to many

irrelevant categories, and be sensitive to the addition of extra annotations to a gene. Instead, we use a

more robust method to measure similarity between two sets of function annotations, developed by

(Schlicker et al. 2006). This method relies on the most similar gene pairs, instead of all the pairs. For two

57

binary activation vectors 𝑎 = (𝑎1 … 𝑎𝑚) , 𝑏 = (𝑏1 … 𝑏𝑚) define a matrix 𝑆𝑖𝑗 = 𝑠𝑖𝑚𝑟𝑒𝑙(𝑖, 𝑗)𝑎𝑖𝑏𝑗. Then we

define 𝑠𝑖𝑚𝑎→𝑏 = 1

𝑚∑ ( max

𝑗=1…𝑚𝑆𝑖𝑗)𝑚

𝑖=1 that measures for each annotation of 𝑎 its most similar annotation

in 𝑏, and averages across all of 𝑎 ‘s annotations. We similarly define 𝑠𝑖𝑚𝑏→𝑎 with the roles of 𝑎 and 𝑏

switched, and use it to define GO-sim= max (𝑠𝑖𝑚𝑎→𝑏 , 𝑠𝑖𝑚𝑏→𝑎). To assess the contribution of individual

gene functional annotations to the GO-sim measure, we look at the category pairs (i,j) corresponding to

the highest values of 𝑆𝑖𝑗. Each such pair also has its “most informative common ancestor” 𝑀𝐼𝐶𝐴(𝑖, 𝑗) =

argmax𝑘∈𝑎𝑛𝑐(𝑖,𝑗)

2𝐼𝐶(𝑘)

𝐼𝐶(𝑖)+𝐼𝐶(𝑗)(1 − 10−𝐼𝐶(𝑘)). These ancestor functional categories give a succinct interpretation

of the similarity between genes 𝑎 and 𝑏.

Computing GO-sim for n=16351 genes, each with m functional annotations, is computationally

burdensome, requiring O(n2m2) operations. We therefore use only 164 brain-related categories out of

the total 2081 functional categories for calculating GO-sim.

4.1.2 Robustness of bag-of-words representations

In order to validate the stability of the bag-of-words gene representations, we measured the similarities

between pairs of representations of images that are of the same gene but from different image series,

and the similarities between the representations of different genes.

Similarity is much higher for representations of the same gene (Wilcoxon difference of medians test, p<10-

200). The similarity values are shown in Figure 4.1. This implies that representations of the same gene,

derived from different image series are indeed stable and are representative of the gene.

58

Figure 4.1: The similarity in the representation of same-gene pairs (blue) and different-gene pairs (red). Each curve shows the

histogram of similarity values. Same-gene image series have highly similar representations.

4.1.3 Predicting functional annotations using brain ISH images

We applied our method to 16K ISH images of 15K genes, and mapped each image to a vector

corresponding to 2000 GO categories as functional features. We used the Area Under the ROC Curve

(AUC) as a measure of classification accuracy. All evaluations were performed on a separate held-out test

set. We find that 37% of the GO categories tested yielded a test set AUC value that was significantly above

random (permutation test, p-value<0.05). This was encouraging, since the variability of expression

between brain regions was previously shown to be very low (Khaitovich et al. 2004; Khaitovich et al.

2005). This suggests that fine spatial resolution in neural tissues can reveal highly meaningful expression

patterns.

Which functional categories can be best predicted by ISH images? Table 4.1 lists the top 15 GO categories

that achieved the best test-set AUC classification scores. Interestingly, these include mostly

biosynthesis/metabolism processes and neural processes. To further test whether neural categories

achieve higher classification values based on neural expression patterns, Figure 4.2 compares the AUC

scores of 164 categories related to the nervous-system with the AUC scores of the remaining categories.

59

As expected, neural GO categories receive significantly higher AUCs (Wilcoxon, P-value<10-38), with 69%

of categories yielding significantly above random AUC values.

These AUC values suggest that when a gene is represented as a feature vector of classifiers activations,

many of the features carry a meaningful signal. The axes of the new low-dimensional representation

correspond to functional properties of each gene, linking functions of the genes to the geometry of the

space in which they are embedded.

Figure 4.2. AUC scores for GO categories related to the nervous system (dashed, red) and the remaining categories (solid, blue).

AUC scores are significantly higher for neural categories (Wilcoxon test, p < 10−38). The red and blue ticks indicate the median of

each set.

GO ID GO category name #genes AUC

GO:0060311 negative regulation of elastin catabolic process 17 1

GO:0042759 long-chain fatty acid biosynthetic process 23 0.98

GO:0009449 gamma-aminobutyric acid biosynthetic process 20 0.96

GO:0009448 gamma-aminobutyric acid metabolic process 23 0.96

GO:0032348 negative reg. of aldosterone biosynthetic process 21 0.94

GO:2000065 negative regulation of cortisol biosynthetic process 21 0.94

GO:0043206 fibril organization 23 0.94

GO:0031947 negative reg. of glucocorticoid biosynthetic process 22 0.94

GO:0042136 neurotransmitter biosynthetic process 23 0.94

60

GO:0022010 central nervous system myelination 29 0.89

GO:0008038 neuron recognition 20 0.87

GO:0042220 response to cocaine 30 0.87

GO:0050919 negative chemotaxis 16 0.86

GO:0042274 ribosomal small subunit biogenesis 15 0.86

GO:0016486 peptide hormone processing 17 0.85

Table 4.1. The GO categories classified with highest test-set AUC values.

4.1.4 Comparison with Neuroblast, the ABA image-correlation tool

How well does the method presented here compare to other methods suggested for finding similarity

between these images? We compared our results with NeuroBlast, a method to detect image-image

similarities available on the ABA website (Hawrylycz et al., 2011). This method uses a non-linear mapping

of the images to a reference anatomical atlas to apply voxel-voxel correlation between the images.

To evaluate the quality of the similarity measure, we used three sets of pair-wise relations as evidence of

gene relatedness: (1) markers of known cell types (Cahoy et al. 2008), such as astrocytes or

oligodendrocytes (2) occurrence in the same KEGG pathway (Kanehisa 2002) and (3) a set of known

protein-protein interactions taken from IntAct (Kerrien et al. 2012). For each of the 16531 genes we

ranked the 100 most similar genes according to 4 different similarity measures: (1) Functional

representation of the ISH images (FuncISH) GO-sim, (2) FuncISH flat-sim, (3) cosine similarity between the

SIFT bag-of-words representations, and (4) the ABA NeuroBlast tool. For each of the pair-wise relations

(cell type markers, KEGG pathway and PPIs) we plot the mean fraction of relations retrieved at the top-K

most similar genes (precision-at-k), a standard method in information retrieval (Manning and Raghavan,

2009). Figure 4.3 shows that for all three validation labels, FuncISH GO-sim provides superior precision

for the top 10 ranked similar genes. The superior precision of GO-sim over flat-sim is presumably since

GO-sim weighs categories more correctly and also possibly since GO-sim was limited to brain-related

categories that tend to be more accurately predicted (Figure 4.2). On the other hand, we see that

NeuroBlast outperforms flat-sim in most cases.

61

Figure 4.3. Precision at top-K for similarity defined by (A) cell type marker (B) KEGG pathways (C) protein–protein interaction.

Precision was measured using functional representations (FuncISH, purple lines for GO-sim, orange for flat-sim), SIFT (red) and

NeuroBlast (blue).

4.1.5 Identifying and explaining similarities between GABAergic neuron markers

We now turn to a deeper look into the similarity predictions. Interestingly, the highest classification

scores were achieved for the neural related categories GABA biosynthetic process and GABA metabolic

process (shown in Table 4.1), implying that our algorithm can identify spatial patterns of GABAergic

neurons. A prominent member of the GABAergic neuron marker family is Parvalbumin B (Pvalb), which

encodes for a Calcium binding protein. We examined the genes that are most similar to Pvalb, and found

that another GABAergic neuronal marker and a Calcium binding protein, Calbindin D28K (Calb1) is at the

top-15 most similar gene lists for all associated image series. Pvalb and Calb1 belong to a family of cellular

Ca2+ buffers in GABAergic interneurons. The third member in this family is Calretinin (Calb2). Looking at

the similarity rank of Calb1 and Calb2, Calb2 ranks at the top-2 percentile (out of 16351 images in the

dataset) at 16 out of 17 cases. Similarities between these three genes were not identified by NeuroBlast.

This may be because NeuroBlast uses spatial correlation measures that produce results heavily reliant on

the spatial location of expression, while using functional representations can identify patterns that can

appear in different regions of the brain. A major benefit of representing genes in the functional

embedding space is that similarities between genes can be "explained" in functional terms. Calb1, Pvalb

and Calb2 are all involved in regulation of synaptic plasticity (Schwaller 2012). When looking at the

semantic interpretations explaining the similarities between the genes, 6 out of the top-10 GO categories

are indeed directly related to synaptic plasticity such as "synaptic transmission", "regulation of synaptic

plasticity" and "learning".

62

4.1.6 Finding important spatial patterns in different scales using SIFT "visual words"

A major advantage of representing ISH images with SIFT descriptors is the ability to point directly to

spatial patterns in these complex images. Although their name suggest differently, SIFT descriptors at

several scales capture different types of patterns. Figure 4.4 shows three visual words for each of the four

scales, selected as the visual words that contributed most to classification. Scale invariance is often

assumed when analyzing natural images since objects are photographed at varying distances. ISH images

however, contain distinctive information in the different scales. As Figure 4.4 demonstrates, the four sizes

of visual words correspond to grids capturing different neural entities. The smallest descriptors cover an

actual area of 36*36µm2 and capture fine-scaled information such as cell shapes and cell densities; the

medium-size discriminative descriptors of 72*72µm2 tend to trace thinner cell layers; larger descriptor

sizes of 144*144µm2 and 288*288µm2 can cover large and intricate patterns of a mixture of cells and cell

types in a tissue. Interestingly, the four visual words with the highest contribution to classification were

the words counting the zero descriptors in each scale. This means that the highest information content

lies in "least informative" descriptors; and that overall expression levels ("sparseness" of expression) are

important factors in functional prediction of genes based on their spatial expression. Our method

presents a new representation of ISH imagery as SIFT descriptors, and using multiple scales allows

revealing the multi-resolution nature of the images.

Which scale carries the most meaningful signal for functional prediction? Figure 4.4E shows the mean

absolute value of visual words weights in every scale for all GO categories, showing that all scales

contribute significantly to the scores, with the medium contributing most. Figure 4.4A-D shows

descriptors that contributed to classification of all the categories. Furthermore, each GO category has its

own visual words that are important to its classification, and looking into their details reveals spatial

properties that are unique to specific biological processes.

As an interesting example of this effect, we considered the gene Adducin beta (Add2). Add2 is annotated

to several GO categories including "positive regulation of protein binding" and "actin filament bundle

assembly". Figure 4.5 overlays the top weighted visual words of the two categories over the Add2 ISH

image. It is easy to see that the descriptors important for classification of "actin filament bundle assembly"

are much smaller than those important for classification of the more general category "positive regulation

of protein binding" (t-test, p-value < 10-17). This implies that small scaled features such as specific cell

63

shapes are important to identify genes related to actin filament bundle assembly processes. Actin

assemblies are important to the navigation of neural growth cones, by reorienting growth cones away

from inhibitory cues (Challacombe, Snow, and Letourneau 1996). Representing the images with

histograms of oriented gradients could capture tiny differences in cell shapes that are in the process of

synapse formation, a developmental process occurring continuously throughout adulthood (Vidal-Sanz

et al. 1987).

Figure 4.4. Representing ISH images with visual words. (A, B, C, D) The three visual words with highest absolute weight (averaged

over all categories) at each scale. The SIFT descriptors (red grid) are plotted on top of each panel. The histogram of oriented

gradients used in the SIFT descriptor is plotted in the center of each element of the grid,as a set of red lines, where the length of

the line correspond to the magnitude of the gradient in its direction. (E) Mean absolute weight for the four scales of visual words

calculated over classifiers for all categories.

64

Figure 4.5. The visual words important in classifying Add2 GO categories are overlaid on the Add2 ISH image. Larger descriptors

are needed for the classification of ‘regulation of protein binding’ (A), while the discriminative visual words for ‘actin filament

bundle assembly’ (B) are much smaller, capturing properties such as cell shapes. The descriptors are color-coded by their

importance in classification, highest importance is in bright yellow.

4.1.7 Inferring new gene functions via explainable similarities

We now demonstrate how the semantic representations can be used to propose new gene functional

annotations. Consider as an example the gene Synaptopodin 2 (Synpo2) that is known to bind actin, but

otherwise has very little known associated information. Our method can be used to propose functional

annotations for synpo2 by looking at the genes that are similar to Synpo2 and considering both the GO

functions that contribute to this similarity, and the spatial pattern of expression

First, we find that Synpo2 is similar to two other genes Npepps and Rasa4 but for different reasons (the

list of top-5 semantic explanations for these similarities is shown in Table 4.2). Npepps is an

65

aminopeptidase that is active specifically in the brain (Hui 2007), and the similarity between Synpo2 and

Npepps is explained by processes related to protein processing such as ubiquitination and protein

proteolysis. At the same time, Rasa4 is a GTPase-activating protein that suppresses the Ras/mitogen-

activated protein kinase pathway in response to Ca2+ (Vigil et al. 2010), and the similarity between Synpo2

and Rasa4 is explained by high-level neural processes such as axon guidance or synaptic transmission.

Interestingly, Synpo2 and Rasa4 are expressed in different brain regions: Looking at their spatial

expression patterns reveals that Synpo2 is expressed exclusively in the thalamus, while Rasa4 is expressed

in olfactory areas. Therefore, their similarity is not in their global expression patterns across regions, but

rather in local spatial patterns. This could reflect expression in similar cell types or tissues that exhibit

similar spatial distribution at different brain regions. Npepps is more ubiquitously expressed in the brain,

and is located in the thalamic area where synpo2 is expressed. The co-location of Synpo2 and Npepps

suggests they could be participating in similar biological processes in these areas, possibly in protein

modification processes as suggested by the list of top explanations for the similarity.

Synpo2- Npepps Synpo2-Rasa4

GO ID GO name GO ID GO name

GO:0070646 protein modification by small protein removal

GO:0006836 neurotransmitter transport

GO:0006412 Translation GO:0051970 negative regulation of transmission of nerve impulse

GO:0016567 protein ubiquitination GO:0050805 negative regulation of synaptic transmission

GO:0051603 proteolysis involved in cellular protein catabolic process

GO:0007411 axon guidance

GO:0032446 protein modification by small protein conjugation

GO:0031645 negative regulation of neurological system process

Table 4.2. Top-10 GO annotations explaining the similarities between the gene Synpo2 and Npepps (left column) and Rasa4

(right column).

4.2 Using ISH images to predict neural disease-related genes

66

In the previous section, we learned that representing the ISH images using SIFT-BoW can be used to

predict the inclusion of a gene in a biological process. This is one instance of using co-expression using the

“guilt by association” principle (see Section 1.3.3), where the logistic regression classifier learns typical

patterns of genes from each functional category, and searches for genes with a similar pattern. Doing this

can prove to be particularly useful when trying to identify genes that are related to neural disease. The

idea is that by identifying disease related genes, we are able to better understand the underlying

molecular basis of the disease and consequently improve at preventing, diagnosing or treating the

disease, by developing new drug targets, for example. Indeed, many studies in recent years focused on

analyzing gene co-expression in order to shed light on neural disease (Chen et al. 2013a; de Jong et al.

2012; Ponomarev et al. 2012; Torkamani et al. 2010; Voineagu et al. 2011).

4.2.1 Image classification based on disease-gene markers

We used the visual representations (SIFT-BoW) to try and predict disease related genes for two neural

diseases which have a strong genetic basis with many known associated genes: Parkinson’s disease and

epilepsy. We applied a logistic regression classifier for every disease, where positive examples are genes

already known to be associated with the disease, and negative examples are genes that are not known

yet to be involved in the disease. The classifiers were run using the same setup described in Section 3.3.2.

Disease-genes were taken from the database OMIM (Hamosh et al. 2005). We used mouse orthologs of

24 human genes found to be PD related, leaving 11 mouse genes: Cacnb4, Chrna4, Chrnb2, Clcn2, Cstb,

Epm2a, Gabra1, Gabrg2, Kcnq2, Kcnq3, Lgi1, Me2, Nhlrc1, Scn1a, Slc25a22, Syn1. For epilepsy, we used

mouse orthologs of 43 human epilepsy related genes, leaving 16 genes: Dbh, Lrrk2, Ndufv2, Nr4a2, Park2,

Park7, Pink1, Snca, Sncaip, Tbp, Uchl1. The genes Ndufv2, Snca, Uchl1, Cacnb4 and Chrna4 have been

imaged twice and the gene Chrnb2 has 3 associated images, leading to an overall number of 14 images in

the positive set for PD and 16 images in the positive set for epilepsy. When splitting the positive set into

train and test sets, we made sure to have an equal amount of positives in each split, and also that

classifiers are not trained on genes that have more than one associated image with an image in the test

set for that specific split. We evaluated our results using AUC scores. Test-set AUC scores are 0.799 and

0.73 for PD and epilepsy respectively (Figure 4.6).

67

Figure 4.6: ROC curves for (A) Parkinson’s disease predictions and (B) epilepsy predictions.

We applied the trained classifiers to the entire mouse genome and predict dozens of new candidate genes

for epilepsy and PD. For example, a top epilepsy prediction is Tph2 (tryptophan hydrozylase). This gene is

important in the biosynthesis of serotonin, which has been hypothesized to be involved in epilepsy. Many

of the genes predicted to be PD-related are known to be related to other neural diseases, for example,

mutations in the gene App, the top prediction for PD, have been implicated in autosomal dominant

Alzheimer disease and cerebroarterial amyloidosis (cerebral amyloid angiopathy). Fkbp6, another top PD

prediction, is found to be deleted in Williams syndrome and deficiency of the protein product of Ndn is

implicated in the pathogenesis of the neurodevelopmental disorder Prader-Willi syndrome, just to name

a few examples. We also find that PD genes are widely expressed in the brain, while epilepsy genes show

much more localized patterns, notably in the periaquaductal grey, a region that has been shown to be

associated with the induction of audiogenic seizures in epilepsy-prone rats. Overall we predict 13 new

Parkinson’s disease genes and 247 new epilepsy genes. The top-10 predicted genes for PD and epilepsy

are listed in Table 4.3, along with corresponding logistic regression confidence scores.

68

PD predicted genes Score Epilepsy predicted genes Score

App 0.641 Slc17a8 0.886

Ifnar1 0.639 Gchfr 0.814

2610002F03Rik 0.638 Tph2 0.79

LOC652668 0.636 Ublcp1 0.734

Fkbp6 0.635 Yipf5 0.72

Apbb1 0.633 Chrm3 0.711

Cenpf 0.633 Icam5 0.693

Actc1 0.633 Slc36a1 0.689

LOC546142 0.633 Slc39a3 0.651

Ndn 0.632 Arhgdig 0.651

Table 4.3: Top 10 predicted genes for the two diseases, and corresponding prediction scores.

4.2.2 Validation of results

We validated our gene-disease predictions using a Gene Set Enrichment Analysis (GSEA) (Subramanian et

al. 2005) on our lists of prediction scores. We used two sources of validation: 1. Genetic Association

Database (GAD) (Becker et al. 2004) 2. Disease-gene predictions from a previous attempt to identify

disease-related genes from co-expression patterns (Linghu et al. 2009). For PD predictions, GSEA scores

where significant (P<0.05) using the two datasets. Epilepsy predictions were significant in the GAD dataset

but not significant using the Linghu 2009 set as a validation set, presumably due to the small number of

positives in this dataset (N=30) (Table 4.4).

Disease Validation dataset Enrichment P-value N

PD GAD 5.7*10-05 336

PD Linghu 2009 0.02 28

Epilepsy GAD 0.0028 81

Epilepsy Linghu 2009 0.058 30

Table 4.4: Prediction validation using two datasets: GAD and Linghu 2009. Results were validated using a GSEA analysis

(Subramanian et al. 2005).

69

4.3 Localizing genes to cerebellar layers using ISH image classification

Regions in the brain are organized in layered structures, where each layer has a distinct functional role,

that is reflected in specialized expression of genes. Understanding cell-layer functionality depends on our

ability to identify and characterize its unique expression signature. In this section we focus on the highly

organized layered structure of the mouse cerebellum. The cerebellum is composed of four unique layers:

the Purkinje layer, where the Purkinje cells bodies are located, the molecular layer, containing the thick

dendritic trees of the Purkinje cells, the granular layer, which is the region with the highest density of

neurons known in the brain and the innermost layer, the cerebellar white matter.

We use a machine vision approach to identify layer-specific genes. The method is based on modeling the

spatial expression patterns observed in ISH images of a few genes that are known layer-markers.

Specifically, we represent the images using histograms of local binary patterns (LBP, see Section 3.2.1),

and learn four separate classifiers for the four layers of the cerebellum. For full details on the methods

used see (Kirsch, Liscovitch, and Chechik 2012). All classifiers achieve a very high area under the ROC curve

(AUC > 0.94 for all four categories) Using the learned patterns, we then automatically scan the genome-

wide ISH database and detect all other layer-specific genes.

4.3.2 Genome-wide predictions of cerebellum layer markers

After applying the classifiers to the full mouse genome (20,382 genes in the ABA database), we are able

to identify layer specific markers with high accuracy. Out of 13361 genes that are expressed in the

cerebellum, 454 genes are predicted to be primarily expressed in the Purkinje layer, 233 in the granular

layer, 14 in molecular layer and 16 in the white matter.

We validated the predictions by manually scanning the top predicted genes, visualizing their measured

expression patterns, and comparing them to the patterns expected at that layer. Out of the top 250 genes

predicted to be localized to the Purkinje layer we correctly classified 98.4%. Similarly, 98.1% of the top

250 granular layer prediction were accurate. The precision was worse for localization of the molecular

layer: All 14 predictions had a molecular expression, but 10 out of the 14 also had a granular expression.

Finally, 10 out of 16 predicted white matter were positive. It should be clarified however, that many of

the genes that exhibited localized expression in one cerebellar layer, are also expressed in other regions

70

of the brain, sometimes very widely. Also, despite the fact that most of the training images in the

molecular class show expression in the molecular layer and also in the Purkinje layer, our classifier was

able to identify genes that show expression only in the molecular layer.

Applying the white-matter classifier and the molecular layer classifier to the full genome yielded very few

positively scored genes. This could be attributed to the small number of positive samples in the training

set for these classes. Indeed, when we manually examined one thousand of the genes in the database we

only found one gene that was exclusively localized to the white matter (and one gene localized to the

molecular layer). In comparison, there were many more genes localized to the granular or the Purkinje

layers.

4.3.3 Characterizing layer-specific genes

The above results show that at least 450 genes, which are more than 3.4% of genes that are expressed in

the cerebellum, are primarily expressed in one layer (mostly the Purkinje and granular layers). There could

be many reasons for this highly structured expression pattern. For example, localized genes may reflect

unique cell-type dependent biological processes, like shaping the cell morphology or controlling the

connectivity between specific neuron types. Alternatively, localized expression may also reflect properties

that are not necessarily cell-type specific, like processes that depend on cell size, since Purkinje cells are

exceptionally large. We therefore turned to characterize the properties of localized genes, by testing their

functional annotations and comparing them with the transcriptome of Purkinje-deficient mice.

Comparison with Purkinje deficient mice

To better characterize the properties of genes localized to the Purkinje layer, we aimed to separate genes

whose expression is related to Purkinje cells from genes whose expression is related to non-Purkinje cells.

We compared our study with a study by Rong and colleagues (Rong, Wang, and Morgan 2004) who aimed

to identify Purkinje-cell specific genes. In this study, the cerebellar gene expression of two strains of mice

were compared: wild-type mice and PSD3J mice which have a mutation in the gene Nna1 causing them to

lose their Purkinje cells by adulthood. Genes with reduced expression in the PSD3J mice presumably reflect

the loss of Purkinje cells.

We compared the list of genes that we predicted to localize to the Purkinje-layer with a list of

203 PSD3J genes whose expression decayed by more than 50% as provided by (Rong, Wang, and Morgan

2004). We sorted the predicted genes by the classifier margin, treated the PSD3J list as positives, and

71

computed the precision at the K top-ranked genes. Figure 4.7A shows that the top ranked predicted genes

have high overlap with the PSD3J list, reaching 33% at the top 10.

The cross-comparison between the two sets reveals genes that are localized to the Purkinje-layer, but are

not Purkinje-cell related. This may include genes that are expressed in non-Purkinje cells such as

Bergmann glia. The cross-comparison also reveals genes whose expression is affected by the deficient

Purkinje cells, but are not localized to the Purkinje layer. These may include genes that are expressed in

the dendritic arbors of Purkinje cells, or other genes that are not layer- specific but were affected by the

deficiency of the Purkinje cells. Finally, for those genes that are detected to be both Purkinje-layer related

and Purkinje- cell related, the cross-comparison strengthens their link to Purkinje cells.

Functional annotation

As the next step, we studied known functions of the genes that were localized to the four classified layers.

We used Gene Ontology (GO) annotations to find the biological processes that are over-represented in

the resulting gene sets for each layer.

As expected, genes localized to the white matter layer showed enrichment for myelination. More

interesting was the enrichment for neurogenesis, which is also known to take place in the white

matter (Zhang and Goldman 1996). The Purkinje layer was enriched for lipid metabolic processes and

more general processes, such as oxidation/reduction. Full lists of enriched categories are provided in

Tables 4.5-8.

GO ID Term # annotated # significant FDR q-value

0042552 Myelination 21 2 0.00045

0006665 Sphingolipid metabolic process 36 2 0.00134

0008654 Phospholipid biosynthetic process 49 2 0.00248

0022008 Neurogenesis 267 3 0.00703

Table 4.5. Functional enrichment of genes localized to the white matter.

72


0006816 Calcium ion transport 86 12 9.5*10-5

0006937 Regulation of muscle contraction 22 5 0.0012

0030900 Forebrain development 73 9 0.0018

000629 Lipid metabolic process 407 28 0.0018

0055114 Oxidation reduction 375 26 0.0023

0007264 Small GTPase mediated signal

transduction

263 19 0.0057

0050767 Regulation of neurogenesis 61 7 0.0083

Table 4.6. Functional enrichment of genes localized to the Purkinje layer.


0016192 Vesicle mediated

transport

279 14 0.00036

0009966 Regulation of signal

transduction

317 12 0.00952

Table 4.7. Functional enrichment of genes localized to the granular layer.


0050804 Regulation of synaptic

transmission

49 2 0.0024

0006457 Protein folding 76 2 0.0059

Table 4.8. Functional enrichment of genes localized to the molecular layer.

The cerebellar cortical layers are comprised of distinct types of neurons and glia. We asked whether the

genes expressed in the different layers are associated with specific cell types. To answer this question, we

used lists of genes that were found to be enriched in three major cell types; neurons, astrocytes and

glia (Cahoy et al. 2008). Enrichment was determined by isolating these cell populations

using Fluorescence-Activated Cell Sorting (FACS) and quantifying their expression using microarrays.

73

Genes with a 20-fold and up over-expression levels were defined as cell-type specific markers. The lists of

cell type markers include 2036 genes for neurons, 2618 for astrocytes and 2228 for oligodendrocytes. We

tested for enrichment of these markers in our results, using the entire genome as background. Results are

presented in Figure 4.7B. As expected, genes that were found to be expressed in the cerebellar white

matter show a strong enrichment signal for oligodendrocytes. The granular layer, which contains large

amounts of densely packed granule cells, indeed shows enrichment for neuron-related genes. The

Purkinje cell layer, which is defined by the cell bodies of Purkinje neurons shows, interestingly, a strong

enrichment signal for glia cells, notably astrocytes. This could be explained by the specialized astrocytes

that occupy this layer, the Bergmann glia, and also by the astrocyte processes derived from cells located

in the upper granular layer, covering the Purkinje cell bodies (Ghandour, Vincendon, and Gombos 1980).

Oligodendrocytes are also known to be localized close to the Purkinje cells (Ghandour, Vincendon, and

Gombos 1980). This fact can account for the enrichment of this cell type in the Purkinje cell layer.

Figure 4.7. Comparison with Purkinje-deficient mice and layer enrichment for cell types. (A). Comparison with Purkinje-deficient

mice genes from (Rong, Wang, and Morgan 2004). The overlap of the set of top ranked genes that were localized to the Purkinje

layer with PSD3J Precision is the fraction of Purkinje-localized genes that are found in PSD3J mice. (B) Enrichment for cell type

specific markers, taken from (Cahoy et al. 2008). For each layers enrichment for cell type was tested using a hypergeometric test.

The dashed red line corresponds to p-value at random. As expected, the white matter was enriched for oligodendrocyte markers

and the granular layer is enriched for neuronal markers. Interestingly, Purkinje layer genes show a strong enrichment for

astrocytes markers.

74

Finally, we used the localization predictions to identify novel genetic markers for the different cerebellar

layers. Out of the hundreds of new markers, here we describe two examples of genes that were top-

ranked by our classifier in two layers. The first, Mitogen-activated protein kinase kinase 6 (Map2k6), was

the first-ranked gene in the white matter. Its cerebellar expression pattern, depicted in Figure 4.8, shows

it is indeed clearly localized to the white matter. Map2k6 is a member of the Map kinase signal

transduction pathways, and is thus involved in cell proliferation and growth. It has been shown that the

human ortholog of Map2k6 is activated in the cerebellum in response to calcium, triggering a signaling

pathway which results in the expression of genes responsible for the survival of newly differentiated

neurons (Mao et al. 1999). Therefore, it is not surprising to find it in the white matter of the cerebellum,

and yet this expression pattern was never previously demonstrated. While Map2k6 is a relatively well-

studied gene, the second example we discuss, Fam107b (3110001A13Rik), ranked 6th by the Purkinje

layer detector, has little to no associated information. This gene shows a strong, localized expression in

the Purkinje layer (Figure 4.8B). Moreover, its expression is also largely specific to the cerebellum (Figure

4.8C).

Figure 4.8: Examples of novel genetic markers. Non-masked ISH images showing cerebellar expression of Map2k6 (A) and

Fam107b (B). These are the raw images before the application of the expression mask. The expression of actual labeled mRNA

target transcripts is marked with dark spots. (C) Whole-brain ISH image for Fam107b. Fam107b shows strong, highly localized

expression in the Purkinje layer of the cerebellum.

75

Chapter 5: Patterns of RNA editing in the brain

5.1 Introduction

The previous chapters discuss spatio-temporal patterns of mRNA expression in the developing and adult

brain. After transcription, mRNA is subjected to post-transcriptional modifications that may change its

properties, leading even to changes in protein structure during translation. One such modification is A-to-

I RNA editing by adenosine deaminases acting on RNA (ADARs). This post-transcriptional modification pre-

mRNA that is essential for normal life and development in vertebrates (Nishikura 2010; Bass 2002; Savva,

Rieder, and Reenan 2012). Editing changes the sequences of encoded RNA, thus contributing to proteomic

and phenotypic diversity. To this day, thousands of human genes have been shown to be subject to A-to-

I RNA editing within their untranslated regions and introns (Bazak et al. 2013; Bahn et al. 2012; Li et al.

2009; Park et al. 2012; Ramaswami et al. 2012; Ramaswami et al. 2013; Peng et al. 2012). In primates,

these editing events take place mainly within Alu repeats (Kim et al. 2004; Levanon et al. 2004;

Athanasiadis, Rich, and Maas 2004; Blow et al. 2004), which are primate-specific, 300bp-long elements

that comprise about 10% of the human genome. Importantly, editing has been shown to operate in genes

encoding synaptic proteins or important neuromodulators, suggesting that editing may have an important

role in tuning molecular functions in the brain regions (Burns et al. 1997; Sanjana et al. 2012). Indeed,

known phenotypic effects of editing from Caenorhabditis elegance and Drosophila melanogaster to Mus

musculus are related to neural systems and behavior (Palladino et al. 2000; Tonkin et al. 2002; Higuchi et

al. 2000). In addition, editing was found to be dysregulated in several diseases, mainly related to the

neural system (Eran et al. 2012; Silberberg et al. 2012; Chen et al. 2013b).

Although most human genes have been shown to undergo editing (Bazak et al. 2013; Ramaswami and Li

2014), the exact role of RNA editing is still unclear, and various functions have been proposed to explain

its operation (Nishikura 2010). It has been proposed that 3' UTR editing may play a role in gene silencing

(Nishikura 2010); in augmenting or counteracting the RNAi mechanism (Nishikura 2010), and as an anti-

retroelement mechanism (Levanon et al. 2005). It has also been suggested that heavily-edited mRNA

transcripts are retained in the nucleus (Chen and Carmichael 2009; Prasanth et al. 2005; Zhang and

Carmichael 2001; Scadden 2005; Scadden and O’Connell 2005), or induce inosine specific degradation of

the edited transcripts by Tudor-SN nuclease (Scadden 2005; Scadden and Smith 2001a). Moreover, hyper

76

edited transcripts were even shown to down-regulate gene expression in trans (Scadden 2007). Another

way in which editing might regulate gene expression in human is through modification of micro-RNA

(miRNA) targets within 3' Alu elements (Liang and Landweber 2007) and changing the splicing

enhancers/silencers recognition sites (Lev-Maor et al. 2007). A common effect of all these proposed

mechanisms is that editing of a target gene is expected to reduce its expression. A direct prediction

stemming from this hypothesis is that expression of edited genes will be negatively correlated across

conditions with the expression of ADARs.

The above experimental findings seem to conflict with the abundance of editing targets in the human

genome in terms of the possible effects of RNA editing on expression. On one hand, as pointed above,

editing was demonstrated to have a dramatic impact on inosine-containing transcripts. On the other hand,

if editing determines the fate of mRNA it would have an overly massive effect on human transcriptome.

This is because a large fraction of human transcripts contain double-strand RNAs structures formed by

Alus (Kim et al. 2004; Levanon et al. 2004; Bazak et al. 2013; Athanasiadis, Rich, and Maas 2004; Blow et

al. 2004), ideal ADAR targets, and therefore, editing would impact a large fraction of human genes.

Moreover, since the rapid invasion of Alus into the genome is mostly specific to primates, evolution only

has a short period to adapt to this recent increase of edited targets.

To address these two possible conflicting views, the current work aims to chart co-expression patterns of

ADARs and their potential Alu editing targets in the human brain, using two large sets of mRNA expression

from postmortem brains. Surprisingly, when considering the correlation structure of ADAR and its targets

along development, we do not find evidence supporting the expected global negative correlation, since

the distribution of correlations is often bi-modal: ADAR is positively correlated with most of its targets,

and negatively correlated with other target genes. Our results suggest that in the course of primate

evolution, with the massive editing associated with Alu, editing-related mechanisms for gene regulation

were probably adjusted in such a way that their negative regulation of edited gene has changed.

5.2 Results

To characterize the spatial expression of ADAR (ADAR1) and ADARB1 (ADAR2) in the brain and how their

expression correlates with their potential editing targets, we analyzed genome-wide expression

measurements from two sources: A dataset containing 3702 samples from 6 adult human brains (Website:

©2012 Allen Institute for Brain Science. Allen Human Brain Atlas [Internet]. Available from:

Http://human.brain-Map.org/), and a dataset measured from 57 brains over development (Kang et al.

77

2011) (see Methods for details on both datasets). In the results below, we refer to them as ABA-2013 and

Kang-2011, respectively.

5.2.1 ADAR and ADARB1 expression in the brain

As a first step to characterize the expression of ADAR and ADARB1 in the human brain, we looked at their

pattern of expression across the major brain regions. Figure 5.1 shows the average expression over the

six adult brains in three consecutive coronal slices. ADAR expression is enriched mostly in sub-cortical

regions, the claustrum, pons and medulla oblongata, but also the cingulate gyrus. This expression pattern

is consistent with previous reports that editing targets HTR2C, the gene that codes for a serotonin receptor

that is expressed in sub-cortical regions, but not HTR2A which codes for a receptor in the same family

which is expressed in the cortex. ADARB1 expression is enriched particularly in highly functional regions

such as the cerebellar cortex, pons and thalamus. Over-expression of both ADARs in the pons is consistent

with a previous finding of high editing levels in this region in the rat brain (Paschen and Djuricic 1994).

Interestingly, the expression levels of ADAR were in general not exceptionally high in the neocortex, the

brain area that is dramatically oversized in primates and humans specifically.

As discussed above, RNA editing of Alu repeats has been suggested as a possible regulatory mechanism,

where switching of Adenosine to Inosine marks mRNA for degradation or nuclear retention (Chen and

Carmichael 2009; Prasanth et al. 2005; Zhang and Carmichael 2001). To examine the hypothesis that RNA

editing serves as a mechanism for down-regulation of gene expression, we calculated the spatial

correlation between ADARs and 7,864 potential editing targets (see Methods for details on how target

and background sets were defined) across brain regions in the ABA-2013 dataset, and 6,834 potential

editing targets in the Kang-2011 dataset. If ADARs edit their targets on a wide scale, and if RNA editing by

ADARs down-regulates their targets, regions with high levels of ADAR and ADARB1 mRNA would show

lower levels of their non-edited targets on average. As a consequence, we would expect to see negative

correlations between ADARs and their potential editing targets.

78

Figure 5.1. ADAR and ADARB1 expression in the human brain based on the ABA-2013 dataset. Heat map of

normalized mRNA expression in three coronal slices of a human brain. Expression was averaged over six adult brains.

(A) ADAR expression is enriched in the cingulate gyrus - CG, the pons - P, the claustrum - C and the medulla oblongata

- MO. (B) ADARB1 expression is enriched in the thalamus - TH), the pons - P and the cerebellar cortex - CBC. Figures

were created using the brain-expression-visualizer tool available from www.chechiklab.biu.ac.il.

5.2.2 ADAR expression is positively correlated with potential editing targets

We used the Illumina Human Body Map (HBM) RNA-Seq data from a brain sample to identify genes with

edited Alu elements, focusing on edited Alu repeats that reside within genes. We defined a gene as a

target if it contains at least one edited Alu (Kim et al. 2004; Levanon et al. 2004; Bazak et al. 2013;

Athanasiadis, Rich, and Maas 2004; Blow et al. 2004)

We computed the spatial correlation of ADAR and ADARB1 with their potential editing targets, across all

samples in our two datasets. As a baseline for comparison, we also computed the same correlations but

this time with the spatial expression profile of all genes in a background set of 10,731 genes (see Methods

for details on how target and background sets were defined). Figure 5.2A shows the histograms of

79

correlations between ADAR and the target set (red) and ADAR and the background set (blue). Surprisingly,

the effect observed is opposite than what is predicted by the initial hypothesis. The correlation of ADAR

mRNA levels with the expression of its potential targets is actually more positive than correlations of ADAR

mRNA levels with the background set expression (median Pearson correlation with targets = 0.224,

median Pearson correlation with background = 0.104, Wilcoxon test for different medians z-value = 31.9,

p-value <8.73*10-223, n=20,772, Figure 5.2A). This effect was consistent when we computed non-linear

spatial correlation (median Spearman correlation with targets = 0.219, median Spearman correlation with

background = 0.099, Wilcoxon test for different medians, z-value = 31.3, p-value <9.92*10-215, n=20,772).

There was no significant effect found for the other editing enzyme, ADARB1 and this result is consistent

with the fact that ADAR is considered to be the main gene responsible for Alu editing (Wang et al. 2013;

Riedmann et al. 2008; Bahn et al. 2012).

To further validate the high spatial correlation between ADAR and its targets, we computed the

distribution of spatial correlations in the second dataset, Kang-2011, which measured spatio-temporal

expression profiles throughout the human brain and in different ages (Kang et al. 2011). Results in this

second dataset were highly consistent with the first dataset: The correlation between ADAR and the set

of edited targets, computed using all the samples regardless of age, was significantly positive (median

Pearson correlation with targets = 0.063, median Pearson correlation with background = -0.121, Wilcoxon

test for different medians z-value = 41.2, p-value < 10-223, n = 17,564. Median Spearman correlation with

targets = 0.0567, median Spearman correlation with background = -0.135, Wilcoxon test for different

medians z-value = 41.7, p-value < 10-250, n = 17564, Figure 5.2B). The results were also largely consistent

at the gene-to-gene level: the set of correlations with ADAR, as computed for each gene, was in itself

strongly correlated (Spearman rho = 0.44, p-value<10-16), even though the two datasets used were

measured in different subsets of brain regions.

Figure 5.2 shows the histograms of ADAR correlations with target and background sets. The difference in

the correlations of ADAR and targets versus the background set comes from two sources: a subset of

target genes that have strong positive correlations with ADAR, and also a group of genes that are not

edited but are strongly negatively correlated with ADAR. This “spike” in negative correlations is very

prominent and appears in both datasets. To characterize the highly negatively correlated genes, we

performed a Gene Ontology (GO) enrichment analysis using GOrilla (Eden et al. 2009). In ABA-2013 and

80

also in Kang-2011, we found that the lists of genes that are negatively correlated with ADAR are highly

enriched for olfactory receptor activity (p<10-50 for both datasets).

Figure 5.2. The distribution of spatial correlation values between ADAR and targets (orange) and between ADAR and

a background set (light blue). The results are shown for (A) ABA-2013 dataset (B) Kang-2011 dataset. The two

distributions differ due to two groups of genes: a larger number of target genes have positive correlations with ADAR,

and there also exist a group of genes that do not contain Alus, thus are not targeted by ADAR, but are strongly negatively

correlated with ADAR.

5.2.3 Effect of Alu location in the gene

Double stranded Alu structures appear in various locations in genes. To test if the strong positive

correlation of ADAR with its putative targets depends on the location of the target in the gene, we

repeated the analysis, but this time separating the targets ABA-2013 into four groups of genes based on

the location of the Alu repeat: 3'UTR (1,024/878genes), 5'UTR (92/55 genes), intronic regions

(7,494/6,525genes) and coding sequences (CDS, 38/37genes). We accounted for the different sizes of the

groups using bootstrap (see Methods). The spatial-correlation effect was significant in intronic Alus and

in 3'UTR Alus (Figure 5.3). Lack of differences in correlation between editing at the 3’ UTR and introns

argues against global gene regulation by editing at the 3’ UTR. The distribution of correlation values of

ADAR with each of the target groups and the background set is shown in Figure 5.4.

81

Figure 5.3: Effect vs. Alu location. Boxplot of the log-transformed p-values of a one-sided Wilcoxon test between ADAR correlations with targets versus a background set of genes is plotted against the location of the Alu repeat pairs in the gene (note that Alu in the CDS or 5’ UTR is rare). P-values for the two datasets are pooled and shown together. Error bars encompass data within 1.5 times the inter-quartile range, and the boxes show the lower and upper quartiles together with the median. Outliers are represented as circles. Lack of differences in correlation between editing at the 3’ UTR and introns argues against global gene regulation by editing at the 3’ UTR.

82

Figure 5.4. The distribution of spatial correlation values between ADAR and targets containing Alus in different

locations (orange) and between ADAR and a background set (light blue). The results are shown for (A) ABA-2013 ,

intron (B) Kang-2011, intron (C) ABA-2013 , 3'UTR (D) Kang-2011, 3'UTR (E) ABA-2013 , 5'UTR (F) Kang-2011, 5'UTR (G)

ABA-2013 , CDS (H) Kang-2011, CDS.

83

5.2.4 Specificity of ADAR-target correlations

The difference in ADAR correlations with targets and background genes may not be specific to ADAR. For

instance, if a large group of target genes is highly positively inter-correlated, then many genes, not only

ADAR, would show a strong correlation with that group and as a result, significantly stronger correlation

than with the background set. To test if the difference in correlations is specific to ADAR, we repeated the

above analysis for all genes: for each gene, we calculated the Spearman correlation between the gene's

spatial expression pattern and the expression of the genes from the intronic target and background sets.

We ranked all genes based on the magnitude of their correlation, measured as -log10(Wilcoxon's test p-

value). ADAR is ranked at 6 out of 20,773 genes in the ABA dataset and ranked 22 out of 17565 genes in

the Kang-2011 dataset. When looking at the intersection of the two sets, ADAR is one out of only 10 genes

that are in the top 1% of both two sets (10 out of 17565, top 0.1 percentile). This means that the high

positive correlations of target genes with ADAR are not a common phenomenon in the genome, and this

result is significantly specific to ADAR. The other 9 genes include DDX1, a putative RNA helicase which is

implicated in several processes involving alteration of RNA secondary structure (Li, Monckton, and

Godbout 2008) and the interferon receptor IFNAR1. Another gene that shows high correlation with editing

targets in both sets is NF2, which has been suggested to be involved in neural cell development (Lavado

et al. 2013). Brain development has been suggested to be controlled in part by RNA editing (Mehler and

Mattick 2007).

5.2.5 Relation of ADAR-target co-expression and editing potential in targets

Genes contain variable amounts of Alu repeats. If the positive spatial correlation of ADAR with its targets

is functionally meaningful, we would expect to see higher correlations of ADAR with genes that contain

more Alus. Figure 5.6 plots the correlations of intronic target genes with ADAR against the number of Alus

in the same genes. There is a significant positive correlation between the number of Alus that a gene

contains and its correlation with ADAR, in both datasets (Spearman correlation coefficient ρ=0.084, p-

value=4*10-13 for ABA-2013 dataset, ρ =0.11, p-value=4.4*10-19 for the Kang-2011 dataset). Genes that

contain more Alu repeats tend to be longer, therefore the relation between spatial correlation with ADAR

and the number of Alus could be a side-effect of the increased gene length. To test this, we assembled

two sets of length-matched genes, one from the target set and another from the background set (see

84

Methods), and computed their correlations with ADAR. The correlations of ADAR with the target set were

strongly positive, as opposed to the correlations with the background set, for both ABA-2013 (median

Pearson correlation with targets = 0.241, median Pearson correlation with background = 0.104, Wilcoxon

test for different medians z-value = 25.1, p-value < 7.27*10-139, n = 10054) and Kang-2011 (median Pearson

correlation with targets = 0.065, median Pearson correlation with background = -0.102, Wilcoxon test for

different medians z-value = 27.5, p-value < 9.62*10-167, n = 8968). We conclude that the higher positive

correlations of ADAR with its targets are not simply due to of the effect of gene lengths.

Figure 5.6: 2D histograms of the correlation of genes with ADAR vs. the number of Alu repeats the genes contain.

Positive correlation with ADAR increases with number of Alus. Points with more than 50 Alu repeats were ignored for

easier visualization. The results are shown for (D) ABA-2013 dataset (E) Kang-2011 dataset.

5.2.6 Correlations with ADAR over development

RNA editing has been suggested to be involved in brain development and neurodegeneration (Palladino

et al. 2000; Tonkin et al. 2002; Li and Church 2013; Higuchi et al. 2000). The Kang-2011 dataset is a neural

expression survey measured over development, allowing to test if the positive ADAR-target correlations

change over time. We examined the dynamics of the correlations over brain development, and found that

spatial correlations of ADAR and its targets are higher than with the background set throughout

development (Figure 5.7). Looking at the distribution of correlations in every time point reveals that for

85

at least some of the time points, the histograms of correlations between ADAR and targets are bi-modal

(see example time points at Figure 5.8, Figure 5.7 shows results for all time points).

To test the stability of the groups of target genes that are correlated with ADAR, and how these groups

may change across different time points, we calculated the cross-correlation between the lists of

correlations of target genes and ADAR at every two time points (Figure 5.9). We found that the target

genes correlated with ADAR are similar in two embryonic time points (10-13pcw and 13-16pcw), and in

most of the adult time points (excluding the last one, 60y+).

In order to functionally characterize the bimodal distributions in these two clusters, we pooled together

data from all embryonic time points and all post-natal time points, and performed a GO enrichment

analysis on the positively correlated genes and the negatively correlated ones using GOrilla. The functional

analysis revealed that in the embryonic time points, the genes that are positively correlated with ADAR

are highly enriched for processes such as RNA binding, mRNA processing and gene expression. The

negatively correlated genes are enriched for "ion transport" (FDR q-value <10-7). In the post-natal time

points the positively correlated genes and the negatively correlated ones are not enriched for a particular

biological process.

86

Figure 5.7. The distribution of spatial correlation values between ADAR and targets with intronic Alus (orange) and

between ADAR and a background set (blue), at different developmental time points. Each panel is based on aggregate

measures from 2-7 brains within the designated age group.

Figure 5.8. ADAR-target correlations over development. The distribution of spatial correlation values between ADAR

and targets with intronic Alus (orange) and between ADAR and a background set (blue), at two developmental time

points: (A) 10-13 PCW and (B) 6-12 months.

87

Figure 5.9: Differential co-expression of ADAR and targets. Heatmap of Spearman correlation rho values showing the

temporal cross-correlation between target gene lists ranked by their correlation with ADAR.

5.3 Methods

5.3.1 The data

We used gene expression data from two sources: the Allen Human Brain Atlas (Website: ©2012 Allen

Institute for Brain Science. Allen Human Brain Atlas [Internet]. Available from: Http://human.brain-

Map.org/) and Kang-2011 (Kang et al. 2011). Neuroanatomical expression data from the Human Brain

Atlas was averaged across probes. We used the probe to gene mappings provided by the Allen Institute.

This averaging provides donor specific gene by region expression profiles that range in size from 185 to

348 brain regions that provide expression data for 29,176 transcripts. Probes which are not mapped to

88

genes were discarded, leaving data for 20773 transcripts. Donor age ranges from 24 to 57 years old (more

information available at http://human.brain-map.org/).

Gene expression data from the Kang-2011 dataset covers 15 developmental stages across 30 time points.

The number of sampled brain regions ranged between 2-16 for each of the 41 donors. The gene

summarized exon array data contains profiles for 17565 genes across 1340 samples.

5.3.2 Choosing target and background sets

We used the Illumina Human BodyMap 2.0 Project (GEO accession number GSE30611, HBM) to find RNA

editing sites within Alu repeats. Genes containing Alu elements that were found to be edited in a brain

sample were included in our target set. The background set was defined as the complementary set of

genes in each dataset.

For the ABA-2013 set, the number of targets is 7,864, and the number of background genes is 12,909. For

Kang-2011 set, the numbers of targets is 6834, and the number of genes in the background set is 10,731.

When splitting the target groups based on the location of the Alu repeats, in ABA-2013 dataset there are

7,494 genes with intronic Alus, 1,024genes with Alus in the 3'UTR, 92 genes with Alus in the 5'UTR and 38

genes with Alus in the CDS, and in Kang-2011 dataset there are 6,525 genes with intronic Alus, 878 genes

with Alus in the 3'UTR, 55 genes with Alus in the 5'UTR and 37 genes with Alus in the CDS. The number of

genes in all target and background sets is summarized in Table 5.1.

ABA-2013 Kang-2011

All targets 7,864 6,834

Intronic Alus 7,494 6,525

3’UTR Alus 1,024 878

5’UTR Alus 92 55

CDS Alus 38 37

Background genes 12,909 10,731

Table 5.1. Number of target genes and background genes used in the analyses.

89

5.3.3 Testing ADAR-target correlations at different Alu locations

To take into account the different sizes of target groups when split according to Alu location (CDS, intron,

3'UTR and 5'UTR), we applied a bootstrap approach by sampling subsets of targets in the size of the

smallest group, the CDS set, from all other groups 1,000 times, and calculating a p-value for each sample.

5.3.4 Functional analysis of gene sets

To functionally characterize the target genes negatively and positively co-expressed with ADAR, we

calculated the spatial correlation of each target gene in the Kang-2011 dataset at each time point. We

ranked the genes based on the correlations in an ascending and descending order for embryonic and post-

natal time points, and performed a Gene Ontology (GO) enrichment analysis on ranked gene sets using

GOrilla (Eden et al. 2009).

5.4 Discussion

The current chapter addresses the question of what genome-wide impact RNA A-to-I editing may have on

expression in the brain. We aimed to resolve an apparent conflict: On one hand, it has been shown that

in some cases editing could dramatically impact expression of genes. On the other hand, the unique

abundance of editing targets in human genes would mean that if editing affects the expression of all its

targets, it would lead to massive expression changes.

Using two datasets that measured gene expression in multiple locations in human brains, we computed

the spatial correlation between the expression profile of ADARs and their known targets (Bazak et al.

2013). Surprisingly, we found that the distribution of correlations in many brain samples was bi-modal:

while some genes were negatively correlated with ADAR1 as expected, many targets of ADAR were

actually positively correlated with ADAR1 (but not ADAR2). This is somewhat surprising because it is

believed that edited genes would be down regulated in the presence of ADAR. The group of positively

correlated genes was enriched for functions including RNA processing, suggesting that ADAR operates as

part of wide RNA regulation mechanisms. This is in agreement with the fact that ADAR is located in the

spliceosome and is known to interact with multiple proteins involved in RNA processing (Weissbach and

90

Scadden 2012; Scadden and Smith 2001b; Agranat et al. 2008; Ota et al. 2013; Warf et al. 2012; Wang et

al. 2013; Heale et al. 2009; Nie et al. 2005; Raitskin et al. 2001; Nishikura 2010).

The spatial correlations between ADAR1 and its targets were significantly more negative in a baseline set

of genes, (p-value< 10-90), and were consistent across the two datasets that we analyzed. Interestingly,

the distribution of correlations change during development, and the correlation profile differs significantly

before and after birth. This is in agreement with the fact that the editing level of some key targets of

ADAR, such as genes coding for GluR5, GluR6 and Gabra3 receptors, have been shown to change

significantly along development (Dillman et al. 2013; Bernard and Khrestchatisky 1994; Hanrahan et al.

2000; Rula et al. 2008; Ohlson et al. 2007).

We controlled for several potential biases. First, genes that contain Alus tend to be longer, since Alu

insertions lengthen a gene (and making it even more prone to Alu insertion). We tested if gene length

could lead to a bias in expression correlation but found no such effect.

Second, most Alus are located in introns, while most edited transcripts that were studied undergo editing

in their 3’ UTR. We found a similar distribution of spatial-correlations in genes, regardless of editing

location (3’ UTR, 5’UTR or introns). Third, to verify that the positive correlations we observed do not reflect

an epi-phenomenon of a genome-wide expression changes between brain regions, we computed the

correlations between ADAR targets and all genes. ADAR itself was highly ranked in this list (ranked 14, p-

value < 0.001), suggesting that the correlations we observe are largely ADAR-specific.

These results suggest that RNA editing in the human brain does not lead to consistent and wide alterations

in expression. This is in agreement with the idea that if editing was to lead to expression reduction in

primates, its effects would be overly massive since Alu are abundant in the primate genome. Such an

effect could have been magnified even further, since it has been shown that introducing hyper-edited

transcripts into the nucleus of Xenopus cells leads to reduction of transcription, which is not specific to

the hyperedited transcript (in trans) (Scadden 2007).

How robust are these results in respect to the set of target genes we tested? It has recently become clear

that the majority of human genes undergo editing. Here we defined the set of positive targets to contain

only genes where editing was observed, and the set of negatives as genes that do not contain Alu. While

it is possible that more genes would be shown to be edited, hence growing the positive set, the set of

91

positives is already comprehensive, containing 6-7K genes in the two datasets. We therefore expect the

results to be non-sensitive to adding more positive genes.

The above results are based on separating genes into two groups: edited and non-edited genes. Today, it

is still costly to measure the actual editing levels at a genome scale in each specific tissue. This is because

editing in Alu typically occurs at less than 1 percent per adenosine (Bazak et al. 2013), hence estimating

editing levels requires large coverage. We expect that these types of measurements will become feasible

in the near future, and could clarify the more detailed relation between editing and expression.

Furthermore, to obtain an accurate measure of the relation between expression and editing, one wishes

to measure both in single cells. Excitingly, new technologies now allow to extract RNA from single cells,

and are expected to shed more light on the relation between RNA editing and gene expression.

The above results suggest that editing does not necessarily lead to expression reduction in a large scale,

but leave important questions. Foremost, what molecular mechanisms prevent expression reduction of

edited transcripts, and what could be the implications of the increased diversity of transcripts following

editing (Barak et al. 2009; Mattick and Mehler 2008; Paz-Yaacov et al. 2010).

93

Concluding remarks

In recent years, there has been an explosion of availability of neural datasets of a genomic scale, gene

expression was measured for many conditions, time points, species and regions. There are several

methods to measure gene expression, yielding different types of datasets; from regionalized expression

profiles to high resolution images of expression maps. The different types of data call for the development

of specialized tools and computational methods to analyze them. The goal of the work presented in this

dissertation was to develop and implement such tools in order to gain understanding into brain

organization and function, but also into single gene function in the context of neural function and disease.

Towards this goal, we started by analyzing spatio-temporal patterns of expression in the brain, focusing

mainly on the question of dynamics of inter-region dissimilarities over development. We found that while

the brain develops prenatally and regions become more functionally specialized, expression variation

between the regions actually decreases, reaching a low point around the time of birth. Then, following

birth, variation in regional expression profiles increases again. A functional analysis of the genes

responsible for this “hourglass” shaped divergence profile revealed that the biological processes driving

the large prenatal variation are related to nervous system construction, and the processes driving the

post-natal variation are related to the utilization of the nervous system. We also found that post-natal

specialization in gene expression is most prominent in the cerebellum, an effect that can also clearly be

seen when looking at parallel developmental patterns in human.

The next part of the dissertation focused on the analysis of high resolution ISH images of gene expression

in the mouse brain. The dataset analyzed contains mapping of gene expression in a cellular resolution for

every gene in the mouse genome. This vast amount of spatial information comes in the form of images,

and we discuss methods adapted from computer vision to extract information from the images and

represent them. Then, we discuss several implementations of these methods.

First, we present method to learn functional representations of neural in situ hybridization (ISH) images,

where each image is represented as a point in a low dimensional space whose axes correspond to

meaningful functional annotations, yielding an interpretable measure of similarity between highly

complex images. We successfully infer over 700 functional annotations from neural ISH images, and use

them to detect gene-gene similarities, while providing semantic interpretations for the similarity, enabling

94

the explainable inference of new gene functions from spatial co-expression. The visual features calculated

for this purpose were also used for the inference of genes related to neural disease such as Parkinson’s

disease and epilepsy.

We then present an approach to identify genes that are primarily expressed in specific brain layers or cell

types, based on analyzing the ISH images. By learning the spatial patterns of a few known cell markers in

the mouse cerebellum, we annotate the expression patterns of hundreds of new genes, and predict the

layers and cell types they are expressed in with very high accuracy (AUC>0.94 for all four cerebellar layers).

Overall, 454 genes are predicted to be primarily expressed in the Purkinje layer, 233 in the granular

layer, 14 in the molecular layer and 16 in the cerebellar white matter.

The last part of the dissertation focused on patterns of RNA editing in the human brain. Specifically we

looked at co-expression patterns of the enzyme ADAR, that is responsible for editing, and its potential

editing targets. We aimed to resolve an apparent conflict: On one hand, it has been shown that in some

cases editing could dramatically impact expression of genes. On the other hand, the unique abundance of

editing targets in human genes would mean that if editing affects the expression of all its targets, it would

lead to massive expression changes. Surprisingly, we found that the distribution of correlations in many

brain samples was bi-modal: while some genes were negatively correlated with ADAR as expected, many

targets of ADAR were actually positively correlated with ADAR. This is somewhat surprising because it is

believed that edited genes would be down regulated in the presence of ADAR. The group of positively

correlated genes was enriched for functions including RNA processing, suggesting that ADAR operates as

part of wide RNA regulation mechanisms.

The frameworks and analysis described in this dissertation can be extended in numerous ways. First, as

discussed in the introductory part of the thesis (section 1.1), we would like to be able to analyze protein

expression profiles rather than mRNA expression profiles. mRNA is subjected to changes and regulations,

and it only reflects around 30%-40% of actual protein abundance. In the future, better ways for high-

throughput measure of the proteome are expected to be developed, making the analysis of gene

expression patterns more accurate. The results of many of the analyses presented include functional

predictions for single genes. In order to make the most of these predictions, an important next step would

be to test the most promising ones in a wet-lab.

95

The methods to analyze ISH images of gene expression can be also implemented on more datasets, for

example, there is now partial availability of human brain ISH images. Results from the analyses presented

in Chapter 4 suggest that while ISH images are noisy and hard to analyze, using the appropriate

computational methods to represent and classify them can yield an abundance of new biological

information. In section 4.1.6 we see that a large fraction of functional information exists, surprisingly, in

the smallest patterns of expression, and even in cell shapes. A natural extension of this idea would be to

develop methods to capture these cell shapes and cell distributions in an optimal way.

An important goal in the field of neurogenomics is the accurate mapping of gene expression profiles to

specific cell types, species, at higher temporal resolution and even for subcellular locations. Acquiring

accurate maps of expression will enable to link between brain structure, evolution, development and most

importantly, brain functionality. In recent years there is much focus on methods that map connectivity

between brain regions and even neurons, and attempts to create artificial models of neural function

through very large-scale modelling of electrical activity in the brain. However, what underlies all structural

properties of the brain and its continuous electrical activity is the information coded in our genome, and

implemented by the differential expression of genes. The continuous availability of more specific and

accurate expression data will surely enable to shed light on brain functionality and complex neural process

in the healthy and diseased brain.

97

References

Agranat, Raitskin, Sperling, and Sperling. 2008. “The Editing Enzyme ADAR1 and the mRNA Surveillance Protein hUpf1 Interact in the Cell Nucleus.” Proceedings of the National Academy of Sciences.

Ashburner, Ball, Blake, Botstein, Butler, Cherry, Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology.” Nature Genetics.

Athanasiadis, Rich, and Maas. 2004. “Widespread A-to-I RNA Editing of Alu-Containing mRNAs in the Human Transcriptome.” PLoS Biology.

Auer, and Doerge. 2010. “Statistical Design and Analysis of RNA Sequencing Data.” Genetics.

Bahn, Lee, Li, Greer, Peng, and Xiao. 2012. “Accurate Identification of A-to-I RNA Editing in Human by Transcri

ptome Sequencing.” Genome Research.

Barak, Levanon, Eisenberg, Paz, Rechavi, Church, and Mehr. 2009. “Evidence for Large Diversity in the Human Transcriptome Created by Alu RNA Editing.” Nucleic Acids Research.

Bass. 2002. “RNA Editing by Adenosine Deaminases That Act on RNA.” Annual Review of Biochemistry.

Bayer, Altman, Russo, and Zhang. 1993. “Timetables of Neurogenesis in the Human Brain Based on Experimentally Determined Patterns in the Rat.” Neurotoxicology.

Bazak, Haviv, Barak, Jacob-Hirsch, Deng, Zhang, Isaacs, et al. 2013. “A-to-I RNA Editing Occurs at over a Hundred Million Genomic Sites, Located in a Majority of Human Genes.” Genome Research.

Becker, Barnes, Bright, and Wang. 2004. “The Genetic Association Database.” Nature Genetics.

Benjamini, and Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society Series B.

Bernard, and Khrestchatisky. 1994. “Assessing the Extent of RNA Editing in the TMII Regions of GluR5 and GluR6 Kainate Receptors during Rat Brain Development.” Journal of Neurochemistry.

Bishop. 2006. Pattern Recognition and Machine Learning. Edited by Jordan, Kleinberg, and Schölkopf. Pattern Recognition. Information Science and Statistics.

Blow, Futreal, Wooster, and Stratton. 2004. “A Survey of RNA Editing in Human Brain.” Genome Research.

Boguski, and Jones. 2004. “Neurogenomics: At the Intersection of Neurobiology and Genome Sciences.” Nature Neuroscience.

98

Bohland, Bokil, Pathak, Lee, Ng, Lau, Kuan, Hawrylycz, and Mitra. 2010. “Clustering of Spatial Gene Expression Patterns in the Mouse Brain and Comparison with Classical Neuroanatomy.” Methods.

Bosch, Zisserman, and Mu. 2006. “Scene Classification via pLSA.” Edited by Leonardis, Bischof, and Pinz. Analysis, Lecture Notes in Computer Science.

Bosch, Zisserman, and Munoz. 2007. “Image Classification Using Random Forests and Ferns.” IEEE 11th International Conference on Computer Vision (2007).

Briscoe, Sussel, Serup, Hartigan-O’Connor, Jessell, Rubenstein, and Ericson. 1999. “Homeobox Gene Nkx2.2 and Specification of Neuronal Identity by Graded Sonic Hedgehog Signalling.” Nature.

Bryant, Subrahmanyan, Tworoger, LaTray, Liu, Li, van den Engh, and Ruohola-Baker. 1999. “Characterization of Differentially Expressed Genes in Purified Drosophila Follicle Cells: Toward a General Strategy for Cell Type-Specific Developmental Analysis.” Proceedings of the National Academy of Sciences.

Burns, Chu, Rueter, Hutchinson, Canton, Sanders-Bush, and Emeson. 1997. “Regulation of Serotonin-2C Receptor G-Protein Coupling by RNA Editing.” Nature.

Buss, and Oppenheim. 2004. “Special Review Based on a Presentation Made at the 16th International Congress of the IFAA Role of Programmed Cell Death in Normal Neuronal Development and Function.” Anatomical Science International.

Cahoy, Emery, Kaushal, Foo, Zamanian, Christopherson, Xing, et al. 2008. “A Transcriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Understanding Brain Development and Function.” The Journal of Neuroscience.

Cavodeassi, and Houart. 2012. “Brain Regionalization: Of Signaling Centers and Boundaries.” Developmental Neurobiology.

Challacombe, Snow, and Letourneau. 1996. “Actin Filament Bundles Are Required for Microtubule Reorientation during Growth Cone Turning to Avoid an Inhibitory Guidance Cue.” Journal of Cell Science.

Chen, and Carmichael. 2009. “Nuclear Retention of mRNAs Containing Inverted Repeats in Human Embryonic Stem Cells : Functional Role of a Nuclear Noncoding RNA.” Molecular Cell.

Chen, Cheng, Grennan, Pibiri, Zhang, Badner, Gershon, and Liu. 2013a. “Two Gene Co-Expression Modules Differentiate Psychotics and Controls.” Molecular Psychiatry.

Chen, Li, Lin, Chan, Chow, Song, Liu, et al. 2013b. “Recoding RNA Editing of AZIN1 Predisposes to Hepatocellular Carcinoma.” Nature Medicine.

Chizhikov, Lindgren, Currle, Rose, Monuki, and Millen. 2006. “The Roof Plate Regulates Cerebellar Cell-Type Specification and Proliferation.” Development.

99

Choi, Yu, Yoo, and Kim. 2005. “Differential Coexpression Analysis Using Microarray Data and Its Application to Human Cancer.” Bioinformatics.

Clancy, and Darlington. 2001a. “Translating Developmental Time across Mammalian Species.” Neuroscience.

Coelho, Peng, and Murphy. 2010. “Quantifying the Distribution of Probes between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics.

Colantuoni, Lipska, Ye, Hyde, Tao, Leek, Colantuoni, et al. 2011. “Temporal Dynamics and Genetic Control of Transcription in the Human Prefrontal Cortex.” Nature.

Csurka, and Dance. 2004. “Visual Categorization with Bags of Keypoints.” Proc. of ECCV International Workshop on Statistical Learning in Computer Vision.

Datson, van der Perk, de Kloet, and Vreugdenhil. 2001. “Expression Profile of 30,000 Genes in Rat Hippocampus Using SAGE.” Hippocampus.

Davis, and Eddy. 2009. “A Tool for Identification of Genes Expressed in Patterns of Interest Using the Allen Brain Atlas.” Bioinformatics.

De Jong, Boks, Fuller, Strengman, Janson, de Kovel, Ori, et al. 2012. “A Gene Co-Expression Network in Whole Blood of Schizophrenia Patients Is Independent of Antipsychotic-Use and Enriched for Brain-Expressed Genes.” PloS One.

De la Fuente. 2010. “From ‘Differential Expression’ to ‘Differential Networking’ - Identification of Dysfunctional Regulatory Networks in Diseases.” Trends in Genetics.

Deng, Berg, and Fei-Fei. 2011. “Hierarchical Semantic Indexing for Large Scale Image Retrieval.” Cvpr 2011.

Dickson. 2002. “Molecular Mechanisms of Axon Guidance.” Science.

Dillman, Hauser, Gibbs, Nalls, McCoy, Rudenko, Galter, and Cookson. 2013. “mRNA Expression, Splicing and Editing in the Embryonic and Adult Mouse Cerebral Cortex.” Nature Neuroscience.

Domazet-Lošo, and Tautz. 2010a. “A Phylogenetically Based Transcriptome Age Index Mirrors Ontogenetic Divergence Patterns.” Nature.

Eden, Navon, Steinfeld, Lipson, and Yakhini. 2009. “GOrilla: A Tool for Discovery and Visualization of Enriched GO Terms in Ranked Gene Lists.” BMC Bioinformatics.

Eran, Li, Vatalaro, McCarthy, Rahimov, Collins, Markianos, et al. 2012. “Comparative RNA Editing in Autistic and Neurotypical Cerebella.” Molecular Psychiatry.

Fan, Chang, Hsieh, Wang, and Lin. 2008. “LIBLINEAR: A Library for Large Linear Classification.” The Journal of Machine Learning Research.

100

Foss, Radulovic, Shaffer, Ruderfer, Bedalov, Goodlett, and Kruglyak. 2007. “Genetic Basis of Proteome Variation in Yeast.” Nature Genetics.

French, and Pavlidis. 2011. “Relationships between Gene Expression and Brain Wiring in the Adult Rodent Brain.” PLoS Computational Biology.

Frise, Hammonds, and Celniker. 2010. “Systematic Image-Driven Analysis of the Spatial Drosophila Embryonic Expression Landscape.” Molecular Systems Biology.

Fu, Keurentjes, Bouwmeester, America, Verstappen, Ward, Beale, et al. 2009. “System-Wide Molecular Evidence for Phenotypic Buffering in Arabidopsis.” Nature Genetics.

Gaiteri, Ding, French, Tseng, and Sibille. 2014. “Beyond Modules and Hubs: The Potential of Gene Coexpression Networks for Investigating Molecular Mechanisms of Complex Brain Disorders.” Genes, Brain, and Behavior.

Gewin. 2005. “A Golden Age of Brain Exploration.” PLoS Biology.

Ghandour, Vincendon, and Gombos. 1980. “Astrocyte and Oligodendrocyte Distribution in Adult Rat Cerebellum: An Immunohistological Study.” Journal of Neurocytology.

Ghazalpour, Bennett, Petyuk, Orozco, Hagopian, Mungrue, Farber, et al. 2011. “Comparative Analysis of Proteome and Transcriptome Variation in Mouse.” PLoS Genetics.

Gillis, and Pavlidis. 2011. “The Role of Indirect Connections in Gene Networks in Predicting Function.” Bioinformatics.

Grange, Bohland, Okaty, Sugino, Bokil, Nelson, Ng, Hawrylycz, and Mitra. 2014. “Cell-Type-Based Model Explaining Coexpression Patterns of Genes in the Brain.” Proceedings of the National Academy of Sciences.

Grauman, and Darrell. 2005. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. Tenth IEEE International Conference on Computer Vision ICCV05 Volume 1. ICCV ’05.

Gray, Fu, Luo, Zhao, Yu, Ferrari, Tenzen, et al. 2004. “Mouse Brain Organization Revealed through Direct Genome-Scale TF Expression Analysis.” Science.

Hamosh, Scott, Amberger, Bocchini, and McKusick. 2005. “Online Mendelian Inheritance in Man (OMIM), a Knowledgebase of Human Genes and Genetic Disorders.” Nucleic Acids Research.

Hanrahan, Palladino, Ganetzky, and Reenan. 2000. “RNA Editing of the Drosophila Para Na(+) Channel Transcript. Evolutionary Conservation and Developmental Regulation.” Genetics.

Hawrylycz, Lein, Guillozet-Bongaarts, Shen, Ng, Miller, van de Lagemaat, et al. 2012. “An Anatomically Comprehensive Atlas of the Adult Human Brain Transcriptome.” Nature.

101

Hawrylycz, Ng, Page, Morris, Lau, Faber, Faber, et al. 2011. “Multi-Scale Correlation Structure of Gene Expression in the Brain.” Neural Networks : The Official Journal of the International Neural Network Society.

Heale, Keegan, McGurk, Michlewski, Brindle, Stanton, Caceres, and O’Connell. 2009. “Editing Independent Effects of ADARs on the miRNA/siRNA Pathways.” The EMBO Journal.

Henry, and Hohmann. 2012. “High-Resolution Gene Expression Atlases for Adult and Developing Mouse Brain and Spinal Cord.” Mammalian Genome.

Higuchi, Maas, Single, Hartner, Rozov, Burnashev, Feldmeyer, Sprengel, and Seeburg. 2000. “Point Mutation in an AMPA Receptor Gene Rescues Lethality in Mice Deficient in the RNA-Editing Enzyme ADAR2.” Nature.

Horan, Jang, Bailey-Serres, Mittler, Shelton, Harper, Zhu, Cushman, Gollery, and Girke. 2008. “Annotating Genes of Known and Unknown Function by Large-Scale Coexpression Analysis.” Plant Physiology.

Huang. 2009. “Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification.” 2009 IEEE Conference on Computer Vision and Pattern Recognition.

Hui. 2007. “Brain-Specific Aminopeptidase: From Enkephalinase to Protector against Neurodegeneration.” Neurochemical Research.

Kalinka, Varga, Gerrard, Preibisch, Corcoran, Jarrells, Ohler, Bergman, and Tomancak. 2010a. “Gene Expression Divergence Recapitulates the Developmental Hourglass Model.” Nature.

Kalinka, Varga, Gerrard, Preibisch, Corcoran, Jarrells, Ohler, Bergman, and Tomancak. 2010b. “Gene Expression Divergence Recapitulates the Developmental Hourglass Model.” Nature.

Kanehisa. 2002. “The KEGG Database.” Novartis Foundation Symposium.

Kang, Kawasawa, Cheng, Zhu, Xu, Li, Sousa, et al. 2011. “Spatio-Temporal Transcriptome of the Human Brain.” Nature.

Kawahara, and Kwak. 2005. “Excitotoxicity and ALS: What Is Unique about the AMPA Receptors Expressed on Spinal Motor Neurons?” Amyotrophic Lateral Sclerosis and Other Motor Neuron Disorders : Official Publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

Kerrien, Aranda, Breuza, Bridge, Broackes-Carter, Chen, Duesbury, et al. 2012. “The IntAct Molecular Interaction Database in 2012.” Nucleic Acids Research.

Khaitovich, Hellmann, Enard, Nowick, Leinweber, Franz, Weiss, Lachmann, and Pääbo. 2005. “Parallel Patterns of Evolution in the Genomes and Transcriptomes of Humans and Chimpanzees.” Science.

Khaitovich, Muetzel, She, Lachmann, Hellmann, Dietzsch, Steigele, et al. 2004. “Regional Patterns of Gene Expression in Human and Chimpanzee Brains.” Genome Research.

102

Kim, Kim, Walsh, Kobayashi, Matise, Buyske, and Gabriel. 2004. “Widespread RNA Editing of Embedded Alu Elements in the Human Transcriptome.” Genome Research.

Kirsch, and Chechik. “Human Areal Expression of Most Genes Is Governed by Regionalization.” In preparation.

Kirsch, Liscovitch, and Chechik. 2012. “Localizing Genes to Cerebellar Layers by Classifying ISH Images.” Edited by Ohler. PLoS Computational Biology.

Krauss, Johansen, Korzh, and Fjose. 1991. “Expression Pattern of Zebrafish Pax Genes Suggests a Role in Early Brain Regionalization.” Nature.

Lavado, He, Paré, Neale, Olson, Giovannini, and Cao. 2013. “Tumor Suppressor Nf2 Limits Expansion of the Neural Progenitor Pool by Inhibiting Yap/Taz Transcriptional Coactivators.” Development.

Lazebnik, Schmid, and Ponce. 2006. “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories.” 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Volume 2 CVPR06, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, .

Lee, Hsu, Sajdak, Qin, and Pavlidis. 2004. “Coexpression Analysis of Human Genes across Many Microarray Data Sets.” Genome Research.

Lee, Weindruch, and Prolla. 2000. “Gene-Expression Profile of the Ageing Brain in Mice.” Nature Genetics.

Lein, Hawrylycz, et al., Ao, Ayres, Bensinger, Bernard, et al. 2007. “Genome-Wide Atlas of Gene Expression in the Adult Mouse Brain.” Nature.

Levanon, Eisenberg, Rechavi, and Levanon. 2005. “Letter from the Editor: Adenosine-to-Inosine RNA Editing in Alu Repeats in the Human Genome.” EMBO Reports.

Levanon, Eisenberg, Yelin, Nemzer, Hallegger, Shemesh, Fligelman, et al. 2004. “Systematic Identification of Abundant A-to-I Editing Sites in the Human Transcriptome.” Nature Biotechnology.

Lev-Maor, Sorek, Levanon, Paz, Eisenberg, and Ast. 2007. “RNA-Editing-Mediated Exon Evolution.” Genome Biology.

Lewis. 1978. “A Gene Complex Controlling Segmentation in Drosophila.” Nature.

Li, Bickel, and Biggin. 2014. “System Wide Analyses Have Underestimated Protein Abundances and the Importance of Transcription in Mammals.” PeerJ.

Li, and Church. 2013. “Deciphering the Functions and Regulation of Brain-Enriched A-to-I RNA Editing.” Nature Neuroscience.

Li, Levanon, Yoon, Aach, Xie, Leproust, Zhang, Gao, and Church. 2009. “Genome-Wide Identification of Human RNA Editing Sites by Parallel DNA Capturing and Sequencing.” Science.

103

Li, Monckton, and Godbout. 2008. “A Role for DEAD Box 1 at DNA Double-Strand Breaks.” Molecular and Cellular Biology.

Li, Su, Lim, and Fei-Fei. 2010a. “Objects as Attributes for Scene Classification.” 12th European Conference of Computer Vision (ECCV), 1st International Workshop on Parts and Attributes.

Li, Su, Xing, and Fei-Fei. 2010b. “Object Bank: A High-Level Image Representation for Scene Classification and Semantic Feature Sparsification.” Proceedings of the Neural Information Processing Systems (NIPS) 2010.

Liang, and Landweber. 2007. “Hypothesis: RNA Editing of microRNA Target Sites in Humans?” Rna.

Linghu, Snitkin, Hu, Xia, and Delisi. 2009. “Genome-Wide Prioritization of Disease Genes and Identification of Disease-Disease Associations from an Integrated Human Functional Linkage Network.” Genome Biology.

Lowe. 1999. “Object Recognition from Local Scale-Invariant Features.” Proceedings of the Seventh IEEE International Conference on Computer Vision.

Lowe. 2004. “Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision.

Ma, Fode, Guillemot, and Anderson. 1999. “NEUROGENIN1 and NEUROGENIN2 Control Two Distinct Waves of Neurogenesis in Developing Dorsal Root Ganglia.” Genes & Development.

Malisiewicz, Gupta, and Efros. 2011. “Ensemble of Exemplar-SVMs for Object Detection and beyond.” 2011 International Conference on Computer Vision.

Malone, and Oliver. 2011. “Microarrays, Deep Sequencing and the True Measure of the Transcriptome.” BMC Biology.

Manning, and Raghavan. 2009. “An Introduction to Information Retrieval.” Edited by Salas. Online.

Mao, Bonni, Xia, Nadal-Vicens, and Greenberg. 1999. “Neuronal Activity-Dependent Cell Survival Mediated by Transcription Factor MEF2.” Science.

Martínez. 2001. “The Isthmic Organizer and Brain Regionalization.” The International Journal of Developmental Biology.

Mattick, and Mehler. 2008. “RNA Editing, DNA Recoding and the Evolution of Human Cognition.” Trends in Neurosciences.

McGinnis, and Krumlauf. 1992. “Homeobox Genes and Axial Patterning.” Cell.

Mehler, and Mattick. 2007. “Noncoding RNAs and RNA Editing in Brain Development , Functional Diversification , and Neurological Disease.” Physiological Reviews.

104

Miller, Ding, Sunkin, Smith, Ng, Szafer, Ebbert, et al. 2014. “Transcriptional Landscape of the Prenatal Human Brain.” Nature.

Miller, Robinson, Cleary, and Doe. 2009. “TU-Tagging: Cell Type-Specific RNA Isolation from Intact Complex Tissues.” Nature Methods.

Moens, and Prince. 2002. “Constructing the Hindbrain: Insights from the Zebrafish.” Developmental Dynamics.

Needleman, and Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology.

Ng, Bernard, Lau, Overly, Dong, Kuan, Pathak, et al. 2009. “An Anatomic Gene Expression Atlas of the Adult Mouse Brain.” Nature Neuroscience.

Nie, Ding, Kao, Braun, and Yang. 2005. “ADAR1 Interacts with NF90 through Double-Stranded RNA and Regulates NF90-Mediated Gene Expression Independently of RNA Editing.” Molecular and Cellular Biology.

Nishikura. 2010. “Functions and Regulation of RNA Editing by ADAR Deaminases.” Annual Review of Biochemistry.

O’Connor, and Tessier-Lavigne. 1999. “Identification of Maxillary Factor, a Maxillary Process-Derived Chemoattractant for Developing Trigeminal Sensory Axons.” Neuron.

Ohlson, Pedersen, Haussler, and Ohman. 2007. “Editing Modifies the GABA(A) Receptor Subunit alpha3.” RNA.

Ojala, Pietikainen, and Maenpaa. 2002. “Multiresolution Gray-Scale and Rotation Invariant Texture classification with Local Binary Patterns.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Oliver. 2000. “Guilt-by-Association Goes Global.” Nature.

Ota, Sakurai, Gupta, Valente, Wulff, Ariyoshi, Iizasa, Davuluri, and Nishikura. 2013. “ADAR1 Forms a Complex with Dicer to Promote microRNA Processing and RNA-Induced Gene Silencing.” Cell.

Palladino, Keegan, O’Connell, and Reenan. 2000. “A-to-I Pre-mRNA Editing in Drosophila Is Primarily Involved in Adult Nervous System Function and Integrity.” Cell.

Park, Williams, Wold, and Mortazavi. 2012. “RNA Editing in the Human ENCODE RNA-Seq Data.” Genome Research.

Paschen, and Djuricic. 1994. “Extent of RNA Editing of Glutamate Receptor Subunit GluR5 in Different Brain Regions of the Rat.” Cellular and Molecular Neurobiology.

105

Paz-Yaacov, Levanon, Nevo, Kinar, Harmelin, Jacob-Hirsch, Amariglio, Eisenberg, and Rechavi. 2010. “Adenosine-to-Inosine RNA Editing Shapes Transcriptome Diversity in Primates.” Proceedings of the National Academy of Sciences.

Peng, Bonamy, Glory-Afshar, Rines, Chanda, and Murphy. 2010. “Determining the Distribution of Probes between Different Subcellular Locations through Automated Unmixing of Subcellular Patterns.” Proceedings of the National Academy of Sciences.

Peng, Cheng, Tan, Kang, Tian, Zhu, Zhang, et al. 2012. “Comprehensive Analysis of RNA-Seq Data Reveals Extensive RNA Editing in a Human Transcriptome.” Nature Biotechnology.

Peng, Long, Zhou, Leung, Eisen, and Myers. 2007. “Automatic Image Analysis for Gene Expression Patterns of Fly Embryos.” BMC Cell Biology.

Ponomarev, Wang, Zhang, Harris, and Mayfield. 2012. “Gene Coexpression Networks in Human Brain Identify Epigenetic Modifications in Alcohol Dependence.” The Journal of Neuroscience.

Prasanth, Prasanth, Xuan, Hearn, Freier, Bennett, Zhang, and Spector. 2005. “Regulating Gene Expression through RNA Nuclear Retention.” Cell.

Pruteanu-Malinici, Mace, and Ohler. 2011. “Automatic Annotation of Spatial Expression Patterns via Sparse Bayesian Factor Models.” Edited by Bader. PLoS Computational Biology.

Puelles, and Rubenstein. 1993. “Expression Patterns of Homeobox and Other Putative Regulatory Genes in the Embryonic Mouse Forebrain Suggest a Neuromeric Organization.” Trends in Neurosciences.

Raitskin, Cho, Sperling, Nishikura, and Sperling. 2001. “RNA Editing Activity Is Associated with Splicing Factors in lnRNP Particles: The Nuclear Pre-mRNA Processing Machinery.” Proceedings of the National Academy of Sciences.

Ramaswami, and Li. 2014. “RADAR: A Rigorously Annotated Database of A-to-I RNA Editing.” Nucleic Acids Research.

Ramaswami, Lin, Piskol, Tan, Davis, and Li. 2012. “Accurate Identification of Human Alu and Non-Alu RNA Editing Sites.” Nature Methods.

Ramaswami, Zhang, Piskol, Keegan, Deng, O’Connell, and Li. 2013. “Identifying RNA Editing Sites Using RNA Sequencing Data Alone.” Nature Methods.

Riedmann, Schopoff, Hartner, and Jantsch. 2008. “Specificity of ADAR-Mediated RNA Editing in Newly Identified Targets.” RNA.

Rong, Wang, and Morgan. 2004. “Identification of Candidate Purkinje Cell-Specific Markers by Gene Expression Profiling in Wild-Type and pcd(3J) Mice.” Brain Research. Molecular Brain Research.

Rula, Lagrange, Jacobs, Hu, Macdonald, and Emeson. 2008. “Developmental Modulation of GABA(A) Receptor Function by RNA Editing.” The Journal of Neuroscience.

106

Saito, Hirai, and Yonekura-Sakakibara. 2008. “Decoding Genes with Coexpression Networks and Metabolomics - ‘Majority Report by Precogs’.” Trends in Plant Science.

Sandberg. 2000. “From the Cover: Regional and Strain-Specific Gene Expression Mapping in the Adult Mouse Brain.” Proceedings of the National Academy of Sciences.

Sanjana, Levanon, Hueske, Ambrose, and Li. 2012. “Activity-Dependent A-to-I RNA Editing in Rat Cortical Neurons.” Genetics.

Sato, Joyner, and Nakamura. 2004. “How Does Fgf Signaling from the Isthmic Organizer Induce Midbrain and Cerebellum Development?” Development, Growth & Differentiation.

Savva, Rieder, and Reenan. 2012. “The ADAR Protein Family.” Genome Biology.

Scadden. 2005. “The RISC Subunit Tudor-SN Binds to Hyper-Edited Double-Stranded RNA and Promotes Its Cleavage.” Nature Structural & Molecular Biology.

Scadden. 2007. “Inosine-Containing dsRNA Binds a Stress-Granule-like Complex and Downregulates Gene Expression In Trans.” Molecular Cell.

Scadden, and O’Connell. 2005. “Cleavage of dsRNAs Hyper-Edited by ADARs Occurs at Preferred Editing Sites.” Nucleic Acids Research.

Scadden, and Smith. 2001a. “Specific Cleavage of Hyper-Edited dsRNAs.” The EMBO Journal.

Scadden, and Smith. 2001b. “RNAi Is Antagonized by A-->I Hyper-Editing.” EMBO Reports.

Schena, Shalon, Davis, and Brown. 1995. “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray.” Science.

Schlicker, Domingues, Rahnenführer, and Lengauer. 2006. “A New Measure for Functional Similarity of Gene Products Based on Gene Ontology.” BMC Bioinformatics.

Schwaller. 2012. “The Use of Transgenic Mouse Models to Reveal the Functions of Ca2+ Buffer Proteins in Excitable Cells.” Biochimica et Biophysica Acta.

Silberberg, Lundin, Navon, and Öhman. 2012. “Deregulation of the A-to-I RNA Editing Mechanism in Psychiatric Disorders.” Human Molecular Genetics.

Smedley, Haider, Ballester, Holland, London, Thorisson, and Kasprzyk. 2009. “BioMart--Biological Queries Made Easy.” BMC Genomics.

Subramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette, Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences.

107

Sunkin, Ng, Lau, Dolbeare, Gilbert, Thompson, Hawrylycz, and Dang. 2012. “Allen Brain Atlas: An Integrated Spatio-Temporal Portal for Exploring the Central Nervous System.” Nucleic Acids Research.

Taga, Miyoshi, Okajima, Matsuda, and Nadano. 2010. “Identification of Heterogeneous Nuclear Ribonucleoprotein A/B as a Cytoplasmic mRNA-Binding Protein in Early Involution of the Mouse Mammary Gland.” Cell Biochemistry and Function.

Takano-Maruyama, Chen, and Gaufo. 2011. “Differential Contribution of Neurog1 and Neurog2 on the Formation of Cranial Ganglia along the Anterior-Posterior Axis.” Developmental Dynamics.

Tau, and Peterson. 2010. “Normal Development of Brain Circuits.” Neuropsychopharmacology : Official Publication of the American College of Neuropsychopharmacology.

Thomas, Lee, Dalton, Nomie, Stoica, Costa-Mattioli, Chang, Nuzhdin, Arbeitman, and Dierick. 2012. “A Versatile Method for Cell-Specific Profiling of Translated mRNAs in Drosophila.” PloS One.

Thompson, Ng, Menon, Martinez, Lee, Glattfelder, Sunkin, et al. 2014. “A High-Resolution Spatiotemporal Atlas of Gene Expression of the Developing Mouse Brain.” Neuron.

Tonkin, Saccomanno, Morse, Brodigan, Krause, and Bass. 2002. “RNA Editing by ADARs Is Important for Normal Behavior in Caenorhabditis Elegans.” The EMBO Journal.

Torkamani, Dean, Schork, and Thomas. 2010. “Coexpression Network Analysis of Neural Tissue Reveals Perturbations in Developmental Processes in Schizophrenia.” Genome Research.

Torresani, Szummer, and Fitzgibbon. 2010. “Efficient Object Category Recognition Using Classemes.” Computer Vision–ECCV 2010.

Vedaldi, and Fulkerson. 2010. “VLFeat - An Open and Portable Library of Computer Vision Algorithms.” Design, MM ’10, .

Venter, Adams, Myers, Li, Mural, Sutton, Smith, et al. 2001. “The Sequence of the Human Genome.” Science.

Vidal-Sanz, Bray, Villegas-Pérez, Thanos, and Aguayo. 1987. “Axonal Regeneration and Synapse Formation in the Superior Colliculus by Retinal Ganglion Cells in the Adult Rat.” The Journal of Neuroscience.

Vigil, Cherfils, Rossman, and Der. 2010. “Ras Superfamily GEFs and GAPs: Validated and Tractable Targets for Cancer Therapy?” Nature Reviews Cancer.

Vincent, DeVoss, Ryan, and Murphy. 2002. “Analysis of Neuronal Gene Expression with Laser Capture Microdissection.” Journal of Neuroscience Research.

Voineagu, Wang, Johnston, Lowe, Tian, Horvath, Mill, Cantor, Blencowe, and Geschwind. 2011. “Transcriptomic Analysis of Autistic Brain Reveals Convergent Molecular Pathology.” Nature.

108

Vollmer, and Clerc. 2002. “Homeobox Genes in the Developing Mouse Brain.” Journal of Neurochemistry.

Walker, Russell, and Hodgetts. 1987. “Is Schizophrenia a Neurodevelopmental Disorder?” British Medical Journal.

Walker, Volkmuth, and Klingler. 1999. “Pharmaceutical Target Discovery Using Guilt-by-Association : Schizophrenia and Parkinson's Disease Genes.” ISMB proceedings 1999.

Wang, So, Devlin, Zhao, Wu, and Cheung. 2013. “ADAR Regulates RNA Editing, Transcript Stability, and Gene Expression.” Cell Reports.

Wang, and Zoghbi. 2001. “Genetic Regulation of Cerebellar Development.” Nature Reviews. Neuroscience.

Ward, McCann, DeWulf, Wu, and Rao. 2003. “Distinguishing between Directional Guidance and Motility Regulation in Neuronal Migration.” The Journal of Neuroscience.

Warf, Shepherd, Johnson, and Bass. 2012. “Effects of ADARs on Small RNA Processing Pathways in C. Elegans.” Genome Research.

Waterston, Lindblad-Toh, Birney, Rogers, Abril, Agarwal, Agarwala, et al. 2002. “Initial Sequencing and Comparative Analysis of the Mouse Genome.” Nature.

Website: ©2012 Allen Institute for Brain Science. Allen Human Brain Atlas [Internet]. Available from: Http://human.brain-Map.org/.

Website: ©2012 Allen Institute for Brain Science. BrainSpan Atlas of the Developing Human Brain [Internet]. Available from: Http://brainspan.org/.

Website: ©2012 Allen Institute for Brain Science. NIH Blueprint Non-Human Primate (NHP) Atlas [Internet]. Available from: Http://www.blueprintnhpatlas.org/.

Weissbach, and Scadden. 2012. “Tudor-SN and ADAR1 Are Components of Cytoplasmic Stress Granules.” RNA.

Wingate. 2001. “The Rhombic Lip and Early Cerebellar Development.” Current Opinion in Neurobiology.

Wolf, Goldberg, Manor, Sharan, and Ruppin. 2011. “Gene Expression in the Rodent Brain Is Associated with Its Regional Connectivity.” PLoS Computational Biology.

Wu, Neff, Kalisky, Dalerba, Treutlein, Rothenberg, Mburu, et al. 2014. “Quantitative Assessment of Single-Cell RNA-Sequencing Methods.” Nature Methods.

Zapala, Hovatta, Ellison, Wodicka, Del Rio, Tennant, Tynan, et al. 2005. “Adult Mouse Brain Gene Expression Patterns Bear an Embryologic Imprint.” Proceedings of the National Academy of Sciences.

Zhang, and Carmichael. 2001. “The Fate of dsRNA in the Nucleus: A p54nrb-Containing Complex Mediates the Nuclear Retention of Promiscuously A-to-I Edited RNAs.” Cell.

109

Zhang, and Goldman. 1996. “Generation of Cerebellar Interneurons from Dividing Progenitors in White Matter.” Neuron.

Zhong, and Sternberg. 2007. “Automated Data Integration for Developmental Biological Research.” Development.

Zirlinger, Lo, McMahon, McMahon, and Anderson. 2002. “Transient Expression of the bHLH Factor Neurogenin-2 Marks a Subpopulation of Neural Crest Cells Biased for a Sensory but Not a Neuronal Fate.” Proceedings of the National Academy of Sciences.

spatiotemporal patterns of genomic expression in the mammalian brain noa...

Documents