spatiotemporal patterns of genomic expression in the mammalian brain noa...
TRANSCRIPT
Spatiotemporal Patterns of Genomic Expression in the Mammalian Brain
Noa Liscovitch
Interdisciplinary Studies Unit Gonda Multidisciplinary Brain Research Center
Ph.D. Thesis
Submitted to the Senate of Bar-Ilan University
Ramat-Gan, Israel July 2014
Spatiotemporal Patterns of Genomic Expression in the Mammalian Brain
Noa Liscovitch
Interdisciplinary Studies Unit Gonda Multidisciplinary Brain Research Center
Ph.D. Thesis
Submitted to the Senate of Bar-Ilan University
Ramat-Gan, Israel July 2014
This work was carried out under the supervision of Dr. Gal Chechik
Gonda Multidisciplinary Brain Research Center, Bar-Ilan University
Acknowledgements
I’d like to express my deepest gratitude to my PhD advisor, Gal Chechik. As a biology student with no
computational background at all, Gal took a chance on me when accepting me as a student in his lab for
a 4-year long mutual commitment. This chance paid off tremendously for me as I had the opportunity to
be mentored by a PI with a rare “hands-on” approach: teaching me everything from high level biological
and computational principles to everyday advice on code preparation and organization, always with an
incredible amount of patience for any of my questions, trivial as they may seem to him. As a young
research student, Gal always encouraged me to think independently and believe in my ideas, while
respecting my pace and giving me great personal space to progress as I see fit, making this long journey
towards a PhD much more manageable and enjoyable.
My lab mates, especially four of them who I have been lucky to meet at the lab almost every day for the
last four years: Hadas Taubman, Lior Kirsch, Ossnat Bar Shira and Uri Shalit. Thank you for your friendship
and for endless amounts of professional and personal advice.
The staff of the Gonda secretariat: Aliza Shadmi, Asi Kirsch, Henia Gal, Ma’ayan Tsibelman and Tami
Rubenov, have been incredibly helpful, with a welcoming smile on their faces, and I could always count
on them for help in figuring out complex university bureaucracies or lending coffee and milk when needed.
My family: my mother, Billa Harari-Liscovitch and brother Dror Liscovitch, for their continuous support
and interest in my research and attempts to read my papers, and my father, Moti Liscovitch, who passed
away just several days before I started this PhD, and who I have missed dearly every single day since.
Lastly, my husband Lior, for making me laugh, fixing my computer, thank you for your love and support.
Table of Contents
Abstract ................................................................................................................................. ɪ
Chapter 1: Introduction.......................................................................................................... 1
1.1. What is gene expression? Measuring the transcriptome as a proxy to the proteome .................2
1.2 Technologies for measuring mRNA transcripts ...........................................................................3 1.2.1 DNA microarrays ........................................................................................................................................... 3 1.2.2 RNA-Sequencing ........................................................................................................................................... 3 1.2.3 In-situ hybridization (ISH) ............................................................................................................................. 4
1.3 Using genome-wide patterns of expression to study neural processes ........................................5 1.3.1 Gene expression in brain development ........................................................................................................ 5 1.3.2 Spatial patterns of expression in multiple scales; regional to single-cells .................................................... 9 1.3.3 Inferring gene function from neural co-expression patterns ..................................................................... 10 1.3.4 Beyond expression - post-transcriptional mechanisms in the brain .......................................................... 11
1.4 Dissertation outline ................................................................................................................. 12
Chapter 2: Specialization of neural expression during mouse development ........................... 15
2.1 Introduction ............................................................................................................................ 15
2.2 Results .................................................................................................................................... 16 2.2.1 Changes in expression regionalization during development ...................................................................... 16 2.2.2 Functional characteristics of early and post-natal regionalization ............................................................. 20 2.2.3 Expression conservation across regions and their embryonic origins ........................................................ 24 2.2.4 Comparison with human development ...................................................................................................... 27
2.3 Methods ................................................................................................................................. 31 2.3.1 Data acquisition and pre-processing .......................................................................................................... 31 2.3.2 Selecting brain region delineation .............................................................................................................. 31 2.3.3 Contribution of individual genes to the hourglass shape and functional analysis ..................................... 32 2.3.4 Identifying genes with similar sequences ................................................................................................... 33 2.3.5 Constructing reference curves for correlation analysis .............................................................................. 33 2.3.6 Visualizing inter-region distances ............................................................................................................... 34 2.3.7 Dissimilarity of one region to the rest of the brain .................................................................................... 34 2.3.8 Mouse-human comparison ......................................................................................................................... 34
2.4 Discussion ............................................................................................................................... 35
Chapter 3: Methods to represent neural ISH images ............................................................. 37
3.1 Introduction ............................................................................................................................ 37
3.2 A visual representation of ISH images ...................................................................................... 39 3.2.1 Feature extraction ...................................................................................................................................... 39 3.2.2 Feature aggregation using “Bags of visual words” ..................................................................................... 42 3.2.3 Applying a spatial pyramid kernel to the images ........................................................................................ 42 3.2.4 Using the representations for classification ............................................................................................... 43
3.3 A functional representation of ISH images ............................................................................... 44 3.3.1 Data filtering and preprocessing................................................................................................................. 45 3.3.2 Creating the representations ...................................................................................................................... 47 3.3.3 Choosing parameters for analysis ............................................................................................................... 50
Chapter 4: Analysis of neural ISH images ............................................................................. 55
4.1 Explainable gene coexpression patterns using ISH functional representations .......................... 55 4.1.1 Calculating image-image similarities .......................................................................................................... 55 4.1.2 Robustness of bag-of-words representations ............................................................................................ 57 4.1.3 Predicting functional annotations using brain ISH images ......................................................................... 58 4.1.4 Comparison with Neuroblast, the ABA image-correlation tool .................................................................. 60 4.1.5 Identifying and explaining similarities between GABAergic neuron markers ............................................ 61 4.1.6 Finding important spatial patterns in different scales using SIFT "visual words" ....................................... 62 4.1.7 Inferring new gene functions via explainable similarities .......................................................................... 64
4.2 Using ISH images to predict neural disease-related genes ......................................................... 65 4.2.1 Image classification based on disease-gene markers ................................................................................. 66 4.2.2 Validation of results .................................................................................................................................... 68
4.3 Localizing genes to cerebellar layers using ISH image classification ........................................... 69 4.3.2 Genome-wide predictions of cerebellum layer markers ............................................................................ 69 4.3.3 Characterizing layer-specific genes............................................................................................................. 70
Chapter 5: Patterns of RNA editing in the brain .................................................................... 75
5.1 Introduction ............................................................................................................................ 75
5.2 Results .................................................................................................................................... 76 5.2.1 ADAR and ADARB1 expression in the brain ................................................................................................ 77 5.2.2 ADAR expression is positively correlated with potential editing targets .................................................... 78 5.2.3 Effect of Alu location in the gene ............................................................................................................... 80 5.2.4 Specificity of ADAR-target correlations ...................................................................................................... 83 5.2.5 Relation of ADAR-target co-expression and editing potential in targets.................................................... 83 5.2.6 Correlations with ADAR over development ................................................................................................ 84
5.3 Methods ................................................................................................................................. 87 5.3.1 The data ...................................................................................................................................................... 87 5.3.2 Choosing target and background sets ........................................................................................................ 88 5.3.3 Testing ADAR-target correlations at different Alu locations ...................................................................... 89 5.3.4 Functional analysis of gene sets ................................................................................................................. 89
5.4 Discussion ............................................................................................................................... 89
Concluding remarks ............................................................................................................. 93
References........................................................................................................................... 97
Hebrew abstract .................................................................................................................... א
List of Figures
Figure 1.1: Measuring gene expression on the tissue using in situ hybridization………………………………… 5
Figure 1.2: Mammalian brain development……………………………………………………………………………..………… 6
Figure 1.3: Brain regionalization and patterning is controlled by gene expression………………..…………… 8
Figure 2.1: mouse brain developmental timeline. ……………………………………………………………..……………… 16
Figure 2.2: Mean pair-wise dissimilarities between the regions. …………………………………….………………… 17
Figure 2.3: robustness of hourglass shape to the sampling genes. ……………………………………….…………… 18
Figure 2.4: The hourglass shape is robust when removing highly variable genes. …………………..………… 19
Figure 2.5: The hourglass shape is robust throughout the brain. ……………………………………………………… 19
Figure 2.6. Functional characterization of hourglass shape. ……………………………………………..……………… 22
Figure 2.7. Sequence similarity vs. spatial correlation of gene pairs belonging to the GO category
'neuron fate commitment'. ……………………………………………………………………………………………………………… 23
Figure 2.8: cosine dissimilarity curve when computed with the three categories showing the
largest correlation with each reference curve. ………………………………………………………………………………… 24
Figure 2.9: Changes in dissimilarity across individual brain regions. ………………………………………………… 26
Figure 2.10: Hierarchical clustering of 11 large brain regions over development. …………………….……… 26
Figure 2.11: Comparison with human data. ……………………………………………………………………………..……… 28
Figure 2.12: Cross-correlation between mouse and human expression profiles over development…..29
Figure 2.13: Comparison with human data. …………………………………………………………………………………….. 30
Figure 3.1: gene expression ISH images for the genes. ………………………………………………………….………… 38
Figure 3.2: Gene expression for each gene was measured on a brain from a different individual
mouse...……………………………………………………………………………………………………………………………..….………… 39
Figure 3.3: Calculating SIFT descriptors. . ………………………………………………………………………..…….………… 40
Figure 3.4: various patterns of expression taken from one ISH image, at the same scale. ..…...………… 41
Figure 3.5: Calculating LBP features. . …………………………………………………………………………………...………… 41
Figure 3.6. Representing images using the Bag-of-visual-words model. ..…………………………………….…… 42
Figure 3.7: A spatial pyramid approach to extracting dense SIFT features. ..………………………….…….…… 43
Figure 3.8: Using compact ISH image representation as input for classifiers. ..…………………………………. 44
Figure 3.9: Each image series was represented with three slices. ..…………………………………….………..….. 45
Figure 3.10: Regular and expression-masked examples of ISH images as provided by the Allen Brain
Atlas. ..…………………………………….…………………………………………………………………..…………………….………....… 46
Figure 3.11. The raw data. ..…………………………………….……………………………………………………………………..… 47
Figure 3.12. Illustration of the image processing pipeline. ..……………………………………………………………… 50
Figure 3.13: Mean test-AUC values for dictionary size K=100, 200, 500, 1000. ..………………………………. 51
Figure 3.14: Mean test-set AUCs for dictionary size K=100 versus K=1000. ..……………………………………. 52
Figure 3.15: Mean AUC (averaged over test-splits) for the GO categories vs. GO category size
(number of genes in the category). ..……………………………………………………………………………………………….. 53
Figure 4.1: The similarity in the representation of same-gene pairs and different-gene pairs. ..………. 58
Figure 4.2. AUC scores for GO categories related to the nervous system and the remaining
categories..………………………………………………………………………………………………………………………………………. 59
Figure 4.3. Precision at top-K for similarity..…………………………………………………………………………………….. 61
Figure 4.4. Representing ISH images with visual words. ..………………………………………………………………… 63
Figure 4.5. The visual words important in classifying Add2 GO categories are overlaid on the
Add2 ISH image. ..………………………………………………………………………………………………………….………………… 64
Figure 4.6: ROC curves for (A) Parkinson’s disease predictions and (B) epilepsy predictions. ..………… 67
Figure 4.7. Comparison with Purkinje-deficient mice and layer enrichment for cell types. ..………….… 73
Figure 4.8: Examples of novel genetic markers. ..………………………………………………………………………..…… 74
Figure 5.1. ADAR and ADARB1 expression in the human brain based on the ABA-2013 dataset. …..… 78
Figure 5.2. The distribution of spatial correlation values between ADAR and targets and between
ADAR and a background set. ..………………………………………………………………………………….…………………..… 80
Figure 5.3: Effect vs. Alu location. ..………………………………………………………………………….……………………… 81
Figure 5.4. The distribution of spatial correlation values between ADAR and targets containing
Alus in different locations and between ADAR and a background set. ..…………………….………………………82
Figure 5.6: 2D histograms of the correlation of genes with ADAR vs. the number of Alu repeats the
genes contain. ..………………….…………………………………………………………………………………….…………………..… 84
Figure 5.7. The distribution of spatial correlation values between ADAR and targets with intronic
Alus (orange) and between ADAR and a background set (blue), at different time points……….………….86
Figure 5.8. ADAR-target correlations over development. ..………………………………………………….…………… 86
Figure 5.9: Differential co-expression of ADAR and targets. ..………………………………………………….…….… 87
List of Tables
Table 2.1. Mean contribution values of GO categories at E11.5 and P28. ………………….…………………….. 21
Table 3.1: Pearson's rho correlation values between AUC results for 2081 categories, compared
across the 4 different dictionary sizes. ..…………………………………………………………………..………………………. 51
Table 4.1. The GO categories classified with highest test-set AUC values. ..……………………………..………. 59
Table 4.2. Top-10 GO annotations explaining the similarities between the gene Synpo2 and Npepps
and Rasa4. ..…………………………………………………………………………………..………………………………………………....65
Table 4.3: Top 10 predicted genes for the two diseases, and corresponding prediction scores. ..………68
Table 4.4: Prediction validation using two datasets: GAD and Linghu 2009………………………………………..68
Table 4.5. Functional enrichment of genes localized to the white matter……………………………………….….71
Table 4.6. Functional enrichment of genes localized to the Purkinje layer……………………………….…………72
Table 4.7. Functional enrichment of genes localized to the granular layer………………………………….………72
Table 4.8. Functional enrichment of genes localized to the molecular layer……………….………………………72
Table 5.1. Number of target genes and background genes used in the analyses……….…….………………….88
List of Abbreviations
ABA: Allen Brain Atlas
AUC: Area Under the ROC Curve
BoW: Bag of Words
cDNA: Complementary DNA
devABA : Allen Developing Mouse Brain Atlas
FACS : Fluorescence-Activated Cell Sorting
GO: Gene Ontology
GBA: Guilt By Association
ISH: In-Situ Hybridization
LBP: Local Binary Patterns
PCA: Principal Component Analysis
PD: Parkinson’s disease
ROC: Receiver Operating Characteristic
SIFT: Scale Invariant Feature Transform
SVM: Support Vector Machine
I
Abstract
The vast complexity of the brain in terms of structure, function and development is enabled due to the
coordinated work of thousands of genes in time and space. In fact, around 80% of genes are differentially
expressed in the brain, and more than half of mouse genes have been found to be involved in brain
development. Although considerable effort and progress have been made in the past century to advance
the field of neurogenomics, the rules and driving forces behind a proper execution of an organism's
functional brain are far from being fully understood.
Patterns of gene expression in the brain have been studied for decades in the context of single genes or
small groups of genes, but the complexity and scale of the mammalian brain calls for the usage of genomic
approaches to the study of expression patterns in the developing and the adult brain. In the last decade,
new high throughput methods for biological data collection enabled the accumulation of large neural
gene-expression datasets allowing the study of complex neural processes from an integrative, genomic
point of view.
In this dissertation, several large neural gene expression databases are used to analyze genomic scaled
patterns of expression from several angles; developmental, spatial and functional. The intention is to shed
light into higher order principles of brain organization and function, but also into single gene function in
the context of neural processes.
We first look into one of the most complex biological processes: brain development. As the brain develops,
specific regions are formed, their structure and function reflected in unique sets of expressed genes. We
investigated the temporal dynamics of changes in regional gene expression patterns throughout mouse
brain development. We identify a neurotypic phase around the time of birth, in which patterns of gene
expression become more homogeneous across the brain, creating an ‘hourglass’ shaped expression
divergence profile. We characterize the biological processes, genes and brain regions responsible for this
pattern, and also compare mouse neurodevelopmental expression patterns with parallel data from
human, finding striking similarities and differences between the two species.
We then describe methods to exploit the abundance of spatial information that exists in high-resolution
images that show a mapping of gene expression in brain tissues, by employing computer vision methods
to represent and classify the images. Methods for feature extraction and image representation have been
II
developed for natural images, and high-resolution images of gene expression pose a unique challenge for
analysis. After creating a representation for the images, we can use it as input for classifiers. We use the
representations calculated for neural expression images to extract meaningful biological information such
as layer-specific gene markers in the mouse cerebellum, identify spatial co-expression profiles, and predict
functional annotations for genes and disease-gene markers.
Finally, we look beyond gene expression, and examine patterns of A-to-I RNA editing by ADARs, a post-
transcriptional modification pre-mRNA that is essential for normal life and development in vertebrates.
Although most human genes have been shown to undergo editing, the exact role of RNA editing is still
unclear, and various functions have been proposed to explain its operation. We addressed one current
hypothesis stating that editing is a way to negatively regulate gene expression, by looking at co-expression
of ADAR and potential RNA editing targets. Instead of the negative co-expression expected, we found a
positive one, suggesting a complex regulation mediated by RNA editing in the human brain.
1
Chapter 1: Introduction
The brain is our most complex organ; the human brain is composed of 50-100 billion neurons, forming an
estimated amount of 1014 connections. There are numerous types of neurons that differ from each other
in their functional and morphological properties. Neurons in the vertebrate brain are also supported by a
vast amount of glia cells, which contain many subtypes as well, varying by function and location. Brain
cells are organized in different compositions and spatial patterns to form cell layers, that form functionally
distinct brain regions. This complexity in structure and function is reflected in gene expression profiles
that are specialized in time and space. Given this, it is hardly surprising that around 80% of genes are
differentially expressed in the brain (Hawrylycz et al. 2012), and that more than half of mouse genes have
been found to be involved in brain development (Waterston et al. 2002).
Although considerable effort and progress have been made in the past century to advance the field of
neurogenomics, the rules and driving forces behind a proper execution of an organism's functional brain
are far from being fully understood. Many of the genes involved in brain patterning and function remain
unidentified. Gene expression in the brain is governed by regulatory networks of transcription factors and
other regulatory proteins controlling their concentrations in a highly localized and timely manner. Failure
to express the right gene at the right time and place can cause many neural diseases that have been found
to have a genetic basis; autism, Down syndrome, fragile X syndrome, Rett syndrome and
neurofibromatosis to name a few (Walker, Russell, and Hodgetts 1987).
Patterns of gene expression in the brain have been studied for decades in the context of single genes or
small groups of genes, but the complexity and scale of the mammalian brain described above calls for the
usage of genomic approaches to the study of expression patterns in the developing and the adult brain.
Current methods to measure gene expression have significantly improved in accuracy, cost and runtime
(Malone and Oliver 2011). In the last decade, large amounts of data was accumulated using methods such
as DNA microarrays and RNA sequencing. Specifically, a couple of extensive gene-expression datasets are
available for the study of the brain (Lein et al. 2007; Sunkin et al. 2012). In addition, data from many
individual experiments were assembled into large repositories. As a result, a genomic approach to
neuroscience is enabling us to study complex neural processes from an integrative point of view (Boguski
and Jones 2004; Zhong and Sternberg 2007).
2
In this dissertation, several large neural gene expression databases are used to analyze genome-scaled
patterns of expression from several angles: developmental, spatial and functional. The intention is to
shed light into higher order principles of brain organization and function, but also into single gene function
in the context of neural processes. Specifically we look into patterns of expression in the developing brain,
explore methods to extract visual and semantic features from images showing high resolution expression
mapping in the brain, identify markers of layers in the brain and gene co-expression patterns, examine
patterns of genes known to be related to neural disease and finally, look into RNA editing, a process which
is especially known to be important in the brain.
The introductory chapter is organized as follows: I start by defining gene expression and discussing the
concept of measuring mRNA expression as a proxy for protein abundance. I then describe several popular
methods to measure gene expression. I continue by reviewing some of the approaches that have been
used so far to investigate genomic patterns of neural gene expression which are relevant to the work
presented in this thesis. The introductory chapter concludes with an outline of this dissertation.
1.1. What is gene expression? Measuring the transcriptome as a proxy to the proteome
A well-known concept in biology, called “the central dogma”, describes the main flow of genetic
information in the cell. The genetic information is stored in DNA which codes for proteins. Proteins carry
out most of the actual work in the cell. Information from DNA to protein is mediated by molecules called
messenger RNA (mRNA). The specific phenotype and function of each cell is determined by the subset of
proteins it expresses in any given moment. Proteins vary largely by their size and 3D structure, while
mRNA molecules have a simpler and more uniform structures; they are composed of a single stranded
molecule, with four ribonucleotides as building blocks. The complex and high variance nature of proteins
makes it difficult to measure protein abundances in cells and tissues, and so high-throughput methods
have focused mainly on the measurements of mRNA abundances as a proxy for protein levels. It is clear
that mRNA levels do not precisely reflect protein abundances, due to many post-transcriptional regulatory
processes such as mRNA and protein degradation, and varying rates of translation. Still, the general
assumption is that mRNA levels are correlated to protein levels. The concordance between mRNA and
protein levels has been studied, for example in yeast (Foss et al. 2007) and in Arabidopsis (Fu et al. 2009),
finding a weak but significant correlation between the two measures. A more recent study conducted in
3
mouse shows a modest correlation (r = 0.27) between the transcriptome and the proteome (Ghazalpour
et al. 2011). More recently, another study argued that protein abundances have been significantly
underestimated as well as the relative importance of transcription (Li, Bickel, and Biggin 2014). While this
issue is still controversial, the wide consensus today is to measure gene expression using mRNA
abundances. Approximation of protein abundances may improve in the coming years with the advent of
more accurate ways to measure mRNA transcripts, or even with the development of high-throughput
methods to measure the proteome itself.
1.2 Technologies for measuring mRNA transcripts
1.2.1 DNA microarrays
Most genomic-scaled gene expression studies in the past decade were conducted using DNA microarrays,
a method that measures expression levels of thousands of mRNA transcripts of interest, effectively
providing a “snapshot” of the transcriptome in a certain condition which can be a tissue, organism,
subject, time point, cell type etc. In a microarray experiment, mRNA extracted from a tissue is hybridized
to a matrix of wells, containing different complementary DNA (cDNA) strands for all the sequences of
interest. Every well in the matrix shows transcript abundance for a different gene. There are two types of
microarray experiments: dual channel experiments, where two experimental conditions are labeled in
different colors and compared to each other, and single channel experiments, where transcript
abundance for the genes is compared to overall mRNA concentration in the sample. The microarrays
technology is very powerful because it provides fast measurements for thousands of genes at a very low
cost, as opposed to older methods such as northern blotting and RT-PCR, and has thus been widely used
in biological and medical sciences since its introduction in 1995 (Schena et al. 1995). The two main
technical challenges when using this method are (a) quantifying the fluorescent signal, and (b) the need
to know the sequences of interest in advance, in order to create the library of the probes.
1.2.2 RNA-Sequencing
In the past year several new sequencing technologies have been developed that make possible another
type of sequence based mRNA annotation which has been dubbed RNA-Seq (Mortazavi, Williams et al.
2008). Instead of sequencing RNAs one at a time these new technologies produce sequence information
about most of the RNAs in a biological sample in a single experiment. The underlying sequencing
4
technologies, however, are incapable of producing full transcript sequences as is the case with full-length
cDNA sequencing. Instead they recover hundreds or thousands of short independent subsequences from
the mRNAs, with typical lengths of a few dozens of base pairs. As with the earlier methods, these are
either aligned to a known reference genome, or they are mapped in an overlapping manner for a de-novo
assembly of a transcriptome. While RNA-Seq does not require previous knowledge of reference sequences
and can be quantified in a more exact manner (the output data is essentially transcript count) this method
is relatively new and the measurement noise and biases are not as well understood as in microarrays
(Auer and Doerge 2010; Malone and Oliver 2011).
1.2.3 In-situ hybridization (ISH)
A complementing aspect of gene expression studies is the spatial localization of mRNA or its protein
product. A popular method for acquiring this information is In Situ Hybridization (ISH) (Lein et al. 2007).
ISH is a potent method allowing for the localization of nucleic acid targets in fixed tissues, therefore
obtaining high resolution spatial information about gene expression. ISH should not be confused with
single-cell fluorescent in situ hybridization (FISH) image analysis, which aims to identify subcellular
structures. The principle of ISH is that mRNA, after several steps of preservation and fixation, can be
detected by using a complementary, fluorescently labeled probe, and gene expression can then be
measured in the context of tissue/cell morphology.
In an ISH experiment, a fluorescently labeled probe is hybridized to a complementary mRNA strand in the
tissue itself, which is thinly sliced, mapping the transcripts to their original location. Figure 1.1 depicts the
ISH process. This method measures RNA expression at a very high, even sub-cellular, spatial resolution,
but each experiment is typically limited to measure expression for one gene or a small number of genes.
5
1.3 Using genome-wide patterns of expression to study neural processes
The work presented in this dissertation attempts to gain insight into brain and gene function from the
analysis of genomic patterns of expression in the brain. This is done considering several aspects of gene
expression: temporal; looking at patterns of expression that evolve over time in the developing brain,
spatial; considering patterns of expression over regions, cell layers and single cells, evolutionary; where
patterns arising in different species are compared and functional; taking a closer look at gene co-
expression patterns in the healthy and diseased brain. Finally, I discuss possible mechanisms of post-
transcriptional regulation of expression via RNA editing in the brain.
1.3.1 Gene expression in brain development
The vertebrate nervous system develops from the neural tube that has a posterior part, which later
develops into the spinal cord, and an anterior part, which divides into 3 primary vesicles: the
prosencephalon, the mesencephalon and the rombencephalon. The prosencephalon further develops into
two secondary vesicles: the telencephalon and the diencephalon. The most posterior vesicle, the
rombencephalon, forms two secondary vesicles as well, the metencephalon, and the myelencephalon
(Figure 1.2A).
Brain regions are formed and wired through a series of partially overlapping cellular processes. The first
phase involves localized proliferation of neural precursor cells. These cells further divide to become
neurons and glia cells. The partially differentiated neural cells migrate from their site of origin to their final
Figure 1.1: Measuring gene expression on the tissue using in situ hybridization. Complementary DNA (cDNA) of the target gene’s mRNA is fluorescently labeled and placed onto the tissue of interest, which is thinly sliced. The cDNA is hybridized to the target mRNA and the tissue is images. The cells containing the mRNA of the target gene are fluorescently labeled. Figure created by Lior Kirsch.
6
locations in the nervous system (Ward et al. 2003). During migration, the neurons develop growth cones
that extend into axons and dendrites, forming synapses with other neurons (O’Connor and Tessier-Lavigne
1999). The neural cells further differentiate to form mature, specialized neurons and glia, and aggregate
into identifiable structures. Further steps include selective death of neurons, elimination of some
neuronal connections and stabilization of others (Buss and Oppenheim 2004). These modifications
continue throughout the organism’s lifetime, as a result of its experiences and its external and internal
environment. The major steps in human brain development are shown at figure 1.2B, and a parallel
timeline of development for rat and human brains is shown at figure 1.2C.
Figure 1.2: Mammalian brain development. (A) Division of the neural plate to the five embryonic vesicles. (B) Major events in
human brain development from conception to adulthood. Figure taken from (Tau and Peterson 2010) (C) the brains of rat and
human embryos at several matched stages of development (Bayer et al. 1993).
The creation of more refined and distinct regions in function and cytoarchitecture is called brain
regionalization. This process is usually carried out by genes whose localized expression in the brain signals
and induces the process of neural patterning: creating these functionally distinct regions in the brain. To
develop a properly functioning brain, the genes involved in development need to be expressed at precise
7
times and in particular locations. Characterizing the spatiotemporal expression patterns of those genes is
therefore of great importance in the field of developmental neuroscience.
Over the years, many development-related genes that have unique expression patterns have been
identified. For example, the spatial pattern of the Sonic Hedgehog gene (Shh) was found to be responsible
for generating different cell types in the neural tube, during early stages of vertebrate development, and
also responsible for generating the boundaries between the prethalamus and thalamus, and the pallium
and subpallium via regulatory interactions with other signaling molecules (Cavodeassi and Houart 2012)
(Figure 1.3A). Shh acts as a signaling molecule, whose concentration gradient determines the activation
levels of different homeobox transcription factors (Hox), responsible for the differentiation of neural cells
into different neuron types (Briscoe et al. 1999). The Hox gene family was first discovered by Edward
Lewis, who also found that the spatial patterning of the family members along the anterior-posterior axis
of the fruit-fly Drosophila melanogaster is responsible for the specification of each body segment as a
unique one with a unique function (Lewis 1978). Remarkably, these properties of the Hox family have
been preserved over evolution and they are also responsible to hindbrain patterning in the vertebrate
brain (Figure 1.3B). They include 4 classes which together control hindbrain segmentation into
rhombomeres in a combinatorial manner (McGinnis and Krumlauf 1992).
8
Figure 1.3: Brain regionalization and patterning is controlled by gene expression. (A) Shh expression mediates the creation of
boundaries between the prethalamus and thalamus, and the pallium and subpallium via regulatory interactions with other
signaling molecules (B) Hox family genes combinatorically control segmentation in a highly evolutionary preserved manner.
Although considerable effort and progress have been made in the past century to advance the field of
developmental neurobiology, the rules and driving forces behind a proper execution of an organism's
developmental scheme are far from being fully understood, and many genes involved in development
remain unidentified. Since it’s hypothesized that more than half of our genes play a role in brain
development (Waterston et al. 2002), a natural approach to the issue is to analyze transcriptomic patterns
of expression, allowing to identify new gene functions and larger principles of brain organization.
Most transcriptomic studies have been conducted on mouse; gene expression levels from publically
available repositories have been used to identify putative transcription factors involved in M. musculus
brain development (Gray et al. 2004). Large-scale gene expression studies were also used to monitor
changes in gene expression in the neocortex and cerebellum of aging mice (Lee, Weindruch, and Prolla
2000), and to characterize different brain regions by expression patterns (Zapala et al. 2005). In recent
9
years several atlases of gene expression over development in the human brain have been created and
made available to the scientific community, allowing to investigate genomic patterns of expression in the
brain and this has leading to important findings. New cortical germinal zones or postmitotic neurons have
been identified as sites of dynamic expression for many genes associated with neurological or psychiatric
disorders (Miller et al. 2014), differential gene expression was found to be more pronounced before birth
in the human brain (Kang et al. 2011), embryonic transcriptome signatures have been identified in adult
patterns of expression (Hawrylycz et al. 2012) and the relationship between genome and transcriptome
was studied in the developing brain, finding that race plays a minor role in the determination of gene
expression over cortical development, even when considering pronounced genetic differences between
individuals (Colantuoni et al. 2011).
The plethora of neural transcriptomic and genomic data, with the addition of other data modalities such
as protein/gene interactions and neural/neuronal connectivity data, ascertains that the field of
developmental neurogenomics will provide some exciting breakthroughs in the upcoming years. Chapter
two of this dissertation is dedicated to the study of the dynamics of inter-regional changes in gene
expression over mouse brain development, with a comparison to similar patterns in human brain
development.
1.3.2 Spatial patterns of expression in multiple scales; regional to single-cells
The brain is organized in multiple scales: from large regions with different functionalities through
specialized sub-regions, to single cells composing each sub-region and even smaller structures such as the
synaptic cleft or post-synaptic density, or neural subcellular compartments such as the synapse, the axon
and the dendrite. There are still many unanswered questions that remain in regards to the functionality
of many of these neural entities, and their relation to one another. For example, it is still unknown which
regions are composed of which cell types, what are the trancriptomic profiles of the different types of
neurons and glia and what is the relationship between different types of neurons and other important
neural cell types such as astrocytes.
Measuring transcriptomic profiles of regions using current methods usually involves mixture of the many
cell types that exist in a tissue, making it difficult to delineate the patterns into functionally distinct ones.
There has been some progress in the purification of specific cell types and their gene expression signatures
(Bryant et al. 1999; Miller et al. 2009; Thomas et al. 2012; Vincent et al. 2002; Wu et al. 2014), but this has
10
been limited to a small number of cell types so far, and requires a lot of experimental effort. Some other,
computational methods have been developed to tackle this problem, for example, co-expression patterns
have been used to infer cell type compositions for different brain regions (Grange et al. 2014).
In recent years, the availability of ISH images of brains has significantly grown, allowing to capture neural
gene expression patterns in a cellular resolution. The data comes in the format of images, calling for the
implementation of image processing and analysis techniques and the development of new, specialized
tools. Chapter 3 of this dissertation discusses several methods to extract features and analyze neural ISH
images using approaches adapted from computer vision, a field that generally focuses on extracting
information from natural images. In chapter 4 these techniques are implemented to discover layer specific
expression of genes, neural co-expression patterns and potential gene-disease candidates.
1.3.3 Inferring gene function from neural co-expression patterns
Since the completion of the human genome project, and consequently the sequencing of genomes of
many other organisms, many new genes have been discovered (Venter et al. 2001). In spite of much
progress in attempts to characterize new genes using high throughput methods and computational
analyses, we still don’t know the function of most genes. While it was shown that around 80% of the genes
are differentially expressed in the brain, it seems that the scientific community focuses its effort on a very
small number of genes. In fact, according to United States National Institute of Mental Health Director
Thomas Insel, over 99% of the neuroscience literature focuses on only 1% of the estimated 15,000–16,000
genes expressed in the brain (Gewin 2005). The genes that do have some known associated function are
likely to have more, unknown roles that change in different neural regions or in different developmental
stages. Kirsch and Chechik indeed show that the majority of human genes show a transcriptomic spatial
signature that reflects the embryonic origin of neural regions, suggesting that genes responsible to brain
patterning over development assume different roles in the adult brain (Kirsch and Chechik, in
preparation).
One way to infer new function is to look for genes that are expressed similarly to genes with known
functions. The similarity in expression suggests that the genes participate in similar biological processes.
This concept is well established as the “guilt by association principle” (GBA) (Oliver 2000). GBA essentially
means that genes that share functionality are likely to share other biological properties like similar
structure or protein domains, and are more likely to physically interact or associate in other manners.
11
This principle has been implemented successfully to discover new gene functions in many datasets (Horan
et al. 2008; Saito, Hirai, and Yonekura-Sakakibara 2008; Lee et al. 2004). An extension to the idea of
inferring functionality via coexpression is to look at differential coexpression, which is defined as changes
in gene–gene correlation structure between two sets of samples which are phenotypically distinct (de la
Fuente 2010). This approach can be used to detect regulatory transcriptional rewiring over development
(Gillis and Pavlidis 2011) or regulatory mechanisms that differ between healthy and unhealthy tissues or
subjects (Choi et al. 2005).
In the context of genes and the brain, gene co-expression analysis has been used to predict the spatial
distribution of neural cell types in the mouse brain (Grange et al. 2014), predict pharmaceutical target
candidates for schizophrenia and Parkinson’s disease (Walker, Volkmuth, and Klingler 1999), identify
epigenetic changes in alcoholism (Ponomarev et al. 2012), elucidate disease mechanisms in autism
(Voineagu et al. 2011) and many more (Gaiteri et al. 2014).
The concept of studying gene coexpression patterns in order to learn about new gene relationships and
function is used throughout this dissertation. Chapter 3 presents a method to identify gene-gene
expression similarities based on high resolution images that display a mapping of expression in the brain,
and chapters 5 discusses the coexpression patterns of one specific gene, ADAR, that is especially
important to normal brain function.
1.3.4 Beyond expression - post-transcriptional mechanisms in the brain
After the gene is transcribed it is subjected to several regulatory mechanisms that account for a large
fraction of the difference between the transcriptome and the proteome, as previously discussed in section
1.1. The major source of variation in the proteome is derived from alternative splicing, where exons of a
gene can be combined in different ways to create different versions of proteins with different functional
domains. Another way to diversify the transcriptome, albeit a more subtle one, is by RNA editing. This
process is carried out by enzymes that bind to RNA molecules and change nucleotide sequence through a
deamination process. There are two main families of this enzymes; adenosine deaminase acting on RNA
(ADAR), which convert adenosine to inosine, translated as guanosine by the translational machinery and
cytosine deaminase acting on RNA (CDAR) also known as APOBEC proteins, which deaminate cytosine to
create uracil.
12
A - I RNA editing has been shown to be especially important in neural tissues. Only a few dozen sites of
RNA editing leading to protein recoding have been identified in mammals. These sites are highly
conserved and are significantly enriched in genes that are related to neural function such as ion channels.
For example, the glutamate-activated cation channel GluR2 AMPA receptor subunit undergoes editing
that changes an amino acid, and consequently changes the channel’s permeability to calcium. This editing
event is important to neuronal viability and interference with RNA editing can cause syndromes such as
epilepsy and ALS (Kawahara and Kwak 2005). Another example of functional RNA editing of an important
neuromodulator is the editing of the serotonin receptor 5-HT2CR. Aberrant patterns of editing for this gene
has been linked to several neuropsychiatric conditions such as depression, schizophrenia, and also
metabolic diseases such as obesity and diabetes (Nishikura 2010).
In recent years, there have been several attempts to identify RNA editing sites in genomes of different
species. The identification of an abundance of RNA editing sites in primate DNA has led to one particularly
captivating theory, where the large amount of editing is suggested as the main driving force in brain
evolution (Li and Church 2013). While most human genes have been shown to undergo editing, the
functionality of this process is still not clear. In some cases, RNA editing of genes has been shown to be a
part of their regulatory mechanisms via gene silencing or nuclear retention (Zhang and Carmichael 2001;
Nishikura 2010).
In chapter 5 of this dissertation we explore the coexpression structures of the A-I RNA editing enzymes
ADAR and ADARB1, and their putative editing targets in neural tissues, in an attempt to find out if there
is indeed a large scale regulatory mechanism that takes place in the human brain.
1.4 Dissertation outline
This thesis is organized as follows. Chapter 2 presents an investigation of the dynamics of inter-regional
dissimilarities in gene expression profiles in different mouse brain regions, using a coarse quantification
of regional gene expression from a genomic collection of ISH images. Chapter 3 describes methods to
exploit the abundance of spatial information that exists in the ISH images by employing computer vision
methods to represent the images and use them to extract biological information, and in chapter 4 these
13
methods are implemented and used to answer diverse biological questions. In chapter 5 we look beyond
gene expression, and examine patterns of RNA editing in the adult and developing human brain.
14
15
Chapter 2: Specialization of neural expression during mouse development
2.1 Introduction
The development of the nervous system is a highly complex process, involving the coordinated expression
of thousands of genes (Waterston et al. 2002; Colantuoni et al. 2011; Kang et al. 2011). Classical models
of development describe a process of brain regionalization, that transforms the neural plate through
several phases into increasingly refined regions (Krauss et al. 1991; Martínez 2001). In the adult, functional
compartments of the brain have been shown to exhibit unique transcriptome signatures (Sandberg 2000;
Datson et al. 2001), suggesting that the process of brain regionalization may be accompanied by a similar
trend in the transcriptome, where expression profiles become more region-specific as the brain develops.
Regional profiles of gene expression in the brain have been studied extensively. These profiles were used
to define new brain delineations based on gene expression (Bohland et al. 2010), conduct comparisons
between brains of different species (Khaitovich et al. 2004), predict neural connectivity (French and
Pavlidis 2011; Wolf et al. 2011), capture functional similarities between brain regions (Hawrylycz et al.
2012) and shed light into many aspects of human brain development (Kang et al. 2011; Colantuoni et al.
2011).
In this chapter, we look at changes in regional expression patterns in the mouse brain, aiming to study the
specific timing of functional specialization. We study expression across 36 developmental neural regions
which cover the complete mouse brain at several time points spanning embryonic and post-natal mouse
development, and also 41 adult brain regions. Expression was measured for thousands of genes, allowing
a large-scale, genomic approach to the study of brain regionalization. We also conduct an inter-species
comparison between expression patterns in mouse and human brain development.
This chapter studies three aspects of spatio-temporal transcriptome patterns: which biological processes
become spatially specialized, at what time points during development, and in which brain regions. We
first trace how expression regionalization changes during brain development. We then identify neural
processes that contribute to the regionalization at various developmental phases. Then, we identify the
brain regions which become largely dissimilar from other regions, and the genes that contribute to this
16
dissimilarity. Finally, we compare the specialization patterns we find in mouse with corresponding
patterns measured in human.
2.2 Results
To study gene expression specialization during development, we analyze expression primarily based on
ISH expression values obtained from the Allen Developing Mouse Brain Atlas (devABA) (Henry and
Hohmann 2012). In this data, mRNA transcript levels were measured for 2002 genes of special interest in
brain development at 7 developmental time-points spanning embryonic (E11.5, E13.5, E15.5, E18.5) and
post-natal phases (P4, P14, P28). We added another time point, P56, using expression measurements for
the same set of genes from the Allen Adult Mouse Brain Atlas (Lein et al. 2007) (Figure 2.1A). The genes
in the dataset, comprising around 10% of the mouse genome, were selected to include transcription
factors, neurotransmitters, neuroanatomical markers, genes important in brain development and genes
of general interest in neuroscience (see Section 2.3.1). We used per-region data that was quantified from
ISH images by combining all pixels with the same regional label, based on a mapping of each image to a
reference atlas made available by the Allen institute (http://www.brain-map.org). We analyze data from
36 anatomically-delineated regions of the developing brain and 41 regions of the adult brain. These
regions encompass the entire brain (see Section 2.3.2). The data and pre-processing are described in more
details in Section 2.3.
Figure 2.1: mouse brain developmental timeline. ISH for each gene was performed at eight time points during development.
Shown here are mid-sagittal slices for the gene Hmgn2, taken with permission from Allen Institute for Brain Science. Allen Mouse
Brain Atlas [Internet] Available from: http://mouse.brain-map.org/ (Lein et al. 2007)
2.2.1 Changes in expression regionalization during development
Aiming to understand how the transcriptome becomes specialized across different brain regions, we first
quantify the differences between expression profiles of brain regions, and examine how these differences
change during development.
17
We quantify the differences between brain regions in terms of the correlation between their gene
expression profiles. Specifically, for every pair of regions R1, R2, we represent each region as a vector of
expression levels, calculate their Pearson Correlation Coefficient (PCC) and compute 1- PCC as the
dissimilarity between the regions. Figure 2.2 depicts the mean dissimilarity for each time point across all
pairs of brain regions. The dissimilarity varies significantly between ages (p-value < 10-16, ANOVA), and its
overall profile follows an 'hourglass' shape. During early development, the dissimilarity is actually reduced,
reaching its lowest value around birth (in E18.5 and P4), although one would expect that the process of
region specialization would lead to an increase in dissimilarity in early embryonic development. After
birth, the dissimilarity rises again. The variance of inter-region dissimilarity follows the changes in the
mean dissimilarity and decreases around birth as well. Interestingly, similar hourglass shapes were also
observed in the profiles of transcriptome variability across species during early development, providing
striking molecular evidence to the 'phylotypic stage' hypothesis (Kalinka et al. 2010a; Domazet-Lošo and
Tautz 2010a). The reduction in expression specialization across brain regions suggests a neurotypic phase
around birth in which all brain regions tend to have a more similar transcriptome.
To test if the overall hourglass shape is a wide effect or strongly depends on a small set of genes, we
measured the dissimilarity using 100 random subsets of sizes K =1000, 500, 200 and 100 genes. We find
that the hourglass shape is largely insensitive to the subset of genes analyzed (Figure 2.3). To further
Figure 2.2: Mean pair-wise dissimilarities between the regions. The curve is a second-order polynomial which minimizes the squared error of the fit to the data. Error bars denote data within 1.5 times the inter-quartile range, and the boxes show the lower and upper quartiles together with the median.
18
ensure that the hourglass effect is not driven by a small number of highly variable genes, we measured
again the dissimilarity, this time after removing the genes with the largest inter-region variability for each
time point. At each time point, we measured the standard deviation across regions for every gene, and
removed the top k genes with the highest standard deviation values (k = 50, 100, 200, 500). The hourglass
shape was robust even when removing the 500 most variable genes (25% of the dataset, Figure 2.4).
Figure 2.3: robustness of hourglass shape to the sampling genes. Dissimilarity curves were computed by random sampling of genes sized (A) 1000, (B) 500, (C) 200 and (D) 100. The shape is robust and largely remains even when using 100 genes, 5% of the full dataset.
19
We also tested the sensitivity of the hourglass shape to the selection of regions by computing the
dissimilarity repeatedly, each time with one region being excluded from the analysis ("leave one region
out", Figure 2.5A). To test how the delineation of the brain into regions may affect the results, we used
the hierarchical structure of the anatomical regions to select six sets of regions at increasing sizes (see
Section 2.3.2). Figure 2.5B depicts the dissimilarity profiles for each of the six sets, as computed at various
resolutions, from 488 developing and 631 adult small brain regions at the most refined level, to 48
developing and 13 adult brain regions at the most coarse level. The hourglass shape of dissimilarity profile
is largely preserved in all delineations. Together, these results demonstrate that the hourglass shape is
robust throughout the dataset and is not constrained to specific genes or brain regions.
Figure 2.4: The hourglass shape is robust when removing highly variable genes. Inter-region distance curve was calculated for the data withholding top k most variable genes for each time point. Error bars represent standard error between brain regions.
20
Figure 2.5: The hourglass shape is robust throughout the brain. Inter-region distance curve was calculated for the data
withholding one region at a time. The blue curve is the mean across brain regions, error bars represent standard deviations from
mean. (E) The dissimilarity curve using sets of regions taken from different levels of the reference atlas regional ontology tree,
starting from the leaf regions (level 1).
2.2.2 Functional characteristics of early and post-natal regionalization
Which biological processes could underlie the pattern of inter-region dissimilarity? In principle, the
hourglass shape could stem from functions or genes whose individual expression profiles follow the
hourglass shape. Alternatively, the shape could be the result of a mix of several biological processes, some
contributing to the decreasing phase of the hourglass and some contributing to the increasing phase. To
test these alternatives, we created a temporal profile for each gene that quantifies its contribution to the
hourglass shape at developmental time points (E11.5 - P28) (see Section 2.3.3). We then used the k-Means
clustering algorithm (Bishop 2006) to group the profiles into distinct clusters of genes that have congruent
developmental dissimilarity patterns, and searched for functional enrichment in these clusters using Gene
Ontology (GO) categories (see Section 2.3.3).
We found two main families of clusters that were functionally enriched (pFDR, q-value < 0.01), each family
accounting for a different phase of the hourglass shape, and depicted in Figure 2.6. Genes from the first
family contributed largely to the dissimilarity during early embryonic development and are related to
nervous-system development categories, such as neuron differentiation, axonogenesis and forebrain
development (an example is shown in Figure 2.6A). At the same time, genes from the second family have
a high contribution to dissimilarity in late post-natal developmental time points (P14 and P28) and tend
to be related to experience dependent plasticity, with enriched categories such as regulation of synaptic
transmission, behavior, learning and memory (Figure 2.6B).
To quantify the relative contribution of GO categories to early embryonic and late postnatal dissimilarity,
we computed a category contribution index (see Section 2.3.3). The top contributing categories at E11.5
are related to nervous system construction, including positive regulation of neuroblast proliferation and
axonogenesis (Table 1). The top scoring categories at P28 are related to the utilization of the nervous
system, including regulation of neurotransmitter secretion and visual perception. An exception to this rule
is the category hindbrain development, ranked at #10 at P28, which is in agreement with the postnatal
timeline of hindbrain development (Moens and Prince 2002).
21
GO category
contribution
at E11.5 GO category
contribution
at P28
positive regulation of neuroblast proliferation 0.0022 neurotransmitter metabolic process 0.0011
retinal ganglion cell axon guidance 0.002 regulation of neurotransmitter secretion 0.00039
CNS projection neuron axonogenesis 0.0018 sensory perception of sound 0.00032
central nervous system neuron development 0.0016 regulation of neurotransmitter levels 0.0003
midbrain development 0.0015 sensory percept. of mechanical stimulus 0.00028
central nervous system neuron axonogenesis 0.0015 synaptic transmission, dopaminergic 0.00027
hindbrain development 0.0012 visual perception 0.00027
neural tube development 0.0011 regulation of long-term synaptic
plasticity
0.00027
motor axon guidance 0.0011 sensory perception of light stimulus 0.00024
negative regulation of glial cell differentiation 0.0011 hindbrain development 0.00024
Table 2.1. Mean contribution values of GO categories at E11.5 and P28.
The observed expression dissimilarity means that each of these neural processes contains a mixture of
genes with different spatial expression patterns. Such spatial differences could result from specialization
at the level of gene families: the same process may be carried out in different brain regions using different
members of a common gene family. This is for example the case with homeobox genes, well known to
operate as pattern specificators in the brain (Puelles and Rubenstein 1993; Vollmer and Clerc 2002).
To search for spatial specialization within gene families of interest, we collected pairs of genes from the
17 enriched GO categories discussed above. We computed both their spatial correlation at developmental
ages with peak dissimilarity (E11.5 and P28), and their sequence similarity (see Section 2.3.4). Results for
an example category 'neuron fate commitment' are presented in Figure 2.7.
The spatial specialization of genes that are members of the same family, could explain apparent
inconsistencies in the way they cooperate, by considering their different spatial patterns.
One interesting example is the pair of paralogs Neurog1 and Neurog2, where there are mixed reports
suggesting that they sometimes operate in a synergistic way (Ma et al. 1999) and sometimes in a
redundant way (Takano-Maruyama, Chen, and Gaufo 2011). These genes are bHLH transcription factors
22
involved in neuronal differentiation determination and subtype specification during embryogenesis
(Zirlinger et al. 2002). Figure 2.6C shows that they display a complementary pattern of expression at E11.5
(ρ = -0.59, Pearson correlation): Neurog2 is prominently expressed in areas derived from the forebrain,
and Neurog1 is expressed more strongly at hindbrain areas. Their different spatial distribution could
explain why they were found to be redundant in some conditions, for example, in tissues where both are
expressed, but not in all of them.
Figure 2.6. Functional characterization of hourglass shape. (A), (B) Clusters of gene profiles that are functionally enriched. Each
profile is a measure of contribution to dissimilarity D (see Methods). Black bold curve is the mean of the cluster. Blue lines - all
the genes in the cluster; red lines - genes that are in the cluster and in the category; grey lines - genes that are not in the cluster
even though belong to the category. (A) Neuron migration shows decreasing dissimilarity (B) Learning or memory shows a post-
natal increase in dissimilarity. (C) Spatial expression of the genes Neurog1 and Neurog2 at E11.5 in 11 coarse regions, selected as
neuron differentiation genes with highly similar sequence.
23
Figure 2.7. Sequence similarity vs. spatial correlation of gene pairs belonging to the GO category 'neuron fate commitment'.
Pairs of genes with sequence similarity > 0 and spatial correlation > 0.2 are marked in yellow. Pairs of genes with sequence
similarity > 0 and spatial correlation < -0.2 are marked in red.
Similar neural processes were found using a complementary analysis where we first created a dissimilarity
curve for each GO category and then correlated them to reference curves which represent the embryonic
part of the hourglass and the post natal one (Figure 2.8, see Section 2.3.5 for full analysis details). It
therefore appears that early inter-structure specialization is dominated by genes related to the
construction of the nervous system, while late variability is dominated by genes related to its operation.
Surprisingly, these results suggest that region dissimilarity in brain construction processes strongly
decreases at the same developmental phases where brain regions are actually known to become
anatomically segregated and specialized.
24
2.2.3 Expression conservation across regions and their embryonic origins
To further understand how changes in dissimilarity relate to the process of regionalization throughout
development, we next look into the question of which brain regions contribute to the overall dissimilarity.
Brain regions develop from three embryonic vesicles; the prosencephalon (forebrain), mesencephalon
(midbrain) and rhombencephalon (hindbrain). In the adult mouse brain, (Zapala et al. 2005) showed that
brain regions sharing an embryonic precursor also tend to share similar expression profiles. Here we
further examine the dynamic of this relation, testing how the embryonic origins of brain regions influence
the changes in their dissimilarity.
Specifically, we first visualize the changes in region dissimilarity over time. All regions were embedded in
a two dimensional space, while preserving the pair-wise dissimilarity of their expression profiles (using
non-metric multidimensional scaling (Bishop 2006), see Section 2.3.6). The embeddings for each time
point are shown in Figure 2.9, revealing how the hourglass shape manifests itself across individual regions.
In accordance with the hourglass shape, brain regions tend to be less dispersed in the two time points
that surround birth (Figure 2.9, E18.5, P4). To visualize the relation between expression profiles and the
embryonic origin of each region, we colored the regions in Figure 2.9 by their embryonic vesicle of origin.
Indeed, regions sharing the same origin tend to be clustered together throughout development. This
Figure 2.8: Dissimilarity computed with the three categories showing the largest correlation with each reference curve. The three highly correlated "embryonic" categories are related to construction of the nervous system: axonogenesis, neuron projection morphogenesis and cerebral cortex development. In contrast, the categories that are highly correlated to the post-natal dissimilarity curve are related to nervous system function: regulation of sensory perception of pain, regulation of neuronal synaptic plasticity and negative regulation of transmission of nerve impulse. These findings suggest that early dissimilarity is dominated by genes involved in axonogenesis and late dissimilarity by synaptic transmission and neural activity.
25
relation was also statistically significant (ρ = 0.33, p < 0.05, mean over all time points of Pearson correlation
between the dissimilarity and embryonic tree distance).
The regions that are most diverged in the developing post-natal time points are Isthmus and rhombomere
1, the two regions that give rise to the cerebellum (Figure 2.9, black arrows). In the adult time point, the
cerebellar cortex is, notably, the most unique region in the brain in terms of gene expression. These results
are to a large extent consistent with previous analysis of cerebellar gene expression (Zapala et al. 2005;
Lein et al. 2007). The post natal shift in cerebellar gene expression is also in agreement with the functional
role of the cerebellum, since the cerebellum is a motor coordination center that relies on sensory input
becoming available only after birth. Cerebellar development is also known to take place at a large part
after birth (Wang and Zoghbi 2001).
The same effect can be seen when performing a hierarchical clustering analysis on 11 large brain regions.
The precursor region to the cerebellum, the pre-pontine hindbrain (PPH), is clustered with other hindbrain
regions throughout embryonic development. Immediately after birth the PPH detaches from the
hindbrain cluster, becoming the most specialized region in the brain (Figure 2.10).
We next turned to identify the specific genes contributing to the post-natal shift in cerebellar gene
expression. We defined the contribution of each gene g to the cerebellar dissimilarity, as the difference
between the total cerebellar dissimilarity with and without g (see Section 2.3.7), and listed the top twenty
genes that contribute most to cerebellar distance at each of the three post-natal developmental time
points. Overall, 78% (32/41 unique genes) of the top contributing genes are known to be related to the
cerebellum, including genes that play an important role in cerebellar development or function like
Neurod1, Pvalb, Zic1 and Zic5. The remaining top genes (8/41) have not been previously linked to the
cerebellum, even though some of them ranked very high in our contribution lists. For instance,
heterogeneous nuclear ribonucleoprotein A/B (Hnrpab), which is ranked 8 at P4, and microfibrillar-
associated protein 4 (Mfap4) which is ranked 20 at P4 and 13 at P14. Hnrpab is a DNA and RNA binding
protein, and is suggested to be involved in cytostatic activity (Taga et al. 2010). Mfap4 is thought to be an
extracellular matrix protein which is involved in cell adhesion or intercellular interactions, and has almost
no other associated information. Both of these genes make interesting targets for further investigation as
important to cerebellar specialization.
26
Figure 2.9: Changes in dissimilarity across individual brain regions. Embedding of all regions onto a 2D plane using
multidimensional scaling. Each circle corresponds to a brain region, with a size that corresponds to the within-region expression
standard deviation and a color that corresponds to its embryonic origins. Red: forebrain, telencephalon; pink: forebrain,
diencephalon; cyan: midbrain; blue: hindbrain. Rhombomere 1 and Isthmus in the developing post-natal time points are and the
cerebellar cortex and cerebellar nuclei at P56 are marked with a black arrow.
Figure 2.10: Hierarchical clustering of 11 large brain regions over development. Dendrogram of 11 brain regions, created by
their gene expression profile at all time-points. Region abbreviations: rostral secondary prosencephalon (RSP); telencephalic
vesicle (Tel); peduncular hypothalamus (PedHy); prosomere 3 (p3); prosomere 2 (p2); prosomere 1 (p1); midbrain (M);
prepontine hindbrain (PPH); pontine hindbrain (PH); pontomedullary hindbrain (PMH); medullary hindbrain (MH). Hindbrain
structures are colored in red. The PPH (cerebellar precursor) shifts dramatically after birth.
27
2.2.4 Comparison with human development
The above findings show how the specificity of the regional expression profiles in the brain changes during
development. How do these findings generalize to other mammals? A recent study provides a good
opportunity to test these findings in humans (Kang et al. 2011). Kang and colleagues measured the
transcriptome of 57 human subjects using DNA microarrays of 11 cortical regions, the mediodorsal
nucleus of the thalamus, striatum, amygdala, hippocampus and the cerebellar cortex.
We first aimed to assess if the gene expression levels in mouse and human can be compared. We
considered the human genes that are orthologous to the 2002 mouse genes and computed the Spearman
correlation of the gene expression profiles of every pair of time points, averaged over brain regions (see
Section 2.3.8). Figure 2.11 depicts the cross correlation between the human and the mouse
developmental timeline, showing a high correlation between the expression profiles of the two species,
which peaks along the translation between the mouse and human brain development timelines proposed
in (Clancy and Darlington 2001). This means that the expression profiles of the two species are highly
correlated in corresponding ages, and the correlation peaks at post-natal time points and that the mouse
and the human neurodevelopmental datasets can be directly compared to each other, even though they
are measured using different methods (ISH and microarrays) and at different brain regions.
28
We next turned to compare expression in specific regions of the mouse and human brains, focusing on
four mouse brain regions which have parallel regions in the human data (see Section 2.3.8). The human
cortical areas were averaged and compared to the mouse dorsal pallium, the human mediodorsal nucleus
of the thalamus was compared to the mouse thalamus, the human cerebellar cortex was compared to
two mouse regions which were averaged: rhombomere1 and isthmus, and the human and mouse striatum
were compared as well. For each pair of parallel regions, we first looked at the overall temporal
correspondence of the mouse and human development timelines by computing the correlation between
expression levels of the two species during development. We computed the cross-species correlation as
described above for the four pairs of human regions and their parallel mouse regions, finding high
correlation values for all region pairs (Figure 2.12).
Figure 2.11: Comparison with human data. (A) Cross correlation between mouse and human gene expression. The black line is taken from known developmental timeline of the two species based on anchor events (Clancy and Darlington 2001a).
29
Figure 2.12: Cross-correlation between mouse and human expression profiles over development. Coherence between
expression profiles for orthologous genes was measured using Spearman correlation, for every pair of time points in
mouse and human. (A) Thalamus (B) Cortex (C) Striatum and (D) Cerebellum. The black line depicts the mapping
between neurodevelopmental timelines of the two species proposed by (Clancy and Darlington 2001b).
We looked at region-specific dissimilarity and traced how the dissimilarity of each of the four regions from
all other brain regions changes over development, in both mouse and human (see Section 2.3.7). The
specialization patterns in mouse and human show partial correspondence (Figure 2.13). While the
thalamus is specialized very early in human (Figure 2.13B), at 4-8pcw, in the mouse it keeps a relatively
constant distance from the rest of the regions (Figure 2.13A). In mouse, the cortex is specialized right
before birth (Figure 2.13C), while in human there is a decrease in specialization over time (Figure 2.13D).
The Striatum in mouse gets specialized right before birth (Figure 2.13E), and in human it keeps a more or
less constant distance (Figure 2.13F). The region with the highest correspondence between mouse and
human is the cerebellum, which becomes specialized right after birth in both species (Figure 2.13G,H).
The differences between mouse and human regional specialization is striking, and the fact that the most
30
similar profile is for the cerebellum is especially interesting given the fact the cerebellum shows the lowest
inter-species correlations for the post-natal time points (Figure 2.12).
Figure 2.13: Comparison with human data. Region-specific dissimilarity curves of four brain regions in mouse and human. (A)
mouse thalamus, (B) human mediodorsal nucleus of the thalamus , (C) mouse dorsal pallium, (D) human cortical regions, (E)
mouse striatum, (F) human striatum, (G) mouse rhombomere 1 and isthmus and (H) human cerebellar cortex. Error bars denote
standard deviation across regions.
31
2.3 Methods
2.3.1 Data acquisition and pre-processing
The detailed process of data acquisition was described in (Lein et al. 2007). 2002 genes were chosen from
five classes: (1) Transcription factors, including homeobox, basic helix-loop-helix, forkhead, nuclear
receptor, high mobility group and POU domain genes. (2) Neuropeptides, neurotransmitters, and their
receptors. In particular genes involved in dopaminergic, serotonergic, glutamatergic and gabaergic
signaling. (3) Neuroanatomical marker genes. (4) Genes relevant to brain development including axon
guidance, receptor tyrosine kinases and their ligands. (5) Genes of general interest including common
drug targets, ion channels, cell adhesion, genes involved in neurotransmission, G-protein-coupled
receptors and genes involved in neurodevelopmental diseases. One animal used to measure expression
for each gene.
Brain regions may change dramatically in size and shape causing a problem to compare gene expression
in the brain across different developmental stages. Here, expression density for each brain region in each
time point was measured while taking the differences in size into account. The expression density for each
brain region R is defined as the sum of expressing pixels in R divided by the total number of pixels that
intersect R (taken from: http://developingmouse.brain-map.org/docs/InformaticsDataProcessing.pdf).
Since expression measurement for each gene come from different individual brains and their 3D shapes
differ, this registration process is prone to mistakes, especially for small regions. To avoid errors that stem
from erroneous registration, we selected a set of regions that are large relative to the magnitude of the
registration perturbation.
2.3.2 Selecting brain region delineation
We used the hierarchical structure of the anatomical regions as defined in the reference atlas ontologies
available in the Allen Brain Atlas website to define six delineations of the brain into sets of regions. These
delineations are achieved by considering several levels of the tree in a serial manner. We started with the
set of leaf regions and then repeatedly took their "parent" regions five times, yielding six sets of regions
corresponding to six levels of the ontology tree. The most refined level has 488 developing and 631 adult
small brain regions, and the coarsest level 48 developing and 13 adult brain regions. For some time points,
expression measurement are only available for a small number of regions, and the remaining regions were
ignored.
32
2.3.3 Contribution of individual genes to the hourglass shape and functional analysis
To functionally characterize the hourglass shape, we calculated the contribution of each gene to the inter-
region distance as: ( ) 1 /g g
geneContribution g D D where gD is the mean dissimilarity (1-PCC)
across all N pairs of regions
1, 2
11 ( 1, 2)g
r rD r r
N
, (1)
and gD is similarly measured, but after excluding the gene g. This was used to create a temporal
contribution profile for each gene.
To find biological processes who share similar contribution profiles, we clustered the profiles using k-
Means (k = 10, 15, 20, 25, 30, 35, 40, 45, 50). The resulting clusters were tested for Gene Ontology
functional enrichment (Ashburner et al. 2000). We limited the analysis to GO categories with at least 10
associated genes in our dataset (~0.5% of the dataset) and to GO categories related to nervous system
structure and function. This was done by taking several top-level categories like neurological system
process (GO:0050877) and nervous system development (GO:0007399) and get all of their descendant
categories in the GO graph. We added to this several more biological process categories and cellular
component categories with their descendants such as neuron projection, neuronal cell body,
synaptosome etc.
We tested for enrichment using a hyper-geometric test. P-values were corrected for multiple comparisons
using a double-FDR approach: First, for each clustering result, we corrected the enrichment p-values using
False Discovery Rate (FDR, (Benjamini and Hochberg 1995)). Next, to correct for the fact that the clustering
was computed for ten different values of k, we corrected the 10 p-values each category received using
FDR as well. Finally, to present the most refined categories, we screened the resulting categories using
the hierarchical structure of the GO tree, and discarded categories that had a descendant category with a
lower p-value.
To decide if a cluster represents the embryonic or the post-natal dissimilarity (or neither), we pooled all
contribution values of genes in the cluster in the embryonic time-points (E11.5, E13.5, E15.5, E18.5) and
33
separately pooled the ones in the post-natal developmental time points (P4, P14, P28). We then applied
a Wilcoxon signed-rank test to decide if there is a significant difference between the two samples. If there
was, we checked the direction of the difference by comparing the medians of the samples.
The contribution of each GO categories C to inter-region dissimilarity was computed as the mean
contribution of all genes assigned to C. 1
( ) ( )( )
cat geneg CContribution c Contribution g
size C . This
index captures both large categories and small categories with highly contributing genes.
2.3.4 Identifying genes with similar sequences
To identify genes from the same gene family we computed the similarity of their protein coding sequence
as measured by the Needleman-Wunsch algorithm (Needleman and Wunsch 1970). We used BLOck
SUbstitution Matrix 50 (BLOSUM50) as the scoring matrix for the global alignment and gap alignment
penalty of 8. Pairs with a score of zero or higher were considered as matches.
2.3.5 Constructing reference curves for correlation analysis
To ensure that the enriched categories represent the 'embryonic' and 'post-natal' parts of the hourglass
curve we also did a complementary analysis. Instead of clustering profiles and searching for enrichment
as described above, we started from GO categories, and computed the dissimilarity curves using all genes
in each category. We treat this dissimilarity curve as the dissimilarity of the category. We then computed
the Pearson correlation of the dissimilarity profile of each function with two reference curves: One
capturing the embryonic part of the hourglass curve and one capturing the post-natal one (Figure 2.8). To
construct the embryonic curve, we computed the full hourglass curve and then assigned the value of the
last embryonic time point E18.5, to all of the post natal time points. Similarly, for the post-natal reference
curve, the embryonic time points were assigned the value of the first post-natal time point, P4. Figure 2.8
shows the top three highest correlating categories for the two reference curves. We computed an
empirical p-value for this correlation using a Monte-Carlo approach: we sampled 10K groups of genes in
the same size of the category and computed the correlation coefficients of these groups to the reference
curves.
34
2.3.6 Visualizing inter-region distances
To visualize the temporal dynamics of the inter-region dissimilarity in the brain, we embedded the regions
in a two-dimensional space while preserving the pair-wise dissimilarity of their expression profiles, using
non-metric multidimensional scaling. For easier comparison of the time points, at each time point, the
location of the regions was adjusted to best match the location of the other regions using MATLAB’s
‘procrustes’.
2.3.7 Dissimilarity of one region to the rest of the brain
To quantify the time course of expression specialization, we measured the dissimilarity between a region
R and the remaining brain regions. The region-specific index for a region R is defined as the average
dissimilarity between R and all the other regions, 1
( ) 1 ( , )# r
D R R rregions
, divided by the mean
inter-region dissimilarity of Eq (1).
2.3.8 Mouse-human comparison
To compare expression in mouse and human brains, we focused on four mouse brain regions which have
parallel regions in the human data of (Kang et al. 2011). Since the mouse cortical regions have data only
for P28, we used their parent region, the dorsal pallium, to compare with the 11 human cortical areas,
averaged to create one cortical expression profile. The human mediodorsal nucleus of the thalamus was
compared to the mouse thalamus, the human cerebellar cortex was compared to two mouse regions
which were averaged: rhombomere1 and isthmus, and the human and mouse striatum were compared
as well.
To identify human genes that are orthologous to the 2002 genes in the mouse dataset we used the
R/BioConductor package BioMart (Smedley et al. 2009).
The set of 1737 ortholog gene pairs was used to calculate the Spearman correlation between mouse and
human expression profiles, averaged over regions, for every two time points (Figure 2.11), and also for
the four parallel regions (Figure 2.12).
35
2.4 Discussion
We characterized how the dissimilarity between transcription profiles of brain regions changes during
development of the mouse brain. Based on the process of brain regionalization we expected to observe a
monotonous increase in transcription specialization, but we actually found that brain regions exhibit
increasingly more similar expression profiles during early embryonic development, until reaching a
"neurotypical" phase around the time of birth. After birth, brain regions tend to specialize and their
expression dissimilarity increases. Functional characterization of the hourglass shape suggests that it is
derived from two separate, complementary processes: the embryonic reduction in dissimilarity is
dominated by genes responsible for constructing and shaping the brain, while the post-natal increase in
specialization largely involves processes that govern the operation of the nervous system, like neural
activity and plasticity.
When visualizing the dissimilarity between the regions (Figure 2.9), it is apparent that the cerebellum
“breaks off” from the rest of the regions after birth. The dissimilarity between the cerebellum and other
regions grows at each post-natal time point and so does its dissimilarity from other regions of the
hindbrain. This dynamic is consistent with the view that cerebellar development follows unique cues from
the junction of the midbrain and hindbrain (Sato, Joyner, and Nakamura 2004; Wingate 2001), and
therefore its transcriptome may differ from other hindbrain regions significantly (Chizhikov et al. 2006).
Indeed, the cerebellum has been shown before to be the most unique region in terms of its expression
profiles (Wang and Zoghbi 2001; Zapala et al. 2005; Lein et al. 2007; Kang et al. 2011). One explanation
for the late specialization lies in the main function of the cerebellum as a motor coordination and sensory-
motor integration center.
The above findings are in partial accordance to a recent large scale developmental-brain transcriptome
study in humans (Kang et al. 2011), where similarity between brain regions was aggregated across three
long life periods: embryonic, postnatal and adult. In both mouse and human dissimilarity decreases before
birth. However, on average, the similarity in these periods seems to grow from post-natal development
to adulthood in human. In the mouse dataset we see the opposite effect: a robust increase in dissimilarity
during post-natal development, following birth. While the cerebellum specializes after birth in both
species, other temporal dissimilarity profiles differ between the species. Further measurements are
needed to clarify if this mismatch reflects a fundamental difference between rodent and primate
36
development, or if it is due to differences in the experimental technique or the specific subset of six
regions measured in humans.
Interestingly, recent studies have shown examples of whole-organism developmental gene expression
profiles that follow an hourglass shape. (Kalinka et al. 2010) measured inter-species distances over
development for six species of flies and found that the distance is minimized during the presumed
'phylotypic' stage. (Domazet-Lošo and Tautz 2010b) analyzed the phylotypic stage further by looking into
the relative ages of genes expressed in different stages of development and finding that the genes
expressed during the phylotypic stage are more ancient, hence more stable in face of evolutionary
changes.
The above findings suggest that expression dissimilarity decreases at the same developmental phases
where brain regions become anatomically segregated and specialized. The question remains if the
reduced dissimilarity in mRNA is accompanied by reduced dissimilarity in regional protein abundance
profiles across the brain. Alternatively, post-transcription regulation mechanisms may take a larger role
in preserving specialization across brain regions and explain this apparent mismatch.
ISH provides a much higher spatial resolution than the one used in this study, that can be used to
investigate specialization at a finer scale of cell layers and even cell types. This is especially important
when considering the fact that gene expression as measured here reflects cell densities, as well as
transcript abundance. Quantifying and correcting for regional cell densities is a crucial step towards a
more accurate description of the neural transcriptome. Furthermore, the recent availability of
transcription measures from other species (Website: ©2012 Allen Institute for Brain Science. NIH
Blueprint Non-Human Primate (NHP) Atlas [Internet]. Available from:
Http://www.blueprintnhpatlas.org/; Website: ©2012 Allen Institute for Brain Science. BrainSpan Atlas of
the Developing Human Brain [Internet]. Available from: Http://brainspan.org/) calls for a thorough study
of the similarities and differences of development as reflected in gene expression between species to
understand the genetic blueprint underlying brain development.
37
Chapter 3: Methods to represent neural ISH images
3.1 Introduction
Analysis of gene expression patterns in the brain poses a unique challenge due to the highly complex
nature of brain tissues; each region is composed of many types of neurons and glia cells (Dickson 2002;
Lein et al. 2007). When measuring gene expression using popular methods such as DNA microarrays or
RNA-sequencing, expression patterns of different cells are typically mixed because of the difficulty in
extracting RNA from single cells. This has been done to some extent using cell sorting techniques (Cahoy
et al. 2008) and more recently by single-cell RNA-Seq (Wu et al. 2014). Another approach is to use high
resolution gene expression maps obtained using ISH. As discussed in section 1.2.3, in an ISH experiment,
a labeled probe is hybridized to a complementary mRNA strand in the tissue itself, which is thinly sliced,
mapping the transcripts to their original location. This method measures RNA expression at a very high,
even sub-cellular, spatial resolution, but each experiment is limited to measure expression for a small
number of genes.
The challenge of building a whole genome database of ISH for the mouse brain was met by the Allen
Institute for Brain Science, which have created a comprehensive mouse brain atlas, freely available on-
line at www.brain-map.org/. Examples of ISH images for two genes measured on adult brains are shown
at Figure 3.1. Expression was measured on adult brain tissues for every single gene in the mouse genome,
and for ~2000 genes along pre- and post-natal development (Thompson et al. 2014). The first version of
the atlas was completed in 2006, and was used to identify spatial expression patterns in the brain at a
genome-wide level (Lein et al. 2007), and to explore the structural organization of the brain from a
genomic perspective (Bohland et al. 2010). Since then, the atlas was extended in several ways, and now
also covers the human brain and non-human primates (Lein et al., 2007; Henry and Hohmann, 2012; Ng
et al., 2009).
38
This recent explosion of high-resolution expression data measured in mammalian brains calls for new
ways to analyze neural gene expression images. Most existing methods for bio-imaging analysis were
developed to handle data with very different characteristics, like drosophila embryos (Frise et al., 2010;
Peng et al., 2007; Pruteanu-Malinici et al., 2011) or cellular imagery (Peng et al., 2010; Coelho et al.,
2010). The complex nature of the mammalian brain poses new challenges for analysis. The expression of
each gene was measured on a brain of a different individual mouse and the images cannot be easily
aligned to each other and compared. As with every other organ in the body, there is a variance between
individuals, even on the level of neural structures and layers. Examples of images obtained from different
brains are shown in Figure 3.2. These differences make naïve approaches such as a simple correlation
between images impractical. Indeed, current approaches for analyzing brain images are based on smooth
nonlinear transformations to a reference atlas (Davis and Eddy, 2009; Hawrylycz et al., 2011) but these
methods may be insensitive to fine local patterns like those emerging from the layered structure of the
cerebellum or the spatial distribution of cortical interneurons.
This chapter describes several methods to represent and analyze the large collection of neural ISH images.
Specifically, we first describe ways to represent the images using visual features extracted from them.
Then, we describe a method we developed to represent the images in a functional way – adding a layer
of semantic interpretability to the representations. In the next chapter, we will describe the
implementation of these methods to extract meaningful biological information from the images such as
identification of layer-specific gene markers, spatial co-expression profiles, functional annotations for
genes and disease-gene markers.
Figure 3.1: gene expression ISH images for the genes. (A) Mapt and (B) Gria2. Expression is shown as black dots on the tissue, marking neural cells expressing the gene of interest.
39
3.2 A visual representation of ISH images
In this section, we first describe methods to extract visual features from the images. While these methods
have been originally developed for natural images, they hold several properties that make them suitable
to use in the case of neural ISH imagery. We then discuss a method to create a compact representation
from the large collection of features assembled from each image, and a method to incorporate spatial
information into the representation. Lastly, we describe using the representations as feature vectors for
classifiers as a way to classify or label the images.
3.2.1 Feature extraction
There are many ways to extract features from images; here I focus on two popular methods: scale
invariant feature transform (SIFT) and local binary patterns (LBP). Using these methods, we are able to
create a robust and compact representation for ISH images in several scales.
Scale invariant feature transform
Figure 3.2: Gene expression for each gene was measured on a brain from a different individual mouse, making the images difficult to compare with naïve methods such as pixel-pixel correlation.
40
The Scale Invariant Feature Transform (SIFT) was introduced by David Lowe in the late nineties (Lowe
1999). It was found to be very effective for matching objects in different images and is used extensively
in applications such as video tracking, object recognition and scene understanding. Using SIFT, we can
transform an image into a collection of features, called “descriptors”. The descriptors can be extracted in
specific “key-points” in the image, or on a grid. For natural images, usually the “key-point” strategy is
chosen, and a major step in the process of image representation is to identify the key-points. Here we will
use a grid-based strategy where descriptors are calculated on a regular grid. This strategy is more
appropriate for neural ISH imagery because of the dense and more uniform-looking nature of the
information contained in the images.
Specifically, we compute the descriptors as follows: any region to be represented by a descriptor is resized
to 16*16 pixels. Then, for each pixel, the orientation and magnitude of the intensity gradient is computed.
Each pixel is divided into 4*4 pixels and for each of them, a histogram of 8 gradient orientations is
computed. Then these 16 histograms are concatenated together to form a 128D feature vector. The
process of calculating a SIFT descriptor is depicted in Figure 3.3.
Local binary patterns
Figure 3.3: Calculating SIFT descriptors. (A) Any region to be represented by a descriptor is resized to 16*16 pixels, and the orientation and magnitude of the intensity gradient is computed for each pixel. (B) Each pixel is divided into 4*4 pixels and for each of them, a histogram of 8 gradient orientations is computed. (C) The 16 histograms are concatenated together to form a 128D feature vector. Figure adapted from: Levi Gil, (2013, August 18). A Short introduction to descriptors [blog post]
41
When examining similar-scaled expression patterns at different regions in the brain we find cells that are
arranged in many different textures, as shown in Figure 3.4. The different patterns can reflect varying
compositions of cell types, or different distributions of cells in the tissue. An important aspect in feature
extraction from these images is, therefore, the ability to capture these different textures. An effective
feature extraction method that captures textures well is called Local Binary Patterns (LBP). LBP was first
described in the early nineties (Ojala, Pietikainen, and Maenpaa 2002).
The LBP feature vector, in its simplest form, is created in the following manner. The image is divided into
rectangular sections. Then intensity values of each pixel in the section is compared to the intensity values
of its eight neighboring pixels from all sides. Then the intensity values of the neighboring pixels are
binarized; where the center pixel's value is greater than the neighbor's value, a value of 1 is assigned.
Otherwise, the value 0 is assigned (see Figure 3.5). This gives an 8-digit binary number. There are 28 = 256
options for this number. For each section, we calculate a histogram that counts the frequency of each
binary digit, and concatenate all of the histograms of the different sections to create the full image
representation.
Figure 3.4: various patterns of expression taken from one ISH image, at the same scale. Different regions in the brain exhibit expression signatures in many different textures.
Figure 3.5: Calculating LBP features. The image is divided into rectangular sections. Intensity values of each pixel in the section is compared to the intensity values of its eight neighboring pixels from all sides, and binarized; where the center pixel's value is greater than the neighbor's value, a value of 1 is assigned. Otherwise, the value 0 is assigned.
42
3.2.2 Feature aggregation using “Bags of visual words”
After the extraction of local features from an image, one way to represent them in a compact way is to
create a “visual vocabulary” using a method called a “bag-of-visual-words”. This method is adapted from
the representation of large corpuses of text, where documents are represented as histograms of word
frequencies. Here, image descriptors are divided into several groups, and a representative descriptor from
each group is called a “visual word”. This creates a “dictionary” or “vocabulary” of visual words. We then
divide the descriptors into groups. This can be done, for example, using methods such as K-Means
clustering, and then the representative visual word will be the centroid of the cluster. Now we can count
the frequency of each visual word in our image, and represent each image as a histogram of visual words,
called a “bag-of-visual-words”. The process is described using natural image examples in Figure 3.6.
Figure 3.6. Representing images using the Bag-of-visual-words model. (A) local features such as SIFT descriptors are extracted
from the images (B) the features from all the images in the dataset are aggregated together and (C) separated into groups using
methods such as K-means clustering. A representative feature is decided for each group. This can be, for example, the centroid
of each cluster. These representative features are called “visual words” (D) All of the descriptors in each image are assigned to
one visual word, using a distance measure such as the Cosine distance. Finally, each image is represented by a histogram counting
the number of occurrences of each visual word in it.
3.2.3 Applying a spatial pyramid kernel to the images
One of the main advantages of the visual BoW method described above is the ability to find small-scaled
spatial patterns that are location-independent in the brain. However, one way to incorporate spatial
43
location information into the representation is to use a method called spatial pyramid kernels (Lazebnik,
Schmid, and Ponce 2006). Using this approach, every image is split into 4 and 16 rectangles and the bag
of words method is applied to each rectangle separately (Figure 3.7). The resulting feature vector is a
concatenation of the 1+4+16 = 21 dictionaries. This approach has been shown to be highly successful in
machine vision tasks (Grauman and Darrell 2005; Huang 2009). The down side of this approach is that it
inflates the feature dimensionality significantly, and requires reducing the dictionary size, creating a
challenging tradeoff when computing both local and global features. An alternative approach could be
based on data-dependent segmentation of images into anatomic structures (like the thalamus, cortex or
cerebellum) followed by coding each structure separately. Such segmentation is a topic for a separate
research.
Figure 3.7: A spatial pyramid approach to extracting dense SIFT features. Features were extracted in the full image (a) and the
image divided into four parts (b) and 16 parts (c).
3.2.4 Using the representations for classification
After extracting visual features and representing the images in a compact way, we can use the new
representations as input features for unsupervised or supervised machine learning methods in order to
group them or make decisions about them (Figure 3.8). The next section describes a way to use this
approach to create a functional, semantic representation of the images, and the next chapter discusses
using the principles described above to extract meaningful information from the ISH image collection
available in the Allen Brain Atlas in order to identify gene co-expression patterns that are interpretable,
infer gene functionality from spatial expression patterns, identify disease-related genes and find layer and
cell-specific expression patterns in the brain.
44
Figure 3.8: Using compact ISH image representation as input for classifiers. (A) After feature extraction and image
representation using methods such SIFT-BoW, (B) the images can then be used as input features for classifiers, where genes serve
as positive and negative examples for different biological properties.
3.3 A functional representation of ISH images
One important challenge for automatic analysis of biological images lies in providing human interpretable
analysis. Most machine vision approaches are developed for tasks in analysis of natural images, like object
recognition. In such tasks, humans can understand the scene effortlessly, and infer complex relations
between objects easily. In bio-imaging however, the goal of image analysis is often to reveal features and
structures that are hardly seen even by experts. It is therefore important that an image analysis approach
provides meaningful interpretation to any patterns or structures that it detects.
Here we develop a method to learn functional representations of expression images by using predefined
functional ontologies. This approach has two main advantages: accuracy and interpretability, and it builds
on a growing body of work in object recognition in natural images, showing how images can be
represented using the activations of a large set of detectors (Deng, Berg, and Fei-Fei 2011; Torresani,
Szummer, and Fitzgibbon 2010; Li et al. 2010b; Li et al. 2010a; Malisiewicz 2012; Malisiewicz, Gupta, and
Efros 2011). For object recognition, the detectors may include common objects, like a detector for the
presence of a chair, a mug or a door. Here we show how to adapt this idea to represent gene expression
images, by training a large set of detectors, each corresponding to a known functional category, like axon
guidance or glutamatergic receptors. Once this representation is trained, every gene is represented as a
point in a low dimensional space whose axes correspond to functional meaningful categories.
45
3.3.1 Data filtering and preprocessing
We used whole-brain, expression-masked images of gene expression measured using ISH, publically
available at the Allen Brain Atlas (www.brain-map.org). Expression was measured for the entire mouse
genome. For each gene, a different adult mouse brain was sliced into 100µm thick slices, mRNA
abundance was measured and the slice was imaged. The database holds image series for over 20K
transcripts. Most genes have one corresponding image series, containing ~25 imaged brain slices. Some
genes were imaged more than once and have several associated image series.
Choosing an image to represent each gene
In our analysis, we used the most medial slice for each image series, yielding a typical image size of 8Kx16K
pixels. 4823 out of the available 21174 images showed no expression in the brain and were ignored in
subsequent analysis, leaving 16351 images representing 15612 genes.
In order to take into fuller account the 3D structure of the brain, we repeated the full set of our
experiments while including two additional sagittal sections. The three sections used were taken from one
hemisphere, capturing the medial section and also the 30% and 50% marks on the medial-lateral axis. An
example of three such slices is shown in Figure 3.9. However, the results of the experiments using multiple
slices were inconclusive and so we report results based on the medial slice alone. The reasons for this
inconsistency could be that the location of the non-medial slices is more variable, due to variation across
brains.
46
Figure 3.9: Each image series was represented with three slices, the most medial (a), and the 30% (b) and 50% (c) marks on the
medial-lateral axis.
Using expression masked images
Images in the Allen dataset are provided in two formats: the raw imagery, and images that were processed
to remove the background, yielding expression-masked images. The analysis was applied to the masked
images. This is a big advantage when examining expression patterns, as noise effects coming from
cytoarchitecture and underlying brain structures is reduced. Examples of a pair of images are given below
in Figure 3.10. Figure 3.11 shows examples of images, demonstrating the complexity of neural expression
patterns across brain regions and multiple scales. The images analyzed in our study were in grey scale but
are shown here as color-coded by expression intensity for better visualization.
Figure 3.10: Regular (A) and expression-
masked (B) examples of ISH images as
provided by the Allen Brain Atlas, for the
gene Tuba1. While the expression masked
images are presented in color, the color
images are in fact derived from gray-scale
images, which we have used in this work.
47
Figure 3.11. The raw data. ISH image for the gene Tuba1 shown (A) at different scales and (B) in three different regions.
3.3.2 Creating the representations
We present a method to identify similarities between neural ISH images and to explain these similarities
in functional terms.
Our method consists of a visual phase - where we transform the raw pixel images into a robust visual
representation, and a semantic phase - where we transform that visual representation using a set of 2081
gene-function detectors. The output of these detectors comprises a higher-order semantic
representation of the images in a gene-functional space (Figure 3.12). Similar two-phase systems have
recently been proposed and applied successfully for tasks such as cross-domain image similarity and
object detection in natural images (Malisiewicz 2012; Malisiewicz, Gupta, and Efros 2011; Li et al. 2010b;
Deng, Berg, and Fei-Fei 2011; Li et al. 2010a; Torresani, Szummer, and Fitzgibbon 2010).
For the first, visual, phase, we represent each image as a collection of local descriptors using SIFT features
(Lowe 2004). This step aims to address the problem that ISH brain images of the same gene vary
significantly in shape and size when measured in different brains (Kirsch, Liscovitch, and Chechik 2012).
SIFT features are histograms of oriented gradients on a small grid. The resulting image-patch SIFT
descriptor is invariant to small rotation and illumination (but not to scale), making imaged-slices from
different brains more comparable. We computed SIFT descriptors of dimension 128 extracted on a dense
grid spanning the full image (Bosch et al., 2006; Bosch et al., 2007; Csurka and Dance, 2004), at four spatial
48
resolutions. In ISH images, different information lies in different descriptor sizes, and we wish that the
representation captures spatial patterns both at the level of single cells, micro-circuitry, and at the coarser
level of distribution of expression across brain layers. To capture information at multiple scales, we used
the VLFeat implementation of SIFT (Vedaldi and Fulkerson 2010), where scale-invariance is not
incorporated automatically. Specifically, each image is represented as a collection of ~1M SIFT
descriptors, computed by down sampling each image at a factor of 1, 2, 4 and 8. Since the descriptors
were extracted from high resolution images which are mostly dark, many descriptors were completely
dark and were discarded.
Next, to achieve a compact non-linear representation of each image, we aggregate the descriptors from
all images for a given resolution level, and cluster them to form a dictionary of distinct “visual words” per
each resolution level (see Section 3.2.2). We used the original Lloyd optimization for k-Means with L2
distance, initializing the centroids by randomly sampling data points. The clustering procedure was
repeated multiple times (n=3), and the solution with the lowest energy was used. We tested 4 different
dictionary sizes (k=100, 200, 500, 1000), all yielding similar results (see Section 3.3.3), and report below
results for k=500 which obtained slightly higher accuracies. Next, we construct a standard “bag-of-words”
(Bosch, Zisserman, and Mu 2006) description of each image. As a result of this process, each image is
described by four concatenated 500-dimensional vectors counting how many times each “visual word”
appeared in it at a given resolution level. We also added a count of the number of zero descriptors per
resolution level, ending up with a 2004 dimensional vector describing each image. Using this approach,
similar spatial information from different brain regions is preserved, as opposed to using global
correlation-based approaches.
We then turn to the second, "semantic", phase, and represent each image by a set of functional
descriptors. Given a set of predefined Gene Ontology (GO) annotations of each gene, we train one
separate classifier for each known biological annotation category, using the SIFT bag-of-words
representation as an input vector (see Section 3.2.4). Specifically, here we trained a set of 2081 L2 -
regularized logistic regression classifiers (using LIBLINEAR (Fan et al. 2008)) corresponding to biological-
processes GO classes that have 15-500 annotated genes (see Section 3.3.3). We trained the classifiers
using two layers of 5-fold cross validation, performed as follows: The full set of 16351 gene images was
split into five non-overlapping equal sets (without controlling for the number of positives in each split),
training the classifiers on four of them and testing performance on the fifth, unseen test set of images.
49
This procedure was repeated five times, each time with a different set acting as the test set. All accuracy
and other results below are reported for a held-out test set that was not used during training.
To tune the logistic regression regularization hyperparameter, we used a second layer of cross validation.
We repeated the splitting procedure within each of the five training sets, splitting each of them again into
five subsets of images, using four for training and the fifth as a validation set. The regularization
hyperparameter was selected from the values {0.001, 0.01, 0.1, 1, 10, 100}. At the end of this process,
each gene is then represented as a vector of "activations", corresponding to the likelihood that the gene
belongs to one functional category such as “forebrain development” or “regulation of fatty acid
transport”.
The representation described above removes important information about global location in the brain.
We therefore also tested an approach using spatial pyramids (Lazebnik, Schmid, and Ponce 2006), where
descriptor histograms are computed separately for different parts of the image (see Section 3.2.3).
Unfortunately, this approach results in feature vectors whose dimensionality was too high for the current
dataset, and yielded poor classification results. We concluded the increase in feature dimensionality hurts
more than the gain obtained by describing different brain regions separately.
50
3.3.3 Choosing parameters for analysis
Choosing the dictionary size
In order to choose the size of the visual word dictionary, we performed analysis with four dictionary sizes:
100, 200, 500 and 1000. Figure 3.13 shows mean test-set AUC values obtained using the different
dictionary sizes. Mean AUC across categories is insensitive to the size of the dictionary (K). To check how
stable the representations are between the different K's, we measured the Pearson correlation between
AUC values of the 2081 GO categories using the different dictionary sizes. Correlation values are very high
and are shown in Table 4.1. The lowest correlation value is 0.846, between K=100 and K=1000, and is still
highly significant (P<10-100). Correspondence between AUC values for the 2081 GO categories obtained
using the two dictionary sizes are shown in Figure 3.14, showing indeed a high linear correspondence.
Figure 3.12. Illustration of the image processing pipeline. (A) Original image in pixel grayscale indicating level of
gene expression. (B) Local SIFT descriptors are extracted from image at 4 resolutions. (C) Descriptors from all
16351 images are clustered into 500 representative ‘visual words' for each resolution level using k-Means. (D) Each
image is represented as a histogram counting the occurrences of visual words. (E) L2-regularized logistic regression
classifiers are applied for 2081 GO categories. (F) The final 2081 dimensional image representation.
51
Figure 3.13: Mean test-AUC values for dictionary size K=100, 200, 500, 1000. Error bars indicate standard error of mean across
five folds in cross-validation data.
Dictionary size
(K)
100 200 500 1000
100 1 0.896 0.861 0.846
200 0.896 1 0.896 0.883
500 0.861 0.896 1 0.917
1000 0.846 0.883 0.917 1
Table 3.1: Pearson's rho correlation values between AUC results for 2081 categories, compared across the 4 different dictionary
sizes. Correlations are high (the lowest is 0.846 between K=100 and K=1000).
52
Figure 3.14: Mean test-set AUCs for dictionary size K=100 versus K=1000. This pair of dictionary sizes is the least correlated
among all dictionary size pairs. It can be seen that even in this case, the correlation is high and indicative of a stable
representation.
Choice of GO category size
We chose GO categories with a number of annotations ranging from 15 to 500 genes. We set the lower
limit to 15 in order to provide enough positive examples for testing the classifiers across five cross-
validation partitions. The higher limit is set to 500 to preclude the resulting semantic explanations from
being very general (we use more specific categories such as "regulation of long-term neuronal synaptic
plasticity" or "glutamate receptor signaling pathway" and avoid general categories such as "transport" or
"biological regulation").
To make sure that this choice of categories did not cause a bias in the classification results, we checked
the relation between category size and test-set AUC scores. No significant relation between the size of
the GO category and the resulting AUC values (Figure 3.15).
53
Figure 3.15: Mean AUC (averaged over test-splits) for the GO categories vs. GO category size (number of genes in the category).
There's no significant relation between classification success of a category and the number of genes annotated to it.
54
55
Chapter 4: Analysis of neural ISH images
The previous chapter discusses methods to extract features from neural ISH images and represent them
using both visual and semantic features. In this chapter, I will present several ways to use these
representations in order to extract biological information from them. Specifically, Section 4.1 discusses
using the functional representations presented in Section 3.3 to identify gene-gene co-expression
patterns, that can be explained in meaningful, semantic terms. Section 4.2 demonstrates using the ISH
images to identify neural-disease related genes, and in Section 4.3 the images are used to identify layer-
specific gene markers in the cerebellum.
4.1 Explainable gene coexpression patterns using ISH functional representations
In Section 3.3, we described a method to learn functional representations for neural ISH images. We now
turn to use these representations to identify gene co-expression patterns. Gene co-expression in the brain
has been extensively studied using the more popular methods to measure mRNA expression which
produce expression values aggregated for regions, without considering fine-resolution spatial
information that may differentiate between brain regions (see Sections 1.2.1, 1.2.2, 1.3.3). ISH image
analysis has been used in the past to infer gene biological functions from spatial co-expression in non-
neural tissues (Frise, Hammonds, and Celniker 2010). However, inferring functions based on gene
expression patterns in the brain is believed to be hard, since several studies found very low variability
between transcriptomic patterns of different brain regions, sometimes even lower than between-subject
variability for the same area (Khaitovich et al. 2004; Khaitovich et al. 2005). Here we identify spatial co-
expression patterns exploiting the fine-scaled patterns that exist in the brain ISH imagery, eventually
using this subtle, even cellular resolution, spatial information for functional inference.
4.1.1 Calculating image-image similarities
Following the method described in Section 3.2, each gene can now be represented as a vector of
functional category activations. In order to identify spatial co-expression patterns between the genes, we
can simply calculate the similarity between these functional representations. We use two gene-gene
56
similarity measures in this work, taking. The first, flat-sim, is simply the linear correlation of two functional
category activation vectors. The second, GO-sim, takes into account the known directed acyclic graph
(DAG) structure among the functional categories of the GO annotation.
Formally, the flat-sim score between a pair of 𝐿2 normalized feature vectors 𝑎 = (𝑎1 … 𝑎𝑚) 𝑏 =
(𝑏1 … 𝑏𝑚) is given by their dot product flat-sim(𝑎, 𝑏) = ∑ 𝑎𝑖 ∙ 𝑏𝑖𝑚𝑖=1 . This additive similarity measure
allows assessing the contribution of each individual feature to the overall similarity score, by setting the
contribution of the feature i (corresponding to GO category i) to 𝑎𝑖 ∙ 𝑏𝑖. Thus, for each pair of similar
images, we can sort the GO categories by order of their contribution to the similarity, providing a semantic
interpretation of the correlation.
However, flat-sim does not take into account that the activation of some functional categories can be far
more informative than others. For example, two genes that share a very specific function like "negative
regulation of systemic arterial blood pressure" are much more likely to be similar than a pair of genes
sharing a more general category like "metabolism". We address this issue by adapting a functional
similarity measure between gene products developed by (Schlicker et al. 2006), which we refer to as GO-
sim. GO-sim is designed to give high similarity scores to gene-pairs which share many specific & similar
functional categories. We treat our model’s functional activations as binary annotations (using a
threshold of 0.5), and calculate GO-sim as follows.
For each GO category i, we calculate its Information Content (IC) as 𝐼𝐶(𝑖) = −𝑙𝑜𝑔10#𝑔𝑒𝑛𝑒𝑠 𝑖𝑛 𝑖
𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑔𝑒𝑛𝑒𝑠, which
measures the specificity of each category. For each pair of categories i and j, we consider the set of their
common ancestors 𝑎𝑛𝑐(𝑖, 𝑗) and define 𝑠𝑖𝑚𝑟𝑒𝑙(𝑖, 𝑗) = max𝑘∈𝑎𝑛𝑐(𝑖,𝑗)
2𝐼𝐶(𝑘)
𝐼𝐶(𝑖)+𝐼𝐶(𝑗)(1 − 10−𝐼𝐶(𝑘)). The measure
𝑠𝑖𝑚𝑟𝑒𝑙 is symmetric, bounded between 0 and 1, and attains larger values for pairs of categories which
are both specific and close to each other in the GO graph.
In our method, each gene is annotated with multiple categories. Naively, we could calculate the mean
𝑠𝑖𝑚𝑟𝑒𝑙 measure between all pairs of categories, but calculating this mean could give weight to many
irrelevant categories, and be sensitive to the addition of extra annotations to a gene. Instead, we use a
more robust method to measure similarity between two sets of function annotations, developed by
(Schlicker et al. 2006). This method relies on the most similar gene pairs, instead of all the pairs. For two
57
binary activation vectors 𝑎 = (𝑎1 … 𝑎𝑚) , 𝑏 = (𝑏1 … 𝑏𝑚) define a matrix 𝑆𝑖𝑗 = 𝑠𝑖𝑚𝑟𝑒𝑙(𝑖, 𝑗)𝑎𝑖𝑏𝑗. Then we
define 𝑠𝑖𝑚𝑎→𝑏 = 1
𝑚∑ ( max
𝑗=1…𝑚𝑆𝑖𝑗)𝑚
𝑖=1 that measures for each annotation of 𝑎 its most similar annotation
in 𝑏, and averages across all of 𝑎 ‘s annotations. We similarly define 𝑠𝑖𝑚𝑏→𝑎 with the roles of 𝑎 and 𝑏
switched, and use it to define GO-sim= max (𝑠𝑖𝑚𝑎→𝑏 , 𝑠𝑖𝑚𝑏→𝑎). To assess the contribution of individual
gene functional annotations to the GO-sim measure, we look at the category pairs (i,j) corresponding to
the highest values of 𝑆𝑖𝑗. Each such pair also has its “most informative common ancestor” 𝑀𝐼𝐶𝐴(𝑖, 𝑗) =
argmax𝑘∈𝑎𝑛𝑐(𝑖,𝑗)
2𝐼𝐶(𝑘)
𝐼𝐶(𝑖)+𝐼𝐶(𝑗)(1 − 10−𝐼𝐶(𝑘)). These ancestor functional categories give a succinct interpretation
of the similarity between genes 𝑎 and 𝑏.
Computing GO-sim for n=16351 genes, each with m functional annotations, is computationally
burdensome, requiring O(n2m2) operations. We therefore use only 164 brain-related categories out of
the total 2081 functional categories for calculating GO-sim.
4.1.2 Robustness of bag-of-words representations
In order to validate the stability of the bag-of-words gene representations, we measured the similarities
between pairs of representations of images that are of the same gene but from different image series,
and the similarities between the representations of different genes.
Similarity is much higher for representations of the same gene (Wilcoxon difference of medians test, p<10-
200). The similarity values are shown in Figure 4.1. This implies that representations of the same gene,
derived from different image series are indeed stable and are representative of the gene.
58
Figure 4.1: The similarity in the representation of same-gene pairs (blue) and different-gene pairs (red). Each curve shows the
histogram of similarity values. Same-gene image series have highly similar representations.
4.1.3 Predicting functional annotations using brain ISH images
We applied our method to 16K ISH images of 15K genes, and mapped each image to a vector
corresponding to 2000 GO categories as functional features. We used the Area Under the ROC Curve
(AUC) as a measure of classification accuracy. All evaluations were performed on a separate held-out test
set. We find that 37% of the GO categories tested yielded a test set AUC value that was significantly above
random (permutation test, p-value<0.05). This was encouraging, since the variability of expression
between brain regions was previously shown to be very low (Khaitovich et al. 2004; Khaitovich et al.
2005). This suggests that fine spatial resolution in neural tissues can reveal highly meaningful expression
patterns.
Which functional categories can be best predicted by ISH images? Table 4.1 lists the top 15 GO categories
that achieved the best test-set AUC classification scores. Interestingly, these include mostly
biosynthesis/metabolism processes and neural processes. To further test whether neural categories
achieve higher classification values based on neural expression patterns, Figure 4.2 compares the AUC
scores of 164 categories related to the nervous-system with the AUC scores of the remaining categories.
59
As expected, neural GO categories receive significantly higher AUCs (Wilcoxon, P-value<10-38), with 69%
of categories yielding significantly above random AUC values.
These AUC values suggest that when a gene is represented as a feature vector of classifiers activations,
many of the features carry a meaningful signal. The axes of the new low-dimensional representation
correspond to functional properties of each gene, linking functions of the genes to the geometry of the
space in which they are embedded.
Figure 4.2. AUC scores for GO categories related to the nervous system (dashed, red) and the remaining categories (solid, blue).
AUC scores are significantly higher for neural categories (Wilcoxon test, p < 10−38). The red and blue ticks indicate the median of
each set.
GO ID GO category name #genes AUC
GO:0060311 negative regulation of elastin catabolic process 17 1
GO:0042759 long-chain fatty acid biosynthetic process 23 0.98
GO:0009449 gamma-aminobutyric acid biosynthetic process 20 0.96
GO:0009448 gamma-aminobutyric acid metabolic process 23 0.96
GO:0032348 negative reg. of aldosterone biosynthetic process 21 0.94
GO:2000065 negative regulation of cortisol biosynthetic process 21 0.94
GO:0043206 fibril organization 23 0.94
GO:0031947 negative reg. of glucocorticoid biosynthetic process 22 0.94
GO:0042136 neurotransmitter biosynthetic process 23 0.94
60
GO:0022010 central nervous system myelination 29 0.89
GO:0008038 neuron recognition 20 0.87
GO:0042220 response to cocaine 30 0.87
GO:0050919 negative chemotaxis 16 0.86
GO:0042274 ribosomal small subunit biogenesis 15 0.86
GO:0016486 peptide hormone processing 17 0.85
Table 4.1. The GO categories classified with highest test-set AUC values.
4.1.4 Comparison with Neuroblast, the ABA image-correlation tool
How well does the method presented here compare to other methods suggested for finding similarity
between these images? We compared our results with NeuroBlast, a method to detect image-image
similarities available on the ABA website (Hawrylycz et al., 2011). This method uses a non-linear mapping
of the images to a reference anatomical atlas to apply voxel-voxel correlation between the images.
To evaluate the quality of the similarity measure, we used three sets of pair-wise relations as evidence of
gene relatedness: (1) markers of known cell types (Cahoy et al. 2008), such as astrocytes or
oligodendrocytes (2) occurrence in the same KEGG pathway (Kanehisa 2002) and (3) a set of known
protein-protein interactions taken from IntAct (Kerrien et al. 2012). For each of the 16531 genes we
ranked the 100 most similar genes according to 4 different similarity measures: (1) Functional
representation of the ISH images (FuncISH) GO-sim, (2) FuncISH flat-sim, (3) cosine similarity between the
SIFT bag-of-words representations, and (4) the ABA NeuroBlast tool. For each of the pair-wise relations
(cell type markers, KEGG pathway and PPIs) we plot the mean fraction of relations retrieved at the top-K
most similar genes (precision-at-k), a standard method in information retrieval (Manning and Raghavan,
2009). Figure 4.3 shows that for all three validation labels, FuncISH GO-sim provides superior precision
for the top 10 ranked similar genes. The superior precision of GO-sim over flat-sim is presumably since
GO-sim weighs categories more correctly and also possibly since GO-sim was limited to brain-related
categories that tend to be more accurately predicted (Figure 4.2). On the other hand, we see that
NeuroBlast outperforms flat-sim in most cases.
61
Figure 4.3. Precision at top-K for similarity defined by (A) cell type marker (B) KEGG pathways (C) protein–protein interaction.
Precision was measured using functional representations (FuncISH, purple lines for GO-sim, orange for flat-sim), SIFT (red) and
NeuroBlast (blue).
4.1.5 Identifying and explaining similarities between GABAergic neuron markers
We now turn to a deeper look into the similarity predictions. Interestingly, the highest classification
scores were achieved for the neural related categories GABA biosynthetic process and GABA metabolic
process (shown in Table 4.1), implying that our algorithm can identify spatial patterns of GABAergic
neurons. A prominent member of the GABAergic neuron marker family is Parvalbumin B (Pvalb), which
encodes for a Calcium binding protein. We examined the genes that are most similar to Pvalb, and found
that another GABAergic neuronal marker and a Calcium binding protein, Calbindin D28K (Calb1) is at the
top-15 most similar gene lists for all associated image series. Pvalb and Calb1 belong to a family of cellular
Ca2+ buffers in GABAergic interneurons. The third member in this family is Calretinin (Calb2). Looking at
the similarity rank of Calb1 and Calb2, Calb2 ranks at the top-2 percentile (out of 16351 images in the
dataset) at 16 out of 17 cases. Similarities between these three genes were not identified by NeuroBlast.
This may be because NeuroBlast uses spatial correlation measures that produce results heavily reliant on
the spatial location of expression, while using functional representations can identify patterns that can
appear in different regions of the brain. A major benefit of representing genes in the functional
embedding space is that similarities between genes can be "explained" in functional terms. Calb1, Pvalb
and Calb2 are all involved in regulation of synaptic plasticity (Schwaller 2012). When looking at the
semantic interpretations explaining the similarities between the genes, 6 out of the top-10 GO categories
are indeed directly related to synaptic plasticity such as "synaptic transmission", "regulation of synaptic
plasticity" and "learning".
62
4.1.6 Finding important spatial patterns in different scales using SIFT "visual words"
A major advantage of representing ISH images with SIFT descriptors is the ability to point directly to
spatial patterns in these complex images. Although their name suggest differently, SIFT descriptors at
several scales capture different types of patterns. Figure 4.4 shows three visual words for each of the four
scales, selected as the visual words that contributed most to classification. Scale invariance is often
assumed when analyzing natural images since objects are photographed at varying distances. ISH images
however, contain distinctive information in the different scales. As Figure 4.4 demonstrates, the four sizes
of visual words correspond to grids capturing different neural entities. The smallest descriptors cover an
actual area of 36*36µm2 and capture fine-scaled information such as cell shapes and cell densities; the
medium-size discriminative descriptors of 72*72µm2 tend to trace thinner cell layers; larger descriptor
sizes of 144*144µm2 and 288*288µm2 can cover large and intricate patterns of a mixture of cells and cell
types in a tissue. Interestingly, the four visual words with the highest contribution to classification were
the words counting the zero descriptors in each scale. This means that the highest information content
lies in "least informative" descriptors; and that overall expression levels ("sparseness" of expression) are
important factors in functional prediction of genes based on their spatial expression. Our method
presents a new representation of ISH imagery as SIFT descriptors, and using multiple scales allows
revealing the multi-resolution nature of the images.
Which scale carries the most meaningful signal for functional prediction? Figure 4.4E shows the mean
absolute value of visual words weights in every scale for all GO categories, showing that all scales
contribute significantly to the scores, with the medium contributing most. Figure 4.4A-D shows
descriptors that contributed to classification of all the categories. Furthermore, each GO category has its
own visual words that are important to its classification, and looking into their details reveals spatial
properties that are unique to specific biological processes.
As an interesting example of this effect, we considered the gene Adducin beta (Add2). Add2 is annotated
to several GO categories including "positive regulation of protein binding" and "actin filament bundle
assembly". Figure 4.5 overlays the top weighted visual words of the two categories over the Add2 ISH
image. It is easy to see that the descriptors important for classification of "actin filament bundle assembly"
are much smaller than those important for classification of the more general category "positive regulation
of protein binding" (t-test, p-value < 10-17). This implies that small scaled features such as specific cell
63
shapes are important to identify genes related to actin filament bundle assembly processes. Actin
assemblies are important to the navigation of neural growth cones, by reorienting growth cones away
from inhibitory cues (Challacombe, Snow, and Letourneau 1996). Representing the images with
histograms of oriented gradients could capture tiny differences in cell shapes that are in the process of
synapse formation, a developmental process occurring continuously throughout adulthood (Vidal-Sanz
et al. 1987).
Figure 4.4. Representing ISH images with visual words. (A, B, C, D) The three visual words with highest absolute weight (averaged
over all categories) at each scale. The SIFT descriptors (red grid) are plotted on top of each panel. The histogram of oriented
gradients used in the SIFT descriptor is plotted in the center of each element of the grid,as a set of red lines, where the length of
the line correspond to the magnitude of the gradient in its direction. (E) Mean absolute weight for the four scales of visual words
calculated over classifiers for all categories.
64
Figure 4.5. The visual words important in classifying Add2 GO categories are overlaid on the Add2 ISH image. Larger descriptors
are needed for the classification of ‘regulation of protein binding’ (A), while the discriminative visual words for ‘actin filament
bundle assembly’ (B) are much smaller, capturing properties such as cell shapes. The descriptors are color-coded by their
importance in classification, highest importance is in bright yellow.
4.1.7 Inferring new gene functions via explainable similarities
We now demonstrate how the semantic representations can be used to propose new gene functional
annotations. Consider as an example the gene Synaptopodin 2 (Synpo2) that is known to bind actin, but
otherwise has very little known associated information. Our method can be used to propose functional
annotations for synpo2 by looking at the genes that are similar to Synpo2 and considering both the GO
functions that contribute to this similarity, and the spatial pattern of expression
First, we find that Synpo2 is similar to two other genes Npepps and Rasa4 but for different reasons (the
list of top-5 semantic explanations for these similarities is shown in Table 4.2). Npepps is an
65
aminopeptidase that is active specifically in the brain (Hui 2007), and the similarity between Synpo2 and
Npepps is explained by processes related to protein processing such as ubiquitination and protein
proteolysis. At the same time, Rasa4 is a GTPase-activating protein that suppresses the Ras/mitogen-
activated protein kinase pathway in response to Ca2+ (Vigil et al. 2010), and the similarity between Synpo2
and Rasa4 is explained by high-level neural processes such as axon guidance or synaptic transmission.
Interestingly, Synpo2 and Rasa4 are expressed in different brain regions: Looking at their spatial
expression patterns reveals that Synpo2 is expressed exclusively in the thalamus, while Rasa4 is expressed
in olfactory areas. Therefore, their similarity is not in their global expression patterns across regions, but
rather in local spatial patterns. This could reflect expression in similar cell types or tissues that exhibit
similar spatial distribution at different brain regions. Npepps is more ubiquitously expressed in the brain,
and is located in the thalamic area where synpo2 is expressed. The co-location of Synpo2 and Npepps
suggests they could be participating in similar biological processes in these areas, possibly in protein
modification processes as suggested by the list of top explanations for the similarity.
Synpo2- Npepps Synpo2-Rasa4
GO ID GO name GO ID GO name
GO:0070646 protein modification by small protein removal
GO:0006836 neurotransmitter transport
GO:0006412 Translation GO:0051970 negative regulation of transmission of nerve impulse
GO:0016567 protein ubiquitination GO:0050805 negative regulation of synaptic transmission
GO:0051603 proteolysis involved in cellular protein catabolic process
GO:0007411 axon guidance
GO:0032446 protein modification by small protein conjugation
GO:0031645 negative regulation of neurological system process
Table 4.2. Top-10 GO annotations explaining the similarities between the gene Synpo2 and Npepps (left column) and Rasa4
(right column).
4.2 Using ISH images to predict neural disease-related genes
66
In the previous section, we learned that representing the ISH images using SIFT-BoW can be used to
predict the inclusion of a gene in a biological process. This is one instance of using co-expression using the
“guilt by association” principle (see Section 1.3.3), where the logistic regression classifier learns typical
patterns of genes from each functional category, and searches for genes with a similar pattern. Doing this
can prove to be particularly useful when trying to identify genes that are related to neural disease. The
idea is that by identifying disease related genes, we are able to better understand the underlying
molecular basis of the disease and consequently improve at preventing, diagnosing or treating the
disease, by developing new drug targets, for example. Indeed, many studies in recent years focused on
analyzing gene co-expression in order to shed light on neural disease (Chen et al. 2013a; de Jong et al.
2012; Ponomarev et al. 2012; Torkamani et al. 2010; Voineagu et al. 2011).
4.2.1 Image classification based on disease-gene markers
We used the visual representations (SIFT-BoW) to try and predict disease related genes for two neural
diseases which have a strong genetic basis with many known associated genes: Parkinson’s disease and
epilepsy. We applied a logistic regression classifier for every disease, where positive examples are genes
already known to be associated with the disease, and negative examples are genes that are not known
yet to be involved in the disease. The classifiers were run using the same setup described in Section 3.3.2.
Disease-genes were taken from the database OMIM (Hamosh et al. 2005). We used mouse orthologs of
24 human genes found to be PD related, leaving 11 mouse genes: Cacnb4, Chrna4, Chrnb2, Clcn2, Cstb,
Epm2a, Gabra1, Gabrg2, Kcnq2, Kcnq3, Lgi1, Me2, Nhlrc1, Scn1a, Slc25a22, Syn1. For epilepsy, we used
mouse orthologs of 43 human epilepsy related genes, leaving 16 genes: Dbh, Lrrk2, Ndufv2, Nr4a2, Park2,
Park7, Pink1, Snca, Sncaip, Tbp, Uchl1. The genes Ndufv2, Snca, Uchl1, Cacnb4 and Chrna4 have been
imaged twice and the gene Chrnb2 has 3 associated images, leading to an overall number of 14 images in
the positive set for PD and 16 images in the positive set for epilepsy. When splitting the positive set into
train and test sets, we made sure to have an equal amount of positives in each split, and also that
classifiers are not trained on genes that have more than one associated image with an image in the test
set for that specific split. We evaluated our results using AUC scores. Test-set AUC scores are 0.799 and
0.73 for PD and epilepsy respectively (Figure 4.6).
67
Figure 4.6: ROC curves for (A) Parkinson’s disease predictions and (B) epilepsy predictions.
We applied the trained classifiers to the entire mouse genome and predict dozens of new candidate genes
for epilepsy and PD. For example, a top epilepsy prediction is Tph2 (tryptophan hydrozylase). This gene is
important in the biosynthesis of serotonin, which has been hypothesized to be involved in epilepsy. Many
of the genes predicted to be PD-related are known to be related to other neural diseases, for example,
mutations in the gene App, the top prediction for PD, have been implicated in autosomal dominant
Alzheimer disease and cerebroarterial amyloidosis (cerebral amyloid angiopathy). Fkbp6, another top PD
prediction, is found to be deleted in Williams syndrome and deficiency of the protein product of Ndn is
implicated in the pathogenesis of the neurodevelopmental disorder Prader-Willi syndrome, just to name
a few examples. We also find that PD genes are widely expressed in the brain, while epilepsy genes show
much more localized patterns, notably in the periaquaductal grey, a region that has been shown to be
associated with the induction of audiogenic seizures in epilepsy-prone rats. Overall we predict 13 new
Parkinson’s disease genes and 247 new epilepsy genes. The top-10 predicted genes for PD and epilepsy
are listed in Table 4.3, along with corresponding logistic regression confidence scores.
68
PD predicted genes Score Epilepsy predicted genes Score
App 0.641 Slc17a8 0.886
Ifnar1 0.639 Gchfr 0.814
2610002F03Rik 0.638 Tph2 0.79
LOC652668 0.636 Ublcp1 0.734
Fkbp6 0.635 Yipf5 0.72
Apbb1 0.633 Chrm3 0.711
Cenpf 0.633 Icam5 0.693
Actc1 0.633 Slc36a1 0.689
LOC546142 0.633 Slc39a3 0.651
Ndn 0.632 Arhgdig 0.651
Table 4.3: Top 10 predicted genes for the two diseases, and corresponding prediction scores.
4.2.2 Validation of results
We validated our gene-disease predictions using a Gene Set Enrichment Analysis (GSEA) (Subramanian et
al. 2005) on our lists of prediction scores. We used two sources of validation: 1. Genetic Association
Database (GAD) (Becker et al. 2004) 2. Disease-gene predictions from a previous attempt to identify
disease-related genes from co-expression patterns (Linghu et al. 2009). For PD predictions, GSEA scores
where significant (P<0.05) using the two datasets. Epilepsy predictions were significant in the GAD dataset
but not significant using the Linghu 2009 set as a validation set, presumably due to the small number of
positives in this dataset (N=30) (Table 4.4).
Disease Validation dataset Enrichment P-value N
PD GAD 5.7*10-05 336
PD Linghu 2009 0.02 28
Epilepsy GAD 0.0028 81
Epilepsy Linghu 2009 0.058 30
Table 4.4: Prediction validation using two datasets: GAD and Linghu 2009. Results were validated using a GSEA analysis
(Subramanian et al. 2005).
69
4.3 Localizing genes to cerebellar layers using ISH image classification
Regions in the brain are organized in layered structures, where each layer has a distinct functional role,
that is reflected in specialized expression of genes. Understanding cell-layer functionality depends on our
ability to identify and characterize its unique expression signature. In this section we focus on the highly
organized layered structure of the mouse cerebellum. The cerebellum is composed of four unique layers:
the Purkinje layer, where the Purkinje cells bodies are located, the molecular layer, containing the thick
dendritic trees of the Purkinje cells, the granular layer, which is the region with the highest density of
neurons known in the brain and the innermost layer, the cerebellar white matter.
We use a machine vision approach to identify layer-specific genes. The method is based on modeling the
spatial expression patterns observed in ISH images of a few genes that are known layer-markers.
Specifically, we represent the images using histograms of local binary patterns (LBP, see Section 3.2.1),
and learn four separate classifiers for the four layers of the cerebellum. For full details on the methods
used see (Kirsch, Liscovitch, and Chechik 2012). All classifiers achieve a very high area under the ROC curve
(AUC > 0.94 for all four categories) Using the learned patterns, we then automatically scan the genome-
wide ISH database and detect all other layer-specific genes.
4.3.2 Genome-wide predictions of cerebellum layer markers
After applying the classifiers to the full mouse genome (20,382 genes in the ABA database), we are able
to identify layer specific markers with high accuracy. Out of 13361 genes that are expressed in the
cerebellum, 454 genes are predicted to be primarily expressed in the Purkinje layer, 233 in the granular
layer, 14 in molecular layer and 16 in the white matter.
We validated the predictions by manually scanning the top predicted genes, visualizing their measured
expression patterns, and comparing them to the patterns expected at that layer. Out of the top 250 genes
predicted to be localized to the Purkinje layer we correctly classified 98.4%. Similarly, 98.1% of the top
250 granular layer prediction were accurate. The precision was worse for localization of the molecular
layer: All 14 predictions had a molecular expression, but 10 out of the 14 also had a granular expression.
Finally, 10 out of 16 predicted white matter were positive. It should be clarified however, that many of
the genes that exhibited localized expression in one cerebellar layer, are also expressed in other regions
70
of the brain, sometimes very widely. Also, despite the fact that most of the training images in the
molecular class show expression in the molecular layer and also in the Purkinje layer, our classifier was
able to identify genes that show expression only in the molecular layer.
Applying the white-matter classifier and the molecular layer classifier to the full genome yielded very few
positively scored genes. This could be attributed to the small number of positive samples in the training
set for these classes. Indeed, when we manually examined one thousand of the genes in the database we
only found one gene that was exclusively localized to the white matter (and one gene localized to the
molecular layer). In comparison, there were many more genes localized to the granular or the Purkinje
layers.
4.3.3 Characterizing layer-specific genes
The above results show that at least 450 genes, which are more than 3.4% of genes that are expressed in
the cerebellum, are primarily expressed in one layer (mostly the Purkinje and granular layers). There could
be many reasons for this highly structured expression pattern. For example, localized genes may reflect
unique cell-type dependent biological processes, like shaping the cell morphology or controlling the
connectivity between specific neuron types. Alternatively, localized expression may also reflect properties
that are not necessarily cell-type specific, like processes that depend on cell size, since Purkinje cells are
exceptionally large. We therefore turned to characterize the properties of localized genes, by testing their
functional annotations and comparing them with the transcriptome of Purkinje-deficient mice.
Comparison with Purkinje deficient mice
To better characterize the properties of genes localized to the Purkinje layer, we aimed to separate genes
whose expression is related to Purkinje cells from genes whose expression is related to non-Purkinje cells.
We compared our study with a study by Rong and colleagues (Rong, Wang, and Morgan 2004) who aimed
to identify Purkinje-cell specific genes. In this study, the cerebellar gene expression of two strains of mice
were compared: wild-type mice and PSD3J mice which have a mutation in the gene Nna1 causing them to
lose their Purkinje cells by adulthood. Genes with reduced expression in the PSD3J mice presumably reflect
the loss of Purkinje cells.
We compared the list of genes that we predicted to localize to the Purkinje-layer with a list of
203 PSD3J genes whose expression decayed by more than 50% as provided by (Rong, Wang, and Morgan
2004). We sorted the predicted genes by the classifier margin, treated the PSD3J list as positives, and
71
computed the precision at the K top-ranked genes. Figure 4.7A shows that the top ranked predicted genes
have high overlap with the PSD3J list, reaching 33% at the top 10.
The cross-comparison between the two sets reveals genes that are localized to the Purkinje-layer, but are
not Purkinje-cell related. This may include genes that are expressed in non-Purkinje cells such as
Bergmann glia. The cross-comparison also reveals genes whose expression is affected by the deficient
Purkinje cells, but are not localized to the Purkinje layer. These may include genes that are expressed in
the dendritic arbors of Purkinje cells, or other genes that are not layer- specific but were affected by the
deficiency of the Purkinje cells. Finally, for those genes that are detected to be both Purkinje-layer related
and Purkinje- cell related, the cross-comparison strengthens their link to Purkinje cells.
Functional annotation
As the next step, we studied known functions of the genes that were localized to the four classified layers.
We used Gene Ontology (GO) annotations to find the biological processes that are over-represented in
the resulting gene sets for each layer.
As expected, genes localized to the white matter layer showed enrichment for myelination. More
interesting was the enrichment for neurogenesis, which is also known to take place in the white
matter (Zhang and Goldman 1996). The Purkinje layer was enriched for lipid metabolic processes and
more general processes, such as oxidation/reduction. Full lists of enriched categories are provided in
Tables 4.5-8.
GO ID Term # annotated # significant FDR q-value
0042552 Myelination 21 2 0.00045
0006665 Sphingolipid metabolic process 36 2 0.00134
0008654 Phospholipid biosynthetic process 49 2 0.00248
0022008 Neurogenesis 267 3 0.00703
Table 4.5. Functional enrichment of genes localized to the white matter.
72
GO ID Term # annotated # significant FDR q-value
0006816 Calcium ion transport 86 12 9.5*10-5
0006937 Regulation of muscle contraction 22 5 0.0012
0030900 Forebrain development 73 9 0.0018
000629 Lipid metabolic process 407 28 0.0018
0055114 Oxidation reduction 375 26 0.0023
0007264 Small GTPase mediated signal
transduction
263 19 0.0057
0050767 Regulation of neurogenesis 61 7 0.0083
Table 4.6. Functional enrichment of genes localized to the Purkinje layer.
GO ID Term # annotated # significant FDR q-value
0016192 Vesicle mediated
transport
279 14 0.00036
0009966 Regulation of signal
transduction
317 12 0.00952
Table 4.7. Functional enrichment of genes localized to the granular layer.
GO ID Term # annotated # significant FDR q-value
0050804 Regulation of synaptic
transmission
49 2 0.0024
0006457 Protein folding 76 2 0.0059
Table 4.8. Functional enrichment of genes localized to the molecular layer.
The cerebellar cortical layers are comprised of distinct types of neurons and glia. We asked whether the
genes expressed in the different layers are associated with specific cell types. To answer this question, we
used lists of genes that were found to be enriched in three major cell types; neurons, astrocytes and
glia (Cahoy et al. 2008). Enrichment was determined by isolating these cell populations
using Fluorescence-Activated Cell Sorting (FACS) and quantifying their expression using microarrays.
73
Genes with a 20-fold and up over-expression levels were defined as cell-type specific markers. The lists of
cell type markers include 2036 genes for neurons, 2618 for astrocytes and 2228 for oligodendrocytes. We
tested for enrichment of these markers in our results, using the entire genome as background. Results are
presented in Figure 4.7B. As expected, genes that were found to be expressed in the cerebellar white
matter show a strong enrichment signal for oligodendrocytes. The granular layer, which contains large
amounts of densely packed granule cells, indeed shows enrichment for neuron-related genes. The
Purkinje cell layer, which is defined by the cell bodies of Purkinje neurons shows, interestingly, a strong
enrichment signal for glia cells, notably astrocytes. This could be explained by the specialized astrocytes
that occupy this layer, the Bergmann glia, and also by the astrocyte processes derived from cells located
in the upper granular layer, covering the Purkinje cell bodies (Ghandour, Vincendon, and Gombos 1980).
Oligodendrocytes are also known to be localized close to the Purkinje cells (Ghandour, Vincendon, and
Gombos 1980). This fact can account for the enrichment of this cell type in the Purkinje cell layer.
Figure 4.7. Comparison with Purkinje-deficient mice and layer enrichment for cell types. (A). Comparison with Purkinje-deficient
mice genes from (Rong, Wang, and Morgan 2004). The overlap of the set of top ranked genes that were localized to the Purkinje
layer with PSD3J Precision is the fraction of Purkinje-localized genes that are found in PSD3J mice. (B) Enrichment for cell type
specific markers, taken from (Cahoy et al. 2008). For each layers enrichment for cell type was tested using a hypergeometric test.
The dashed red line corresponds to p-value at random. As expected, the white matter was enriched for oligodendrocyte markers
and the granular layer is enriched for neuronal markers. Interestingly, Purkinje layer genes show a strong enrichment for
astrocytes markers.
74
Finally, we used the localization predictions to identify novel genetic markers for the different cerebellar
layers. Out of the hundreds of new markers, here we describe two examples of genes that were top-
ranked by our classifier in two layers. The first, Mitogen-activated protein kinase kinase 6 (Map2k6), was
the first-ranked gene in the white matter. Its cerebellar expression pattern, depicted in Figure 4.8, shows
it is indeed clearly localized to the white matter. Map2k6 is a member of the Map kinase signal
transduction pathways, and is thus involved in cell proliferation and growth. It has been shown that the
human ortholog of Map2k6 is activated in the cerebellum in response to calcium, triggering a signaling
pathway which results in the expression of genes responsible for the survival of newly differentiated
neurons (Mao et al. 1999). Therefore, it is not surprising to find it in the white matter of the cerebellum,
and yet this expression pattern was never previously demonstrated. While Map2k6 is a relatively well-
studied gene, the second example we discuss, Fam107b (3110001A13Rik), ranked 6th by the Purkinje
layer detector, has little to no associated information. This gene shows a strong, localized expression in
the Purkinje layer (Figure 4.8B). Moreover, its expression is also largely specific to the cerebellum (Figure
4.8C).
Figure 4.8: Examples of novel genetic markers. Non-masked ISH images showing cerebellar expression of Map2k6 (A) and
Fam107b (B). These are the raw images before the application of the expression mask. The expression of actual labeled mRNA
target transcripts is marked with dark spots. (C) Whole-brain ISH image for Fam107b. Fam107b shows strong, highly localized
expression in the Purkinje layer of the cerebellum.
75
Chapter 5: Patterns of RNA editing in the brain
5.1 Introduction
The previous chapters discuss spatio-temporal patterns of mRNA expression in the developing and adult
brain. After transcription, mRNA is subjected to post-transcriptional modifications that may change its
properties, leading even to changes in protein structure during translation. One such modification is A-to-
I RNA editing by adenosine deaminases acting on RNA (ADARs). This post-transcriptional modification pre-
mRNA that is essential for normal life and development in vertebrates (Nishikura 2010; Bass 2002; Savva,
Rieder, and Reenan 2012). Editing changes the sequences of encoded RNA, thus contributing to proteomic
and phenotypic diversity. To this day, thousands of human genes have been shown to be subject to A-to-
I RNA editing within their untranslated regions and introns (Bazak et al. 2013; Bahn et al. 2012; Li et al.
2009; Park et al. 2012; Ramaswami et al. 2012; Ramaswami et al. 2013; Peng et al. 2012). In primates,
these editing events take place mainly within Alu repeats (Kim et al. 2004; Levanon et al. 2004;
Athanasiadis, Rich, and Maas 2004; Blow et al. 2004), which are primate-specific, 300bp-long elements
that comprise about 10% of the human genome. Importantly, editing has been shown to operate in genes
encoding synaptic proteins or important neuromodulators, suggesting that editing may have an important
role in tuning molecular functions in the brain regions (Burns et al. 1997; Sanjana et al. 2012). Indeed,
known phenotypic effects of editing from Caenorhabditis elegance and Drosophila melanogaster to Mus
musculus are related to neural systems and behavior (Palladino et al. 2000; Tonkin et al. 2002; Higuchi et
al. 2000). In addition, editing was found to be dysregulated in several diseases, mainly related to the
neural system (Eran et al. 2012; Silberberg et al. 2012; Chen et al. 2013b).
Although most human genes have been shown to undergo editing (Bazak et al. 2013; Ramaswami and Li
2014), the exact role of RNA editing is still unclear, and various functions have been proposed to explain
its operation (Nishikura 2010). It has been proposed that 3' UTR editing may play a role in gene silencing
(Nishikura 2010); in augmenting or counteracting the RNAi mechanism (Nishikura 2010), and as an anti-
retroelement mechanism (Levanon et al. 2005). It has also been suggested that heavily-edited mRNA
transcripts are retained in the nucleus (Chen and Carmichael 2009; Prasanth et al. 2005; Zhang and
Carmichael 2001; Scadden 2005; Scadden and O’Connell 2005), or induce inosine specific degradation of
the edited transcripts by Tudor-SN nuclease (Scadden 2005; Scadden and Smith 2001a). Moreover, hyper
76
edited transcripts were even shown to down-regulate gene expression in trans (Scadden 2007). Another
way in which editing might regulate gene expression in human is through modification of micro-RNA
(miRNA) targets within 3' Alu elements (Liang and Landweber 2007) and changing the splicing
enhancers/silencers recognition sites (Lev-Maor et al. 2007). A common effect of all these proposed
mechanisms is that editing of a target gene is expected to reduce its expression. A direct prediction
stemming from this hypothesis is that expression of edited genes will be negatively correlated across
conditions with the expression of ADARs.
The above experimental findings seem to conflict with the abundance of editing targets in the human
genome in terms of the possible effects of RNA editing on expression. On one hand, as pointed above,
editing was demonstrated to have a dramatic impact on inosine-containing transcripts. On the other hand,
if editing determines the fate of mRNA it would have an overly massive effect on human transcriptome.
This is because a large fraction of human transcripts contain double-strand RNAs structures formed by
Alus (Kim et al. 2004; Levanon et al. 2004; Bazak et al. 2013; Athanasiadis, Rich, and Maas 2004; Blow et
al. 2004), ideal ADAR targets, and therefore, editing would impact a large fraction of human genes.
Moreover, since the rapid invasion of Alus into the genome is mostly specific to primates, evolution only
has a short period to adapt to this recent increase of edited targets.
To address these two possible conflicting views, the current work aims to chart co-expression patterns of
ADARs and their potential Alu editing targets in the human brain, using two large sets of mRNA expression
from postmortem brains. Surprisingly, when considering the correlation structure of ADAR and its targets
along development, we do not find evidence supporting the expected global negative correlation, since
the distribution of correlations is often bi-modal: ADAR is positively correlated with most of its targets,
and negatively correlated with other target genes. Our results suggest that in the course of primate
evolution, with the massive editing associated with Alu, editing-related mechanisms for gene regulation
were probably adjusted in such a way that their negative regulation of edited gene has changed.
5.2 Results
To characterize the spatial expression of ADAR (ADAR1) and ADARB1 (ADAR2) in the brain and how their
expression correlates with their potential editing targets, we analyzed genome-wide expression
measurements from two sources: A dataset containing 3702 samples from 6 adult human brains (Website:
©2012 Allen Institute for Brain Science. Allen Human Brain Atlas [Internet]. Available from:
Http://human.brain-Map.org/), and a dataset measured from 57 brains over development (Kang et al.
77
2011) (see Methods for details on both datasets). In the results below, we refer to them as ABA-2013 and
Kang-2011, respectively.
5.2.1 ADAR and ADARB1 expression in the brain
As a first step to characterize the expression of ADAR and ADARB1 in the human brain, we looked at their
pattern of expression across the major brain regions. Figure 5.1 shows the average expression over the
six adult brains in three consecutive coronal slices. ADAR expression is enriched mostly in sub-cortical
regions, the claustrum, pons and medulla oblongata, but also the cingulate gyrus. This expression pattern
is consistent with previous reports that editing targets HTR2C, the gene that codes for a serotonin receptor
that is expressed in sub-cortical regions, but not HTR2A which codes for a receptor in the same family
which is expressed in the cortex. ADARB1 expression is enriched particularly in highly functional regions
such as the cerebellar cortex, pons and thalamus. Over-expression of both ADARs in the pons is consistent
with a previous finding of high editing levels in this region in the rat brain (Paschen and Djuricic 1994).
Interestingly, the expression levels of ADAR were in general not exceptionally high in the neocortex, the
brain area that is dramatically oversized in primates and humans specifically.
As discussed above, RNA editing of Alu repeats has been suggested as a possible regulatory mechanism,
where switching of Adenosine to Inosine marks mRNA for degradation or nuclear retention (Chen and
Carmichael 2009; Prasanth et al. 2005; Zhang and Carmichael 2001). To examine the hypothesis that RNA
editing serves as a mechanism for down-regulation of gene expression, we calculated the spatial
correlation between ADARs and 7,864 potential editing targets (see Methods for details on how target
and background sets were defined) across brain regions in the ABA-2013 dataset, and 6,834 potential
editing targets in the Kang-2011 dataset. If ADARs edit their targets on a wide scale, and if RNA editing by
ADARs down-regulates their targets, regions with high levels of ADAR and ADARB1 mRNA would show
lower levels of their non-edited targets on average. As a consequence, we would expect to see negative
correlations between ADARs and their potential editing targets.
78
Figure 5.1. ADAR and ADARB1 expression in the human brain based on the ABA-2013 dataset. Heat map of
normalized mRNA expression in three coronal slices of a human brain. Expression was averaged over six adult brains.
(A) ADAR expression is enriched in the cingulate gyrus - CG, the pons - P, the claustrum - C and the medulla oblongata
- MO. (B) ADARB1 expression is enriched in the thalamus - TH), the pons - P and the cerebellar cortex - CBC. Figures
were created using the brain-expression-visualizer tool available from www.chechiklab.biu.ac.il.
5.2.2 ADAR expression is positively correlated with potential editing targets
We used the Illumina Human Body Map (HBM) RNA-Seq data from a brain sample to identify genes with
edited Alu elements, focusing on edited Alu repeats that reside within genes. We defined a gene as a
target if it contains at least one edited Alu (Kim et al. 2004; Levanon et al. 2004; Bazak et al. 2013;
Athanasiadis, Rich, and Maas 2004; Blow et al. 2004)
We computed the spatial correlation of ADAR and ADARB1 with their potential editing targets, across all
samples in our two datasets. As a baseline for comparison, we also computed the same correlations but
this time with the spatial expression profile of all genes in a background set of 10,731 genes (see Methods
for details on how target and background sets were defined). Figure 5.2A shows the histograms of
79
correlations between ADAR and the target set (red) and ADAR and the background set (blue). Surprisingly,
the effect observed is opposite than what is predicted by the initial hypothesis. The correlation of ADAR
mRNA levels with the expression of its potential targets is actually more positive than correlations of ADAR
mRNA levels with the background set expression (median Pearson correlation with targets = 0.224,
median Pearson correlation with background = 0.104, Wilcoxon test for different medians z-value = 31.9,
p-value <8.73*10-223, n=20,772, Figure 5.2A). This effect was consistent when we computed non-linear
spatial correlation (median Spearman correlation with targets = 0.219, median Spearman correlation with
background = 0.099, Wilcoxon test for different medians, z-value = 31.3, p-value <9.92*10-215, n=20,772).
There was no significant effect found for the other editing enzyme, ADARB1 and this result is consistent
with the fact that ADAR is considered to be the main gene responsible for Alu editing (Wang et al. 2013;
Riedmann et al. 2008; Bahn et al. 2012).
To further validate the high spatial correlation between ADAR and its targets, we computed the
distribution of spatial correlations in the second dataset, Kang-2011, which measured spatio-temporal
expression profiles throughout the human brain and in different ages (Kang et al. 2011). Results in this
second dataset were highly consistent with the first dataset: The correlation between ADAR and the set
of edited targets, computed using all the samples regardless of age, was significantly positive (median
Pearson correlation with targets = 0.063, median Pearson correlation with background = -0.121, Wilcoxon
test for different medians z-value = 41.2, p-value < 10-223, n = 17,564. Median Spearman correlation with
targets = 0.0567, median Spearman correlation with background = -0.135, Wilcoxon test for different
medians z-value = 41.7, p-value < 10-250, n = 17564, Figure 5.2B). The results were also largely consistent
at the gene-to-gene level: the set of correlations with ADAR, as computed for each gene, was in itself
strongly correlated (Spearman rho = 0.44, p-value<10-16), even though the two datasets used were
measured in different subsets of brain regions.
Figure 5.2 shows the histograms of ADAR correlations with target and background sets. The difference in
the correlations of ADAR and targets versus the background set comes from two sources: a subset of
target genes that have strong positive correlations with ADAR, and also a group of genes that are not
edited but are strongly negatively correlated with ADAR. This “spike” in negative correlations is very
prominent and appears in both datasets. To characterize the highly negatively correlated genes, we
performed a Gene Ontology (GO) enrichment analysis using GOrilla (Eden et al. 2009). In ABA-2013 and
80
also in Kang-2011, we found that the lists of genes that are negatively correlated with ADAR are highly
enriched for olfactory receptor activity (p<10-50 for both datasets).
Figure 5.2. The distribution of spatial correlation values between ADAR and targets (orange) and between ADAR and
a background set (light blue). The results are shown for (A) ABA-2013 dataset (B) Kang-2011 dataset. The two
distributions differ due to two groups of genes: a larger number of target genes have positive correlations with ADAR,
and there also exist a group of genes that do not contain Alus, thus are not targeted by ADAR, but are strongly negatively
correlated with ADAR.
5.2.3 Effect of Alu location in the gene
Double stranded Alu structures appear in various locations in genes. To test if the strong positive
correlation of ADAR with its putative targets depends on the location of the target in the gene, we
repeated the analysis, but this time separating the targets ABA-2013 into four groups of genes based on
the location of the Alu repeat: 3'UTR (1,024/878genes), 5'UTR (92/55 genes), intronic regions
(7,494/6,525genes) and coding sequences (CDS, 38/37genes). We accounted for the different sizes of the
groups using bootstrap (see Methods). The spatial-correlation effect was significant in intronic Alus and
in 3'UTR Alus (Figure 5.3). Lack of differences in correlation between editing at the 3’ UTR and introns
argues against global gene regulation by editing at the 3’ UTR. The distribution of correlation values of
ADAR with each of the target groups and the background set is shown in Figure 5.4.
81
Figure 5.3: Effect vs. Alu location. Boxplot of the log-transformed p-values of a one-sided Wilcoxon test between ADAR correlations with targets versus a background set of genes is plotted against the location of the Alu repeat pairs in the gene (note that Alu in the CDS or 5’ UTR is rare). P-values for the two datasets are pooled and shown together. Error bars encompass data within 1.5 times the inter-quartile range, and the boxes show the lower and upper quartiles together with the median. Outliers are represented as circles. Lack of differences in correlation between editing at the 3’ UTR and introns argues against global gene regulation by editing at the 3’ UTR.
82
Figure 5.4. The distribution of spatial correlation values between ADAR and targets containing Alus in different
locations (orange) and between ADAR and a background set (light blue). The results are shown for (A) ABA-2013 ,
intron (B) Kang-2011, intron (C) ABA-2013 , 3'UTR (D) Kang-2011, 3'UTR (E) ABA-2013 , 5'UTR (F) Kang-2011, 5'UTR (G)
ABA-2013 , CDS (H) Kang-2011, CDS.
83
5.2.4 Specificity of ADAR-target correlations
The difference in ADAR correlations with targets and background genes may not be specific to ADAR. For
instance, if a large group of target genes is highly positively inter-correlated, then many genes, not only
ADAR, would show a strong correlation with that group and as a result, significantly stronger correlation
than with the background set. To test if the difference in correlations is specific to ADAR, we repeated the
above analysis for all genes: for each gene, we calculated the Spearman correlation between the gene's
spatial expression pattern and the expression of the genes from the intronic target and background sets.
We ranked all genes based on the magnitude of their correlation, measured as -log10(Wilcoxon's test p-
value). ADAR is ranked at 6 out of 20,773 genes in the ABA dataset and ranked 22 out of 17565 genes in
the Kang-2011 dataset. When looking at the intersection of the two sets, ADAR is one out of only 10 genes
that are in the top 1% of both two sets (10 out of 17565, top 0.1 percentile). This means that the high
positive correlations of target genes with ADAR are not a common phenomenon in the genome, and this
result is significantly specific to ADAR. The other 9 genes include DDX1, a putative RNA helicase which is
implicated in several processes involving alteration of RNA secondary structure (Li, Monckton, and
Godbout 2008) and the interferon receptor IFNAR1. Another gene that shows high correlation with editing
targets in both sets is NF2, which has been suggested to be involved in neural cell development (Lavado
et al. 2013). Brain development has been suggested to be controlled in part by RNA editing (Mehler and
Mattick 2007).
5.2.5 Relation of ADAR-target co-expression and editing potential in targets
Genes contain variable amounts of Alu repeats. If the positive spatial correlation of ADAR with its targets
is functionally meaningful, we would expect to see higher correlations of ADAR with genes that contain
more Alus. Figure 5.6 plots the correlations of intronic target genes with ADAR against the number of Alus
in the same genes. There is a significant positive correlation between the number of Alus that a gene
contains and its correlation with ADAR, in both datasets (Spearman correlation coefficient ρ=0.084, p-
value=4*10-13 for ABA-2013 dataset, ρ =0.11, p-value=4.4*10-19 for the Kang-2011 dataset). Genes that
contain more Alu repeats tend to be longer, therefore the relation between spatial correlation with ADAR
and the number of Alus could be a side-effect of the increased gene length. To test this, we assembled
two sets of length-matched genes, one from the target set and another from the background set (see
84
Methods), and computed their correlations with ADAR. The correlations of ADAR with the target set were
strongly positive, as opposed to the correlations with the background set, for both ABA-2013 (median
Pearson correlation with targets = 0.241, median Pearson correlation with background = 0.104, Wilcoxon
test for different medians z-value = 25.1, p-value < 7.27*10-139, n = 10054) and Kang-2011 (median Pearson
correlation with targets = 0.065, median Pearson correlation with background = -0.102, Wilcoxon test for
different medians z-value = 27.5, p-value < 9.62*10-167, n = 8968). We conclude that the higher positive
correlations of ADAR with its targets are not simply due to of the effect of gene lengths.
Figure 5.6: 2D histograms of the correlation of genes with ADAR vs. the number of Alu repeats the genes contain.
Positive correlation with ADAR increases with number of Alus. Points with more than 50 Alu repeats were ignored for
easier visualization. The results are shown for (D) ABA-2013 dataset (E) Kang-2011 dataset.
5.2.6 Correlations with ADAR over development
RNA editing has been suggested to be involved in brain development and neurodegeneration (Palladino
et al. 2000; Tonkin et al. 2002; Li and Church 2013; Higuchi et al. 2000). The Kang-2011 dataset is a neural
expression survey measured over development, allowing to test if the positive ADAR-target correlations
change over time. We examined the dynamics of the correlations over brain development, and found that
spatial correlations of ADAR and its targets are higher than with the background set throughout
development (Figure 5.7). Looking at the distribution of correlations in every time point reveals that for
85
at least some of the time points, the histograms of correlations between ADAR and targets are bi-modal
(see example time points at Figure 5.8, Figure 5.7 shows results for all time points).
To test the stability of the groups of target genes that are correlated with ADAR, and how these groups
may change across different time points, we calculated the cross-correlation between the lists of
correlations of target genes and ADAR at every two time points (Figure 5.9). We found that the target
genes correlated with ADAR are similar in two embryonic time points (10-13pcw and 13-16pcw), and in
most of the adult time points (excluding the last one, 60y+).
In order to functionally characterize the bimodal distributions in these two clusters, we pooled together
data from all embryonic time points and all post-natal time points, and performed a GO enrichment
analysis on the positively correlated genes and the negatively correlated ones using GOrilla. The functional
analysis revealed that in the embryonic time points, the genes that are positively correlated with ADAR
are highly enriched for processes such as RNA binding, mRNA processing and gene expression. The
negatively correlated genes are enriched for "ion transport" (FDR q-value <10-7). In the post-natal time
points the positively correlated genes and the negatively correlated ones are not enriched for a particular
biological process.
86
Figure 5.7. The distribution of spatial correlation values between ADAR and targets with intronic Alus (orange) and
between ADAR and a background set (blue), at different developmental time points. Each panel is based on aggregate
measures from 2-7 brains within the designated age group.
Figure 5.8. ADAR-target correlations over development. The distribution of spatial correlation values between ADAR
and targets with intronic Alus (orange) and between ADAR and a background set (blue), at two developmental time
points: (A) 10-13 PCW and (B) 6-12 months.
87
Figure 5.9: Differential co-expression of ADAR and targets. Heatmap of Spearman correlation rho values showing the
temporal cross-correlation between target gene lists ranked by their correlation with ADAR.
5.3 Methods
5.3.1 The data
We used gene expression data from two sources: the Allen Human Brain Atlas (Website: ©2012 Allen
Institute for Brain Science. Allen Human Brain Atlas [Internet]. Available from: Http://human.brain-
Map.org/) and Kang-2011 (Kang et al. 2011). Neuroanatomical expression data from the Human Brain
Atlas was averaged across probes. We used the probe to gene mappings provided by the Allen Institute.
This averaging provides donor specific gene by region expression profiles that range in size from 185 to
348 brain regions that provide expression data for 29,176 transcripts. Probes which are not mapped to
88
genes were discarded, leaving data for 20773 transcripts. Donor age ranges from 24 to 57 years old (more
information available at http://human.brain-map.org/).
Gene expression data from the Kang-2011 dataset covers 15 developmental stages across 30 time points.
The number of sampled brain regions ranged between 2-16 for each of the 41 donors. The gene
summarized exon array data contains profiles for 17565 genes across 1340 samples.
5.3.2 Choosing target and background sets
We used the Illumina Human BodyMap 2.0 Project (GEO accession number GSE30611, HBM) to find RNA
editing sites within Alu repeats. Genes containing Alu elements that were found to be edited in a brain
sample were included in our target set. The background set was defined as the complementary set of
genes in each dataset.
For the ABA-2013 set, the number of targets is 7,864, and the number of background genes is 12,909. For
Kang-2011 set, the numbers of targets is 6834, and the number of genes in the background set is 10,731.
When splitting the target groups based on the location of the Alu repeats, in ABA-2013 dataset there are
7,494 genes with intronic Alus, 1,024genes with Alus in the 3'UTR, 92 genes with Alus in the 5'UTR and 38
genes with Alus in the CDS, and in Kang-2011 dataset there are 6,525 genes with intronic Alus, 878 genes
with Alus in the 3'UTR, 55 genes with Alus in the 5'UTR and 37 genes with Alus in the CDS. The number of
genes in all target and background sets is summarized in Table 5.1.
ABA-2013 Kang-2011
All targets 7,864 6,834
Intronic Alus 7,494 6,525
3’UTR Alus 1,024 878
5’UTR Alus 92 55
CDS Alus 38 37
Background genes 12,909 10,731
Table 5.1. Number of target genes and background genes used in the analyses.
89
5.3.3 Testing ADAR-target correlations at different Alu locations
To take into account the different sizes of target groups when split according to Alu location (CDS, intron,
3'UTR and 5'UTR), we applied a bootstrap approach by sampling subsets of targets in the size of the
smallest group, the CDS set, from all other groups 1,000 times, and calculating a p-value for each sample.
5.3.4 Functional analysis of gene sets
To functionally characterize the target genes negatively and positively co-expressed with ADAR, we
calculated the spatial correlation of each target gene in the Kang-2011 dataset at each time point. We
ranked the genes based on the correlations in an ascending and descending order for embryonic and post-
natal time points, and performed a Gene Ontology (GO) enrichment analysis on ranked gene sets using
GOrilla (Eden et al. 2009).
5.4 Discussion
The current chapter addresses the question of what genome-wide impact RNA A-to-I editing may have on
expression in the brain. We aimed to resolve an apparent conflict: On one hand, it has been shown that
in some cases editing could dramatically impact expression of genes. On the other hand, the unique
abundance of editing targets in human genes would mean that if editing affects the expression of all its
targets, it would lead to massive expression changes.
Using two datasets that measured gene expression in multiple locations in human brains, we computed
the spatial correlation between the expression profile of ADARs and their known targets (Bazak et al.
2013). Surprisingly, we found that the distribution of correlations in many brain samples was bi-modal:
while some genes were negatively correlated with ADAR1 as expected, many targets of ADAR were
actually positively correlated with ADAR1 (but not ADAR2). This is somewhat surprising because it is
believed that edited genes would be down regulated in the presence of ADAR. The group of positively
correlated genes was enriched for functions including RNA processing, suggesting that ADAR operates as
part of wide RNA regulation mechanisms. This is in agreement with the fact that ADAR is located in the
spliceosome and is known to interact with multiple proteins involved in RNA processing (Weissbach and
90
Scadden 2012; Scadden and Smith 2001b; Agranat et al. 2008; Ota et al. 2013; Warf et al. 2012; Wang et
al. 2013; Heale et al. 2009; Nie et al. 2005; Raitskin et al. 2001; Nishikura 2010).
The spatial correlations between ADAR1 and its targets were significantly more negative in a baseline set
of genes, (p-value< 10-90), and were consistent across the two datasets that we analyzed. Interestingly,
the distribution of correlations change during development, and the correlation profile differs significantly
before and after birth. This is in agreement with the fact that the editing level of some key targets of
ADAR, such as genes coding for GluR5, GluR6 and Gabra3 receptors, have been shown to change
significantly along development (Dillman et al. 2013; Bernard and Khrestchatisky 1994; Hanrahan et al.
2000; Rula et al. 2008; Ohlson et al. 2007).
We controlled for several potential biases. First, genes that contain Alus tend to be longer, since Alu
insertions lengthen a gene (and making it even more prone to Alu insertion). We tested if gene length
could lead to a bias in expression correlation but found no such effect.
Second, most Alus are located in introns, while most edited transcripts that were studied undergo editing
in their 3’ UTR. We found a similar distribution of spatial-correlations in genes, regardless of editing
location (3’ UTR, 5’UTR or introns). Third, to verify that the positive correlations we observed do not reflect
an epi-phenomenon of a genome-wide expression changes between brain regions, we computed the
correlations between ADAR targets and all genes. ADAR itself was highly ranked in this list (ranked 14, p-
value < 0.001), suggesting that the correlations we observe are largely ADAR-specific.
These results suggest that RNA editing in the human brain does not lead to consistent and wide alterations
in expression. This is in agreement with the idea that if editing was to lead to expression reduction in
primates, its effects would be overly massive since Alu are abundant in the primate genome. Such an
effect could have been magnified even further, since it has been shown that introducing hyper-edited
transcripts into the nucleus of Xenopus cells leads to reduction of transcription, which is not specific to
the hyperedited transcript (in trans) (Scadden 2007).
How robust are these results in respect to the set of target genes we tested? It has recently become clear
that the majority of human genes undergo editing. Here we defined the set of positive targets to contain
only genes where editing was observed, and the set of negatives as genes that do not contain Alu. While
it is possible that more genes would be shown to be edited, hence growing the positive set, the set of
91
positives is already comprehensive, containing 6-7K genes in the two datasets. We therefore expect the
results to be non-sensitive to adding more positive genes.
The above results are based on separating genes into two groups: edited and non-edited genes. Today, it
is still costly to measure the actual editing levels at a genome scale in each specific tissue. This is because
editing in Alu typically occurs at less than 1 percent per adenosine (Bazak et al. 2013), hence estimating
editing levels requires large coverage. We expect that these types of measurements will become feasible
in the near future, and could clarify the more detailed relation between editing and expression.
Furthermore, to obtain an accurate measure of the relation between expression and editing, one wishes
to measure both in single cells. Excitingly, new technologies now allow to extract RNA from single cells,
and are expected to shed more light on the relation between RNA editing and gene expression.
The above results suggest that editing does not necessarily lead to expression reduction in a large scale,
but leave important questions. Foremost, what molecular mechanisms prevent expression reduction of
edited transcripts, and what could be the implications of the increased diversity of transcripts following
editing (Barak et al. 2009; Mattick and Mehler 2008; Paz-Yaacov et al. 2010).
92
93
Concluding remarks
In recent years, there has been an explosion of availability of neural datasets of a genomic scale, gene
expression was measured for many conditions, time points, species and regions. There are several
methods to measure gene expression, yielding different types of datasets; from regionalized expression
profiles to high resolution images of expression maps. The different types of data call for the development
of specialized tools and computational methods to analyze them. The goal of the work presented in this
dissertation was to develop and implement such tools in order to gain understanding into brain
organization and function, but also into single gene function in the context of neural function and disease.
Towards this goal, we started by analyzing spatio-temporal patterns of expression in the brain, focusing
mainly on the question of dynamics of inter-region dissimilarities over development. We found that while
the brain develops prenatally and regions become more functionally specialized, expression variation
between the regions actually decreases, reaching a low point around the time of birth. Then, following
birth, variation in regional expression profiles increases again. A functional analysis of the genes
responsible for this “hourglass” shaped divergence profile revealed that the biological processes driving
the large prenatal variation are related to nervous system construction, and the processes driving the
post-natal variation are related to the utilization of the nervous system. We also found that post-natal
specialization in gene expression is most prominent in the cerebellum, an effect that can also clearly be
seen when looking at parallel developmental patterns in human.
The next part of the dissertation focused on the analysis of high resolution ISH images of gene expression
in the mouse brain. The dataset analyzed contains mapping of gene expression in a cellular resolution for
every gene in the mouse genome. This vast amount of spatial information comes in the form of images,
and we discuss methods adapted from computer vision to extract information from the images and
represent them. Then, we discuss several implementations of these methods.
First, we present method to learn functional representations of neural in situ hybridization (ISH) images,
where each image is represented as a point in a low dimensional space whose axes correspond to
meaningful functional annotations, yielding an interpretable measure of similarity between highly
complex images. We successfully infer over 700 functional annotations from neural ISH images, and use
them to detect gene-gene similarities, while providing semantic interpretations for the similarity, enabling
94
the explainable inference of new gene functions from spatial co-expression. The visual features calculated
for this purpose were also used for the inference of genes related to neural disease such as Parkinson’s
disease and epilepsy.
We then present an approach to identify genes that are primarily expressed in specific brain layers or cell
types, based on analyzing the ISH images. By learning the spatial patterns of a few known cell markers in
the mouse cerebellum, we annotate the expression patterns of hundreds of new genes, and predict the
layers and cell types they are expressed in with very high accuracy (AUC>0.94 for all four cerebellar layers).
Overall, 454 genes are predicted to be primarily expressed in the Purkinje layer, 233 in the granular
layer, 14 in the molecular layer and 16 in the cerebellar white matter.
The last part of the dissertation focused on patterns of RNA editing in the human brain. Specifically we
looked at co-expression patterns of the enzyme ADAR, that is responsible for editing, and its potential
editing targets. We aimed to resolve an apparent conflict: On one hand, it has been shown that in some
cases editing could dramatically impact expression of genes. On the other hand, the unique abundance of
editing targets in human genes would mean that if editing affects the expression of all its targets, it would
lead to massive expression changes. Surprisingly, we found that the distribution of correlations in many
brain samples was bi-modal: while some genes were negatively correlated with ADAR as expected, many
targets of ADAR were actually positively correlated with ADAR. This is somewhat surprising because it is
believed that edited genes would be down regulated in the presence of ADAR. The group of positively
correlated genes was enriched for functions including RNA processing, suggesting that ADAR operates as
part of wide RNA regulation mechanisms.
The frameworks and analysis described in this dissertation can be extended in numerous ways. First, as
discussed in the introductory part of the thesis (section 1.1), we would like to be able to analyze protein
expression profiles rather than mRNA expression profiles. mRNA is subjected to changes and regulations,
and it only reflects around 30%-40% of actual protein abundance. In the future, better ways for high-
throughput measure of the proteome are expected to be developed, making the analysis of gene
expression patterns more accurate. The results of many of the analyses presented include functional
predictions for single genes. In order to make the most of these predictions, an important next step would
be to test the most promising ones in a wet-lab.
95
The methods to analyze ISH images of gene expression can be also implemented on more datasets, for
example, there is now partial availability of human brain ISH images. Results from the analyses presented
in Chapter 4 suggest that while ISH images are noisy and hard to analyze, using the appropriate
computational methods to represent and classify them can yield an abundance of new biological
information. In section 4.1.6 we see that a large fraction of functional information exists, surprisingly, in
the smallest patterns of expression, and even in cell shapes. A natural extension of this idea would be to
develop methods to capture these cell shapes and cell distributions in an optimal way.
An important goal in the field of neurogenomics is the accurate mapping of gene expression profiles to
specific cell types, species, at higher temporal resolution and even for subcellular locations. Acquiring
accurate maps of expression will enable to link between brain structure, evolution, development and most
importantly, brain functionality. In recent years there is much focus on methods that map connectivity
between brain regions and even neurons, and attempts to create artificial models of neural function
through very large-scale modelling of electrical activity in the brain. However, what underlies all structural
properties of the brain and its continuous electrical activity is the information coded in our genome, and
implemented by the differential expression of genes. The continuous availability of more specific and
accurate expression data will surely enable to shed light on brain functionality and complex neural process
in the healthy and diseased brain.
96
97
References
Agranat, Raitskin, Sperling, and Sperling. 2008. “The Editing Enzyme ADAR1 and the mRNA Surveillance Protein hUpf1 Interact in the Cell Nucleus.” Proceedings of the National Academy of Sciences.
Ashburner, Ball, Blake, Botstein, Butler, Cherry, Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology.” Nature Genetics.
Athanasiadis, Rich, and Maas. 2004. “Widespread A-to-I RNA Editing of Alu-Containing mRNAs in the Human Transcriptome.” PLoS Biology.
Auer, and Doerge. 2010. “Statistical Design and Analysis of RNA Sequencing Data.” Genetics.
Bahn, Lee, Li, Greer, Peng, and Xiao. 2012. “Accurate Identification of A-to-I RNA Editing in Human by Transcri
ptome Sequencing.” Genome Research.
Barak, Levanon, Eisenberg, Paz, Rechavi, Church, and Mehr. 2009. “Evidence for Large Diversity in the Human Transcriptome Created by Alu RNA Editing.” Nucleic Acids Research.
Bass. 2002. “RNA Editing by Adenosine Deaminases That Act on RNA.” Annual Review of Biochemistry.
Bayer, Altman, Russo, and Zhang. 1993. “Timetables of Neurogenesis in the Human Brain Based on Experimentally Determined Patterns in the Rat.” Neurotoxicology.
Bazak, Haviv, Barak, Jacob-Hirsch, Deng, Zhang, Isaacs, et al. 2013. “A-to-I RNA Editing Occurs at over a Hundred Million Genomic Sites, Located in a Majority of Human Genes.” Genome Research.
Becker, Barnes, Bright, and Wang. 2004. “The Genetic Association Database.” Nature Genetics.
Benjamini, and Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society Series B.
Bernard, and Khrestchatisky. 1994. “Assessing the Extent of RNA Editing in the TMII Regions of GluR5 and GluR6 Kainate Receptors during Rat Brain Development.” Journal of Neurochemistry.
Bishop. 2006. Pattern Recognition and Machine Learning. Edited by Jordan, Kleinberg, and Schölkopf. Pattern Recognition. Information Science and Statistics.
Blow, Futreal, Wooster, and Stratton. 2004. “A Survey of RNA Editing in Human Brain.” Genome Research.
Boguski, and Jones. 2004. “Neurogenomics: At the Intersection of Neurobiology and Genome Sciences.” Nature Neuroscience.
98
Bohland, Bokil, Pathak, Lee, Ng, Lau, Kuan, Hawrylycz, and Mitra. 2010. “Clustering of Spatial Gene Expression Patterns in the Mouse Brain and Comparison with Classical Neuroanatomy.” Methods.
Bosch, Zisserman, and Mu. 2006. “Scene Classification via pLSA.” Edited by Leonardis, Bischof, and Pinz. Analysis, Lecture Notes in Computer Science.
Bosch, Zisserman, and Munoz. 2007. “Image Classification Using Random Forests and Ferns.” IEEE 11th International Conference on Computer Vision (2007).
Briscoe, Sussel, Serup, Hartigan-O’Connor, Jessell, Rubenstein, and Ericson. 1999. “Homeobox Gene Nkx2.2 and Specification of Neuronal Identity by Graded Sonic Hedgehog Signalling.” Nature.
Bryant, Subrahmanyan, Tworoger, LaTray, Liu, Li, van den Engh, and Ruohola-Baker. 1999. “Characterization of Differentially Expressed Genes in Purified Drosophila Follicle Cells: Toward a General Strategy for Cell Type-Specific Developmental Analysis.” Proceedings of the National Academy of Sciences.
Burns, Chu, Rueter, Hutchinson, Canton, Sanders-Bush, and Emeson. 1997. “Regulation of Serotonin-2C Receptor G-Protein Coupling by RNA Editing.” Nature.
Buss, and Oppenheim. 2004. “Special Review Based on a Presentation Made at the 16th International Congress of the IFAA Role of Programmed Cell Death in Normal Neuronal Development and Function.” Anatomical Science International.
Cahoy, Emery, Kaushal, Foo, Zamanian, Christopherson, Xing, et al. 2008. “A Transcriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Understanding Brain Development and Function.” The Journal of Neuroscience.
Cavodeassi, and Houart. 2012. “Brain Regionalization: Of Signaling Centers and Boundaries.” Developmental Neurobiology.
Challacombe, Snow, and Letourneau. 1996. “Actin Filament Bundles Are Required for Microtubule Reorientation during Growth Cone Turning to Avoid an Inhibitory Guidance Cue.” Journal of Cell Science.
Chen, and Carmichael. 2009. “Nuclear Retention of mRNAs Containing Inverted Repeats in Human Embryonic Stem Cells : Functional Role of a Nuclear Noncoding RNA.” Molecular Cell.
Chen, Cheng, Grennan, Pibiri, Zhang, Badner, Gershon, and Liu. 2013a. “Two Gene Co-Expression Modules Differentiate Psychotics and Controls.” Molecular Psychiatry.
Chen, Li, Lin, Chan, Chow, Song, Liu, et al. 2013b. “Recoding RNA Editing of AZIN1 Predisposes to Hepatocellular Carcinoma.” Nature Medicine.
Chizhikov, Lindgren, Currle, Rose, Monuki, and Millen. 2006. “The Roof Plate Regulates Cerebellar Cell-Type Specification and Proliferation.” Development.
99
Choi, Yu, Yoo, and Kim. 2005. “Differential Coexpression Analysis Using Microarray Data and Its Application to Human Cancer.” Bioinformatics.
Clancy, and Darlington. 2001a. “Translating Developmental Time across Mammalian Species.” Neuroscience.
Coelho, Peng, and Murphy. 2010. “Quantifying the Distribution of Probes between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics.
Colantuoni, Lipska, Ye, Hyde, Tao, Leek, Colantuoni, et al. 2011. “Temporal Dynamics and Genetic Control of Transcription in the Human Prefrontal Cortex.” Nature.
Csurka, and Dance. 2004. “Visual Categorization with Bags of Keypoints.” Proc. of ECCV International Workshop on Statistical Learning in Computer Vision.
Datson, van der Perk, de Kloet, and Vreugdenhil. 2001. “Expression Profile of 30,000 Genes in Rat Hippocampus Using SAGE.” Hippocampus.
Davis, and Eddy. 2009. “A Tool for Identification of Genes Expressed in Patterns of Interest Using the Allen Brain Atlas.” Bioinformatics.
De Jong, Boks, Fuller, Strengman, Janson, de Kovel, Ori, et al. 2012. “A Gene Co-Expression Network in Whole Blood of Schizophrenia Patients Is Independent of Antipsychotic-Use and Enriched for Brain-Expressed Genes.” PloS One.
De la Fuente. 2010. “From ‘Differential Expression’ to ‘Differential Networking’ - Identification of Dysfunctional Regulatory Networks in Diseases.” Trends in Genetics.
Deng, Berg, and Fei-Fei. 2011. “Hierarchical Semantic Indexing for Large Scale Image Retrieval.” Cvpr 2011.
Dickson. 2002. “Molecular Mechanisms of Axon Guidance.” Science.
Dillman, Hauser, Gibbs, Nalls, McCoy, Rudenko, Galter, and Cookson. 2013. “mRNA Expression, Splicing and Editing in the Embryonic and Adult Mouse Cerebral Cortex.” Nature Neuroscience.
Domazet-Lošo, and Tautz. 2010a. “A Phylogenetically Based Transcriptome Age Index Mirrors Ontogenetic Divergence Patterns.” Nature.
Eden, Navon, Steinfeld, Lipson, and Yakhini. 2009. “GOrilla: A Tool for Discovery and Visualization of Enriched GO Terms in Ranked Gene Lists.” BMC Bioinformatics.
Eran, Li, Vatalaro, McCarthy, Rahimov, Collins, Markianos, et al. 2012. “Comparative RNA Editing in Autistic and Neurotypical Cerebella.” Molecular Psychiatry.
Fan, Chang, Hsieh, Wang, and Lin. 2008. “LIBLINEAR: A Library for Large Linear Classification.” The Journal of Machine Learning Research.
100
Foss, Radulovic, Shaffer, Ruderfer, Bedalov, Goodlett, and Kruglyak. 2007. “Genetic Basis of Proteome Variation in Yeast.” Nature Genetics.
French, and Pavlidis. 2011. “Relationships between Gene Expression and Brain Wiring in the Adult Rodent Brain.” PLoS Computational Biology.
Frise, Hammonds, and Celniker. 2010. “Systematic Image-Driven Analysis of the Spatial Drosophila Embryonic Expression Landscape.” Molecular Systems Biology.
Fu, Keurentjes, Bouwmeester, America, Verstappen, Ward, Beale, et al. 2009. “System-Wide Molecular Evidence for Phenotypic Buffering in Arabidopsis.” Nature Genetics.
Gaiteri, Ding, French, Tseng, and Sibille. 2014. “Beyond Modules and Hubs: The Potential of Gene Coexpression Networks for Investigating Molecular Mechanisms of Complex Brain Disorders.” Genes, Brain, and Behavior.
Gewin. 2005. “A Golden Age of Brain Exploration.” PLoS Biology.
Ghandour, Vincendon, and Gombos. 1980. “Astrocyte and Oligodendrocyte Distribution in Adult Rat Cerebellum: An Immunohistological Study.” Journal of Neurocytology.
Ghazalpour, Bennett, Petyuk, Orozco, Hagopian, Mungrue, Farber, et al. 2011. “Comparative Analysis of Proteome and Transcriptome Variation in Mouse.” PLoS Genetics.
Gillis, and Pavlidis. 2011. “The Role of Indirect Connections in Gene Networks in Predicting Function.” Bioinformatics.
Grange, Bohland, Okaty, Sugino, Bokil, Nelson, Ng, Hawrylycz, and Mitra. 2014. “Cell-Type-Based Model Explaining Coexpression Patterns of Genes in the Brain.” Proceedings of the National Academy of Sciences.
Grauman, and Darrell. 2005. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. Tenth IEEE International Conference on Computer Vision ICCV05 Volume 1. ICCV ’05.
Gray, Fu, Luo, Zhao, Yu, Ferrari, Tenzen, et al. 2004. “Mouse Brain Organization Revealed through Direct Genome-Scale TF Expression Analysis.” Science.
Hamosh, Scott, Amberger, Bocchini, and McKusick. 2005. “Online Mendelian Inheritance in Man (OMIM), a Knowledgebase of Human Genes and Genetic Disorders.” Nucleic Acids Research.
Hanrahan, Palladino, Ganetzky, and Reenan. 2000. “RNA Editing of the Drosophila Para Na(+) Channel Transcript. Evolutionary Conservation and Developmental Regulation.” Genetics.
Hawrylycz, Lein, Guillozet-Bongaarts, Shen, Ng, Miller, van de Lagemaat, et al. 2012. “An Anatomically Comprehensive Atlas of the Adult Human Brain Transcriptome.” Nature.
101
Hawrylycz, Ng, Page, Morris, Lau, Faber, Faber, et al. 2011. “Multi-Scale Correlation Structure of Gene Expression in the Brain.” Neural Networks : The Official Journal of the International Neural Network Society.
Heale, Keegan, McGurk, Michlewski, Brindle, Stanton, Caceres, and O’Connell. 2009. “Editing Independent Effects of ADARs on the miRNA/siRNA Pathways.” The EMBO Journal.
Henry, and Hohmann. 2012. “High-Resolution Gene Expression Atlases for Adult and Developing Mouse Brain and Spinal Cord.” Mammalian Genome.
Higuchi, Maas, Single, Hartner, Rozov, Burnashev, Feldmeyer, Sprengel, and Seeburg. 2000. “Point Mutation in an AMPA Receptor Gene Rescues Lethality in Mice Deficient in the RNA-Editing Enzyme ADAR2.” Nature.
Horan, Jang, Bailey-Serres, Mittler, Shelton, Harper, Zhu, Cushman, Gollery, and Girke. 2008. “Annotating Genes of Known and Unknown Function by Large-Scale Coexpression Analysis.” Plant Physiology.
Huang. 2009. “Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification.” 2009 IEEE Conference on Computer Vision and Pattern Recognition.
Hui. 2007. “Brain-Specific Aminopeptidase: From Enkephalinase to Protector against Neurodegeneration.” Neurochemical Research.
Kalinka, Varga, Gerrard, Preibisch, Corcoran, Jarrells, Ohler, Bergman, and Tomancak. 2010a. “Gene Expression Divergence Recapitulates the Developmental Hourglass Model.” Nature.
Kalinka, Varga, Gerrard, Preibisch, Corcoran, Jarrells, Ohler, Bergman, and Tomancak. 2010b. “Gene Expression Divergence Recapitulates the Developmental Hourglass Model.” Nature.
Kanehisa. 2002. “The KEGG Database.” Novartis Foundation Symposium.
Kang, Kawasawa, Cheng, Zhu, Xu, Li, Sousa, et al. 2011. “Spatio-Temporal Transcriptome of the Human Brain.” Nature.
Kawahara, and Kwak. 2005. “Excitotoxicity and ALS: What Is Unique about the AMPA Receptors Expressed on Spinal Motor Neurons?” Amyotrophic Lateral Sclerosis and Other Motor Neuron Disorders : Official Publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.
Kerrien, Aranda, Breuza, Bridge, Broackes-Carter, Chen, Duesbury, et al. 2012. “The IntAct Molecular Interaction Database in 2012.” Nucleic Acids Research.
Khaitovich, Hellmann, Enard, Nowick, Leinweber, Franz, Weiss, Lachmann, and Pääbo. 2005. “Parallel Patterns of Evolution in the Genomes and Transcriptomes of Humans and Chimpanzees.” Science.
Khaitovich, Muetzel, She, Lachmann, Hellmann, Dietzsch, Steigele, et al. 2004. “Regional Patterns of Gene Expression in Human and Chimpanzee Brains.” Genome Research.
102
Kim, Kim, Walsh, Kobayashi, Matise, Buyske, and Gabriel. 2004. “Widespread RNA Editing of Embedded Alu Elements in the Human Transcriptome.” Genome Research.
Kirsch, and Chechik. “Human Areal Expression of Most Genes Is Governed by Regionalization.” In preparation.
Kirsch, Liscovitch, and Chechik. 2012. “Localizing Genes to Cerebellar Layers by Classifying ISH Images.” Edited by Ohler. PLoS Computational Biology.
Krauss, Johansen, Korzh, and Fjose. 1991. “Expression Pattern of Zebrafish Pax Genes Suggests a Role in Early Brain Regionalization.” Nature.
Lavado, He, Paré, Neale, Olson, Giovannini, and Cao. 2013. “Tumor Suppressor Nf2 Limits Expansion of the Neural Progenitor Pool by Inhibiting Yap/Taz Transcriptional Coactivators.” Development.
Lazebnik, Schmid, and Ponce. 2006. “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories.” 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Volume 2 CVPR06, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, .
Lee, Hsu, Sajdak, Qin, and Pavlidis. 2004. “Coexpression Analysis of Human Genes across Many Microarray Data Sets.” Genome Research.
Lee, Weindruch, and Prolla. 2000. “Gene-Expression Profile of the Ageing Brain in Mice.” Nature Genetics.
Lein, Hawrylycz, et al., Ao, Ayres, Bensinger, Bernard, et al. 2007. “Genome-Wide Atlas of Gene Expression in the Adult Mouse Brain.” Nature.
Levanon, Eisenberg, Rechavi, and Levanon. 2005. “Letter from the Editor: Adenosine-to-Inosine RNA Editing in Alu Repeats in the Human Genome.” EMBO Reports.
Levanon, Eisenberg, Yelin, Nemzer, Hallegger, Shemesh, Fligelman, et al. 2004. “Systematic Identification of Abundant A-to-I Editing Sites in the Human Transcriptome.” Nature Biotechnology.
Lev-Maor, Sorek, Levanon, Paz, Eisenberg, and Ast. 2007. “RNA-Editing-Mediated Exon Evolution.” Genome Biology.
Lewis. 1978. “A Gene Complex Controlling Segmentation in Drosophila.” Nature.
Li, Bickel, and Biggin. 2014. “System Wide Analyses Have Underestimated Protein Abundances and the Importance of Transcription in Mammals.” PeerJ.
Li, and Church. 2013. “Deciphering the Functions and Regulation of Brain-Enriched A-to-I RNA Editing.” Nature Neuroscience.
Li, Levanon, Yoon, Aach, Xie, Leproust, Zhang, Gao, and Church. 2009. “Genome-Wide Identification of Human RNA Editing Sites by Parallel DNA Capturing and Sequencing.” Science.
103
Li, Monckton, and Godbout. 2008. “A Role for DEAD Box 1 at DNA Double-Strand Breaks.” Molecular and Cellular Biology.
Li, Su, Lim, and Fei-Fei. 2010a. “Objects as Attributes for Scene Classification.” 12th European Conference of Computer Vision (ECCV), 1st International Workshop on Parts and Attributes.
Li, Su, Xing, and Fei-Fei. 2010b. “Object Bank: A High-Level Image Representation for Scene Classification and Semantic Feature Sparsification.” Proceedings of the Neural Information Processing Systems (NIPS) 2010.
Liang, and Landweber. 2007. “Hypothesis: RNA Editing of microRNA Target Sites in Humans?” Rna.
Linghu, Snitkin, Hu, Xia, and Delisi. 2009. “Genome-Wide Prioritization of Disease Genes and Identification of Disease-Disease Associations from an Integrated Human Functional Linkage Network.” Genome Biology.
Lowe. 1999. “Object Recognition from Local Scale-Invariant Features.” Proceedings of the Seventh IEEE International Conference on Computer Vision.
Lowe. 2004. “Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision.
Ma, Fode, Guillemot, and Anderson. 1999. “NEUROGENIN1 and NEUROGENIN2 Control Two Distinct Waves of Neurogenesis in Developing Dorsal Root Ganglia.” Genes & Development.
Malisiewicz, Gupta, and Efros. 2011. “Ensemble of Exemplar-SVMs for Object Detection and beyond.” 2011 International Conference on Computer Vision.
Malone, and Oliver. 2011. “Microarrays, Deep Sequencing and the True Measure of the Transcriptome.” BMC Biology.
Manning, and Raghavan. 2009. “An Introduction to Information Retrieval.” Edited by Salas. Online.
Mao, Bonni, Xia, Nadal-Vicens, and Greenberg. 1999. “Neuronal Activity-Dependent Cell Survival Mediated by Transcription Factor MEF2.” Science.
Martínez. 2001. “The Isthmic Organizer and Brain Regionalization.” The International Journal of Developmental Biology.
Mattick, and Mehler. 2008. “RNA Editing, DNA Recoding and the Evolution of Human Cognition.” Trends in Neurosciences.
McGinnis, and Krumlauf. 1992. “Homeobox Genes and Axial Patterning.” Cell.
Mehler, and Mattick. 2007. “Noncoding RNAs and RNA Editing in Brain Development , Functional Diversification , and Neurological Disease.” Physiological Reviews.
104
Miller, Ding, Sunkin, Smith, Ng, Szafer, Ebbert, et al. 2014. “Transcriptional Landscape of the Prenatal Human Brain.” Nature.
Miller, Robinson, Cleary, and Doe. 2009. “TU-Tagging: Cell Type-Specific RNA Isolation from Intact Complex Tissues.” Nature Methods.
Moens, and Prince. 2002. “Constructing the Hindbrain: Insights from the Zebrafish.” Developmental Dynamics.
Needleman, and Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology.
Ng, Bernard, Lau, Overly, Dong, Kuan, Pathak, et al. 2009. “An Anatomic Gene Expression Atlas of the Adult Mouse Brain.” Nature Neuroscience.
Nie, Ding, Kao, Braun, and Yang. 2005. “ADAR1 Interacts with NF90 through Double-Stranded RNA and Regulates NF90-Mediated Gene Expression Independently of RNA Editing.” Molecular and Cellular Biology.
Nishikura. 2010. “Functions and Regulation of RNA Editing by ADAR Deaminases.” Annual Review of Biochemistry.
O’Connor, and Tessier-Lavigne. 1999. “Identification of Maxillary Factor, a Maxillary Process-Derived Chemoattractant for Developing Trigeminal Sensory Axons.” Neuron.
Ohlson, Pedersen, Haussler, and Ohman. 2007. “Editing Modifies the GABA(A) Receptor Subunit alpha3.” RNA.
Ojala, Pietikainen, and Maenpaa. 2002. “Multiresolution Gray-Scale and Rotation Invariant Texture classification with Local Binary Patterns.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Oliver. 2000. “Guilt-by-Association Goes Global.” Nature.
Ota, Sakurai, Gupta, Valente, Wulff, Ariyoshi, Iizasa, Davuluri, and Nishikura. 2013. “ADAR1 Forms a Complex with Dicer to Promote microRNA Processing and RNA-Induced Gene Silencing.” Cell.
Palladino, Keegan, O’Connell, and Reenan. 2000. “A-to-I Pre-mRNA Editing in Drosophila Is Primarily Involved in Adult Nervous System Function and Integrity.” Cell.
Park, Williams, Wold, and Mortazavi. 2012. “RNA Editing in the Human ENCODE RNA-Seq Data.” Genome Research.
Paschen, and Djuricic. 1994. “Extent of RNA Editing of Glutamate Receptor Subunit GluR5 in Different Brain Regions of the Rat.” Cellular and Molecular Neurobiology.
105
Paz-Yaacov, Levanon, Nevo, Kinar, Harmelin, Jacob-Hirsch, Amariglio, Eisenberg, and Rechavi. 2010. “Adenosine-to-Inosine RNA Editing Shapes Transcriptome Diversity in Primates.” Proceedings of the National Academy of Sciences.
Peng, Bonamy, Glory-Afshar, Rines, Chanda, and Murphy. 2010. “Determining the Distribution of Probes between Different Subcellular Locations through Automated Unmixing of Subcellular Patterns.” Proceedings of the National Academy of Sciences.
Peng, Cheng, Tan, Kang, Tian, Zhu, Zhang, et al. 2012. “Comprehensive Analysis of RNA-Seq Data Reveals Extensive RNA Editing in a Human Transcriptome.” Nature Biotechnology.
Peng, Long, Zhou, Leung, Eisen, and Myers. 2007. “Automatic Image Analysis for Gene Expression Patterns of Fly Embryos.” BMC Cell Biology.
Ponomarev, Wang, Zhang, Harris, and Mayfield. 2012. “Gene Coexpression Networks in Human Brain Identify Epigenetic Modifications in Alcohol Dependence.” The Journal of Neuroscience.
Prasanth, Prasanth, Xuan, Hearn, Freier, Bennett, Zhang, and Spector. 2005. “Regulating Gene Expression through RNA Nuclear Retention.” Cell.
Pruteanu-Malinici, Mace, and Ohler. 2011. “Automatic Annotation of Spatial Expression Patterns via Sparse Bayesian Factor Models.” Edited by Bader. PLoS Computational Biology.
Puelles, and Rubenstein. 1993. “Expression Patterns of Homeobox and Other Putative Regulatory Genes in the Embryonic Mouse Forebrain Suggest a Neuromeric Organization.” Trends in Neurosciences.
Raitskin, Cho, Sperling, Nishikura, and Sperling. 2001. “RNA Editing Activity Is Associated with Splicing Factors in lnRNP Particles: The Nuclear Pre-mRNA Processing Machinery.” Proceedings of the National Academy of Sciences.
Ramaswami, and Li. 2014. “RADAR: A Rigorously Annotated Database of A-to-I RNA Editing.” Nucleic Acids Research.
Ramaswami, Lin, Piskol, Tan, Davis, and Li. 2012. “Accurate Identification of Human Alu and Non-Alu RNA Editing Sites.” Nature Methods.
Ramaswami, Zhang, Piskol, Keegan, Deng, O’Connell, and Li. 2013. “Identifying RNA Editing Sites Using RNA Sequencing Data Alone.” Nature Methods.
Riedmann, Schopoff, Hartner, and Jantsch. 2008. “Specificity of ADAR-Mediated RNA Editing in Newly Identified Targets.” RNA.
Rong, Wang, and Morgan. 2004. “Identification of Candidate Purkinje Cell-Specific Markers by Gene Expression Profiling in Wild-Type and pcd(3J) Mice.” Brain Research. Molecular Brain Research.
Rula, Lagrange, Jacobs, Hu, Macdonald, and Emeson. 2008. “Developmental Modulation of GABA(A) Receptor Function by RNA Editing.” The Journal of Neuroscience.
106
Saito, Hirai, and Yonekura-Sakakibara. 2008. “Decoding Genes with Coexpression Networks and Metabolomics - ‘Majority Report by Precogs’.” Trends in Plant Science.
Sandberg. 2000. “From the Cover: Regional and Strain-Specific Gene Expression Mapping in the Adult Mouse Brain.” Proceedings of the National Academy of Sciences.
Sanjana, Levanon, Hueske, Ambrose, and Li. 2012. “Activity-Dependent A-to-I RNA Editing in Rat Cortical Neurons.” Genetics.
Sato, Joyner, and Nakamura. 2004. “How Does Fgf Signaling from the Isthmic Organizer Induce Midbrain and Cerebellum Development?” Development, Growth & Differentiation.
Savva, Rieder, and Reenan. 2012. “The ADAR Protein Family.” Genome Biology.
Scadden. 2005. “The RISC Subunit Tudor-SN Binds to Hyper-Edited Double-Stranded RNA and Promotes Its Cleavage.” Nature Structural & Molecular Biology.
Scadden. 2007. “Inosine-Containing dsRNA Binds a Stress-Granule-like Complex and Downregulates Gene Expression In Trans.” Molecular Cell.
Scadden, and O’Connell. 2005. “Cleavage of dsRNAs Hyper-Edited by ADARs Occurs at Preferred Editing Sites.” Nucleic Acids Research.
Scadden, and Smith. 2001a. “Specific Cleavage of Hyper-Edited dsRNAs.” The EMBO Journal.
Scadden, and Smith. 2001b. “RNAi Is Antagonized by A-->I Hyper-Editing.” EMBO Reports.
Schena, Shalon, Davis, and Brown. 1995. “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray.” Science.
Schlicker, Domingues, Rahnenführer, and Lengauer. 2006. “A New Measure for Functional Similarity of Gene Products Based on Gene Ontology.” BMC Bioinformatics.
Schwaller. 2012. “The Use of Transgenic Mouse Models to Reveal the Functions of Ca2+ Buffer Proteins in Excitable Cells.” Biochimica et Biophysica Acta.
Silberberg, Lundin, Navon, and Öhman. 2012. “Deregulation of the A-to-I RNA Editing Mechanism in Psychiatric Disorders.” Human Molecular Genetics.
Smedley, Haider, Ballester, Holland, London, Thorisson, and Kasprzyk. 2009. “BioMart--Biological Queries Made Easy.” BMC Genomics.
Subramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette, Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences.
107
Sunkin, Ng, Lau, Dolbeare, Gilbert, Thompson, Hawrylycz, and Dang. 2012. “Allen Brain Atlas: An Integrated Spatio-Temporal Portal for Exploring the Central Nervous System.” Nucleic Acids Research.
Taga, Miyoshi, Okajima, Matsuda, and Nadano. 2010. “Identification of Heterogeneous Nuclear Ribonucleoprotein A/B as a Cytoplasmic mRNA-Binding Protein in Early Involution of the Mouse Mammary Gland.” Cell Biochemistry and Function.
Takano-Maruyama, Chen, and Gaufo. 2011. “Differential Contribution of Neurog1 and Neurog2 on the Formation of Cranial Ganglia along the Anterior-Posterior Axis.” Developmental Dynamics.
Tau, and Peterson. 2010. “Normal Development of Brain Circuits.” Neuropsychopharmacology : Official Publication of the American College of Neuropsychopharmacology.
Thomas, Lee, Dalton, Nomie, Stoica, Costa-Mattioli, Chang, Nuzhdin, Arbeitman, and Dierick. 2012. “A Versatile Method for Cell-Specific Profiling of Translated mRNAs in Drosophila.” PloS One.
Thompson, Ng, Menon, Martinez, Lee, Glattfelder, Sunkin, et al. 2014. “A High-Resolution Spatiotemporal Atlas of Gene Expression of the Developing Mouse Brain.” Neuron.
Tonkin, Saccomanno, Morse, Brodigan, Krause, and Bass. 2002. “RNA Editing by ADARs Is Important for Normal Behavior in Caenorhabditis Elegans.” The EMBO Journal.
Torkamani, Dean, Schork, and Thomas. 2010. “Coexpression Network Analysis of Neural Tissue Reveals Perturbations in Developmental Processes in Schizophrenia.” Genome Research.
Torresani, Szummer, and Fitzgibbon. 2010. “Efficient Object Category Recognition Using Classemes.” Computer Vision–ECCV 2010.
Vedaldi, and Fulkerson. 2010. “VLFeat - An Open and Portable Library of Computer Vision Algorithms.” Design, MM ’10, .
Venter, Adams, Myers, Li, Mural, Sutton, Smith, et al. 2001. “The Sequence of the Human Genome.” Science.
Vidal-Sanz, Bray, Villegas-Pérez, Thanos, and Aguayo. 1987. “Axonal Regeneration and Synapse Formation in the Superior Colliculus by Retinal Ganglion Cells in the Adult Rat.” The Journal of Neuroscience.
Vigil, Cherfils, Rossman, and Der. 2010. “Ras Superfamily GEFs and GAPs: Validated and Tractable Targets for Cancer Therapy?” Nature Reviews Cancer.
Vincent, DeVoss, Ryan, and Murphy. 2002. “Analysis of Neuronal Gene Expression with Laser Capture Microdissection.” Journal of Neuroscience Research.
Voineagu, Wang, Johnston, Lowe, Tian, Horvath, Mill, Cantor, Blencowe, and Geschwind. 2011. “Transcriptomic Analysis of Autistic Brain Reveals Convergent Molecular Pathology.” Nature.
108
Vollmer, and Clerc. 2002. “Homeobox Genes in the Developing Mouse Brain.” Journal of Neurochemistry.
Walker, Russell, and Hodgetts. 1987. “Is Schizophrenia a Neurodevelopmental Disorder?” British Medical Journal.
Walker, Volkmuth, and Klingler. 1999. “Pharmaceutical Target Discovery Using Guilt-by-Association : Schizophrenia and Parkinson's Disease Genes.” ISMB proceedings 1999.
Wang, So, Devlin, Zhao, Wu, and Cheung. 2013. “ADAR Regulates RNA Editing, Transcript Stability, and Gene Expression.” Cell Reports.
Wang, and Zoghbi. 2001. “Genetic Regulation of Cerebellar Development.” Nature Reviews. Neuroscience.
Ward, McCann, DeWulf, Wu, and Rao. 2003. “Distinguishing between Directional Guidance and Motility Regulation in Neuronal Migration.” The Journal of Neuroscience.
Warf, Shepherd, Johnson, and Bass. 2012. “Effects of ADARs on Small RNA Processing Pathways in C. Elegans.” Genome Research.
Waterston, Lindblad-Toh, Birney, Rogers, Abril, Agarwal, Agarwala, et al. 2002. “Initial Sequencing and Comparative Analysis of the Mouse Genome.” Nature.
Website: ©2012 Allen Institute for Brain Science. Allen Human Brain Atlas [Internet]. Available from: Http://human.brain-Map.org/.
Website: ©2012 Allen Institute for Brain Science. BrainSpan Atlas of the Developing Human Brain [Internet]. Available from: Http://brainspan.org/.
Website: ©2012 Allen Institute for Brain Science. NIH Blueprint Non-Human Primate (NHP) Atlas [Internet]. Available from: Http://www.blueprintnhpatlas.org/.
Weissbach, and Scadden. 2012. “Tudor-SN and ADAR1 Are Components of Cytoplasmic Stress Granules.” RNA.
Wingate. 2001. “The Rhombic Lip and Early Cerebellar Development.” Current Opinion in Neurobiology.
Wolf, Goldberg, Manor, Sharan, and Ruppin. 2011. “Gene Expression in the Rodent Brain Is Associated with Its Regional Connectivity.” PLoS Computational Biology.
Wu, Neff, Kalisky, Dalerba, Treutlein, Rothenberg, Mburu, et al. 2014. “Quantitative Assessment of Single-Cell RNA-Sequencing Methods.” Nature Methods.
Zapala, Hovatta, Ellison, Wodicka, Del Rio, Tennant, Tynan, et al. 2005. “Adult Mouse Brain Gene Expression Patterns Bear an Embryologic Imprint.” Proceedings of the National Academy of Sciences.
Zhang, and Carmichael. 2001. “The Fate of dsRNA in the Nucleus: A p54nrb-Containing Complex Mediates the Nuclear Retention of Promiscuously A-to-I Edited RNAs.” Cell.
109
Zhang, and Goldman. 1996. “Generation of Cerebellar Interneurons from Dividing Progenitors in White Matter.” Neuron.
Zhong, and Sternberg. 2007. “Automated Data Integration for Developmental Biological Research.” Development.
Zirlinger, Lo, McMahon, McMahon, and Anderson. 2002. “Transient Expression of the bHLH Factor Neurogenin-2 Marks a Subpopulation of Neural Crest Cells Biased for a Sensory but Not a Neuronal Fate.” Proceedings of the National Academy of Sciences.