photometric redshifts from sdss images using ... - usm home
TRANSCRIPT
LUDWIG MAXIMILIAN UNIVERSITY OF MUNICH
FACULTY OF PHYSICS
Photometric redshifts from
SDSS images using
convolutional neural networks
and custom Gaussian label
smoothing
A Master’s Thesis by
Benjamin Alber
Supervisor:
Dr. Benjamin P. Moster
Submitted:
September 14, 2020
LUDWIG-MAXIMILIANS-UNIVERSITAT MUNCHEN
FAKULTAT FUR PHYSIK
Photometrischen Rotverschiebungen
aus SDSS Galaxiebildern unter
Verwendung von Convolutional
Neural Net-
works und ”Gaussian label smoothing”
Eine Masterarbeit von
Benjamin Alber
Erstgutachter:
Dr. Benjamin P. Moster
Eingereicht am:
14. September 2020
Contents
1 Introduction 1
2 Data 2
3 Network architectures 10
3.1 Input layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Inception block . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Fully connected layers . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Dropout layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 Output layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.9 Complete architecture . . . . . . . . . . . . . . . . . . . . . . 15
4 Regression and classification 15
4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Binning the data . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 Encoding labels and label smoothing . . . . . . . . . . 16
4.2.3 Returning to numerical values . . . . . . . . . . . . . . 18
5 Training 19
6 CNN results 20
6.1 Comparing different input dimensions . . . . . . . . . . . . . . 22
6.2 Comparing class 1 and class 2 method . . . . . . . . . . . . . 23
6.3 Comparing no smoothing and Gaussian smoothing . . . . . . . 24
5
6.4 Comparing regression and classification . . . . . . . . . . . . . 25
7 Conclusion and outlook 26
Appendices 31
A CNN results 31
A.1 Numeric results . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.2 Plots 64 regression . . . . . . . . . . . . . . . . . . . . . . . . 32
A.3 Plots 32 regression . . . . . . . . . . . . . . . . . . . . . . . . 33
A.4 Plots 64 classification 1 . . . . . . . . . . . . . . . . . . . . . . 34
A.5 Plots 64 classification 2 . . . . . . . . . . . . . . . . . . . . . . 35
A.6 Plots 64 classification 1 Gaussian smoothing . . . . . . . . . . 36
A.7 Plots 64 classification 2 Gaussian smoothing . . . . . . . . . . 37
A.8 Plots 32 classification 1 . . . . . . . . . . . . . . . . . . . . . . 38
A.9 Plots 32 classification 2 . . . . . . . . . . . . . . . . . . . . . . 39
A.10 Plots 32 classification 1 Gaussian smoothing . . . . . . . . . . 40
A.11 Plots 32 classification 2 Gaussian smoothing . . . . . . . . . . 41
B My SQL search 42
C My CNN architecture 46
References 47
1 Introduction
Robust distance estimates for galaxies are required in order to maximise cos-
mological information from current and upcoming large scale galaxy surveys.
These distances are inferred via the distance-redshift relation which relates
how the light emitted by a galaxy is stretched due to the expansion of the
universe as it travels from the galaxy to our detectors. This stretching leads
to an energy loss of the photon and a shift towards redder wavelengths, which
is known as the redshift. The further away the galaxy is from us, the longer
the light has been passing through the expanding universe, and the more
it becomes redshifted. Obtaining accurate spectroscopic redshifts, which
measure the redshifted spectral absorption and emission lines, are extremely
time-intensive, whereas obtaining photometric redshifts is much cheaper, but
less accurate.
Two main techniques are traditionally used for photometric redshifts:
template fitting and machine learning algorithms. The template fitting codes
(e.g. [1, 2, 3, 4]) match the broadband photometry of a galaxy to the synthetic
magnitudes of a suite of templates across a large redshift interval1. These
methods do not rely on training samples of galaxies. However, they are often
computationally intensive due to the brute-force approach to explore the pre-
generated grid of model photometry and poorly known parameters such as
dust attenuation can lead to degeneracies in color-redshift space.
On the other hand, the machine learning methods (e.g. [5, 6, 7]) were
shown to have similar or better performances when a large spectroscopic
training set is available. However, they are only reliable within the limits of
the training set and the current lack of spectroscopic coverage in some color
space regions and at high redshift remains a major issue for this approach.
Within the standard machine learning approach the choice of which photo-
1
metric input features to train the machine learning architecture on, from the
full list of possible photometric features, still has no definitive answer. Hoyle
et al. (2015) [8] performed an analysis of feature importance for photomet-
ric redshifts, which uses machine learning techniques to determine which of
the many possible photometric features produce the most predictive power.
But by passing the entire galaxy image into a convolutional neural network
(CNN), the manual feature extraction required by previous methods can be
bypassed (e.g. [9, 10, 11, 12]).
A commonly used technique in this works is separating the redshift space
in multiple small bins and performing a classification task. This work shall
try to evaluate whether this approach is justified and how it could be im-
proved. The paper is organized as follows. Section 2 describes the data
acquisition and preparation used in this study. Section 3 describes the dif-
ferent elements and the overall structure of the used CNN networks. Section
4 outlines the different regression and classification approaches that can be
taken for the redshift estimation task with a CNN. Section 5 describes the
data augmentation and training regime. The results are discussed and sum-
marized in sections 6 and 7.
2 Data
The Sloan Digital Sky Survey (hereafter SDSS) is a multiband imaging and
spectroscopic redshift survey using a dedicated 2.5 m telescope at Apache
Point Observatory in New Mexico (USA). It collects deep photometry (r <
22.5) in ugriz bands and makes them publicly available through the SDSS
website [sdss.org]. Data is stored on the SDSS Science Archive Server in
the form of FITS files containing a single 2048 x 1361 pixel (corresponding
2
Figure 1: RGB image from irg bands of frame 6073-4-50 (run-camcol-field)
to 13.51 x 8.98 arcminutes) ”corrected frame” and additional World Coor-
dinate System (WCS) header information. These images are calibrated in
nanomaggies per pixel and have had a sky subtraction applied. Each frame
can be uniquely identified by its run, camcol and field number. An example
frame can be seen in figure 1. The galaxy images for this thesis are drawn
from the flux-limited spectroscopic Main Galaxy Sample [13] of SDSS Data
Release 16 [14].
For a galaxy to be used in this thesis, sources need to have a spectro-
scopic redshift of 0 < zspec ≤ 0.4 (which is mainly already given by belonging
to the Main Galaxy Sample), an error of zerror/zspec < 0.1 and fulfil several
photometric processing flags recommended by SDSS. These include removing
of duplicates, objects with debelending problems, objects with interpolation
problems and suspicious detections1.
1https://www.sdss.org/dr16/algorithms/photo_flags_recommend/
3
Running the SQL query (see appendix B) on the SDSS CasJob website2
returns 346.733 galaxies in the redshift range from 0 to 0.3305794. For all
queried galaxies, the following characteristics have been obtained: equatorial
coordinates (RA, Dec), spectroscopic redshift and error and radius containing
90% of Petrosian flux [15, 16, 17] for ugriz bands.
Since observations in the five different filters happen 71.7 seconds apart
from another in the order of r-i-u-z-g 3, all images need to be shifted to the
same frame. All SDSS FITS files come with header information about pixel
and world coordinates for a single reference pixel in that specific image. All
images haven been shifted so that the world coordinates of their reference
pixel match those of the r band image. Due to the necessary rounding to
integers, positions are precise to ± 1 pixel.
In order to keep the training times in check, a random, statistically
relevant subset of 100.000 out of the 346.733 found galaxies was selected for
this work. In Pasquet et al. [10], who worked on nearly the same set of
galaxies, the whole set was used with which they achieved great results. But
they also used a subset of 100.000 galaxies for the same network and could
show, that their results did not significantly change. The reduction of the
dataset for this work will therefore be seen as unproblematic.
Regarding the image size and bands used, no consensus has been estab-
lished by earlier works in this field. Hoyle (2016) [9] used 72x72 pixel RGBA
images, encoding colors (i - z, r - i, g - r) in RGB layers and r band magni-
tudes in the alpha layer, from which 100 random 60x60 pixel stamps where
used per galaxy. D’Isanto & Polsterer (2018) [11] used 28x28 pixel images in
the five SDSS bands as well as all 10 pairwise color combinations as input.
2websitehttps://skyserver.sdss.org/casjobs/3https://www.sdss.org/instruments/camera/
4
Pasquet et al. (2019) [10] used 64x64 pixel images in the five SDSS bands
and no colors.
Just like in Pasquet et al. (2019) [10] the input for this thesis consists
only of ugriz images and no color images are used. But regarding the image
size, two different approaches have been examined. The first one follows
Pasquet et al. (2019) [10] and uses 64x64 cutouts around the galaxy center.
If a galaxy is too close to the edge of its frame, it is discarded. This leaves
99.842 galaxies. Due to the low number of lost galaxies, no mosaic approach
to recover these galaxies as used in Pasquet et al. (2019) [10] was used.
The second approach uses coutouts around the galaxy of varying size, that
depend on the real size of the galaxy, and then rescales the image to a 32x32
pixel image. The size of the cutout is calculated by
A = 4 ∗median(rpetro) ∗1
0.396arcsecpixel
(1)
where rpetro are the Petrosian radii in all five bands and the 0.396arcsecpixel
account for the pixel scale of all SDSS images. Since the Petrosian radii
occasionally have a big outlier, the median turned out to be a more reliable
metric than the mean or the maximum value. The idea is to have the im-
portant features of smaller galaxies cover more of the image by effectively
zooming in and not cut off features of bigger galaxies by effectively zooming
out. In retrospect the calculation for A was chosen quit generous and images
therefore still include partially large areas with little signal (see figure ??).
Figure 2 shows the distribution of the values for A for all galaxies. Since the
highest value is 572, the plot has been cut at 160 for visual reasons.
Again, if a galaxy is too close to the edge, it is discarded. Also all
galaxies with A ≤ 32 are also discarded, in order to prevent artifacts from
upscaling the image. This procedure leaves 96.388 galaxies. More galaxies
5
Figure 2: Distribution of image width/height using equation 1 with cutoff at
A = 32 (red) and median at A = 54 (black)
6
are lost, but still not enough to have significant impact and justify a mosaic
approach.
If a galaxy has an A value of ∼ 64, the 32 pixel image would basically
be a rescaled version of the original with no zoom of any kind (see top row
of figure 3). For A values smaller than 64, the 32 pixel image results in a
zoomed in version (see middle row of figure 3). For A values bigger than
64, the 32 pixel image results in a zoomed out version (see bottom row of
figure 3). The respective companions in figure 3 visualize the corresponding
zoom quite clearly. Despite the generous calculation for A, a good amount of
galaxies falls on either side of the decision boundary at 64 and may therefore
justify its definition.
Figure 4 shows the redshift distribution for the whole set, the galaxies
used for the 64x64 approach and the 32x32 approach. To show that the corre-
sponding subsets are representative of the whole set, a Kolmogorov-Smirnoff
test [18] was conducted. This is a two-sided test for the null hypothesis that
two independent samples are drawn from the same distribution. The cor-
responding p-values for every set combination can be seen in Table 1. The
combination of the 64 pixel and 32 pixel images is the one of highest inter-
est, since this guarantees, that the two different subsets, and ultimately the
different networks trained with them, can be compared. But the other two
set combinations have been tested for consistency as well. Since all p-values
are high enough, we can can not reject the null hypothesis and can therefore
conclude, that all our subsets are drawn from the same distribution.
As the final step in data preparation a feature standardization makes
the values of each feature (image pixel values in this case) in the data have
zero mean and unit-variance via equation 2, where x′ is the standardized
feature value, x the raw feature value, µ the mean value of all feature values
7
Figure 3: u band comparison images for three different galaxies with given
SDSS bestobjid, zspec and Petrosian image size in pixel which corresponds to
the value A in Figure 2. Rescaled 32 pixel cutouts on the left and original
sized 64 pixel cutouts on the right.
8
Set 1 Set 2 KS statistic p-value
Whole set 64 px images 0.00282 0.56585
Whole set 32 px images 0.00362 0.27554
64 px images 32 px images 0.00346 0.59938
Table 1: Kolmogorov-Smirnoff test results for different subset combinations
Figure 4: Redshift distribution of all samples found by the SQL search (blue)
in comparison with all samples used with the 64 pixel architecture (orange)
or the 32 pixel architecture (green)
9
in the training set and σ the standard deviation of all feature values in the
training set. The final data results are 64x64x5 respectively 32x32x5 data
cubes consisting of ugriz bands centered around galaxy centers.
x′ =x− µσ
(2)
3 Network architectures
In this section the different elements of the models used for the experiments
are described. A schematic representation of a complete network can be seen
in figure 16 in appendix C.
3.1 Input layer
The input layer in a CNN defines the dimensions of input, that is fed into the
network. This is only worth talking about in this work, since the effect of dif-
ferent input dimensions shall be evaluated. This leads to the first separation
between different networks, since one works with an 64x64x5 input, while
the other works with a 32x32x5 input and thereby increasing the number of
networks to evaluate from a one to two.
3.2 Convolutional layer
The convolutional layer is the core building block of a CNN. The layer’s
parameters consist of a set of learnable filters (or kernels), which have a
small receptive field, but extend through the full depth of the input volume.
In this case the five SDSS bands define the depth. Number and size of
these filters are chosen before and can not be changed during training the
network and are therefore so called hyperparameters. During the forward
10
pass, each filter is convolved across the width and height of the input volume,
computing the dot product between the entries of the filter and the input and
producing a 2-dimensional activation map of that filter. During training, the
kernels, which had been initially populated with random values, are updated
to become progressively relevant to solve the underlying problem. As a result,
the network learns filters that activate when it detects some specific type of
feature at some spatial position in the input [19].
3.3 Pooling layer
Pooling layers reduce the dimensions of the data by combining the outputs
of neuron clusters at one layer into a single neuron in the next layer. Local
pooling combines small clusters, 2x2 in this case. Pooling may compute a
max or an average. Max pooling uses the maximum value from each of a
cluster of neurons at the prior layer, average pooling uses the average value
from each of a cluster of neurons at the prior layer. In all cases, pooling
helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate
the input by a small amount, the values of most of the pooled outputs do
not change [20, 21]. Pasquet et al.(2019) [10] argue that most image pixels in
SDSS images are dominated by background noise, which is why they choose
average pooling over max pooling. The same has been done in this work.
3.4 Activation functions
In order to introduce non-linearity into the network, different non-linear ac-
tivation functions can be used. The most commonly used activation function
is the ReLU (Rectified Linear Unit, [22]) defined by f(x) = max(x, 0). A
problem with ReLU is the Dying ReLU problem where some ReLU neurons
11
essentially die for all inputs and remain inactive no matter what input is
supplied, here no gradient flows and if large number of dead neurons are in
a neural network, its performance is affected. This is a form of the vanishing
gradient problem [23] and can be corrected by making use of what is called
a Leaky ReLU [24], which can be defined as
f(x) =
x if x > 0
0.01x otherwise(3)
thus causing a ”leak” and extending the range of ReLU. The problem
with he Leaky ReLU is the arbitrarily chosen value of 0.01, which may not
suite different architectures equally and is therefore another hyperparame-
ter, that needs tuning. More recently, He et al. (2015) [25] introduced a
Parametric ReLU (PReLU), defined as
f(x) =
x if x > 0
αx otherwise(4)
where α is a trainable parameter instead of a fixed hyperparameter.
Another promising activation function is the Scaled Exponential Linear Unit
(SELU) [26], which has a self-normalizing effect on the neural network. When
using this activation function in practice, one must use ”lecun normal” for
weight initialization, and if dropout wants to be applied, one should use
AlphaDropout. Although both has been ensured, the SELU could not out-
perform the PReLU. Therefore for all activations, except for the output layer
of classification networks (see section 3.8), a PReLU was used.
The output layer for the classification networks use a softmax activation
function. The softmax activation function outputs a vector of size N, where N
is the number of potential outcomes or classes, that represents the probability
12
Figure 5: Inception module with dimension reductions as defined in [27]
distributions over all potential outcomes. The probabilities sum to one, which
makes sense in this case since the classes are mutually exclusive.
3.5 Inception block
An Inception Layer (figure 5) is a combination of 1x1 Convolutional layer, 3x3
Convolutional layer, 5x5 Convolutional layer with their output filter banks
concatenated into a single output vector forming the input of the next stage
[27]. It allows the internal layers to pick and choose which filter size will be
relevant to learn the required information. The 1x1 convolutions are mainly
used for dimensionality reduction before the expensive 3x3 and 5x5 convolu-
tions. Besides being used as reductions, they also include the use of another
activation layer, which makes them dual-purpose by introducing additional
non-linearity to the network. Additionally, since pooling operations have
been essential for the success in current state of the art convolutional net-
works, Szegedy et al. (2015) [27] suggest that adding an alternative parallel
pooling path in each such stage should have additional beneficial effect.
13
3.6 Fully connected layers
The convolutional layers do not make predictions but rather extract mean-
ingful features from the input image, that are then flattened into a one-
dimensional vector and fed into a series of so called fully connected layers.
They consist of a via another hyperparameter given number of single neu-
rons, which are connected to every neuron in the layer before and afterwards,
hence the name fully connected. These connections are also initially popu-
lated with random values and updated during training.
3.7 Dropout layer
In order to prevent the network from overfitting, dropout layers [28] were used
between convolutional as well as between fully connected layers. Dropout
is a regularization method that approximates training a large number of
neural networks with different architectures in parallel. During training,
some number of layer outputs are randomly ignored or ”dropped out”. This
has the effect of making the layer look like and be treated like a layer with a
different number of nodes and connectivity to the prior layer. In effect, each
update to a layer during training is performed with a different ”view” of the
configured layer.
3.8 Output layer
Depending on the given underlying problem a machine learning algorithm
has to solve, it is usually a clear choice, whether it is a regression or a
classification problem, which then dictates the size of the output layer. In this
case however, the obvious regression task can be turned into a classification
task by separating the outputspace into multiple bins (see chapter 4.2). For
14
the regression version of the networks, the output layer consists of a fully
connected layer with a single neuron and a PReLU activation layer. For the
classification version it consists of a fully connected layer with 110 neurons
and a softmax activation layer [29]. 110 because of the number of classes as
explained in section 4.2.1.
3.9 Complete architecture
In general the networks are composed of a first convolution layer, a pooling
layer, three inception blocks, a droput layer followed by three fully connected
layers each with dropout and finally the output layer. A schematic represen-
tation can be seen in figure 16 in appendix C. For the 64 version the first
convolution and pooling layer scale down the inputs, while in the 32 version
”same” padding was used in order to keep the dimensions comparable. The
output layers differ for regression and classification as described in section
3.8 increasing the number of networks to evaluate from two to four.
4 Regression and classification
The following sections outlines the different regression and especially classi-
fication approaches that can be taken for the redshift estimation task.
4.1 Regression
In order to treat this problem as the regression task, that it fundamentally is,
the output layer must consist of a fully connected layer with only one output
unit and uses a PReLu activation function as described in section 3.4. The
only difference for the regression variants is the input dimension of galaxy
images leading to one 64 variant and one 32 variant.
15
4.2 Classification
In order to transfer this regression task into a classification task, two changes
have to be made: (a) the output layer has to be a fully connected layer of size
N instead of one, where N is the number of classes, and a softmax function
as the activation function, (b) the initial labels (true redshifts) have to be
transformed from their single float value to one-hot encoded vectors, where
each column represents a bin of size 1/N ∗max(redshift).
4.2.1 Binning the data
When it comes to binning the data one must choose a reasonable compromise
between the number of galaxies in each bin and redshift quantization noise.
Hoyle [9] chose the number of classes to be 94 over the redshift range of 0
- 0.94 resulting in a bin width of δz = 1.0 ∗ 10−2. Pasquet et al. [10] chose
the number of classes to be 180 over the redshift range of 0 - 0.4 resulting
in a bin width of δz = 2.2 ∗ 10−3. Since this work was done on nearly the
same redshift range as Pasquet et al. [10], but with less galaxy samples, the
number of classes was chosen to be 110 over the redshift range of 0 - 0.33
resulting in a bin width of δz ≈ 3.3 ∗ 10−3.
4.2.2 Encoding labels and label smoothing
Converting the numeric values to one-hot encoded vectors is a rather straight
forward process. Every redshift label gets converted into a 110 dimensional
vector with all zeros and a one on the index corresponding to the bin a
particular galaxy falls in. At this point one could implement label smoothing.
Label smoothing is a regularization technique for classification problems to
prevent the model from predicting the labels too confidently during training
16
and generalizing poorly [30]. It usually replaces the one-hot encoded label
vector yhot with a mixture of yhot and a uniform distribution resulting in a
smoothed label vector yls with
yls = (1− α) ∗ yhot + α/N (5)
where N is the number of label classes, and α is a hyperparameter that
determines the amount of smoothing. If α = 0, we obtain the original one-hot
encoded yhot. If α = 1, we get the uniform distribution.
Another positive effect of label smoothing is to soften the effect of
wrongly labeled data. Although wrongly labeled data can be neglected for
this work, it has been examined if it still could be beneficial to apply different
weights the the wrong bins. In the one-hot encoded case, putting a galaxy in
the bin right next to the correct one would lead to the same error as putting
the galaxy in a bin much further away from the correct bin. This would
be fine in nearly any classification task, but the bins’ order in this case has
meaning to it. In order to incentivize the network to get closer to the correct
bin, a Gaussian label smoothing was applied via
yls,i =1
σ√
2πe−(i−j)
2/2σ2
(6)
with σ = 0.1 ∗ N = 11 and j being the index, where yhot,j = 1. The
value for σ was chosen rather intuitively as a compromise between spanning a
large area of the bin spectrum and offering a big enough slope to incentivize
the network to optimize to the center of the distribution. As a last step
in the Gaussian label smoothing the values are divided by the sum of all
values to keep the sum at 1. In order to evaluate the effect of Gaussian
label smoothing, the two classification networks (one 64 variant and one 32
17
variant) were trained once with and once without smoothing, increasing the
number of observed networks from four to six.
4.2.3 Returning to numerical values
The predictions are then no longer a single value, but a vector of size N, where
each element stands for the predicted probability of single sample to fall into
that respective bin. In order to get a single value back out of this vector, one
would normally assign the value of the bin with the highest probability. This
method will be referred to as ”class 1”, e.g. in table 8. But since we do not
have a classic classification task at hand and our output classes are ordered,
we are offered another way of converting this vector back to a single value:
the softmax weighted sum of the redshift values in each bin. This method
will be referred to as ”class 2”, e.g. in table 8. Since N is 110, this method
shall be shown with a smaller fictive example. Here the output space ranges
from 0 to 3 and is separated into three bins.
Bin values 0.5 1.5 2.5
Predicted probabilities 0.15 0.75 0.1
Table 2: Fictive classifier prediction for a single sample
As said one would normally assign the value of 1.5 for this example, since
the second bin got assigned the highest probability. But since our output in
theory could be every real number between 0 and 3, one can also weight every
bin value with its respective probability and would get the following result:
x = 0.15 ∗ 0.5 + 1.5 ∗ 0.75 + 2.5 ∗ 0.1 = 1.45
In order to evaluate the differences of these two techniques, the four clas-
sification networks were evaluated once with the highest probability approach
(class 1) and once with the weighted vector approach (class2), increasing the
18
number of networks to evaluate from six to finally ten. No further training
is needed for this distinction, since either technique does not influence the
training process and is only applied afterwards.
5 Training
The dataset has been divided into a training, validation and test set of sizes
60%, 20% and 20% respectively. In order to minimize the effect of galaxy
orientation, data augmentation in the form of randomly flipping and/or ro-
tating the images between 0 and 360 degree was applied. Also a random
translation of a maximum of 1 pixel in x and/or y direction was applied,
since the positions in the galaxy images are only precise enough to ±1 pixel.
Although an ensemble of trained models would be desirable for each of
the ten network variations, due to time and resource restrictions, one trained
model has to suffice for each network variant.
Each network has been trained for 200 epochs using the Adadelta opti-
mizer [31]. Adadelta is a more robust extension of Adagrad [32] that adapts
learning rates based on a moving window of gradient updates, instead of ac-
cumulating all past gradients. This way, Adadelta continues learning even
when many updates have been done. The final model was chosen to be the
one, that produced the smallest loss on the validation set. The loss function
for regression networks was mean squared error, for classification networks
categorical crossentropy.
19
6 CNN results
Though there are commonly used statistics, there is no exact consensus on
evaluation metrics for deep learning based redshift estimations. Therefore
the following metrics used in various other papers ([9, 10, 11, 12]) have been
used to evaluate the performance for each model on the test set:
• the residuals, ∆z = (zCNN − zspec)/(1 + zspec) following Cohen et al.
(2000) [33]
• the prediction bias, < ∆z >, defined as the mean of the residuals
• σ68, σ95, corresponding to the 68.27% and 95.45% spread of ∆z
• Median Absolute Deviation (MAD) of ∆z, which is defined as the me-
dian of |∆z −Median(∆z)|
• σMAD ≈ 1.4826 ∗MAD, the standard deviation of ∆z under the as-
sumption of normal distribution
• the fraction of outliers η in percent with |∆z| > 0.05
There is a lack of common ground especially on the definition of when
to count a prediction as an outlier. Hoyle (2016) [9] uses an absolute value
of 0.15 for the threshold, while Mu et al. (2018) [12] choose their threshold
as 3 ∗ δ, where δ represents the standard deviation of ∆z. In their case the
thresholds range from 0.0885 to 0.1188 for different networks. Pasquet et
al. (2019) [10] choose their threshold as 5 ∗ σMAD ≈ 0.05, where σMAD =
1.4826 ∗MAD achieved by their network. Since the latter one also performs
on the same redshift range of up to 0.4, the value of 0.05 was chosen as
threshold for the further evaluation of all networks regardless of the MAD or
σMAD of the specific network. A full table of all evaluation metrics for the ten
20
mentioned networks can be found in appendix A.1, the corresponding plots
in Appendix A.2 to A.11. There are five different plots for each network:
• Upper left: Spectroscopic redshift zspec against redshift predicted by
the network zCNN in a scatter plot. Same plot on the right part but
with transparency.
• Upper right: Spectroscopic redshift zspec against redshift predicted by
the network zCNN in a density plot.
• Middle left: Histogram of residuals ∆z with mean and standard devi-
ation
• Middle right: Spectroscopic redshift zspec against residuals ∆z in a
density plot with a linear fit.
• Bottom: Histogram of residuals ∆z with mean and standard deviation
for different redshift bins
All networks have been able to produce competitive results and are only
beaten by the results of Pasquet et al. [10]. The reason for this may be the
fact, that they have been using a deeper CNN with five instead of just three
inception blocks. Since several different networks instead of just a single
one have been examined in this work a shallower network architecture was
chosen, in order to keep training times in check.
With perfect predictions, the upper left plot would be just the positive
half of a identity line (red line in the plots). For all networks the predictions
are more or less closely scattered around that line and thereby showing a
general predictive power for all of them. The upper right plots are more
suited to show the density of this scatter and reflect the imbalance in the
21
dataset, since density drastically decreases at the upper and lower end of the
redshift spectrum.
The non-equality of standard deviations in the middle left plots and the
corresponding σMAD reveal, that the residuals are not normally distributed.
Looking at the means one could assume, that the redshift precision criteria
defined by Knox et al. (2016) [34] of having a bias better than 0.002 seem
fulfilled for all networks except for the one with ID 6. But taking a closer look
by creating different histograms for separate redshift ranges as in the bottom
plots reveals another picture. In the first panel in every bottom plot, the
mean is slightly above zero, whereas in the middle and left panels the mean
is below zero and shifts more and more to the left. Most of the times being
worse than 0.002. This effect can be seen even more clearly in the middle
right plots, where the spectroscopic redshift zspec is plotted against residuals
∆z in a density plot and a line has been fitted to the distribution. A clear
negative trend with f(0) > 0 can be seen in every plot, meaning the networks
overestimate smaller redshift values and underestimate the larger redshift
values. This is no surprise since we are dealing with an imbalanced dataset
that lacks samples in both ends of the spectrum. Therefore estimations are
drawn to the center of the distribution.
In order to answer the question of how different techniques influence
the networks ability to predict the redshift, the following sections compare
averaged network results for every differentiation described before.
6.1 Comparing different input dimensions
For the comparison between different input dimensions of the galaxy images,
average model results from 64x64 networks (model IDs 1,3,4,5 and 6) are
compared to average model results from 32x32 networks (model IDs 2,7,8,9
22
and 10). The results can be seen in table 3. The average 64x64 network
outperform the 32x32 network in every metric. For the 32x32 networks σ68
is bigger by 5.7%, σ95 is bigger by 4.5%, MAD is bigger by 5.8%, σMAD is
bigger by 5.7% and η is bigger by 23.5%.
method σ68 σ95 MAD σMAD η [%]
64x64 0.02743 0.06538 0.00896 0.01329 0.86200
32x32 0.02900 0.06833 0.00948 0.01405 1.06443
Table 3: Comparison between 64x64 and 32x32 model results. Best value in
each column is underlined.
6.2 Comparing class 1 and class 2 method
For the comparison between different methods of returning to single float
values for classification networks, average model results from classification
networks with highest probability methods (class 1) (model IDs 3,5,7 and
9) are compared to average model results from classification networks with
softmax weighted label vector method (class 2) (model IDs 4,6,8 and10). The
results can be seen in table 4. For class 1 networks σ68 is bigger by 0.2%,
σ95 is bigger by 2.3% and η is bigger by 20.5%. For class 2 networks MAD
is bigger by 0.5% and σMAD is bigger by 0.7%. For the most metrics both
methods seem to be fairly equal with differences smaller than 1%. But for σ95
and especially η, class 2 outperforms class 1. But looking at the upper left
plots of e.g. appendix A.6 and A.7, it is evident that the softmax weighted
label vector method seems to introduce a bias in the lower redshift region
for networks with Gaussian label smoothing. Since evaluating both methods
comes at no additional training of networks and therefore virtually at no
23
computational cost, it can be advantageous to take a look at both any time
it is possible.
method σ68 σ95 MAD σMAD η [%]
class 1 0.02843 0.06803 0.00926 0.01372 1.08377
class 2 0.02836 0.06647 0.00931 0.01381 0.89927
Table 4: Comparison between highest probability and softmax weighted la-
bel vector method for classification networks. Best value in each column is
underlined.
6.3 Comparing no smoothing and Gaussian smoothing
For the comparison between classification networks without label smoothing
and networks with Gaussian label smoothing, average model results from
networks without label smoothing (model IDs 3,4,7 and 8) are compared to
average model results networks with Gaussian label smoothing (model IDs
5,6,9 and 10). The results can be seen in table 5. Networks with Gaussian
label smoothing outperform networks without label smoothing in every met-
ric. For networks without label smoothing σ68 is bigger by 6.1%, σ95 is bigger
by 8.3%, MAD is bigger by 4.5%, σMAD is bigger by 4.4% and η is bigger by
73.8%.
method σ68 σ95 MAD σMAD η [%]
no smoothing 0.02923 0.06994 0.00949 0.01406 1.25889
Gaussian smoothing 0.02756 0.06456 0.00908 0.01347 0.72415
Table 5: Comparison between classification networks without label smooth-
ing and with Gaussian label smoothing
24
6.4 Comparing regression and classification
For the comparison between regression and classification networks, the pro-
cess is not as straightforward as for the other comparisons, since there are
multiple possible subsets of classification networks. Also the results from
sections 6.2 and 6.3 are not necessarily compatible as explained in section
6.3. Therefore the average regression network results (model IDs 1 and 2)
are compared to all possible subsets of classification model ensembles:
• A: all classification models (model IDs 3,4,5,6,7,8,9 and 10)
• B: all classification models with class 1 method (model IDs 3,5,7 and
9)
• C: all classification models with class 2 method (model IDs 4,6,8 and
10)
• D: all classification models without label smoothing (model IDs 3,4,7
and 8)
• E: all classification models with Gaussian label smoothing (model IDs
5,6,9 and 10)
• F: all classification models with class 1 method and without label
smoothing (model IDs 3 and 7)
• G: all classification models with class 1 method and Gaussian label
smoothing (model IDs 5 and 9)
• H: all classification models with class 2 method and without label
smoothing (model IDs 4 and 8)
25
• I: all classification models with class 2 method and Gaussian label
smoothing (model IDs 6 and 10)
Results can be seen in table 6. Table 7 shows the same results, but
divided by the value of the regression ensemble in each column, which makes
it easier to compare. A value above 1 indicates a performance worse than
the regression ensemble, a value below 1 indicates a performance better than
the regression ensemble. For σ68, MAD and σMAD only ensemble G is better,
for σ95 and η ensembles E, G and I are better than the regression ensem-
ble. These ensembles are the three with Gaussian label smoothing, therefore
further strengthening the believe in its positive contribution for model pre-
diction performance. Comparing ensemble C (all class 2 models) with E (all
models with Gaussian label smoothing) reveals, that out of the two possi-
ble differentiations, Gaussian label smoothing seems to hold more predictive
power. Out of ensembles E, G and I only ensemble G was able to outperform
the regression one in every metric. A notable ensemble is H, since it is the
one commonly used for this task in previous works. It could not outperform
the regression and may therefore not be the best suited for this task.
7 Conclusion and outlook
In this work I have presented multiple deep CNNs, that were trained and
tested on the Main Galaxy Sample of the SDSS at z ≤ 0.4, to estimate pho-
tometric redshifts. Regression as well as different classification approaches
have been examined, in order to evaluate the currently commonly used tech-
nique of turning the regression task into a classification task.
I could produce competitive results for each of the ten different net-
works and show, that the commonly used classification approach does not
26
method σ68 σ95 MAD σMAD η [%]
regression 0.02748 0.06527 0.00897 0.01330 0.84998
A 0.02923 0.06994 0.00949 0.01406 1.25889
B 0.02843 0.06803 0.00926 0.01372 1.08377
C 0.02836 0.06647 0.00931 0.01381 0.89927
D 0.02923 0.06994 0.00949 0.01406 1.25889
E 0.02756 0.06456 0.00908 0.01347 0.72415
F 0.02984 0.07160 0.00967 0.01433 1.40078
G 0.02702 0.06447 0.00884 0.01311 0.76676
H 0.02861 0.06828 0.00930 0.01379 1.11700
I 0.02810 0.06466 0.00932 0.01382 0.68154
Table 6: Comparison between regression networks and different ensembles of
classification networks as defined in section 6.4. Best value in each column
is underlined.
necessarily outperform the regular regression approach. It has been shown,
that the current approach of using same sized 64x64 images delivers better
results than the new proposed variable 32x32 images. I introduced a label
smoothing technique (Gaussian label smoothing), which not only improves
the overall performance of the classification networks, but also enables the
classification networks to outperform the regression networks consistently.
Finally, returning to single numerical values from a classification prediction
via the softmax weighted label vector method delivers still the best results,
except when combined with Gaussian label smoothing. I thereby strongly
recommend the use of Gaussian label smoothing in the task of galaxy redshift
estimation via deep CNNs.
27
method σ68 σ95 MAD σMAD η [%]
regression 1.000000 1.000000 1.000000 1.000000 1.000000
A 1.063683 1.071549 1.057971 1.057143 1.481082
B 1.034571 1.042286 1.032330 1.031579 1.275054
C 1.032023 1.018385 1.037904 1.038346 1.057990
D 1.063683 1.071549 1.057971 1.057143 1.481082
E 1.002911 0.989122 1.012263 1.012782 0.851961
F 1.085881 1.096982 1.078038 1.077444 1.648015
G 0.983261 0.987743 0.985507 0.985714 0.902092
H 1.041121 1.046116 1.036789 1.036842 1.314149
I 1.022562 0.990654 1.039019 1.039098 0.801831
Table 7: Comparison between regression networks and different ensembles of
classification networks as defined in section 6.4. Values are divided by the
value of the regression ensemble in each column. Best value in each column
is underlined.
Pasquet et al. [10] argue that for most galaxies the precision is limited
by the signal-to-noise ratio of SDSS images rather than by the method. I
hope my work could show a promising change in method, that could still
improve results. To further investigate this, a bigger analysis utilizing es-
tablished ensemble techniques (i.e. training multiple instances of the same
network version, with the same specifications, but different random initial
layer weights and combining results) would be interesting. If the focus lies
on just one or two network versions and not ten as in this work, one could eas-
ily increase model complexity (e.g. five instead of three inception modules)
while still using the same resources.
28
Besides increasing model complexity and increasing sample size from
existing and upcoming surveys, other options, that that have not been used
in this work but may improve results are:
• using class weights in classification (and with some adjustments as well
in regression) approaches to counteract the effects an inevitably imbal-
anced dataset;
• utilizing cyclical learning rate schedules with stochastic gradient de-
scent (with momentum), since Adam/Adadelta can have a tendency to
overfit;
• making use of more recent ensembling techniques like Fast Geomet-
ric Ensembling(FGE) [35] or Stochastic Weight Averaging (SWA) [36].
SWA has indeed been tried in this work and could show promising
improvements, but I was not able to examine it to full extend;
• training a pre-trained model on simulated galaxy images. This could
both be used as a way to verify the simulated galaxy images, as well
as improve pre-trained models via transfer learning.
29
Appendices
A CNN results
A.1 Numeric results
ID input method < ∆z > σ68 σ95 MAD σMAD η [%]
1 64x64x5 reg 0.00089 0.02724 0.06550 0.00887 0.01315 0.870
2 32x32x5 reg 0.00085 0.02771 0.06504 0.00906 0.01344 0.830
3 64x64x5 class 1 0.00011 0.02938 0.06982 0.00952 0.01411 1.235
4 64x64x5 class 2 0.00078 0.02826 0.06703 0.00926 0.01372 1.015
5 64x64x5 class 1 0.00149 0.02587 0.06256 0.00844 0.01251 0.605
6 64x64x5 class 2 0.00335 0.02637 0.06200 0.00873 0.01295 0.585
7 32x32x5 class 1 -0.00004 0.03031 0.07338 0.00982 0.01456 1.567
8 32x32x5 class 2 0.00077 0.02896 0.06954 0.00935 0.01386 1.219
9 32x32x5 class 1 -0.00081 0.02817 0.06638 0.00925 0.01371 0.929
10 32x32x5 class 2 0.00158 0.02984 0.06731 0.00991 0.01470 0.778
Hoyle [9] 0.00 0.030 0.10 - - 1.71
Polsterer [11] -0.0003 - - 0.0128 - -
Pasquet [10] 0.00010 - - 0.00615 0.00912 0.31
Table 8: Numeric results as defined in chapter 6 for all
ten final models. reg: regression; class 1/2: as described
in section 4.2.3; bold: Gaussian smoothing. Best value
in each column is underlined.
31
A.2 Plots 64 regression
Figure 6: Model evaluation plots for model with ID 1.
32
A.3 Plots 32 regression
Figure 7: Model evaluation plots for model with ID 2.
33
A.4 Plots 64 classification 1
Figure 8: Model evaluation plots for model with ID 3.
34
A.5 Plots 64 classification 2
Figure 9: Model evaluation plots for model with ID 4.
35
A.6 Plots 64 classification 1 Gaussian smoothing
Figure 10: Model evaluation plots for model with ID 5.
36
A.7 Plots 64 classification 2 Gaussian smoothing
Figure 11: Model evaluation plots for model with ID 6.
37
A.8 Plots 32 classification 1
Figure 12: Model evaluation plots for model with ID 7.
38
A.9 Plots 32 classification 2
Figure 13: Model evaluation plots for model with ID 8.
39
A.10 Plots 32 classification 1 Gaussian smoothing
Figure 14: Model evaluation plots for model with ID 9.
40
A.11 Plots 32 classification 2 Gaussian smoothing
Figure 15: Model evaluation plots for model with ID 10.
41
B My SQL search
SELECT TOP 500000
sp.bestobjid, sp.ra, sp.dec, ph.run, ph.camcol, ph.field, ph.obj,
sp.z, sp.zErr, ph.expRad_u, ph.expRad_g, ph.expRad_r, ph.expRad_i,
ph.expRad_z, ph.deVRad_u, ph.deVRad_g, ph.deVRad_r, ph.deVRad_i,
ph.deVRad_z, ph.petroR90_u, ph.petroR90_g, ph.petroR90_r,
ph.petroR90_i, ph.petroR90_z, ph.petroR90Err_u, ph.petroR90Err_g,
ph.petroR90Err_r, ph.petroR90Err_i, ph.petroR90Err_z
FROM SpecObj AS sp
JOIN PhotoObj AS ph ON ph.objid = sp.bestobjid
WHERE
class = ’GALAXY’
AND sp.z > 0
AND sp.z <= 0.4
AND (sp.zERR/sp.z) < 0.1
AND sp.zWarning = 0
AND sp.primTarget = 64
AND ph.dered_r > 0
AND ph.dered_r < 22.2
AND ph.dered_i > 0
AND ph.dered_i < 21.3
AND ph.dered_u > 0
AND ph.dered_u < 22
AND ph.dered_z > 0
AND ph.dered_z < 20.5
AND ph.dered_g > 0
42
AND ph.dered_g < 22.2
AND clean = 1
AND (calibStatus_u & 1) != 0
AND (calibStatus_g & 1) != 0
AND (calibStatus_r & 1) != 0
AND (calibStatus_i & 1) != 0
AND (calibStatus_z & 1) != 0
AND ((flags & 0x10000000) != 0)
AND ((flags & 0x8100000c00a0) = 0)
AND (((flags & 0x400000000000) = 0) OR (psfmagerr_u <= 0.2))
AND (((flags & 0x400000000000) = 0) OR (psfmagerr_g <= 0.2))
AND (((flags & 0x400000000000) = 0) OR (psfmagerr_r <= 0.2))
AND (((flags & 0x400000000000) = 0) OR (psfmagerr_i <= 0.2))
AND (((flags & 0x400000000000) = 0) OR (psfmagerr_z <= 0.2))
AND (((flags & 0x100000000000) = 0) OR (flags & 0x1000) = 0)
--/Removing Duplicates
AND mode = 1
--/Removing Objects with Deblending Problems
AND (flags_u & 0x20) = 0
AND (flags_u & 0x80000) = 0
AND ((flags_u & 0x400000000000) = 0 OR psfmagerr_u <= 0.2)
AND (flags_g & 0x20) = 0
AND (flags_g & 0x80000) = 0
AND ((flags_g & 0x400000000000) = 0 OR psfmagerr_g <= 0.2)
43
AND (flags_r & 0x20) = 0
AND (flags_r & 0x80000) = 0
AND ((flags_r & 0x400000000000) = 0 OR psfmagerr_r <= 0.2)
AND (flags_i & 0x20) = 0
AND (flags_i & 0x80000) = 0
AND ((flags_i & 0x400000000000) = 0 OR psfmagerr_i <= 0.2)
AND (flags_z & 0x20) = 0
AND (flags_z & 0x80000) = 0
AND ((flags_z & 0x400000000000) = 0 OR psfmagerr_z <= 0.2)
--/Removing Objects with Interpolation Problems
AND (flags_u & 0x800000000000) = 0
AND (flags_u & 0x10000000000) = 0
AND ((flags_u & 0x100000000000) = 0 OR (flags_u & 0x1000) = 0)
AND (flags_g & 0x800000000000) = 0
AND (flags_g & 0x10000000000) = 0
AND ((flags_g & 0x100000000000) = 0 OR (flags_g & 0x1000) = 0)
AND (flags_r & 0x800000000000) = 0
AND (flags_r & 0x10000000000) = 0
AND ((flags_r & 0x100000000000) = 0 OR (flags_r & 0x1000) = 0)
AND (flags_i & 0x800000000000) = 0
AND (flags_i & 0x10000000000) = 0
AND ((flags_i & 0x100000000000) = 0 OR (flags_i & 0x1000) = 0)
AND (flags_z & 0x800000000000) = 0
AND (flags_z & 0x10000000000) = 0
AND ((flags_z & 0x100000000000) = 0 OR (flags_z & 0x1000) = 0)
--/Removing Suspicious Detections
AND (flags_u & 0x10000000) != 0
44
AND (flags_u & 0x40000) = 0
AND (flags_u & 0x80) = 0
AND (flags_g & 0x10000000) != 0
AND (flags_g & 0x40000) = 0
AND (flags_g & 0x80) = 0
AND (flags_r & 0x10000000) != 0
AND (flags_r & 0x40000) = 0
AND (flags_r & 0x80) = 0
AND (flags_i & 0x10000000) != 0
AND (flags_i & 0x40000) = 0
AND (flags_i & 0x80) = 0
AND (flags_z & 0x10000000) != 0
AND (flags_z & 0x40000) = 0
AND (flags_z & 0x80) = 0
ORDER BY z DESC
45
C My CNN architecture
Figure 16: Schematic network architecture.
46
References
[1] WA Baum. Problems of extragalactic research. In IAU Symposium,
1962, volume 15, pages 390–397, 1962.
[2] Stephane Arnouts, Stefano Cristiani, Lauro Moscardini, Sabino Matar-
rese, Francesco Lucchin, Adriano Fontana, and Emanuele Giallongo.
Measuring and modelling the redshift evolution of clustering: the hub-
ble deep field north. Monthly Notices of the Royal Astronomical Society,
310(2):540–556, 1999.
[3] David A Bohlender, Daniel Durand, and Thomas H Handley. Astro-
nomical Data Analysis Software and Systems XI, volume 281. 2002.
[4] Gabriel B Brammer, Pieter G van Dokkum, and Paolo Coppi. Eazy:
a fast, public photometric redshift code. The Astrophysical Journal,
686(2):1503, 2008.
[5] Adrian A Collister and Ofer Lahav. Annz: estimating photometric red-
shifts using artificial neural networks. Publications of the Astronomical
Society of the Pacific, 116(818):345, 2004.
[6] I Csabai, L Dobos, M Trencseni, G Herczegh, P Jozsa, N Purger,
T Budavari, and AS Szalay. Multidimensional indexing tools for the
virtual observatory. Astronomische Nachrichten: Astronomical Notes,
328(8):852–857, 2007.
[7] Samuel Carliles, Tamas Budavari, Sebastien Heinis, Carey Priebe, and
Alexander S Szalay. Random forests for photometric redshifts. The
Astrophysical Journal, 712(1):511, 2010.
47
[8] Ben Hoyle, Markus Michael Rau, Roman Zitlau, Stella Seitz, and Jochen
Weller. Feature importance for machine learning redshifts applied
to sdss galaxies. Monthly Notices of the Royal Astronomical Society,
449(2):1275–1283, 2015.
[9] Ben Hoyle. Measuring photometric redshifts using galaxy images and
deep neural networks. Astronomy and Computing, 16:34–40, 2016.
[10] Johanna Pasquet, Emmanuel Bertin, Marie Treyer, Stephane Arnouts,
and Dominique Fouchez. Photometric redshifts from sdss images using
a convolutional neural network. Astronomy & Astrophysics, 621:A26,
2019.
[11] Antonio D’Isanto and Kai Lars Polsterer. Photometric redshift es-
timation via deep learning-generalized and pre-classification-less, im-
age based, fully probabilistic redshifts. Astronomy & Astrophysics,
609:A111, 2018.
[12] Yong-Huan Mu, Bo Qiu, Jian-Nan Zhang, Jun-Cheng Ma, and Xiao-
Dong Fan. Photometric redshift estimation of galaxies with convolu-
tional neural network.
[13] Michael A Strauss, David H Weinberg, Robert H Lupton, Vijay K
Narayanan, James Annis, Mariangela Bernardi, Michael Blanton, Scott
Burles, AJ Connolly, Julianne Dalcanton, et al. Spectroscopic target
selection in the sloan digital sky survey: the main galaxy sample. The
Astronomical Journal, 124(3):1810, 2002.
[14] Romina Ahumada, Carlos Allende Prieto, Andres Almeida, Friedrich
Anders, Scott F Anderson, Brett H Andrews, Borja Anguiano, Ric-
cardo Arcodia, Eric Armengaud, Marie Aubert, et al. The sixteenth
48
data release of the sloan digital sky surveys: First release from the
apogee-2 southern survey and full release of eboss spectra. arXiv preprint
arXiv:1912.02905, 2019.
[15] Vahe Petrosian. Surface brightness and evolution of galaxies. The As-
trophysical Journal, 209:L1–L5, 1976.
[16] Michael R Blanton, Julianne Dalcanton, Daniel Eisenstein, Jon Love-
day, Michael A Strauss, Mark SubbaRao, David H Weinberg, John E
Anderson Jr, James Annis, Neta A Bahcall, et al. The luminosity func-
tion of galaxies in sdss commissioning data. The Astronomical Journal,
121(5):2358, 2001.
[17] Naoki Yasuda, Masataka Fukugita, Vijay K Narayanan, Robert H Lup-
ton, Iskra Strateva, Michael A Strauss, Zeljko Ivezic, Rita SJ Kim,
David W Hogg, David H Weinberg, et al. Galaxy number counts from
the sloan digital sky survey commissioning data. The Astronomical Jour-
nal, 122(3):1104, 2001.
[18] JL Hodges. The significance probability of the smirnov two-sample test.
Arkiv for Matematik, 3(5):469–486, 1958.
[19] Aurelien Geron. Hands-on machine learning with Scikit-Learn, Keras,
and TensorFlow: Concepts, tools, and techniques to build intelligent
systems. O’Reilly Media, 2019.
[20] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning.
MIT press, 2016.
[21] Haigang Zhu, Xiaogang Chen, Weiqun Dai, Kun Fu, Qixiang Ye, and
Jianbin Jiao. Orientation robust object detection in aerial images using
49
deep convolutional neural network. In 2015 IEEE International Confer-
ence on Image Processing (ICIP), pages 3735–3739. IEEE, 2015.
[22] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rod-
ney J Douglas, and H Sebastian Seung. Digital selection and ana-
logue amplification coexist in a cortex-inspired silicon circuit. Nature,
405(6789):947–951, 2000.
[23] Sepp Hochreiter. The vanishing gradient problem during learning re-
current neural nets and problem solutions. International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116,
1998.
[24] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier non-
linearities improve neural network acoustic models. In Proc. icml, vol-
ume 30, page 3, 2013.
[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving
deep into rectifiers: Surpassing human-level performance on imagenet
classification. In Proceedings of the IEEE international conference on
computer vision, pages 1026–1034, 2015.
[26] Zhen Huang, Tim Ng, Leo Liu, Henry Mason, Xiaodan Zhuang, and
Daben Liu. Sndcnn: Self-normalizing deep cnns with scaled exponential
linear units for speech recognition. In ICASSP 2020-2020 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 6854–6858. IEEE, 2020.
[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
50
Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1–9, 2015.
[28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. Dropout: a simple way to prevent neu-
ral networks from overfitting. The journal of machine learning research,
15(1):1929–1958, 2014.
[29] John S Bridle. Probabilistic interpretation of feedforward classification
network outputs, with relationships to statistical pattern recognition. In
Neurocomputing, pages 227–236. Springer, 1990.
[30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2818–2826, 2016.
[31] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv
preprint arXiv:1212.5701, 2012.
[32] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient
methods for online learning and stochastic optimization. Journal of
machine learning research, 12(7), 2011.
[33] Judith G Cohen, David W Hogg, Roger Blandford, Lennox L Cowie,
Esther Hu, Antoinette Songaila, Patrick Shopbell, and Kevin Richberg.
Caltech faint galaxy redshift survey. x. a redshift survey in the region of
the hubble deep field north. The Astrophysical Journal, 538(1):29, 2000.
[34] Lloyd Knox, Yong-Seon Song, and Hu Zhan. Weighing the universe with
photometric redshift surveys and the impact on dark energy forecasts.
The Astrophysical Journal, 652(2):857, 2006.
51
[35] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov,
and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast
ensembling of dnns, 2018.
[36] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov,
and Andrew Gordon Wilson. Averaging weights leads to wider optima
and better generalization, 2018.
52
List of Figures
1 RGB image from irg bands of frame 6073-4-50 (run-camcol-field) 3
2 Distribution of image width/height using equation 1 with cut-
off at A = 32 (red) and median at A = 54 (black) . . . . . . . 6
3 u band comparison images for three different galaxies with
given SDSS bestobjid, zspec and Petrosian image size in pixel
which corresponds to the value A in Figure 2. Rescaled 32
pixel cutouts on the left and original sized 64 pixel cutouts on
the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Redshift distribution of all samples found by the SQL search
(blue) in comparison with all samples used with the 64 pixel
architecture (orange) or the 32 pixel architecture (green) . . . 9
5 Inception module with dimension reductions as defined in [27] 13
6 Model evaluation plots for model with ID 1. . . . . . . . . . . 32
7 Model evaluation plots for model with ID 2. . . . . . . . . . . 33
8 Model evaluation plots for model with ID 3. . . . . . . . . . . 34
9 Model evaluation plots for model with ID 4. . . . . . . . . . . 35
10 Model evaluation plots for model with ID 5. . . . . . . . . . . 36
11 Model evaluation plots for model with ID 6. . . . . . . . . . . 37
12 Model evaluation plots for model with ID 7. . . . . . . . . . . 38
13 Model evaluation plots for model with ID 8. . . . . . . . . . . 39
14 Model evaluation plots for model with ID 9. . . . . . . . . . . 40
15 Model evaluation plots for model with ID 10. . . . . . . . . . 41
16 Schematic network architecture. . . . . . . . . . . . . . . . . 46
53
List of Tables
1 Kolmogorov-Smirnoff test results for different subset combi-
nations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Fictive classifier prediction for a single sample . . . . . . . . . 18
3 Comparison between 64x64 and 32x32 model results. Best
value in each column is underlined. . . . . . . . . . . . . . . . 23
4 Comparison between highest probability and softmax weighted
label vector method for classification networks. Best value in
each column is underlined. . . . . . . . . . . . . . . . . . . . . 24
5 Comparison between classification networks without label smooth-
ing and with Gaussian label smoothing . . . . . . . . . . . . . 24
6 Comparison between regression networks and different ensem-
bles of classification networks as defined in section 6.4. Best
value in each column is underlined. . . . . . . . . . . . . . . . 27
7 Comparison between regression networks and different ensem-
bles of classification networks as defined in section 6.4. Values
are divided by the value of the regression ensemble in each
column. Best value in each column is underlined. . . . . . . . 28
8 Numeric results as defined in chapter 6 for all ten final models.
reg: regression; class 1/2: as described in section 4.2.3; bold:
Gaussian smoothing. Best value in each column is underlined. 31
54
Declaration of AuthorshipI hereby declare that this thesis is my own work and that I have not used
any sources and aids other than those stated in the thesis.
ErklarungHiermit erklare ich, die vorliegende Arbeit selbstandig verfasst zu haben und
keine anderen als die in der Arbeit angegebenen Quellen und Hilfsmittel
benutzt zu haben.
Munich, September 14, 2020: .........................................
Benjamin Alber