photometric redshifts from sdss images using ... - usm home

LUDWIG MAXIMILIAN UNIVERSITY OF MUNICH

FACULTY OF PHYSICS

Photometric redshifts from

SDSS images using

convolutional neural networks

and custom Gaussian label

smoothing

A Master’s Thesis by

Benjamin Alber

Supervisor:

Dr. Benjamin P. Moster

Submitted:

September 14, 2020

LUDWIG-MAXIMILIANS-UNIVERSITAT MUNCHEN

FAKULTAT FUR PHYSIK

Photometrischen Rotverschiebungen

aus SDSS Galaxiebildern unter

Verwendung von Convolutional

Neural Net-

works und ”Gaussian label smoothing”

Eine Masterarbeit von

Benjamin Alber

Erstgutachter:

Dr. Benjamin P. Moster

Eingereicht am:

14. September 2020

Contents

1 Introduction 1

2 Data 2

3 Network architectures 10

3.1 Input layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 11

3.5 Inception block . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.6 Fully connected layers . . . . . . . . . . . . . . . . . . . . . . 14

3.7 Dropout layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.8 Output layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.9 Complete architecture . . . . . . . . . . . . . . . . . . . . . . 15

4 Regression and classification 15

4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.1 Binning the data . . . . . . . . . . . . . . . . . . . . . 16

4.2.2 Encoding labels and label smoothing . . . . . . . . . . 16

4.2.3 Returning to numerical values . . . . . . . . . . . . . . 18

5 Training 19

6 CNN results 20

6.1 Comparing different input dimensions . . . . . . . . . . . . . . 22

6.2 Comparing class 1 and class 2 method . . . . . . . . . . . . . 23

6.3 Comparing no smoothing and Gaussian smoothing . . . . . . . 24

5

6.4 Comparing regression and classification . . . . . . . . . . . . . 25

7 Conclusion and outlook 26

Appendices 31

A CNN results 31

A.1 Numeric results . . . . . . . . . . . . . . . . . . . . . . . . . . 31

A.2 Plots 64 regression . . . . . . . . . . . . . . . . . . . . . . . . 32

A.3 Plots 32 regression . . . . . . . . . . . . . . . . . . . . . . . . 33

A.4 Plots 64 classification 1 . . . . . . . . . . . . . . . . . . . . . . 34


A.6 Plots 64 classification 1 Gaussian smoothing . . . . . . . . . . 36






B My SQL search 42

C My CNN architecture 46

References 47

1 Introduction

Robust distance estimates for galaxies are required in order to maximise cos-

mological information from current and upcoming large scale galaxy surveys.

These distances are inferred via the distance-redshift relation which relates

how the light emitted by a galaxy is stretched due to the expansion of the

universe as it travels from the galaxy to our detectors. This stretching leads

to an energy loss of the photon and a shift towards redder wavelengths, which

is known as the redshift. The further away the galaxy is from us, the longer

the light has been passing through the expanding universe, and the more

it becomes redshifted. Obtaining accurate spectroscopic redshifts, which

measure the redshifted spectral absorption and emission lines, are extremely

time-intensive, whereas obtaining photometric redshifts is much cheaper, but

less accurate.

Two main techniques are traditionally used for photometric redshifts:

template fitting and machine learning algorithms. The template fitting codes

(e.g. [1, 2, 3, 4]) match the broadband photometry of a galaxy to the synthetic

magnitudes of a suite of templates across a large redshift interval1. These

methods do not rely on training samples of galaxies. However, they are often

computationally intensive due to the brute-force approach to explore the pre-

generated grid of model photometry and poorly known parameters such as

dust attenuation can lead to degeneracies in color-redshift space.

On the other hand, the machine learning methods (e.g. [5, 6, 7]) were

shown to have similar or better performances when a large spectroscopic

training set is available. However, they are only reliable within the limits of

the training set and the current lack of spectroscopic coverage in some color

space regions and at high redshift remains a major issue for this approach.

Within the standard machine learning approach the choice of which photo-

1

metric input features to train the machine learning architecture on, from the

full list of possible photometric features, still has no definitive answer. Hoyle

et al. (2015) [8] performed an analysis of feature importance for photomet-

ric redshifts, which uses machine learning techniques to determine which of

the many possible photometric features produce the most predictive power.

But by passing the entire galaxy image into a convolutional neural network

(CNN), the manual feature extraction required by previous methods can be

bypassed (e.g. [9, 10, 11, 12]).

A commonly used technique in this works is separating the redshift space

in multiple small bins and performing a classification task. This work shall

try to evaluate whether this approach is justified and how it could be im-

proved. The paper is organized as follows. Section 2 describes the data

acquisition and preparation used in this study. Section 3 describes the dif-

ferent elements and the overall structure of the used CNN networks. Section

4 outlines the different regression and classification approaches that can be

taken for the redshift estimation task with a CNN. Section 5 describes the

data augmentation and training regime. The results are discussed and sum-

marized in sections 6 and 7.

2 Data

The Sloan Digital Sky Survey (hereafter SDSS) is a multiband imaging and

spectroscopic redshift survey using a dedicated 2.5 m telescope at Apache

Point Observatory in New Mexico (USA). It collects deep photometry (r <

22.5) in ugriz bands and makes them publicly available through the SDSS

website [sdss.org]. Data is stored on the SDSS Science Archive Server in

the form of FITS files containing a single 2048 x 1361 pixel (corresponding

2

Figure 1: RGB image from irg bands of frame 6073-4-50 (run-camcol-field)

to 13.51 x 8.98 arcminutes) ”corrected frame” and additional World Coor-

dinate System (WCS) header information. These images are calibrated in

nanomaggies per pixel and have had a sky subtraction applied. Each frame

can be uniquely identified by its run, camcol and field number. An example

frame can be seen in figure 1. The galaxy images for this thesis are drawn

from the flux-limited spectroscopic Main Galaxy Sample [13] of SDSS Data

Release 16 [14].

For a galaxy to be used in this thesis, sources need to have a spectro-

scopic redshift of 0 < zspec ≤ 0.4 (which is mainly already given by belonging

to the Main Galaxy Sample), an error of zerror/zspec < 0.1 and fulfil several

photometric processing flags recommended by SDSS. These include removing

of duplicates, objects with debelending problems, objects with interpolation

problems and suspicious detections1.

1https://www.sdss.org/dr16/algorithms/photo_flags_recommend/

3

Running the SQL query (see appendix B) on the SDSS CasJob website2

returns 346.733 galaxies in the redshift range from 0 to 0.3305794. For all

queried galaxies, the following characteristics have been obtained: equatorial

coordinates (RA, Dec), spectroscopic redshift and error and radius containing

90% of Petrosian flux [15, 16, 17] for ugriz bands.

Since observations in the five different filters happen 71.7 seconds apart

from another in the order of r-i-u-z-g 3, all images need to be shifted to the

same frame. All SDSS FITS files come with header information about pixel

and world coordinates for a single reference pixel in that specific image. All

images haven been shifted so that the world coordinates of their reference

pixel match those of the r band image. Due to the necessary rounding to

integers, positions are precise to ± 1 pixel.

In order to keep the training times in check, a random, statistically

relevant subset of 100.000 out of the 346.733 found galaxies was selected for

this work. In Pasquet et al. [10], who worked on nearly the same set of

galaxies, the whole set was used with which they achieved great results. But

they also used a subset of 100.000 galaxies for the same network and could

show, that their results did not significantly change. The reduction of the

dataset for this work will therefore be seen as unproblematic.

Regarding the image size and bands used, no consensus has been estab-

lished by earlier works in this field. Hoyle (2016) [9] used 72x72 pixel RGBA

images, encoding colors (i - z, r - i, g - r) in RGB layers and r band magni-

tudes in the alpha layer, from which 100 random 60x60 pixel stamps where

used per galaxy. D’Isanto & Polsterer (2018) [11] used 28x28 pixel images in

the five SDSS bands as well as all 10 pairwise color combinations as input.

2websitehttps://skyserver.sdss.org/casjobs/3https://www.sdss.org/instruments/camera/

4

Pasquet et al. (2019) [10] used 64x64 pixel images in the five SDSS bands

and no colors.

Just like in Pasquet et al. (2019) [10] the input for this thesis consists

only of ugriz images and no color images are used. But regarding the image

size, two different approaches have been examined. The first one follows

Pasquet et al. (2019) [10] and uses 64x64 cutouts around the galaxy center.

If a galaxy is too close to the edge of its frame, it is discarded. This leaves

99.842 galaxies. Due to the low number of lost galaxies, no mosaic approach

to recover these galaxies as used in Pasquet et al. (2019) [10] was used.

The second approach uses coutouts around the galaxy of varying size, that

depend on the real size of the galaxy, and then rescales the image to a 32x32

pixel image. The size of the cutout is calculated by

A = 4 ∗median(rpetro) ∗1

0.396arcsecpixel

(1)

where rpetro are the Petrosian radii in all five bands and the 0.396arcsecpixel

account for the pixel scale of all SDSS images. Since the Petrosian radii

occasionally have a big outlier, the median turned out to be a more reliable

metric than the mean or the maximum value. The idea is to have the im-

portant features of smaller galaxies cover more of the image by effectively

zooming in and not cut off features of bigger galaxies by effectively zooming

out. In retrospect the calculation for A was chosen quit generous and images

therefore still include partially large areas with little signal (see figure ??).

Figure 2 shows the distribution of the values for A for all galaxies. Since the

highest value is 572, the plot has been cut at 160 for visual reasons.

Again, if a galaxy is too close to the edge, it is discarded. Also all

galaxies with A ≤ 32 are also discarded, in order to prevent artifacts from

upscaling the image. This procedure leaves 96.388 galaxies. More galaxies

5

Figure 2: Distribution of image width/height using equation 1 with cutoff at

A = 32 (red) and median at A = 54 (black)

6

are lost, but still not enough to have significant impact and justify a mosaic

approach.

If a galaxy has an A value of ∼ 64, the 32 pixel image would basically

be a rescaled version of the original with no zoom of any kind (see top row

of figure 3). For A values smaller than 64, the 32 pixel image results in a

zoomed in version (see middle row of figure 3). For A values bigger than

64, the 32 pixel image results in a zoomed out version (see bottom row of

figure 3). The respective companions in figure 3 visualize the corresponding

zoom quite clearly. Despite the generous calculation for A, a good amount of

galaxies falls on either side of the decision boundary at 64 and may therefore

justify its definition.

Figure 4 shows the redshift distribution for the whole set, the galaxies

used for the 64x64 approach and the 32x32 approach. To show that the corre-

sponding subsets are representative of the whole set, a Kolmogorov-Smirnoff

test [18] was conducted. This is a two-sided test for the null hypothesis that

two independent samples are drawn from the same distribution. The cor-

responding p-values for every set combination can be seen in Table 1. The

combination of the 64 pixel and 32 pixel images is the one of highest inter-

est, since this guarantees, that the two different subsets, and ultimately the

different networks trained with them, can be compared. But the other two

set combinations have been tested for consistency as well. Since all p-values

are high enough, we can can not reject the null hypothesis and can therefore

conclude, that all our subsets are drawn from the same distribution.

As the final step in data preparation a feature standardization makes

the values of each feature (image pixel values in this case) in the data have

zero mean and unit-variance via equation 2, where x′ is the standardized

feature value, x the raw feature value, µ the mean value of all feature values

7

Figure 3: u band comparison images for three different galaxies with given

SDSS bestobjid, zspec and Petrosian image size in pixel which corresponds to

the value A in Figure 2. Rescaled 32 pixel cutouts on the left and original

sized 64 pixel cutouts on the right.

8

Set 1 Set 2 KS statistic p-value

Whole set 64 px images 0.00282 0.56585

Whole set 32 px images 0.00362 0.27554

64 px images 32 px images 0.00346 0.59938

Table 1: Kolmogorov-Smirnoff test results for different subset combinations

Figure 4: Redshift distribution of all samples found by the SQL search (blue)

in comparison with all samples used with the 64 pixel architecture (orange)

or the 32 pixel architecture (green)

9

in the training set and σ the standard deviation of all feature values in the

training set. The final data results are 64x64x5 respectively 32x32x5 data

cubes consisting of ugriz bands centered around galaxy centers.

x′ =x− µσ

(2)

3 Network architectures

In this section the different elements of the models used for the experiments

are described. A schematic representation of a complete network can be seen

in figure 16 in appendix C.

3.1 Input layer

The input layer in a CNN defines the dimensions of input, that is fed into the

network. This is only worth talking about in this work, since the effect of dif-

ferent input dimensions shall be evaluated. This leads to the first separation

between different networks, since one works with an 64x64x5 input, while

the other works with a 32x32x5 input and thereby increasing the number of

networks to evaluate from a one to two.

3.2 Convolutional layer

The convolutional layer is the core building block of a CNN. The layer’s

parameters consist of a set of learnable filters (or kernels), which have a

small receptive field, but extend through the full depth of the input volume.

In this case the five SDSS bands define the depth. Number and size of

these filters are chosen before and can not be changed during training the

network and are therefore so called hyperparameters. During the forward

10

pass, each filter is convolved across the width and height of the input volume,

computing the dot product between the entries of the filter and the input and

producing a 2-dimensional activation map of that filter. During training, the

kernels, which had been initially populated with random values, are updated

to become progressively relevant to solve the underlying problem. As a result,

the network learns filters that activate when it detects some specific type of

feature at some spatial position in the input [19].

3.3 Pooling layer

Pooling layers reduce the dimensions of the data by combining the outputs

of neuron clusters at one layer into a single neuron in the next layer. Local

pooling combines small clusters, 2x2 in this case. Pooling may compute a

max or an average. Max pooling uses the maximum value from each of a

cluster of neurons at the prior layer, average pooling uses the average value

from each of a cluster of neurons at the prior layer. In all cases, pooling

helps to make the representation become approximately invariant to small

translations of the input. Invariance to translation means that if we translate

the input by a small amount, the values of most of the pooled outputs do

not change [20, 21]. Pasquet et al.(2019) [10] argue that most image pixels in

SDSS images are dominated by background noise, which is why they choose

average pooling over max pooling. The same has been done in this work.

3.4 Activation functions

In order to introduce non-linearity into the network, different non-linear ac-

tivation functions can be used. The most commonly used activation function

is the ReLU (Rectified Linear Unit, [22]) defined by f(x) = max(x, 0). A

problem with ReLU is the Dying ReLU problem where some ReLU neurons

11

essentially die for all inputs and remain inactive no matter what input is

supplied, here no gradient flows and if large number of dead neurons are in

a neural network, its performance is affected. This is a form of the vanishing

gradient problem [23] and can be corrected by making use of what is called

a Leaky ReLU [24], which can be defined as

f(x) =

x if x > 0

0.01x otherwise(3)

thus causing a ”leak” and extending the range of ReLU. The problem

with he Leaky ReLU is the arbitrarily chosen value of 0.01, which may not

suite different architectures equally and is therefore another hyperparame-

ter, that needs tuning. More recently, He et al. (2015) [25] introduced a

Parametric ReLU (PReLU), defined as

f(x) =

x if x > 0

αx otherwise(4)

where α is a trainable parameter instead of a fixed hyperparameter.

Another promising activation function is the Scaled Exponential Linear Unit

(SELU) [26], which has a self-normalizing effect on the neural network. When

using this activation function in practice, one must use ”lecun normal” for

weight initialization, and if dropout wants to be applied, one should use

AlphaDropout. Although both has been ensured, the SELU could not out-

perform the PReLU. Therefore for all activations, except for the output layer

of classification networks (see section 3.8), a PReLU was used.

The output layer for the classification networks use a softmax activation

function. The softmax activation function outputs a vector of size N, where N

is the number of potential outcomes or classes, that represents the probability

12

Figure 5: Inception module with dimension reductions as defined in [27]

distributions over all potential outcomes. The probabilities sum to one, which

makes sense in this case since the classes are mutually exclusive.

3.5 Inception block

An Inception Layer (figure 5) is a combination of 1x1 Convolutional layer, 3x3

Convolutional layer, 5x5 Convolutional layer with their output filter banks

concatenated into a single output vector forming the input of the next stage

[27]. It allows the internal layers to pick and choose which filter size will be

relevant to learn the required information. The 1x1 convolutions are mainly

used for dimensionality reduction before the expensive 3x3 and 5x5 convolu-

tions. Besides being used as reductions, they also include the use of another

activation layer, which makes them dual-purpose by introducing additional

non-linearity to the network. Additionally, since pooling operations have

been essential for the success in current state of the art convolutional net-

works, Szegedy et al. (2015) [27] suggest that adding an alternative parallel

pooling path in each such stage should have additional beneficial effect.

13

3.6 Fully connected layers

The convolutional layers do not make predictions but rather extract mean-

ingful features from the input image, that are then flattened into a one-

dimensional vector and fed into a series of so called fully connected layers.

They consist of a via another hyperparameter given number of single neu-

rons, which are connected to every neuron in the layer before and afterwards,

hence the name fully connected. These connections are also initially popu-

lated with random values and updated during training.

3.7 Dropout layer

In order to prevent the network from overfitting, dropout layers [28] were used

between convolutional as well as between fully connected layers. Dropout

is a regularization method that approximates training a large number of

neural networks with different architectures in parallel. During training,

some number of layer outputs are randomly ignored or ”dropped out”. This

has the effect of making the layer look like and be treated like a layer with a

different number of nodes and connectivity to the prior layer. In effect, each

update to a layer during training is performed with a different ”view” of the

configured layer.

3.8 Output layer

Depending on the given underlying problem a machine learning algorithm

has to solve, it is usually a clear choice, whether it is a regression or a

classification problem, which then dictates the size of the output layer. In this

case however, the obvious regression task can be turned into a classification

task by separating the outputspace into multiple bins (see chapter 4.2). For

14

the regression version of the networks, the output layer consists of a fully

connected layer with a single neuron and a PReLU activation layer. For the

classification version it consists of a fully connected layer with 110 neurons

and a softmax activation layer [29]. 110 because of the number of classes as

explained in section 4.2.1.

3.9 Complete architecture

In general the networks are composed of a first convolution layer, a pooling

layer, three inception blocks, a droput layer followed by three fully connected

layers each with dropout and finally the output layer. A schematic represen-

tation can be seen in figure 16 in appendix C. For the 64 version the first

convolution and pooling layer scale down the inputs, while in the 32 version

”same” padding was used in order to keep the dimensions comparable. The

output layers differ for regression and classification as described in section

3.8 increasing the number of networks to evaluate from two to four.

4 Regression and classification

The following sections outlines the different regression and especially classi-

fication approaches that can be taken for the redshift estimation task.

4.1 Regression

In order to treat this problem as the regression task, that it fundamentally is,

the output layer must consist of a fully connected layer with only one output

unit and uses a PReLu activation function as described in section 3.4. The

only difference for the regression variants is the input dimension of galaxy

images leading to one 64 variant and one 32 variant.

15

4.2 Classification

In order to transfer this regression task into a classification task, two changes

have to be made: (a) the output layer has to be a fully connected layer of size

N instead of one, where N is the number of classes, and a softmax function

as the activation function, (b) the initial labels (true redshifts) have to be

transformed from their single float value to one-hot encoded vectors, where

each column represents a bin of size 1/N ∗max(redshift).

4.2.1 Binning the data

When it comes to binning the data one must choose a reasonable compromise

between the number of galaxies in each bin and redshift quantization noise.

Hoyle [9] chose the number of classes to be 94 over the redshift range of 0

- 0.94 resulting in a bin width of δz = 1.0 ∗ 10−2. Pasquet et al. [10] chose

the number of classes to be 180 over the redshift range of 0 - 0.4 resulting

in a bin width of δz = 2.2 ∗ 10−3. Since this work was done on nearly the

same redshift range as Pasquet et al. [10], but with less galaxy samples, the

number of classes was chosen to be 110 over the redshift range of 0 - 0.33

resulting in a bin width of δz ≈ 3.3 ∗ 10−3.

4.2.2 Encoding labels and label smoothing

Converting the numeric values to one-hot encoded vectors is a rather straight

forward process. Every redshift label gets converted into a 110 dimensional

vector with all zeros and a one on the index corresponding to the bin a

particular galaxy falls in. At this point one could implement label smoothing.

Label smoothing is a regularization technique for classification problems to

prevent the model from predicting the labels too confidently during training

16

and generalizing poorly [30]. It usually replaces the one-hot encoded label

vector yhot with a mixture of yhot and a uniform distribution resulting in a

smoothed label vector yls with

yls = (1− α) ∗ yhot + α/N (5)

where N is the number of label classes, and α is a hyperparameter that

determines the amount of smoothing. If α = 0, we obtain the original one-hot

encoded yhot. If α = 1, we get the uniform distribution.

Another positive effect of label smoothing is to soften the effect of

wrongly labeled data. Although wrongly labeled data can be neglected for

this work, it has been examined if it still could be beneficial to apply different

weights the the wrong bins. In the one-hot encoded case, putting a galaxy in

the bin right next to the correct one would lead to the same error as putting

the galaxy in a bin much further away from the correct bin. This would

be fine in nearly any classification task, but the bins’ order in this case has

meaning to it. In order to incentivize the network to get closer to the correct

bin, a Gaussian label smoothing was applied via

yls,i =1

σ√

2πe−(i−j)

2/2σ2

(6)

with σ = 0.1 ∗ N = 11 and j being the index, where yhot,j = 1. The

value for σ was chosen rather intuitively as a compromise between spanning a

large area of the bin spectrum and offering a big enough slope to incentivize

the network to optimize to the center of the distribution. As a last step

in the Gaussian label smoothing the values are divided by the sum of all

values to keep the sum at 1. In order to evaluate the effect of Gaussian

label smoothing, the two classification networks (one 64 variant and one 32

17

variant) were trained once with and once without smoothing, increasing the

number of observed networks from four to six.

4.2.3 Returning to numerical values

The predictions are then no longer a single value, but a vector of size N, where

each element stands for the predicted probability of single sample to fall into

that respective bin. In order to get a single value back out of this vector, one

would normally assign the value of the bin with the highest probability. This

method will be referred to as ”class 1”, e.g. in table 8. But since we do not

have a classic classification task at hand and our output classes are ordered,

we are offered another way of converting this vector back to a single value:

the softmax weighted sum of the redshift values in each bin. This method

will be referred to as ”class 2”, e.g. in table 8. Since N is 110, this method

shall be shown with a smaller fictive example. Here the output space ranges

from 0 to 3 and is separated into three bins.

Bin values 0.5 1.5 2.5

Predicted probabilities 0.15 0.75 0.1

Table 2: Fictive classifier prediction for a single sample

As said one would normally assign the value of 1.5 for this example, since

the second bin got assigned the highest probability. But since our output in

theory could be every real number between 0 and 3, one can also weight every

bin value with its respective probability and would get the following result:

x = 0.15 ∗ 0.5 + 1.5 ∗ 0.75 + 2.5 ∗ 0.1 = 1.45

In order to evaluate the differences of these two techniques, the four clas-

sification networks were evaluated once with the highest probability approach

(class 1) and once with the weighted vector approach (class2), increasing the

18

number of networks to evaluate from six to finally ten. No further training

is needed for this distinction, since either technique does not influence the

training process and is only applied afterwards.

5 Training

The dataset has been divided into a training, validation and test set of sizes

60%, 20% and 20% respectively. In order to minimize the effect of galaxy

orientation, data augmentation in the form of randomly flipping and/or ro-

tating the images between 0 and 360 degree was applied. Also a random

translation of a maximum of 1 pixel in x and/or y direction was applied,

since the positions in the galaxy images are only precise enough to ±1 pixel.

Although an ensemble of trained models would be desirable for each of

the ten network variations, due to time and resource restrictions, one trained

model has to suffice for each network variant.

Each network has been trained for 200 epochs using the Adadelta opti-

mizer [31]. Adadelta is a more robust extension of Adagrad [32] that adapts

learning rates based on a moving window of gradient updates, instead of ac-

cumulating all past gradients. This way, Adadelta continues learning even

when many updates have been done. The final model was chosen to be the

one, that produced the smallest loss on the validation set. The loss function

for regression networks was mean squared error, for classification networks

categorical crossentropy.

19

6 CNN results

Though there are commonly used statistics, there is no exact consensus on

evaluation metrics for deep learning based redshift estimations. Therefore

the following metrics used in various other papers ([9, 10, 11, 12]) have been

used to evaluate the performance for each model on the test set:

• the residuals, ∆z = (zCNN − zspec)/(1 + zspec) following Cohen et al.

(2000) [33]

• the prediction bias, < ∆z >, defined as the mean of the residuals

• σ68, σ95, corresponding to the 68.27% and 95.45% spread of ∆z

• Median Absolute Deviation (MAD) of ∆z, which is defined as the me-

dian of |∆z −Median(∆z)|

• σMAD ≈ 1.4826 ∗MAD, the standard deviation of ∆z under the as-

sumption of normal distribution

• the fraction of outliers η in percent with |∆z| > 0.05

There is a lack of common ground especially on the definition of when

to count a prediction as an outlier. Hoyle (2016) [9] uses an absolute value

of 0.15 for the threshold, while Mu et al. (2018) [12] choose their threshold

as 3 ∗ δ, where δ represents the standard deviation of ∆z. In their case the

thresholds range from 0.0885 to 0.1188 for different networks. Pasquet et

al. (2019) [10] choose their threshold as 5 ∗ σMAD ≈ 0.05, where σMAD =

1.4826 ∗MAD achieved by their network. Since the latter one also performs

on the same redshift range of up to 0.4, the value of 0.05 was chosen as

threshold for the further evaluation of all networks regardless of the MAD or

σMAD of the specific network. A full table of all evaluation metrics for the ten

20

mentioned networks can be found in appendix A.1, the corresponding plots

in Appendix A.2 to A.11. There are five different plots for each network:

• Upper left: Spectroscopic redshift zspec against redshift predicted by

the network zCNN in a scatter plot. Same plot on the right part but

with transparency.

• Upper right: Spectroscopic redshift zspec against redshift predicted by

the network zCNN in a density plot.

• Middle left: Histogram of residuals ∆z with mean and standard devi-

ation

• Middle right: Spectroscopic redshift zspec against residuals ∆z in a

density plot with a linear fit.

• Bottom: Histogram of residuals ∆z with mean and standard deviation

for different redshift bins

All networks have been able to produce competitive results and are only

beaten by the results of Pasquet et al. [10]. The reason for this may be the

fact, that they have been using a deeper CNN with five instead of just three

inception blocks. Since several different networks instead of just a single

one have been examined in this work a shallower network architecture was

chosen, in order to keep training times in check.

With perfect predictions, the upper left plot would be just the positive

half of a identity line (red line in the plots). For all networks the predictions

are more or less closely scattered around that line and thereby showing a

general predictive power for all of them. The upper right plots are more

suited to show the density of this scatter and reflect the imbalance in the

21

dataset, since density drastically decreases at the upper and lower end of the

redshift spectrum.

The non-equality of standard deviations in the middle left plots and the

corresponding σMAD reveal, that the residuals are not normally distributed.

Looking at the means one could assume, that the redshift precision criteria

defined by Knox et al. (2016) [34] of having a bias better than 0.002 seem

fulfilled for all networks except for the one with ID 6. But taking a closer look

by creating different histograms for separate redshift ranges as in the bottom

plots reveals another picture. In the first panel in every bottom plot, the

mean is slightly above zero, whereas in the middle and left panels the mean

is below zero and shifts more and more to the left. Most of the times being

worse than 0.002. This effect can be seen even more clearly in the middle

right plots, where the spectroscopic redshift zspec is plotted against residuals

∆z in a density plot and a line has been fitted to the distribution. A clear

negative trend with f(0) > 0 can be seen in every plot, meaning the networks

overestimate smaller redshift values and underestimate the larger redshift

values. This is no surprise since we are dealing with an imbalanced dataset

that lacks samples in both ends of the spectrum. Therefore estimations are

drawn to the center of the distribution.

In order to answer the question of how different techniques influence

the networks ability to predict the redshift, the following sections compare

averaged network results for every differentiation described before.

6.1 Comparing different input dimensions

For the comparison between different input dimensions of the galaxy images,

average model results from 64x64 networks (model IDs 1,3,4,5 and 6) are

compared to average model results from 32x32 networks (model IDs 2,7,8,9

22

and 10). The results can be seen in table 3. The average 64x64 network

outperform the 32x32 network in every metric. For the 32x32 networks σ68

is bigger by 5.7%, σ95 is bigger by 4.5%, MAD is bigger by 5.8%, σMAD is

bigger by 5.7% and η is bigger by 23.5%.

method σ68 σ95 MAD σMAD η [%]

64x64 0.02743 0.06538 0.00896 0.01329 0.86200

32x32 0.02900 0.06833 0.00948 0.01405 1.06443

Table 3: Comparison between 64x64 and 32x32 model results. Best value in

each column is underlined.

6.2 Comparing class 1 and class 2 method

For the comparison between different methods of returning to single float

values for classification networks, average model results from classification

networks with highest probability methods (class 1) (model IDs 3,5,7 and

9) are compared to average model results from classification networks with

softmax weighted label vector method (class 2) (model IDs 4,6,8 and10). The

results can be seen in table 4. For class 1 networks σ68 is bigger by 0.2%,

σ95 is bigger by 2.3% and η is bigger by 20.5%. For class 2 networks MAD

is bigger by 0.5% and σMAD is bigger by 0.7%. For the most metrics both

methods seem to be fairly equal with differences smaller than 1%. But for σ95

and especially η, class 2 outperforms class 1. But looking at the upper left

plots of e.g. appendix A.6 and A.7, it is evident that the softmax weighted

label vector method seems to introduce a bias in the lower redshift region

for networks with Gaussian label smoothing. Since evaluating both methods

comes at no additional training of networks and therefore virtually at no

23

computational cost, it can be advantageous to take a look at both any time

it is possible.


class 1 0.02843 0.06803 0.00926 0.01372 1.08377

class 2 0.02836 0.06647 0.00931 0.01381 0.89927

Table 4: Comparison between highest probability and softmax weighted la-

bel vector method for classification networks. Best value in each column is

underlined.

6.3 Comparing no smoothing and Gaussian smoothing

For the comparison between classification networks without label smoothing

and networks with Gaussian label smoothing, average model results from

networks without label smoothing (model IDs 3,4,7 and 8) are compared to

average model results networks with Gaussian label smoothing (model IDs

5,6,9 and 10). The results can be seen in table 5. Networks with Gaussian

label smoothing outperform networks without label smoothing in every met-

ric. For networks without label smoothing σ68 is bigger by 6.1%, σ95 is bigger

by 8.3%, MAD is bigger by 4.5%, σMAD is bigger by 4.4% and η is bigger by

73.8%.


no smoothing 0.02923 0.06994 0.00949 0.01406 1.25889

Gaussian smoothing 0.02756 0.06456 0.00908 0.01347 0.72415

Table 5: Comparison between classification networks without label smooth-

ing and with Gaussian label smoothing

24

6.4 Comparing regression and classification

For the comparison between regression and classification networks, the pro-

cess is not as straightforward as for the other comparisons, since there are

multiple possible subsets of classification networks. Also the results from

sections 6.2 and 6.3 are not necessarily compatible as explained in section

6.3. Therefore the average regression network results (model IDs 1 and 2)

are compared to all possible subsets of classification model ensembles:

• A: all classification models (model IDs 3,4,5,6,7,8,9 and 10)

• B: all classification models with class 1 method (model IDs 3,5,7 and

9)

• C: all classification models with class 2 method (model IDs 4,6,8 and

10)

• D: all classification models without label smoothing (model IDs 3,4,7

and 8)

• E: all classification models with Gaussian label smoothing (model IDs

5,6,9 and 10)

• F: all classification models with class 1 method and without label

smoothing (model IDs 3 and 7)

• G: all classification models with class 1 method and Gaussian label


• H: all classification models with class 2 method and without label


25

• I: all classification models with class 2 method and Gaussian label


Results can be seen in table 6. Table 7 shows the same results, but

divided by the value of the regression ensemble in each column, which makes

it easier to compare. A value above 1 indicates a performance worse than

the regression ensemble, a value below 1 indicates a performance better than

the regression ensemble. For σ68, MAD and σMAD only ensemble G is better,

for σ95 and η ensembles E, G and I are better than the regression ensem-

ble. These ensembles are the three with Gaussian label smoothing, therefore

further strengthening the believe in its positive contribution for model pre-

diction performance. Comparing ensemble C (all class 2 models) with E (all

models with Gaussian label smoothing) reveals, that out of the two possi-

ble differentiations, Gaussian label smoothing seems to hold more predictive

power. Out of ensembles E, G and I only ensemble G was able to outperform

the regression one in every metric. A notable ensemble is H, since it is the

one commonly used for this task in previous works. It could not outperform

the regression and may therefore not be the best suited for this task.

7 Conclusion and outlook

In this work I have presented multiple deep CNNs, that were trained and

tested on the Main Galaxy Sample of the SDSS at z ≤ 0.4, to estimate pho-

tometric redshifts. Regression as well as different classification approaches

have been examined, in order to evaluate the currently commonly used tech-

nique of turning the regression task into a classification task.

I could produce competitive results for each of the ten different net-

works and show, that the commonly used classification approach does not

26


regression 0.02748 0.06527 0.00897 0.01330 0.84998

A 0.02923 0.06994 0.00949 0.01406 1.25889

B 0.02843 0.06803 0.00926 0.01372 1.08377

C 0.02836 0.06647 0.00931 0.01381 0.89927

D 0.02923 0.06994 0.00949 0.01406 1.25889

E 0.02756 0.06456 0.00908 0.01347 0.72415

F 0.02984 0.07160 0.00967 0.01433 1.40078

G 0.02702 0.06447 0.00884 0.01311 0.76676

H 0.02861 0.06828 0.00930 0.01379 1.11700

I 0.02810 0.06466 0.00932 0.01382 0.68154

Table 6: Comparison between regression networks and different ensembles of

classification networks as defined in section 6.4. Best value in each column

is underlined.

necessarily outperform the regular regression approach. It has been shown,

that the current approach of using same sized 64x64 images delivers better

results than the new proposed variable 32x32 images. I introduced a label

smoothing technique (Gaussian label smoothing), which not only improves

the overall performance of the classification networks, but also enables the

classification networks to outperform the regression networks consistently.

Finally, returning to single numerical values from a classification prediction

via the softmax weighted label vector method delivers still the best results,

except when combined with Gaussian label smoothing. I thereby strongly

recommend the use of Gaussian label smoothing in the task of galaxy redshift

estimation via deep CNNs.

27


regression 1.000000 1.000000 1.000000 1.000000 1.000000

A 1.063683 1.071549 1.057971 1.057143 1.481082

B 1.034571 1.042286 1.032330 1.031579 1.275054

C 1.032023 1.018385 1.037904 1.038346 1.057990

D 1.063683 1.071549 1.057971 1.057143 1.481082

E 1.002911 0.989122 1.012263 1.012782 0.851961

F 1.085881 1.096982 1.078038 1.077444 1.648015

G 0.983261 0.987743 0.985507 0.985714 0.902092

H 1.041121 1.046116 1.036789 1.036842 1.314149

I 1.022562 0.990654 1.039019 1.039098 0.801831

Table 7: Comparison between regression networks and different ensembles of

classification networks as defined in section 6.4. Values are divided by the

value of the regression ensemble in each column. Best value in each column

is underlined.

Pasquet et al. [10] argue that for most galaxies the precision is limited

by the signal-to-noise ratio of SDSS images rather than by the method. I

hope my work could show a promising change in method, that could still

improve results. To further investigate this, a bigger analysis utilizing es-

tablished ensemble techniques (i.e. training multiple instances of the same

network version, with the same specifications, but different random initial

layer weights and combining results) would be interesting. If the focus lies

on just one or two network versions and not ten as in this work, one could eas-

ily increase model complexity (e.g. five instead of three inception modules)

while still using the same resources.

28

Besides increasing model complexity and increasing sample size from

existing and upcoming surveys, other options, that that have not been used

in this work but may improve results are:

• using class weights in classification (and with some adjustments as well

in regression) approaches to counteract the effects an inevitably imbal-

anced dataset;

• utilizing cyclical learning rate schedules with stochastic gradient de-

scent (with momentum), since Adam/Adadelta can have a tendency to

overfit;

• making use of more recent ensembling techniques like Fast Geomet-

ric Ensembling(FGE) [35] or Stochastic Weight Averaging (SWA) [36].

SWA has indeed been tried in this work and could show promising

improvements, but I was not able to examine it to full extend;

• training a pre-trained model on simulated galaxy images. This could

both be used as a way to verify the simulated galaxy images, as well

as improve pre-trained models via transfer learning.

29

Appendices

A CNN results

A.1 Numeric results

ID input method < ∆z > σ68 σ95 MAD σMAD η [%]

1 64x64x5 reg 0.00089 0.02724 0.06550 0.00887 0.01315 0.870

2 32x32x5 reg 0.00085 0.02771 0.06504 0.00906 0.01344 0.830

3 64x64x5 class 1 0.00011 0.02938 0.06982 0.00952 0.01411 1.235

4 64x64x5 class 2 0.00078 0.02826 0.06703 0.00926 0.01372 1.015

5 64x64x5 class 1 0.00149 0.02587 0.06256 0.00844 0.01251 0.605

6 64x64x5 class 2 0.00335 0.02637 0.06200 0.00873 0.01295 0.585

7 32x32x5 class 1 -0.00004 0.03031 0.07338 0.00982 0.01456 1.567

8 32x32x5 class 2 0.00077 0.02896 0.06954 0.00935 0.01386 1.219

9 32x32x5 class 1 -0.00081 0.02817 0.06638 0.00925 0.01371 0.929

10 32x32x5 class 2 0.00158 0.02984 0.06731 0.00991 0.01470 0.778

Hoyle [9] 0.00 0.030 0.10 - - 1.71

Polsterer [11] -0.0003 - - 0.0128 - -

Pasquet [10] 0.00010 - - 0.00615 0.00912 0.31

Table 8: Numeric results as defined in chapter 6 for all

ten final models. reg: regression; class 1/2: as described

in section 4.2.3; bold: Gaussian smoothing. Best value

in each column is underlined.

31

A.2 Plots 64 regression

Figure 6: Model evaluation plots for model with ID 1.

32

A.3 Plots 32 regression


33

A.4 Plots 64 classification 1


34



35

A.6 Plots 64 classification 1 Gaussian smoothing


36



37



38



39



40



41

B My SQL search

SELECT TOP 500000

sp.bestobjid, sp.ra, sp.dec, ph.run, ph.camcol, ph.field, ph.obj,

sp.z, sp.zErr, ph.expRad_u, ph.expRad_g, ph.expRad_r, ph.expRad_i,

ph.expRad_z, ph.deVRad_u, ph.deVRad_g, ph.deVRad_r, ph.deVRad_i,

ph.deVRad_z, ph.petroR90_u, ph.petroR90_g, ph.petroR90_r,

ph.petroR90_i, ph.petroR90_z, ph.petroR90Err_u, ph.petroR90Err_g,

ph.petroR90Err_r, ph.petroR90Err_i, ph.petroR90Err_z

FROM SpecObj AS sp

JOIN PhotoObj AS ph ON ph.objid = sp.bestobjid

WHERE

class = ’GALAXY’

AND sp.z > 0

AND sp.z <= 0.4

AND (sp.zERR/sp.z) < 0.1

AND sp.zWarning = 0

AND sp.primTarget = 64

AND ph.dered_r > 0

AND ph.dered_r < 22.2

AND ph.dered_i > 0

AND ph.dered_i < 21.3

AND ph.dered_u > 0

AND ph.dered_u < 22

AND ph.dered_z > 0

AND ph.dered_z < 20.5

AND ph.dered_g > 0

42

AND ph.dered_g < 22.2

AND clean = 1

AND (calibStatus_u & 1) != 0

AND (calibStatus_g & 1) != 0

AND (calibStatus_r & 1) != 0

AND (calibStatus_i & 1) != 0

AND (calibStatus_z & 1) != 0

AND ((flags & 0x10000000) != 0)

AND ((flags & 0x8100000c00a0) = 0)

AND (((flags & 0x400000000000) = 0) OR (psfmagerr_u <= 0.2))

AND (((flags & 0x400000000000) = 0) OR (psfmagerr_g <= 0.2))

AND (((flags & 0x400000000000) = 0) OR (psfmagerr_r <= 0.2))

AND (((flags & 0x400000000000) = 0) OR (psfmagerr_i <= 0.2))

AND (((flags & 0x400000000000) = 0) OR (psfmagerr_z <= 0.2))

AND (((flags & 0x100000000000) = 0) OR (flags & 0x1000) = 0)

--/Removing Duplicates

AND mode = 1

--/Removing Objects with Deblending Problems

AND (flags_u & 0x20) = 0

AND (flags_u & 0x80000) = 0

AND ((flags_u & 0x400000000000) = 0 OR psfmagerr_u <= 0.2)

AND (flags_g & 0x20) = 0

AND (flags_g & 0x80000) = 0

AND ((flags_g & 0x400000000000) = 0 OR psfmagerr_g <= 0.2)

43

AND (flags_r & 0x20) = 0

AND (flags_r & 0x80000) = 0

AND ((flags_r & 0x400000000000) = 0 OR psfmagerr_r <= 0.2)

AND (flags_i & 0x20) = 0

AND (flags_i & 0x80000) = 0

AND ((flags_i & 0x400000000000) = 0 OR psfmagerr_i <= 0.2)

AND (flags_z & 0x20) = 0

AND (flags_z & 0x80000) = 0

AND ((flags_z & 0x400000000000) = 0 OR psfmagerr_z <= 0.2)

--/Removing Objects with Interpolation Problems

AND (flags_u & 0x800000000000) = 0

AND (flags_u & 0x10000000000) = 0

AND ((flags_u & 0x100000000000) = 0 OR (flags_u & 0x1000) = 0)

AND (flags_g & 0x800000000000) = 0

AND (flags_g & 0x10000000000) = 0

AND ((flags_g & 0x100000000000) = 0 OR (flags_g & 0x1000) = 0)

AND (flags_r & 0x800000000000) = 0

AND (flags_r & 0x10000000000) = 0

AND ((flags_r & 0x100000000000) = 0 OR (flags_r & 0x1000) = 0)

AND (flags_i & 0x800000000000) = 0

AND (flags_i & 0x10000000000) = 0

AND ((flags_i & 0x100000000000) = 0 OR (flags_i & 0x1000) = 0)

AND (flags_z & 0x800000000000) = 0

AND (flags_z & 0x10000000000) = 0

AND ((flags_z & 0x100000000000) = 0 OR (flags_z & 0x1000) = 0)

--/Removing Suspicious Detections

AND (flags_u & 0x10000000) != 0

44

AND (flags_u & 0x40000) = 0

AND (flags_u & 0x80) = 0

AND (flags_g & 0x10000000) != 0

AND (flags_g & 0x40000) = 0

AND (flags_g & 0x80) = 0

AND (flags_r & 0x10000000) != 0

AND (flags_r & 0x40000) = 0

AND (flags_r & 0x80) = 0

AND (flags_i & 0x10000000) != 0

AND (flags_i & 0x40000) = 0

AND (flags_i & 0x80) = 0

AND (flags_z & 0x10000000) != 0

AND (flags_z & 0x40000) = 0

AND (flags_z & 0x80) = 0

ORDER BY z DESC

45

C My CNN architecture

Figure 16: Schematic network architecture.

46

References

[1] WA Baum. Problems of extragalactic research. In IAU Symposium,

1962, volume 15, pages 390–397, 1962.

[2] Stephane Arnouts, Stefano Cristiani, Lauro Moscardini, Sabino Matar-

rese, Francesco Lucchin, Adriano Fontana, and Emanuele Giallongo.

Measuring and modelling the redshift evolution of clustering: the hub-

ble deep field north. Monthly Notices of the Royal Astronomical Society,

310(2):540–556, 1999.

[3] David A Bohlender, Daniel Durand, and Thomas H Handley. Astro-

nomical Data Analysis Software and Systems XI, volume 281. 2002.

[4] Gabriel B Brammer, Pieter G van Dokkum, and Paolo Coppi. Eazy:

a fast, public photometric redshift code. The Astrophysical Journal,

686(2):1503, 2008.

[5] Adrian A Collister and Ofer Lahav. Annz: estimating photometric red-

shifts using artificial neural networks. Publications of the Astronomical

Society of the Pacific, 116(818):345, 2004.

[6] I Csabai, L Dobos, M Trencseni, G Herczegh, P Jozsa, N Purger,

T Budavari, and AS Szalay. Multidimensional indexing tools for the

virtual observatory. Astronomische Nachrichten: Astronomical Notes,

328(8):852–857, 2007.

[7] Samuel Carliles, Tamas Budavari, Sebastien Heinis, Carey Priebe, and

Alexander S Szalay. Random forests for photometric redshifts. The

Astrophysical Journal, 712(1):511, 2010.

47

[8] Ben Hoyle, Markus Michael Rau, Roman Zitlau, Stella Seitz, and Jochen

Weller. Feature importance for machine learning redshifts applied

to sdss galaxies. Monthly Notices of the Royal Astronomical Society,

449(2):1275–1283, 2015.

[9] Ben Hoyle. Measuring photometric redshifts using galaxy images and

deep neural networks. Astronomy and Computing, 16:34–40, 2016.

[10] Johanna Pasquet, Emmanuel Bertin, Marie Treyer, Stephane Arnouts,

and Dominique Fouchez. Photometric redshifts from sdss images using

a convolutional neural network. Astronomy & Astrophysics, 621:A26,

2019.

[11] Antonio D’Isanto and Kai Lars Polsterer. Photometric redshift es-

timation via deep learning-generalized and pre-classification-less, im-

age based, fully probabilistic redshifts. Astronomy & Astrophysics,

609:A111, 2018.

[12] Yong-Huan Mu, Bo Qiu, Jian-Nan Zhang, Jun-Cheng Ma, and Xiao-

Dong Fan. Photometric redshift estimation of galaxies with convolu-

tional neural network.

[13] Michael A Strauss, David H Weinberg, Robert H Lupton, Vijay K

Narayanan, James Annis, Mariangela Bernardi, Michael Blanton, Scott

Burles, AJ Connolly, Julianne Dalcanton, et al. Spectroscopic target

selection in the sloan digital sky survey: the main galaxy sample. The

Astronomical Journal, 124(3):1810, 2002.

[14] Romina Ahumada, Carlos Allende Prieto, Andres Almeida, Friedrich

Anders, Scott F Anderson, Brett H Andrews, Borja Anguiano, Ric-

cardo Arcodia, Eric Armengaud, Marie Aubert, et al. The sixteenth

48

data release of the sloan digital sky surveys: First release from the

apogee-2 southern survey and full release of eboss spectra. arXiv preprint

arXiv:1912.02905, 2019.

[15] Vahe Petrosian. Surface brightness and evolution of galaxies. The As-

trophysical Journal, 209:L1–L5, 1976.

[16] Michael R Blanton, Julianne Dalcanton, Daniel Eisenstein, Jon Love-

day, Michael A Strauss, Mark SubbaRao, David H Weinberg, John E

Anderson Jr, James Annis, Neta A Bahcall, et al. The luminosity func-

tion of galaxies in sdss commissioning data. The Astronomical Journal,

121(5):2358, 2001.

[17] Naoki Yasuda, Masataka Fukugita, Vijay K Narayanan, Robert H Lup-

ton, Iskra Strateva, Michael A Strauss, Zeljko Ivezic, Rita SJ Kim,

David W Hogg, David H Weinberg, et al. Galaxy number counts from

the sloan digital sky survey commissioning data. The Astronomical Jour-

nal, 122(3):1104, 2001.

[18] JL Hodges. The significance probability of the smirnov two-sample test.

Arkiv for Matematik, 3(5):469–486, 1958.

[19] Aurelien Geron. Hands-on machine learning with Scikit-Learn, Keras,

and TensorFlow: Concepts, tools, and techniques to build intelligent

systems. O’Reilly Media, 2019.

[20] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning.

MIT press, 2016.

[21] Haigang Zhu, Xiaogang Chen, Weiqun Dai, Kun Fu, Qixiang Ye, and

Jianbin Jiao. Orientation robust object detection in aerial images using

49

deep convolutional neural network. In 2015 IEEE International Confer-

ence on Image Processing (ICIP), pages 3735–3739. IEEE, 2015.

[22] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rod-

ney J Douglas, and H Sebastian Seung. Digital selection and ana-

logue amplification coexist in a cortex-inspired silicon circuit. Nature,

405(6789):947–951, 2000.

[23] Sepp Hochreiter. The vanishing gradient problem during learning re-

current neural nets and problem solutions. International Journal of

Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116,

1998.

[24] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier non-

linearities improve neural network acoustic models. In Proc. icml, vol-

ume 30, page 3, 2013.

[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving

deep into rectifiers: Surpassing human-level performance on imagenet

classification. In Proceedings of the IEEE international conference on

computer vision, pages 1026–1034, 2015.

[26] Zhen Huang, Tim Ng, Leo Liu, Henry Mason, Xiaodan Zhuang, and

Daben Liu. Sndcnn: Self-normalizing deep cnns with scaled exponential

linear units for speech recognition. In ICASSP 2020-2020 IEEE Interna-

tional Conference on Acoustics, Speech and Signal Processing (ICASSP),

pages 6854–6858. IEEE, 2020.

[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,

Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew

50

Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pages 1–9, 2015.

[28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,

and Ruslan Salakhutdinov. Dropout: a simple way to prevent neu-

ral networks from overfitting. The journal of machine learning research,

15(1):1929–1958, 2014.

[29] John S Bridle. Probabilistic interpretation of feedforward classification

network outputs, with relationships to statistical pattern recognition. In

Neurocomputing, pages 227–236. Springer, 1990.

[30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and

Zbigniew Wojna. Rethinking the inception architecture for computer

vision. In Proceedings of the IEEE conference on computer vision and

pattern recognition, pages 2818–2826, 2016.

[31] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv

preprint arXiv:1212.5701, 2012.

[32] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient

methods for online learning and stochastic optimization. Journal of

machine learning research, 12(7), 2011.

[33] Judith G Cohen, David W Hogg, Roger Blandford, Lennox L Cowie,

Esther Hu, Antoinette Songaila, Patrick Shopbell, and Kevin Richberg.

Caltech faint galaxy redshift survey. x. a redshift survey in the region of

the hubble deep field north. The Astrophysical Journal, 538(1):29, 2000.

[34] Lloyd Knox, Yong-Seon Song, and Hu Zhan. Weighing the universe with

photometric redshift surveys and the impact on dark energy forecasts.

The Astrophysical Journal, 652(2):857, 2006.

51

[35] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov,

and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast

ensembling of dnns, 2018.

[36] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov,

and Andrew Gordon Wilson. Averaging weights leads to wider optima

and better generalization, 2018.

52

List of Figures

1 RGB image from irg bands of frame 6073-4-50 (run-camcol-field) 3

2 Distribution of image width/height using equation 1 with cut-

off at A = 32 (red) and median at A = 54 (black) . . . . . . . 6

3 u band comparison images for three different galaxies with

given SDSS bestobjid, zspec and Petrosian image size in pixel

which corresponds to the value A in Figure 2. Rescaled 32

pixel cutouts on the left and original sized 64 pixel cutouts on

the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Redshift distribution of all samples found by the SQL search

(blue) in comparison with all samples used with the 64 pixel

architecture (orange) or the 32 pixel architecture (green) . . . 9

5 Inception module with dimension reductions as defined in [27] 13

6 Model evaluation plots for model with ID 1. . . . . . . . . . . 32









15 Model evaluation plots for model with ID 10. . . . . . . . . . 41

16 Schematic network architecture. . . . . . . . . . . . . . . . . 46

53

List of Tables

1 Kolmogorov-Smirnoff test results for different subset combi-

nations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Fictive classifier prediction for a single sample . . . . . . . . . 18

3 Comparison between 64x64 and 32x32 model results. Best

value in each column is underlined. . . . . . . . . . . . . . . . 23

4 Comparison between highest probability and softmax weighted

label vector method for classification networks. Best value in

each column is underlined. . . . . . . . . . . . . . . . . . . . . 24

5 Comparison between classification networks without label smooth-

ing and with Gaussian label smoothing . . . . . . . . . . . . . 24

6 Comparison between regression networks and different ensem-

bles of classification networks as defined in section 6.4. Best

value in each column is underlined. . . . . . . . . . . . . . . . 27

7 Comparison between regression networks and different ensem-

bles of classification networks as defined in section 6.4. Values

are divided by the value of the regression ensemble in each

column. Best value in each column is underlined. . . . . . . . 28

8 Numeric results as defined in chapter 6 for all ten final models.

reg: regression; class 1/2: as described in section 4.2.3; bold:

Gaussian smoothing. Best value in each column is underlined. 31

54

Declaration of AuthorshipI hereby declare that this thesis is my own work and that I have not used

any sources and aids other than those stated in the thesis.

ErklarungHiermit erklare ich, die vorliegende Arbeit selbstandig verfasst zu haben und

keine anderen als die in der Arbeit angegebenen Quellen und Hilfsmittel

benutzt zu haben.

Munich, September 14, 2020: .........................................

Benjamin Alber

photometric redshifts from sdss images using ... - usm home

Documents