bridging the shifting distribution gap: domain adaptation ... · aldous huxley recent improvements...

56
MSc Artificial Intelligence Master Thesis Bridging the Shifting Distribution Gap: Domain Adaptation for Semantic Segmentation and Visual Data Streams by Sindi Shkodrani 11128348 May 22, 2018 36 ECTS September 2017 - May 2018 Supervisor: Dr. E. Gavves Daily supervisor: Dr. M. Hofmann Assessor: Prof. Dr. C.G.M. Snoek

Upload: others

Post on 12-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

MSc Artificial Intelligence

Master Thesis

Bridging the Shifting Distribution Gap:

Domain Adaptation for Semantic Segmentationand Visual Data Streams

by

Sindi Shkodrani

11128348

May 22, 2018

36 ECTSSeptember 2017 - May 2018

Supervisor:Dr. E. GavvesDaily supervisor:Dr. M. Hofmann

Assessor:Prof. Dr. C.G.M. Snoek

Page 2: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming
Page 3: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Abstract

The focus of this thesis work is visual domain adaptation which is robust to domain shiftsand differences in data distribution statistics between potential source and target domainsacross different dataset types and tasks. Associative domain adaptation is reformulated towork well in realistic scenarios and applications where source and target domains cannot beguaranteed to have similar class distribution statistics.

In addition, modern deep learning applications require methods that scale well with theamount of supervised and unsupervised data and are able to transfer the knowledge learnedfrom previous datasets. This work revisits static domain adaptation in the context of domainshift that arises in open, dynamical data sources, such as image data streams in the wild.Traditional domain adaptation is usually applied to two static domains, while data is usuallynot all available in practice. This thesis work develops a framework for adaptation thatredefines static domain adaptation in a dynamic context that can be treated sequentiallysimilar to streaming application.

Results are reported on several domain adaptation benchmark datasets for classification.In addition, another application where there is increasingly more interest for robust domainadaptation techniques is semantic segmentation. In this work domain adaptation for seman-tic segmentation with associative learning is developed. Finally, a framework for adaptationover distribution shifts that change in time is introduced and extensive experiments are re-ported that indicate how this framework can be used to adapt for unsupervised data bundlesincoming later in the training in a streaming-like fashion.

3

Page 4: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Acknowledgements

The biggest thanks go to my supervisors. Because of them I had a very thorough thesisexperience and learned a lot in the process. To Michael, for being involved and alwaysfinding time to discuss and put effort in this work despite his busy schedule as a manager.To Stratis, for his dedication and intuition on directing work in the right manner, by givingthe proper advice on how to tackle things.

My gratitude goes out to them and Cees Snoek for agreeing to be in the committee andto read the final thesis on short notice.

Many thanks to my colleagues at TomTom who were always helpful and approachable forquestions during my time as a master thesis intern.

Finally, I’d like to thank my family and friends for their support and understanding of mylack of presence recently. Special thanks to my researcher friends who are always motivatingme with their dedication, to Peter for the constant patience and support and to Brian forrelocating to bring a bit of home closer these months.

4

Page 5: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Contents

1 Introduction 71.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Overview and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Preliminaries and Related Research 112.1 Domain Adaptation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11The domain shift problem . . . . . . . . . . . . . . . . . . . . . . . . . 11Categories and lines of research . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Shallow Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Instance reweighting methods . . . . . . . . . . . . . . . . . . . . . . . 12Parameter adaptation methods . . . . . . . . . . . . . . . . . . . . . . 12Feature augmentation methods . . . . . . . . . . . . . . . . . . . . . . 13Feature space alignment methods . . . . . . . . . . . . . . . . . . . . . 13Feature transformation methods . . . . . . . . . . . . . . . . . . . . . 14

2.1.3 Deep Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Discrepancy based adaptation . . . . . . . . . . . . . . . . . . . . . . . 14Adversarial domain adaptation . . . . . . . . . . . . . . . . . . . . . . 15Data reconstruction-based methods . . . . . . . . . . . . . . . . . . . . 17

2.1.4 Domain Adaptation for Semantic Segmentation . . . . . . . . . . . . . 17Adversarial discriminative methods . . . . . . . . . . . . . . . . . . . . 17Adversarial generative methods . . . . . . . . . . . . . . . . . . . . . . 18Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Learning Models for Streaming Data . . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

The concept drift problem . . . . . . . . . . . . . . . . . . . . . . . . . 20Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Algorithms and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Method 243.1 Domain Adaptation with Associative Learning . . . . . . . . . . . . . . . . . 24

3.1.1 Associative domain adaptation . . . . . . . . . . . . . . . . . . . . . . 24From maximum mean discrepancy to learning by associations . . . . . 24Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Relaxing the class distribution assumption . . . . . . . . . . . . . . . . 27Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Estimating visit loss weights . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Associative Domain Adaptation for Semantic Segmentation . . . . . . . . . . 303.2.1 Fully convolutional networks for semantic segmentation . . . . . . . . 303.2.2 Embeddings in semantic segmentation networks . . . . . . . . . . . . . 30

Visualizing behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Analyzing embedding distance metrics . . . . . . . . . . . . . . . . . . 31

5

Page 6: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Contents

The probabilistic interpretation of distances and importance of numer-ical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3 Adapting segmentation models . . . . . . . . . . . . . . . . . . . . . . 33The designated embedding layer . . . . . . . . . . . . . . . . . . . . . 33Handling memory constraints . . . . . . . . . . . . . . . . . . . . . . . 33Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Associative Adaptation for Streaming Data . . . . . . . . . . . . . . . . . . . 353.3.1 From domain adaptation to sequential adaptation for streaming . . . 353.3.2 Building a framework for adapting classifiers in time . . . . . . . . . . 353.3.3 Adapting for streaming data . . . . . . . . . . . . . . . . . . . . . . . 36

4 Experiments and Results 374.1 Robust Associative Domain Adaptation for Image Classification . . . . . . . . 37

4.1.1 Datasets and adaptation benchmarks . . . . . . . . . . . . . . . . . . . 374.1.2 Balancing distribution differences with weighted visit loss . . . . . . . 384.1.3 Understanding embedding associations: The effect of the metric and

normalization method . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Domain Adaptation for Semantic Segmentation . . . . . . . . . . . . . . . . . 41

4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 Towards semantic segmentation with patchwise classification . . . . . 41

The patchwise classification dataset . . . . . . . . . . . . . . . . . . . 42Adaptation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.3 Adapting Semantic Segmentation Models . . . . . . . . . . . . . . . . 43Adaptation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Adapting Models for Streaming Data . . . . . . . . . . . . . . . . . . . . . . . 474.3.1 Dataset and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.2 Adaptation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Summary and Conclusions 505.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

References 52

6

Page 7: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

1 Introduction

1.1 Motivation

There are things known and there arethings unknown, and in between are thedoors of perception.

Aldous Huxley

Recent improvements in deep architectures in combination with increasing availabilityof large annotated image datasets have allowed for higher accuracy and usefulness of deeplearning methods. However, labels are costly to obtain, especially for dense prediction taskswhere every pixel in an image needs to be labelled. As an example, accurate pixel-levelannotation of Cityscapes dataset [10] for semantic segmentation, consisting of images fromurban areas, took more than 1.5 hours per image to annotate. Developing methods to exploitthe availability of unsupervised data is crucial to scalability and applicability of deep learningadvancements.

In addition, aquisition of a finite amount of labeled images results in an inevitable datasetbias, which inhibits learned models from generalizing well to data collected under differentconditions. These biases can be conditioned on small shifts such as lighting or pose as well aslarge appearance or shape differences between objects. Looking towards robust approachesto transfer knowledge learned in a supervised manner to previously unseen data is a necessity.

Domain adaptation is the subset of transfer learning that is concerned with transferringknowledge learned on a supervised source set to a target set where no annotations areavailable [11]. These sets share the same label space, but due to the dataset bias the datadistribution between source and target sets differs, causing models learned on the source tofail on the target. This is known as the domain shift problem. Due to this shift, tailoredmethods have to be developed to be able to exploit supervised and unsupervised data totrain models that work well for both. Shallow methods have dominated the field untilthe last few years when the first deep approaches succeeded in adapting models to simpleclassification tasks. With vast amounts of unlabeled data and various domain shifts, itis a necessity to develop algorithms that generalize well across different adaptation tasks.Capturing representations that are consistent across domains and applicable across taskswith a robust domain adaptation method is at the forefront of this research.

A particularly interesting task for domain adaptation that has only recently been broughtto the attention of researchers is semantic segmentation. Nowadays highly accurate semanticsegmentation methods are becoming increasingly important with the advancements towardsautonomous driving, where semantic understanding of images is necessary not only for map-ping of roads but also for per-pixel analysis of images collected from in-car cameras. Whilealgorithmic advancements are being made, dependency on annotated datasets remains abottleneck.

Deep domain adaptation methods for dense prediction tasks were almost non-existent,until Hoffman et al.[24] pioneered adaptation attempts for semantic segmentation. Due tothe higher complexity compared to classification as well as deeper architectures and largerdatasets used, adapting for dense prediction is considered to be a harder task. Recent

7

Page 8: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

1 Introduction

advancements in adversarial methods and generative models have found application and arebeing used in this direction [62].

Particularly fundamental for the semantic segmentation task is that it has been shownthat adaptation can be achieved from models trained on synthetic data rendered from com-puter game engines to real world images. Synthetic data comes at cheap costs and highlyaccurate labels, therefore having methods that exploit the synthetic domain properly can fa-cilitate improvements in real domains immensely. Figure 1.1 illustrates the idea of adaptingfrom synthetically generated data to real images. Synthetic to real adaptation for semanticsegmentation is part of the focus of this work.

Figure 1.1: Adapting from synthetic to real domain images. A shift in object appearancecan be observed across domains.

Besides cross-task applicability, taking a look at how domain adaptation is addressed re-veals meaningful insights. Interestingly, to date domain adaptation is addressed mostly inthe context of static datasets. A domain is defined over marginal and conditional distri-butions of a set of data with respect to its labels [56]. It is usually represented by a finitedataset, which is likely to contain a sample selection bias. However, in modern applicationsthe data from a defined domain is not usually available all at once. If we consider imagedata collected for training models for self-driving cars for instance, the data collected fromdifferent cities comes in at different times in ”bundles” or domain-specific sets. This datamay have similar distribution with the previously collected samples, but usually suffers fromsample selection bias. We can also not consider the incoming data as coming from en entirelydifferent domain, as the difference in distribution will be dependent on sample selection bias.This dynamic work setting requires flexible approaches that can be easily adapted to newdata [17]. As we move from ”closed” and static dataset sources to ”open” and dynamicdataset sources, the non-stationary dataset statistics become a function of time.

Similarly, this happens in a more structured way with streaming data applications. Stream-ing data comes in real-time or accumulated in bundles and labeling it takes a longer timethan the availability of data. Think here for instance data incoming from social media orvideo streams in the wild, where frames observed at any moment might change drasticallyover time. In shown in Fig. 1.2 we show images from the GTA5 synthetic dataset [43],which is collected by rendering a video-game play. Looking at this dataset sequentially as itwas collected, we can observe visual differences across sequences which may lead classifiersto perform worse if the appearance of this data is considerably different from the datasetused for training. In streaming applications the distribution shift of data over time is calledthe concept drift. In addition, the data is not assumed to be independent and identicallydistributed. Instead, we might have bundles of data representing very small parts of the

8

Page 9: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

1 Introduction

distribution, observing here the sample selection bias. Annotations for the abundant dataare not cheap and they are usually obtained slower than the incoming data.

Figure 1.2: Observing changing data distribution over different timesteps in dynamic data.As the GTA5 dataset was collected during video game play, changing appearanceof the scenes over time can be observed in the dataset.

Due to the distribution dynamics and further considerations, such as the memory limi-tations for storing the vast amount of incoming data, dealing with streaming data calls forframeworks and methods that differ from traditional tasks. However, including current tasksthat deal with non-stationary distribution environments in a single framework conform mod-ern application requirements lays the grounds for dealing with visual data at large in the wild.This work attempts to develop a framework for streaming that can be applied across-tasksfor improving classifiers in an unsupervised manner, extending to better adaptive learningover time.

1.2 Overview and Contributions

Conform the motivation discussed above this work proposes an adaptation framework thatgeneralizes domain adaptation in a dynamical context and is applied across tasks. A robustdomain adaptation method is explored and developed to work well for scenarios of evolvingdistribution statistics over time and tasks where complex class distribution statistics areobserved across sets.

The work is structured as follows. In Section 2 some preliminaries of the introduced topicsas well as an overview of related research is presented. We go especially in-depth of deepvisual domain adaptation for classification and dense prediction tasks and discuss how thisbrings us to our decisions on the method.

In Section 3 we discuss the approach. Using as a base method associative learning [19], anadaptation technique that uses association of embeddings in latent space between supervisedand unsupervised samples, we reformulate this to work well and generalize in scenarioswhere class distribution statistics between source and target are dissimilar. This enablesthe applicability of this method to tasks such as semantic segmentation and streaming dataclassification, as further presented in this section.

Next, Section 4 reports on extensive experiment results across several adaptation bench-marks for classification and semantic segmentation. Further experiments are reported todemonstrate classifier performance improvements within the proposed streaming framework.

9

Page 10: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

1 Introduction

A summary, detailed analysis and discussion of the findings and potential future directionsof this research are summarized in Section 5.

The contributions of this work are as follows:

• A broad review of research in domain adaptation including existing adaptation methodsfor semantic segmentation

• An overview of machine learning for streaming data and discussion on differences withstatic data methods

• A formulation of the associative learning method for domain adaptation that relaxesprevious assumptions on source and target class distributions

• Considerations for applying associative domain adaptation to the semantic segmenta-tion task

• A novel framework for adaptation of dynamic distribution shifts that allows exploita-tion of unsupervised data to improve classifiers over time

• Empirical evidence of the success of the reviewed approach in multiple domain adap-tation benchmark

• An in-depth analysis and discussion of extensibility and applicability of the above todynamic or streaming data in the wild

10

Page 11: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

2.1 Domain Adaptation Overview

2.1.1 Introduction

Notation

Unsupervised domain adaptation or simply referred to as domain adaptation is consideredto be a subset of transductive transfer learning where the tasks for both source and targetdatasets are the same, but the distributions differ due to dataset bias and distributionmismatch [11, 62].

More formally, we want to adapt a model learned from a labeled source dataset consistingof source features X s and a marginal probability distribution P (Xs) over data as Ds ={X s, P (Xs)} to an unlabeled target dataset Dt = {X t, P (Xt)}. The source and target datadistributions are different, so P (Xs) 6= P (Xt) due to the dataset bias. The label spaceof source and target datasets is the same, so Ys = Yt. While source data is paired withrespective annotations given Xs, Y s, only the data Xt is available for the target.

In some cases a small set of labels is available in the target dataset, in which case semi-supervised domain adaptation approaches are used. However, in most cases by domainadaptation the unsupervised version is assumed.

Further, there are two approaches to unsupervised domain adaptation. First, conservativedomain adaptation assumes that the same classifier can be used for both source and targetdatasets, assuming that in source classifier space there is one such which performs well inboth. In the non-conservative domain adaptation case, this assumptions is not made andthe classifier differs from source to target [46].

The domain shift problem

A domain is defined over marginal and conditional distribution of a set of data with respectto its labels [56]. A domain is usually represented by a finite dataset, which is likely tocontain a sample selection bias. This distribution shift or domain shift between datasetscauses classifiers learned in one domain to fail when applied to another domain of the samecategories. This concept is illustrated in Figure 2.1.

There is a wide range of what can be considered as a domain shift, including differencesdue to the acquisition of data, e.g: lighting, conditions and point of view in images. Morecomplex domain shifts can be caused due to intra-class variability and category biases be-tween datasets. Sometimes research methods tackle small and large domain shifts differently[24].

In addition, homogeneous domain adaptation is referred to cases when source and targetfeature space is the same (X s = X t), while in heterogeneous domain adaptation features insource and target can have different representations spaces as well as different modalities(X s 6= X t) [11].

11

Page 12: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

Figure 2.1: A domain shift due to sample selection bias causes classifiers trained on sourceto fail when applied to target domain data. Domain adaptation aims to correctclassifiers for the shift. Image from [45].

Categories and lines of research

Initial domain adaptation attempts were shallow models, these being homogeneous or het-erogeneous, and the classifier assumption being conservative or non-conservative as explainedabove.

Later a line of research utilized deep features extracted from deep models but shallowclassifiers to obtain higher domain adaptation performance than the previous shallow ap-proaches, before behavior of domains in deep models was well understood.

Last, deep domain adaptation methods started to emerge. These are categorized intodiscrepancy based approaches, adversarial and reconstruction based methods[11, 62, 42].

2.1.2 Shallow Approaches

Early domain adaptation approaches relied on data statistics transformation and mappingsor feature augmentation to apply domain adaptation.Instance reweighting methods do weighting of an instance by estimating, for instance,

the ratio between likelihoods of it being a source and target sample. These can be esti-mated independently with a classifier on samples as source or target. Many approaches useMaximum Mean Discrepancy between domain distributions [11]. Instance reweighting isillustrated in Figure 2.2.

Figure 2.2: Instance reweighting illustrated. Image from [33]

Parameter adaptation methods are non-conservative domain adaptation methodswhich do not assume that the same classifier may be used on both source and target. Typi-cally these methods adapt parameters of a model trained on source, e.g: an SVM. Adaptive

12

Page 13: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

SVM [64] uses perturbation functions to progressively adjust the classifier learned on sourceto target data.

Feature augmentation methods use an augmented feature space for source and target.In [12] a ”frustratingly easy” approach each feature is mapped to an augmented feature spaceby duplicating the feature vectors and filling with zero vectors as ( xs xs 0 )T and ( xt 0 xt )T .An SVM is then used to exploit features belonging to each domain and to both.

In [30], a common subspace is introduced where features from both domains can be pro-jected with respective W1 and W2 matrices and a single classifier can be used (see Figure2.3).

Figure 2.3: Source and target samples transformed into a heterogeneous feature space. Imagefrom [30]

Feature space alignment methods minimize the domain shift by aligning source andtarget features. For instance Correlation Alignment (CORAL) [52] aligns source and targetby second order statistics. More specifically, computing covariances CS and CT on source

and target, the source data can be whitened as DS = DS ∗ C− 1

2S and re-colored with target

covariance as D∗S = DS ∗ C12T . A classifier trained on the source data then can be applied to

the target. Whitening both domains would not work well as the data may lie on differentsubspaces [52]. Figure 2.4 demonstrates the process.

Figure 2.4: (a) Source and target normalized features have different covariance. (b) Sourceis whitened. (c) Source is re-colored with target covariance. (d) Whitening bothsets wouldn’t work if they lie on different subspaces. Image from [52]

13

Page 14: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

Feature transformation methods are various and they attempt to minimize discrep-ancy of source and target distributions in latent space by learning different transformations.For instance, Stacked Marginalized Denoising Autoencoders [7] learn representations by re-construction, recovering denoised features by marginalizing the noise using correlations be-tween source and target features. Multiple principles from feature transformation methodshave been used in deep domain adaptation methods.

2.1.3 Deep Approaches

Discrepancy based adaptation

Discrepancy based deep adaptation approaches are inspired by working concepts from earlyshallow methods, and usually attempt to minimize a discrepancy measure between sourceand target in latent space. Siamese architectures are often used where the weights are sharedbetween source and target in the conservative adaptation case or the target weights are tunedfurther in the non-conservative adaptation cases. Typically a loss function composed of thetask loss and discrepancy loss is used, thus L = Ltask + Lda.

Deep Adaptation Networks [34], also know as DAN, use one such siamese architecturewhere a multi-kernel Maximum Mean Discrepancy measure between the activations in thelast three layers is minimized (see Figure 2.5). Previous approaches which used single-layersingle-kernel MMD (for instance [61]) were outperformed.

Figure 2.5: DAN siamese architecture using Multi-Kernel MMD for domain adaptation. Im-age from [34]

The loss function including the task and discrepancy loss for the multi-kernel MMD be-comes:

L = Lclassification + λ

l2∑`=l1

d2k(D`s,D`t)

where layer indices are set between 6 and 8, d2k(D`s,D`t) is the MK-MMD between source

and target and D`∗ = h∗`i is the hidden representation of source or target in layer `.Similarly, in DeepCoral [53], which extends on the shallow CORAL[52] approach men-

tioned above, the regularization loss for adaptation is as below:

Lda =1

4d2‖CS − CT ‖2F

where the Frobenius norm between source and target covariances is minimized.In Residual Transfer Networks [35] a non conservative approach is used, assuming that

source and target classifiers differ by a residual function fS(x) = fT (x) + ∆f(x). Residual

14

Page 15: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

blocks here are used not for feature mappings, but for source to target mapping of theclassifier.

In the recent Parameter Reference Loss [26] approach, an extra loss function is addedbesides the classifier and adapter loss. This loss component minimizes the distance betweensource and target parameters:

L = Ltask + LMMD + LPR where LPR =

NP∑i=1

‖pSi − pTi‖1

All these discrepancy-based methods have the limitation of having to choose a distancemetric to optimize and often a kernel to map features in latent space. Associative DomainAdaptation [19], the approach on which this thesis builds upon, chooses to use an association-based loss between domains instead of the discrepancy loss in addition to the task losscomponent. A detailed explanation on this approach follows in Section 3.

Adversarial domain adaptation

Adversarial domain adaptation is categorized into adversarial discriminative and generativeapproaches. The first category typically utilizes a discriminator between source and targetfeatures in order to force the original classifier to output invariant features. The secondcategory uses a generative model that learns to generate samples imitating the distributionof the other domain. A taxonomy of adversarial domain adaptation approaches is presentedin Figure 2.6.

Figure 2.6: Categories of adversarial domain adaptation. Image from [60]

Discriminative methods Adversarial discriminative domain adaptation relies on the useof a discriminator between source and target that tries to distinguish features coming fromeither. Discriminator feedback aims to make the features indistinguishable, therefore invari-ant. Discriminative domain adaptation often relies on the assumption that there is a set ofinvariant features across domains on which a single classifier can be used.

One of the first discriminative approaches was Domain Adversarial Neural Networks(DANN) [16]. The feed-forward part of the network has a feature encoder and a label predic-tor component. The output of the feature extractor has a second head into the discriminator,where a gradient reversal layer ensures domain features are indistinguishable.

15

Page 16: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

Adversarial Discriminative Domain Adaptation (ADDA) [60] is a recent approach whichconsists of two stages of training. In the first stage, a task-oriented endcoder and classificationnetwork is trained on source. In the second stage, a copy of the pre-trained encoder modelis further fine-tuned on target with a discriminator that predicts the domain from which thefeatures coming from. The new fine-tuned encoder is then used with the original classifieras the class label predictor on target. Here the weights are shared only in the classifier partof the network, but not in the encoder, thus making it a non-conservative approach.

In [59], a multi-task domain adaptation aproach simultaniously optimizes for domain in-variance in addition to a soft-label distribution matching loss for trasferring task correlation.The joint loss function considering classifier, discriminator and representation parameters

L(xS , yS , xT , yT , θD; θrepr, θC) = LC(xS , yS , xT , yT ; θrepr, θC)

+ λLconf (xS , xT , θD; θrepr)

+ νLsoft(xT , yT ; θrepr, θC)

is minimized, where Lconf is the domain confusion loss of the discriminator and Lsoft isdefined over per-category soft labels as

Lsoft(xT , yT ; θrepr, θC) = −∑i

l(yT )i logpi

where p is the softmax activation of the target image.

Generative methods With the recent advances on generative adversarial networks (GANs)[18], many domain adaptation aproaches use some version of GANs to learn an explicitmapping between source and target. A generator loss is added to the adversarial one tolearn to map domain distributions.

In [4] a GAN is used to learn a mapping between source and target and generates sourceimages that look like the target domain. The loss is composed of a domain GAN losscomponent, a task-specific (classification) component and a content similarity loss componentbetween images. The objective becomes:

minθG,θT

maxθD

αLd(D,G) + βLt(T,G) + γLc(G)

where Lt is the classification task loss, i.e. cross entropy, Ld is the domain loss wherethe discriminator and generator are optimized in a minimax fashion where the generator isconditioned on noise as well as the source image:

Ld(D,G) = Ext [logD(xt; θD)] + Exs,z[log(1−D(G(xs, z; θG); θD))]

For the similarity loss component Lc a masked pairwise mean squared error between pixelsis used instead of L1 or L2 which are commonly used.

Coupled Generative Adversarial Networks(CoGANs) [31] use a pair of GANs to synthesizerealistic images in each domain and discriminate whether the images are real or synthesized.A weight sharing constraint is applied that allows the network to learn a joint distributionfrom images without corresponding pairs.

Cycle-consistent Adversarial Domain Adaptation (CyCADA) [23] is a recent approachwhich uses a combination of cycle-consistent GAN losses that learn mappings from sourceto target and vice versa. Cycle consistency is introduced following the work of [69] whereinverse mapping is enforced besides the one-directional one.

16

Page 17: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

Data reconstruction-based methods

These methods use data reconstruction to achieve feature invariance between domains oftenby using autoencoders, where the encoder part learns the representation and the decoderpart the reconstruction.

Domain Separation Networks [5] use stacked domain-specific and joint autoencoders, wherethe encoded features are used to learn common and specific representations. The decodersare attached to a reconstruction loss where they attempt to reconstruct the input samplesjointly with classification training.

2.1.4 Domain Adaptation for Semantic Segmentation

Semantic segmentation is the task of predicting an object class label for every pixel value ofan image. Due to the higher complexity compared to image classification, domain adapta-tion aproaches for semantic segmentation only recently emerged. It was shown with FullyConvolutional Networks [32] that encoders from classification networks such as ResNet [22]or VGG [47] can be extended in a fully convolutional fashion with upsampling layers toproduce dense output and achieve state of the art results. These architectures or versionsthereof with additional context or pooling modules in the upsampling have been widely usedas base architectures for domain adaptation methods.

Adversarial discriminative methods

FCNs in the Wild [24] was the first work to perform such adaptation using an adversarialapproach. For the segmenter a siamese architecture of FCNs with a dilated convolutionscontext module in the upsampling stage is used [66]. A discriminator network takes as inputencoder features and learns to discriminate between source and target, giving thus feedbackto the segmenter that has to learn to output invariant features for both domains. In addition,the authors argue that domain adversarial training only captures global domain shifts. Inorder to account for category specific shifts a constraint multiple instance loss component isadded. The output prediction map of the target is constrained on a lower and upper boundof category presence statistics in an image, based on percentages of labels in images of thesource domain. This procedure is illustrated in Figure 2.7

Similarly, in [58] adversarial training is done between source and target to achieve featureinvariance for domain adaptation. However, multi-level adversarial training is done by dis-criminating separately between output probability maps as well as feature-level activationsin the network. Two separate discriminators are being used at two levels of the networkwhich allows for better performance than sigle-level feature or pixel-level adaptation.

Similarly in [57] adversarial training for both feature and pixel level is performed. Inaddition, novel regularizations for feature level adaptation are explored while using insightsfrom semi-supervised learning.

In Reality Oriented Adaptation (ROAD) [8] domain classifiers are trained to achieve fea-ture invariance on different areas of the images. More specifically, the image is split into agrid and for every part of the grid a separate domain classifier is trained. This is done inattempt to capture more information on the spatial structure of the images collected fromurban scenes.

In addition to the grid-based domain adversarial training, an additional distillation loss isdefined that attempts to preserve the pre-trained weights (e.g on ImageNet) from forgettingwhat’s learned on real objects while training on the synthetic object source. This loss isdefined as:

17

Page 18: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

Figure 2.7: FCNs in the Wild. Image from [24]

Ldist =1

N

∑i,j

‖xi,j − zi,j‖2

where xi,j and zi,j are activations at position (i, j) of the feature map and N is the totalnumber of activations. Euclidean distance between activations is minimized in attempt toguide the model to maintain behavior learned on the real images.

Adversarial generative methods

As in the respective classification approaches, these methods use a generator to learn anexplicit mapping between source and target images. In [44] this mapping is learned bythe generator G which learns to produce ”fake” source and target images. A discriminatorD tries to distinguish source and target generated images from the real ones. In additionto the segmenter loss on source and discriminator loss on generator outputs, an auxiliarysegmentation loss on newly generated source images from target is added as well as an L1

reconstruction loss for the accuracy of cross-domain generated images (see Figure 2.8).Training is done in 3 stages: First the discriminator D is trained to be able to distinguish

real and fake outputs from the G network together with the auxilary segmentation loss as:

LD = Lsadv,D + Ltadv,D + LsauxSecond, the generator G is trained to beat the discriminator D by making the features

invariant together with reconstruction loss:

LG = Lsadv,G + Ltadv,G + Lsrec + LtrecThird, the segmenter network (F + C) is trained with segmentation and auxilary loss:

LF = Lseg + αLsaux + β(Lsadv,F + Ltadv,F )

18

Page 19: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

Figure 2.8: Unsupervised Domain Adaptation for Semantic Segmentation with GANs. Imagefrom [44]

The CyCADA [23] approach mentioned above uses a GAN to learn a mapping betweensource and target in addition to cycle-consistency loss enforcement. Both pixel and featurelevel adaptation is done and competitive results are achieved in semantic segmentation aswell besides image classification.

Other methods

A few other recent methods have attempted to solve domain adaptation for classification byusing other approaches rather than the standard adversarial approach.

A Fully-convolutional tri-branch [67] splits the original segmentation network into 3 branches.The first two are trained on source and used to make target predictions. If these two branchesagree on target predictions with a high confidence score, these predictions are used as pseudo-labels to train the third branch. The first two branches are encouraged to be diverse by aweight-constrained loss.

Curriculum Domain Adaptation is explored in [68] where in curriculum-style learning fash-ion domain adaptation is achieved by first solving the easier tasks before the more complexones. The authors use label distributions over images as an easy task and local distributionsover landmark superpixels as a difficult one.

19

Page 20: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

2.2 Learning Models for Streaming Data

2.2.1 Introduction

Streaming data is generated from continuous data sources such as social network feeds,transaction data, live video feeds etc. The process generating the data is often non-stationaryand the data cannot be assumed to be independent and identically distributed and drawnfrom a single distribution. The temporal properties of streams require a different set oftechniques for dealing with this kind of data.

Consider a stream S of data that appear incrementally in sequences of single online samplesor in portions or blocks. The sequential data Xτ enters the stream for time τ = 1, 2, ...Kand the labels Y τ may or may not be provided. If the data is processed sample by samplethis is called online processing. In the case of portions or data blocks, these are usuallyaccumulated in equal sizes and training is done once a block is available, which is known asblock processing [50].

The concept drift problem

Concepts in data are stable if they are generated from the same distribution. Often instreaming this is not the case. The distribution shift over time in streaming data is known asthe concept drift. Thus if the data samples are generated with a source distribution pτ (X,Y ),for two distinct points in time τ and τ + ∆, an X exists such that pτ (X,Y ) 6= pτ+∆(X,Y )[50].

There are two types of drift. Real drift is defined as the variation of posterior probabilityof classes pτ (Y |X) over time independent of variations in the evidence pτ (X). Virtual driftis considered the change in marginal distribution of the evidence pτ (X) without affectingthe posterior probability of classes pτ (Y |X) [25].

This drift can be sudden, when the current data distribution is suddenly replaced by adifferent one, or gradual when a slow rate of change is observed in the stream distribution.[50]. Noise and outliers are supposed to not affect the classifiers which aim to capture theunderlying distributions. Further, drifts can be permanent if variation continues throughtime or transient when the drift disappears after a while [14]. Often data from a similardistribution reappears resulting in recurrent drifts.

Considerations

Besides the distribution shift over time, there are a few other important considerations tobe taken into account when developing methods for streaming data classification.

First, memory constraints do not allow for all stream data to be stored simultaneously.Thus most of the data should be used and then discarded to free the memory for incomingstreams. This creates the need for a one-pass learning approach where each sample pair ordata block is only seen once in the training process before the data is discarded.

Second, labels for the sequence Sτ are not usually available with the data itself. Whenthey arrive with a delay in the next sequence Sτ+1 the model can be easily evaluated ina ”train-then-test” scenario. This delay is known as verification latency and the scenariowhen labels are only provided in the beginning of a stream is known as ”initially labelednonstationary streaming”. In these cases classifier knowledge needs to be propagated acrossseveral timesteps [14].

Another important consideration is that sometimes smaller classes occur so rarely thatbeing able to detect when they occur and adapt accordingly is an important part of thesetup.

20

Page 21: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

In addition, it’s important for the classifier to be able to produce on demand predictionsregardless of whether a similar distribution of incoming stream data has been previously seenin training data.

Last, different approaches and constraints call for different evaluation measures for theclassifier. Usually streaming classification algorithms are evaluated by the required process-ing time, memory usage, predictive performance and ability to adapt [50].

2.2.2 Algorithms and methods

Approaches that deal with streaming data are either passive approaches that use a singleclassifier or an ensemble or active ones where an extra decision is made on whether to updatethe classifier. Most often classification algorithms such as Decision Trees, Rule-Based andNearest Neighbor are used, whereas adjustments in neural network architectures to accountfor streaming have been proposed [1]. Figure 2.9 shows an overview of streaming dataclassification algorithms. It can be seen that not many deep approaches are used for thistask, although a few attempts have been made [65].

Figure 2.9: Streaming data classification algorithms. Image from [25]

Active approaches Active approaches of streaming data classification aim to detect theconcept drift in streaming data. A change detector looks at the features extracted anddepending on whether there is a drift it decides on whether the classifier should update.These change is usually measured by variation in classification error as well as the inspectionof raw data features themselves [14]. As these methods usually are able to adapt to conceptdrift, detecting it might be hard in cases when the drift happens gradually.

An example of an active approach is [2], where a complex sampling and filtering mechanismfor active training and a random forest based classifier are used.

Passive approaches These do not seek to detect a drift, but simply continue the training asnew labeled data arrives. The models are either based on a single classifier or on ensembles.Single classifier models are less demanding computationally.

An example is [55] where a micro-cluster Nearest Neighbor is used, which makes use ofstatistical summaries for data streams.

However, ensembles usually do best due to the availability of multiple classifiers to makedecisions. In addition, ensembles are flexible to change as classifier members can be added or

21

Page 22: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

Figure 2.10: Active learning framework for stream classification. A change detection mech-anism informs the model on whether to adapt or not given the newly incomingdata. Image from [14]

removed based on incoming data, although this process requires more effort. Still, ensemble-based methods are most often preferred [51, 63].

Not many works look into exploiting unsupervised data for improving data stream classi-fiers. [54] use semi-supervised learning to adjust k-nearest neighbor weights over time. Dueto the complexity of dealing with labeled data themselves, not much has been done aboutexploring potential ways to boost classifier results with unsupervised data.

22

Page 23: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

2 Preliminaries and Related Research

2.3 Discussion

An overview of the topics introduced and related research was presented in this section. Wediscuss the motivation and choices made when developing the method that follows.

In the domain adaptation setting it can be noticed that not many methods are consis-tently used across tasks achieving competitive results. Usually domain adaptive solutionsare tailored to the task. We believe that robust domain adaptation methods that can beapplied across tasks without bells and whistles are important to the advancements in thefield.

As it can be observed, adaptation approaches for semantic segmentation are mostly basedon adversarial DA and use either a discriminator or a GAN that learns a mapping betweenthe images. Many insights acquired from classification approaches are yet to be exploited,with discrepancy based methods being almost not represented. This can be due to the lackof enough understanding of how multi-dimensional embeddings behave in latent space andthe complexity of semantic segmentation.

This thesis attempts to shed more light into understanding the behavior differences fromclassification to patchwise classification and further to semantic segmentation, for laying thegrounds to better understanding of domain adaptation for semantic segmentation.

Regarding the streaming setup, we discussed how domain adaptation can be consideredas a subset of streaming with two time steps and a single shift between them where pτ (X) 6=pτ+1(X) . Thus we can generalize dynamic domain adaptation and stream adaptation overtime in a joint framework. To show how this would work, we use a streaming setup with an”initially labeled environment”, that as described above does not receive further labels, onlyraw images, in a semi-supervised modality setting.

From the evaluation methods of streaming data classification discussed above, in thiswork we evaluate our streaming classification models for predictive performance and abilityto adapt. We do consider memory usage by simulating a streaming scenario where data isused and then discarded, but explicit memory optimization for streaming is not the focus ofthis work. The processing time component is not considered for evaluation within the scopeof this work.

23

Page 24: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

3.1 Domain Adaptation with Associative Learning

3.1.1 Associative domain adaptation

From maximum mean discrepancy to learning by associations

Early domain adaptation approaches [41, 34] have used maximum mean discrepancy (MMD)as a distance measure between source and target feature distributions. MMD has been shownto be a good estimate of distribution distances through their mean embeddings [3]. Considersource and target datasets xs and xt and φ is a mapping such that φ : X → Hk where Hk isa reproducing kernel Hilbert space (RKHS) [48]. Maximum mean discrepancy is defined as:

MMD(X,Y ) = ‖n∑i=1

φ(xsi )−m∑i=1

φ(xti)‖Hk

Minimizing explicit source and target MMD yields to improved adaptation results on amodel trained with source labels only. MMD is computed with the kernel trick in quadraticruntime, although there are linear time estimator versions [34]. Despite the computationalcomplexity, a kernel and relevant parameters have to be chosen.

Learning by association [20, 19] is a technique that uses association of embeddings in latentspace assuming that same-class data will have similar embeddings. Given a labeled and anunlabeled set of data, the unlabeled set should be associated to the closest labeled points,given that the overall distribution structure is maintained.

Similarly in associative domain adaptation, the aim is to minimize source and targetdiscrepancy, but indirectly through source and target embedding associations, unlike explicitMMD-minimization based approaches, with the advantage of not having to choose a kerneland relevant parameters.

Definition

Considering the domain shift between source and target distributions, the goal is to associatethe domains on embedding level. Let us assume that xs and xt are source and target datawhere distributions differ, so due to the domain shift P(xs) 6= P(xt). To perform adaptationin par with the supervised task, an additional loss component is added that operates onsupervised source data and unsupervised target data. In associative domain adaptation thisis the associative loss component that acts as a regularizer on the source classifier in orderto minimize the distribution shift on feature level.

L = Ltask + Lassoc (3.1)

The associative loss consists of two components: the walker and the visit loss.

Lassoc = αLwalker + βLvisit (3.2)

where α and β are the coefficients.

24

Page 25: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

Consider a network trained on a task that produces respective embeddings φs ∈ RN×Dand φt ∈ RM×D for xs and xt respectively where N and M are number of data points insource and target respectively and D is an arbitrary embedding dimensionality. In order toassociate the correct source and target embeddings it is needed to compute their pairwisedistances or affinities.

Affinities and transition probabilities Affinity matrix A ∈ RN×M can be computed whereevery element Aij can be written as the affinity of source and target embeddings for therespective indices:

Aij = aff(φsi , φtj) (3.3)

In [19], the unnormalized dot product between vector embeddings is used as a similaritymeasure, thus Aij = 〈φsi , φtj〉. In Section 3.2.2 we investigate different affinity measures andtheir impact on the performance of the method.

If we could think embeddings as nodes in a graph where all source embedding nodes areconnected to all target embeddings and all source to target affinities as edges or transitionweights, affinities could be interpreted as transition probabilities between embeddings insource and target. Since every embedding point in the source is ”connected” to a finite setof points in the target, it is convenient to interpret affinities between a single sample in onedomain to all samples in another as a probability distribution. Thus p(φt|φsi ) is a probabilitydistribution over samples in φt given the i-th source embedding and

∑j′ p(φ

tj′ |φsj) = 1.

From here on, the notation ps→tij is used to denote p(φtj |φsj) with s → t indicating atransition from source to target. To get from the affinity matrix A to transition probabilitydistributions the softmax function can be used. Thus the affinity turned probability fromone embedding vector φsi in the source to embedding vector φtj in the target becomes:

ps→tij =exp (Aij)∑j′ exp (Aij′)

(3.4)

It can be argued that different normalization functions can be used to get a probabilitydistribution from the values. We discuss and investigate this empirically in Section 3.2.2.

In addition, for convenience the probability of two transitions happening jointly can bemodeled as the product of these separate transition events. If a transition from an embeddingφsi in the source to φtj in the target and back to another source embedding φsk, the probabilityof this event as can be written as:

ps→t→sijk = ps→tij pt→sjk (3.5)

and the overall two step probability as:

ps→t→sik =∑j

ps→tij pt→sjk (3.6)

The associative loss component is based on two principles around these transition proba-bilities, constraining the embedding behavior as follows.

The walker loss To associate source and target embeddings the labeled source embeddingcan be leveraged to associate to target embedding points. The labels of the target domainembeddings are not known, but the distribution in the source and target is expected to followa similar structure where same-class samples can be associated in embedding space.

25

Page 26: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

(a) (b)

Figure 3.1: (a) Associative walker loss. Arrows represent probabilities of transitions, whichare to be maximized if there is a same-class target-to-source transition. (b)Associative visit loss. Source to target probabilities are distributed uniformly.

Thus, while minimizing same-class embedding distance from both domains, all sourceand target embeddings of the same class should lie closer together. This can be consideredequivalent to minimizing both source-to-target and target-to-source embedding distance forevery particular class. With the probabilistic interpretation of the distances, this wouldmean maximizing the joint probability of a source-to-target and target-to-source distance-based transition for same class embeddings. As class specific target embeddings are notknown, the two-step joint probability ps→t→sik and labels in the source can be utilized.

This joint probability should be maximized if the φsi and φsk samples in the source belongto the same class. In other words, a walker transition from source to target and back shouldend up in the same class in source with high probability. This can be enforced through across-entropy loss:

Lwalker = H(E,P s→t→s) (3.7)

where P s→t→s ∈ RN×N is a matrix containing elements ps→t→sik = ps→tij pt→sjk representingtwo-step transition probabilities between an embedding sample in source to one in targetand back to source and E is the normalized equality matrix indicating labels of the sameclass as:

Eik =

{1/|φsi | if Cφsi = Cφsk0 otherwise

(3.8)

Based on this intuition, correct associations of target embeddings are encouraged if theylie close to multiple source embeddings which belong to the same class . For any i and k,ps→t→sik should be maximized if Cφsi = Cφsk where Cφn is the class of a sample embedding.This is illustrated in figure 3.1a.

The visit loss The walker loss function risks to collapse into minimizing distances withtarget samples that lie the closest to the source samples, as they are easier to associate.This can partially be mitigated by the visit loss component which ensures that all transitionprobabilities from source to target are distributed equally among target samples. This isenforced with a cross entropy loss which distributes source to target transitions equallyamong samples as:

Lvisit = H(V, P s→t) (3.9)

26

Page 27: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

where P s→tj =∑

i ps→tij . Considering all probabilities ps→tij of transition from φsi to φsj , P

s→t

is enforced to be uniformly distributed with V ∈ RM having elements:

Vj =1

|φt|(3.10)

where P s→tj =∑

i ps→tij . This is illustrated in Figure 3.1b.

Assumptions The definition of the visit loss in eq. (3.10) assumes that source and targetclass distribution is similar on batch level where adaptation happens. The authors of [19]acknowledge the problem and recommend using a lower coefficient for the visit loss if this isnot the case. In addition, they sample a uniform distribution over class labels per batch inthe source, and attempt to alleviate the distribution difference in the target by sampling abatch 10-100 times larger than the number of classes. This often assures that all classes willbe present in a target batch, but does not necessarily approximate the uniform distributionin source. This is even harder to ensure in diverse realistic datasets or tasks such as semanticsegmentation where target class distributions in the batch might very a lot depending on thesize of objects.

We argue that this assumption should be tackled on a theoretical level in the method.First, adjusting the coefficient β for the visit loss would require implicit access to labelsin the target domain in order to tune β. We show in experiments (Section 4) that anincreased difference in KL-divergence between source and target during training deterioratesthe method results.

In addition, point out that a lower visit loss coefficient does not allow the network toexploit the full adaptation capability of the method. Below, we reformulate the approachfor relaxing this distribution assumption and making the method robust to distributiondifferences between source and target, while preserving the full adaptation capacity betweenembeddings.

3.1.2 Relaxing the class distribution assumption

We discussed the assumption made by the original formulation of the visit loss in eq. 3.10.While setting a low coefficient for the visit loss component is a potential workaround, weshow that this does not allow for full exploitation of adaptation capabilities of the methodand tends to fail if the class distribution in the target is far from uniform in KL-divergence.

Relaxing the distribution assumption is especially important for the scalability and us-ability of the method as well as applicability to more complex tasks such as segmentationor streaming data classification. In the following sections we reformulate the method con-sidering potentially different class distributions and show the impact of our new approach inthese tasks.

Intuition

In Figure 3.2 we illustrate why the uniform distribution of source to target probabilitieswith the visit loss would fail. In Figure 3.2a, if a large class in the source corresponds to asmall source in the target, uniform distribution of source to target probabilities will enforcewrong associations from the large class to a wrong class in the target. If we were to considerclass distributions, however, a larger portion of the source to target probabilties would bedistributed among less samples in the target (see Figure 3.2b).

According to the intuition, we can rewrite the visit loss from eq. (3.9) as :

Lvisit = H(wV, P s→t) (3.11)

27

Page 28: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

(a) (b)

Figure 3.2: (a) Uniformly distributed visit loss may enforce wrong associations among em-beddings. (b) Class balanced visit loss adjusts potentially wrong associationsbetween embeddings when source and target class sizes vary.

with Vj = 1/|φt| and:

wj =psrc(Cφtj )

ptgt(Cφtj )(3.12)

where Cφtj denotes the class label of the j-th target embedding.

The target distribution of the cross entropy can also be written as:

wjVj =

psrc(Cφtj)

ptgt(Cφtj)

|φt|(3.13)

To compute wj directly, we would need to know the respective class probability of thistarget sample in the source as well as in the target. We discuss our technique for estimatingthese weights in an unsupervised manner.

Estimating visit loss weights

While due to the lack of target labels we can’t know the class which φtj belongs to, we wouldneed to estimate this class probability in the source and in the target. Mind here that whiletalking about class probabilities, we mean those in the sampled batch where association istaking place, not in the whole dataset. However, we can leverage here the labels in thesource. By sampling a source batch with uniform distribution among classes, we wouldget a constant psrc(Cφtj ) across classes in the source. This would leave us with the task of

estimating class probabilities for every embedding in the target. We approximate this byretrieving clusters around the embeddings that attempt to approximate classes.

It is logical to expect that same-class embeddings in a latent space cluster together fora modern classifier to be able to discriminate between different class samples. If we couldretrieve the clusters corresponding to the classes, we could compute the class distribution ina batch by from the respective cluster sizes and the overall batch size. This is illustrated inFigure 3.3. The approximation holds true under the assumption that the clusters are wellaligned to the means of the respective, optimal classifiers.

There is a wide range of clustering methods that can be used for embedding clustering.We use an off-the-shelf hierarchical agglomerative clustering algorithm which experimentally

28

Page 29: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

Figure 3.3: Clustering to estimate the class probabilities for target embedding samples. Clus-ters may not be fully accurate but they do approximate the class sizes for classprobability estimation of samples in a batch.

appears to allow for good alignment between the obtained clusters and works well whenclusters have very different sizes. We show in Section 4 that an off-the-shelf fast clusteringapproach yields to close approximations of the accurate oracle visit loss weights and improve-ments of the associative adaptation for cases when batch-sampling is not equalized amongsource and target datasets.

29

Page 30: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

3.2 Associative Domain Adaptation for Semantic Segmentation

While several domain adaptation approaches work well for classification, semantic segmen-tation is a more challenging task to tackle due to the complexity and necessity for densepredictions. Current approaches make mostly use of adversarial training or GANs to learn asource and target mapping. Approaches that aim to minimize discrepancies in latent spaceof embeddings generated by a network trained on a semantic segmentation task have notbeen explored.

Having relaxed the distribution assumption, the associative domain adaptation approachcan be applied to tasks where source and target distributions are not uniform or uniformitycannot be approximated. We show that our associative adaptation approach robust todistribution differences can achieve competitive results on dense prediction tasks.

3.2.1 Fully convolutional networks for semantic segmentation

Recent methods for semantic segmentation have shifted entirely towards fully convolutionalnetworks. Large classification networks including fully connected layers can be easily turnedinto segmentation networks by turning fully connected layers into convolutions with kernelsof size of the input region [32]. This facilitates not only transferability of classificationarchitectures into semantic segmentation tasks, but also reusability of models trained onlarge classification datasets to segmentation tasks.

Good results can already be achieved with a very simple upsampling decoder consistingof a 1 × 1 convolution adjusting the channels to the number of output classes as well asupsampling by bilinear interpolation. Further attempts have developed more elaborate up-sampling modules using atrous convolutions, refining modules to capture global context andskip connections from the downsampling to the upsampling layers [66, 27].

With small modifications, a classification network such as VGG16 [47] and ResNet101 [22]can be plugged in as a backbone and one of the context-based upsampling modules can becombined to achieve state of the art semantic segmentation results.

3.2.2 Embeddings in semantic segmentation networks

Visualizing behavior

To understand the applicability of the associative domain adaptation method on segmenta-tion it is important to understand the behavior of embeddings produced by a segmentationnetwork. We expect that during training, embeddings of the same class start clusteringtogether in latent space.

Recent approaches in semantic instance segmentation [40, 15] use embeddings producedby a semantic segmentation network and further regularize these to constrain same-instanceembeddings to lie closer and embeddings belonging to different instances to lie further apart.This is done in order to constrain different instance pixels of the same class to cluster totheir respective instances.

In addition, it has been shown that even semantic segmentation slightly benefits fromfurther regularizing embeddings to lie closer if they belong to the same class and further ifthey belong to different classes [21]. This can be interpreted as extra supervision with anadditional metric on the task.

In principle, however, we expect that although segmentation is a more challenging task fordeep learning, a semantic segmentation network trained with cross entropy loss on pixel-levellabels is supposed to learn good separation of different class embeddings and similarity ofsame class ones.

30

Page 31: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

Figure 3.4: Embeddings produced by a segmentation network after (a) 1000, (b) 20000 and(c) 50000 iterations respectively. Visualized with t-SNE [36].

How to visualize this embedding space can also be complex. Embedding visualization re-quires high-dimensional embeddings to be projected down to 2D or 3D. This can be simplydone with random projections or PCA, or through t-SNE [36], which is specialized for visu-alizing high dimensional embeddings. Although specialized, with very large dimensionalitiesand amount of data points, as well as more complex embedding space for segmentation, itcan still be difficult to end up with a sample of the data where every class is in a clear andseparate cluster. We illustrate this further.

A state of the art segmentation network is trained on the Cityscapes dataset for semanticsegmentation. In Figure 3.4 we can observe the behavior of embeddings at different stages ofthe training visualized with t-SNE at a fixed perplexity value. For simplicity, we show only8 out of 21 classes present in the dataset. Although it becomes clear through the trainingthat embedding separability increases as the training progresses, t-SNE fails at keepingthese clusters separate to each other further in the training. Looking at the different classespresent, we observe that the clusters that fail to separate further in the training belong topairs of object classes that are usually co-located in the image such as road and sidewalk orsky and vegetation. Co-location in these t-SNE plots might be also explained due to highermisclassification rate between these classes.

Analyzing embedding distance metrics

The authors of [19] use unnormalized dot product between embeddings as an affinity measure.Thus affinity matrix A has elements:

Aij = 〈φsi , φtj〉

We discuss potential affinity measures based on similarity scores or distances and how touse them as affinities. Given two vectors a = [a1a2...ad] and b = [b1b2...bd] we describe theaffinities as follows.

Dot product The dot product between the two vectors can be computed as:

a · b = aT b =∑i

aibi

31

Page 32: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

From the definition, it can be observed that the dot product values are unbounded andthe resulting value may increase infinitely with the increase of magnitude of the vector. [19]argue that this affinity measure worked best for convergence. However, L2 regularization isnecessary in order not to allow the weights to explode towards values that maximize affinityby increasing vector magnitude. We argue that dot product works well as a similarity measuresince due to the large variety of values it can take, it feeds more signal to the network duringthe association of embeddings. However, the usage of this affinity measure can get complexand cause very large weights if not regularized properly especially in deeper architectures.

Cosine similarity Cosine similarity is the normalized version of the dot product, where thelatter is divided by the vector norms. Considering an angle θ between the vectors a and b,the cosine is defined as:

cos(θ) =a · b‖a‖‖b‖

=

∑di=1 aibi√∑N

i=1 a2i

√∑Ni=1 b

2i

While conceptually this is a normalized dot product, the value ranges are between -1 and 1yielding the cosine of the angle, therefore the cosine similarity behaves very different from thedot product. It can be argued that in theory the cosine similarity is a much more consistentmeasure of similarity between vectors. This is also equal to the dot product values whenthe vectors a and b have unit norm, but this cannot be guaranteed across neural networkarchitectures.

While in theory it is a good similarity measure, in practice in experiments we observedthat it doesn’t do much for the embedding association. We interpret this in the sectionbelow.

Euclidean distance into affinity The Euclidean distance between vectors a and b is definedas:

d(a, b) =

√√√√ d∑i=1

(ai − bi)2

This distance can be usually turned into affinity by using a radial basis function kernelor by taking the inverse 1

1+d(a,b) . Due to getting different behavior of adaptation whenusing different transformations for the Euclidean distance, we investigated this further andobserved that value ranges of the affinity measure affect the adaptation convergence. If thevalue ranges do not allow for enough variety, the signal propagated to the network is notstrong enough to allow for the transformation of embeddings. We observe that taking thenegative of the Euclidean distance as an affinity yields the best results, since it preserves themagnitudes of the distance measure itself.

The probabilistic interpretation of distances and importance of numerical stability

We discussed how we transform affinities between embeddings into transition probabilitiesby using the Softmax function in eq. (3.4). We can observe that the softmax functionwill produce very small values if classes with a very small presence occur in a large batchof embeddings, which is very often the case in semantic segmentation. This may cause thegradients to explode. It is important to stabilize the values accordingly with a proper affinitymeasure. It can be argued that other normalization methods can be used to turn the affinity

32

Page 33: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

values into probability distributions. We show comparison results in Section 4 and it seemsthat softmax works best.

Initially when training models on deep semantic segmentation architectures, explodinggradients especially with the dot product similarity measure would often occur. To mitigatethis, it proved important to stabilize the values fed to the softmax by the affinity measure.Using negative Euclidean distance for the affinity computation worked best. Since values aresoftmaxed, the sign becomes unimportant but the magnitude still affects the behavior in asimilar manner.

In addition, the data normalization method seemed to impact the training results forsemantic segmentation and avoid the exploding gradient problems. Dataset mean subtrac-tion and range normalization, where mean is subtracted separately for source and target,worked best with the ResNet50 based DeepLab-V2 architecture. The ResNet50 uses batchnormalization per layer, so using mean centered data as the network input stabilizes thevalue ranges during training.

3.2.3 Adapting segmentation models

The designated embedding layer

Not only the training stage, but also the location in the network where we extract pixelembeddings from is important for extracting good representations. If extracted too early inthe network, they risk to miss out properties learned by latter layers. However we still wantrelatively lower spatial dimensionality of feature-level representations.

We want to allow upsampling modules to capture global context earlier in the networkthan the features that will be used. In state of the art architectures, the amount of layers inthe upsampling module is minimized for each of them to serve a clear purpose.

In addition, we observe that the dimensionality of pixel embeddings for semantic seg-mentation is crucial for convergence. If too large, the gradients propagated are noisy andadaptation not very effective. However, dimensionality has to be large enough to allow forsimilar embeddings to group together but still preserve discriminable structures in latentspace. Therefore it is important to add a specific embedding layer to existing semantic seg-mentation architecture where embedding dimensionality can be adjusted according to thetask. This layer is added in the decoder part of the the base DeepLab-V2 [6] network rightbefore the final bilinear subsampling. This extension is illustrated in Figure 3.5.

Handling memory constraints

Due to the necessity to get pairwise distances between pixel embedding vectors, the memoryrequirements of the associative method are large in the case of semantic segmentation case.Matrix A of pairwise distances will have dimensionality N ×M , where N and M are theamounts of embedding samples for source and target domains respectively.

In architectures such as the DeepLab-V2 [6] where dilated convolutions are used to captureglobal context, the pre-last layer is a tensor downsampled 8 times in each spatial dimensionwith respect to the input. Bilinear upsampling is then used to get an output with theoriginal input spatial size. We add the embedding layer as explained above in between these,preserving the 8 times downsampled sizes. Using downsampled data allows us to fit morepixel-level embeddings in memory. A similar approach can be taken with multiple modernsegmentation architectures.

In the cases when memory availability is even more limited, a subsample of pixel embed-dings can be used for association instead of the entire batch. We show in Section 4 thatsampling pixel embeddings by a few times less than the batch size doesn’t hurt performance.

33

Page 34: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

Figure 3.5: Extending a Deeplab V2 architecture with the embedding layer of adjustabledimensionality D.

Other sampling methods such as uniform, grid-based or sampling by max pooling have beeninvestigated in this work. For future work, it could be interesting to investigate density-basedembedding sampling as in [37].

Adaptation

Consider a source dataset DS = {xSi , ySi }, i = 1, 2, ...N where every data sample xSi withspatial dimensionality H ×W is annotated by the respective pixel level labels ySi with thesame spatial dimenstionality. The target images DT = {xTj }, j = 1, ...M are available with-out annotations. Using a trained network we extract embeddings φ(xs), φ(xt) respectively,for which we use the simplified notation φs, φt.

Using the DeepLab-V2 semantic segmentation architecture as described above, we extractactivations from the embedding layer in the decoder part of the network. These embeddingsare considered on pixel level, on an activation map that is 8 times downsampled in each spatialdimension. This layer produces activations with dimensions U × V × D, where U = H/8,V = W/8 and D is an arbitrary embedding dimensionality that is selected according to thetask, usually between 64 and 128. In addition, we downsample the label annotations anduse y′S , where y′i ∈ RU×V to match the downsampled source embeddings.

For the experiments in this work we make use of all U×V pixel embeddings for both sourceand target, extracted from a siamese architecture that shares weights for both domains andis trained on the source branch. We further train with the walker and visit loss to associatepixel-level embeddings, as explained in Section 3.1.

34

Page 35: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

3.3 Associative Adaptation for Streaming Data

3.3.1 From domain adaptation to sequential adaptation for streaming

Streaming data comes in real-time or accumulated in blocks of data, requiring a sequentialmanner of processing and using this data for training. We discussed in Section 2.2 theproblem of concept drift caused by the distribution shift of data over time in streaming. Thedata is not assumed to be independent and identically distributed. Instead, we might haveblocks of data representing very small parts of the distribution, observing here the sampleselection bias. This is anologous to the domain shift in domain adaptation.

Given the stream S = {S1, ...SK} of data, every Sk can be a sample pair or a block ofdata XT where multiple samples are accumulated in blocks. The sequential data Xτ entersthe stream for time τ = 1, 2, ...K and sometimes has access to the respective labels Y τ .Due to the concept drift, the distribution of data changes over time, which can be formallyrepresented. If the data samples are generated with a source distribution pτ (X,Y ), for twodistinct points in time τ and τ + ∆, an X exists such that pτ (X,Y ) 6= pτ+∆(X,Y ) [50].

Domain adaptation can be considered to be a special case of this setup where only twotime steps and a single domain shift between them is considered where pτ (X) 6= pτ+1(X).Thus we can generalize dynamic domain adaptation over multiple steps just as a streamingscenario.

Contrary to this intuition, so far domain adaptation has only been studied in a staticsetup with a fixed source and target domains. In modern applications the data from adefined domain is not usually available all at once. If we consider image data collected fortraining models for self-driving cars for instance, the data collected from different cities comesin at different times in ”bundles” or ”data blocks” which are domain-specific sets. This datamay have similar distribution with the previously collected samples, but usually suffers fromsample selection bias. We can also not consider the incoming data as coming from en entirelydifferent domain, as the difference in distribution will be dependent on sample selection bias.

These data blocks may be larger than what is usually percepted in a streaming setup,but formally training networks that work on these does not differ much from the streamingtraining steps. We argue that large-scale domain adaptation behaves similarly to classifieradaptation in streaming data scenarios. Further, we propose an adaptation framework thatgeneralized to both, where data comes in as blocks of small mini-domains, which can beadapted to an initial model trained on a single supervised static domain.

3.3.2 Building a framework for adapting classifiers in time

We introduce a framework for adaptation over distribution shifts that change in time that canbe applied to streaming data and sequential domain adaptation. We consider a streamingsetup with an ”initially labeled environment”, that as described above does not receivefurther labels, only raw images, in a semi-supervised modality setting.

Formally, consider an initially labeled set S0 = {X0i , Y

0i }, i = 1, ...N . Incoming data is

bulked into streaming sets, which are blocks of data coming in as sequences S1..K and usuallyhave less samples relative to the stationary set S0. This pre-labeled stationary training setis first used for off-line training of a predictive model.

An illustration is provided in Figure 3.6. We can adapt to these data blocks sequentially.An accumulator collects samples in data blocks or stream batches as we call them. At everytime step, a sliding window goes over stream batches to adapt to. For simplicity we use asliding window of size 1, but adaptation can be performed with multiple stream batches.

In this work we formulate a patchwise classification task for patches extracted from videoframes. This aims to represent a simple approximation to a semantic segmentation task for

35

Page 36: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

3 Method

Figure 3.6: Streaming framework for adaptation of classifiers in a semi-supervised manneron initially labeled environments.

video data.In this framework the memory limitations of streaming applications are considered since

sequential adaptation allows for discarding of the previously adapted streaming batches.Once adapted, the stream batch can easily be discarded allowing for the following batchesto enter the pipeline.

3.3.3 Adapting for streaming data

We show how the incoming unsupervised streaming batches can be used to boost classifieraccuracy on the next batches by using our distribution balancing associative learning methodto adapt from the source classifier to the unsupervised stream batches. We improve classifieraccuracy over incoming stream set in an unsupervised manner without requiring extra labels,and show how this can be generalized to adaptation of data bundles over arbitrary time-frames.

Consider a stream batch Sk has just entered the stream. The model trained on S0 canbe adapted to Sk while Sk+1 in being collected. Predictions on Sk+1 made with the modeltrained on S0 and Sk+1 are more accurate than a model trained offline on S0 only. Adaptationon Sk is expected to boost predictions for streaming batches Sk+2...m as well if distributionsshift up to Sm is close enough to Sk. In the next stage, the previously adapted model isfurther adapted on Sk+1 to improve on Sk+2 and so forth. The size of the streaming batchescan be arbitrary and should be determined based on the relative velocity of distributionshift. We experiment with different sizes relative to the batch size.

This approach not only improves upon predictions in the next stream batch, but also forthe next few following batches when on demand predictions are requested. In this way we canadapt classifiers to deal with incoming streams by exploiting adaptability to unsuperviseddata.

The adaptation is done similarly to Section 3.1, considering here the stationary data S′ assource and extracting target embeddings from the stream batch Sk being adapted.

36

Page 37: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

4.1 Robust Associative Domain Adaptation for ImageClassification

In this section we report extended experiment results motivating our formulation of themethod. These experiments aim to adapt for image classification on several well knownadaptation benchmarks for digit classification, street sign classification and generic objectrecognition.

First, we show how perturbing source to target distribution divergence causes the originalformulation of associative domain adaptation to fail. Further, we show how our proposedweighted visit loss yields improved results even with the weights estimated with an off-the-shelf clustering algorithm. We report how using an oracle-weighted visit loss, whichis the theoretical upper bound if the unsupervised clustering method were to fully overlapwith classes, yields a maximum adaptation capacity which is stable even with very largedistribution divergence.

Further, we report considerations on the method and metrics used.

4.1.1 Datasets and adaptation benchmarks

For digit classification, we report experiments on the following adaptation benchmarks:

MNIST→MNISTM As a standard domain adaptation benchmark digit classification forsource MNIST [29] and target MNIST-M [16] was performed. MNIST-M is a modified versionof MNIST where the original images are filled with colorful bacground.

SVHN → MNIST The second version of Street View House Numbers [39] where imagesare cropped to digits are used as a source domain to adapt to MNIST at target domain.

Synth digits→ SVHN Synthetic Digits [16] is a dataset that aims to mimic the distributionof SVHN for domain adaptation purposes.

Further, we adapt between synthetic and real data for street sign classification:

Synth signs → GTSRB Synthetic signs[38] is a dataset generated from different transfor-mations of Wikipedia images of street signs, which aims to mimic the German Traffic SignRecognition Benchmark [49] of cropped street signs.

Last, we adapt for object recognition with the following benchmark:

CIFAR-10 → STL-10 CIFAR-10 [28] contains 60000 32×32 images belonging to 10 classes.STL-10 [9] contains 1300 labeled images and a large unsupervised part. The images are96×96, but we downscale them to the resolution of CIFAR-10. In the two datasets 9 out ofthe 10 classes are overlapping, therefore they can be used for adaptation purposes.

These are visualized in Figure 4.1.

37

Page 38: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

(a) MNIST → MNIST-M (b) SVHN → MNIST (c) SYNTH → SVHN

(d) Synth Signs → GTSRB (e) CIFAR-10 → STL-10

Figure 4.1: Domain daptation benchmarks for classification

4.1.2 Balancing distribution differences with weighted visit loss

In Table 4.1 we show adaptation results over all classification benchmarks. Step by step,the source to target KL-divergence between class distributions is increased in a controlledsetting. As KL-divergence increases, we see the original formulation of the associative domainadaptation [19] starts degrading rapidly.

Table 4.1: Adaptation accuracy as KL-divergence of source to target class distributions ina batch increases. The oracle version uses the true target class probabilities andserves as an upper bound.

Src -Tgt divergence Method Datasets

MNIST-MNISTM

SVHN-MNIST

Synth Dig.-SVHN

Synth Signs-GTSRB

CIFAR10-STL10

Source only 64.0 69.4 85.8 95.4 52.7Target only 93.6 99.5 94.2 98.1 99.8

KL = 0.05Adapted using [19] 87.6 97.0 91.9 96.2 61.3Ours 88.3 97.2 92.6 96.5 61.2Ours with oracle* 90.0 97.2 92.8 97.5 61.5

KL = 0.1Adapted using [19] 87.8 96.8 91.6 96.0 60.0Ours 89.1 97.3 91.6 96.4 61.3Ours with oracle* 90.1 97.5 92.7 97.5 61.3

KL = 0.2Adapted using [19] 85.2 94.3 87.6 95.9 57.6Ours 87.6 96.9 89.9 95.6 58.3Ours with oracle* 90.1 97.8 92.8 97.3 61.2

KL = 0.4Adapted using [19] 81.7 94.2 87.1 95.5 53.4Ours 83.8 94.9 88.0 95.3 56.2Ours with oracle* 89.8 96.6 92.6 94.1 61.4

We show respective results with our method. In order to verify our theory over distributionbalance, we show results with an oracle-weighted visit loss. This is a theoretical upper bound,in an ideal case scenario where we have a perfect clustering algorithm that is able to clusterall embeddings to their respective, fully-overlapping class clusters.

38

Page 39: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

Our off-the shelf agglomerative clustering method is not perfect, but it almost alwaysoutperforms the original formulation, visibly in the cases when KL-divergence is largest.These results often get very close to the oracle-weighted results, indicating that with a veryexact clustering technique we would be able to achieve maximum adaptation performance.

We can conclude that in scenarios when large class distribution shift is expected, ourrobust formulation of the method works best. Having verified this, we can further move onto applying our method to realistic tasks and datasets with potentially large class distributionshift between domains.

4.1.3 Understanding embedding associations: The effect of the metric andnormalization method

In order to investigate the effect of the affinity measure used between embeddings and theaccuracy results, we rethink and analyze the decisions made in the original associative domainadaptation method 3.1.

The authors report dot product as working best for convergence. We compare with cosinesimilarity and euclidean distance based affinities. In addition, the softmax is a good functionto produce probabilities from affinities, but it could be cause for exploding gradients in deeperarchitectures. Therefore, we analyze the impact of using softmax and compare it to unity-based normalization of values. The results are reported in Table 4.2.

Table 4.2: Analyzing the effect of the affinity measure and normalization method for turn-ing affinities into probabilities. Accuracy results are reported for the MNIST →MNIST-M adaptation benchmark.

Setup Affinity measure Normalization method Accuracy

Source only - - 64.0Target only - - 93.6

Adaptation

dot product softmax 90.2dot product unity-based 79.2

cosine similarity softmax 81.6cosine similarity unity-based 85.5

euclidean (inverse) softmax 71.1euclidean (inverse) unity-based 81.2

euclidean (negative) softmax 90.1euclidean (negative) unity-based 87.4

These results are interesting because not only they show which metric works best, butalso they uncover an interesting observation. The combination of metric and normalizationmethod is important due to importance of value ranges produced on the overall results.

Given the abundance of embeddings, especially in dense prediction tasks, if probabil-ity ranges are very close to each other, there will not be enough signal coming from theadaptation loss due to the noisy gradients these produce. It seems like softmax is a goodnormalization function to use as it seems to balance these ranges best.

In addition, we can observe that cosine similarity for instance works best with unity-basednormalization. Since cosine similarity ranges in the interval [-1, 1], small value differencesmay be meaningful. It is possible that softmax smoothes these meaningful differences, notallowing for enough signal to be propagated to the classifier from the affinity computation.

Another argument that shows the importance of value ranges is comparing the accuracy

39

Page 40: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

results with euclidean-based metrics. In principle we should expect the same behavior as theunderlying distance is the same. However, the function used to transform this distance intoaffinity seems to impact accuracy scores a lot.

By using the negative of euclidean distance as affinity, we get similar results with theoriginal dot product scores. This conclusion is useful for replacing the dot product to stabilizetraining further in semantic segmentation experiments. Due to dot product being unboundedand increasing with the magnitude of both embedding vectors, the network might see it fitto increase weights to very large values to maximize affinity. Thus, not only would dotproduct require proper regularization, but it also risks causing exploding gradients. This iswhy further in this work we continue using euclidean distance based affinity.

40

Page 41: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

4.2 Domain Adaptation for Semantic Segmentation

4.2.1 Datasets

For the semantic segmentation adaptation setup we experiment with the synthetic to realdata adaptation for urban street images. We focus on the GTA5 to Cityscapes adaptationbenchmark.

GTA5 GTA5 is a synthetically generated dataset which is collected by game renderersduring gameplay of realistic race game images. It contains 24966 images with resolution1914 × 1052. Of these, 12500 are used for the training set and 6800 for validation.

Cityscapes A well known semantic segmentation benchmark dataset, Cityscapes consistsof real images in 2048 × 1024 resolution, 5000 of which finely annotated with pixel-levellabels. Out of these 2975 images are used for training and 500 for validation. 19 annotatedclasses matching the GTA5 categories are used.

In Figure 4.2 we plot the class probability distribution of the classes over the images inthese datasets.

Figure 4.2: Distribution of class probabilities in images of the original GTA5 and Cityscapedatasets.

It can be observed that there is a large variation of probabilities between classes. Asoverall the distribution graphs look similar, for the small classes distribution differences arelarger relatively compared to the class probability. Besides, we can expect a large variationof these distributions per batch, which would make our formulation of the associative domainadaptation method useful in this scenario.

4.2.2 Towards semantic segmentation with patchwise classification

To take a smaller step from classification to semantic segmentation we start with patchwiseclassification for patches extracted from the datasets above. In this way we can first usea classification network which has shown to work well for other classification benchmarks

41

Page 42: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

(a) GTA5 patches (b) Cityscapes patches

Figure 4.3: Patchwise Classification Dataset

and use it for these specific datasets before moving on to complex semantic segmentationarchitectures.

The patchwise classification dataset

We generate patchwise classification datasets from the original segmentation datasets, Cityscapesand GTA5.

From these datasets, 65 × 65 crops are extracted from images and labeled with the middlepixel of the patch. About 380000 patch images are generated for GTA5 and about 160000patches are generated for Cityscapes, approximating their original dataset size ratios. Someexample images of the patchwise classification dataset are shown in Figure 4.3.

Patches are sampled from images such that as large of a class diversity as possible issampled from every image. A plot of class distribution in the patchwise classification datasetsis shown in figure 4.4.

Figure 4.4: Patchwise classification dataset class distribution

42

Page 43: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

This distribution is more balanced than the class probability distributions per image dueto the sampling method used to extract the patches, otherwise it would be expected to besimilar to class probabilities per image.

Adaptation results

In Table 4.3 we show the results for domain adaptation between the patchwise classificationdatasets. For these results we use the simple classification network made of three convolutionblocks of two convolutions and a max pooling each, plus two fully connected layers.

Table 4.3: Adaptation accuracy results with the patchwise classification dataset.

Source to Target Divergence Setup Accuracy

Source only 35.1Target only 46.0

KL = 0.2, target uniform sampled Adapted using [19] 37.9KL = 0.2, source uniform sampled Adapted using [19] 38.1KL = 0, both sampled non-uniformly Adapted using [19] 40.2KL = 0.2 , source uniform sampled Ours (cluster-estimated weights) 40.7

KL = 0.2 , source uniform sampled Ours (oracle weights) 40.9

It can be observed in this dataset as well that when there is an increased KL-divergencebetween source and target the original associative method fails while our revision with weightestimations by cluster estimates managed to get very close to the oracle-weighted counter-part.

4.2.3 Adapting Semantic Segmentation Models

We validate our class distribution independent approach on the task of domain adaptationfor semantic segmentation of urban street scenes. The task of synthetic to synthetic to realdomain adaptation is important due to scalability of annotation in synthetic data. We adaptbetween the synthetic GTA5 and real Cityscapes urban scene parsing datasets describedabove.

Setup As a segmentation network we use DeepLab-V2 [6] architecture with a ResNet-50[22] backbone and dilated convolutions. We extend the original DeepLab-V2 architecturewith a D-dimensional embedding layer that can be adjusted for experiment purposes. Adetailed figure of the architecture was given in Figure 3.5. We use a coefficient 1 for thewalker and 0.5 for the visit loss, adjusted for the magnitude of the loss values. We do nottune the visit loss coefficient further, as this would imply access to the target labels.

The respective training sets of GTA5 and Cityscapes are used as the domains for training.We test on the Cityscapes val set and report the results in Table 4.4. For all the experimentswe use downsampled versions of the original datasets to 256×512 pixels. The source onlymodel with GTA5 is trained for 30K iterations on the GTA5 training set. We initialize theweights for the ResNet-50 based layers from pretrained ImageNet [13] weights. This is onlyin the encoder part of the network, everything needs to be trained from scratch in the furtherdilated convolution layers.

43

Page 44: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

Effect of similarity measures We discussed in Section 3.2.2 that in deeper semantic seg-mentation networks the distance measure is very important for the numerical stability inthe network. The embedding visualizations in Figure 4.5 indicate how value range variationcorresponds to visual discriminability of objects in the image. For instance, the values forcosine similarity are bounded to [-1, 1] range and we can visually see that different objectsare hard to visualize within the projected RGB image in comparison to the other measures.The visualized embeddings are generated by reducing the embedding dimensionality from64 to 3 with random projections and using the resulting values for selecting an RGB colorscale.

Figure 4.5: Visualized embeddings of adaptation with different similarity measures: (a) Co-sine similarity, (b) Dot product, (c) Negative Euclidean distance

Adaptation results

The adaptation results for the best performing adapted model are shown in Table 4.4. Thismodel achieves 3.1% mIoU and 9.1% accuracy improvement over the source only baseline,compared to 1.7% mIoU and 6.0% accuracy improved by the original formulation assumingsimilar distributions. The standard formulation is outperformed on 15 out of the 19 categoriesThis model uses a 64-dimensional embedding layer integrated in the original DeepLab-V2architecture.

It can be noticed that the weighted version improves over mid-size classes such as car, busand person but also smaller classes such as light and pole, which would the most likely to bewrongly associated if the distributions were assumed similar. On the other hand, the originalunweighted approach improves over large classes such as vegetation, terrain and sky, clearlyto the cost of the smaller classes which are wrongly associated due to uniformly distributedlarge classes covering the associative space for the smaller ones. For very small classes suchas bike, train and motorbike, there is still room for improvement. Since these classes arerare and also relatively small objects in the image, wrong associations between embeddingsduring adaptation can easily damage improvement on these. Very good performance of theweight estimates with unsupervised clustering is necessary to be able to improve for theserare classes.

Table 4.4: GTA5 to Cityscapes domain adaptation. The last two rows show results on adapt-ing with the unweighted version of the method and the distribution independentone.

road

sidew

alk

buildin

g

wal

l

fence

pol

e

ligh

t

sign

vege

tati

on

terr

ain

sky

per

son

rider

car

truck

bus

trai

n

mot

orb.

bik

e

mIoU Pixel Acc

NoAdapt 33.8 23.2 67.5 18.2 20.1 18.1 15.9 21.8 66.9 18.0 72.4 33.0 6.5 25.0 15.8 19.3 6.0 8.4 5.8 26.1 72.8Adapt (no wght.) 59.9 29.8 67.1 16.2 10.7 22.9 13.2 9.1 78.0 33.4 75.7 41.9 0.3 32.3 12.4 16.5 5.7 2.9 0.1 27.8 78.8Adapt (est. wght.) 63.8 31.3 68.4 19.4 19.6 23.2 17.6 11.8 62.9 22.7 61.0 52.1 7.8 42.5 13.4 22.1 6.2 9.1 0.1 29.2 81.9

In Figure 4.6 we show qualitative visual results of how adaptation changes the segmentation

44

Page 45: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

masks compared to the source only training. We can observe how adaptation recovers largelyon classes such as road, car and truck where the source only network outputs very blobbypredictions.

Figure 4.6: Visualizing domain adaptation results for Cityscapes

We reported above the adaptation results which yielded an absolute maximum adaptationmIoU over the target. With a different setup for embedding dimesnionality we achieve asmaller absolute adaptation mIoU but a larger adaptation surplus of 4.5% mIoU. Theseresults have been achieved with only little tuning on the embedding layer dimensionality.Further hyperparameter tuning such as learning rate schedule and adaptation loss coefficientscould be experimented with for even better results. In addition, we use pre-trained ImageNetweights from a ResNet-50 architecture. Our DeepLab-V2 backbone is slightly different fromthe original ResNet-50, as there are added dilated (atrous) convolutions in the last tworesidual convolution blocks. We do use the pre-trained weights, but we continue adaptingfurther with a relatively small learning rate of 10−5 with exponential decay. This might notbe the most optimal learning rate schedule as some layers are initialized randomly, whileothers are initialized with pre-trained weights. Freezing the pre-trained layers or using asmaller learning rate for those compared to the randomly initialized layers could potentiallyproduce even better pre and post adaptation results. For future work there is definitelyplenty to improve from the current setup in this direction.

Considering the above, we still compare to current state of the art methods in domainadaptation for semantic segmentation, which are usually engineered for the semantic seg-mentation task and do not aim to reuse cross-task adaptation methods. In Table 4.5 wereport these comparisons.

While our method does not outperform state of the art, it does achieve competitive resultsin terms of final adaptation score with a method not fully tailored to the task, exceeding orgetting very close to 3 out of the 5 methods. In terms of adaptation surplus, we achieve alarger adaptation surplus with a model which does not yield the best absolute adaptationmIoU, using 128 dimensional pixel embeddings instead of 64-dimensional ones. In termsof surplus our models only exceed one out of the 5 existing methods to date. However,we compete with a strong pre-adaptation baseline and a method which is developed forcross-task adaptation, unlike all the segmentation-tailored competitors.

45

Page 46: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

Table 4.5: Comparing to other methods by mIoU values before and after adaptation.

Ours (64-D embeddings) Ours (128-D embeddings) FCNs in the Wild [24] Curriculum DA [68] FC Tri-branch [67] CyCADA [23] UDA with GANs [44]

NoAdapt baseline 26.1 23.4 21.2 23.0 26.2 29.2 29.6Adapt 29.2 27.9 27.1 29.0 30.5 35.4 37.1

Adaptation surplus 3.1 4.5 5.9 6.0 4.3 6.2 7.5

Discussion Except FCN’s in the Wild [24], all the other methods were published duringthe time this thesis work was taking place, thus they are all very recent. The release ofthe GTA5 dataset and the establishing of GTA5 to Cityscapes as a synthetic to real datasemantic segmentation adaptation benchmark has sparked a higher interest in solving thisproblem, especially due to the usefulness with the perspective of new developments for self-driving cars.

There are a few things that would potentially improve the result of our method as-iswithout theoretical changes, to get closer to state of the art. First, we use downsampledGTA5 and Cityscapes data from the original resolutions, from 1051×1914 and 1024×2048respectively, to 256×512 for both. This is approximately 4 times downsampling in eachspatial direction for both datasets, thus 16 times less pixel annotations. We do this for com-putational and memory efficiency given the time limitations. However, using full resolutionimages would be the best thing to do to achieve even better results.

We do use a DeepLab-V2 architecture with a ResNet-50 backbone, which is competitive tothe other methods that use variations of fully convolutional networks with either a VGG-16or a ResNet-101 as a backbone, and some variation of a context module in the decoder withatrous convolutions, learned deconvolutions or simply bilinear upsampling.

In addition, some works use not only the Cityscapes train set as an unsupervised targetto adapt to, but they also include the Cityscapes val set in the training. The labels for thetest set in Cityscapes are not publicly available, but the predictions can be submitted to theCityscapes adaptation benchmark to obtain the scores.

Another discussion item when comparing to state of the art in domain adaptation isthe novelty that the method brings. For instance, [24, 23, 44] are all adversarial basedapproaches. While [24] follow an adversarial discriminative approach, [23] and [44] usean adversarial generative approach, learning an explicit mapping between source and targetimages, so that a classifier trained on source can operate on transformed target images. Thereare further considerations tailored to semantic segmentation, but the underlying approachremains similar. There is clearly a large advantage in using generative adversarial networksto adapt for semantic segmentation tasks.

On the other hand, [67] with network branches and pseudo labels and [68] with curriculumlearning explore entirely different ideas on tackling adaptation for semantic segmentation.However, discrepancy-based methods and reconstruction based methods for domain adapta-tion remain largely unexplored with regards to the semantic segmentation task. Our workbrings the novelty of exploring a metric-based approach for semantic segmentation, show-ing that it is possible to achieve competitive results with a discrepancy-based method thatcarries over theoretical basis and lessons from the classification task.

46

Page 47: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

4.3 Adapting Models for Streaming Data

Next, we evaluate the method within our streaming data framework. We use video sequencedata and extract patches for patchwise classification as we explain in the next section. Sincedue to the concept drift the class distributions are expected to be different between thesource and the target, we expect our method to work well for closing the distribution gap.

4.3.1 Dataset and setup

To simulate a streaming data scenario, we use crops from the sequentially ordered GTA5dataset [43]. Since it’s collected from video game play, while seen sequentially this datasethas video-like sequences that can be considered a stream of video frames. In Figure 1.2 weshowed how a distribution shift can be observed in this dataset across different time steps.

The annotations for the data are synthetically rendered and are available on pixel level.To simplify the problem but still be able to capture properties of segmentation, we extractpatches from the sequential images and work with a patchwise classification task.

Thus, from GTA5 frame sequences, several patch crops are extracted per image withresolution 65×65. The patches look visually similar to the patchwise classification datasetshown in Figure 4.3a. Similar to the previous case, these patches are extracted from adownsampled version of the GTA5 dataset to 256×512 resolution. Every patch is annotatedwith the label value of the middle pixel. Different from the patchwise classification scenariofor domain adaptation between GTA5 and Cityscapes, in this setup we do not sample cropsfor as many classes as possible in the image. In streaming data we do not usually haveaccess to the classes for crop sampling. Instead, the crops are collected at random and thedistribution is expected to be close to the true distribution of classes in the video frames.

We consider an ”initially labeled environment” streaming data setup, where a small setof stationary labeled data is pre-collected and available for training. Next, the streamingdata comes in accumulated blocks which we refer to as streaming batches. For the stationarytraining data, we sample patches from the first 5000 images in the GTA5 dataset. About32000 patches of 65x65 dimensions are sampled. Next, for the incoming stream we samplepatches from bundles of 1000 images each, collected sequentially. 6000 patches are sampledfrom every bundle of images and accumulated in a streaming batch for the main exprimentalsetup, but we show below what happens if these sizes are selected differently. We experimentwith adapting six of these sets following the stationary training set.

4.3.2 Adaptation results

For the choice of size of the streaming batch of data we accumulate bundles of data indifferent sizes, relative to the training mini-batch. For training we use a mini-batch of size190, which is 10 times the number of classes. We create 3 setups of streaming batches, eachof size 60 times, 30 times and 10 times respectively larger than the size of the mini-batch.We sample on average 6 random crops from every image in the GTA5 dataset. Thus if weare collecting 1000 crops, we look at the next 166 images in the GTA5 dataset, if collecting6000 crops we look at the next 6000 images and so forth.

The effect of the streaming batch size on source only accuracy for the batches is plottedin Figure 4.7a. It can be observed that smaller streaming batches which go over the dataslower seem to follow the accuracy of the previously seen data in ”faster” setups. Thus wecan conclude that the choice of the streaming batch size can be arbitrary and should dependon the speed of concept drift. Having a sliding window over these stream batches could alsopotentially benefit the sequential modeling of concept drift over time.

47

Page 48: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

Futher, we continue with the 2nd stream batch setup size, selecting 6000 patches fromeach next 1000 images in the GTA5 dataset. In Table 4.6 we report the results of adapting amodel trained on stationary source to six streaming batches sequentially. It can be observedthat initially the source only scores vary depending on the distribution shift observed in thecurrent streaming batch. When considering the classifiers trained only on the stationarysource, there is considerable fluctuation on the classification accuracy over time. This is notalways damagin the accuracy, as sometimes batches later in the stream seem to be moresimilar to the stationary source than batches early in the stream.

At adaptation time for streaming batch Sk, we achieve maximal adapted accuracy forthat batch. However, while adapting for previous stream batches ...,Sk−2,Sk−1 accuracy forstream batch Sk also increases over time, suggesting that when set Sk enters the stream wealready get on demand classification with improved test scores even before adaptation withSk itself. At every adaptation step, we manage to recover performance deteriorated due toconcept drift over time.

Table 4.6: Accuracy of adapting over stream batches (SB) in GTA5. Adaptation is donewith one streaming batch per round.

Timestep Adapt Set Test Set

SB1 SB2 SB3 SB4 SB5 SB6

t=0 Source only 42.72 40.30 38.58 37.85 42.02 46.78t=1 SB1 46.72 44.02 40.88 40.88 45.22 50.65t=2 SB2 - 45.22 41.48 41.52 45.90 50.70t=3 SB3 - - 42.02 41.98 45.93 51.27t=4 SB4 - - - 43.00 45.73 51.47t=5 SB5 - - - - 47.51 51.33t=6 SB6 - - - - - 52.73

In Figure 4.7b we show the accuracy graph of sequential adaptation for the first four streambatches, to give an impression of the relative improvement between them. In this setting weuse a fixed number of iterations to adapt to the batches, specifically 2000 training iterationsper batch. We use a batch size of 190 patches for the source as well as the adaptation target.This is different from the results in Table 4.6 where we train for 3000 iterations for everystream batch.

Further optimization can be achieved if every stream batch is adaptation step is fullytrained until convergence, regardless of the number of iterations. However this requirestuning to each adaptation step. In real case scenarios this could be useful, especially if largerstream batches are accumulated.

In addition, we observe that if adapted over a large number of stream batches, the originalsupervised model might tend to overfit to the source data. Therefore it is necessary to applyregularization accordingly. In our setup we use L2 regularizaiton over weights in the wholenetwork. For the results reported here we use a simple classifier with 3 convolutional blocksof 3 convolutions and one max pooling layer each. We experimented with classifying witha VGG-16 and ResNet-50 architecture, but these are more prone to overfitting due to thesufficiently large number of parameters for this task which seems to be simple enough for asmaller network.

In addition, we plot in Figure 4.8 how our formulation of the associative learning methodimproves not only over all stream batches adapted, but also manages do to best comparedto the original formulation of the method. We can observe that across stream batches a

48

Page 49: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

4 Experiments and Results

(a) Observing the effect of concept drift accordingto stream batch sizes, relative to the trainingmini-batch.

(b) Accuracy increase while adapting streambatches sequentially

Figure 4.7: Adapting streaming batches

consistent improvement over the source only score is maintained even in the further batches.In addition, our method manages to recover most of the accuracy gap caused by the dis-tribution shift and gets considerably close to the source and target joint training accuracyscores. While we would have access to this score only in the case when all the streamingdata is fully labeled, we have shown that we can approximate it by adapting in an entirelyunsupervised manner over stream batches.

Figure 4.8: A constant adaptation surplus is achieved by adapting to incoming streambatches.

To conlude, we have showed how we can use our approach in an ”initially labeled environ-ment” streaming scenario, where labels are only available in a first stage for training. Weadapt to entirely unsupervised further streaming batches and recover deteriorated accuracydue to the concept drift. Not only do we improve results after adaptation with the cur-rent streaming batch of data, but we show how adaptations over previous streaming batchescan cascade their adaptation improvements over the following batches, allowing for higheraccuracy in case on-demand prediction at entry time in the stream would be necessary.

49

Page 50: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

5 Summary and Conclusions

5.1 Summary

This thesis work focused on a domain adaptation method that is based on associationsbetween embeddings to achieve adaptation between source and target sets. We apply ouradaptation method on classification, semantic segmentation and streaming data applications.

In Section 1 the motivation for the topics researched was presented and the contributionsof this work. Section 2 introduced the definition, overview and current status of research indomain adaptation and streaming data classification.

In Section 3 the associative domain adaptation method was introduced an a distribu-tion independent approach was introduced which relaxes the class distribution similaritybetween source and target assumption made by the original method. Further, in this sectionwe presented considerations an the implementation of our approach robust to distributiondifferences for adapting domains in the context of the semantic segmentation task. Adap-tation between synthetic and real domains for semantic segmentation data is performed. Inaddition, we interpreted static domain adaptation as a one-step concept drift in a streamingsetup and build a framework of adaptation for semi-supervised streaming data classificationscenario.

Next, in Section 4 extensive experiment results were reported. Initially across severaladaptation benchmarks for classification we showed how our formulation of the method isrobust to diverging source to target class distributions in a batch. Then we experimentedwith our setup of domain adaptation for semantic segmentation data, attempting to adaptfrom synthetic to real data of urban street scenes. Further experiments are reported todemonstrate classifier performance improvements within the proposed streaming framework.

5.2 Conclusions

In this work we formulated a robust associative learning method that works well under largeclass distribution differences between source and target. Theoretically, we relax an importantassumption and develop an unsupervised estimation method for describing the structure ofthe target data. This is important for the practical implications and the applicability of themethod to large scale datasets and tasks where this assumption does not hold.

In addition, domain adaptation methods are often very tailored to the task and not usuallyapplicable across tasks. We managed to show that a robust domain adaptation method canbe used to achieve competitive results across tasks with very few considerations that alsohold across tasks for very deep architectures.

Not only did we develop a method that works across tasks, but also one of the veryfew methods published to date on domain adaptation for semantic segmentation. As thevery few existing works follow mostly adversarial-based approaches, we have managed tosuccessfully apply a metric-based domain adaptation approach to the semantic segmentationtask, by carefully interpreting lessons of the steps taken to transform classification networksto segmentation ones.

Another contribution of this work is developing an adaptation framework for streaming.We interpreted domain adaptation as a one-step shift in the concept drift observed during

50

Page 51: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

5 Summary and Conclusions

streaming and generalized dynamic, continuous domain adaptation and streaming in a jointclassifier adaptation framework.

In this setup we exploited a small set of labeled data and only unsupervised data streams inorder to improve classifier prediction for further incoming stream batches. We have showedthat we can exploit unsupervised data to achieve improvements over several few streamingbatches without any extra annotated samples.

5.3 Future Work

As with the deep learning era the field of domain adaptation has seen many breakthroughs,the existing methods are still far from entirely being able to close the source to target resultsgap. There is still a lot of work to be done in this direction for better domain adaptationresults.

Especially for semantic segmentation, there is a wide range of possibilities to be experi-mented with when it comes to pixel embedding associations. For instance, if a really goodsuperpixel extractor were assumed, adapting segmentation data on superpixel level embed-dings rather than pixel level ones could yield faster associations and probably less noise inthe gradients, allowing for larger adaptation surpluses.

Last but not least, having considered a time-shifting distribution setup, we lay the groundsfor using this framework for streaming data in the wild. In addition, showing results forsemantic segmentation, indicated that if combined with the streaming framework this methodcould be applied to video segmentation in streaming video data. This is a more complexbut very useful task. Being able to exploit unsupervised data to improve video segmentationnetworks allows for considerable advantages due to the large availability of unsuperviseddata.

51

Page 52: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Bibliography

[1] Charu C Aggarwal. A survey of stream classification algorithms., 2014.

[2] PV Srilakshmi Annapoorna and TT Mirnalinee. Streaming data classification. In RecentTrends in Information Technology (ICRTIT), 2016 International Conference on, pages1–7. IEEE, 2016.

[3] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, BernhardScholkopf, and Alex J Smola. Integrating structured biological data by kernel maximummean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.

[4] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and DilipKrishnan. Unsupervised pixel-level domain adaptation with generative adversarial net-works. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),volume 1, page 7, 2017.

[5] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, andDumitru Erhan. Domain separation networks. In Advances in Neural InformationProcessing Systems, pages 343–351, 2016.

[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan LYuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous con-volution, and fully connected crfs. IEEE transactions on pattern analysis and machineintelligence, 40(4):834–848, 2018.

[7] Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. Marginalized denoisingautoencoders for domain adaptation. arXiv preprint arXiv:1206.4683, 2012.

[8] Yuhua Chen, Wen Li, and Luc Van Gool. ROAD: reality oriented adaptation for se-mantic segmentation of urban scenes. In CVPR, 2018.

[9] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks inunsupervised feature learning. In Proceedings of the fourteenth international conferenceon artificial intelligence and statistics, pages 215–223, 2011.

[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes datasetfor semantic urban scene understanding. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 3213–3223, 2016.

[11] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey.arXiv preprint arXiv:1702.05374, 2017.

[12] Hal Daume III. Frustratingly easy domain adaptation. arXiv preprint arXiv:0907.1815,2009.

[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

52

Page 53: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Bibliography

[14] Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. Learning in nonsta-tionary environments: A survey. IEEE Computational Intelligence Magazine, 10(4):12–25, 2015.

[15] Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, SergioGuadarrama, and Kevin P Murphy. Semantic instance segmentation via deep metriclearning. arXiv preprint arXiv:1703.10277, 2017.

[16] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,Francois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial train-ing of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030,2016.

[17] Stratis Gavves, Thomas Mensink, Tatiana Tommasi, Cees Snoek, and Tinne Tuytelaars.Active transfer learning with zero-shot priors: Reusing past datasets for future tasks.In Proceedings ICCV 2015, pages 2731–2739, 2015.

[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher-jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems, pages 2672–2680, 2014.

[19] Philip Haeusser, Thomas Frerix, Alexander Mordvintsev, and Daniel Cremers. Asso-ciative domain adaptation. In International Conference on Computer Vision (ICCV),volume 2, page 6, 2017.

[20] Philip Haeusser, Alexander Mordvintsev, and Daniel Cremers. Learning by association-a versatile semi-supervised training method for neural networks. In Proc. IEEE Conf.on Computer Vision and Pattern Recognition (CVPR), 2017.

[21] Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos. Learning denseconvolutional embeddings for semantic segmentation. arXiv preprint arXiv:1511.04377,2015.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016.

[23] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko,Alexei A Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adap-tation. arXiv preprint arXiv:1711.03213, 2017.

[24] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649, 2016.

[25] Janardan and Shikha Mehta. Concept drift in streaming data classification: Algorithms,platforms and issues. Procedia Computer Science, 122:804 – 811, 2017. 5th InternationalConference on Information Technology and Quantitative Management, ITQM 2017.

[26] Jiren Jin, Richard G Calland, Takeru Miyato, Brian K Vogel, and Hideki Nakayama.Parameter reference loss for unsupervised domain adaptation. arXiv preprintarXiv:1711.07170, 2017.

[27] Ivan Kreso, Sinisa Segvic, and Josip Krapac. Ladder-style densenets for semantic seg-mentation of large natural images. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 238–245, 2017.

53

Page 54: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Bibliography

[28] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tinyimages. 2009.

[29] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learn-ing applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[30] W. Li, L. Duan, D. Xu, and I. W. Tsang. Learning with augmented features for su-pervised and semi-supervised heterogeneous domain adaptation. IEEE Transactions onPattern Analysis and Machine Intelligence, 36(6):1134–1148, June 2014.

[31] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advancesin neural information processing systems, pages 469–477, 2016.

[32] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks forsemantic segmentation. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 3431–3440, 2015.

[33] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsuper-vised domain adaptation. In 2014 IEEE Conference on Computer Vision and PatternRecognition, pages 1410–1417, June 2014.

[34] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferablefeatures with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.

[35] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised do-main adaptation with residual transfer networks. In Advances in Neural InformationProcessing Systems, pages 136–144, 2016.

[36] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal ofmachine learning research, 9(Nov):2579–2605, 2008.

[37] R Manmatha, Chao-Yuan Wu, Alexander J Smola, and Philipp Krahenbuhl. Sam-pling matters in deep embedding learning. In Computer Vision (ICCV), 2017 IEEEInternational Conference on, pages 2859–2867. IEEE, 2017.

[38] Boris Moiseev, Artem Konev, Alexander Chigorin, and Anton Konushin. Evaluationof traffic sign recognition methods trained on synthetically generated data. In Interna-tional Conference on Advanced Concepts for Intelligent Vision Systems, pages 576–583.Springer, 2013.

[39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.Reading digits in natural images with unsupervised feature learning. In NIPS workshopon deep learning and unsupervised feature learning, volume 2011, page 5, 2011.

[40] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-endlearning for joint detection and grouping. In Advances in Neural Information ProcessingSystems, pages 2274–2284, 2017.

[41] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptationvia transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.

[42] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A surveyof recent advances. IEEE Signal Processing Magazine, 32(3):53–69, May 2015.

54

Page 55: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Bibliography

[43] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data:Ground truth from computer games. In European Conference on Computer Vision,pages 102–118. Springer, 2016.

[44] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chel-lappa. Unsupervised domain adaptation for semantic segmentation with gans. arXivpreprint arXiv:1711.06969, 2017.

[45] Yuan Shi and Fei Sha. Information-theoretical learning of discriminative clusters forunsupervised domain adaptation. arXiv preprint arXiv:1206.6438, 2012.

[46] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach tounsupervised domain adaptation. arXiv preprint arXiv:1802.08735, 2018.

[47] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[48] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality,characteristic kernels and rkhs embedding of measures. Journal of Machine LearningResearch, 12(Jul):2389–2410, 2011.

[49] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The germantraffic sign recognition benchmark: a multi-class classification competition. In Neu-ral Networks (IJCNN), The 2011 International Joint Conference on, pages 1453–1460.IEEE, 2011.

[50] Jerzy Stefanowski and Dariusz Brzezinski. Stream classification. In Encyclopedia ofMachine Learning and Data Mining, pages 1191–1199. Springer, 2017.

[51] W Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scaleclassification. In Proceedings of the seventh ACM SIGKDD international conference onKnowledge discovery and data mining, pages 377–382. ACM, 2001.

[52] Baochen Sun, Jiashi Feng, and Kate Saenko. Correlation alignment for unsuperviseddomain adaptation. In Domain Adaptation in Computer Vision Applications, pages153–171. Springer, 2017.

[53] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domainadaptation. In Gang Hua and Herve Jegou, editors, Computer Vision – ECCV 2016Workshops, pages 443–450, Cham, 2016. Springer International Publishing.

[54] Chao Tan and Genlin Ji. Semi-supervised incremental feature extraction algorithm forlarge-scale data stream. Concurrency and Computation: Practice and Experience, 29(6),2017.

[55] Mark Tennant, Frederic Stahl, Omer Rana, and Joao Bartolo Gomes. Scalable real-timeclassification of data streams with concept drift. Future Generation Computer Systems,75:187–199, 2017.

[56] Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A deeper lookat dataset bias. In Domain Adaptation in Computer Vision Applications, pages 37–55.Springer, 2017.

[57] Luan Tran, Kihyuk Sohn, Xiang Yu, Xiaoming Liu, and Manmohan Chandraker. Jointpixel and feature-level domain adaptation in the wild. arXiv preprint arXiv:1803.00068,2018.

55

Page 56: Bridging the Shifting Distribution Gap: Domain Adaptation ... · Aldous Huxley Recent improvements in deep architectures in combination with increasing availability ... In streaming

Bibliography

[58] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang,and Manmohan Chandraker. Learning to adapt structured output space for semanticsegmentation. arXiv preprint arXiv:1802.10349, 2018.

[59] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep trans-fer across domains and tasks. In Computer Vision (ICCV), 2015 IEEE InternationalConference on, pages 4068–4076. IEEE, 2015.

[60] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminativedomain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1,page 4, 2017.

[61] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domainconfusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.

[62] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. arXiv preprintarXiv:1802.03601, 2018.

[63] Ye Wang, Hu Li, Hua Wang, Bin Zhou, and Yanchun Zhang. Multi-window based ensem-ble learning for classification of imbalanced streaming data. In International Conferenceon Web Information Systems Engineering, pages 78–92. Springer, 2015.

[64] Jun Yang, Rong Yan, and Alexander G. Hauptmann. Cross-domain video conceptdetection using adaptive svms. In Proceedings of the 15th ACM International Conferenceon Multimedia, MM ’07, pages 188–197, New York, NY, USA, 2007. ACM.

[65] Yibin Ye, Stefano Squartini, and Francesco Piazza. Online sequential extreme learningmachine in nonstationary environments. Neurocomputing, 116:94 – 101, 2013. AdvancedTheory and Methodology in Intelligent Computing.

[66] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.arXiv preprint arXiv:1511.07122, 2015.

[67] Junting Zhang, Chen Liang, and C-C Jay Kuo. A fully convolutional tri-branch network(fctn) for domain adaptation. arXiv preprint arXiv:1711.03694, 2017.

[68] Yang Zhang, Philip David, and Boqing Gong. Curriculum domain adaptation for seman-tic segmentation of urban scenes. In The IEEE International Conference on ComputerVision (ICCV), volume 2, page 6, 2017.

[69] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprintarXiv:1703.10593, 2017.

56