school of electrical engineering, korea advanced institute ... · young-hyun park1, jun seo1, and...

25
CAFENet: Class-Agnostic Few-Shot Edge Detection Network Young-Hyun Park 1 , Jun Seo 1 , and Jaekyun Moon 1 School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea {dnffkf369,tjwns0630}@kaist.ac.kr, [email protected] Abstract. We tackle a novel few-shot learning challenge, which we call few-shot semantic edge detection, aiming to localize crisp boundaries of novel categories using only a few labeled samples. We also present a Class-Agnostic Few-shot Edge detection Network (CAFENet) based on meta-learning strategy. CAFENet employs a semantic segmentation module in small-scale to compensate for lack of semantic information in edge labels. The predicted segmentation mask is used to generate an attention map to highlight the target object region, and make the decoder module concentrate on that region. We also propose a new reg- ularization method based on multi-split matching. In meta-training, the metric-learning problem with high-dimensional vectors are divided into small subproblems with low-dimensional sub-vectors. Since there is no existing dataset for few-shot semantic edge detection, we construct two new datasets, FSE-1000 and SBD-5 i , and evaluate the performance of the proposed CAFENet on them. Extensive simulation results confirm the performance merits of the techniques adopted in CAFENet. Keywords: few-shot edge detection, few-shot learning, semantic edge detection 1 Introduction Semantic edge detection aims to identify pixels that belong to boundaries of predefined categories. It is shown that semantic edge detection is useful for a va- riety of computer vision tasks such as semantic segmentation [3,5,6,8,55], object reconstruction [14,46,59], image generation [25,49] and medical imaging [1,36]. Early edge detection algorithms interpret the problem as a low-level grouping problem exploiting hand-crafted features and local information [7,19,44]. Re- cently, there have been significant improvements on edge detection thanks to the advances in deep learning [4,21,24,51]. Moreover, beyond previous boundary detection, category-aware semantic edge detection became possible [2,23,54,56]. However, it is impossible to train deep neural networks without massive amounts of annotated data. To overcome the data scarcity issue in image classification, few-shot learn- ing has been actively discussed for recent years [16,31,43,47]. Few-shot learning arXiv:2003.08235v1 [cs.CV] 18 Mar 2020

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-ShotEdge Detection Network

Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1

School of Electrical Engineering, Korea Advanced Institute of Science and Technology(KAIST), Daejeon, Korea

{dnffkf369,tjwns0630}@kaist.ac.kr, [email protected]

Abstract. We tackle a novel few-shot learning challenge, which we callfew-shot semantic edge detection, aiming to localize crisp boundariesof novel categories using only a few labeled samples. We also presenta Class-Agnostic Few-shot Edge detection Network (CAFENet) basedon meta-learning strategy. CAFENet employs a semantic segmentationmodule in small-scale to compensate for lack of semantic informationin edge labels. The predicted segmentation mask is used to generatean attention map to highlight the target object region, and make thedecoder module concentrate on that region. We also propose a new reg-ularization method based on multi-split matching. In meta-training, themetric-learning problem with high-dimensional vectors are divided intosmall subproblems with low-dimensional sub-vectors. Since there is noexisting dataset for few-shot semantic edge detection, we construct twonew datasets, FSE-1000 and SBD-5i, and evaluate the performance ofthe proposed CAFENet on them. Extensive simulation results confirmthe performance merits of the techniques adopted in CAFENet.

Keywords: few-shot edge detection, few-shot learning, semantic edgedetection

1 Introduction

Semantic edge detection aims to identify pixels that belong to boundaries ofpredefined categories. It is shown that semantic edge detection is useful for a va-riety of computer vision tasks such as semantic segmentation [3,5,6,8,55], objectreconstruction [14,46,59], image generation [25,49] and medical imaging [1,36].Early edge detection algorithms interpret the problem as a low-level groupingproblem exploiting hand-crafted features and local information [7,19,44]. Re-cently, there have been significant improvements on edge detection thanks tothe advances in deep learning [4,21,24,51]. Moreover, beyond previous boundarydetection, category-aware semantic edge detection became possible [2,23,54,56].However, it is impossible to train deep neural networks without massive amountsof annotated data.

To overcome the data scarcity issue in image classification, few-shot learn-ing has been actively discussed for recent years [16,31,43,47]. Few-shot learning

arX

iv:2

003.

0823

5v1

[cs

.CV

] 1

8 M

ar 2

020

Page 2: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

2 Y. H. Park et al.

ResNet-34

ResNet-34

Share weights

Support set

Query set

Support features

Query features

maskedaveragepooling

Feature Extractor Edge Detector

Foreground

Prototypes

Background

Background

Segmentator

Feature matchingBackground

Foreground

Segmentation prediction

Multi-scale side output

Boundary prediction

Edge detector

Fig. 1: Architecture overview of the proposed CAFENet. The feature extractoror encoder extracts feature from the image, the segmentator generates a segmen-tation mask based on metric learning, and the edge detector detects semanticboundaries using the segmentation mask and query features.

algorithms train machines to learn previously unseen classification tasks usingonly a few relevant labeled examples. More recently, the idea of few-shot learn-ing is applied to computer vision tasks requiring highly laborious and expen-sive data labeling such as semantic segmentation [12,42,48] and object detection[17,26,27]. Based on meta-learning across varying tasks, the machines can adaptto unencountered environments and demonstrate robust performance in variouscomputer vision problems. In this paper, we consider a novel few-shot learningchallenge, few-shot semantic edge detection, to detect the semantic boundariesusing only a few labeled samples. To tackle this elusive challenge, we also proposea class-agnostic few-shot edge detector (CAFENet) and present new datasets forevaluating few-shot semantic edge detection.

Fig. 1 shows the architecture of the proposed CAFENet. Since the edge la-bels do not contain enough semantic information due to the sparsity of labels,performance of the edge detector severely degrades when the training datasetis very small. To overcome this, we adopt the segmentation process in advanceof detecting edge with downsized feature and segmentation labels generatedfrom boundaries labels. We utilize a simple metric-based segmentator generat-ing a segmentation mask through pixel-wise non-parametric feature matchingwith class prototypes, which are computed by masked average pooling of [58].The predicted segmentation mask provides the semantic information to the edgedetector. The multi-scale attention maps are generated from the segmentationmask, and applied to corresponding multi-scale features. The edge detector pre-dicts the semantic boundaries using the attended features. Using this attentionmechanism, the edge detector can focus on relevant regions while alleviating thenoise effect of external details.

Page 3: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 3

For meta-training of CAFENet, we introduce a simple yet powerful regular-ization method, Multi-Split Matching Regularization (MSMR), performing met-ric learning on multiple low-dimensional embedding sub-spaces. During meta-training, the model is meta-learned to minimize distances between the pixels ofquery features and the prototype of same class for segmentation. At the sametime, the prototypes and pixels of feature vectors are divided into multiple low-dimensional splits and the model also learns to minimize distances between thepixels of query feature splits and their corresponding prototype splits of thesame class. The proposed MSMR method can achieve significant performancegain without additional learnable parameters.

To sum up, the main contributions of this paper are as follows:

• We introduce the few-shot semantic edge detection problem for performingsemantic edge detection on previously unseen objects using only a few train-ing examples.

• We propose to generate an attention map using a segmentation predictionmask and attend convolutional features so as to localize semantically impor-tant regions prior to edge detection.

• We introduce a novel MSMR regularization method dividing high-dimensionalvectors into low-dimensional sub-vectors and conduct metric-learning sub-problems for few-shot semantic edge detection.

• We introduce two datasets for few-shot semantic edge detection and validatethe proposed CAFENet techniques.

2 Related Work

2.1 Few-shot Learning

To tackle the few-shot learning challenge, many methods have been proposedbased on meta-learning. Optimization-based methods [16,40,41] train the meta-learner which updates the parameters of the actual learner so that the learner caneasily adapt to a new task within a few labeled samples. Metric-based methods[15,29,43,45,47,52] train the feature extractor to assemble features from the sameclass together on the embedding space while keeping features from differentclasses far apart. Recent metric-based approaches propose dense classification[22,30,31,38]. Dense classification trains an instance-wise classifier on pixel-wiseclassification loss which imposes coherent predictions over the spatial dimensionand prevents overfitting as a result. Our model adopts the metric-based methodfor few-shot learning. Inspired by dense classification, we propose multi-splitmatching regularization which divides the feature vector into sub-vector splitsand performs split-wise classification for regularization in meta-learning

2.2 Few-shot Semantic Segmentation

The goal of few-shot segmentation is to perform semantic segmentation withina few labeled samples based on meta-learning [12,39,42,48,57]. OSLSM of [42]

Page 4: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

4 Y. H. Park et al.

adopts a two-branch structure: conditioning branch generating element-wisescale and shift factor using the support set and segmentation branch performingsegmentation with a fully convolutional network and task-conditioned features.Co-FCN [39] also utilizes a two-branch structure. The globally pooled predic-tion is generated using support set in conditioning branch, and fused with queryfeatures to predict segmentation mask in segmentation branch. SG-One of [58]proposes a masked average pooling to compute prototypes from pixels of supportfeatures. The cosine similarity scores are computed between the prototypes andpixels of query feature, and the similarity map guides the segmentation process.CANet of [57] also adopts masked average pooling to generate the global featurevector, and concatenate it with every location of the query feature for densecomparison in predicting the segmentation mask. PANet of [48] introduces pro-totype alignment for regularization, to predict the segmentation mask of supportsamples using query prediction results as labels of query samples.

2.3 Semantic Edge Detection

Semantic edge detection aims to find the boundaries of objects from image andclassify the objects at the same time. The history of semantic edge detection[2,20,23,33,37,54,56] dates back to the work of [37] which adopts the supportvector machine as a semantic classifier on top of the traditional canny edge de-tector. Recently, many semantic edge detection algorithms rely on deep neuralnetwork and multi-scale feature fusion. CASENET of [54] addresses the semanticedge detection as a multi-label problem where each boundary pixel is labeled intocategories of adjacent objects. Dynamic Feature Fusion (DFF) of [23] proposesa novel way to leverage multi-scale features. The multi-scale features are fusedby weighted summation with fusion weights generated dynamically for each im-ages and each pixel. Meanwhile, Simultaneous Edge Alignment and Learning(SEAL) of [56] deals with severe annotation noise of the existing edge dataset[20]. SEAL treats edge labels as latent variables and jointly trains them to alignnoisy misaligned boundary annotations. Semantically Thinned Edge AlignmentLearning (STEAL) of [2] improves the computation efficiency of edge label align-ment through a lightweight level set formulation. In addition, STEAL optimizesthe model for non-maximum suppression (NMS) during training while previousworks use NMS at the postprocessing step.

3 Problem Setup

For few-shot semantic edge detection, we use train set Dtrain and test set Dtest

consisting of non-overlapping categories Ctrain and Ctest. The model is trainedonly using Ctrain, and the test categories Ctest are never seen during the trainingphase. For meta-training of the model, we adopt episodic training as done inmany previous few-shot learning works. Each episode is composed of a supportset with a few-labeled samples and a query set. When an episode is given, themodel adapts to the given episode using the support set and detect semantic

Page 5: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 5

Foreground prototype

Background prototype

2x

2x

2x

2x

2x 2x

2x

Query sample Prediction

320 × 320

320 × 320 320 × 320

160 × 160 160 × 160 160 × 160

80 × 80 80 × 80 80 × 80

40 × 40 40 × 40 40 × 40

20 × 20 20 × 20 20 × 20

Support set

MAP

MAP

Segmentator

Attention map

Up-samplingMultiplication Concatenation

E 1

E 2

E 3

E 4

S 1

S 2

S 3

S 4

S 0

D 1

D 2

D 3

D 4

D 0

D_conv1D_conv2D_conv3

D 3,1 D 3,2

Masked Average Pooling

𝐷𝐷 4,1 𝐷𝐷 4,2

𝑆𝑆 3

Fig. 2: Network architecture overview of proposed CAFENet. ResNet-34 encoderE(1) ∼ E(4) extracts multi-level semantic features. The segmentator module gen-erates a segmentation prediction using query feature from E(4) and prototypesPFG, PBG from support set features. Small bottleneck blocks S(0) ∼ S(4) trans-form the original image and multi-scale features from encoder blocks to be moresuitable for edge detection. The attention maps generated from segmentationprediction are applied to multi-scale features to localize the semantically relatedregion. Decoder D(0) ∼ D(4) takes attentive multi-scale features to give edgeprediction.

boundaries of the query set. By episodic training, the model is learned so that itadapts to the unseen class using only a few labeled samples and predict semanticedges of query samples.

ForNc-wayNs-shot setting, each training episode is constructed byNc classessampled from Ctrain. When Nc categories are given, Ns support samples and Nq

query samples are randomly chosen from Dtrain for each class. In evaluation,the performance of the model is evaluated using test episodes. The test episodesare constructed in the same way as the training episodes, except Nc classes andcorresponding support and query samples are sampled from Ctest and Dtest.

In this work, we address Nc-way Ns-shot semantic edge detection. The goal istraining the model to generalize to Nc unseen classes given only Ns images andtheir edge labels. Based on the few labeled support samples, the model shouldproduce edge predictions of query images which belong to Nc unencounteredclasses.

4 Method

We propose a novel algorithm for few-shot semantic edge detection. Fig. 2 illus-trates the network architecture of the proposed method. The proposed CAFENet

Page 6: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

6 Y. H. Park et al.

adopts the semantic segmentation module to compensate for the lack of seman-tic information in edge labels. The predicted segmentation mask is utilized forattention in skip connection. The final edge detection is done using attentivemulti-scale features.

4.1 Semantic Segmentator

Most previous works on semantic edge detection directly predict edges from thegiven input image. However, direct edge prediction is a hard task when onlya few labeled samples are given. To overcome this difficulty in few-shot edgedetection, we adopt a semantic segmentation module in advance of edge predic-tion. With the assistance of the segmentation module, CAFENet can effectivelylocalize the target object and extract semantic features from query samples.For few-shot segmentation, we employ the metric-learning which utilizes pro-totypes for foreground and background as done in [12,48]. Given the supportset S = {xsi , ysi }Ns

i=1, the encoder E extracts features {E(xsi )}Nsi=1 from S. Also,

for support labels {ysi }Nsi=1, we generate the dense segmentation mask {Ms

i }Nsi=1

using a rule-based preprocessor, considering the pixels inside the boundary asforeground pixels in the segmentation label. Using down-sampled segmentationlabels {ms

i}Nsi=1, the prototype for foreground pixels PFG is computed as

PFG =1

Ns

1

H ×W∑i

∑j

Ej(xsi )m

si,j (1)

where j indexes the pixel location, Ej(x) and msi,j denote the jth pixel of feature

E(x) and segmentation mask msi . H,W denote height and width of the images.

Likewise, the background prototype PBG is computed as

PBG =1

Ns

1

H ×W∑i

∑j

Ej(xsi )(1−ms

i,j). (2)

Following the prototypical networks of [43], the probability that pixel j be-longs to foreground for the query sample xqi is

p(yqi,j = FG|xqi ;E) =exp(−τd(Ej(x

qi ), PFG))

exp(−τd(Ej(xqi ), PFG)) + exp(−τd(Ej(x

qi ), PBG))

(3)

where d(·, ·) is squared Euclidean distance between two vectors and τ is a learn-

able temperature parameter used in [18,38]. With query samples {xqi }Nq

i=1 and

the down-sampled segmentation labels for query {mqi }

Nq

i=1, the segmentation lossLSeg is calculated as the mean-squared error (MSE) loss between predicted prob-abilities and the down-sized segmentation mask

LSeg =1

Nq

1

H ×W

Nq∑i=1

H×W∑j=1

{(p(yqi,j = FG|xqi ;E)−mqi,j)

2}. (4)

Page 7: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 7

C

WH

Foreground prototype

Background prototypeQuery feature

Pixel-wise matching

(a) High-dimensional Matching

C

WH

Foreground prototype

Background prototypeQuery feature

Split-wise matching

(b) Split-wise Matching

Fig. 3: Comparison between (a) High-dimensional feature matching used in[12,48] and (b) split-wise feature matching in MSMR

Note that the segmentation mask is generated in down-sized scale so that anypixel near the boundaries can be classified into the foreground to some extent,as well as the background. Therefore, we approach the problem as a regressionusing MSE loss rather than cross entropy loss.

4.2 Multi-Split Matching Regularization

The metric-based few-shot segmentation method utilizes distance metrics be-tween the high-dimensional feature vectors and prototypes, as seen in Fig. 3a.However, this approach is prone to overfit due to the massive number of parame-ters in feature vectors. To get around this issue, we propose a novel regularizationmethod, multi-split matching regularization (MSMR). MSMR inherits the spiritof dense feature matching [30] where pixel-wise feature matching acts as a reg-ularizer for high-dimensional embedding. In MSMR, high-dimensional featurevectors are split into several low-dimensional feature vectors, and the metriclearning is conducted on each vector split as Fig. 3b.

With the query feature E(xqi ) ∈ RC×W×H , where C is channel dimension andH,W are spatial dimensions, we divide E(xqi ) into K sub-vectors {Ek(xqi )}Kk=1

along channel dimension. Each sub-vector Ek(xqi ) is in R CK×W×H . Likewise, the

prototypes PFG and PBG are also disassembled into K sub-vectors {P kFG}Kk=1

and {P kBG}Kk=1 along channel dimension where P k

FG, PkBG ∈ R C

K . For the kth

sub-vector of query feature Ek(xqi ), the probability that the jth pixel belongs tothe foreground class is computed as follows:

pk(yqi,j = FG|xqi ;E) =exp(−τd(Ek

j (xqi ), P kFG))

exp(−τd(Ekj (xqi ), P k

FG)) + exp(−τd(Ekj (xqi ), P k

BG)). (5)

Multi-split matching regularization divides the original metric learning probleminto K small sub-problems composed of a fewer parameters and acts as regular-izer for high-dimensional embeddings. The prediction results of K sub-problemsare reflected on learning by combining the split-wise segmentation losses to orig-

Page 8: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

8 Y. H. Park et al.

inal segmentation loss in Eq.(4). The total segmentation loss is calculated as

LSeg =1

Nq

1

H ×W

Nq∑i=1

H×W∑j=1

{(pi,j −mqi,j)

2 +

K∑i=1

(pki,j −mqi,j)

2}. (6)

where pi,j = p(yqi,j = FG|xqi ;E) and pki,j = pk(yqi,j = FG|xqi ;E).

4.3 Attentive Edge Detector

As shown in Fig. 2, we adopt the nested encoder structure of [34,51] to extractrich hierarchical features. The multi-scale side outputs from encoder E(1) ∼ E(4)

are post-processed through bottleneck blocks S(1) ∼ S(4). Since ResNet-34 givesside outputs of down-sized scale, we pass the original image through bottleneckblock S(0) to extract local details in original scale. In front of S(3), we employthe Atrous Spatial Pyramid Pooling (ASPP) block of [9]. We have empiricallyfound that locating ASPP there shows better performance.

In utilizing multi-scale features, we employ the predicted segmentation maskM from the segmentator where the jth pixel of M is the predicted probabil-ity from Eq. (3). Note that we generate M based on the entire feature vectorsand the prototypes instead of utilizing sub-vectors, since the split-wise metriclearning is used only for regularizing the segmentation module. For each layerl, M (l) denotes the segmentation mask upscaled to the corresponding featuresize by bilinear interpolation. Using segmentation prediction mask M (l), we gen-erate attention map A(l), as follows. First, the prediction with a value lowerthan threshold λ is rounded down to zero, to ignore activation in regions withlow confidence. Second, we broaden the attention map using morphological dila-tion of [13] as a second chance, since the segmentation module may not alwaysguarantee fine results. The final attention map of lth layer A(l) is computed asfollows

A(l) = 1(M (l) > λ)M (l) +Dilation(1(M (l) > λ)M (l)) (7)

where 1(M (l) > λ)M (l) is the rounded value of prediction mask M (l). Theattention maps are applied to the multi-scale features of corresponding bottle-neck blocks S(0) ∼ S(4). We apply the residual attention of [22], where theinitial multi-level side outputs from S(l) are pixel-wisely weighted by 1 + A(l),to strengthen the activation value of the semantically important region. We vi-sualize the effect of semantic attention in Fig. 4.

As shown in Fig. 2, the decoder network is composed of five consecutiveconvolutional blocks. Each decoder block D(l) contains three 3 × 3 convolutionlayers. The outputs of decoder blocks D(1) ∼ D(4) are bilinearly upsampledby two and passed to the next block. Similar to [13], the up-sampled decoderoutputs are then concatenated to the skip connection features from bottleneckblocks S(0) ∼ S(4) and previous decoder blocks. Multi-scale semantic informationand local details are transmitted through skip architectures. The hierarchical

Page 9: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 9

Input image

Input image

𝐸𝐸(1)

𝐸𝐸(2)

𝐸𝐸(3)

3

64

128

256

Encoder side-output Pixel-wise semantic attention Decoder input

𝑆𝑆(1)

𝑆𝑆(2)

𝑆𝑆(3)

Fig. 4: An example of activation map of [53] before and after pixel-wise semanticattention (warmer color has higher value). As seen, the attention mechanismmakes encoder side-outputs attend to the regions of the target object (horse inthe figure).

decoder network in turn refines the outputs of the previous decoder blocks andfinally produces the edge prediction yqi of query samples xqi .

Following the work of [11], we combine cross-entropy loss and Dice loss to

produce crisp boundaries. Given a query set Q = {xqi , yqi }Nq

i=1 and predictionmask yqi , the cross-entropy loss is computed as

LCE = −Nq∑i=1

{∑j∈Y+

log(yqi ) +∑j∈Y−

log(1− yqi )} (8)

where Y+ and Y− denote the sets of foreground and background pixels. The Diceloss is then computed as

LDice =

Nq∑i=1

{∑

j(yqi,j)

2 +∑

j(yqi,j)

2

2∑

j yqi,jy

qi,j

} (9)

where j denotes the pixels of a label. The final loss for meta-training is given by

Lfinal = LSeg + LCE + LDice. (10)

5 Experiments

5.1 Datasets

FSE-1000 The datasets used in previous semantic edge detection research suchas SBD of [20] and Cityscapes of [10] are not suitable for few-shot learningas they have only 20 and 30 classes, respectively. We propose a new dataset forfew-shot edge detection, which we call FSE-1000, based on FSS-1000 of [50]. FSS-1000 is a dataset for few-shot segmentation and composed of 1000 classes and 10images per class with foreground-background segmentation annotation. From the

Page 10: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

10 Y. H. Park et al.

image ground truth prediction image ground truth prediction

Fig. 5: Qualitative examples of 5-shot edge detection on FSE-1000 dataset.

images and segmentation masks of FSS-1000, we build FSE-1000 by extractingboundary labels from segmentation masks. In the light of difficulty associatedwith few-shot setting, we extract thick edges of which thickness is around 2 ∼3 pixels on average. For dataset split, we split 1000 classes into 800 trainingclasses and 200 test classes. We will provide the detailed class configuration inthe Supplementary Material.

SBD-5i Based on the SBD dataset of [20] for semantic edge detection, wepropose a new SBD-5i dataset. With reference to the setting of Pascal-5i, 20classes of the SBD dataset are divided into 4 splits. In the experiment with spliti, 5 classes in the ith split are used as test classes Ctest. The remaining 15 classesare utilized as training classes Ctrain. The training set Dtrain is constructed withall image-annotation pairs whose annotation include at least one pixel from theclasses in Ctrain. For each class, the boundary pixels which do not belong to thatclass are considered as background. The test set Dtest is also constructed in thesame way as Dtrain, using Ctest this time. Considering the difficulty of few-shotsetting and severe annotation noise of the SBD dataset, we extract thicker edgesas done in FSE-1000. We utilize edges extracted from the segmentation maskas ground truth instead of original boundary labels of the SBD dataset, andthickness of extracted edge lies between 3 ∼ 4 pixels on average. We conduct 4experiments with each split of i = 0 ∼ 3, and report performance of each split aswell as the averaged performance. Note that unlike Pascal-5i, we do not considerdivision of training and test samples of the original SBD dataset. As a result,the images in Dtrain might appear in Dtest with different annotation from classin Ctest.

Page 11: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 11

image ground truth prediction image ground truth prediction

Fig. 6: Qualitative examples of 5-shot edge detection on SBD-5i dataset.

5.2 Evaluation Settings

We use two evaluation metrics to measure the few-shot semantic edge detectionperformance of our approach: the Average Precision (AP) and the maximumF-measure (MF) at optimal dataset scale(ODS).

In evaluation, we compare the unthinned raw prediction results and theground truths without Non-Maximum Suppression (NMS) following [2,56]. Forthe evaluation of edge detection, an important parameter is matching distancetolerance which is an error threshold between the prediction result and theground truth. Prior works on edge detection such as [2,20,54,56] adopt non-zero distance tolerance to resolve the annotation noise of edge detection datasets.However, the proposed datasets for few-shot edge detection utilize thicker bound-aries to overcome the annotation noise issue instead of adopting distance tol-erance. Moreover, evaluation with non-zero distance tolerance requires addi-tional heavy computation. This becomes more problematic under few-shot set-ting where the performance should be measured on the same test image multipletimes due to the variation in the support set. For these reasons, we set distancetolerance to be 0 for both FSE-1000 and SBD-5i. In addition, we evaluate thepositive predictions from the area inside an object and zero-padded region asfalse positives, which is stricter than the evaluation protocol in prior works of[20,54].

5.3 Implementation Detail

We implement our framework using Pytorch library and adopt Scikit-learn li-brary to construct the precision-recall curve and compute average precision (AP).For the encoder, ResNet-34 pretrained on ImageNet is adopted. All parametersexcept the encoder parameters are learned from scratch. The entire network istrained using the Adam optimizer of [28] with weight decay regularization of[35]. In both experiments on FSE-1000 and SBD-5i, we use a learning rate of10−4 and an l2 weight decay rate of 10−2. For FSE-1000 experiments, the model

Page 12: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

12 Y. H. Park et al.

is trained with 40,000 episodes and the learning rate is decayed by 0.1 aftertraining 38,000 episodes. For SBD-5i experiments, 30,000 episodes are used fortraining, and the learning rate is decayed by 0.1 after training 28,000 episodes.Higher shot training of [32] is employed in 1-shot experiments for both datasets.In every experiment of our paper, single NVIDIA GeForce GTX 1080ti GPU isused for computation.

Data preprocessing During training, we adopt data augmentation with ran-dom rotation by multiples of 90 degrees for both FSE-1000 and SBD-5i. Weadditionally resize SBD-5i data to 320×320, while no such resizing is performedon FSE-1000. During evaluation, images of SBD-5i are zero-padded to 512×512.Again, the original image size is used for FSE-1000.

5.4 Experiment Result

Table 1 shows the experiment results on the FSE-1000 dataset. To examine theimpact of proposed MSMR and attentive decoder, we show the results of ab-lation experiments together. The baseline method conducts edge prediction inlow resolution and utilizes the loss from edge prediction for meta-training. Theedge prediction is done using a metric-based method utilizing prototypes whichare computed using down-sampled edge labels. The method dubbed as Seg uti-lizes a segmentation module without MSMR or attentive decoding. Seg directlymatches high-dimensional query feature vectors with prototypes in both trainingand evaluation. In Seg, the segmentation module is utilized only to provide thesegmentation loss that assists for model learning to extract semantic features.Seg + Att employs the predicted segmentation mask for the additional atten-tion process in skip architecture. Seg + MSMR + Att additionally utilizesthe MSMR regularization for training. For fair comparison, all methods use thesame network architecture and training hyperparameters. For SBD-5i datasets,the ablation experiments are done with same model variations as FSE-1000. Theresults on SBD-5i are shown in Table 2.

Tables 1 and 2 demonstrate that the use of the segmentation module in Seggives significant performance advantages over baseline for both FSE-1000 andSBD-5i datasets. It is also seen that the additional use of attentive decoding,Seg + Att, generally improves the performance over Seg. Finally, adding theeffect of MSMR regularization gives substantial extra gains, as seen by the scoresassociated with Seg + MSMR + Att. Clearly, when compared with baseline,our overall approach Seg + MSMR + Att provides large gains.

5.5 Experiments on Multi-Split Matching Regularization

Feature matching method for segmentation In Table 3, we have com-pared various feature matching methods between prototypes and query featurevectors for producing segmentation prediction on SBD-5i. The method baselinerefers to the original method generating segmentation prediction using only the

Page 13: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 13

Table 1: 1-way 1-shot and 1-way 5-shot results of proposed CAFENet on FSE-1000.1000 randomly sampled test episodes are used for evaluation. MF and AP scores aremeasured by %

Metric Method 1-way 1-shot 1-way 5-shot

MF(ODS)

baseline 52.71 53.52Seg 56.89 59.65Seg + Att 58.00 60.14Seg + Att + MSMR 58.47 60.63

AP

baseline 53.66 54.59Seg 58.80 61.87Seg + Att 59.81 62.37Seg + Att + MSMR 60.54 63.92

similarity metric between high-dimensional vectors as done in Eq. (3). For themethod average, segmentation predictions from low-dimensional feature splits(Eq.(5)) and original high-dimensional feature vectors (Eq.(5)) are averaged togenerate the final prediction mask. The average method can be understood as amethod utilizing MSMR not only for regularization, but also for inference. In theweighted sum method, the above five segmentation masks are combined usinga weighted sum with learnable weights. As we can see in Table 3, the MSMRmethod shows the best performance when employed for regularization.

Number of vector splits MSMR divides the high-dimensional feature intomultiple splits. Table 4 shows the performance of proposed CAFENet with vary-ing numbers of splits K. Comparing the K = 1 case with other cases, we cansee that applying MSMR regularization consistently improves performance. Wecan see that K = 4 results in the best AP and MF performance. The perfor-mance gain is marginal when we divide the embedding dimension into too small(K = 16) or too big (K = 2) a pieces.

6 Conclusion

In this paper, we establish the few-shot semantic edge detection problem. Weproposed the Class-Agnostic Few-shot Edge detector (CAFENet) based on askip architecture utilizing multi-scale features. To compensate the shortage of se-mantic information in edge labels, CAFENet employs a segmentation module inlow resolution and utilizes segmentation masks to generate attention maps. Theattention maps are applied to multi-scale skip connection to localize the semanti-cally related region. We also present the MSMR regularization method splittingthe feature vectors and prototypes into several low-dimension sub-vectors andsolving multiple metric-learning sub-problems with the sub-vectors. We builttwo novel datasets of FSE-1000 and SBD-5i well-suited to few-shot semantic

Page 14: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

14 Y. H. Park et al.

Table 2: Evaluation results of proposed CAFENet on SBD-5i. 1000 randomly sampledtest episodes are used for evaluation. MF and AP scores are measured by %

i=0 i=1 i=2 i=3

aeroplane,bike,bird,boat,bottle bus,car,cat,chair,cow table,dog,horse,mbike,person plant,sheep,sofa,train,tv

Metric Method(5-shot) SBD-50 SBD-51 SBD-52 SBD-53 Mean

MF(ODS)

baseline 22.27 19.64 20.41 20.41 20.20Seg 30.61 31.62 28.06 24.97 28.82Seg + Att 31.75 33.41 28.44 26.03 29.91Seg + Att + MSMR 34.71 36.81 32.02 28.37 32.98

AP

baseline 18.68 15.57 14.97 14.05 15.82Seg 26.14 26.78 21.92 18.43 23.32Seg + Att 27.61 28.39 22.66 20.11 24.69Seg + Att + MSMR 30.47 32.40 27.01 23.06 28.24

Metric Method(1-shot) SBD-50 SBD-51 SBD-52 SBD-53 Mean

MF(ODS)

baseline 21.81 19.49 20.34 18.06 19.93Seg 29.89 31.64 27.89 24.41 28.46Seg + Att 30.72 33.03 28.63 25.04 29.36Seg + Att + MSMR 31.54 34.75 29.47 26.68 30.61

AP

baseline 18.11 15.47 14.73 13.89 15.55Seg 25.16 26.15 21.52 18.52 22.84Seg + Att 26.10 27.21 22.47 18.81 23.65Seg + Att + MSMR 26.81 29.08 23.77 20.44 25.03

Table 3: Comparison of different feature matching method on SBD-5i under1-way 5-shot setting. MF and AP scores are averaged over 4 splits

Feature matching method baseline average weighted sum

AP 34.61 31.05 31.44MF(ODS) 29.91 26.20 26.46

Table 4: Comparison of different numbers of vector splits K on SBD-5i under1-way 5-shot setting. MF and AP scores are averaged over 4 splits

Number of splits K = 1 K = 2 K = 4 K = 8 K = 16

AP 24.69 26.48 29.91 27.68 23.83MF(ODS) 29.91 31.62 32.30 31.58 30.77

Page 15: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 15

edge detection. Experimental results demonstrate that the proposed techniquessignificantly improve the few-shot semantic edge detection performance relativeto a baseline approach.

Page 16: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

16 Y. H. Park et al.

References

1. Abbass, H.H., Mousa, Z.R.: Edge detection of medical images using markov basis.Applied Mathematical Sciences 11(37), 1825–1833 (2017)

2. Acuna, D., Kar, A., Fidler, S.: Devil is in the edges: Learning semantic boundariesfrom noisy annotations. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 11075–11083 (2019)

3. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchicalimage segmentation. IEEE transactions on pattern analysis and machine intelli-gence 33(5), 898–916 (2010)

4. Bertasius, G., Shi, J., Torresani, L.: Deepedge: A multi-scale bifurcated deep net-work for top-down contour detection. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 4380–4389 (2015)

5. Bertasius, G., Shi, J., Torresani, L.: High-for-low and low-for-high: Efficient bound-ary detection from deep object features and its applications to high-level vision.In: Proceedings of the IEEE International Conference on Computer Vision. pp.504–512 (2015)

6. Bertasius, G., Shi, J., Torresani, L.: Semantic segmentation with boundary neuralfields. In: Proceedings of the IEEE conference on computer vision and patternrecognition. pp. 3602–3610 (2016)

7. Canny, J.: A computational approach to edge detection. IEEE Transactions onpattern analysis and machine intelligence (6), 679–698 (1986)

8. Chen, L.C., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L.: Semanticimage segmentation with task-specific edge detection using cnns and a discrim-inatively trained domain transform. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 4545–4554 (2016)

9. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se-mantic image segmentation with deep convolutional nets, atrous convolution, andfully connected crfs. IEEE transactions on pattern analysis and machine intelli-gence 40(4), 834–848 (2017)

10. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban sceneunderstanding. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3213–3223 (2016)

11. Deng, R., Shen, C., Liu, S., Wang, H., Liu, X.: Learning to predict crisp boundaries.In: Proceedings of the European Conference on Computer Vision (ECCV). pp.562–578 (2018)

12. Dong, N., Xing, E.: Few-shot semantic segmentation with prototype learning. In:BMVC. vol. 3 (2018)

13. Feng, M., Lu, H., Ding, E.: Attentive feedback network for boundary-aware salientobject detection. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 1623–1632 (2019)

14. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour seg-ments for object detection. IEEE transactions on pattern analysis and machineintelligence 30(1), 36–51 (2007)

15. Fink, M.: Object classification from a single example utilizing class relevance met-rics. In: Advances in neural information processing systems. pp. 449–456 (2005)

16. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptationof deep networks. In: Proceedings of the 34th International Conference on MachineLearning-Volume 70. pp. 1126–1135. JMLR. org (2017)

Page 17: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 17

17. Fu, K., Zhang, T., Zhang, Y., Yan, M., Chang, Z., Zhang, Z., Sun, X.: Meta-ssd:Towards fast adaptation for few-shot object detection with meta-learning. IEEEAccess 7, 77597–77606 (2019)

18. Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition. pp. 4367–4375 (2018)

19. Hancock, E.R., Kittler, J.: Edge-labeling using dictionary-based relaxation. IEEETransactions on Pattern Analysis and Machine Intelligence 12(2), 165–181 (1990)

20. Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contoursfrom inverse detectors. In: 2011 International Conference on Computer Vision. pp.991–998. IEEE (2011)

21. He, J., Zhang, S., Yang, M., Shan, Y., Huang, T.: Bi-directional cascade network forperceptual edge detection. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 3828–3837 (2019)

22. Hou, R., Chang, H., Bingpeng, M., Shan, S., Chen, X.: Cross attention networkfor few-shot classification. In: Advances in Neural Information Processing Systems.pp. 4005–4016 (2019)

23. Hu, Y., Chen, Y., Li, X., Feng, J.: Dynamic feature fusion for semantic edge de-tection. arXiv preprint arXiv:1902.09104 (2019)

24. Hwang, J.J., Liu, T.L.: Pixel-wise deep learning for contour detection. arXivpreprint arXiv:1504.01989 (2015)

25. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 1125–1134 (2017)

26. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detectionvia feature reweighting. In: Proceedings of the IEEE International Conference onComputer Vision. pp. 8420–8429 (2019)

27. Karlinsky, L., Shtok, J., Harary, S., Schwartz, E., Aides, A., Feris, R., Giryes, R.,Bronstein, A.M.: Repmet: Representative-based metric learning for classificationand few-shot object detection. In: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. pp. 5197–5206 (2019)

28. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

29. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shotimage recognition. In: ICML deep learning workshop. vol. 2. Lille (2015)

30. Kye, S.M., Lee, H.B., Kim, H., Hwang, S.J.: Transductive few-shot learning withmeta-learned confidence (2020)

31. Lifchitz, Y., Avrithis, Y., Picard, S., Bursuc, A.: Dense classification and implant-ing for few-shot learning. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 9258–9267 (2019)

32. Liu, Y., Lee, J., Park, M., Kim, S., Yang, E., Hwang, S., Yang, Y.: Learn-ing to propagate labels: Transductive propagation network for few-shot learn-ing. In: International Conference on Learning Representations (2019), https:

//openreview.net/forum?id=SyVuRiC5K733. Liu, Y., Cheng, M.M., Fan, D.P., Zhang, L., Bian, J., Tao, D.: Semantic edge

detection with diverse deep supervision. arXiv preprint arXiv:1804.02864 (2018)34. Liu, Y., Cheng, M.M., Hu, X., Wang, K., Bai, X.: Richer convolutional features

for edge detection. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3000–3009 (2017)

35. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprintarXiv:1711.05101 (2017)

Page 18: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

18 Y. H. Park et al.

36. Mehena, J.: Medical image edge detection using modified morphological edge de-tection approach (2019)

37. Prasad, M., Zisserman, A., Fitzgibbon, A., Kumar, M.P., Torr, P.H.: Learningclass-specific edges for object detection and segmentation. In: Computer Vision,Graphics and Image Processing, pp. 94–105. Springer (2006)

38. Qi, H., Brown, M., Lowe, D.G.: Low-shot learning with imprinted weights. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 5822–5830 (2018)

39. Rakelly, K., Shelhamer, E., Darrell, T., Efros, A., Levine, S.: Conditional networksfor few-shot semantic segmentation (2018)

40. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning (2016)41. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Meta-learning

with memory-augmented neural networks. In: International conference on machinelearning. pp. 1842–1850 (2016)

42. Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semanticsegmentation. arXiv preprint arXiv:1709.03410 (2017)

43. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In:Advances in neural information processing systems. pp. 4077–4087 (2017)

44. Sugihara, K.: Machine interpretation of line drawings. The Massachusetts Instituteof Technology (1986)

45. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learningto compare: Relation network for few-shot learning. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018)

46. Ullman, S., Basri, R.: Recognition by linear combination of models. Tech. rep.,MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLI-GENCE LAB (1989)

47. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networksfor one shot learning. In: Advances in neural information processing systems. pp.3630–3638 (2016)

48. Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: Panet: Few-shot image semanticsegmentation with prototype alignment. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 9197–9206 (2019)

49. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 8798–8807 (2018)

50. Wei, T., Li, X., Chen, Y.P., Tai, Y.W., Tang, C.K.: Fss-1000: A 1000-class datasetfor few-shot segmentation. arXiv preprint arXiv:1907.12347 (2019)

51. Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEEinternational conference on computer vision. pp. 1395–1403 (2015)

52. Yoon, S.W., Seo, J., Moon, J.: Tapnet: Neural network augmented with task-adaptive projection for few-shot learning. arXiv preprint arXiv:1905.06549 (2019)

53. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neuralnetworks through deep visualization. arXiv preprint arXiv:1506.06579 (2015)

54. Yu, Z., Feng, C., Liu, M.Y., Ramalingam, S.: Casenet: Deep category-aware seman-tic edge detection. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 5964–5973 (2017)

55. Yu, Z., Liu, W., Liu, W., Peng, X., Hui, Z., Kumar, B.V.: Generalized transitivedistance with minimum spanning random forest. In: IJCAI. pp. 2205–2211. Citeseer(2015)

Page 19: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 19

56. Yu, Z., Liu, W., Zou, Y., Feng, C., Ramalingam, S., Vijaya Kumar, B., Kautz,J.: Simultaneous edge alignment and learning. In: Proceedings of the EuropeanConference on Computer Vision (ECCV). pp. 388–404 (2018)

57. Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C.: Canet: Class-agnostic segmentationnetworks with iterative refinement and attentive few-shot learning. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5217–5226 (2019)

58. Zhang, X., Wei, Y., Yang, Y., Huang, T.: Sg-one: Similarity guidance network forone-shot semantic segmentation. arXiv preprint arXiv:1810.09091 (2018)

59. Zhu, D., Li, J., Wang, X., Peng, J., Shi, W., Zhang, X.: Semantic edge baseddisparity estimation using adaptive dynamic programming for binocular sensors.Sensors 18(4), 1074 (2018)

Page 20: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot EdgeDetection Network: Supplementary Material

Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1

School of Electrical Engineering, Korea Advanced Institute of Science and Technology(KAIST), Daejeon, Korea

{dnffkf369,tjwns0630}@kaist.ac.kr, [email protected]

1 Additional Experimental Setup

In this section, we provide detailed information about experimental setup. Weadopt the ImageNet pretrained ResNet-34 with 64-128-256-512 channels for eachresidual block from [link] as the encoder. To construct the skip architecture, weemploy the bottleneck block of ResNet as the post-processing blocks S(1) ∼S(4). Each bottleneck block consists of two 1x1 convolutional layers and one 3x3convolutional layer with expansion rate of 4. Dropout with the ratio of 0.25 isapplied to the end of each bottleneck block. For the ASPP Module in front ofS(3), we adopt the dilation rate of 1,4,7,11. The segmentation module generates asegmentation prediction with the rounding threshold value λ of 0.4. For decoder,each decoder block is composed of three consecutive 3x3 convolutional layers,and dropout with the ratio of 0.25 is again located at the end of each layer.

During meta-training of CAFENet, we set the number of query samples intraining episodes to be 5 for FSE-1000 and 10 for SBD-5i, respectively. In evalu-ation, we employ average precision score function of Scikit-learn library to mea-sure the Average Precision (AP) score. We compute the AP score for each imageand average them to measure the overall performance. For Maximum F-measure(MF) score, we measure true positives (TP), false positives (FP) and false nega-tives (FN) at every 0.01 threshold intervals for each image, and accumulate thevalues for all images in 1000 test episodes. The MF score is computed using theaccumulated TP, FP, and FN values.

2 Label Generator

2.1 Edge Label Generator

Algorithm 1 generates the edge labels from the segmentation labels. The edgelabel generator finds the regions where the pixel value of segmentation labeldrastically changes, and determine the pixels in the regions as the boundary.Note that the pixels at the border of the image are also determined as theboundary.

Page 21: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

2 Y. H. Park et al.

Algorithm 1 Edge Label Generation

Input: Segmentation label M of an imageOutput: Edge label y of an image.

y ← 0W×H . Initialize y as zero matrix having same shape with Mfor (h,w) in (1, 1),...,(H,W ) do . H/W is height/width of the image

if Mh,w = 1 thenfor (a, b) = (-r,-r),...,(r,r) do . radius r determines thickness of edge

if Mh+a,w+b = 0 thenyh,w ← 1 . 0/1 means non-edge/edge pixel, respectivelybreak

else if (h+ a < 0) or (h+ a > H) or (w + b < 0) or (w + b > W ) thenyh,w ← 1break

return y . Return label annotation

image segmentation label edge label image segmentation label edge label

Fig. 1: Result of edge label generator.

2.2 Segmentation Label Generator

Algorithm 2 is the segmentation label generator which generates the segmenta-tion label from the edge label. Before the label generation, the pixels are dividedinto several groups based on boundary labels. We employ the Breadth-FirstSearch (BFS) algorithm and divide pixels into groups {G1, G2, ..., Gn}. The seg-mentation label generator of Algorithm 2 classifies these groups into foregroundand background. First, the algorithm sweeps each column and row to countthe number of pixel value change in edge label. If there are certain numbers ofchanges, the algorithm again sweeps the column or row and record the locationof foreground pixels and mark the foreground pixels in the column or row in amatrix T . Based on pixel groups {G1, G2, ..., Gn}, the marking results in T arethen divided into pixel value groups {T 1, T 2, ..., Tn}. The probability that eachgroup Gi belongs to the foreground is calculated as the mean of pixel valuesT i. The groups with probability above the threshold λ are determined as theforeground groups, and the pixels belongs to foreground groups are marked asforeground pixels. We set the threshold value λ to be 20/255.

Page 22: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 3

Algorithm 2 Segmentation Label Generation

Input: Edge label y of an image, pixel groups G1, G2, ..., Gn

Output: Segmentation label M of an image.

M,T ← 0W,H . Initialize M,T as zero matrix having same shape with yfor h = 1,...,H do . H is height of the image

cnt,mode← 0for w = 1,...,W do . W is width of the image

if yh,w = mod(mode+ 1, 2) then . Accumulate changes of pixel valuecnt← cnt+ 1mode← mod(mode+ 1, 2)

if mod(cnt, 4) = 0 and cnt 6= 0 then . Check if there are FG pixels in the rowcnt′,mode′ ← 0for w′ = 1,...,W do . Find location of FG pixels in the row

if yh,w′ = mod(mode′ + 1, 2) thencnt′ ← cnt′ + 1mode′ ← mod(mode′ + 1, 2)

if mod(cnt′, 4) = 2 thenTh,w′ ← 1 . Record location of FG pixels in the row

for w = 1,...,W do . Repeat the same process for every columncnt,mode← 0for h = 1,...,H do

if yh,w = mod(mode+ 1, 2) thencnt← cnt+ 1mode← mod(mode+ 1, 2)

if mod(cnt, 4) = 0 and cnt 6= 0 thencnt′,mode′ ← 0for h′ = 1,...,H do

if yh′,w = mod(mode′ + 1, 2) thencnt′ ← cnt′ + 1mode′ ← mod(mode′ + 1, 2)

if mod(cnt′, 4) = 2 thenTh′,w ← 1

for i = 1, ..., n doT i ← Th,w|(h,w)∈Gi

if mean(T i) ≥ λ then . Check the probability that Gi belongs to foregroundMh,w|(h,w)∈Gi ← 1 . 1 means a foreground pixel

elseMh,w|(h,w)∈Gi ← 0 . 0 means a background pixel

return M . Return segmentation annotation

image edge label marking result 𝑇𝑇 pixel groups segmentation label

Fig. 2: Result of segmentation label generator.

Page 23: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

4 Y. H. Park et al.

3 Details on Datasets

3.1 FSE-1000

We build FSE-1000 using an existing few-shot segmentation dataset, FSS-1000.We extract the boundary labels from segmentation annotations using Algorithm1. The radius value r in Algorithm 1 is set to 3 in FSE-1000. 1000 categories inFSE-1000 are split into 800 train classes and 200 test classes. For the detailedclass configuration, the reader may refer to our attachment on class configura-tion. Figure 1 visualizes the result of our edge extraction algorithm.

3.2 SBD-5i

SBD-5i is constructed based on the existing semantic edge detection dataset(SBD). Due to the noise of boundary annotations in original SBD, we utilizethe thicker edge as done in FSE-1000. To extract thicker edge, we generate thesegmentation labels from the edge labels using Algorithm 2 instead of usingexisting segmentation labels of SBD. Figure 2 shows the process of generatingthe segmentation label from the edge label. From the generated segmentationlabels, we extract edge labels using Algorithm 1 with a radius value of 4. Thisprocess allows us to train the proposed CAFENet using only the edge labels.

While all images in FSE-1000 have the same size, images in SBD-5i have dif-ferent size. However, constructing the training episode as a mini-batch requiresimages with the same size. Previous works on semantic edge detection typi-cally apply random cropping to deal with this issue. For the few-shot setting,however, random cropping severely degrades informativeness of support set andconsequently hinders learning. Alternatively, we utilize the training examplesresized to 320 × 320 to maintain the information of images as much as possi-ble. When resizing the edge labels for training, we first generate segmentationlabels in original scale using Algorithm 2 and resize the segmentation labels to320× 320. Then, we extract edge labels from resized segmentation labels usingAlgorithm 1 with a radius value of 3.

4 Additional Qualitative Results

Figure 3 visualizes more qualitative results on SBD-5i. We illustrate and comparethe boundary prediction results of the baseline, Seg, Seg + Att, and Seg+ Att + MSMR methods. For fair comparison, all methods share the samesupport set. From the results, we can clearly see that the techniques proposedin CAFENet steadily improve the quality of edge prediction.

5 Additional Results with Multi-angle Input Test

In this section, we report the few-shot semantic edge prediction result withmulti-angle input test. In multi-angle input test, the model predicts the edge by

Page 24: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

CAFENet: Class-Agnostic Few-Shot Edge Detection Network 5

image ground truth baseline Seg Seg + Att Seg + Att + MSMR

Fig. 3: Qualitative results comparison of proposed methods

Page 25: School of Electrical Engineering, Korea Advanced Institute ... · Young-Hyun Park1, Jun Seo1, and Jaekyun Moon1 School of Electrical Engineering, Korea Advanced Institute of Science

6 Y. H. Park et al.

Table 1: 1-way 5-shot results with the multi-angle input test on FSE-1000. 1000 ran-domly sampled test episodes are used for evaluation. MF and AP scores are measuredby %

Metric Method 1-way 5-shot

MF(ODS)

baseline 55.23Seg 61.17Seg + Att 61.90Seg + Att + MSMR 62.23

AP

baseline 56.12Seg 63.32Seg + Att 63.83Seg + Att + MSMR 65.81

averaging the 4 edge prediction results from 4 copies of an input image rotatedby multiples of 90 degrees. We have empirically found that multi-angle inputtest significantly improves the performance. Table 1 and 2 show the evaluationresults with multi-angle input test for FSE-1000 and SBD-5i, respectively. Wecan verify the effectiveness of the multi-angle input test from the results.

Table 2: 1-way 5-shot results with the multi-angle input test on SBD-5i. 1000 randomlysampled test episodes are used for evaluation. MF and AP scores are measured by %

Metric Method(5-shot) SBD-50 SBD-51 SBD-52 SBD-53 Mean

MF(ODS)

baseline 24.38 23.78 22.81 20.39 22.84Seg 32.46 33.18 29.63 26.46 30.43Seg + Att 34.29 35.76 32.25 28.12 32.61Seg + Att + MSMR 36.13 37.92 34.18 30.21 34.61

AP

baseline 21.03 20.39 17.48 16.31 18.80Seg 27.92 28.00 23.94 20.06 24.98Seg + Att 29.94 30.40 24.34 21.38 26.52Seg + Att + MSMR 31.92 33.51 29.28 24.91 29.91