semi-supervised multimodal deep learning for rgb …learning to benefit from the massive unlabeled...

7
Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition Yanhua Cheng 1, Xin Zhao 1 , Rui Cai 2 , Zhiwei Li 2 , Kaiqi Huang 1,3 , Yong Rui 2 1 CRIPAC&NLPR, CASIA 2 Microsoft Research 3 CAS Center for Excellence in Brain Science and Intelligence Technology {yh.cheng, xzhao, kaiqi.huang}@nlpr.ia.ac.cn, {ruicai, zli, yongrui}@microsoft.com Abstract This paper studies the problem of RGB-D object recognition. Inspired by the great success of deep convolutional neural networks (DCNN) in AI, re- searchers have tried to apply it to improve the performance of RGB-D object recognition. How- ever, DCNN always requires a large-scale anno- tated dataset to supervise its training. Manually labeling such a large RGB-D dataset is expensive and time consuming, which prevents DCNN from quickly promoting this research area. To address this problem, we propose a semi-supervised mul- timodal deep learning framework to train DCNN effectively based on very limited labeled data and massive unlabeled data. The core of our frame- work is a novel diversity preserving co-training al- gorithm, which can successfully guide DCNN to learn from the unlabeled RGB-D data by making full use of the complementary cues of the RGB and depth data in object representation. Experiments on the benchmark RGB-D dataset demonstrate that, with only 5% labeled training data, our approach achieves competitive performance for object recog- nition compared with those state-of-the-art results reported by fully-supervised methods. 1 Introduction Recent years have witnessed RGB-D object recognition be- coming a very active research area in computer vision and robotics with the rapid development of commodity depth cameras. Such off-the-shelf sensors, e.g., Microsoft Kinect and Intel RealSense, are capable of providing high quality synchronized RGB and depth information, to depict multi- modal characteristics of an object. Specifically, the RGB modality captures rich colors and textures, while the depth modality provides pure geometry and shape cues which are robust to lighting and color variations. It represents an op- portunity to dramatically improve the performance of object recognition by combining the two complementary cues. Remarkable efforts have been invested for RGB-D object recognition in the last few years. Most existing work falls The work was performed at Microsoft Research. into two kinds. One is about feature representation, includ- ing handcrafted features [Lai et al., 2011a; Bo et al., 2011a; R.C. et al., 2012] and learning-based features [Blum et al., 2012; Bo et al., 2012; Socher et al., 2012; Jhuo et al., 2015]. The other one is about RGB-D fusion, like straight- forward concatenation of RGB and depth features as well as learning-based fusion [Lai et al., 2011b; Cheng et al., 2015a]. Towards building a unified solution for feature learn- ing and RGB-D fusion, a promising trend is to devise an end-to-end deep learning system via convolutional neural networks (DCNN) [Gupta et al., 2014; Eitel et al., 2015; Wang et al., 2015], such as the one shown in Fig. 1 (a). Such ideas were inspired by the great success of deep learning for image classification (only RGB data). It should be noticed that DCNN models always require a large-scale dataset for supervised training, e.g., ImageNet with millions of anno- tated images [Deng et al., 2009]. However, labeling such a large dataset for the emerging RGB-D object recognition task is still expensive and time consuming. This prevents DCNN from quickly promoting this research area. Thus it is neces- sary to develop a new effective training framework for deep learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned problem, a natural idea is to incorporate the conventional semi-supervised learning meth- ods into the deep learning framework. Although many suc- cessful semi-supervised methods exist in the literature [Zhu, 2005], we are particularly interested in the co-training algo- rithm due to its unique advantage over the multimodal data. Theoretical proofs have been given in [Blum and Mitchell, 1998; Balcan et al., 2004] to guarantee the success of co- training in learning from the unlabeled data on condition that: 1) each example contains two views, either of which is able to depict the example well; and 2) the two views should not be highly correlated. RGB-D data matches the two condi- tions well by providing two complementary cues of objects (i.e., RGB and depth). Therefore, the goal of this paper is to develop a semi-supervised multimodal deep learning frame- work based on co-training, as shown in Fig. 1 (b). The pipeline of the framework can be summarized as fol- lows. First, the RGB- and depth-DCNN models are trained on the given labeled data of the respective views. Then each model is applied to predict the unlabeled pool and label the most confident examples for the other model, for which these Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) 3345

Upload: others

Post on 04-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semi-Supervised Multimodal Deep Learning for RGB …learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned

Semi-Supervised Multimodal Deep Learning forRGB-D Object Recognition

Yanhua Cheng1⇤, Xin Zhao1, Rui Cai2, Zhiwei Li2, Kaiqi Huang1,3, Yong Rui21CRIPAC&NLPR, CASIA 2Microsoft Research

3CAS Center for Excellence in Brain Science and Intelligence Technology{yh.cheng, xzhao, kaiqi.huang}@nlpr.ia.ac.cn, {ruicai, zli, yongrui}@microsoft.com

AbstractThis paper studies the problem of RGB-D objectrecognition. Inspired by the great success of deepconvolutional neural networks (DCNN) in AI, re-searchers have tried to apply it to improve theperformance of RGB-D object recognition. How-ever, DCNN always requires a large-scale anno-tated dataset to supervise its training. Manuallylabeling such a large RGB-D dataset is expensiveand time consuming, which prevents DCNN fromquickly promoting this research area. To addressthis problem, we propose a semi-supervised mul-timodal deep learning framework to train DCNNeffectively based on very limited labeled data andmassive unlabeled data. The core of our frame-work is a novel diversity preserving co-training al-gorithm, which can successfully guide DCNN tolearn from the unlabeled RGB-D data by makingfull use of the complementary cues of the RGB anddepth data in object representation. Experimentson the benchmark RGB-D dataset demonstrate that,with only 5% labeled training data, our approachachieves competitive performance for object recog-nition compared with those state-of-the-art resultsreported by fully-supervised methods.

1 IntroductionRecent years have witnessed RGB-D object recognition be-coming a very active research area in computer vision androbotics with the rapid development of commodity depthcameras. Such off-the-shelf sensors, e.g., Microsoft Kinectand Intel RealSense, are capable of providing high qualitysynchronized RGB and depth information, to depict multi-modal characteristics of an object. Specifically, the RGBmodality captures rich colors and textures, while the depthmodality provides pure geometry and shape cues which arerobust to lighting and color variations. It represents an op-portunity to dramatically improve the performance of objectrecognition by combining the two complementary cues.

Remarkable efforts have been invested for RGB-D objectrecognition in the last few years. Most existing work falls

⇤The work was performed at Microsoft Research.

into two kinds. One is about feature representation, includ-ing handcrafted features [Lai et al., 2011a; Bo et al., 2011a;R.C. et al., 2012] and learning-based features [Blum et al.,2012; Bo et al., 2012; Socher et al., 2012; Jhuo et al.,2015]. The other one is about RGB-D fusion, like straight-forward concatenation of RGB and depth features as wellas learning-based fusion [Lai et al., 2011b; Cheng et al.,2015a]. Towards building a unified solution for feature learn-ing and RGB-D fusion, a promising trend is to devise anend-to-end deep learning system via convolutional neuralnetworks (DCNN) [Gupta et al., 2014; Eitel et al., 2015;Wang et al., 2015], such as the one shown in Fig. 1 (a). Suchideas were inspired by the great success of deep learning forimage classification (only RGB data). It should be noticedthat DCNN models always require a large-scale dataset forsupervised training, e.g., ImageNet with millions of anno-tated images [Deng et al., 2009]. However, labeling such alarge dataset for the emerging RGB-D object recognition taskis still expensive and time consuming. This prevents DCNNfrom quickly promoting this research area. Thus it is neces-sary to develop a new effective training framework for deeplearning to benefit from the massive unlabeled RGB-D data,which is often cheap and easily available.

To handle the aforementioned problem, a natural idea is toincorporate the conventional semi-supervised learning meth-ods into the deep learning framework. Although many suc-cessful semi-supervised methods exist in the literature [Zhu,2005], we are particularly interested in the co-training algo-rithm due to its unique advantage over the multimodal data.Theoretical proofs have been given in [Blum and Mitchell,1998; Balcan et al., 2004] to guarantee the success of co-training in learning from the unlabeled data on condition that:1) each example contains two views, either of which is ableto depict the example well; and 2) the two views should notbe highly correlated. RGB-D data matches the two condi-tions well by providing two complementary cues of objects(i.e., RGB and depth). Therefore, the goal of this paper is todevelop a semi-supervised multimodal deep learning frame-work based on co-training, as shown in Fig. 1 (b).

The pipeline of the framework can be summarized as fol-lows. First, the RGB- and depth-DCNN models are trainedon the given labeled data of the respective views. Then eachmodel is applied to predict the unlabeled pool and label themost confident examples for the other model, for which these

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

3345

Page 2: Semi-Supervised Multimodal Deep Learning for RGB …learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned

examples are random and informative to increase its capabil-ity through the next round training. The two steps are re-peated until no confident examples can be chosen for eachother. Finally, we add a fusion layer to combine the twostream networks for recognition and jointly train the wholemodel.

Although the proposed framework looks quite straightfor-ward, it is not a trivial task to make it work. In fact, startingthe framework directly doesn’t show any inspiring results inour experiments. There are two obstacles during the train-ing of the framework. One is about the initial phase. Sucha limited labeled set is hard to provide a good deep learningmodel for either the RGB or the depth modality due to over-fitting, even though each model can be pretrained based onother datasets like ImageNet and then finetuned on the RGB-D object recognition task. The other is about the co-trainingphase. Since each DCNN model selects those most confidentexamples from every predicted class, it is prone to result ina biased distribution over each category in the labeled poolalong with co-training, e.g., almost all “apples” will be redbut few are green, and the “cups” with handles will be dom-inant compared to those without handles, meaning that theintra-class diversity of each category is fading. As a result,the final DCNN models trained on the imbalanced labeledset have poor generalization ability for category-level objectrecognition on the unseen data.

Two strategies are proposed in this paper to address theinvolved problems. First, we devise two reconstruction net-works to better initialize the RGB- and depth-DCNN mod-els for object recognition separately. The reconstruction net-works make use of both the labeled and unlabeled data forunsupervised feature learning, which can help to relieve theover-fitting problem effectively. Second, we introduce a di-versity preserving co-training algorithm to balance the addedsamples from the unlabeled pool. To this end, we adoptthe convex clustering [Lashkari and Golland, 2007] to au-tomatically discover various intra-class attributes over eachcategory, and then keep the added samples to uniformlycover every attribute of every category during the iterations.Such informative and balanced samples can boost the RGB-and depth-DCNN models during every round training. Wedemonstrate the effectiveness of the two strategies in the ex-periments.

The rest of this paper is organized as follows. Section 2briefly reviews related work. Section 3 introduces the pro-posed semi-supervised multimodal deep learning framework.Experimental results and detailed analysis are reported inSection 4. In Section 5, we finally draw our conclusions.

2 Related WorkRGB-D Object Recognition. Many successful methodshave been proposed for RGB-D object recognition in recentyears. Here we review those state-of-the-art supervised aswell as semi-supervised approaches evaluated on the bench-mark RGB-D datasets.

Supervised Methods. Early work can be divided intotwo groups. One was focused on feature extraction of thenovel RGB-D data, including handcrafted features [Lai et al.,

Labeled Dataset

fusionclassifier

RGBclassifier

depthclassifier

depth

RGB

depth-DCNN

RGB-DCNN

depth features

RGB features

(a) A supervised multimodal deep learning framework

(b) Our semi-supervised multimodal deep learning framework

Labeled Datasetfusion

classifier

RGBclassifier

depthclassifier

depth

RGB

depth-DCNN

RGB-DCNN

depth features

RGB features

Unlabeled Dataset

auto-labeled data

Figure 1: The structures of (a) supervised and (b) semi-supervised multimodal deep learning for RGB-D objectrecognition. Both (a) and (b) jointly learn the features andclassifiers in an end-to-end fashion.

2011a] (such as SIFT and spin images), a series of appearanceand shape kernel descriptors [Bo et al., 2011a], and automati-cally learning features via successful machine learning meth-ods [Blum et al., 2012; Bo et al., 2012; Socher et al., 2012;Jhuo et al., 2015]. The other tried to explore a more ef-fective way for RGB and depth fusion [Lai et al., 2011b;Cheng et al., 2015a] instead of a direct feature concatenationin the first group. Both the two groups of work were followedby SVM or random forest classifier, fully supervised by allthe training data for better object recognition. Very recently,researches began to jointly learn the features, classifiers andRGB-D fusion using end-to-end deep learning [Gupta et al.,2014; Eitel et al., 2015; Wang et al., 2015]. Due to the lackof a large scale annotated RGB-D object dataset, they sparedno efforts to augment the data, e.g., synthesizing objects viaCAD rendering, as well as generating new samples via geo-metrical transformations. Compared to the real data, the arti-ficial data inevitably has a different distribution, and is hard toprovide the same rich information to depict object categories.Thus the potential of deep learning to improve RGB-D objectrecognition is limited.

Semi-Supervised Methods. The most similar work to oursis that of [Cheng et al., 2014; 2015c], who also employed co-training for RGB-D object recognition to reduce the depen-dence on large annotated training sets. However, they onlyadopted co-training to retrain the RGB and depth SVM clas-sifiers based on the features extracted in advance. Differentfrom them, this paper proposes a powerful semi-superviseddeep learning method, which can jointly learn the features,classifiers and RGB-D fusion by making use of the unlabeleddata. Experimental results show that our method achievesmuch better performance.Semi-Supervised Deep Learning. [Weston et al., 2012] pro-

3346

Page 3: Semi-Supervised Multimodal Deep Learning for RGB …learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned

Conv111x11x96stride 4norm.

pool 3x3

Conv25x5x256

pad 2norm.

pool 3x3

Conv33x3x384

pad 1

Conv43x3x384

pad 1

Conv53x3x256

pad 1pool 3x3

FC64096

FC74096

drop 0.9

Conv111x11x96stride 4norm.

pool 3x3

Conv25x5x256

pad 2norm.

pool 3x3

Conv33x3x384

pad 1

Conv43x3x384

pad 1

Conv53x3x256

pad 1pool 3x3

FC64096

FC74096

drop0.9

Labeled Depth

Labeled RGB

Confident Set By depth DCNN

Confident Set By RGB DCNN

RGB classifier(categories)

depth classifier(categories)

Concat-FC78192

fusion classifier(categories)

diversity preserving

depth classifier(latent attributes)

RGB classifier(latent attributes)

predictionUnlabeled RGB-D Data

Figure 2: Overview of our semi-supervised multimodal deep learning framework for RGB-D object recognition. The trainingof the framework mainly contains two iterative steps: 1) Training the RGB- and depth-DCNN models over the respective databased on the labeled pool (indicated as solid lines); 2) Applying each model to predict the unlabeled pool and select the mostconfident samples over each category for the other, whilst keeping these newly labeled examples to preserve intra-class diversity(indicated as dashed lines. See details in the text.). After all iterations are completed, we add a fusion classification layer andoptimize the entire model end-to-end. Note that all the classifiers are implemented with a hingeloss layer. Best viewed in color.

posed a semi-supervised deep learning method for the single-modal data by embedding a pairwise loss in the middle layer.However, they required extra information about whether apair of unlabeled images belong to the same class, which werehard to obtain in reality. Differently, we focus on the multi-modal RGB-D object recognition, and propose a more ex-plicit semi-supervised deep learning framework to learn fromthe unlabeled data via diversity preserving co-training.

3 Our Approach3.1 OverviewWe target on learning a powerful semi-supervised mul-timodal deep learning model for RGB-D object recogni-tion based on limited labeled data and massive unlabeleddata. To be specific, we have a small labeled pool L =

{(I1,D1, y1), · · · , (IM ,DM, y

M

)} with M pairwise RGB-D objects , where I

i

and Di

denote the corresponding RGBand depth modalities of the i-th example with the categorylabel y

i

2 {1, · · · , C}. Meanwhile, we have a large-scaleunlabeled RGB-D dataset U = {(I1,D1), · · · , (IN ,DN)}with similar data distribution of the labeled pool. The pro-posed framework in this paper is shown in Fig. 2, for whicha diversity preserving co-training algorithm is introduced tolearn from the unlabeled RGB-D data. Now we detail threeimportant phases of the framework.

Initialization. A well trained RGB- as well as depth-DCNN model before co-training is the first prerequisite ofthe whole system. This paper adopts the architecture ofAlexNet [Krizhevsky et al., 2012] to represent both the RGBand depth data. However, the small labeled set L is infeasi-ble to supervise the training of such deep learning models forobject recognition. To address this problem, we devise tworeconstruction networks (Section 3.2) to initialize the convo-lution layer parameters (i.e., conv1, ...,conv5) of RGB- anddepth-DCNN models, respectively. Each reconstruction net-work tries to encode and decode its inputs, taking advantageof all the labeled and unlabeled data to learn meaningful fea-tures. After pretrained by the corresponding reconstruction

network, the RGB- and depth-DCNN models finetuned on Lcan generalize well for object recognition.

Training. The training of the semi-supervised deep learn-ing framework mainly involves two iterative steps, includingtraining each DCNN model and updating the labeled pool asillustrated in Fig. 2. For clarity, we denote the state of thesystem at the t-th iteration as L

t

, Ut

. To select an informativeand balanced set H

t

= {HRGB

t

,Hdepth

t

} = {(Ii

,Di

, y

i

)}(y

i

is the predicted category label) from Ut

to update the nextround train of the deep learning models effectively, a diversitypreserving co-training algorithm (Section 3.3) is introduced.

The goal of the diversity preserving algorithm is to makesure that H

t

captures as diverse intra-class attributes as pos-sible for each category of each modality. To this end, theconvex clustering [Lashkari and Golland, 2007] is utilized todiscover latent attributes over each category of each modalitybased on L

t

, and then gives a RGB as well as a depth attributetag for each object, i.e., the labeled pool can be recorded asLt

= {(Ii

, z

RGB

i

,Di

, z

depth

i

, y

i

)}, where the attribute tagz

RGB

i

2 {1, · · · , ��ZRGB

��}, z

depth

i

2 {1, · · · , ��Zdepth

��}.Note that ZRGB (or Zdepth) is an attribute set integratingall the generated attributes over all categories of the RGB(or depth) modality. Compared to those non-convex cluster-ing methods like k-means, convex clustering is guaranteedto converge to the global minimum and automatically findsthe optimal number of clusters given a temperature-like pa-rameter. Such a characteristic is important for our methodto search for the unknown representative attributes for eachcategory. Now we train an extra attribute classifier for eachmodality, which can help the category classifier to select adiversity preserving confident set H

t

, consisting of uniformsamples over each attribute of each category. It is noted thatall the classifiers in Fig. 2 are implemented with a hingelosslayer.

When no confident examples can be selected from Ut

, theiteration stops. Finally, we add a fusion layer and train thewhole network end-to-end based on the resulting model ofthe last iteration.

3347

Page 4: Semi-Supervised Multimodal Deep Learning for RGB …learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned

Encode24096

Decode14096

Decode24096 R: 64x64

G: 64x64

B: 64x64148x148x3

Conv1~Conv5

Encode14096

Encode24096

Decode14096

Decode24096

Encode14096

Encode24096

Decode14096

Decode24096

Encode14096

Figure 3: The reconstruction network for the RGB modality,which is the same for the depth modality (We compute 3-channel surface normals to represent the depth data).

Inference. Given an unseen RGB-D object, we utilize thefinal RGB model, depth model, and the fusion model to pre-dict the category label, respectively. In the experiments, wecompare the performance of our method with the state of thearts evaluated on each modality as well as the both.

3.2 Reconstruction Networks for PretrainingThe architecture of our reconstruction network for eachmodality is shown in Fig. 3, which consists of 5 convolutionallayers (with the same structure of the convolutional layers inFig. 2) and 12 fully connected layers to decode each channelof the inputs. It is noted that the depth data is represented as3-channel surface normals in this paper, since researches [Boet al., 2012; Cheng et al., 2015c] have demonstrated that sur-face normals can capture more robust geometry cues of objectthan the original depth data. For simplicity, we still use theterm “depth” instead of the surface normals in this paper.

Both the labeled and unlabeled data are utilized to trainthe reconstruction network of each modality. Specifically,the input of the network is a rescaled RGB or depth im-age x 2 R148⇥148⇥3. The corresponding output is a recon-structed map R(x) 2 R64⇥64⇥3 with downsampled resolu-tion due to memory and computational loss. We train thenetwork by minimizing the mean square reconstruction error

Loss

v

R

=

1

M +N

X

x2{Lv,Uv}

3X

ch=1

��x

ch �R

ch

(x)

��2, (1)

where M , N is the size of the labeled and unlabeled pool inthe beginning, ch denotes the channel index of the input, andthe modality v represents RGB or depth data. x 2 R64⇥64⇥3

is the ground truth by resizing x via bilinear interpolation.We use the standard back-propagation algorithm based

on stochastic gradient descent (SGD) to optimize the re-construction network. When the network of each modalityachieves convergence, the parameters of the convolutionallayers (conv1, ..., conv5 in Fig. 3) are utilized to initialize thecorresponding convolutional layers of each modality in theproposed framework of Fig. 2. The experiments will showthat such a pretraining strategy is able to largely improve thegeneralization ability of RGB- and depth-DCNN models forobject recognition, even though very limited labeled samplesin L are available to supervise the training.

3.3 Diversity Preserving Co-TrainingThe goal of the diversity preserving co-training algorithm isto select highly confident examples with predicted labels from

Convex ClusteringConvex Clustering

RGB-CNN

Depth-CNN

(a) (b) (c) (d)

Figure 4: A sketch map illustrates the convex clustering overthe category “coffee mug” in the labeled pool. For RGBmodality, it finds 5 latent attributes, while 3 attributes for thedepth modality (fc7 features are used for convex clustering).

the unlabeled pool, whilst keeping these examples uniformlycover each category, as well as each intra-class attribute foreach category. Such newly labeled examples by one DCNNmodel can greatly boost the performance of the other one inthe next round training. Now we introduce the three maincomponents of the iterative algorithm. For clarity, sometimeswe omit the iteration number t in equations below.

Convex Clustering. We apply convex clustering [Lashkariand Golland, 2007] to discover the diverse intra-class at-tributes for each category of each modality independentlybased on the labeled pool L. Convex clustering solve the clus-tering problem by maximizing the following log-likelihoodfunction

l({qx

};Lv

c

) =

1|Lv

c |P

x2Lvc

log[

Px

02Lvc

q

x

0e

��d�(x,x0)]

s.t.

Px2Lv

c

q

x

= 1, q

x

� 0

(2)

where the category c 2 {1, · · · , C}, the modality v denotesthe RGB or depth data, and d

(x, x

0) = ||�(x) � �(x

0)||2 is

the Euclidean distance between the DCNN features of twoexamples (fc7 is used). The scalar weight q

x

denotes therepresentative degree of exemplar x, while � is a positivetemperature-like parameter that controls the sparseness of{q

x

} (qx

> 0 indicates that x is a cluster center, and q

x

= 0

denotes an exemplar). Given �, we follow [Lashkari andGolland, 2007] to optimize the likelihood function for eachcategory of each modality individually. Due to space limits,please refer [Lashkari and Golland, 2007] for more details.As a result, we obtain two cluster sets

ZRGB

= {ZRGB

1 , · · · ,ZRGB

C

}Zdepth

= {Zdepth

1 , · · · ,Zdepth

C

}, (3)

where Zv

c

is a subset that contains all the clusters generatedon category c of modality v. Fig. 4 illustrates how convexclustering is applied to discover latent attributes for the cate-gory “coffee mug” for example.

Multitask Learning. For each modality, we define eachcluster as an attribute, and assign every exemplar to its clos-est cluster to tag the same attribute label. Now we train theDCNN model for both category and attribute recognition, asshown in Fig. 2. The loss function of the multitask learning

3348

Page 5: Semi-Supervised Multimodal Deep Learning for RGB …learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned

for each modality is

Loss

v

MT

=

Px2Lv

max(0, 1� y'

v

cat

(x))+

Px2Lv

max(0, 1� z'

v

attr

(x)),

(4)

where y, z denote the ground truth category label and attributelabel for exemplar x of modality v, while '

v

cat

, 'v

attr

are thecorresponding predicted probabilities of the DCNN model.We fix the coefficient � = 1 in the experiments.

Co-Training. Finally, the two well-trained attributeDCNN models are utilized to predict the unlabeled pool Uover the respective modalities. A highly confident set Hv forRGB or depth data can be selected as follows:

Hv

=

[z2Zv

{(x, z)|scorevattr

(z|x) > ⌧, x 2 Uv}, (5)

where score

v

attr

(z|x) = f('

v

attr

(z|x)) is the predicted scorefor x carrying the attribute z via a softmax function f , and⌧ is a score threshold. To further keep the data balance andaccuracy, we only reserve the top K examples with highestscores for each attribute in Hv . Then for each remaining ex-emplar x 2 Hv , we are easy to obtain its category label ybased on the predicted attribute label z according to Eq.(3).Now we attach both the RGB and depth data to x and updatethe labeled pool as

Lt+1 = L

t

[HRGB

[Hdepth

, (6)

where every exemplar in Lt+1 still contains pairwise RGB-D

data with a category label (the attribute labels will be updatedby the next round convex clustering). During the next roundtraining, HRGB can greatly improve the depth-DCNN modelas they are new and informative labeled samples, which is thesame to Hdepth for RGB-DCNN.

4 Experiments4.1 Experimental SetupDataset. We perform our experiments on the WashingtonRGB-D dataset [Lai et al., 2011a] captured by MicrosoftKinect. The dataset consists of 300 household objects,grouped into 51 categories. Each object is imaged from 3vertical angles as well as multiple horizontal angles, result-ing roughly 600 images per object. We subsample every 5thframe from each instance and obtain around 41,877 imagesin total for category recognition.

To evaluate our semi-supervised learning, we first utilizeone of the 10 random splits provided by [Lai et al., 2011a]to divide the dataset into a training set and a testing set. Forany split, there are around 35,000 examples for training andaround 6,877 for testing. Then we randomly labeled 5% sam-ples (around 1750) of the training set, and remain the restunlabeled (around 33,250). Finally, we train our model basedon both the labeled and unlabeled data in the training set, andevaluate its performance on the testing set.

Besides semi-supervised methods, we also compare ourapproach to those existing powerful supervised methods, forwhich all the objects in the training set are manually labeledto train their classifiers. All the experiments are repeated 10

Table 1: Comparison of recent results on the WashingtonRGB-D object database for category recognition.

Supervised Methods Depth RGB Combine[Lai et al., 2011a]linear svm 53.1± 1.7 74.3± 3.3 81.9± 2.8[Lai et al., 2011a]kernel svm 64.7± 2.2 74.5± 3.1 83.8± 3.5[Lai et al., 2011a]random forest 66.8± 2.5 74.7± 3.6 79.6± 4.0[Lai et al., 2011b]IDL 70.2± 2.0 78.6± 3.1 85.4± 3.2[R.C. et al., 2012]3D SPMK 67.8 – –[Bo et al., 2011a]KDES 78.8± 2.7 77.7± 1.9 86.2± 2.1[Blum et al., 2012]CKM – – 86.4± 2.3[Bo et al., 2011b]HMP 70.3± 2.2 74.7± 2.5 82.1± 3.3[Bo et al., 2012]SP�HMP 81.2± 2.3 82.4± 3.1 87.5± 2.9[Socher et al., 2012]CNNRNN 78.9± 3.8 80.8± 4.2 86.8± 3.3[Schwarz et al., 2015]CNN – 83.1± 2.0 89.4± 1.3

[Jhuo et al., 2015]R2ICA 83.9± 2.8 85.7± 2.7 89.6± 3.8[Eitel et al., 2015]FusCNN(HHA) 83.0± 2.7 84.1± 2.7 91.0± 1.9[Eitel et al., 2015]FusCNN(jet) 83.8± 2.7 84.1± 2.7 91.3± 1.4[Cheng et al., 2015a]warping – – 92.7 ± 1.0[Wang et al., 2015]NMSS 75.6± 2.7 74.6± 2.9 88.5± 2.2[Cheng et al., 2015b]CFK 85.8 ± 2.3 86.8 ± 2.2 91.2± 1.5

Semi-Supervised Methods Depth RGB Combine[Cheng et al., 2014]CT+SVM1 71.8± 0.8 77.1± 2.3 81.6± 1.4[Cheng et al., 2015c]CT+SVM2 75.4± 2.4 78.7± 1.4 83.7± 1.3

our approach 82.6± 2.3 85.5± 2.0 89.2± 1.3

times based on the given 10 splits, and the average accuraciesare reported for comparison.Parameter Setting of Our Approach. We fix ⌧ = 0.5,K = 20, � = 1 for our semi-supervised learning method,although dynamically finetuning each parameter could resultin a better performance. For the reconstruction network ofeach modality, we use a mini-batch b = 128 of images andinitial learning rate ⌘ = 10

�5, multiplying the learning rateby 0.1 at every s = 4000 iterations. Towards the trainingof the RGB- and depth-DCNN models for recognition duringevery iteration, we set b = 128, ⌘ = 10

�7, and s = 3000.It is noted that we only apply convex clustering for the first5 iterations in consideration of efficiency, and then keep theattribute labels unchanged for the rest iterations (around 400attributes for RGB, and 280 attributes for depth at last).

4.2 Overall PerformanceTable 1 presents the recognition accuracies of all recent meth-ods. We can find that: 1) with only 5% labeled data, ourmethod can achieve very promising result on each modal-ity (depth: 82.6%, RGB: 85.5%, both: 89.2%), demonstrat-ing the effectiveness of the proposed semi-supervised multi-modal deep learning framework based on diversity preservingco-training algorithm; 2) Compared to other semi-supervisedmethods [Cheng et al., 2014; 2015c], which adopted the co-training algorithm directly to retrain the RGB- and depth-SVM classifiers iteratively based on the extracted features inadvance, our end-to-end deep learning system shows a largeimprovement (nearly 7% increase over each modality) for ob-

3349

Page 6: Semi-Supervised Multimodal Deep Learning for RGB …learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned

81.4 81.9

86.6

82.6

85.5

89.2

75

80

85

90

depth RGB combine

Cla

ssif

icat

ion

Acc

urac

y (%

)reconstruction networksimageNet

(a) (b) (c) (d)

0 0.2 0.4 0.6 0.8 170

75

80

85

90

Score ThresholdC

lass

ifica

tion

Acc

urac

y (%

)

CombineRGBDepth

0 10 20 30 40

70

74

78

82

86

90

Iteration Number

Cla

ssifi

catio

n A

ccur

acy

(%)

CombineRGBDepth

0 10 20 30 40

70

74

78

82

86

90

Iteration Number

Cla

ssifi

catio

n A

ccur

acy

(%)

CombineRGBDepth

Figure 5: Performance analysis of the semi-supervised multimodal deep learning framework. See details in the text.

ject recognition; 3) Our method is comparable to all state-of-the-art supervised methods, except for the one of [Cheng etal., 2015a], who employed dense matching to obtain a queryadaptive similarity measure for RGB-D object recognition.Despite of the best performance, their algorithm is very time-consuming due to plentiful dense matching required for onequery object, restricting their potentials in practical use. Ourmethod is efficient in testing, since the recognition only needsone forward propagation of the network.

Furthermore, we also run our method on the fully su-pervised training data, and the average results are 84.0%

accuracy for depth, 86.3% for RGB, and 91.3% for both,which is a little superior to our method based on only 5%labeled data. The results further demonstrate that our semi-supervised learning approach is able to make use of the unla-beled data very effectively. Readers may doubt why the deeplearning model does not surpass the traditional methods likefisher kernel encoding [Cheng et al., 2015b] and the densematching methods [Cheng et al., 2014]. We think the mainreason is the scale of the Washington dataset, which is rela-tively very small compared to ImageNet [Deng et al., 2009].If given more labeled or unlabeled RGB-D data, we believeour method can achieve much higher performance for objectrecognition.

4.3 Detailed AnalysisIn this section, we analyze the effects of the reconstructionnetworks, the score threshold ⌧ , the diversity preserving co-training and the initial labeled size to the performance of oursemi-supervised multimodal deep learning framework. Notethat we evaluate each of them by keeping others the same tothe experimental settings in Section 4.1.

The effect of the reconstruction networks. To demon-strate the effectiveness of our reconstruction networks forpretraining, we compare it with a popular pretraining skill,which utilizes the AlexNet model [Krizhevsky et al., 2012]pretrained on imageNet to initialize the parameters of boththe RGB- and depth-DCNN models. As shown in Fig. 5 (a),the reconstruction networks can better boost the performanceof our semi-supervised learning. We explain that, comparedto the knowledge learned from other domains like ImageNet,the reconstruction network trained on the RGB-D data is ableto learn more proper cues for RGB-D object representation.

The effect of the score threshold ⌧ . As shown in

Fig. 5 (b), our semi-supervised learning is robust to ⌧ when0.3 < ⌧ < 0.7. Such a characteristic is very important inpractical usage since a wide range of ⌧ can keep the algo-rithm successful. When ⌧ is smaller than 0.3, it drops a littlebecause some unconfident examples (probably with wronglypredicted labels) can be involved to distract the next round su-pervised training. When ⌧ is larger than 0.7, the performanceof our method begins to drop quickly. It is reasonable, sincevery few examples can be added to the labeled pool to benefitthe network training. Note that we always constrain that theadded confident examples of each attribute is no more thanK = 20 for balance and accuracy, whatever the value of ⌧ isset.

The effect of the diversity preserving co-training. Asshown in Fig. 5 (c), our diversity preserving co-training algo-rithm can significantly increase the capability of each DCNNmodel along with iterations. When we employ the conven-tional co-training algorithm without diversity preserving con-straint, the improvements are much inferior, as shown inFig. 5 (d). We explain that the conventional co-training isprone to result in a biased labeled pool, which limits the po-tential of each DCNN model a lot.

The effect of the initial labeled size. When the initiallabeled size of the training set is changed from 1% to 10%,the recognition accuracies of our method are increased from(68.3%, 79.3%, 84.4%) to (83.8%, 86.1%, 91.2%). Addingmore labeled examples do not show obvious improvements,as 10% labeled data is already sufficient for our method tolearn from the unlabeled data successfully.

5 Conclusion

This paper proposes a semi-supervised multimodal deeplearning framework for RGB-D object recognition, which iscapable of reducing the dependence of deep learning methodon large-scale manually labeled RGB-D data. The key to theframework are two parts: 1) the reconstruction networks forgood initialization and 2) the diversity preserving co-trainingalgorithm for effective semi-supervised learning. Experi-mental results on the Washington RGB-D benchmark datasetdemonstrate the effectiveness of our approach.

3350

Page 7: Semi-Supervised Multimodal Deep Learning for RGB …learning to benefit from the massive unlabeled RGB-D data, which is often cheap and easily available. To handle the aforementioned

AcknowledgmentsThis work is funded by the National Basic Research Programof China (Grant No. 2012CB316302), National Natural Sci-ence Foundation of China (Grant No. 61322209), the Strate-gic Priority Research Program of the Chinese Academy ofSciences (Grant XDA06040102).

References[Balcan et al., 2004] Maria-Florina Balcan, Avrim Blum,

and Ke Yang. Co-training and expansion: Towards bridg-ing theory and practice. In NIPS, 2004.

[Blum and Mitchell, 1998] Avrim Blum and Tom Mitchell.Combining labeled and unlabeled data with co-training. InCOLT, 1998.

[Blum et al., 2012] Manuel Blum, Jost Tobias Springenberg,Jan Wulfing, and Martin Riedmiller. A learned featuredescriptor for object recognition in rgb-d data. In ICRA,2012.

[Bo et al., 2011a] Liefeng Bo, Xiaofeng Ren, and DieterFox. Depth kernel descriptors for object recognition. InIROS, 2011.

[Bo et al., 2011b] Liefeng Bo, Xiaofeng Ren, and DieterFox. Hierarchical matching pursuit for image classifica-tion: architecture and fast algorithms. In NIPS, 2011.

[Bo et al., 2012] Liefeng Bo, Xiaofeng Ren, and Dieter Fox.Unsupervised feature learning for rgb-d based objectrecognition. ISER, June, 2012.

[Cheng et al., 2014] Yanhua Cheng, Xin Zhao, Kaiqi Huang,and Tieniu Tan. Semi-supervised learning for rgb-d objectrecognition. In ICPR, 2014.

[Cheng et al., 2015a] Yanhua Cheng, Rui Cai, Chi Zhang,Zhiwei Li, Xin Zhao, Kaiqi Huang, and Yong Rui. Queryadaptive similarity measure for rgb-d object recognition.In ICCV, 2015.

[Cheng et al., 2015b] Yanhua Cheng, Rui Cai, Xin Zhao,and Kaiqi Huang. Convolutional fisher kernels for rgb-dobject recognition. In 3DV, 2015.

[Cheng et al., 2015c] Yanhua Cheng, Xin Zhao, KaiqiHuang, and Tieniu Tan. Semi-supervised learning and fea-ture evaluation for rgb-d object recognition. Computer Vi-sion and Image Understanding, 139:149–160, 2015.

[Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In CVPR, 2009.

[Eitel et al., 2015] Andreas Eitel, Jost Tobias Springenberg,Luciano Spinello, Martin Riedmiller, and Wolfram Bur-gard. Multimodal deep learning for robust rgb-d objectrecognition. IROS, 2015.

[Gupta et al., 2014] Saurabh Gupta, Ross Girshick, PabloArbelaez, and Jitendra Malik. Learning rich features fromrgb-d images for object detection and segmentation. InECCV, 2014.

[Jhuo et al., 2015] I-Hong Jhuo, Shenghua Gao, LianshengZhuang, DT Lee, and Yi Ma. Unsupervised feature learn-ing for rgb-d image classification. In ACCV, 2015.

[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In NIPS, 2012.

[Lai et al., 2011a] Kevin Lai, Liefeng Bo, Xiaofeng Ren,and Dieter Fox. A large-scale hierarchical multi-view rgb-d object dataset. In ICRA, 2011.

[Lai et al., 2011b] Kevin Lai, Liefeng Bo, Xiaofeng Ren,and Dieter Fox. Sparse distance learning for object recog-nition combining rgb and depth information. In ICRA,2011.

[Lashkari and Golland, 2007] Danial Lashkari and PolinaGolland. Convex clustering with exemplar-based models.In NIPS, 2007.

[R.C. et al., 2012] Carolina R.C., Roberto J. Lopez-Sastre,Javier Acevedo-Rodriguez, and Saturnino Maldonado-Bascon. Surfing the point clouds: selective 3d spatialpyramids for category-level object recognition. In CVPR,2012.

[Schwarz et al., 2015] Max Schwarz, Hannes Schulz, andSven Behnke. Rgb-d object recognition and pose esti-mation based on pre-trained convolutional neural networkfeatures. In ICRA, 2015.

[Socher et al., 2012] Richard Socher, Brody Huval, BharathBath, Christopher D Manning, and Andrew Ng.Convolutional-recursive deep learning for 3d object clas-sification. In NIPS, 2012.

[Wang et al., 2015] Anran Wang, Jianfei Cai, Jiwen Lu, andTat-Jen Cham. Mmss: Multi-modal sharable and specificfeature learning for rgb-d object recognition. In ICCV,2015.

[Weston et al., 2012] Jason Weston, Frederic Ratle, HosseinMobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of theTrade, pages 639–655. Springer, 2012.

[Zhu, 2005] Xiaojin Zhu. Semi-supervised learning litera-ture survey. Technical Report 1530, Computer Sciences,University of Wisconsin-Madison, 2005.

3351