learning with a wasserstein loss (nips2015)

(NIPS2015) Learning with a Wasserstein Loss

先進理工学研究科　電気・情報生命専攻　　村田研究室（情報学習システム研究室）

修士1年　渡邊隼人

ワッサースタイン

2015/Dec/18 機械学習トップカンファレンス読み会 vol.1

マルチラベル予測あなたならこの写真にどんなラベル(タグ)をつけますか？

やりたいこと 2

Flickrユーザによるラベル(タグ) l  water l  boat l  reflection

マルチラベル予測あなたならこの写真にどんなラベル(タグ)をつけますか？

やりたいこと 3

Flickrユーザによるラベル(タグ) l  water l  boat l  reflection

写真から複数のラベルを予測したい

問題設定 4

画像全ラベルのベクトルラベル

water boat reflection

education weather cow

spring race training

agriculture …

問題設定 5

画像全ラベルのベクトルラベルを

符号化したもの

…

問題設定 6

画像ラベルに属する確率全ラベルのベクトル

…

…

問題設定 7

画像

…

ラベルに属する確率

…

　を求めよ

問題設定 8

画像

写像(判別器)

…


…

　を求めよ

普通のアプローチ 9

画像

多クラスロジスティック回帰

…


…

より良い予測とは？ 10

画像

判別器1　

ラベルに属する確率(予測)

判別器2　ラベルに属する確率(正解)

KL損失:1.58

KL損失:1.58

l  KL損失としては同じだが，真のラベルに関連するラベルを予測する判別器1の方が良い予測をしているのでは？

l  "boat"を"lake"と間違えるより，"club"と間違えることの方がひどい．後者の場合に，より強い罰則を課したい

l ラベルの類似度を考慮することで，上記の罰則を実現できる

キーワードで検索して，左の画像を探したい

ラベルの類似度を考慮するメリット 11

真のラベル l  spring l  race l  training

l 良い判別器の予測結果を利用

l 悪い判別器の予測結果を利用

u 真のラベルとは異なっているかもしれないが，似ているラベルは予測できるかも

u 人間のラベル付けより良いラベル付けができるかも (キーワード検索の意味で)

(山道)

　を求めよ

問題設定(再掲) 12

画像

写像(判別器)

…


…

　を求めよ

問題設定(改) 13

画像

写像(判別器)

…


…

water boat reflection …

water 0 0.5 0.4 …

boat 0.5 0 0.2 …

reflection 0.4 0.2 0 …

… … … … …

ラベル間の類似度(距離)もわかっている

　を求めよ

問題設定(改) 14

画像

写像(判別器)

…


…


water 0 0.5 0.4 …

boat 0.5 0 0.2 …


… … … … …


ラベルの類似度を考慮していない

Wasserstein損失

2つとも分布の違いを計る尺度

KL損失とWasserstein損失 15

ラベルに属する確率の分布(正解)

ラベルに属する確率の分布(予測)

KL

各次元の関係を考慮していない（次元ごとに割り算，掛け算してその和を取ってるから）

water

boat

reflection

river

lake

club0.00

0.20

water

boat

reflection

river

lake

club0.00

0.20

water

boat

reflection

river

lake

club0.00

0.20

boat lake water

reflection

river

club

boat lake water

reflection

river

club





Wasserstein

boat water

reflection

river lake

club

同じ状態にするのに，必要な最小コストは？





Wasserstein

boat water

reflection

river lake

club


boat lake water

reflection

river

club

最小





Wasserstein

boat water

reflection

river lake

club


boat lake water

reflection

river

club

輸送距離

water boat reflection river lake club

water 0 0.4 0.5 0.2 0.3 0.4

boat 0.4 0 0.4 0.3 0.1 0.8

reflection 0.5 0.4 0 0.3 0.3 0.6

river 0.2 0.3 0.3 0 0.1 0.5

lake 0.3 0.1 0.3 0.1 0 0.6

club 0.4 0.8 0.4 0.5 0.6 0





Wasserstein

boat water

reflection

river lake

club


boat lake water

reflection

river

club

輸送量


water 6 0 0 0 0 0

boat 0 1 0 4 1 0

reflection 0 0 1 1 0 4

river 0 0 0 0 0 0

lake 0 0 0 0 0 0

club 0 0 0 0 0 0





Wasserstein

boat water

reflection

river lake

club


boat lake water

reflection

river

club


water 6 0 0 0 0 0

boat 0 1 0 4 1 0


river 0 0 0 0 0 0

lake 0 0 0 0 0 0

club 0 0 0 0 0 0





Wasserstein

boat water

reflection

river lake

club


boat lake water

reflection

river

club


water 6 0 0 0 0 0

boat 0 1 0 4 1 0


river 0 0 0 0 0 0

lake 0 0 0 0 0 0

club 0 0 0 0 0 0

輸送量と輸送距離の要素ごとの積の和


water 0 0.4 0.5 0.2 0.3 0.4

boat 0.4 0 0.4 0.3 0.1 0.8

reflection 0.5 0.4 0 0.3 0.3 0.6

river 0.2 0.3 0.3 0 0.1 0.5

lake 0.3 0.1 0.3 0.1 0 0.6

club 0.4 0.8 0.4 0.5 0.6 0





Wasserstein

boat water

reflection

river lake

club


boat lake water

reflection

river

club

何でこんな損失考えてたんだっけ？ l  "boat"を"lake"と間違えるより，"club"と間違えることの方がひどい．後者の場合に，より強い罰則を課したい

KLの場合パラメータ　で偏微分して勾配求める→勾配法

最適化 26

Wassersteinの場合 (劣)勾配求めるのに計算時間がかかりすぎる…

凸問題にした．ラベル数の影響がほぼなくなる

ケーススタディ

Flickrのタグ付き画像データ l 訓練, 確認, テスト各1万画像，1千タグ特徴抽出 l  Convolutional Neural Networks (CNNs) ラベル(タグ)間の距離 l  word2vecで単位ベクトルに変換して，ユークリッド距離

実験設定 28

　を求めよ

問題設定(改)(再掲) 29

画像

写像(判別器)

…


…


water 0 0.5 0.4 …

boat 0.5 0 0.2 …


… … … … …


Flickrのタグ付き画像データ l 訓練, 確認, テスト各1万画像，1千タグ特徴抽出 l  Convolutional Neural Networks (CNNs) ラベル(タグ)間の距離 l  word2vecで単位ベクトルに変換して，ユークリッド距離評価指標 l  top-K cost

実験設定 30


water 0 0.4 0.5 0.2 0.3 0.4

boat 0.4 0 0.4 0.3 0.1 0.8

reflection 0.5 0.4 0 0.3 0.3 0.6

river 0.2 0.3 0.3 0 0.1 0.5

lake 0.3 0.1 0.3 0.1 0 0.6

club 0.4 0.8 0.4 0.5 0.6 0

正解

予測 1 2

Flickrのタグ付き画像データ l 訓練, 確認, テスト各1万画像，1千タグ特徴抽出 l  Convolutional Neural Networks (CNNs) ラベル(タグ)間の距離 l  word2vecで単位ベクトルに変換して，ユークリッド距離評価指標 l  top-K cost (ラベルの意味が近いものが予測できてれば◎) l  AUC (正解ラベルがなるべく上位に予測できてれば◎)

実験設定 31

実験結果｜top-K cost 32

5 10 15 20

K (# of proposed tags)

0.70

0.75

0.80

0.85

0.90

0.95

1.00

to

p-K

Co

st

Loss Function

Divergence

Wasserstein (↵=0.5)



(a) Original Flickr tags dataset.

5 10 15 20

K (# of proposed tags)

0.70

0.75

0.80

0.85

0.90

0.95

1.00

to

p-K

Co

st

Loss Function

Divergence




(b) Reduced-redundancy Flickr tags dataset.

Figure 5: Top-K cost comparison of the proposed loss (Wasserstein) and the baseline (Divergence).

probability of the true digit goes to 1 while the probability for all other digits goes to 0. As pincreases, the predictions become more evenly distributed over the neighboring digits, convergingto a uniform distribution as p ! 1 5.

6.2 Flickr tag prediction

We apply the Wasserstein loss to a real world multi-label learning problem, using the recently re-leased Yahoo/Flickr Creative Commons 100M dataset [23]. 6 Our goal is tag prediction: we select1000 descriptive tags along with two random sets of 10,000 images each, associated with these tags,for training and testing. We derive a distance metric between tags by using word2vec [24] toembed the tags as unit vectors, then taking their Euclidean distances. To extract image features weuse MatConvNet [25]. Note that the set of tags is highly redundant and often many semanticallyequivalent or similar tags can apply to an image. The images are also partially tagged, as differentusers may prefer different tags. We therefore measure the prediction performance by the top-K cost,defined as C

K

= 1/KP

K

k=1

min

j

dK(̂k

,j

), where {j

} is the set of groundtruth tags, and {̂k

}are the tags with highest predicted probability. The standard AUC measure is also reported.

We find that a linear combination of the Wasserstein loss W p

p

and the standard multiclass logistic lossKL yields the best prediction results. Specifically, we train a linear model by minimizing W p

p

+↵KLon the training set, where ↵ controls the relative weight of KL. Note that KL taken alone is ourbaseline in these experiments. Figure 5a shows the top-K cost on the test set for the combined lossand the baseline KL loss. We additionally create a second dataset by removing redundant labelsfrom the original dataset: this simulates the potentially more difficult case in which a single usertags each image, by selecting one tag to apply from amongst each cluster of applicable, semanticallysimilar tags. Figure 3b shows that performance for both algorithms decreases on the harder dataset,while the combined Wasserstein loss continues to outperform the baseline.

In Figure 6, we show the effect on performance of varying the weight ↵ on the KL loss. We observethat the optimum of the top-K cost is achieved when the Wasserstein loss is weighted more heavilythan at the optimum of the AUC. This is consistent with a semantic smoothing effect of Wasserstein,which during training will favor mispredictions that are semantically similar to the ground truth,sometimes at the cost of lower AUC 7. We finally show two selected images from the test set inFigure 7. These illustrate cases in which both algorithms make predictions that are semanticallyrelevant, despite overlapping very little with the ground truth. The image on the left shows errorsmade by both algorithms. More examples can be found in the appendix.

5To avoid numerical issues, we scale down the ground metric such that all of the distance values are in theinterval [0, 1).

6The dataset used here is available at http://cbcl.mit.edu/wasserstein.7The Wasserstein loss can achieve a similar trade-off by choosing the metric parameter p, as discussed in

Section 6.1. However, the relationship between p and the smoothing behavior is complex and it can be simplerto implement the trade-off by combining with the KL loss.

7

悪

良

損失関数にKL使った普通のロジスティックより良い

実験結果｜top-K cost & AUC 33

悪

良

Wasserstein + αKL として，αを動かしてみる

0.0 0.5 1.0 1.5 2.0

0.650.700.750.800.850.900.95

To

p-K

co

st

K = 1 K = 2 K = 3 K = 4

0.0 0.5 1.0 1.5 2.0↵

0.54

0.56

0.58

0.60

0.62

0.64

AU

C

Wasserstein AUC

Divergence AUC

(a) Original Flickr tags dataset.

0.0 0.5 1.0 1.5 2.0

0.650.700.750.800.850.900.95

To

p-K

co

st

K = 1 K = 2 K = 3 K = 4

0.0 0.5 1.0 1.5 2.0↵

0.54

0.56

0.58

0.60

0.62

0.64

AU

C

Wasserstein AUC

Divergence AUC

(b) Reduced-redundancy Flickr tags dataset.

Figure 6: Trade-off between semantic smoothness and maximum likelihood.

(a) Flickr user tags: street, parade, dragon; ourproposals: people, protest, parade; baseline pro-posals: music, car, band.

(b) Flickr user tags: water, boat, reflection, sun-shine; our proposals: water, river, lake, summer;baseline proposals: river, water, club, nature.

Figure 7: Examples of images in the Flickr dataset. We show the groundtruth tags and as well astags proposed by our algorithm and the baseline.

7 Conclusions and future work

In this paper we have described a loss function for learning to predict a non-negative measure over afinite set, based on the Wasserstein distance. Although optimizing with respect to the exact Wasser-stein loss is computationally costly, an approximation based on entropic regularization is efficientlycomputed. We described a learning algorithm based on this regularization and we proposed a novelextension of the regularized loss to unnormalized measures that preserves its efficiency. We alsodescribed a statistical learning bound for the loss. The Wasserstein loss can encourage smoothnessof the predictions with respect to a chosen metric on the output space, and we demonstrated thisproperty on a real-data tag prediction problem, showing improved performance over a baseline thatdoesn’t incorporate the metric.

An interesting direction for future work may be to explore the connection between the Wassersteinloss and Markov random fields, as the latter are often used to encourage smoothness of predictions,via inference at prediction time.

8

悪

良

意味的に近いラベルを予測したいなら， Wassersteinの重みを強く

実験結果｜実際のラベル(タグ)予測の結果 34

正解: zoo, run, mark 提案: running, summer, fun ベース: running, country, lake

正解: travel, architecture, tourism 提案: sky, roof, building ベース: art, sky, beach

l 教師あり学習に初めてWasserstein損失を用いた l マルチラベル予測問題に適用し，正解ラベルとは一致しないかもしれないが，意味的に近いものが予測できた

l そのままでは時間のかかるWassersteinの計算を高速な手法を使ってうまく計算した(手法自体は既存のもの)

まとめ (contributions) 35

補遺

learning with a wasserstein loss (nips2015)

Data & Analytics