various applications of restricted boltzmann machines for ...aldous/pitman_conference/slides/... ·...

Wrocław University of Technology

Various applications of restricted Boltzmannmachines for bad quality training data

Maciej Zięba

Wroclaw University of Technology

20.06.2014

MotivationBig data - 7 dimensions1

Volume: size of data.

Velocity: speed, displacement of data.

Variety: diversity of data.

Viscosity: measures the resistance to flow inthe volume of data.

Virality: measures how fast data is distributedunique and shared between nodes in a network(e.g. the Internet).

Veracity: trust and quality of the data.

Value: what is the added value that Big Datashould bring?

1According to ATOS company2/7

Veracity of DataTypical problems with data - training context

Imbalanced data problem. One classdominates another in the training data.

Noisy labels problem. Some of the examplesin training data contain incorrectly assignedlabels.

Missing values issue. Values of somefeatures are unknown.

Unstructured data. The data is representedin unprocessed form: images, videos,documents, XML structures.

Semi-supervised data. Some portion oftraining data is unlabelled.

Example ofimbalanced data

3/7

MethodsRestricted Boltzmann Machines (RBM)

RBM is a bipartie Markov Random Field withvisible and hidden units.

The joint distribution of visible and hidden unitsis the Gibbs distribution:

p(x,h|θ) =1Z

exp(− E(x,h|θ)

) For binary visible x ∈ {0, 1}D and hidden units

h ∈ {0, 1}M th energy function is as follows:

E(x,h|θ) = −x>Wh− b>x− c>h,

Because of no visible to visible, or hidden tohidden connection we have:

p(xi = 1|h,W,b) = sigm(Wi·h + bi

)p(hj = 1|x,W, c) = sigm

((W·j)>x + cj

)

984Chapter

27.Latent

variablem

odelsfor

discretedata

(a)

HV

(b)

Figure

27.30(a)

Ageneral

Boltzmann

machine,

with

anarbitrary

graphstructure.

Theshaded

(visible)nodes

arepartitioned

intoinput

andoutput,although

them

odelis

actuallysym

metric

anddefines

ajoint

densityon

allthe

nodes.(b)

Arestricted

Boltzmann

machine

with

abipartite

structure.Note

thelack

ofintra-layer

connections.

andthen

normalizing,

whereas

ina

mixture

ofexperts,

we

takea

convexcom

binationof

normalized

distributions.The

intuitivereason

why

PoEm

odelsm

ightw

orkbetter

thana

mixture

isthat

eachexpert

canenforce

aconstraint

(ifthe

experthas

avalue

which

is!1

or"1)

ora

“don’tcare”

condition(if

theexpert

hasvalue

1).By

multiplying

theseexperts

togetherin

di!erent

ways

we

cancreate

“sharp”distributions

which

predictdata

which

satisfiesthe

specifiedconstraints

(Hinton

andTeh

2001).For

example,

considera

distributedm

odelof

text.A

givendocum

entm

ighthave

thetopics

“government”,

“mafia”

and“playboy”.

Ifw

e“m

ultiply”the

predictionsof

eachtopic

together,the

model

may

givevery

highprobability

tothe

word

“Berlusconi” 9(Salakhutdinov

andH

inton2010).

Bycontrast,

addingtogether

expertscan

onlym

akethe

distributionbroader

(seeFigure

14.17).Typically

thehidden

nodesin

anRBM

arebinary,so

hspecifies

which

constraintsare

active.It

isw

orthcom

paringthis

with

thedirected

models

we

havediscussed.

Ina

mixture

model,w

ehave

onehidden

variableq#

{1,...,K

}.W

ecan

representthis

usinga

setof

Kbits,w

iththe

restrictionthat

exactlyone

bitis

onat

atim

e.This

iscalled

alocalist

encod

ing,

sinceonly

onehidden

unitis

usedto

generatethe

responsevector.

Thisis

analogousto

thehypothetical

notionof

grandm

other

cellsin

thebrain,

thatare

ableto

recognizeonly

onekind

ofobject.

Bycontrast,an

RBMuses

adistribu

teden

codin

g,where

many

unitsare

involvedin

generatingeach

output.M

odelsthat

usedvector-valued

hiddenvariables,

suchas

!#

SK

,as

inm

PCA/

LDA,or

z#

RK

,asin

ePCA

alsouse

distributedencodings.

Them

aindi!

erencebetw

eenan

RBMand

directedtw

o-layerm

odelsis

thatthe

hiddenvariables

areconditionally

independentgiven

thevisible

variables,sothe

posteriorfactorizes:

p(h|v

,")

=!

k

p(h

k |v,"

)(27.96)

Thism

akesinference

much

simpler

thanin

adirected

model,

sincew

ecan

estimate

eachh

k

9.Silvio

Berlusconiis

thecurrent

(2011)prim

em

inisterof

Italy.

4/7

MethodsRBM for imbalanced data

Train the model on examples from minority class by application ofMLL (scaled):

1N

log(p(XNn=1|θ)

)=

1N

N∑n=1

log(∑h

p(xn,h|θ))

Generate artificial examples X̄Mm=1 using Synthetic OversamplingTEchnique (SMOTE).

For each of the newly created example xm apply Gibbs sampling:

hm ∼ p(h|x̄m, θ)x̃m ∼ p(x|hm, θ)

Label newly created example x̃m and store in training data.

5/7

MethodsRBM for imbalanced data - example

SMOTE procedure:

A

B

Generating artificial examples on MNIST data:

6/7

MethodsRBM for imbalanced data - example

SMOTE procedure:

A

B

Generating artificial examples on MNIST data:

EXAMPLE 1

EXAMPLE 2

SMOTE

SMOTE RBM

6/7

RBM for other raw data issues Problem of missing values.

RBM is trained for each of the classes separately.

Gibbs sampling is applied to uncover unknown values.

RBM models are iteratively updated while new training example iscompleted.

Problem of noisy labels.

RBM is trained for each of the classes separately.

Each of the trained models is used as an oracle to detectuncorrected labelled data.

Reconstruction error is used to determine unlabelled examples.

Problem of unstructured data.

RBM is used as domain-independent feature extractor thattransforms raw data into hidden units.

7/7

various applications of restricted boltzmann machines for ...aldous/pitman_conference/slides/... ·...

Documents