pascal voc 2010: semantic object segmentation and action recognition in still images

IntroductionHarmony potential 2.0: fusing across scale

Action recognitionDiscussion

PASCAL VOC 2010Semantic object segmentation and action recognition in still images

Andrew D. [email protected]

Departamento de Ciencias de la ComputacionUniversidad Autnoma de Barcelona

Xavier Pep Nataliya Wenjuan Fahad

The CVC PASCAL VOC Team CVC PASCAL VOC 2010



PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas

Overview

On 03/05/2010 the PASCAL VOC competition was announcedand the training and validation sets published.20 semantic categories for the competition remain the same:

aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable,dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor.





Old competitions, new competitions

There are two (+ 1/2) main challenges in PASCAL.Image classification is the prediction of the presence/absence ofan instance of class in a test image.Object detection is the prediction of the bounding box and labelof each object from the twenty target classes in a test image.Semantic image segmentation is the assignment of one of thetwenty class labels to every pixel in a test image.Image segmentation is becoming a mainstream competition.Action recognition in still images was included as a new “tasterchallenge” this year.Taster competitions are used to measure interest in new problems.





Our contributions to PASCAL VOC 2010

Last year we participated in the Detection, Classification andSegmentation challenges.This year we decided to concentrate on Classification andSegmentation. Our segmentation technique relies heavily onclassification.We also fielded a team in Action Recognition this year to seewhat that’s all about.As always, success in PASCAL VOC challenges is approximately85% engineering, 10% inspiration and 5% luck (if you’re lucky).





Outline

1 IntroductionOverview of the challengesOur contribution and main ideas

2 The harmony potential 2.0: fusing across scaleBuilding on last year’s submissionFusing across scales and learning

3 Action recognitionA torrent of featuresExploiting the size of the problem

4 Discussion





Giving semantics to pixels

Image Object Class

Semantic image segmentation is not object segmentationOnly for simple cases are they the same.





Turning a hard problem into a harder one

Image Object Class

The object is to assign semantic labels to every pixelFine distinctions must be made





Make that a very hard one

Image Object Class

The objective is to assign semantic labels to every pixelFine distinctions must be madeOcclusions, varying viewpoint and size complicate things





Action recognition in still images

New competition this year: human action recognition in stillimages.Individual images sampled from the Flikr dataset.Bounding boxes of the human in each image is provided.Very important: we don’t have to solve the detection problem.Action recognition is offered as a “taster challenge” in order togauge interest in the general problem.It was difficult to hypothesize about what would succeed and whatwould not in this challenge.





Action classes





Segmentation: the role of context

Context provides very important cues for make finediscriminations at the (super-) pixel scale.We can exploit three levels of scale: local, mid-level and global[Zhu, NIPS2008].Existing techniques apply overly-simplified models of context thatdo not generalize upward from local to global scales.





Segmentation: global constraints on labelcombinations

Our principal idea is to use global Classification to enhancesegmentation results.Global image classification results tend to be less noisy than ones.We will use them to constrain the combinations of semantic labelswe are likely to encounter during segmentation.We showed last year how a tractable inference technique can bedevised for this labeling problem (our PASCAL 2009 entry).This year we also show how mid-level context can be incorporatedin the form of object detections.We also show how position priors cam be similarly incorporatedinto the framework to provide class specific location information.Finally, we devised a stochastic steepest ascent technique foroptimizing the many parameters in a class-specific way.





Action recognition: driven by data limitations

Initial experiments confirmed our intuition about the limitations ofthe data.

Structural learning: sampling of pose space not dense enough.Latent SVM: object interactions under-sampled as well.Multiple kernel learning: converges to simple selection.

From a very early stage, we decided to treat action recognition asan image classification problem.We exploit the small size dataset by performing extensive crossvalidation.Features are one of our string points, and we had to get thefeature pipeline running for Classification in any case.




Our point of departureDatasets and implementationExperimental results

HCRFs for labeling problem

We represent our segmentation problem as a graph: G = (V, E)

V is used for indexing random variables, and E is the set ofundirected edges representing compatibility relationships betweenrandom variables.X = {Xi} denotes the set of random variables or nodes, for i ∈ V.An energy function will be defined over graphical configurations ofrandom variables.By the Hammersley-Clifford theorem, the energy of a configurationof x = {xi} can be written as the negative exponential of anenergy function E(x) =

∑c∈C ϕc(xc), where ϕc is the potential

function of clique c ∈ C.





Consistency potentials for labeling problems

The energy function of G can be written as:

E(x) =∑i∈V

φ(xi) +∑

(i,j)∈EL

ψL(xi , xj) +∑

(i,g)∈EG

ψG(xi , xg).

The unary term φ(xi) depends on a single probabilityP(Xi = xi |Øi), where Øi is the observation that affects Xi in themodel.The smoothness potential ψL(xi , xj) determines the pairwiserelationship between two local nodes.The consistency potential ψG(xi , xg) expresses the dependencybetween local nodes and a global node.And the Maximum a Posteriori (MAP) estimate of the optimallabeling is:

x∗ = arg minx

E(x).





HCRF models of image segmentation

Smoothness Potts Robust PN

Free

(Shotten et al, CVPR2008) (Plath et al, ICML2009) (Ladicky et al, ICCV2009)

Colored nodes represent (hidden) semantic labels.Dark nodes represent image measurements.Red edges represent penalties imposed by potential.





Different features for discriminations

The previously mentioned approaches all try to make globaldistinctions using local information.Either by voting of local observations (Potts).Or, by penalizing rampantly discordant local label assignmentsPN .None of these techniques try to exploit truly global information toconstrain local labels.And none incorporate the notion of encoding combinations ofprimitive node labels at the global level.





The harmony potential: selective subsets

Only labels that do not agree with subset are penalized.Can represent more diverse combinations.





The harmony potential: overview





Ranked subsampling of P(L)

We can do this using the following posterior:

P(` ⊆ x∗g |Ø) ∝ P(` ⊆ x∗g )P(O|` ⊆ x∗g ).

This allows us to effectively rank possible global node labels, andthus to prioritize candidates in the search for the optimal label x∗g .P(` ⊆ x∗g |O) establishes an order on subsets of the (unknown)optimal labeling of the global node x∗g that guides theconsideration of global labels.We may not be able to exhaustively consider all labels in P(L), butat least we consider the most likely candidates for x∗g .And image classification can give us an estimate of this posterior.





PASCAL 2010: pushing the limit

The previous slides describe our approach used for the PASCAL2009 submission.The discriminative model was based on only SVMs trained todiscriminate object classes from their own backgrounds.Starting with the harmony potential approach, this year weconcentrated on adding cues derived from different levels ofmid-level context.We found the HCRF model with harmony potential to be veryuseful for performing this fusion.Our hypothesis at the end of the 2009 competition was thatdetection would be essential for pushing forward thestate-of-the-art.





PASCAL 2010: fusing across scales

1 FG/BG: 20 SVMs trained to discriminate classes from their ownbackground. The same discriminative model used last year,essential for localizing object boundaries.

2 CLASS: 20 SVMs trained to discriminate each object class fromthe other object. Essential for distinguishing objects with similarbackgrounds (e.g. cows from sheep, birds from planes).Incorporated directly into unary potential.

3 LOC: 20 class-specific location priors. Computed from groundtruth segmentations by simple, spatial averaging. A form oftop-down mid-level context.

4 OBJ: 20 class-specific object detectors [Felzenszwalb 2010] areconverted to superpixel scores by selecting the highest scoringdetection intersecting each pixel of the superpixel. A type ofbottom-up mid-level context.





PASCAL 2010: learning unary potentials

We compute the unary potential by weighting the classificationscores {si(k , xi)}k∈F through a sigmoid function. The unarypotential becomes:

φLi (xi) = −µLKi log

∏k∈F

11 + exp(fi(k , xi))

fi(k , xi) = a(k , xi)si(k , xi) + b(k , xi)

µL is the weighting factor of the local unary potential, andKi normalizes over the number of pixels inside the superpixel.We have two sigmoid parameters for each class/cue pair: a(k , xi)and b(k , xi).





Datasets

We have evaluated the harmony potential approach on twostandard, publicly available datasets.The Pascal VOC 2010 Segmentation Challenge dataset contains2250 color images of 20 different semantic classes.This set is split into 750 images for training, 750 images fortesting, and 750 for validation.The Microsoft MSRC-21 dataset contains 591 color images of 21object classes.We do our own splits for cross-validation on MSRC-21.





Unsupervised segmentation

Images are first over-segmented to with quick-shift to derivesuper-pixels [Fulkerson, ICCV 2009].This preserves object boundaries while simplifying therepresentation.Working at the super-pixel level reduces the number of nodes inthe CRF by 102 to 105 per image.





Local classification scores: P(Xi = xi |Oi)

We extract patches with 50% overlap on a regular grid at severalresolutions (12, 24, 36 and 48 pixels in diameter).Patches are described with SIFT, color and for MSCR-21 locationfeatures.A vocabulary is constructed using k -means to quantize to 1000SIFT words and 400 color words.An SVM classifier using an intersection kernel is built for eachsemantic category.A similar number of positive and negative examples are used:around a total of 8.000 superpixel samples for MSCR-21, and20.000 for VOC 2010 for each class.





Global potential and general approach

For the PASCAL 2010 dataset we use our entry to the 2010 VOCClassification Challenge:[Khan, IJCV2010 (submitted)].It uses a bag-of-words representation based on SIFT and colorSIFT, plus spatial pyramids and color attention[Khan, ICCV 2009].An SVM classifier with a χ2 kernel is trained for each semanticcategory in the dataset.The FG/BG and CLASS cues are computed by training adiscriminative model using an SVM with histogram intersectionkernel.Except for the additional cues and optimization strategy,architecture the same as our approach described at CVPR.[Gonfaus, CVPR2010]





Learning the HCRF parameters

We found it to be essential to train the per-class sigmoidparameters through cross validation.Classification scores are learned independently, are unbalancedand are effectively incomparable in many cases.The sigmoid functions weight the importance of each cue for eachclass.In addition to these (180) sigmoid parameters, we also must learnthe weighting factors for each potential.We use a stochastic, steepest ascent technique to optimize theseparameters on a validation set.In each step we randomly generate new instances of parameters.New parameter instances are generated using a Gibbs-likesampling strategy.





History: PASCAL VOC 2009

Bac

kgro

und

Aer

opla

ne

Bic

ycle

Bird

Boa

t

Bot

tle

Bus

Car

Cat

Cha

ir

BONN 83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2BROOKES 79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5

Harmony potential 80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1

Cow

Din

ning

Tabl

e

Dog

Hor

se

Mot

orbi

ke

Pers

on

Potte

dP

lant

She

ep

Sof

a

Trai

n

TV/M

onito

r

Ave

rage

BONN 28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1 36.3BROOKES 9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4 24.8

Harmony potential 30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1 34.1





Qualitative results: MSRC-21





Quantitative results: MSRC-21

MSRC-21 contains more multi-class images than PASCAL.Our performance demonstrates the benefits of incorporatingglobal scale when making local decisions.





Qualitative results: PASCAL 2010





Quantitative results: PASCAL 2010

FG/BG shows the performance of our baseline (PASCAL 2009)approach.At the top, performance on the validation set (i.e. how well wethought we were doing).Image tags indicated how well the technique can perform withperfect global information.





The cost of segmentation

The optimal MAP label configuration x∗ is inferred usingα-expansion graph cuts [Kolmogorov, PAMI2004].The global node uses the 100 most probable label subsetsobtained from ranked subsampling.Sheet1

Page 1

1 2 3 5 10 15 20 25 30 35 40 50 75 100 150 20050

55

60

65

70

75

80

85

30

32

34

36

38

40

42

44

46

48

50

MSRC-21 PASCAL 2010

# labels selected

mA

P o

n M

SR

C-2

1

mA

P o

n P

AS

CA

L V

OC

201

0





Qualitative results: PASCAL 2010 failures

Context is sometimes weighted too much.When the global classifier fails, little can be done.





Every little bit helps





A photo finish

Sheet1

Page 1

FG-BG

CLASS

LOC

OBJ

FG-BG + CLASS

All

15 20 25 30 35 40

33.9

23.4

20.1

26.2

36.6

40.4

Sheet1

Page 1

0 500 1000 1500 2000 2500 300030

32

34

36

38

40

42

#iterations

mA

P o

n P

AS

CA

L V

OC

201

0

The final results are tough to call between BONN and CVC.In the end, fusion over many scales and per-class, per-featureparameter optimization won.




The dataState-of-the-artOur approachResults

The action recognition taster

Images collected from Flikr using action queries. A set of nineactions was chosen in the end.They are disjoint from the main challenge dataset.Only subset of people are annotated (bounding box + action).This subset labelled with exactly one action class.Important point: we don’t have to solve the detection problem.Most action classes in the challenge contain either large variationin scale or large variations in pose (or both).





Dataset breakdown

train val trainval testimg obj img obj img obj img obj

Phoning 25 25 25 26 50 51 - -Playinginstrument 27 38 27 38 54 76 - -

Reading 25 26 26 27 51 53 - -Ridingbike 25 33 25 33 50 66 - -

Ridinghorse 27 35 26 36 53 71 - -Running 26 47 25 47 51 94 - -

Takingphoto 25 27 26 28 51 55 - -Usingcomputer 26 29 26 30 52 59 - -

Walking 25 41 26 42 51 83 - -Total 226 301 228 307 454 608 - -





Grouplets and poselets

Two state-of-the art techniques to action recognition in stillimages. The grouplets of Fei Fei Li [Yao et al, CVPR2010]:

And the latent poses of Greg Mori [Yang et al, CVPR2010]:





Treat it like image classification

Initial experiments confirmed our intuition about the limitations ofthe data.

Structural learning: sampling of pose space not dense enough.Latent SVM: complexity of object interactions problematic.Multiple kernel learning: converges to simple selection.

State-of-the-art techniques rely on learning complex structuralmodels of pose-variations over manyFrom a very early stage, we decided to treat action recognition asan image classification problem.We exploit the small size dataset by performing extensive crossvalidation.





The classification pipeline





Action recognition: features

SIFT, color SIFT (normalize R/G and opponent), self-similarity,SURF, PHOG (good for capturing pose), and color attention(focuses on interesting color features).Sparse and dense variations of most of these.Plus a range of pyramid configurations (1, 2× 2, 3× 3, 4× 4).Object detectors also incorporated using a simple occurrencehistogram [Felzenszwalb 2010].The goal was to incorporate all of this into a BoVW classifier andpush the limits of what is possible using classical BoW on actions.





Action recognition: contextual pyramids

Context was also important for most object classes.We used a type of foreground/background pyramid decompositionthat split features into object or background.The was done using a type of spatial soft-assign based on thedistance to the boundary of the object.For some classes, we also assigned contextual object regions thatmodel the appearance of objects associated with them (the “horsybox”).





Action recognition: learning in the design space

In the end, after all of the combinatorics introduced by pyramidsand other variations, we had about 100 feature configurations in abig pool.Most attempts to automatically learn the parameters of thesefeatures were total failures.Except one. Initial experiments with multiple kernel learningshowed that MKL starts converging quickly towards class-specificfeature selection rather than mixing.With such a small dataset, and a little heuristic trimming, we wereable to exhaustively explore a part of the design space.This resulted in the best per-class feature combinations.





Action recognition: classification

We experimented with a number of kernels (histogramintersection, χ2, bin-ratio distance).There wasn’t a huge difference among these kernels.In the end, we chose histogram intersection for our submission asit appeared to generalize better.In addition to over-fitting less, there are no parameters to tune andit is very fast.





Overall results: average precision





Per-class AP





Per technique median average precision





Qualitative results

When the horsey box and detectors fail, context dominates.Classifier still surprisingly robust.





Qualitative results

Some fine discriminations very difficult to make.Probably difficult even for humans.





Qualitative results

People taking photos should be banned.Classes with large pose variations were the most difficult.




Discussion: semantic image segmentation

The harmony potential works well for fusing global information intolocal segmentations.This year we also showed that the harmony potential framework isalso appropriate for incorporating different types of mid-level cuesas well.Ranked sub-sampling, driven by the same posterior as used todefine the global potential function, renders the optimizationproblem tractable.Most useful when multiple semantic classes co-occur frequently.Per-class learning of parameters essential (about +5% in finalresults).




Discussion: action recognition

This year’s taster challenge on action recognition was little morethan a toy.However, we have demonstrated what is possible using proventechniques from image classification.We feel that object context, in particular object interaction context,is the way forward.The PASCAL data set is the right direction to go (more general),but we need more samples.




The future: segmentation

Semantic image segmentation has come a long way, but still has along way to go.It is becoming a mainstream event in PASCAL.This year we arrived as a sort of three-way detente between theCVC (winner 2010), BONN (winner 2009) and OXFORD (bestpaper award ECCV 2010) in segmentation.Each have their own approach, and each has its advantages anddisadvantages.Engineering can probably maximize results.It is becoming mature, and we can begin thinking about what newapplications are enabled by such technologies.




The future: action recognition

It seems that action recognition in still images is a popularchallenge.The PASCAL organizers are keen to promote it for the future.The concentration will remain on still images, but perhaps moreconcentration on incorporating user interaction as well.It seems that the community is becoming more interested in the“alternative” PASCAL challenges.The multimedia community probably has an important role to playhere.


pascal voc 2010: semantic object segmentation and action recognition in still images

Technology