pascal voc 2010: semantic object segmentation and action recognition in still images
DESCRIPTION
In this talk, I will discuss the extensions we have made to our approach to semantic image segmentation. I will show how the results of object detectors and spatial priors can be naturally integrated into our hierarchical conditional random field (HCRF) approach based on the harmony potential. The addition of these extra cues, as well as class-specific normalization of classifier outputs, significantly improves segmentation quality.TRANSCRIPT
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic object segmentation and action recognition in still images
Andrew D. [email protected]
Departamento de Ciencias de la ComputacionUniversidad Autnoma de Barcelona
Xavier Pep Nataliya Wenjuan Fahad
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Overview
On 03/05/2010 the PASCAL VOC competition was announcedand the training and validation sets published.20 semantic categories for the competition remain the same:
aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable,dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Old competitions, new competitions
There are two (+ 1/2) main challenges in PASCAL.Image classification is the prediction of the presence/absence ofan instance of class in a test image.Object detection is the prediction of the bounding box and labelof each object from the twenty target classes in a test image.Semantic image segmentation is the assignment of one of thetwenty class labels to every pixel in a test image.Image segmentation is becoming a mainstream competition.Action recognition in still images was included as a new “tasterchallenge” this year.Taster competitions are used to measure interest in new problems.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Our contributions to PASCAL VOC 2010
Last year we participated in the Detection, Classification andSegmentation challenges.This year we decided to concentrate on Classification andSegmentation. Our segmentation technique relies heavily onclassification.We also fielded a team in Action Recognition this year to seewhat that’s all about.As always, success in PASCAL VOC challenges is approximately85% engineering, 10% inspiration and 5% luck (if you’re lucky).
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Outline
1 IntroductionOverview of the challengesOur contribution and main ideas
2 The harmony potential 2.0: fusing across scaleBuilding on last year’s submissionFusing across scales and learning
3 Action recognitionA torrent of featuresExploiting the size of the problem
4 Discussion
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Giving semantics to pixels
Image Object Class
Semantic image segmentation is not object segmentationOnly for simple cases are they the same.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Turning a hard problem into a harder one
Image Object Class
The object is to assign semantic labels to every pixelFine distinctions must be made
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Make that a very hard one
Image Object Class
The objective is to assign semantic labels to every pixelFine distinctions must be madeOcclusions, varying viewpoint and size complicate things
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Action recognition in still images
New competition this year: human action recognition in stillimages.Individual images sampled from the Flikr dataset.Bounding boxes of the human in each image is provided.Very important: we don’t have to solve the detection problem.Action recognition is offered as a “taster challenge” in order togauge interest in the general problem.It was difficult to hypothesize about what would succeed and whatwould not in this challenge.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Action classes
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Segmentation: the role of context
Context provides very important cues for make finediscriminations at the (super-) pixel scale.We can exploit three levels of scale: local, mid-level and global[Zhu, NIPS2008].Existing techniques apply overly-simplified models of context thatdo not generalize upward from local to global scales.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Segmentation: global constraints on labelcombinations
Our principal idea is to use global Classification to enhancesegmentation results.Global image classification results tend to be less noisy than ones.We will use them to constrain the combinations of semantic labelswe are likely to encounter during segmentation.We showed last year how a tractable inference technique can bedevised for this labeling problem (our PASCAL 2009 entry).This year we also show how mid-level context can be incorporatedin the form of object detections.We also show how position priors cam be similarly incorporatedinto the framework to provide class specific location information.Finally, we devised a stochastic steepest ascent technique foroptimizing the many parameters in a class-specific way.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
PASCAL VOC 2010Semantic image segmentationAction recognitionOur main ideas
Action recognition: driven by data limitations
Initial experiments confirmed our intuition about the limitations ofthe data.
Structural learning: sampling of pose space not dense enough.Latent SVM: object interactions under-sampled as well.Multiple kernel learning: converges to simple selection.
From a very early stage, we decided to treat action recognition asan image classification problem.We exploit the small size dataset by performing extensive crossvalidation.Features are one of our string points, and we had to get thefeature pipeline running for Classification in any case.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
HCRFs for labeling problem
We represent our segmentation problem as a graph: G = (V, E)
V is used for indexing random variables, and E is the set ofundirected edges representing compatibility relationships betweenrandom variables.X = {Xi} denotes the set of random variables or nodes, for i ∈ V.An energy function will be defined over graphical configurations ofrandom variables.By the Hammersley-Clifford theorem, the energy of a configurationof x = {xi} can be written as the negative exponential of anenergy function E(x) =
∑c∈C ϕc(xc), where ϕc is the potential
function of clique c ∈ C.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Consistency potentials for labeling problems
The energy function of G can be written as:
E(x) =∑i∈V
φ(xi) +∑
(i,j)∈EL
ψL(xi , xj) +∑
(i,g)∈EG
ψG(xi , xg).
The unary term φ(xi) depends on a single probabilityP(Xi = xi |Øi), where Øi is the observation that affects Xi in themodel.The smoothness potential ψL(xi , xj) determines the pairwiserelationship between two local nodes.The consistency potential ψG(xi , xg) expresses the dependencybetween local nodes and a global node.And the Maximum a Posteriori (MAP) estimate of the optimallabeling is:
x∗ = arg minx
E(x).
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
HCRF models of image segmentation
Smoothness Potts Robust PN
Free
(Shotten et al, CVPR2008) (Plath et al, ICML2009) (Ladicky et al, ICCV2009)
Colored nodes represent (hidden) semantic labels.Dark nodes represent image measurements.Red edges represent penalties imposed by potential.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Different features for discriminations
The previously mentioned approaches all try to make globaldistinctions using local information.Either by voting of local observations (Potts).Or, by penalizing rampantly discordant local label assignmentsPN .None of these techniques try to exploit truly global information toconstrain local labels.And none incorporate the notion of encoding combinations ofprimitive node labels at the global level.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
The harmony potential: selective subsets
Only labels that do not agree with subset are penalized.Can represent more diverse combinations.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
The harmony potential: overview
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Ranked subsampling of P(L)
We can do this using the following posterior:
P(` ⊆ x∗g |Ø) ∝ P(` ⊆ x∗g )P(O|` ⊆ x∗g ).
This allows us to effectively rank possible global node labels, andthus to prioritize candidates in the search for the optimal label x∗g .P(` ⊆ x∗g |O) establishes an order on subsets of the (unknown)optimal labeling of the global node x∗g that guides theconsideration of global labels.We may not be able to exhaustively consider all labels in P(L), butat least we consider the most likely candidates for x∗g .And image classification can give us an estimate of this posterior.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
PASCAL 2010: pushing the limit
The previous slides describe our approach used for the PASCAL2009 submission.The discriminative model was based on only SVMs trained todiscriminate object classes from their own backgrounds.Starting with the harmony potential approach, this year weconcentrated on adding cues derived from different levels ofmid-level context.We found the HCRF model with harmony potential to be veryuseful for performing this fusion.Our hypothesis at the end of the 2009 competition was thatdetection would be essential for pushing forward thestate-of-the-art.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
PASCAL 2010: fusing across scales
1 FG/BG: 20 SVMs trained to discriminate classes from their ownbackground. The same discriminative model used last year,essential for localizing object boundaries.
2 CLASS: 20 SVMs trained to discriminate each object class fromthe other object. Essential for distinguishing objects with similarbackgrounds (e.g. cows from sheep, birds from planes).Incorporated directly into unary potential.
3 LOC: 20 class-specific location priors. Computed from groundtruth segmentations by simple, spatial averaging. A form oftop-down mid-level context.
4 OBJ: 20 class-specific object detectors [Felzenszwalb 2010] areconverted to superpixel scores by selecting the highest scoringdetection intersecting each pixel of the superpixel. A type ofbottom-up mid-level context.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
PASCAL 2010: learning unary potentials
We compute the unary potential by weighting the classificationscores {si(k , xi)}k∈F through a sigmoid function. The unarypotential becomes:
φLi (xi) = −µLKi log
∏k∈F
11 + exp(fi(k , xi))
fi(k , xi) = a(k , xi)si(k , xi) + b(k , xi)
µL is the weighting factor of the local unary potential, andKi normalizes over the number of pixels inside the superpixel.We have two sigmoid parameters for each class/cue pair: a(k , xi)and b(k , xi).
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Datasets
We have evaluated the harmony potential approach on twostandard, publicly available datasets.The Pascal VOC 2010 Segmentation Challenge dataset contains2250 color images of 20 different semantic classes.This set is split into 750 images for training, 750 images fortesting, and 750 for validation.The Microsoft MSRC-21 dataset contains 591 color images of 21object classes.We do our own splits for cross-validation on MSRC-21.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Unsupervised segmentation
Images are first over-segmented to with quick-shift to derivesuper-pixels [Fulkerson, ICCV 2009].This preserves object boundaries while simplifying therepresentation.Working at the super-pixel level reduces the number of nodes inthe CRF by 102 to 105 per image.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Local classification scores: P(Xi = xi |Oi)
We extract patches with 50% overlap on a regular grid at severalresolutions (12, 24, 36 and 48 pixels in diameter).Patches are described with SIFT, color and for MSCR-21 locationfeatures.A vocabulary is constructed using k -means to quantize to 1000SIFT words and 400 color words.An SVM classifier using an intersection kernel is built for eachsemantic category.A similar number of positive and negative examples are used:around a total of 8.000 superpixel samples for MSCR-21, and20.000 for VOC 2010 for each class.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Global potential and general approach
For the PASCAL 2010 dataset we use our entry to the 2010 VOCClassification Challenge:[Khan, IJCV2010 (submitted)].It uses a bag-of-words representation based on SIFT and colorSIFT, plus spatial pyramids and color attention[Khan, ICCV 2009].An SVM classifier with a χ2 kernel is trained for each semanticcategory in the dataset.The FG/BG and CLASS cues are computed by training adiscriminative model using an SVM with histogram intersectionkernel.Except for the additional cues and optimization strategy,architecture the same as our approach described at CVPR.[Gonfaus, CVPR2010]
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Learning the HCRF parameters
We found it to be essential to train the per-class sigmoidparameters through cross validation.Classification scores are learned independently, are unbalancedand are effectively incomparable in many cases.The sigmoid functions weight the importance of each cue for eachclass.In addition to these (180) sigmoid parameters, we also must learnthe weighting factors for each potential.We use a stochastic, steepest ascent technique to optimize theseparameters on a validation set.In each step we randomly generate new instances of parameters.New parameter instances are generated using a Gibbs-likesampling strategy.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
History: PASCAL VOC 2009
Bac
kgro
und
Aer
opla
ne
Bic
ycle
Bird
Boa
t
Bot
tle
Bus
Car
Cat
Cha
ir
BONN 83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2BROOKES 79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5
Harmony potential 80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1
Cow
Din
ning
Tabl
e
Dog
Hor
se
Mot
orbi
ke
Pers
on
Potte
dP
lant
She
ep
Sof
a
Trai
n
TV/M
onito
r
Ave
rage
BONN 28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1 36.3BROOKES 9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4 24.8
Harmony potential 30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1 34.1
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Qualitative results: MSRC-21
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Quantitative results: MSRC-21
MSRC-21 contains more multi-class images than PASCAL.Our performance demonstrates the benefits of incorporatingglobal scale when making local decisions.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Qualitative results: PASCAL 2010
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Quantitative results: PASCAL 2010
FG/BG shows the performance of our baseline (PASCAL 2009)approach.At the top, performance on the validation set (i.e. how well wethought we were doing).Image tags indicated how well the technique can perform withperfect global information.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
The cost of segmentation
The optimal MAP label configuration x∗ is inferred usingα-expansion graph cuts [Kolmogorov, PAMI2004].The global node uses the 100 most probable label subsetsobtained from ranked subsampling.Sheet1
Page 1
1 2 3 5 10 15 20 25 30 35 40 50 75 100 150 20050
55
60
65
70
75
80
85
30
32
34
36
38
40
42
44
46
48
50
MSRC-21 PASCAL 2010
# labels selected
mA
P o
n M
SR
C-2
1
mA
P o
n P
AS
CA
L V
OC
201
0
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Qualitative results: PASCAL 2010 failures
Context is sometimes weighted too much.When the global classifier fails, little can be done.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
Every little bit helps
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Our point of departureDatasets and implementationExperimental results
A photo finish
Sheet1
Page 1
FG-BG
CLASS
LOC
OBJ
FG-BG + CLASS
All
15 20 25 30 35 40
33.9
23.4
20.1
26.2
36.6
40.4
Sheet1
Page 1
0 500 1000 1500 2000 2500 300030
32
34
36
38
40
42
#iterations
mA
P o
n P
AS
CA
L V
OC
201
0
The final results are tough to call between BONN and CVC.In the end, fusion over many scales and per-class, per-featureparameter optimization won.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
The action recognition taster
Images collected from Flikr using action queries. A set of nineactions was chosen in the end.They are disjoint from the main challenge dataset.Only subset of people are annotated (bounding box + action).This subset labelled with exactly one action class.Important point: we don’t have to solve the detection problem.Most action classes in the challenge contain either large variationin scale or large variations in pose (or both).
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Dataset breakdown
train val trainval testimg obj img obj img obj img obj
Phoning 25 25 25 26 50 51 - -Playinginstrument 27 38 27 38 54 76 - -
Reading 25 26 26 27 51 53 - -Ridingbike 25 33 25 33 50 66 - -
Ridinghorse 27 35 26 36 53 71 - -Running 26 47 25 47 51 94 - -
Takingphoto 25 27 26 28 51 55 - -Usingcomputer 26 29 26 30 52 59 - -
Walking 25 41 26 42 51 83 - -Total 226 301 228 307 454 608 - -
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Grouplets and poselets
Two state-of-the art techniques to action recognition in stillimages. The grouplets of Fei Fei Li [Yao et al, CVPR2010]:
And the latent poses of Greg Mori [Yang et al, CVPR2010]:
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Treat it like image classification
Initial experiments confirmed our intuition about the limitations ofthe data.
Structural learning: sampling of pose space not dense enough.Latent SVM: complexity of object interactions problematic.Multiple kernel learning: converges to simple selection.
State-of-the-art techniques rely on learning complex structuralmodels of pose-variations over manyFrom a very early stage, we decided to treat action recognition asan image classification problem.We exploit the small size dataset by performing extensive crossvalidation.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
The classification pipeline
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Action recognition: features
SIFT, color SIFT (normalize R/G and opponent), self-similarity,SURF, PHOG (good for capturing pose), and color attention(focuses on interesting color features).Sparse and dense variations of most of these.Plus a range of pyramid configurations (1, 2× 2, 3× 3, 4× 4).Object detectors also incorporated using a simple occurrencehistogram [Felzenszwalb 2010].The goal was to incorporate all of this into a BoVW classifier andpush the limits of what is possible using classical BoW on actions.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Action recognition: contextual pyramids
Context was also important for most object classes.We used a type of foreground/background pyramid decompositionthat split features into object or background.The was done using a type of spatial soft-assign based on thedistance to the boundary of the object.For some classes, we also assigned contextual object regions thatmodel the appearance of objects associated with them (the “horsybox”).
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Action recognition: learning in the design space
In the end, after all of the combinatorics introduced by pyramidsand other variations, we had about 100 feature configurations in abig pool.Most attempts to automatically learn the parameters of thesefeatures were total failures.Except one. Initial experiments with multiple kernel learningshowed that MKL starts converging quickly towards class-specificfeature selection rather than mixing.With such a small dataset, and a little heuristic trimming, we wereable to exhaustively explore a part of the design space.This resulted in the best per-class feature combinations.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Action recognition: classification
We experimented with a number of kernels (histogramintersection, χ2, bin-ratio distance).There wasn’t a huge difference among these kernels.In the end, we chose histogram intersection for our submission asit appeared to generalize better.In addition to over-fitting less, there are no parameters to tune andit is very fast.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Overall results: average precision
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Per-class AP
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Per technique median average precision
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Qualitative results
When the horsey box and detectors fail, context dominates.Classifier still surprisingly robust.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Qualitative results
Some fine discriminations very difficult to make.Probably difficult even for humans.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The dataState-of-the-artOur approachResults
Qualitative results
People taking photos should be banned.Classes with large pose variations were the most difficult.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Discussion: semantic image segmentation
The harmony potential works well for fusing global information intolocal segmentations.This year we also showed that the harmony potential framework isalso appropriate for incorporating different types of mid-level cuesas well.Ranked sub-sampling, driven by the same posterior as used todefine the global potential function, renders the optimizationproblem tractable.Most useful when multiple semantic classes co-occur frequently.Per-class learning of parameters essential (about +5% in finalresults).
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
Discussion: action recognition
This year’s taster challenge on action recognition was little morethan a toy.However, we have demonstrated what is possible using proventechniques from image classification.We feel that object context, in particular object interaction context,is the way forward.The PASCAL data set is the right direction to go (more general),but we need more samples.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The future: segmentation
Semantic image segmentation has come a long way, but still has along way to go.It is becoming a mainstream event in PASCAL.This year we arrived as a sort of three-way detente between theCVC (winner 2010), BONN (winner 2009) and OXFORD (bestpaper award ECCV 2010) in segmentation.Each have their own approach, and each has its advantages anddisadvantages.Engineering can probably maximize results.It is becoming mature, and we can begin thinking about what newapplications are enabled by such technologies.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010
IntroductionHarmony potential 2.0: fusing across scale
Action recognitionDiscussion
The future: action recognition
It seems that action recognition in still images is a popularchallenge.The PASCAL organizers are keen to promote it for the future.The concentration will remain on still images, but perhaps moreconcentration on incorporating user interaction as well.It seems that the community is becoming more interested in the“alternative” PASCAL challenges.The multimedia community probably has an important role to playhere.
The CVC PASCAL VOC Team CVC PASCAL VOC 2010