cvpr 2011 best student paper recognition using visual phrases
TRANSCRIPT
Recognition Using Visual Phrases
CVPR 2011 Best Student PaperRecognition Using Visual Phrases
OutlineIntroductionRelated WorksApproachPhrasal RecognitionDecoding Multiple DetectionsResultsDiscussionIntroduction
IntroductionVisual PhrasesTraditional approachDetect objects (person, dog, horse)Relation between objectsNMS(non-maximum suppression)PASCAL otherDisadvantage
Introduction
ContributionsIntroducing visual phrases as categories for recognitionIntroducing a novel dataset for phrasal recognitionThe state of the art methods of modeling interactionsA decoding algorithmPerformance results in multi-class object recognitionIntroduction
Object RecognitionObject RecognitionDeformable templates [IEEE2001,CVPR1998] Part base model [CVPR2005,CVPR2003] DetectorsDeformable based model [IEEE2010] Related WorkObject InteractionsFocus on relation [ECCV2008] Person with object [CVPR 2010]Objects [ECCV2010]Relation of objects [ICCV2010] left, right, top, down label weight, confidence Related WorkScene understandingRepresent scenes as with global features that take into account general information about images [Vision2001,CVPR2006]Cluster [ECCV2008]Related WorkMachine translationStatistical translation methods [Press2010]Translation modelLanguage modelA decoding algorithm
Output: a query sentenceAllow multiple to multiple translation
Related Work
Phrasal Recognition Datasetselect 8 obj. class (Pascal VOC 2008) person, bike, car, dog, horse, bottle, sofa, chairA list of 17 visual phrases + background classDog jumping ,horse jumping, person riding horsePhrasal RecognitionPhrasal Recognition
Datasets2769 images (822 negative image)120 examples, average of each classes 5067 bounding boxes(1796 phrases,3271 objects)The complexity of Visual Phrases creaseThe number of training example decrease
Phrasal RecognitionPhrasal RecognitionAppearance modelsDeformation part model17 phrases in our dataset using provided bounding boxes8 categories from Pascal are used as models for objectsNMS decodingPerfect detectors with excellent tightly tuned modelsNatural decoding strategy better than NMS on interactionGreedily search the space of labelsWell designed feature (nearby)
Decoding Multiple DetectionsAll detector responsesDecodingFinal outcomeDecoding processWe compare our decoding algorithm with that of [2] on our phrase datasetStep1: construct the featureStep2: running algorithm to learn a set of weights that rescore the confidences of the bounding boxes based on interactionsStep3: We again rescore until optimal
Decoding Multiple Detections
Discriminative models for multi-class object layout
Decoding Multiple Detections : a bounding box in an imageAn image is represented as a collection of overlapping Bounding boxesX = { : i=1.M},M is the total num of bounding boxK is different categories1 , 11 is the score of image X with Y is the set of weights that corresponds tothe class of the bounding box
RepresentationImage = bounding boxesConfidenceOverlapSize ratioRelationAbove, Below, overlappingWindow, category, spatial binsRepresentation has K*3*3+1 dimensionsDecoding Multiple DetectionsInferenceassume bounding boxes are independent given their features1
Decoding Multiple Detections
LearningA form of max margin structure learning1
Decoding Multiple Detections
Decoding Multiple Detections1
our inner maximization is exact and very fast. We solve this optimization problem by subgradient descent method as follows.
Single category detectiondeformable part models for 17 visual phrasethe trained models from for objectsUse PASCAL dataset : 50 positive and 150 negative examples Show Precision-Recall (PR) curvesTrained these detectors with at most 50 positive examples
Result
ResultResult
Result
Result
Decoding
ResultPaper decoding*[2]NMSOverall AP0.3190.313 0.308Mean per class AP0.4950.4930.491[2] C. F. C. Desai, D. Ramanan. Discriminative models for multi-class object layout. In ICCV, 2010.Result
Result
Introduce visual phrases, phrasal recognition datasetA coding algorithm The dimensionality of our features grows with the number of categories Future Workthe relations between attributes and objectsparts and objectsvisual phrases and scenesobjects and visual phrases mirror one anotherDiscussionExperienceLow complexityUse less data to detectionFeatures grows with the number of categories (exponential 2n)But we dont need to consider all of the categories when we model the interactionsBuilding long enough phrase tables is still a challengeDiscussion