3d object recognition and scene...

EECS 442 – Computer vision

3D Object Recognition and Scene Understanding

Object: Building 8-10 meters away

Object: Car, ¾ view 2-3 meters away

Interpreting the visual world

Object: Traffic light

How can we achieve all of this?

• 3D modeling – no semantic • Semantic reasoning – no 3D geometry • Joint 3D modeling and semantic reasoning

… Chen & Medioni, 92 Debevec et al 96 Pollefeys et al 02 Nister 04 Hartley & Zisserman, 00 Levoy et al., 00 Brown & Lowe, 04 Schindler et al 08 Snavely et al 08 Agarwal et al 09 Etc…

… Weber et al. 00 Felzenszwalb & Huttenlocher, 00 Leibe & Schiele, 04 Kumar & Hebert ’04 Fei-Fei & Perona, ‘05 Sivic et al. ’05 Shotton et al ‘05 Grauman et al. ‘05

Ullman et al. 02 Fergus et al. ’03 Torralba et al. ‘03

Lazebnik et al, 06 Maji & Malik, 07

Vedaldi & Soatto ’08 Zhu et al 08 Etc…

• 3D modeling – no semantic • Semantic – no 3D geometry

• Semantic from range data – disjoint 3D modeling and recognition

• … • Huber 01 • Rusu et al. 08 • Brostow et al. 08 • Son & Kim 10 • Tang et al. 10 • Adan et al. 11 • etc …

• Joint 3D modeling and semantic reasoning

• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09

• Gupta et al, 10 • Ladick´y et al, 10 • Bao, Sun, Savarese 10 • Sun, Bao, Savarese 10 • Bao & Savarese, 11

• Semantic from range data – disjoint 3D modeling and recognition

Joint 3D modeling and recognition

• Given the scene the layout, objects can be detected more robustly

• Objects and their geometrical attributes provide constraints for estimating the scene layout

• 3D Object detectors

– Robust to view point transformation

– Allow to estimate pose, scale and 3D shape

• Methods for coherent object detection and scene layout estimation – single view

– multi-view

– videos

In this lecture….

Viewing sphere

• Detect objects under generic view points • Estimate object pose • General and work for any object category

Azimuth , Zenith

3D Object Detectors

• Detect objects under generic view points • Estimate object pose • General and work for any object category

3D Object Categorization

•Felzenszwalb & Huttenlocher ‘03 •Fei-Fei et al. ‘04

•Leibe et al. ‘04

•Sudderth et al ‘05 •Torralba et al. ‘05 •Lazebnik et al. ‘06 •Todorovic et al. ’06 •Bosh et al ‘07 •Vedaldi & Soatto ‘08

•Kumar & Hebert ‘04 •Sivic et al. ’05 •Shotton et al ‘05

•Grauman et al. ‘05

•Leung et al ‘99 •Weber et al. ‘00 •Ullman et al. 02 •Fergus et al. ’03 •Torralba et al. ‘03

Single view object categorization

•Zhang et al ’95 •Schmid & Mohr, ‘96 •Schiele & Crowley, ’96 •Lowe, ‘99 •Jacob & Barsi, ‘99 •Rothganger et al., ‘04

•Edelman et al. ’91 •Ullman & Barsi, ’91 • Rothwell ‘92 •Linderberg, ’94 •Murase & Nayar ‘94

•Ferrari et al, ’05 •Brown & Lowe ’05 •Snavely et al ’06 •Yin & Collins, ‘07

•Ballard, ‘81 •Grimson & L.-Perez, ‘87 •Lowe, ’87

Single 3D object recognition

3D models - Explicit 3d models - Implicit 3d models

• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07

Mixture of 2D single view models

• Weber et al. ‘00 • Schneiderman et al. ’01 • Bart et al. ’04 • Gu & Ren, ‘10

•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08

Single view model

Mixture of 2D models • Weber et al. ’00 • Schneiderman et al. ’01 • Ullman et al. 02 • Fergus et al. ’03 • Torralba et al. ’03

• Felzenszwalb & Huttenlocher ‘03 • Leibe et al. ’04 • Shotton et al. ‘05 • Grauman et al. ’05

• Savarese et al, ‘06 •Todorovic et al. ’06 • Vedaldi & Soatto ’08 • Zhu et al 08 • Gu & Ren, ‘10

3D Category model

CONS: Single view models are independent Non scalable to large number of categories/view-points Just b. boxes Cannot estimate 3D pose or 3D layout

3D models - Implicit 3d models - Explicit 3d models

• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07 …. • Xiang & Savarese ‘12

•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08 • Sun et al. ’09 …

Implicit 3D models

Sparse set of interest points or parts of the objects are linked across views by implicit 3D transformations (H, F)

… 3D Category

x’ x

Linking features or parts across views: Perspective or affine transformation constraints

x’ = H x

l’ = FT x x

l’ = FT x x’ l’

Linking features or parts across views: Epipolar Transformation Constraints

Sparse set of interest points or parts of the objects are linked across views.

… Multi-view

• Thomas et al. ’06 • Leibe et al. ‘04

Implicit 3D models by ISM representations

rtesy o

f Thom

as et al. 06

Set of region-tracks connecting model views Each track is composed of image regions of a single physical surface patch along the model views in which it is visible.

[Ferrari et al. ’04, ‘06]

Region tracks

Results

• Canonical parts captures view invariant diagnostic appearance information

Savarese, Fei-Fei, ICCV 07 Savarese, Fei-Fei, ECCV 08 Sun, et al, CVPR 2009, ICCV 09

• Parts and relationship are modeled in a probabilistic fashion • Parameters are learnt so as to maximize detection accuracy

• 2d ½ structure linking parts via weak geometry

Implicit 3D models by graph-based representations

Parameterization on view-sphere

• Model the object as collection of parts for any T and S on the viewing sphere

Multi-view generative part-based model

Yn=Codeword Xn=Location

= Part Prop. Prior

= Part Appearance

= Part Location/shape

Multi-view generative part-based model

= Part Prop. Prior

= Part Appearance

= Part Location/shape

Multi-view generative part-based model • Learning: estimate the latent variables and relevant parameters, given the observations

• Variational EM can be used Blei, ICML 2004.

Within triangle constraints:

ji mmM

Encoded as a penalty term in variational EM

Incorporating geometrical constraints

Encoded as a penalty term in variational EM

View morphing constraints:

= Shape

= Center

Seitz &Dyer SIGGRAPH 96 Xiao & Shah CVIU ‘04

S. M. Seitz and C. R. Dyer, Proc. SIGGRAPH 96, 1996, 21-30

hhhhhOPPPI ,,,: 321

kkkkkOPPPI ,,,: 321

Sequential ransac J-linkage [toldo et al 07]

•Defining initial parts and part correspondences

Initializing the model

Semi-supervised

• Class label • Object bounding box

• No need to observe same object instance from multiple views

• No pose labels [unlike Sun CVPR 09]

[unlike Savarese & Fei-Fei, 07, 08]

• No part labels

Incremental learning

• Enable unorganized and on-line collection training images • Increase efficiency in learning (no need large storage space)

Incremental learning

• Evidence of training image is used to update model parameters

• Assign new training image to a triangle of the view sphere

• Re-estimate sufficient statistics in a iterative fashion

Evolution of learnt parts

Examples of learnt part-based models

Travel iron

Examples of learnt part-based models

Experimental results

• Object detection from any viewing angles • Accurate estimation of the object pose

• PASCAL 2006 dataset • 3D Object Dataset

Travel Iron

Our model

Detection

Car Bicycle

Savarese & Fei-Fei ICCV ’07

Sun et al, CVPR 09

- 3D Object Dataset

ROC ROC

Our model Savarese ICCV ’07

Classification Accuracy

Viewpoint Classification 3D object dataset

Predicting object appearance from novel views

Viewing sphere

[For natural scenes, see Hoiem et al 07; Saxena et al 07]

Thomas et al 08 Cremer et al 09

Affine transformation

Our model

3D models - explicit 3d models - Implict 3d models

• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07 • Xiang & Svarese, 12

Explicit 3D Models

• Part configuration is modeled as a conditional random fields with maximal margin parameter estimation

• Enable 6DOF object pose estimation • 3D layout estimation of object parts

3D Category model

3D models - Explicit 3D models - Implicit 3D models

[3D object dataset, 07]

• Xiang & Savarese, CVPR 12

Explicit 3D Models

– multi-view

– videos

In this lecture….

• Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras

Work both in indoors and outdoors

3D scene understanding from a single image Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012

• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09 •Lee et al. ‘09, 10 • Gupta et al, 10, 11 • Tsai et al. ‘11

• Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras

Work both in indoors and outdoors

3D scene understanding from a single image

• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09 •Gupta et al, 10, 11

Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012

– multi-view

– videos

In this lecture….

•Measurements I • Points (x,y,scale)

• Objects (x,y, scale, pose)

• Regions (x,y, pose)

•Model Parameters:

• Q = 3D points • O = 3D objects • B = 3D regions

• = cam. prm. K, R, T

Bao & S. Savarese, CVPR 2011 Bao, Bagra, Savarese . CORP – ICCV 2011 Bao, Bagra, Chao, Savarese, CVPR 2012

Bao, Xiang, Savarese, ECCV 2012

3D scene understanding from multiple images Semantic Structure from Motion (SSFM)

• Huber 01 • Rusu et al. 08 • Brostow et al. 08

•Son & Kim 10 • Tang et al. 10 • Adan et al. 11 • etc …

Semantic Structure from Motion (SSFM)

SSFM: point-level compatibility

• Tomasi & Kanade ‘92 • Triggs et al ’99 • Soatto & Perona 99 • Hartley & Zisserman 00 • Dellaert et al. 00

Point re-projection error

SSFM: point-level compatibility

projection

observation

• Pollefeys & V. Gool 02 • Nister 04 • Brown & Lowe 07 • Snavely et al. 08

SSFM: Object-level compatibility

Object “re-projection” error

Camera 1 Camera 2

• Agreement with measurements is computed using position, pose and scale

Class = “car” scale=1 pose=“back“

• Savarese, Fei-Fei, ICCV 07 • Savarese, Fei-Fei, ECCV 08

• Su et al, ICCV 2009 • Sun, et al, CVPR 2009 • Sun et al, ECCV 2010

• Yu & Savarese, CVPR 2012

• A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y

Class = “car” scale=3 pose=“3/4“

• Savarese, Fei-Fei, ICCV 07 • Savarese, Fei-Fei, ECCV 08

• Su et al, ICCV 2009 • Sun, et al, CVPR 2009 • Sun et al, ECCV 2010

• Yu & Savarese, CVPR 2012

• A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y

Camera 1 Camera 2

Class = “car” scale=1 pose=“back“

Class = “car” scale=1 pose=“3/4“

• Efficiently implemented using a parallel computing architecture

SSFM: Region-level compatibility

Region “re-projection” error

SSFM with interactions

Bao, Bagra, Chao, Savarese CVPR 2012

Object-Point Interactions:

Bao, Bagra, Chao, Savarese CVPR 2012

Point-Region Interactions:

Object-Region Interactions:

Solving the SSFM problem

• Modified Markov Chain Monte Carlo (MCMC) sampling algorithm

• Initialization of the cameras, objects, and points are critical for the sampling

• Initialize configuration of cameras using: • SFM • consistency of object/region properties across views

F. Dellaert, S. Seitz, S. Thrun, and C. Thorpe. Feature correspondence: A markov chain monte carlo approach. In NIPS, 2000

Public Ford Campus Vision and LiDAR Dataset

• Object categories: Cars • Ground truth depth provided by LiDAR

[Pandey et al, International Journal of Robotics Research, 2011]

In-house Office dataset

• Object categories: mugs, mice, keyboards • Ground truth depth provided by Kinect

In-house Street dataset

• Object categories: humans • No ground truth depth available

Results

Observations Joint reconstruction & recognition

Results

SSFM Source code available! http://www.eecs.umich.edu/vision/research.html

Results

Object detection results

[1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2009

DPM [1] SSFM (2011) with 2 views

SSFM (2012) with 2 views

SSFM (2012) with 4 views

54.5% 61.3% 62.8% 66.5%

Average precision in detecting objects (cars) in the 2D image

Accuracy in localizing objects in the 3D space (AP)

Hoiem [2]

[2011]

[2012]

FORD CAMPUS – cars 21.4% 32.7% 43.1%

OFFICE – keyboards, mice,

monitors

15.5% 20.2% 21.6%

[2] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 2008.

Camera estimation results

Camera translation error

SFM [1] SSFM (2011)

SSFM (2012)

FORD CAMPUS 26.5 19.9 12.1

OFFICE 8.5 4.7 4.2

STREET 27.1 17.6 11.4

Camera rotation error

SFM [1] SSFM (2011)

SSFM (2012)

<1 <1 <1

9.6 4.2 3.5

21.1 3.1 3.0

[1] N. Snavely, S. M. Seitz, and R. S. Szeliski. Modeling the world from internet photo collections. IJCV, (2), Nov. 2008

Camera parameter reconstruction errors

Results

Source code available!

• http://www.eecs.umich.edu/vision/research.html

– multi-view

– videos

In this lecture….

• Choi & Shahid & Savarese , WMC 2010

• Choi & Savarese , ECCV 2010

• Wu et al 07, • Breitenstein et al 09, • Zhao et al 04 • Ess et al 09

• Monocular cameras • Un-calibrated cameras • Arbitrary motion • Highly cluttered scenes

•Occlusion • Background clutter

•Moving targets

Joint 3D modeling and recognition from videos

Joint tracking and camera estimation

Interest Points in 3D

Tracked Interest Points

Camera Parameters

Pedestrian Detections

Target Location in 3D

Ω : set of state variables

Χ : set of observations

• Easily add additional evidence • 3d depth • IMU, etc…

• 5 frames/second! • Code available on line soon!

Safe Driving Applications

Autonomous navigation

• Intelligent vision requires joint reconstruction -recognition

• Geometry provides critical contextual cues for robust recognition

• High level semantics help establish robust geometrical constraints for reconstruction

– Within a single view

– Across views

• High level semantics help scalability in reconstruction problems

– Fewer images are needed with wider baseline

Conclusions

EECS 442 – Computer vision

• Hope you have enjoyed this class!

• Good luck with your projects & presentations!

3d object recognition and scene...

Documents

robert bosch power tools gmbh fsn-ofa/32 · 2020-03-25 ·...

self-learning 3d object classication

object recognition in 3d scans - ros.org · bastian steder...

manual licuadora brly07-z00-mx_43_69387289

learning 3d object orientation

3d object detection and viewpoint estimation with a...

modelling 3d object shape -...

towards 3d object recognition via classiﬁcation of ... ·...

joint 3d instance segmentation and object detection for...

3d object morphing

towards 3d object detection with bimodal deep … ·...

learning objectives 3d object representations 3d object...

overview of 3d object representations

3d object representations

3d-audio object oriented coding

resizing a 3d stl object in repetrel...

finding your (3d) center: 3d object detection using a

object 3d(2)

objectnet3d: a large scale database for 3d object...

3d-fct: simultaneous 3d object detection and tracking