stockman msu/cse fall 20081 computer vision and human perception a brief intro from an ai...

Stockman MSU/CSE Fall 2008 1

Computer Vision and Human Perception

A brief intro from an AI perspective


Computer Vision and Human Perception

What are the goals of CV? What are the applications? How do humans perceive the 3D

world via images? Some methods of processing

images.


Goal of computer vision Make useful decisions about real physical

objects and scenes based on sensed images.

Alternative (Aloimonos and Rosenfeld): goal is the construction of scene descriptions from images.

How do you find the door to leave? How do you determine if a person is friendly

or hostile? .. an elder? .. a possible mate?


Critical Issues Sensing: how do sensors obtain images

of the world? Information: how do we obtain color,

texture, shape, motion, etc.? Representations: what representations

should/does a computer [or brain] use? Algorithms: what algorithms process

image information and construct scene descriptions?


Images: 2D projections of 3D 3D world has color, texture, surfaces,

volumes, light sources, objects, motion, betweeness, adjacency, connections, etc.

2D image is a projection of a scene from a specific viewpoint; many 3D features are captured, some are not.

Brightness or color = g(x,y) or f(row, column) for a certain instant of time

Images indicate familiar people, moving objects or animals, health of people or machines


Image receives reflections Light reaches

surfaces in 3D Surfaces reflect Sensor element

receives light energy

Intensity matters Angles matter Material maters


CCD Camera has discrete elts

Lens collects light rays

CCD elts replace chemicals of film

Number of elts less than with film (so far)


Intensities near center of eye


Camera + Programs = Display Camera

inputs to frame buffer

Program can interpret data

Program can add graphics

Program can add imagery


Some image format issues

Spatial resolution; intensity resolution; image file format


Resolution is “pixels per unit of length”

Resolution decreases by one half in cases at left

Human faces can be recognized at 64 x 64 pixels per face


Features detected depend on the resolution

Can tell hearts from diamonds

Can tell face value

Generally need 2 pixels across line or small region (such as eye)


Human eye as a spherical camera 100M sensing elts in

retina Rods sense intensity Cones sense color Fovea has tightly

packed elts, more cones

Periphery has more rods

Focal length is about 20mm

Pupil/iris controls light entry

• Eye scans, or saccades to image details on fovea

• 100M sensing cells funnel to 1M optic nerve connections to the brain


Image processing operations

Thresholding;Edge detection;

Motion field computation


Find regions via thresholding

Region has brighter or darker or redder color, etc.

If pixel > threshold then pixel = 1 else pixel = 0


Example red blood cell image

Many blood cells are separate objects

Many touch – bad! Salt and pepper

noise from thresholding

How useable is this data?


Robot vehicle must see stop sign

sign = imread('Images/stopSign.jpg','jpg'); red = (sign(:, :, 1)>120) & (sign(:,:,2)<100) & (sign(:,:,3)<80); out = red*200; imwrite(out, 'Images/stopRed120.jpg', 'jpg')


Thresholding is usually not trivial


Can cluster pixels by color similarity and by adjacency

Original RGB Image Color Clusters by K-Means


Some image processing ops

Finding contrast in an image; using neighborhoods of pixels;

detecting motion across 2 images


Differentiate to find object edges For each pixel,

compute its contrast

Can use max difference of its 8 neighbors

Detects intensity change across boundary of adjacent regions

LOG filter later on


4 and 8 neighbors of a pixel 4 neighbors are at

multiples of 90 degrees

. N . W * E . S .

8 neighbors are at every multiple of 45 degrees

NW N NE W * E SW S SE


Detect Motion via Subtraction

Constant background

Moving object Produces

pixel differences at boundary

Reveals moving object and its shapeDifferences computed over

time rather than over space


Two frames of aerial imagery

Video frame N and N+1 shows slight movement: most pixels are same, just in different locations.


Best matching blocks between video frames N+1 to N (motion vectors)

The bulk of the vectors show the true motion of the airplane taking the pictures. The long vectors are incorrect motion vectors, but they do work well for compression of image I2!

Best matches from 2nd to first image shown as vectors overlaid on the 2nd image. (Work by Dina Eldin.)


Gradient from 3x3 neighborhoodEstimate both magnitude and direction of the edge.


Prewitt versus Sobel masks

Sobel mask uses weights of 1,2,1 and -1,-2,-1 in order to give more weight to center estimate. The scaling factor is thus 1/8 and not 1/6.


Computational short cuts


2 rows of intensity vs difference


“Masks” show how to combine neighborhood values

Multiply the mask by the image neighborhood to get first derivatives of intensity versus x and versus y


Curves of contrasting pixels


Boundaries not always found well


Canny boundary operator


LOG filter creates zero crossing at step edges (2nd der. of Gaussian)

3x3 mask applied at each image position

Detects spots

Detects steps edgesMarr-Hildreth theory of edge detection


G(x,y): Mexican hat filter


Positive center

Negative surround


Properties of LOG filter Has zero response on constant

region Has zero response on intensity ramp Exaggerates a step edge by making

it larger Responds to a step in any direction

across the “receptive field”. Responds to a spot about the size of

the center


Human receptive field is analogous to a mask

X[j] are the image intensities.

W[j] are gains (weights) in the mask


Human receptive fields amplify contrast


3D neural network in brain

Level j

Level j+1


Mach band effect shows human bias


Human bias and illusions supports receptive field theory of edge detection


Human brain as a network 100B neurons, or nodes Half are involved in vision 10 trillion connections Neuron can have “fanout” of 10,000 Visual cortex highly structured to

process 2D signals in multiple ways


Color and shading Used heavily in human

vision Color is a pixel property,

making some recognition problems easy

Visible spectrum for humans is 400nm (blue) to 700 nm (red)

Machines can “see” much more; ex. X-rays, infrared, radio waves


Imaging Process (review)


Factors that Affect Perception

• Light: the spectrum of energy that illuminates the object surface

• Reflectance: ratio of reflected light to incoming light

• Specularity: highly specular (shiny) vs. matte surface

• Distance: distance to the light source

• Angle: angle between surface normal and light source

• Sensitivity how sensitive is the sensor


Some physics of color

White light is composed of all visible frequencies (400-700)

Ultraviolet and X-rays are of much smaller wavelength

Infrared and radio waves are of much longer wavelength


Models of Reflectance

We need to look at models for the physics of illumination and reflection that will1. help computer vision algorithms extract information about the 3D world, and2. help computer graphics algorithms render realistic images of model scenes.

Physics-based vision is the subarea of computer visionthat uses physical models to understand image formationin order to better analyze real-world images.


The Lambertian Model:Diffuse Surface Reflection

A diffuse reflectingsurface reflects lightuniformly in all directions

Uniform brightness forall viewpoints of a planarsurface.


Real matte objects

Light from ring around camera lens


Specular reflection is highly directional and mirrorlike.

R is the ray of reflectionV is direction from the surface toward the viewpoint is the shininess parameter


CV: Perceiving 3D from 2D

Many cues from 2D images enable

interpretation of the structure of the 3D

world producing them


Many 3D cues

How can humans and other machines reconstruct the 3D nature of a scene from 2D images?

What other world knowledge needs to be added in the process?


Labeling image contours interprets the 3D scene structure

+

An egg and a thin cup on a table top lighted from the top right

Logo on cup is a “mark” on the material

“shadow” relates to illumination, not material


“Intrinsic Image” stores 3D info in “pixels” and not intensity.

For each point of the image, we want depth to the 3D surface point, surface normal at that point, albedo of the surface material, and illumination of that surface point.


3D scene versus 2D image

Creases Corners Faces Occlusions (for

some viewpoint)

Edges Junctions Regions Blades, limbs, T’s


Labeling of simple polyhedra

Labeling of a block floating in space. BJ and KI are convex creases. Blades AB, BC, CD, etc model the occlusion of the background. Junction K is a convex trihedral corner. Junction D is a T-junction modeling the occlusion of blade CD by blade JE.


Trihedral Blocks World Image Junctions: only 16 cases!

Only 16 possible junctions in 2D formed by viewing 3D corners formed by 3 planes and viewed from a general viewpoint! From top to bottom: L-junctions, arrows, forks, and T-junctions.


How do we obtain the catalog?

think about solid/empty assignments to the 8 octants about the X-Y-Z-origin

think about non-accidental viewpoints

account for all possible topologies of junctions and edges

then handle T-junction occlusions


Blocks world labeling

Left: block floating in space

Right: block glued to a wall at the back


Try labeling these: interpret the 3D structure, then label parts

What does it mean if we can’t label them? If we can label them?


1975 researchers very excited very strong constraints on

interpretations several hundred in catalogue when

cracks and shadows allowed (Waltz): algorithm works very well with them

but, world is not made of blocks! later on, curved blocks world work

done but not as interesting


Backtracking or interpretation tree


“Necker cube” has multiple interpretations

A human staring at one of these cubes typically experiences changing interpretations. The interpretation of the two forks (G and H) flip-flops between “front corner” and “back corner”. What is the explanation?

Label the different interpretations


Depth cues in 2D images


“Interposition” cue

Def: Interposition occurs when one object occludes another object, thus indicating that the occluding object is closer to the viewer than the occluded object.


interposition

• T-junctions indicate occlusion: top is occluding edge while bar is the occluded edge

• Bench occludes lamp post

• leg occludes bench

• lamp post occludes fence

• railing occludes trees

• trees occlude steeple


• Perspective scaling: railing looks smaller at the left; bench looks smaller at the right; 2 steeples are far away

• Forshortening: the bench is sharply angled relative to the viewpoint; image length is affected accordingly


Texture gradient reveals surface orientation

Texture Gradient: change of image texture along some direction, often corresponding to a change in distance or orientation in the 3D world containing the objects creating the texture.

( In East Lansing, we call it “corn” not “maize’. )

Note also that the rows appear to converge in 2D


3D Cues from Perspective


3D Cues from perspective


More 3D cues

Virtual lines Falsely perceived interposition


Irving Rock: The Logic of Perception; 1982

Summarized an entire career in visual psychology

Concluded that the human visual system acts as a problem-solver

Triangle unlikely to be accidental; must be object in front of background; must be brighter since it’s closer


More 3D cues

2D alignment usually means 3d alignment

2D image curves create perception of 3D surface


“structured light” can enhance surfaces in industrial vision

Sculpted object Potatoes with light stripes


Models of Reflectance

We need to look at models for the physics of illumination and reflection that will1. help computer vision algorithms extract information about the 3D world, and2. help computer graphics algorithms render realistic images of model scenes.

Physics-based vision is the subarea of computer visionthat uses physical models to understand image formationin order to better analyze real-world images.


The Lambertian Model:Diffuse Surface Reflection

A diffuse reflectingsurface reflects lightuniformly in all directions

Uniform brightness forall viewpoints of a planarsurface.


Shape (normals) from shading

Cylinder with white paper and pen stripes

Intensities plotted as a surface

Clearly intensity encodes shape in this case


Shape (normals) from shading

Plot of intensity of one image row reveals the 3D shape of these diffusely reflecting objects.


Specular reflection is highly directional and mirrorlike.

R is the ray of reflectionV is direction from the surface toward the viewpoint is the shininess parameter


What about models for recognition

“recognition” = to know again;

How does memory store models of faces, rooms,

chairs, etc.?


Human capability extensive

Child age 6 might recognize 3000 words

And 30,000 objects Junkyard robot must recognize

nearly all objects Hundreds of styles of lamps,

chairs, tools, …


Some methods: recognize Via geometric alignment: CAD Via trained neural net Via parts of objects and how they

join Via the function/behavior of an

object


Side view classes of Ford Taurus (Chen and Stockman)

These were made in the PRIP Lab from a scale model.

Viewpoints in between can be generated from x and y curvature stored on boundary.

Viewpoints matched to real image boundaries via optimization.


Matching image edges to model limbs

Could recognize car model at stoplight or gate.


Object as parts + relations

Parts have size, color, shape Connect together at concavities Relations are: connect, above, right of, inside of,

…


Functional models Inspired by JJ Gibson’s Theory of

Affordances. An object is what an object does * container: holds stuff * club: hits stuff * chair: supports humans


Louise Stark: chair model Dozens of CAD models of chairs Program analyzed for * stable pose * seat of right size * height off ground right size * no obstruction to body on seat * program would accept a trash can (which could also pass as a container)


Minski’s theory of frames(Schank’s theory of scripts)

Frames are learned expectations – frame for a room, a car, a party, an argument, …

Frame is evoked by current situation – how? (hard)

Human “fills in” the details of the current frame (easier)


summary Images have many low level features Can detect uniform regions and contrast Can organize regions and boundaries Human vision uses several simultaneous

channels: color, edge, motion Use of models/knowledge diverse and

difficult Last 2 issues difficult in computer vision

stockman msu/cse fall 20081 computer vision and human perception a brief intro from an ai...

Documents