stockman msu/cse fall 20081 computer vision and human perception a brief intro from an ai...
TRANSCRIPT
Stockman MSU/CSE Fall 2008 1
Computer Vision and Human Perception
A brief intro from an AI perspective
Stockman MSU/CSE Fall 2008 2
Computer Vision and Human Perception
What are the goals of CV? What are the applications? How do humans perceive the 3D
world via images? Some methods of processing
images.
Stockman MSU/CSE Fall 2008 3
Goal of computer vision Make useful decisions about real physical
objects and scenes based on sensed images.
Alternative (Aloimonos and Rosenfeld): goal is the construction of scene descriptions from images.
How do you find the door to leave? How do you determine if a person is friendly
or hostile? .. an elder? .. a possible mate?
Stockman MSU/CSE Fall 2008 4
Critical Issues Sensing: how do sensors obtain images
of the world? Information: how do we obtain color,
texture, shape, motion, etc.? Representations: what representations
should/does a computer [or brain] use? Algorithms: what algorithms process
image information and construct scene descriptions?
Stockman MSU/CSE Fall 2008 5
Images: 2D projections of 3D 3D world has color, texture, surfaces,
volumes, light sources, objects, motion, betweeness, adjacency, connections, etc.
2D image is a projection of a scene from a specific viewpoint; many 3D features are captured, some are not.
Brightness or color = g(x,y) or f(row, column) for a certain instant of time
Images indicate familiar people, moving objects or animals, health of people or machines
Stockman MSU/CSE Fall 2008 6
Image receives reflections Light reaches
surfaces in 3D Surfaces reflect Sensor element
receives light energy
Intensity matters Angles matter Material maters
Stockman MSU/CSE Fall 2008 7
CCD Camera has discrete elts
Lens collects light rays
CCD elts replace chemicals of film
Number of elts less than with film (so far)
Stockman MSU/CSE Fall 2008 8
Intensities near center of eye
Stockman MSU/CSE Fall 2008 9
Camera + Programs = Display Camera
inputs to frame buffer
Program can interpret data
Program can add graphics
Program can add imagery
Stockman MSU/CSE Fall 2008 10
Some image format issues
Spatial resolution; intensity resolution; image file format
Stockman MSU/CSE Fall 2008 11
Resolution is “pixels per unit of length”
Resolution decreases by one half in cases at left
Human faces can be recognized at 64 x 64 pixels per face
Stockman MSU/CSE Fall 2008 12
Features detected depend on the resolution
Can tell hearts from diamonds
Can tell face value
Generally need 2 pixels across line or small region (such as eye)
Stockman MSU/CSE Fall 2008 13
Human eye as a spherical camera 100M sensing elts in
retina Rods sense intensity Cones sense color Fovea has tightly
packed elts, more cones
Periphery has more rods
Focal length is about 20mm
Pupil/iris controls light entry
• Eye scans, or saccades to image details on fovea
• 100M sensing cells funnel to 1M optic nerve connections to the brain
Stockman MSU/CSE Fall 2008 14
Image processing operations
Thresholding;Edge detection;
Motion field computation
Stockman MSU/CSE Fall 2008 15
Find regions via thresholding
Region has brighter or darker or redder color, etc.
If pixel > threshold then pixel = 1 else pixel = 0
Stockman MSU/CSE Fall 2008 16
Example red blood cell image
Many blood cells are separate objects
Many touch – bad! Salt and pepper
noise from thresholding
How useable is this data?
Stockman MSU/CSE Fall 2008 17
Robot vehicle must see stop sign
sign = imread('Images/stopSign.jpg','jpg'); red = (sign(:, :, 1)>120) & (sign(:,:,2)<100) & (sign(:,:,3)<80); out = red*200; imwrite(out, 'Images/stopRed120.jpg', 'jpg')
Stockman MSU/CSE Fall 2008 18
Thresholding is usually not trivial
Stockman MSU/CSE Fall 2008 19
Can cluster pixels by color similarity and by adjacency
Original RGB Image Color Clusters by K-Means
Stockman MSU/CSE Fall 2008 20
Some image processing ops
Finding contrast in an image; using neighborhoods of pixels;
detecting motion across 2 images
Stockman MSU/CSE Fall 2008 21
Differentiate to find object edges For each pixel,
compute its contrast
Can use max difference of its 8 neighbors
Detects intensity change across boundary of adjacent regions
LOG filter later on
Stockman MSU/CSE Fall 2008 22
4 and 8 neighbors of a pixel 4 neighbors are at
multiples of 90 degrees
. N . W * E . S .
8 neighbors are at every multiple of 45 degrees
NW N NE W * E SW S SE
Stockman MSU/CSE Fall 2008 23
Detect Motion via Subtraction
Constant background
Moving object Produces
pixel differences at boundary
Reveals moving object and its shapeDifferences computed over
time rather than over space
Stockman MSU/CSE Fall 2008 24
Two frames of aerial imagery
Video frame N and N+1 shows slight movement: most pixels are same, just in different locations.
Stockman MSU/CSE Fall 2008 25
Best matching blocks between video frames N+1 to N (motion vectors)
The bulk of the vectors show the true motion of the airplane taking the pictures. The long vectors are incorrect motion vectors, but they do work well for compression of image I2!
Best matches from 2nd to first image shown as vectors overlaid on the 2nd image. (Work by Dina Eldin.)
Stockman MSU/CSE Fall 2008 26
Gradient from 3x3 neighborhoodEstimate both magnitude and direction of the edge.
Stockman MSU/CSE Fall 2008 27
Prewitt versus Sobel masks
Sobel mask uses weights of 1,2,1 and -1,-2,-1 in order to give more weight to center estimate. The scaling factor is thus 1/8 and not 1/6.
Stockman MSU/CSE Fall 2008 28
Computational short cuts
Stockman MSU/CSE Fall 2008 29
2 rows of intensity vs difference
Stockman MSU/CSE Fall 2008 30
“Masks” show how to combine neighborhood values
Multiply the mask by the image neighborhood to get first derivatives of intensity versus x and versus y
Stockman MSU/CSE Fall 2008 31
Curves of contrasting pixels
Stockman MSU/CSE Fall 2008 32
Boundaries not always found well
Stockman MSU/CSE Fall 2008 33
Canny boundary operator
Stockman MSU/CSE Fall 2008 34
LOG filter creates zero crossing at step edges (2nd der. of Gaussian)
3x3 mask applied at each image position
Detects spots
Detects steps edgesMarr-Hildreth theory of edge detection
Stockman MSU/CSE Fall 2008 35
G(x,y): Mexican hat filter
Stockman MSU/CSE Fall 2008 36
Positive center
Negative surround
Stockman MSU/CSE Fall 2008 37
Properties of LOG filter Has zero response on constant
region Has zero response on intensity ramp Exaggerates a step edge by making
it larger Responds to a step in any direction
across the “receptive field”. Responds to a spot about the size of
the center
Stockman MSU/CSE Fall 2008 38
Human receptive field is analogous to a mask
X[j] are the image intensities.
W[j] are gains (weights) in the mask
Stockman MSU/CSE Fall 2008 39
Human receptive fields amplify contrast
Stockman MSU/CSE Fall 2008 40
3D neural network in brain
Level j
Level j+1
Stockman MSU/CSE Fall 2008 41
Mach band effect shows human bias
Stockman MSU/CSE Fall 2008 42
Human bias and illusions supports receptive field theory of edge detection
Stockman MSU/CSE Fall 2008 43
Human brain as a network 100B neurons, or nodes Half are involved in vision 10 trillion connections Neuron can have “fanout” of 10,000 Visual cortex highly structured to
process 2D signals in multiple ways
Stockman MSU/CSE Fall 2008 44
Color and shading Used heavily in human
vision Color is a pixel property,
making some recognition problems easy
Visible spectrum for humans is 400nm (blue) to 700 nm (red)
Machines can “see” much more; ex. X-rays, infrared, radio waves
Stockman MSU/CSE Fall 2008 45
Imaging Process (review)
Stockman MSU/CSE Fall 2008 46
Factors that Affect Perception
• Light: the spectrum of energy that illuminates the object surface
• Reflectance: ratio of reflected light to incoming light
• Specularity: highly specular (shiny) vs. matte surface
• Distance: distance to the light source
• Angle: angle between surface normal and light source
• Sensitivity how sensitive is the sensor
Stockman MSU/CSE Fall 2008 47
Some physics of color
White light is composed of all visible frequencies (400-700)
Ultraviolet and X-rays are of much smaller wavelength
Infrared and radio waves are of much longer wavelength
Stockman MSU/CSE Fall 2008 48
Models of Reflectance
We need to look at models for the physics of illumination and reflection that will1. help computer vision algorithms extract information about the 3D world, and2. help computer graphics algorithms render realistic images of model scenes.
Physics-based vision is the subarea of computer visionthat uses physical models to understand image formationin order to better analyze real-world images.
Stockman MSU/CSE Fall 2008 49
The Lambertian Model:Diffuse Surface Reflection
A diffuse reflectingsurface reflects lightuniformly in all directions
Uniform brightness forall viewpoints of a planarsurface.
Stockman MSU/CSE Fall 2008 50
Real matte objects
Light from ring around camera lens
Stockman MSU/CSE Fall 2008 51
Specular reflection is highly directional and mirrorlike.
R is the ray of reflectionV is direction from the surface toward the viewpoint is the shininess parameter
Stockman MSU/CSE Fall 2008 52
CV: Perceiving 3D from 2D
Many cues from 2D images enable
interpretation of the structure of the 3D
world producing them
Stockman MSU/CSE Fall 2008 53
Many 3D cues
How can humans and other machines reconstruct the 3D nature of a scene from 2D images?
What other world knowledge needs to be added in the process?
Stockman MSU/CSE Fall 2008 54
Labeling image contours interprets the 3D scene structure
+
An egg and a thin cup on a table top lighted from the top right
Logo on cup is a “mark” on the material
“shadow” relates to illumination, not material
Stockman MSU/CSE Fall 2008 55
“Intrinsic Image” stores 3D info in “pixels” and not intensity.
For each point of the image, we want depth to the 3D surface point, surface normal at that point, albedo of the surface material, and illumination of that surface point.
Stockman MSU/CSE Fall 2008 56
3D scene versus 2D image
Creases Corners Faces Occlusions (for
some viewpoint)
Edges Junctions Regions Blades, limbs, T’s
Stockman MSU/CSE Fall 2008 57
Labeling of simple polyhedra
Labeling of a block floating in space. BJ and KI are convex creases. Blades AB, BC, CD, etc model the occlusion of the background. Junction K is a convex trihedral corner. Junction D is a T-junction modeling the occlusion of blade CD by blade JE.
Stockman MSU/CSE Fall 2008 58
Trihedral Blocks World Image Junctions: only 16 cases!
Only 16 possible junctions in 2D formed by viewing 3D corners formed by 3 planes and viewed from a general viewpoint! From top to bottom: L-junctions, arrows, forks, and T-junctions.
Stockman MSU/CSE Fall 2008 59
How do we obtain the catalog?
think about solid/empty assignments to the 8 octants about the X-Y-Z-origin
think about non-accidental viewpoints
account for all possible topologies of junctions and edges
then handle T-junction occlusions
Stockman MSU/CSE Fall 2008 60
Blocks world labeling
Left: block floating in space
Right: block glued to a wall at the back
Stockman MSU/CSE Fall 2008 61
Try labeling these: interpret the 3D structure, then label parts
What does it mean if we can’t label them? If we can label them?
Stockman MSU/CSE Fall 2008 62
1975 researchers very excited very strong constraints on
interpretations several hundred in catalogue when
cracks and shadows allowed (Waltz): algorithm works very well with them
but, world is not made of blocks! later on, curved blocks world work
done but not as interesting
Stockman MSU/CSE Fall 2008 63
Backtracking or interpretation tree
Stockman MSU/CSE Fall 2008 64
“Necker cube” has multiple interpretations
A human staring at one of these cubes typically experiences changing interpretations. The interpretation of the two forks (G and H) flip-flops between “front corner” and “back corner”. What is the explanation?
Label the different interpretations
Stockman MSU/CSE Fall 2008 65
Depth cues in 2D images
Stockman MSU/CSE Fall 2008 66
“Interposition” cue
Def: Interposition occurs when one object occludes another object, thus indicating that the occluding object is closer to the viewer than the occluded object.
Stockman MSU/CSE Fall 2008 67
interposition
• T-junctions indicate occlusion: top is occluding edge while bar is the occluded edge
• Bench occludes lamp post
• leg occludes bench
• lamp post occludes fence
• railing occludes trees
• trees occlude steeple
Stockman MSU/CSE Fall 2008 68
• Perspective scaling: railing looks smaller at the left; bench looks smaller at the right; 2 steeples are far away
• Forshortening: the bench is sharply angled relative to the viewpoint; image length is affected accordingly
Stockman MSU/CSE Fall 2008 69
Texture gradient reveals surface orientation
Texture Gradient: change of image texture along some direction, often corresponding to a change in distance or orientation in the 3D world containing the objects creating the texture.
( In East Lansing, we call it “corn” not “maize’. )
Note also that the rows appear to converge in 2D
Stockman MSU/CSE Fall 2008 70
3D Cues from Perspective
Stockman MSU/CSE Fall 2008 71
3D Cues from perspective
Stockman MSU/CSE Fall 2008 72
More 3D cues
Virtual lines Falsely perceived interposition
Stockman MSU/CSE Fall 2008 73
Irving Rock: The Logic of Perception; 1982
Summarized an entire career in visual psychology
Concluded that the human visual system acts as a problem-solver
Triangle unlikely to be accidental; must be object in front of background; must be brighter since it’s closer
Stockman MSU/CSE Fall 2008 74
More 3D cues
2D alignment usually means 3d alignment
2D image curves create perception of 3D surface
Stockman MSU/CSE Fall 2008 75
“structured light” can enhance surfaces in industrial vision
Sculpted object Potatoes with light stripes
Stockman MSU/CSE Fall 2008 76
Models of Reflectance
We need to look at models for the physics of illumination and reflection that will1. help computer vision algorithms extract information about the 3D world, and2. help computer graphics algorithms render realistic images of model scenes.
Physics-based vision is the subarea of computer visionthat uses physical models to understand image formationin order to better analyze real-world images.
Stockman MSU/CSE Fall 2008 77
The Lambertian Model:Diffuse Surface Reflection
A diffuse reflectingsurface reflects lightuniformly in all directions
Uniform brightness forall viewpoints of a planarsurface.
Stockman MSU/CSE Fall 2008 78
Shape (normals) from shading
Cylinder with white paper and pen stripes
Intensities plotted as a surface
Clearly intensity encodes shape in this case
Stockman MSU/CSE Fall 2008 79
Shape (normals) from shading
Plot of intensity of one image row reveals the 3D shape of these diffusely reflecting objects.
Stockman MSU/CSE Fall 2008 80
Specular reflection is highly directional and mirrorlike.
R is the ray of reflectionV is direction from the surface toward the viewpoint is the shininess parameter
Stockman MSU/CSE Fall 2008 81
What about models for recognition
“recognition” = to know again;
How does memory store models of faces, rooms,
chairs, etc.?
Stockman MSU/CSE Fall 2008 82
Human capability extensive
Child age 6 might recognize 3000 words
And 30,000 objects Junkyard robot must recognize
nearly all objects Hundreds of styles of lamps,
chairs, tools, …
Stockman MSU/CSE Fall 2008 83
Some methods: recognize Via geometric alignment: CAD Via trained neural net Via parts of objects and how they
join Via the function/behavior of an
object
Stockman MSU/CSE Fall 2008 84
Side view classes of Ford Taurus (Chen and Stockman)
These were made in the PRIP Lab from a scale model.
Viewpoints in between can be generated from x and y curvature stored on boundary.
Viewpoints matched to real image boundaries via optimization.
Stockman MSU/CSE Fall 2008 85
Matching image edges to model limbs
Could recognize car model at stoplight or gate.
Stockman MSU/CSE Fall 2008 86
Object as parts + relations
Parts have size, color, shape Connect together at concavities Relations are: connect, above, right of, inside of,
…
Stockman MSU/CSE Fall 2008 87
Functional models Inspired by JJ Gibson’s Theory of
Affordances. An object is what an object does * container: holds stuff * club: hits stuff * chair: supports humans
Stockman MSU/CSE Fall 2008 88
Louise Stark: chair model Dozens of CAD models of chairs Program analyzed for * stable pose * seat of right size * height off ground right size * no obstruction to body on seat * program would accept a trash can (which could also pass as a container)
Stockman MSU/CSE Fall 2008 89
Minski’s theory of frames(Schank’s theory of scripts)
Frames are learned expectations – frame for a room, a car, a party, an argument, …
Frame is evoked by current situation – how? (hard)
Human “fills in” the details of the current frame (easier)
Stockman MSU/CSE Fall 2008 90
summary Images have many low level features Can detect uniform regions and contrast Can organize regions and boundaries Human vision uses several simultaneous
channels: color, edge, motion Use of models/knowledge diverse and
difficult Last 2 issues difficult in computer vision