how does the kinect work? - dickinson college
TRANSCRIPT
![Page 1: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/1.jpg)
How does the Kinect work?
John MacCormick
![Page 2: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/2.jpg)
Xbox demo
![Page 3: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/3.jpg)
Laptop demo
![Page 4: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/4.jpg)
The Kinect uses structured light and machine learning
• Inferring body position is a two-stage process: first compute a depth map (using structured light), then infer body position (using machine learning)
• The results are great!
• The system uses many college-level math concepts, and demonstrates the remarkable advances in computer vision in the last 20 years
![Page 5: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/5.jpg)
Inferring body position is a two-stage process: first compute a depth map,
then infer body position
![Page 6: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/6.jpg)
Inferring body position is a two-stage process: first compute a depth map,
then infer body position
![Page 7: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/7.jpg)
RGB CAMERA
MULTI-ARRAY MIC MOTORIZED TILT
3D DEPTH SENSORS
Figure copied from Kinect for Windows SDK Quickstarts
Stage 1: The depth map is constructed by analyzing a speckle pattern of infrared laser light
![Page 8: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/8.jpg)
Remarks:
• Microsoft licensed this technology from a company called PrimeSense
• The depth computation is all done by the PrimeSense hardware built into Kinect
• Details are not publicly available; this description is speculative (based mostly on PrimeSense patent applications) and may be wrong
Stage 1: The depth map is constructed by analyzing a speckle pattern of infrared
laser light
![Page 9: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/9.jpg)
The technique of analyzing a known pattern is called structured light
![Page 10: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/10.jpg)
Structured light general principle: project a known pattern onto the scene and
infer depth from the deformation of that pattern
Zhang et al, 3DPVT (2002)
![Page 11: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/11.jpg)
The Kinect uses infrared laser light, with a speckle pattern
Shpunt et al, PrimeSense patent application US 2008/0106746
![Page 12: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/12.jpg)
Stage 1: The depth map is constructed by analyzing a speckle pattern of
infrared laser light
• The Kinect uses an infrared projector and sensor; it does not use its RGB camera for depth computation
• The technique of analyzing a known pattern is called structured light
• The Kinect combines structured light with two classic computer vision techniques: depth from focus, and depth from stereo
![Page 13: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/13.jpg)
Depth from focus uses the principle that stuff that is more blurry is further
away
Watanabe and Nayar, IJCV 27(3), 1998
![Page 14: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/14.jpg)
Depth from focus uses the principle that stuff that is more blurry is further
away
• The Kinect dramatically improves the accuracy of traditional depth from focus
• The Kinect uses a special (“astigmatic”) lens with different focal length in x- and y- directions
• A projected circle then becomes an ellipse whose orientation depends on depth
![Page 15: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/15.jpg)
The astigmatic lens causes a projected circle to become an ellipse whose
orientation depends on depth
Freedman et al, PrimeSense patent application US 2010/0290698
![Page 16: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/16.jpg)
Depth from stereo uses parallax
• If you look at the scene from another angle, stuff that is close gets shifted to the side more than stuff that is far away
• The Kinect analyzes the shift of the speckle pattern by projecting from one location and observing from another
![Page 17: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/17.jpg)
Depth from stereo uses parallax
Isard and M, ACCV (2006)
![Page 18: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/18.jpg)
Inferring body position is a two-stage process: first compute a depth map,
then infer body position
• Stage 1: The depth map is constructed by analyzing a speckle pattern of infrared laser light
• Stage 2: Body parts are inferred using a randomized decision forest, learned from over 1 million training examples
Note: details of Stage 2 were published in Shotton et al (CVPR 2011)
![Page 19: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/19.jpg)
Inferring body position is a two-stage process: first compute a depth map, then
infer body position
![Page 20: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/20.jpg)
Stage 2 has 2 substages (use intermediate “body parts”
representation)
Shotton et al, CVPR(2011)
![Page 21: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/21.jpg)
Stage 2.1 starts with 100,000 depth images with known skeletons (from a
motion capture system)
Shotton et al, CVPR(2011)
![Page 22: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/22.jpg)
For each real image, render dozens more using computer graphics
techniques
• Use computer graphics to render all sequences for 15 different body types, and vary several other parameters
• Thus obtain over a million training examples
![Page 23: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/23.jpg)
For each real image, render dozens more using computer graphics
techniques
Shotton et al, CVPR(2011)
![Page 24: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/24.jpg)
Stage 2.1 transforms depth image to body part image
• Start with 100,000 depth images with known skeleton (from a motion capture system)
• For each real image, render dozens more using computer graphics techniques.
• Learn a randomized decision forest, mapping depth images to body parts
![Page 25: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/25.jpg)
A randomized decision forest is a more sophisticated version of the classic
decision tree
![Page 26: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/26.jpg)
A decision tree is like a pre-planned game of “twenty questions”
Ntoulas et al, WWW (2006)
![Page 27: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/27.jpg)
What kind of “questions” can the Kinect ask in its twenty questions?
• Simplified version:
– “is the pixel at that offset in the background?”
• Real version:
– “how does the (normalized) depth at that pixel compare to this pixel?” [see Shotton et al, equation 1]
Shotton et al, CVPR(2011)
![Page 28: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/28.jpg)
To learn a decision tree, you choose as the next question the one that is most “useful” on (the relevant part of) the
training data
• E.g. for umbrella tree, is “raining?” or “cloudy?” more useful?
• In practice, “useful” = information gain G (which is derived from entropy H):
Shotton et al, CVPR(2011)
![Page 29: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/29.jpg)
Kinect actually uses a randomized decision forest
• Randomized:
– Too many possible questions, so use a random selection of 2000 questions each time
• Forest:
– learn multiple trees
– to classify, add outputs of the trees
– outputs are actually probability distributions, not single decisions
![Page 30: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/30.jpg)
Shotton et al, CVPR(2011)
Kinect actually uses a randomized decision forest
![Page 31: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/31.jpg)
Learning the Kinect decision forest requires 24,000 CPU-hours, but takes
only a day using hundreds of computers simultaneously
“To keep the training times down we employ a distributed implementation. Training 3 trees to depth 20 from 1 million images takes about a day on a 1000 core cluster.”
—Shotton et al, CVPR(2011)
![Page 32: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/32.jpg)
Learning the Kinect decision forest requires 24,000 CPU-hours, but takes
only a day using hundreds of computers simultaneously
“To keep the training times down we employ a distributed implementation. Training 3 trees to depth 20 from 1 million images takes about a day on a 1000 core cluster.”
—Shotton et al, CVPR(2011)
![Page 33: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/33.jpg)
Stage 2 has 2 substages (use intermediate “body parts”
representation)
Shotton et al, CVPR(2011)
![Page 34: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/34.jpg)
Stage 2.2 transforms the body part image into a skeleton
• The mean shift algorithm is used to robustly compute modes of probability distributions
• Mean shift is simple, fast, and effective
![Page 35: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/35.jpg)
Intuitive Description
Distribution of identical billiard balls
Region of interest
Center of mass
Mean Shift vector
Objective : Find the densest region
Slide taken from Ukrainitz & Sarel
![Page 36: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/36.jpg)
Intuitive Description
Distribution of identical billiard balls
Region of interest
Center of mass
Mean Shift vector
Objective : Find the densest region
Slide taken from Ukrainitz & Sarel
![Page 37: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/37.jpg)
Intuitive Description
Distribution of identical billiard balls
Region of interest
Center of mass
Mean Shift vector
Objective : Find the densest region
Slide taken from Ukrainitz & Sarel
![Page 38: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/38.jpg)
Intuitive Description
Distribution of identical billiard balls
Region of interest
Center of mass
Mean Shift vector
Objective : Find the densest region
Slide taken from Ukrainitz & Sarel
![Page 39: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/39.jpg)
Intuitive Description
Distribution of identical billiard balls
Region of interest
Center of mass
Mean Shift vector
Objective : Find the densest region
Slide taken from Ukrainitz & Sarel
![Page 40: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/40.jpg)
Intuitive Description
Distribution of identical billiard balls
Region of interest
Center of mass
Mean Shift vector
Objective : Find the densest region
Slide taken from Ukrainitz & Sarel
![Page 41: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/41.jpg)
Intuitive Description
Distribution of identical billiard balls
Region of interest
Center of mass
Objective : Find the densest region
Slide taken from Ukrainitz & Sarel
![Page 42: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/42.jpg)
The Kinect uses structured light and machine learning
• Inferring body position is a two-stage process: first compute a depth map (using structured light), then infer body position (using machine learning)
• The results are great!
• The system uses many college-level math concepts, and demonstrates the remarkable advances in computer vision in the last 20 years
![Page 43: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/43.jpg)
The results are great!
![Page 44: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/44.jpg)
The results are great!
• But how can we make the results even better?
![Page 45: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/45.jpg)
How can we make the results even better?
• Use temporal information
– The Xbox team wrote a tracking algorithm that does this
• Reject “bad” skeletons and accept “good” ones
– This is what I worked on last summer
• Of course there are plenty of other ways (next summer???)
![Page 46: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/46.jpg)
Last summer (1)
![Page 47: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/47.jpg)
Last summer (2): Use the same ground truth dataset to infer skeleton
likelihoods
likely unlikely
![Page 48: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/48.jpg)
Next summer: ????
• (working with Andrew Fitzgibbon at MSR Cambridge)
![Page 49: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/49.jpg)
The Kinect uses structured light and machine learning
• Inferring body position is a two-stage process: first compute a depth map (using structured light), then infer body position (using machine learning)
• The results are great!
• The system uses many college-level math concepts, and demonstrates the remarkable advances in computer vision in the last 20 years
![Page 50: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/50.jpg)
What kind of math is used in this stuff?
• PROBABILITY and STATISTICS
• MULTIVARIABLE CALCULUS
• LINEAR ALGEBRA
• Complex analysis, combinatorics, graph algorithms, geometry, differential equations, topology
![Page 51: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/51.jpg)
Personal reflection: computer vision has come a long way since I started working on
it in 1996.
• Key enablers of this progress: – Use of probabilistic and statistical models
– Use of machine learning
– Vastly faster hardware
– Parallel and distributed implementations (clusters, multicore, GPU)
– Cascades of highly flexible, simple features (not detailed models)
![Page 52: How does the Kinect work? - Dickinson College](https://reader036.vdocuments.net/reader036/viewer/2022071600/613d230a736caf36b759c057/html5/thumbnails/52.jpg)
The Kinect uses structured light and machine learning